Rx protocol specification draft
Nickolai Zeldovich, nickolai@csail.mit.edu
Updated by Jeffrey Altman, jaltman@auristor.com

Introduction
============

Rx is a client-server RPC protocol, an extended and combined version
of the older R and RFTP protocols.  This document describes Rx, but
the details of Rx security classes (such as Rxkad and RxGK) are not
specified.

Rx communicates via UDP datagrams on a user-specified port.  Rx also
provides for multiplexing of Rx services on a single port, via a
16-bit service ID which identifies a particular Rx service that's
listening on a given port akin to a port number.  Therefore, an Rx
service is identified by a triple of <IP address; UDP port number;
Rx service ID>.

The protocol is connection-oriented -- a client and a server must
first hand-shake and establish a connection before Rx calls can be
made.  Said hand-shaking is implicit upon the first request if no
authentication is desired, or can consist of a pair of Challenge
and Response requests in order to establish authentication between
the client and the server.

In order to prevent unauthenticated Rx requests from being used for
amplification attacks it is recommended that a reachability test
consisting of an exchange of PING ACK and PING-RESPONSE ACK be
performed before issuing any response larger than the received
request packet.

Protocol Overview
=================

As mentioned above, Rx uses UDP/IP datagrams on a user-specified
port to communicate.  Each Rx server may provide multiple services,
specified by the Service ID.  This allows for service multiplexing,
much in the same way as UDP port numbers allow for multiplexing of UDP
datagrams addressed to the same host.

Each Rx service must offer one or more security classes.  When the
service does not require authentication, integrity protection, or
encryption it can offer the Rxnull security class.  Each multiplexed
Rx service can offer an independent set of security classes.

Each Rx service must specify a protocol for the description and
encoding of data.  Sun RPC's XDR (External Data Representation) is
commonly used but is not required.  Multiplexed Rx services may
require different data encoding protocols.

Each client and server pair that want to communicate using Rx must
establish an Rx connection, which can be thought of as a context
for all subsequent Rx activity between these two parties.  An Rx
connection can only be associated with a single Rx service.

Each Rx connection context contains multiple channels, which are
used for data transmission and actually performing an RPC call.
The channels are independent of each other, allowing multiple
RPC calls to be performed over the same Rx connection simultaneously.

An Rx call involves the transmission of call arguments over an Rx
channel to the server and subsequent reception of the reply data.
For each Rx call, an available Rx channel must be allocated exclusively
to that call.  The channel cannot be used for anything else until the
call completes.  After call completion, the channel may be reused
for subsequent Rx calls.

Rx Connections
==============

This section makes many references to fields of an Rx header; see
the ``Packet Formats'' section for specific layout of the Rx header.

The connection epoch is a relatively unique 31-bit value chosen by
the initiator of each connection.  Historically, the epoch was selected
at startup by computing the seconds since Unix epoch (midnight 1 January
1970 UTC/GMT not counting leap seconds).  It was used to identify the
set of connections initiated by the peer.  Modern Rx implementations
should randomly generate the Rx epoch instead of relying upon a
timestamp.  Rx implementations may generate a unique random epoch for
each initiated connection to reduce the ability to track activity across
multiple peers.

An Rx connection between two Rx peers is identified by:

  1. Direction
  2. Epoch
  3. Connection ID
  4. Service ID
  5. Security Class ID
  6. Peer Address
  7. Peer Port

The Rx Connection direction is determined by the presence of or lack
of the CLIENT-INITIATED bit in the Flags field.  (see below)

The 31-bit of the epoch field is the Ignore-Source Flag.  When set,
the Ignore-Source Flag indicates that the peer is multi-homed and that
it might send packets from more than one endpoint (network address and
port).  Packets received for Rx Connections with the Ignore-Source Flag
set can be accepted from any endpoint.  Conversely, if the Ignore-Source
Flag is not set, packets received from distinct endpoints must be
considered part of a unique Rx connection.

Rx implementations are discouraged from accepting packets from
distinct endpoints for the same Rx Connection.  Instead, Rx peers
should be designed to reliably transmit reply packets from the
network interface on which they were received; and Rx services
should implement peer identification either via Rx security class
authentication or out of band mechanisms.

The Connection ID is a 30-bit value chosen by the connection initiator.
Although not required, the Connection ID should be unique to the peer
independent of the Epoch, Service ID, Security Class ID, Peer Address,
and Peer Port.

Sharing the 32-bit field with the Connection ID is the 2-bit Channel
ID.  Each connection can multiplex four calls on the same connection.
One call at a time can be issued in each Rx channel.

The signed 32-bit Call ID identifies a distinct call within a channel;
there are four call numbers associated with each Rx connection.  Each
new call must start with a higher number than the previous call in the
same channel, and typically this is just the previous call number + 1.
The initial call number must be positive integer.  Call ID zero (0)
indicates a connection-level Rx packet (see below).  The Call ID is
chosen by the peer initiating the call.  Although only one call can
use a channel at one time, the Call ID allows peers to distinguish
packets on the same channel that belong to different calls.  Once
the call whose ID is 2147483648 completes, the call channel must cease
being used.  Although the rx_header.callNumber field is unsigned
32-bit all implementations use a signed 32-bit field and check for
overflow.

The unsigned 32-bit Sequence number is similar to the sequence number
in TCP, but instead of bytes it labels sequential DATA packets
transmitted in a single direction within a call.  The first DATA packet
sent in each direction of a call must have the Sequence number one (1).
Sequence numbers are assigned to DATA packets in sequential order of
data packetization.  Each assigned Sequence number is one greater than
assigned to the previous DATA packet. Unlike Call IDs, Sequence numbers
are permitted to wrap to zero after Sequence number 4294967295 is
reached.  It is the responsibility of Rx peers to ensure proper
ordering of DATA packets.  In Rx, acknowledgements of received DATA
and retransmissions are performed on a packet-by-packet basis,
identified by these Sequence numbers.  The Sequence number should be
zero (0) in all packets which are unassociated with a call or are
a non-DATA packet type.

Every outgoing packet issued on a Rx connection is stamped with a
unsigned 32-bit serial number in the serial field.  The serial number is
incremented by one (1) for every packet sent regardless of the packet
type or Call ID.  When retransmitting a DATA packet, the Sequence
Number remains the same and the Serial number is unique for each
retransmission. These serial numbers might be used by the flow
control mechanisms (described below).  The serial number for a
connection should start at one (1) and is permitted to wrap to one (1)
when Serial number 4294967295 is reached.  Serial number zero (0) must
be skipped when sending DATA packets because an rx_ackPacket.serial
of zero (0) means that the ACK packet was not sent in response to
receipt of a DATA or ACK packet.  Packets which are unassociated with
an Rx connection should be transmitted with a Serial number of zero (0).
(See Historical Implementation Notes: ACK <Serial>).

The unsigned 16-bit Service ID identifies a particular Rx service
running on a given network endpoint.  This is analogous to how UDP port
numbers allow multiplexing packets to a single IP address.  Note that
an Rx connection is created to a specific Service ID.  The associated
Service ID cannot be changed after the first Rx Call on the
Connection.  Existing implementations cache the Service ID value for
a given Connection, and will ignore Service ID values in subsequent
packets.  Attempts to switch the Service ID for a single call will
result in the designated opcode being executed on the cached service.

The unsigned 8-bit SecurityIndex field specifies the type of Security
Class in use on this connection.  The mapping of non-zero SecurityIndex
values to Security Classes are defined by each Rx Service
specification.  SecurityIndex value zero (0) is reserved for the
rxnull Security class.

The interpretation of 16-bit securitySpecific (aka Checksum) field
is dependent upon the active Rx Security Class.  Unless otherwise
specified by the Security Class specification this field should be
set to zero (0).

An Rx Security class can also modify the packet payload in any way,
for instance by encrypting the contents or adding headers or trailers
specific to the Security class protocol (although the end result must
be a properly sized packet that Rx will be able to transmit.)

The userStatus field allows for additional user flags to be transported
with each DATA or ACK packet.  These have no significance to the Rx
protocol itself.  The meaning of userStatus flags are Rx Service
specific and might have unique meanings based upon the in-flight call
context.  The userStatus field in an ACK packet provides an out of
band communication channel in the reverse direction of the active
call. Unless otherwise defined the userStatus field should be set to
zero.

The "Flags" field consists of a number of single-bit flags with
meanings as follows.  The actual bit values are defined below,
in the ``Protocol Constants'' section.

 * CLIENT-INITIATED
     This packet originated from the connection initiator
     (as opposed to the acceptor).  This flag must be set on all
     outgoing packets (regardless of type) sent by the connection
     initiator and must be cleared on all outgoing packets (also
     regardless of type) sent by the connection acceptor.

 * REQUEST-ACK
     Sender is requesting acknowledgement of this packet,
     via an Ack packet response.  The REQUEST-ACK flag is
     defined for DATA and ACK packet types.

 * LAST-PACKET
     This packet is the last DATA packet transmitted in
     the current direction by the sender.  When the acceptor
     receives all DATA packets including the LAST-PACKET,
     it is safe to switch the direction of the call and
     begin transmitting the response DATA packets.  When the
     LAST-PACKET DATA packet sent by the acceptor is received
     and acknowledged by the initiator the call is complete.

 * MORE-PACKETS
     This advisory flag can be set on all but the last DATA
     packet by the sender initially transmitting a sequential
     list of DATA packets at once.  This flag must not be
     set on retransmitted DATA packets.

 * EXTENDED-SACK
     This flag must be set on all ACK packets of the Rx call
     if the sender wishes to send ACK packets using the Extended
     format.  This flag must be set even if the
     <Receive Window Size> is less than or equal to 255 packets.
     Setting the EXTENDED-SACK flag signals that the meaning of
     <Previous Packet> and the three unused octets following the
     first SACK table are specified as per this document.

 * SLOW-START-OK
     This advisory flag can be set on ACK packets to indicate
     that the sender of this packet supports the slow-start
     mechanism, described below under ``Flow Control''.

 * JUMBO-PACKET
     When set in a DATA packet, this flag indicates that the
     DATA packet is part of a jumbogram, and is not the final
     DATA packet.  See the ``Jumbograms'' section below for
     more details.

Packet Types
============

The "Type" field indicates the contents of this packet.  Actual
values are specified in the ``Protocol Constants'' section.
This section describes the simpler packet types, and subsequent
sections cover more complex packet types in more detail.

Certain type packets are connection-only requests (that is, they
are not associated with an RPC call).  A connection-only request
is indicated by a zero call number.  Valid packet types in a
connection-only context are ABORT, CHALLENGE, RESPONSE, DEBUG,
VERSION, and the three unused PARAMS packet types.  Valid packet
types in a call context are DATA, ACK, ABORT, BUSY and ACKALL.
All other packet types are reserved for future use.

The payload of the packet following the header depends on the
type of the field, as follows:

 * DATA type (Standard data packet)

     The payload of a data packet is simply the Rx payload,
     corresponding to the sequence number and call specified
     in the header.  The actual data that is transmitted in
     Rx data packets is described below.

     The receipt of a data packet by the initiator implicitly
     acknowledges that the acceptor has received and processed
     all the DATA packets that were transmitted by the
     initiator as part of this call.

 * ACK type (Acknowledgement of received data)

     An acknowledgement packet provides information about
     which packets were or were not received by the peer,
     and other useful parameters.  The semantics of these
     packets are described below in the ``Call Layer''
     section.

 * BUSY type (Busy response)

     When an initiator attempts to start a new call on a channel
     which the acceptor considers in-use, a busy response
     is returned.  The call and channel number in the packet
     header indicate which call is being rejected.  This packet
     type has no payload associated with it.

 * ABORT type (Abort packet)

     Indicates that the relevant call or connection (if the
     call number field is zero) has encountered an error and
     has been terminated.  The payload of the packet has a
     network-byte-order 32-bit user error code.

     Some Rx services (such as the Ubik VOTE service) use the
     ABORT packet instead of a response DATA packet to convey
     the output of a call.  This technique is discouraged.  Rx
     Security classes cannot provide authentication, integrity
     protection or privacy for ABORT packets.  Rx Services
     should consider sending error results via DATA packets
     instead of ABORT packets.

 * ACKALL type (Acknowledgement of all packets)

     An acknowledge-all packet indicates the obvious: the peer
     wants to acknowledge the receipt of all DATA packets sent
     to it as part of the specified call.  This can be used,
     for example, when a connection is being closed and the
     connection initiator wants to ensure that no retransmissions
     are attempted after it exits.

     Alternatively, the initiator could send a connection
     level ABORT to the peer to ensure that no further
     packets are accepted or transmitted by the acceptor.
     Use of the ACKALL packet type is discouraged.

     There is no payload associated with an acknowledge-all
     packet.

 * CHALLENGE, RESPONSE types (Challenge request/response)

     The payload and use of these packet types are Security
     Class specific data, and are used to authenticate an Rx
     connection.

     CHALLENGE packets are sent from acceptor to initiator.
     CHALLENGE packets are dropped by acceptors without delivery
     to the Security Class.

     RESPONSE packets are sent from initiator to acceptor after
     receiving a CHALLENGE.  RESPONSE packets that are received
     by initiators are dropped by Rx and not delivered to the
     Security Class.  RESPONSE packets received by an acceptor
     without a pending CHALLENGE may be dropped by Rx and not
     delivered to the Security Class.

     Security class specifications are not included in this
     document.

 * DEBUG type (Debug packet)

     Rx supports an optional debugging interface; see the
     ``Debugging'' section below for more details.

     DEBUG packets might or might not be associated with an Rx
     connection.  When associated with an Rx connection the
     securityClass must be Rxnull.  Rx acceptors that associate
     DEBUG packets with Rx connections and require a successful
     reachability test before issuing a response will not
     interoperate with Rx initiators that do not associate DEBUG
     packets with an Rx connection.

 * PARAMS types (Parameter exchange)

     Three types were assigned in IBM AFS 3.2 as Connection
     level packet types but never used in a production
     deployment, and therefore have no protocol significance
     at this time.

     It should be noted that receipt of packets with these
     types are currently ignored and should not be responded
     to with a connection level ABORT.

 * VERSION type (Get Rx version)

     If a peer receives a VERSION packet with the CLIENT-INITIATED
     flag set, it may respond with VERSION packet containing a
     NUL-terminated payload.  The payload might identify the version
     of Rx software it is running.  The response must not have the
     CLIENT-INITIATED flag set.

     Nothing should respond to a version packet with the
     CLIENT-INITIATED flag set, to avoid infinite packet loops.

     VERSION packets might or might not be associated with an Rx
     connection.  When associated with an Rx connection the
     securityClass must be Rxnull.  Rx acceptors that associate
     VERSION packets with Rx connections and require a successful
     reachability test before issuing a response will not
     interoperate with Rx initiators that do not associate VERSION
     packets with an Rx connection.

Call Layer
==========

	The call layer provides a reliable data transport over an
	Rx channel, and is used by the RPC layer to make Rx calls.
	One of the most important pieces of the call layer is the
	Rx ACK packet.  The ACK packet is used by Rx to determine
	when retransmissions are needed, as well as determining
	the proper transmission / receiving parameters to use
	(such as the transmit window size and jumbogram length,
	described in more detail below).

	A new call is established by the initiator sending a
	DATA packet with sequence number one (1) to the acceptor
	on an available channel.  Either side can indicate that
	they have no more DATA packets to send by setting the
	LAST-PACKET flag in their final DATA packet (which might
	be sequence number one).  Each call remains open until
	the upper layer informs Rx that it is done with the call.
	(The upper layer in this case would most likely be the Rx
	RPC layer.)

	The structure of an Rx ACK packet is described in the
	"Packet Formats" section.  This section will refer to
	particular fields of the ACK packet by names.

	The <Buffer Space> field is unused and should be set to
	zero (0).  It was originally intended to store the number
	of packet buffers that the ACK sender can provide for
	receiving DATA packets for this call but was never used
	for this or any other purpose.

	The <Max Skew> field is unused and should be set to zero (0).
	It was originally intended to share the maximum packet skew
	that the sender of the ACK packet has observed for this Rx
	peer with the intent that retransmission be avoided due to
	expected out of order delivery.  See the "Historical
	Implementation Notes" section.

	For example, if a packet is received N packets later than
	expected (based on the packet's serial number, i.e. if
	the last received packet's serial number is N higher than
	this packet's), then it is defined to have a skew of N.
	This can be used to avoid retransmission because of packet
	reordering. However, the reliance on the use of serial numbers
	which are assigned to packets of all types across all calls
	of a connection make this skew measurement particularly
	unreliable.

	The <First Packet> field specifies the Sequence number of
	the first DATA packet that would be explicitly acknowledged
	(either positively or negatively) by this packet if the
	<Ack Count> is non-zero.  All DATA packets with Sequence
	numbers smaller than this are implicitly acknowledged.

	The <First Packet> Sequence number must never go backwards.
	Any ACK packet received with an out-of-order <First Packet>
	Sequence number should be ignored.

	The <Previous Packet> field should specify the largest DATA
	packet Sequence number accepted (aka not dropped) by the
	issuer of the ACK packet unless the EXTENDED-SACK Rx header
	flag is set.

	If the EXTENDED-SACK flag is set, this field must specify the
	largest DATA packet Sequence number accepted by the issuer of
	the ACK packet.  The Sequence number must not be that of a
	discarded packet such as one that exceeded the window size.
	Nor should the value of <Previous Packet> ever go backwards;
	although it is permitted to wrap.

	See the "Historical Implementation Notes" section for details
	on various implementation specific deviations that make use
	of this field unreliable.

	The <Serial Number> field indicates the serial number of the
	packet which triggered this ACK packet, or zero if there
	is no such packet (i.e. the ack packet was delayed and should not
	be used for round-trip time computation).  The receiver should
	note that any DATA packets transmitted with a serial number less
	than this, which are not acknowledged by this packet, are likely
	lost or reordered.  Thus, these packets may be retransmitted,
	after a possible delay to allow for packet reordering (as
	measured by packet skew).

	The <Reason> field specifies a particular type of an ACK packet.
	Valid reason codes are specified in the ``Packet Formats and
	Protocol Constants'' section; their meanings are as follows:

	REQUESTED
		Acknowledgement was requested.  The peer received
		a DATA or ACK packet from us with the ACK-REQUESTED
		flag set, and this packet is acknowledging it.  If
		sent in response to a DATA packet, the DATA packet's
		serialnumber is in the <Serial> field.

	DUPLICATE
		A duplicate DATA packet was received.  The duplicate
		DATA packet's serial number is in the <Serial> field.

	OUT-OF-SEQUENCE
		A DATA packet was received out of sequence.  The serial
		number of the DATA packet is in the <Serial> field.

	EXCEEDS-WINDOW
		A DATA packet was received whose Sequence number
		exceeded the current receive window, and was dropped.
		The serial number of the DATA packet is in the <Serial>
		field.

	NOSPACE
		A DATA packet was received, but no buffer space was
		available and therefore it was dropped.  The serial
		number of the dropped DATA packet is in the <Serial>
		field.

	PING
		This is a keep-alive packet, used to verify that
		the peer is still alive.  If the REQUEST-ACK flag
		in the Rx packet is set, the recipient of this
		packet should reply with a PING-RESPONSE packet.

	PING-RESPONSE
		This is a response to a PING ACK packet with the
		REQUEST-ACK flag set.  The serial number of the
		PING ACK is in the <Serial> field.

	DELAY
		A delayed acknowledgement, usually because a certain
		amount of time has passed since the receipt of the
		last DATA packet and there are outstanding
		unacknowledged DATA packets.

		DELAY ACKs should not be used for RTT computations.

	IDLE
		Similar to DELAY but can be used for RTT computation.
		Introduced by OpenAFS 1.2.

	A peer should never delay the transmission of an ACK packet
	in response to a received packet unless it sets the <Reason>
	field to DELAY.  This is because ACK packets (except for
	DELAY ones) are used for RTT computation by Rx peers.

	All acknowledgement packets should clear the REQUEST-ACK
	flag in the Rx header, except when the <Reason> field is
	set to PING.

	The <Ack Count> field specifies the size of the variable-
	length Selective Acknowledgements Table or <SACK>.  The
	<Ack Count> field can specify a <SACK> size between 0 and
	255 octets.

	The <SACK> is a variable-length Selective Acknowledgements
	Table whose size is specified by the <Ack Count> field.
	When the <Ack Count> is zero, there is no <SACK> table.

	When the <Ack Count> is greater than zero, the 0th bit of
	each octet represents a single DATA packet by sequence
	number.  The range of DATA packets whose acknowledgement
	state is represented by the <SACK> are <First Packet>
	through (<First Packet> + <Ack Count> - 1) inclusive.

	The meaning of each bit is as follows:

	0	Explicit negative acknowledgement: packet with the
		corresponding sequence number has not been received
		or has been dropped.

	1	Explicit acknowledgement: packet with the corresponding
		sequence number has been received but may yet be
		dropped.

	When the EXTENDED-SACK Rx Packet Header flag is set and the
	<Ack Count> equals 255, the number of DATA packets represented
	in the SACK table is

	NumAcks0 := MIN(<Previous Packet> - <First Packet> + 1, 2048)

	If <NumAcks0> is greater than 255, the SACK table is extended
	to a total of 256 octets by stealing the first of the reserved
	octets before the trailers.  This octet must be set to zero if
	<NumAcks0> is less than 256.

	It's important to note the distinction between packets with
	Sequence numbers prior to <First Packet>, between <First Packet>
	and (<First Packet> + <Ack Count> - 1), and those with Sequence
	numbers of at least (<First Packet> + <Ack Count>).  Those in
	the first category have been hard-acknowledged and must not be
	dropped in the future; the DATA sender (ACK recipient) is
	permitted to recycle DATA packets once the leading edge of the
	window advances.

	Packets in the second category are individually soft-acknowledged
	in the <SACK>, either as being queued for the application or
	not received.  The DATA sender (ACK recipient) must keep all
	packets with sequence numbers in this range, but avoid
	retransmitting the positively acknowledged ones.  Negatively
	acknowledged packets should be retransmitted according to the
	DATA sender's flow control algorithm.

	Packets in the third category are not acknowledged at all,
	and the DATA sender (ACK recipient) should assume no knowledge
	of their state; even if the Rx receive window exceeds the size
	of the <SACK>.

	If the EXTENDED-SACK flag is set and <NumAcks0> is greater
	than 255, the <SACK> size is 256 octets.  The extended <SACK>
	acknowledges up to 2048 DATA packets using horizontal
	striping.  The two octets following the <SACK> are
	<Trailer Count> and <Extra SACK Count>.  <Trailer Count> is
	the number of 32-bit <Ack Trailer> fields; currently four (4).
	<Extra SACK Count> is the number of additional extended SACK
	tables that follow the <Ack Trailer> fields.  Each additional
	Extended SACK can acknowledge up to an additional 2048 DATA
	packets.

	The optional <Ack Trailer> fields are not 32-bit aligned with
	respect to the packet.  If the EXTENDED-SACK flag is unset, the
	first field begins three octets after the end of the variable-
	length <SACK>.  If the EXTENDED-SACK flag is set, the first
	field begins following the <Extra SACK Count> field.  Four
	32-bit fields are defined: <Max MTU>, <Interface MTU>,
	<Datagram Packets>, and <Receive Window Size>.  Unrecognized
	<Ack Trailer> fields should be ignored.  Their presence
	depends on the version of the Rx peer; see the "Historical
	Implementation Notes" section for details.

	The <Max MTU> and <Interface MTU> packet sizes are,
	respectively, the largest possible packet size that the peer
	is willing to accept from the ACK receiver, and the size of
	the packet the peer would prefer to receive.  In the absence
	of these fields, it should be assumed that the maximum and
	interface Rx packet sizes are 1444 bytes.

	    (1500 - IPv6 header (40) - IPv6 fragment header (8)
	    - UDP header (8))

	The <Receive Window Size> indicates the size of the ACK
	sender's receive window, in packets.  The maximum <Receive
	Window Size> is 65535 packets.  If this field is absent, the
	implementation must assume a receive window of 16 packets;
	Rx implementations that do not support this trailing field
	implemented a fixed window size of 16 packets.

	The <Datagram Packets> field indicates how many DATA packets
	the ACK sender is willing to receive in a jumbogram (also
	described below).  All DATA packets in a jumbogram (except
	the last one) are always 1412 bytes, regardless of the <Max MTU>
	and <Interface MTU> packet sizes described above.  When the
	<Datagram Packets> field is missing, the ACK receiver must
	assume a value of one (1) packet.

 * Round-trip time computation

	To determine when packet retransmission is necessary, Rx
	computes some statistics about the round-trip time between
	the two hosts:  exponentially-decaying averages of the
	round-trip time and the standard deviation thereof.  Each
	acknowledgement packet which mentions a specific packet in
	the <Serial> field and is not delayed is used to update the
	round-trip statistics.  First, the round-trip time for this
	packet (R) is computed as the difference between the arrival
	time of the ack packet and the time we transmitted the
	packet with the serial number specified in <Serial>.

	Next, the round-trip time average and standard deviation
	values are updated.  For instance, this algorithm could
	be used:

		RTTdev = RTTdev * (3/4) + |RTTavg - R| / 4
		RTTavg = RTTavg * (7/8) + R / 8

 * Packet retransmission

	In order to support reliable data transport, Rx must retransmit
	packet which are lost in the network.  This must not be done
	too early, otherwise we might retransmit a packet whose first
	copy is still in transit, thereby wasting bandwidth.

	Rx computes a retransmit timeout value T, and retransmits any
	packet which hasn't been positively acknowledged since last
	transmission for at least T seconds.  This timeout could be
	computed as follows from the round-trip statistics above:

		T = RTTavg + 4 * RTTdev + 0.350

	This allows the packet to be up to 4 deviations late and still
	not be retransmitted.  The 350 msec fudge factor is used to
	compensate for bursty networks, though it is likely becoming
	less relevant (and accurate) with time.

	A more clever algorithm could take into account the maximum
	packet skew rate, and improve the retransmission strategy to
	take into the account the likelihood that a given packet has
	been reordered, and give it extra time before retransmission.

 * Keepalive and Timeout

	The upper layer (either the Rx RPC layer or the application)
	have to specify a timeout, T, to the call layer.  If the peer
	is not heard from within T seconds, the call layer declares
	the call to be dead and propagates the error to the upper
	layer.

	In order to determine whether the peer is still alive or not,
	keepalive requests are used.  These take form of an ack PING
	and PING-RESPONSE packets.  When the client has not received
	any response from the server, either to the original request
	or the keepalive requests, in T seconds, the call times out.

	The following strategy may be used to determine when to send
	keepalive requests:

		Compute a keepalive timeout, KT = T/6

		If the call was initiated KT seconds ago, or KT
		seconds have passed since the last keepalive
		request transmission, send a keepalive packet.

	This strategy limits the number of transmitted keepalive
	packets to a fixed number in the case of a dead server,
	and proportional to the real timeout in case of a slow
	server.  It also allows up to 5 keepalives to be dropped
	before the server is erroneously declared dead.

 * Flow Control

	Every Rx client or server has associated with each Rx call a
	receive and transmit window.  These windows indicate the number
	of packets that haven't been fully acknowledged packets (that
	is, not read by the peer's application) that an Rx sender can
	have outstanding at any time.  A sender's transmit window may
	never be greater than it's peer's receive window for that call.
	The receive windows are exchanged via the <Receive Window Size>
	parameter in an Ack packet.

	Rx ``sliding windows'' are similar to those used by TCP, except
	they measure packets rather than bytes.  Also, in TCP the window
	effectively applies to bytes in flight between the two peers,
	whereas in Rx the window applies to packets between the user
	applications.  For example, a transmit window of 8 on a certain
	Rx connection means that at most 8 packets can be transmitted
	and not yet read by the peer's application at any time.  The
	sequence number of the first packet that hasn't been read by
	the application is indicated by the <First Packet> field of
	an Ack packet.

	The selection of initial window sizes isn't strictly defined by
	the Rx protocol, but historically the initial window size must
	be 16 packets.  The <Receive Window Size> Ack Trailer field can
	adjust the window up or down as necessary.

	Rx uses the slow start, congestion avoidance, and fast recovery
	algorithms[6].  The algorithms are modified to work in the context
	of Rx packet-based transmission windows, and are described below.

	These algorithms require two additional variables to be maintained
	for each active Rx call: a congestion window, cwind, and a slow
	start threshold, ssthresh.

	Define a "negative ack" as an Ack packet that contains a negative
	acknowledgement followed by a positive one.  Similarly, define a
	"positive ack" to be any Ack that is not negative.  Upon receiving
	three negative acks for a call in a row since the last congestion
	avoidance attempt (if any), the Rx protocol enters congestion
	avoidance for that Rx call.

	 * Slow start, congestion avoidance, and fast recovery algorithms

		First, the congestion window, cwind, is initialized to 1.
		The number of unread transmitted packets is now limited not
		only by the transmission window, but also by the congestion
		window.  The latter limit is a little different:  Rx may
		send up to cwind packets (by sequence number) past the last
		contiguous positively acknowledged packet.  For example,
		if an Ack packet indicates that packets 1, 2 and 8 were
		received, and cwind is 2, Rx may transmit packets 3 and 4.

		When congestion occurs (indicated by a negative ack or a
		packet retransmission timeout), Rx enters congestion avoidance
		and fast recovery.  The slow-start threshold, ssthresh, is
		set to half of the effective transmission window (minimum of
		cwind and transmit window), but no less than 2 packets.

		If triggered by a negative ack, any negatively acknowledged
		packets should be retransmitted as soon as possible (i.e.
		window-permitting).

		If triggered by a retransmission timeout, the congestion
		window is reset to a single packet.

		When in fast-recovery mode, every additional negative ack
		packet received causes cwind to be increased by one packet.
		A positive ack packet causes cwind to be set to ssthresh,
		and terminates fast recovery.  At this point we are back
		to congestion avoidance, since the cwind is half the original
		transmission window.

		When packet acknowledgements are received, the congestion
		window should be increased.  If cwind is less than ssthresh,
		cwind should be increased by 1 for each newly acknowledged
		packet.  If cwind is at least ssthresh, cwind is increased
		by 1 for each newly received Ack packet.

	The advertised <Receive Window Size> can be larger than the size
	of the Rx ACK packet Selective Acknowledgement table.  The sender
	of DATA packets whose sequence numbers do not fit within the SACK
	will not receive any feedback on the reception state of such
	packets.

Debugging
=========

Rx provides for an optional debugging interface, using the DEBUG
packet type, allowing remote Rx clients to query an Rx peer for
some Rx protocol statistics.  Implementations are not required
to implement this interface.  Some parts of this interface may also
be specific to a particular implementation of Rx.  In order to prevent
packet loops, a server should only reply to DEBUG packets with the
client-initiated flag set.

The payload of a debug request packet is always the same; both of
the 32-bit quantities are in network byte order:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                           Debug Type                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                           Debug Index                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The debug type indicates the kind of debug information being sent
or requested, and determines the format of the rest of the packet.
The debug index allows some debug types to export array-like data,
indexed by this field.  The following debug types are defined for
the Transarc implementation:

	0x01	Retrieve basic connection statistics
	0x02	Get information about some connections
	0x03	Get information about all connections
	0x04	Get all Rx stats
	0x05	Get all peers of this server

The index field in the debug packet indicates which element of the
debug information the client wants to access, in cases where there
are multiple entries in question.

The responses to each of those debug queries contain the following
information:

1. Retrieve basic connection stats

	An array of general statistics about packet allocation,
	server performance, and so on.  The first octet in this
	response represents the debug protocol version being used
	by the server.  See RX_DEBUGI_VERSION* in rx/rx.h.

2, 3. Get information about connections

	Both of these calls return a struct rx_debugConn (see
	rx/rx.h), indexed by the "index" field.

	The first version of the debug call (type 2) only retrieves
	information about connections which are deemed interesting,
	that is, connections which are active, or about to be
	reaped.

	The end of the list is signaled by a response where the
	connection ID value is 0xFFFFFFFF.

4. Get Rx stats

	This call returns a struct rx_stats to the client in network
	byte order, containing various statistics about the state of
	Rx on the server (see rx/rx.h).

5. Get all Rx peers

	Similar to the connection request above (2, 3) this call
	returns all the Rx peers of the server (in a network-byte-order
	struct rx_debugPeer), indexed by the index field in the request.
	End of list is indicated by a host value of 0xFFFFFFFF.  (These
	are the first 4 octets.)

In response to unknown requests, the server returns 0xFFFFFFF8 in the
debug type field.

	XXX	The response interface should probably be fixed
		to include a fixed header that indicates whether
		the request was successfully completed.

Jumbograms
==========

To be able to transmit more data in a single packet, Rx supports
``jumbograms'', which are single UDP datagrams containing multiple
sequential Rx DATA packets.  In a jumbogram, all packets except the
last one must be of a fixed maximal size (1412 bytes).  Because all
the packets in the jumbogram are sequential, only one full header
is needed.  Here is what a jumbogram could look like:

  +-----------+---------------+--------------+---------------+
  | Rx header | 1412 byte pkt | Short header | 1412 byte pkt | ->
  +-----------+---------------+--------------+---------------+

      +--------------+-   -+-----------------------+
   -> | Short header | ... | <= 1412 byte last pkt |
      +--------------+-   -+-----------------------+

Every Rx packet in a jumbogram except the first one must be preceeded
by the short Rx header, and all packets except the last one must have
the Jumbogram Rx flag set in their respective headers.  The number of
packets in a jumbogram may not exceed the peer's advertised Max Packets
Per Jumbogram value in the Ack packet.

The maximum number of packets per jumbogram should be assumed to be 1
(i.e., no jumbograms) unless explicitly specified otherwise by an Ack
packet.  If an Ack packet is received without the packet-per-jumbogram
field, it might indicate that the peer is now running a version of Rx
that does not support jumbograms, and therefore no jumbograms should
be sent until they are explicitly enabled again.

The short header in a jumbogram has the following makeup:

    0                   1
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |     Flags     |    Reserved   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |       Security Specific       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

All the packets in the jumbogram have the same Rx header fields
(from the full Rx header) except for <Flags>, <Security Specific>,
<Sequence>, and <Serial>.  The <Flags> and <Security Specific>
fields for subsequent packets are taken from the short header
preceeding that packet in the jumbogram.  The Sequence and Serial
Numbers are assumed to be consecutive, and are incremented by 1
from the first packet in the jumbogram (ie the full Rx header).

Retransmitted packets should not be sent in a jumbogram.

RPC Layer
=========

This section discusses how an RPC call is made using the Rx protocol.
There are two common ``types'' of Rx calls: simple and streaming
(aka split).  These mostly reflect a difference in the upper-level
API rather than in the Rx protocol.  A simple Rx call has a fixed
number of input variables and a fixed number of output variables.
A streaming (or split) Rx call, in addition to the above, allows
the user to send and receive arbitrary amounts of data (whose length
should be specified as a fixed-length argument.)

In either case, an Rx call consists of two basic stages: client
sending the data to the server, and server sending the response
back to the client.  No data can be sent by the client after the
client sends a DATA packet with the LAST-PACKET flag set.  The
server must not send any data packets to the client before it
receives all of the client's DATA packets up to and including the
LAST-PACKET.  The call successfully completes when the client
receives all of the server's DATA packets up to and including the
LAST-PACKET.  After receiving the LAST-PACKET, the receiver must
confirm that the sender did not transmit more DATA than was
expected.

When Rx services use XDR for marshaling, each remote function call
associated with the Rx service (identified by the IP-port-serviceId
triplet) is assigned a 32-bit integer opcode number.  To make a
simple Rx call, the caller must transmit the opcode number followed
by the expected arguments for that call over an Rx channel using XDR
encoding.  The callee uses XDR to unmarshall the opcode and input
arguments, performs a function call corresponding to that opcode
and arguments, and then uses XDR to encode the return values back
to the caller.  The caller then uses XDR to receive the output
variables.

For streaming calls which send data from the caller to the callee,
one convention is to include the length of the data to be sent as
one of the fixed-length arguments, and send the variable-length
data immediately after the fixed-length portion.  For streaming
calls which receive data, one convention is for the callee to first
reply with a fixed-length field specifying the number of bytes it's
about to send, and then send those bytes.  Upon completion of the
streaming part of the call, the output arguments are sent back to
the caller in fixed-length XDR form, as with simple calls.

Packet Formats and Protocol Constants
=====================================

 * Rx packet

	Every simple Rx packet has an Rx header, of the form below.
	All quantities are in network byte order.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |+|                     Connection Epoch                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Connection ID                     | * |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          Call Number                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Sequence Number                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Serial Number                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |     Type      |     Flags     |  User Status  |  Security ID  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |        Security Specific      |          Service ID           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   Payload  ....
   +-+-+-+-+-

	[*]	The field marked with * is the Channel ID.  The last
		two bits of the connection ID are used to multiplex
		between 4 parallel calls.

	[+]	The bit marked with + is the Ignore-Source Flag used
		to indicate that only the connection ID should be used
		to identify this connection, and sender host/port
		should not be used.

	The following packet type values are defined:

	1		DATA		Standard data packet
	2		ACK		Acknowledgement of received data
	3		BUSY		Busy response
	4		ABORT		Abort packet
	5		ACKALL		Acknowledgement of all packets
	6		CHALLENGE	Challenge request
	7		RESPONSE	Challenge response
	8		DEBUG		Debug packet
	9		PARAMS		Exchange of parameters (ignored)
	10		UNUSED_1	Unused and ignored
	11		UNUSED_2	Unused and ignored
	12		UNUSED_3	Unused and RX_PROTOCOL_ERROR abort
	13		VERSION		Get Rx version
	Any other unrecognized packet type returns an RX_PROTOCOL_ERROR abort

	The values for the Flags field are defined as follows:

	0000 0001	CLIENT-INITIATED (Any packet type)
	0000 0010	REQUEST-ACK 	 (DATA and ACK only)
	0000 0100	LAST-PACKET 	 (DATA only)
	0000 1000	MORE-PACKETS 	 (DATA only)
	0000 1000   	EXTENDED-SACK    (ACK only)
	0001 0000	- Reserved -	 (See Historical Implementation Notes)
	0010 0000	SLOW-START-OK 	 (ACK only)
	0010 0000	JUMBO-PACKET  	 (DATA only)

	AFS3 Rx services commonly, but not necessarily, use the
	following value mappings for the Security field:

	0		No security or encryption
	1		bcrypt security, only used in AFS 2.0
	2		"krb4" rxkad
	3		"krb4" rxkad with encryption (sometimes)

 * Rx acknowledgement packet (EXTENDED-SACK Flag unset)

   This is the legacy ACK packet format which is applicable whenever
   an ACK packet is received without the Rx Packet Header Flag
   EXTENDED-SACK (8) set.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Buffer Space          |          Max Skew             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          First Packet                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Previous Packet                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                            Serial                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |     Reason    |   Ack Count   | SACK table (0 to 255 octets)...
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ..

	   ...  -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       ... SACK    |  Reserved[0]  |  Reserved[1]  |  Reserved[2]  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Maximum Packet Size                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                  Recommended Packet Size                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Receive Window Size                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                 Max Packets per Jumbogram                     |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

	Note that the trailing fields can have arbitrary alignment,
	determined by the size of the variable length SACK table in
	the packet.

	There are three reserved and unaligned octets between the
	SACK table and the start of the trailing fields: <Reserved[0]>,
	<Reserved[1]> and <Reserved[2]>.  All three reserved octets
	should be zero.

	The valid values for the Reason code are:

	1		REQUESTED
	2		DUPLICATE
	3		OUT_OF_SEQUENCE
	4		EXCEEDS_WINDOW
	5		NOSPACE
	6		PING
	7		PING_RESPONSE
	8		DELAY
	9		IDLE

 * Extended Rx acknowledgement packet (EXTENDED-SACK Flag set)

   This is the Extended ACK packet format which is applicable whenever
   an ACK packet is received with the Rx Packet Header Flag
   EXTENDED-SACK (8) set.  This ACK packet format is designed to be
   backward compatible with Rx peers that do not recognize the
   meaning of the EXTENDED-SACK Rx Header Flag.  It differs from the
   Legacy ACK packet format as follows:

   1. The <Previous Packet> field must be set to the largest DATA
      packet Sequence number accepted by the Rx peer.  It must not
      be set to the Sequence number of a dropped DATA packet.  Its
      value must not go backwards (although it is permitted to wrap.)

   2. If <Ack Count> is 255 then the number of DATA packets that are
      explicitly ACKed or NACKed within the first SACK table is

	NumAcks0 := MIN(<Previous Packet> - <First Packet> + 1, 2048)

   3. If <NumAcks0> is greater than 255, the SACK table is extended to
      a total of 256 octets with the addition of <Reserved[0]>.
      <Reserved[0]> must be set to zero if <NumAcks0> is less than 256.

   4. The acknowledgement state of each DATA packet is represented by
      a single bit using horizontal striping.  Up to 2048 DATA packets
      starting with <First Packet> can be represented in the SACK
      table.  The striping pattern is:

	<Offset0> := <Sequence Number> - <First Packet>

	bit     <Offset0>
	 0	0 ..  255
	 1    256 ..  511
	 2    512 ..  767
	 3    768 .. 1023
	 4   1024 .. 1279
	 5   1280 .. 1535
	 6   1536 .. 1791
	 7   1792 .. 2047

   5. <Reserved[1]> becomes <Trailer Count> representing the number of
      32-bit trailers.  At present this value is set to four (4) but
      can be increased if new trailer fields are defined.

   6. <Reserved[2]> becomes <Extra SACKS> representing the number of
      Extra SACK tables that are present in this ACK packet following
      the <Trailer Count> trailers.

   7. Each Extra SACK table is preceded by a one octet <SACK Size>
      field that specifies how many additional octets of the SACK table
      are present.  <SACK Size> of zero indicates an Extra SACK table
      consisting of one octet.  The maximum size of each Extra SACK
      table is 256 octets which is represented by <SACK Size> equal to
      255.

      The number of acknowledged DATA packets in the first Extra
      SACK table is:

	NumAcks1 := MIN(<Previous Packet> - <First Packet> - 2047, 2048)

      The striping pattern is:

	<Offset1> := <Sequence Number> - (<First Packet> + 2048)

	bit     <Offset1>    <Offset0> + 2048
	 0	0 ..  255      2048 .. 2303
	 1    256 ..  511      2304 .. 2559
	 2    512 ..  767      2560 .. 2775
	 3    768 .. 1023      2776 .. 3071
	 4   1024 .. 1279      3072 .. 3327
	 5   1280 .. 1535      3328 .. 3583
	 6   1536 .. 1791      3584 .. 3839
	 7   1792 .. 2047      3840 .. 4095

      Subsequent Extra SACK tables repeat the pattern.

   8. Up to three Extra SACK tables can be included in the ACK packet
      before the ACK packet grows beyond the mandatory minimum IPv6
      MTU size (1280 octets).  The minimum IPv4 MTU size (576 octets)
      is exceeded by the inclusion of one Extra SACK table.  However,
      according to RFC9000[7] the IPv6 minimum is supported by most
      IPv4 networks.

   9. Receivers of ACK packets must ignore any Extra SACK tables
      that are missing or truncated.  The receiver should process the
      ACK packet as if the missing or truncated SACK tables were
      intentionally not sent.

  10. The <Receive Window Size> may be larger than the number of
      DATA packets represented in the provided SACK tables.


    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Buffer Space          |          Max Skew             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          First Packet                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |    Previous Packet (largest accepted DATA Sequence number)    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                            Serial                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |     Reason    |  Ack Count    | SACK table (0 to 256 octets)...
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ..

     ... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    ... SACK (horizontal striping) | Trailer Count |  Extra SACKs  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Maximum Packet Size                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                  Recommended Packet Size                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Receive Window Size                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                 Max Packets per Jumbogram                     |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |SACK Size (N-1)| SACK table (0 to 256 octets) ...
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ...

     ... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    ...	SACK (horizontal striping)                                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |SACK Size (N-1)| SACK table (0 to 256 octets) ...
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ...

     ... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    ...	SACK (horizontal striping)                                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


Historical Implementation Notes
===============================

Rx Packet Header Flag LAST-PACKET (4):

Some older Rx implementations use DATA packets without the
LAST-PACKET flag set to restrict the MTU.  Notably IBM AFS 3.2 and
earlier, which supported neither Jumbo Packets nor the the ACK
packet trailers introduced in later versions.  Such Rx peers adjust
down the Path MTU to match the size of received DATA packets that
do not have the LAST-PACKET flag set.


Rx Packet Header Flag Bit-5 (16):

AFS 3.2 through OpenAFS 1.0.1 used Rx Packet Header Flag Bit-5 (16)
to indicate that the packet structure was located on an internal
free packet queue.  Any packet received with Bit-5 (16) set could
be misinterpreted as a free packet.   Therefore, Bit-5 must not be
set in any Rx Packet written to the network.


ACK <Max Skew>:

Use of the ackPacket.maxSkew field was broken prior to March 1989
and is unused.  Some early Rx implementations mistakenly assigned

  ackPacket.maxSkew = htonl(rx_peer.inPacketSkew)

even though the rx_peer.inPacketSkew field was an unsigned short.
Another bug was a failure to ignore maxSkew computations when the
ackPacket.serial number was zero.

Since March 1989, the ackPacket.maxSkew field has been ignored when
processing ACK packets and has been assigned zero when sending ACK
packets.

Prior to March 1989 the rx_peer.inPacketSkew field was calculated as

  skew = rx_conn.lastSerial - packet.header.serial
  rx_conn.lastSerial = packet.header.serial
  if (skew > 0 && skew > peer->inPacketSkew)
     peer->inPacketSkew = skew

The most recently received ackPacket.maxSkew value was then used to
restrict retransmissions:

  if (!packet.acked && packet.sent
      && packet.header.serial + ackPacket.maxSkew < ackPacket.serial)
  then retransmit packet

The above computation failed to take into account ACK packets whose
ackPacket.serial is zero or the possibility that serial numbers could
wrap when more than 2^31-1 packets were sent using a connection in a
single direction.  As a result, lost DATA packets could fail to be
retransmitted and the call could stall indefinitely.

Computing maximum skew from connection serial numbers made sense
when all Rx peers were single-threaded.  In multi-threaded peers,
connection serial numbers are allocated as the data stream is
packetized.  The operating system scheduler, per socket queues, and
per-call window management can result in packets being written to the
wire from multiple calls interleaved.  Therefore, receipt of serial
numbers out of sequence implies nothing about the skew of the network
path.

Enough time has passed that the ackPacket.maxSkew field could be
considered unused and reserved for future use.  Knowledge of the
maximum skew between two Rx peers is useful information and could
be leveraged to reduce unnecessary retransmissions.  However, it
cannot be computed by use of Rx connection serial numbers.


ACK <Serial>:

The original and current meaning of the ackPacket.serial field when
non-zero is that it contains the packet.header.serial of the
incoming packet to which the ACK packet was immediately sent in
response.  The incoming packet could be a DATA or an ACK; it might
or might not have the ACK_REQUESTED flag set.

Originally, ACK packets with reason RX_ACK_PING_RESPONSE were sent
with ackPacket.serial set to zero.

From March 1989 through May 1989 the ackPacket.serial field was
given a different meaning.  Rx connections stored the maximum serial
received from the peer (rx_conn.maxSerial).  When constructing an ACK
packet, the ackPacket.serial was assigned as follows

  if (ackPacket.reason == RX_ACK_REQUESTED
      ackPacket.serial = htonl(conn->maxSerial))
  else
      ackPacket.serial = htonl(conn->maxSerial + 1)

The intent of the change was to force packets to be retransmitted in
response to receipt of an ACK even if the retransmission timer had
been mistakenly turned off.  In the previously deployed Rx
implementations packets would not be sent if

  ackPacket.skew > ackPacket.serial - sentPacket.serial

is true for any DATA packet in the sent queue.

From May 1989 through OpenAFS 1.2.7 the ackPacket.serial field was
set to htonl(conn->maxSerial) regardless of ackPacket.reason.

OpenAFS 1.2.8 restored the original meaning of ackPacket.serial.
A non-zero ackPacket.serial indicates the serial number of the
DATA or ACK packet it was sent in response to.  This now includes
ACK packets with reason RX_ACK_PING_RESPONSE.

During the period from May 1989, the receipt by an incoming connection
of an ackPacket.serial greater than conn.nextSerial would advance
conn.nextSerial.

  if (conn.type == RX_SERVER_CONNECTION && conn.nextSerial < ackPacket.serial)
      conn.nextSerial = ackPacket.serial + 1;

Advancing the incoming connection's serial number was necessary in case
the incoming connection had been previously garbage collected.

OpenAFS 1.2.8 removed the advancement of conn.nextSerial by incoming
connections.


ACK <Previous Packet>:

The value assigned to <Previous Packet> has been inconsistent across
Rx implementations.  Sometimes the value of <Previous Packet> contained
the sequence number of the DATA packet that triggered the ACK packet.
Sometimes it contained the sequence number of the DATA packet received
before the DATA packet that triggered the ACK packet.  The value might
be a sequence number of a DATA packet that was dropped because it was
a duplicate or because it was outside the receive window.

In AFS 3.0 the value was the sequence number of the previously
received DATA packet except when the ackPacket.type is DELAY.  The
value could be a sequence number of a dropped packet.  The value
can go backwards in subsequently received ACK packets.  There is
no relationship between <Previous Packet> and the ackPacket.nAcks
size.

Starting with AFS 3.5 <Previous Packet> is set to the sequence
number of the DATA packet with the ACK_REQUESTED flag set when
sending an ACK packet of ackPacket.type REQUESTED.  This change in
behavior made the REQUESTED-ACK consistent with the DELAY-ACK.

Kernel Rx versions AFS 3.5 up to and including OpenAFS 1.6.22 stored
the Rx serial number of a dropped DATA packet instead of the DATA
packet's sequence number.  This serial number would be sent in
subsequent ACK packets until the next DATA packet was received.

OpenAFS 1.2 Rx modified the behavior of IDLE-ACK packets by setting
<Previous Packet> to the sequence number of the DATA packet that
triggered the sending of the ACK.

OpenAFS 1.4 Rx modified the behavior of OUT-OF-SEQUENCE-ACK packets
by setting <Previous Packet> to the sequence number of the DATA packet
that triggered the sending of the ACK.

In theory the <Previous Packet> field combined with the <First Packet>
could be used to detect and ignore out-of-sequence ACK packets.  For
that to be true <Previous Packet> must consistently contain the
largest DATA packet sequence number accepted by the Rx peer.

AuriStor Rx always sets <Previous Packet> to the largest DATA packet
sequence number of an accepted DATA packet.  <Previous Packet> never
goes backward provided that ACK packets are processed in order.
<Previous Packet> is never set to a sequence number of a dropped DATA
packet.

When the EXTENDED-SACK Flag is set <Previous Packet> must be set to the
largest DATA packet sequence of any DATA packet accepted by the
receiver for the call.  The <Previous Packet> value must never go
backwards; although it is permitted to wrap to zero.


ACK Trailer fields:

The <Ack Trailer> fields were introduced as follows:

  AFS 3.3:
	The <Ack Trailer> fields <Max MTU> and <Interface MTU>
	were introduced in AFS 3.3.

  AFS 3.4:
	The <Ack Trailer> field <Receive Window Size> was introduced
	in AFS 3.4.  The AFS 3.4 ACK receiver would only reduce
	the size of the active call's transmit window by the
	advertised <Receive Window Size>.

  AFS 3.5:
	The <Ack Trailer> field <Datagram Packets> field was
	introduced in AFS 3.5 in conjunction with Jumbograms.

	The <Ack Trailer> field <Receive Window Size> can grow the
	size of the active call's transmit window by the
	advertised <Receive Window Size>.  Although the field is
	unsigned 32-bit, the maximum receive window is constrained
	to unsigned short (65535).


RX Connection Epoch and the Meaning of the High Epoch Bit:

The RX Connection Epoch is an unsigned 32-bit value which when
combined with the unsigned 30-bit Connection ID (CID) forms the
primary identifier for any RX connection.  These values when
combined with the source and destination endpoints, the
direction (as measured by the setting of the CLIENT-INITIATED
Flag bit) and the Security Index uniquely identify the packets
belonging to a connection.  The choice of Epoch and CID for any
RX Connection belongs solely to the Connection Initiator.  Once
the Epoch and CID are selected for a Connection, they must not
change.

  AFS 3.0:

	The RX stack initialized a global RX Epoch value to seconds
	since UNIX Epoch at startup.  This value was assigned to each
	initiated RX Connection.  CID values started with 1 and
	incremented sequentially.  CID values could wrap to 0.

	RX FindConnection matched incoming packets to connections with
	an exact match of Epoch, CID, Direction (type), Security Index,
	Source Address and Source Port.

	RX FindConnection bound the peer endpoint to the connection
	when created and did not permit the peer endpoint to change.

  AFS 3.0 (Aug 1990 patch release)

	RX FindConnection ignored the source endpoint when matching
	incoming packets to connections if the receiver initiated the
	connection and the source port number matched the connection
	peer endpoint port.

  AFS 3.1b:

	Changes were made in response to the pending publication
	of "Hijacking AFS"[8].

	The 31-bit of the RX Connection Epoch is designated the
	Ignore-Source Flag.

	The RX Epoch is replaced when the RXKAD Security Class is
	re-initialized to a random value and the Ignore-Source Flag
	is set.  The Ignore-Source enabled Epoch is used for all
	initiated connections regardless of whether or not the RXKAD
	security class is used.

	  "This function allows rxkad to set the epoch to a suitably
	  random number which rx_NewConnection will use in the future.
	  The principle purpose is to get rxnull connections to use the
	  same epoch as the rxkad connections do, at least once the
	  first rxkad connection is established.  This is important now
	  that the host/port addresses aren't used in FindConnection:
	  the uniqueness of epoch/cid matters and the start time won't
	  do."

	RX FindConnection ignores the source endpoint when matching
	incoming packets to connections if the Ignore-Source Flag is
	set.  This was true for both the acceptor and the initiator.

	  "epoch's high order bits mean route for security reasons only
	  on the cid, not the host and port fields."

	RX connections with the Ignore-Source Flag set can accept packets
	from alternative endpoints provided that each peer continues to
	receive packets on the original source and destination endpoints.

	RX connections initiated by fileservers to vlservers and UBIK
	servers to each other used RXKAD and would therefore set the
	Ignore-Source Flag.  Note that fileserver initiated RXAFSCB
	connections were rxnull but had the Ignore-Source Flag set.
	RXAFSCB services would therefore accept packets from any
	interface on a multihomed fileserver.

  AFS 3.4:

	Initialized KERNEL RX Epoch to seconds since UNIX Epoch and
	set the Ignore-Source Flag even though the Epoch is strictly
	time based and not unique.

	No change was made to the FindConnection logic.

  AFS 3.5:

	Introduced the rxLastConn cache pointer.

	Stopped treating the <Security Index> as part of the Connection
	identity.  Packets that match a connection but have a different
	<Security Index> are dropped.

	Began to update the connection peer endpoint to the endpoint of
	the most recently received packet matched to the connection.

	  "Ensure that the peer structure is set up in such a way that
	  replies in this connection go back to that remote interface
	  from which the last packet was sent out. In case, this packet's
	  source IP address does not match the peer struct for this conn,
	  then drop the refCount on conn->peer and get a new peer structure.
	  We can check the host,port field in the peer structure without the
	  rx_peerHashTable_lock because the peer structure has its refCount
	  incremented and the only time the host,port in the peer struct gets
	  updated is when the peer structure is created."

OpenAFS 1.2.9 RX OPENAFS-SA-2003-002:

	82523baf9f76eca38fc4856f52bc7cdabddf14b3 ("Clean up code in
	rxi_FindConnection") removed the logic introduced in IBM AFS 3.5
	that updated the rx_connection peer upon the receipt of each
	accepted rx_packet.

	Restricted the application of Ignore-Source Flag acceptance only
	to packets received by the connection initiator.

OpenAFS 1.4

	OpenAFS b4566d725e1aa4f57d1e6db5821c590a4b6da7c0
	("partly-revert-rx-cleanup-20040804") reverted the restriction
	on acceptance of packets when the Ignore-Source Flag is set
	except by connection initiators because it broke the RXAFSCB
	service receipt of packets when the cache manager is multi-homed.

	  "if there's a callback connection to a multihomed client, you
	  need this or you end up with multiple connections, one per IP,
	  being made from the single connection".

OpenAFS 1.8

	OpenAFS 39b165cdda941181845022c183fea1c7af7e4356
	("Move epoch and cid generation into the rx core") moved the
	generation of random Epochs and setting of the Ignore-Source
	Flag out of RXKAD and into RX proper.  Now all connections
	have the Ignore-Source Flag set.

AuriStorFS v0.192

	AuriStorFS 506ba040fdc3b4325461ff9d8d8e2b5660e68111
	("rx: do not permit client connection packets to switch endpoints")
	removed the test of Ignore-Source Flag from rxi_conn_find().
	Since connection initiators accept packets from any endpoint
	provided that the port number matches this change only prevents
	connection acceptors from matching packets to connections when
	the endpoint changes.  Acceptors will create a new connection
	to bind the incoming call to a new endpoint.

	Without this change calls could enter a zombie state where ACK
	PING packets are successfully responded to but DELAYED ACK, DATA
	and ABORT response packets are ignored after the initiator moves
	to a new network endpoint and can no longer receive packets at
	the original endpoint.

AuriStorFS v0.207

	AuriStorFS 5d544dd373418539ef7e850c3cc0fd64bfdd7904
	("rx: identify connections by direction, epoch, cid, and
	securityIndex") restored inclusion of <Security Index> in the
	connection identity.  Once again multiple connections that
	differ only by <Security Index> can exist between peers.

AuriStorFS v2021.05-16

	AuriStorFS 427754b023515881553afcf4382c84ed18931c6a
	("rx: clear epoch high-bit to prevent conn endpoint switching")
	removes the setting of Ignore-Source Flag even though the Epoch
	is random.  This will have no impact on communication with
	AuriStorFS services since the checks for High Epoch Bits were
	already removed.  It will force IBM/OpenAFS acceptors to
	allocate a new connection upon receipt of a call from a new
	endpoint in order to bind the connection to the new endpoint.


The Myth of Server Restart Detection Using the RX Connection Epoch

OpenAFS 8d359e6dff5317698597e77f0a1dd5ba2bfb569a removed a March
1989 attempt at RX peer restart detection.  The 1989 commit
included the following statement: "The right way to detect a
server restart in the midst of a call is to notice that the
server epoch changed, btw."  This statement is incorrect because
changing the Connection Epoch will result in a distinct connection
whose packets will not be mixed with those associated with a
prior connection.


Behavior of Rx peers that do not recognize DEBUG and/or VERSION packet types

DEBUG and VERSION packets were not part of the original Rx implementation.
DEBUG packets were introduced prior to the release of AFS 3.0 and
VERSION packets were introduced in the AFS 3.3 release.  When DEBUG
or VERSION packets are unrecognized, the acceptor will attempt to match
the incoming packets with an Rx connection using the epoch, cid, host,
port and/or security index; and if the call number is non-zero attempt to
match an Rx call.  The unrecognized packet type will result in an ABORT
packet with error code RX_PROTOCOL_ERROR.  If the call number is zero,
the ABORT will result in the connection being placed into an error state.


History of Rx VERSION response data

Rx VERSION packets were introduced in AFS 3.3.  AFS 3.3 implemented a
char[64] version buffer to copy the received version data into.  It would read
up to 1500 bytes from the network datagram.  If the length read was at
least 28 bytes it would copy MIN(64, (bytesRead - 28)) octets from
&responseData[28] to the version buffer.  It did not add a terminating
NUL.

The AFS 3.3 Rx VERSION acceptor always wrote 65 octets from a static
array which contained the product version as a C-string.  If the product
version string exceeded 64 octets the buffer when copied to the VERSION
response packet would not include a trailing NUL.

AFS 3.3 Rx VERSION acceptors did not validate the presence of the
CLIENT-INITIATED flag before issuing a response.

AFS 3.3 Rx VERSION acceptors reused the incoming Rx packet to transmit
the response.  It reused the Rx header as received.  As a result, the
CLIENT-INITIATED flag was not cleared when transmitting the response.
This could lead to packet loops.  Nor did it clear the Sequence
number, Serial number, UserStatus or other flag bits.

OpenAFS 1.1.1 corrected the use of the CLIENT-INITIATED flag and
truncated the response version string at 65 bytes and appended a
trailing NUL.  As of 1.1.1 it is safe to send a version string
longer than 63 bytes plus a trailing NUL.

However, OpenAFS 1.1.1 did not reset any other incoming flag bits
nor were the Sequence number, Serial number, and UserStatus fields
set to zero.  Instead the values received in the incoming rx header
are replayed in the response packet.

Linux RxRPC does not reuse the incoming Rx packet for the response
and only copies the Epoch, CID, and Call Number to the response header.


History of Rx DEBUG response data

Rx DEBUG packets were introduced prior to AFS 3.0.

AFS 3.0 Rx DEBUG acceptors did not validate the presence of the
CLIENT-INITIATED flag before issuing a response.

AFS 3.0 Rx DEBUG acceptors reused the incoming Rx packet to transmit
the response.  It reused the Rx header as received.  As a result, the
CLIENT-INITIATED flag was not cleared when transmitting the response.
This could lead to packet loops.  Nor did it clear the Sequence
number, Serial number, UserStatus or other flag bits.

OpenAFS 1.1.1 corrected the use of the CLIENT-INITIATED flag.
However, OpenAFS 1.1.1 did not reset any other incoming flag bits
nor were the Sequence number, Serial number, and UserStatus fields
set to zero.  Instead the values received in the incoming rx header
are replayed in the response packet.

Linux RxRPC ignores received DEBUG packets.

Package  Name                  Intro      Description
----------------------------------------------------------------------------
   1     RX_DEBUGI_GETSTATS    AFS 3.0    Get basic Rx stats
   2     RX_DEBUGI_GETCONN     AFS 3.0    Get connection info
   3     RX_DEBUGI_GETALLCONN  AFS 3.1    Get even uninteresting connections
   4     RX_DEBUGI_RXSTATS     AFS 3.1    Get all Rx stats
   5     RX_DEBUGI_GETPEER     AFS 3.5p2  Get all peer stats
  -8     RX_DEBUGI_BADTYPE     AFS 3.0    Requested package is unknown

RX_DEBUGI_VERSION values identify which RX_DEBUGI_GETSTATS fields or
RX_DEBUGI_GETPEER fields are available from the responding peer.  The
value is communicated via the struct rx_debugStats.version field returned
from the RX_DEBUGI_GETSTATS package.

Version  Intro      Description
----------------------------------------------------------------------------
  'L'    AFS 3.0    Earliest production version.  Unaligned connections
  'M'    AFS 3.1    Supports GETALLCONN and RXSTATS
  'N'    AFS 3.3    Adds rx_debugStats.nWaiting
  'O'    AFS 3.5    Adds rx_debugStats.idleThreads
  'P'    AFS 3.5p1  Adds new rx_stats fields: ignorePacketDally,
                    receiveCbufPktAllocFailures, sendCbufPktAllocFailures
  'Q'    AFS 3.5p2  Supports GETPEER
  'R'    OAFS 1.4.0 Adds rx_debugStats.nWaited
  'S'    OAFS 1.6.0 Adds rx_debugStats.nPackets

AFS 3.2 altered the struct rx_stats structure by increasing the
size of the packetsRead[] and packetsSent[] arrays from 9 to 10 elements
when RX_PACKET_TYPE_PARAMS was allocated.  There is no new RX_DEBUGI_VERSION
value matching this change.

AFS 3.3 altered the struct rx_stats structure by increasing the size of the
packetsRead[] and packetsSent[] arrays from 10 to 13 when
RX_PACKET_TYPE_VERSION was allocated.  This can be detected by
RX_DEBUGI_VERSION 'N'.


Acknowledgements
================

Jeffrey Hutzelman <jhutz@cmu.edu> reviewed an early draft of this
specification, and provided much appreciated feedback on technical
details as well as document structuring.

Love Hornquist-Astrand <lha@stacken.kth.se> made many corrections
to this specification, especially regarding backwards-compatibility
with older Rx implementations.

References
==========

  [1] /afs/sipb.mit.edu/contrib/doc/AFS/hijacking-afs.ps.gz

  [2] OpenAFS: src/rx/

  [3] /afs/sipb.mit.edu/contrib/doc/AFS/ps/rx-spec.ps

  [4] /afs/stacken.kth.se/ftp/pub/arla/prog-afs/shadow/doc/r.vdoc

  [5] /afs/stacken.kth.se/ftp/pub/arla/prog-afs/shadow/doc/rx.mss

  [6] https://datatracker.ietf.org/doc/html/rfc5681

  [7] https://datatracker.ietf.org/doc/html/rfc9000

  [8] https://www.researchgate.net/publication/2513329_Hijacking_AFS/link/02e7e51eeaf7360cfb000000/download