Rx protocol specification draft Nickolai Zeldovich, nickolai@csail.mit.edu Updated by Jeffrey Altman, jaltman@auristor.com Introduction ============ Rx is a client-server RPC protocol, an extended and combined version of the older R and RFTP protocols. This document describes Rx, but the details of Rx security classes (such as Rxkad and RxGK) are not specified. Rx communicates via UDP datagrams on a user-specified port. Rx also provides for multiplexing of Rx services on a single port, via a 16-bit service ID which identifies a particular Rx service that's listening on a given port akin to a port number. Therefore, an Rx service is identified by a triple of . The protocol is connection-oriented -- a client and a server must first hand-shake and establish a connection before Rx calls can be made. Said hand-shaking is implicit upon the first request if no authentication is desired, or can consist of a pair of Challenge and Response requests in order to establish authentication between the client and the server. In order to prevent unauthenticated Rx requests from being used for amplification attacks it is recommended that a reachability test consisting of an exchange of PING ACK and PING-RESPONSE ACK be performed before issuing any response larger than the received request packet. Protocol Overview ================= As mentioned above, Rx uses UDP/IP datagrams on a user-specified port to communicate. Each Rx server may provide multiple services, specified by the Service ID. This allows for service multiplexing, much in the same way as UDP port numbers allow for multiplexing of UDP datagrams addressed to the same host. Each Rx service must offer one or more security classes. When the service does not require authentication, integrity protection, or encryption it can offer the Rxnull security class. Each multiplexed Rx service can offer an independent set of security classes. Each Rx service must specify a protocol for the description and encoding of data. Sun RPC's XDR (External Data Representation) is commonly used but is not required. Multiplexed Rx services may require different data encoding protocols. Each client and server pair that want to communicate using Rx must establish an Rx connection, which can be thought of as a context for all subsequent Rx activity between these two parties. An Rx connection can only be associated with a single Rx service. Each Rx connection context contains multiple channels, which are used for data transmission and actually performing an RPC call. The channels are independent of each other, allowing multiple RPC calls to be performed over the same Rx connection simultaneously. An Rx call involves the transmission of call arguments over an Rx channel to the server and subsequent reception of the reply data. For each Rx call, an available Rx channel must be allocated exclusively to that call. The channel cannot be used for anything else until the call completes. After call completion, the channel may be reused for subsequent Rx calls. Rx Connections ============== This section makes many references to fields of an Rx header; see the ``Packet Formats'' section for specific layout of the Rx header. The connection epoch is a relatively unique 31-bit value chosen by the initiator of each connection. Historically, the epoch was selected at startup by computing the seconds since Unix epoch (midnight 1 January 1970 UTC/GMT not counting leap seconds). It was used to identify the set of connections initiated by the peer. Modern Rx implementations should randomly generate the Rx epoch instead of relying upon a timestamp. Rx implementations may generate a unique random epoch for each initiated connection to reduce the ability to track activity across multiple peers. An Rx connection between two Rx peers is identified by: 1. Direction 2. Epoch 3. Connection ID 4. Service ID 5. Security Class ID 6. Peer Address 7. Peer Port The Rx Connection direction is determined by the presence of or lack of the CLIENT-INITIATED bit in the Flags field. (see below) The 31-bit of the epoch field is the Ignore-Source Flag. When set, the Ignore-Source Flag indicates that the peer is multi-homed and that it might send packets from more than one endpoint (network address and port). Packets received for Rx Connections with the Ignore-Source Flag set can be accepted from any endpoint. Conversely, if the Ignore-Source Flag is not set, packets received from distinct endpoints must be considered part of a unique Rx connection. Rx implementations are discouraged from accepting packets from distinct endpoints for the same Rx Connection. Instead, Rx peers should be designed to reliably transmit reply packets from the network interface on which they were received; and Rx services should implement peer identification either via Rx security class authentication or out of band mechanisms. The Connection ID is a 30-bit value chosen by the connection initiator. Although not required, the Connection ID should be unique to the peer independent of the Epoch, Service ID, Security Class ID, Peer Address, and Peer Port. Sharing the 32-bit field with the Connection ID is the 2-bit Channel ID. Each connection can multiplex four calls on the same connection. One call at a time can be issued in each Rx channel. The signed 32-bit Call ID identifies a distinct call within a channel; there are four call numbers associated with each Rx connection. Each new call must start with a higher number than the previous call in the same channel, and typically this is just the previous call number + 1. The initial call number must be positive integer. Call ID zero (0) indicates a connection-level Rx packet (see below). The Call ID is chosen by the peer initiating the call. Although only one call can use a channel at one time, the Call ID allows peers to distinguish packets on the same channel that belong to different calls. Once the call whose ID is 2147483648 completes, the call channel must cease being used. Although the rx_header.callNumber field is unsigned 32-bit all implementations use a signed 32-bit field and check for overflow. The unsigned 32-bit Sequence number is similar to the sequence number in TCP, but instead of bytes it labels sequential DATA packets transmitted in a single direction within a call. The first DATA packet sent in each direction of a call must have the Sequence number one (1). Sequence numbers are assigned to DATA packets in sequential order of data packetization. Each assigned Sequence number is one greater than assigned to the previous DATA packet. Unlike Call IDs, Sequence numbers are permitted to wrap to zero after Sequence number 4294967295 is reached. It is the responsibility of Rx peers to ensure proper ordering of DATA packets. In Rx, acknowledgements of received DATA and retransmissions are performed on a packet-by-packet basis, identified by these Sequence numbers. The Sequence number should be zero (0) in all packets which are unassociated with a call or are a non-DATA packet type. Every outgoing packet issued on a Rx connection is stamped with a unsigned 32-bit serial number in the serial field. The serial number is incremented by one (1) for every packet sent regardless of the packet type or Call ID. When retransmitting a DATA packet, the Sequence Number remains the same and the Serial number is unique for each retransmission. These serial numbers might be used by the flow control mechanisms (described below). The serial number for a connection should start at one (1) and is permitted to wrap to one (1) when Serial number 4294967295 is reached. Serial number zero (0) must be skipped when sending DATA packets because an rx_ackPacket.serial of zero (0) means that the ACK packet was not sent in response to receipt of a DATA or ACK packet. Packets which are unassociated with an Rx connection should be transmitted with a Serial number of zero (0). (See Historical Implementation Notes: ACK ). The unsigned 16-bit Service ID identifies a particular Rx service running on a given network endpoint. This is analogous to how UDP port numbers allow multiplexing packets to a single IP address. Note that an Rx connection is created to a specific Service ID. The associated Service ID cannot be changed after the first Rx Call on the Connection. Existing implementations cache the Service ID value for a given Connection, and will ignore Service ID values in subsequent packets. Attempts to switch the Service ID for a single call will result in the designated opcode being executed on the cached service. The unsigned 8-bit SecurityIndex field specifies the type of Security Class in use on this connection. The mapping of non-zero SecurityIndex values to Security Classes are defined by each Rx Service specification. SecurityIndex value zero (0) is reserved for the rxnull Security class. The interpretation of 16-bit securitySpecific (aka Checksum) field is dependent upon the active Rx Security Class. Unless otherwise specified by the Security Class specification this field should be set to zero (0). An Rx Security class can also modify the packet payload in any way, for instance by encrypting the contents or adding headers or trailers specific to the Security class protocol (although the end result must be a properly sized packet that Rx will be able to transmit.) The userStatus field allows for additional user flags to be transported with each DATA or ACK packet. These have no significance to the Rx protocol itself. The meaning of userStatus flags are Rx Service specific and might have unique meanings based upon the in-flight call context. The userStatus field in an ACK packet provides an out of band communication channel in the reverse direction of the active call. Unless otherwise defined the userStatus field should be set to zero. The "Flags" field consists of a number of single-bit flags with meanings as follows. The actual bit values are defined below, in the ``Protocol Constants'' section. * CLIENT-INITIATED This packet originated from the connection initiator (as opposed to the acceptor). This flag must be set on all outgoing packets (regardless of type) sent by the connection initiator and must be cleared on all outgoing packets (also regardless of type) sent by the connection acceptor. * REQUEST-ACK Sender is requesting acknowledgement of this packet, via an Ack packet response. The REQUEST-ACK flag is defined for DATA and ACK packet types. * LAST-PACKET This packet is the last DATA packet transmitted in the current direction by the sender. When the acceptor receives all DATA packets including the LAST-PACKET, it is safe to switch the direction of the call and begin transmitting the response DATA packets. When the LAST-PACKET DATA packet sent by the acceptor is received and acknowledged by the initiator the call is complete. * MORE-PACKETS This advisory flag can be set on all but the last DATA packet by the sender initially transmitting a sequential list of DATA packets at once. This flag must not be set on retransmitted DATA packets. * EXTENDED-SACK This flag must be set on all ACK packets of the Rx call if the sender wishes to send ACK packets using the Extended format. This flag must be set even if the is less than or equal to 255 packets. Setting the EXTENDED-SACK flag signals that the meaning of and the three unused octets following the first SACK table are specified as per this document. * SLOW-START-OK This advisory flag can be set on ACK packets to indicate that the sender of this packet supports the slow-start mechanism, described below under ``Flow Control''. * JUMBO-PACKET When set in a DATA packet, this flag indicates that the DATA packet is part of a jumbogram, and is not the final DATA packet. See the ``Jumbograms'' section below for more details. Packet Types ============ The "Type" field indicates the contents of this packet. Actual values are specified in the ``Protocol Constants'' section. This section describes the simpler packet types, and subsequent sections cover more complex packet types in more detail. Certain type packets are connection-only requests (that is, they are not associated with an RPC call). A connection-only request is indicated by a zero call number. Valid packet types in a connection-only context are ABORT, CHALLENGE, RESPONSE, DEBUG, VERSION, and the three unused PARAMS packet types. Valid packet types in a call context are DATA, ACK, ABORT, BUSY and ACKALL. All other packet types are reserved for future use. The payload of the packet following the header depends on the type of the field, as follows: * DATA type (Standard data packet) The payload of a data packet is simply the Rx payload, corresponding to the sequence number and call specified in the header. The actual data that is transmitted in Rx data packets is described below. The receipt of a data packet by the initiator implicitly acknowledges that the acceptor has received and processed all the DATA packets that were transmitted by the initiator as part of this call. * ACK type (Acknowledgement of received data) An acknowledgement packet provides information about which packets were or were not received by the peer, and other useful parameters. The semantics of these packets are described below in the ``Call Layer'' section. * BUSY type (Busy response) When an initiator attempts to start a new call on a channel which the acceptor considers in-use, a busy response is returned. The call and channel number in the packet header indicate which call is being rejected. This packet type has no payload associated with it. * ABORT type (Abort packet) Indicates that the relevant call or connection (if the call number field is zero) has encountered an error and has been terminated. The payload of the packet has a network-byte-order 32-bit user error code. Some Rx services (such as the Ubik VOTE service) use the ABORT packet instead of a response DATA packet to convey the output of a call. This technique is discouraged. Rx Security classes cannot provide authentication, integrity protection or privacy for ABORT packets. Rx Services should consider sending error results via DATA packets instead of ABORT packets. * ACKALL type (Acknowledgement of all packets) An acknowledge-all packet indicates the obvious: the peer wants to acknowledge the receipt of all DATA packets sent to it as part of the specified call. This can be used, for example, when a connection is being closed and the connection initiator wants to ensure that no retransmissions are attempted after it exits. Alternatively, the initiator could send a connection level ABORT to the peer to ensure that no further packets are accepted or transmitted by the acceptor. Use of the ACKALL packet type is discouraged. There is no payload associated with an acknowledge-all packet. * CHALLENGE, RESPONSE types (Challenge request/response) The payload and use of these packet types are Security Class specific data, and are used to authenticate an Rx connection. CHALLENGE packets are sent from acceptor to initiator. CHALLENGE packets are dropped by acceptors without delivery to the Security Class. RESPONSE packets are sent from initiator to acceptor after receiving a CHALLENGE. RESPONSE packets that are received by initiators are dropped by Rx and not delivered to the Security Class. RESPONSE packets received by an acceptor without a pending CHALLENGE may be dropped by Rx and not delivered to the Security Class. Security class specifications are not included in this document. * DEBUG type (Debug packet) Rx supports an optional debugging interface; see the ``Debugging'' section below for more details. DEBUG packets might or might not be associated with an Rx connection. When associated with an Rx connection the securityClass must be Rxnull. Rx acceptors that associate DEBUG packets with Rx connections and require a successful reachability test before issuing a response will not interoperate with Rx initiators that do not associate DEBUG packets with an Rx connection. * PARAMS types (Parameter exchange) Three types were assigned in IBM AFS 3.2 as Connection level packet types but never used in a production deployment, and therefore have no protocol significance at this time. It should be noted that receipt of packets with these types are currently ignored and should not be responded to with a connection level ABORT. * VERSION type (Get Rx version) If a peer receives a VERSION packet with the CLIENT-INITIATED flag set, it may respond with VERSION packet containing a NUL-terminated payload. The payload might identify the version of Rx software it is running. The response must not have the CLIENT-INITIATED flag set. Nothing should respond to a version packet with the CLIENT-INITIATED flag set, to avoid infinite packet loops. VERSION packets might or might not be associated with an Rx connection. When associated with an Rx connection the securityClass must be Rxnull. Rx acceptors that associate VERSION packets with Rx connections and require a successful reachability test before issuing a response will not interoperate with Rx initiators that do not associate VERSION packets with an Rx connection. Call Layer ========== The call layer provides a reliable data transport over an Rx channel, and is used by the RPC layer to make Rx calls. One of the most important pieces of the call layer is the Rx ACK packet. The ACK packet is used by Rx to determine when retransmissions are needed, as well as determining the proper transmission / receiving parameters to use (such as the transmit window size and jumbogram length, described in more detail below). A new call is established by the initiator sending a DATA packet with sequence number one (1) to the acceptor on an available channel. Either side can indicate that they have no more DATA packets to send by setting the LAST-PACKET flag in their final DATA packet (which might be sequence number one). Each call remains open until the upper layer informs Rx that it is done with the call. (The upper layer in this case would most likely be the Rx RPC layer.) The structure of an Rx ACK packet is described in the "Packet Formats" section. This section will refer to particular fields of the ACK packet by names. The field is unused and should be set to zero (0). It was originally intended to store the number of packet buffers that the ACK sender can provide for receiving DATA packets for this call but was never used for this or any other purpose. The field is unused and should be set to zero (0). It was originally intended to share the maximum packet skew that the sender of the ACK packet has observed for this Rx peer with the intent that retransmission be avoided due to expected out of order delivery. See the "Historical Implementation Notes" section. For example, if a packet is received N packets later than expected (based on the packet's serial number, i.e. if the last received packet's serial number is N higher than this packet's), then it is defined to have a skew of N. This can be used to avoid retransmission because of packet reordering. However, the reliance on the use of serial numbers which are assigned to packets of all types across all calls of a connection make this skew measurement particularly unreliable. The field specifies the Sequence number of the first DATA packet that would be explicitly acknowledged (either positively or negatively) by this packet if the is non-zero. All DATA packets with Sequence numbers smaller than this are implicitly acknowledged. The Sequence number must never go backwards. Any ACK packet received with an out-of-order Sequence number should be ignored. The field should specify the largest DATA packet Sequence number accepted (aka not dropped) by the issuer of the ACK packet unless the EXTENDED-SACK Rx header flag is set. If the EXTENDED-SACK flag is set, this field must specify the largest DATA packet Sequence number accepted by the issuer of the ACK packet. The Sequence number must not be that of a discarded packet such as one that exceeded the window size. Nor should the value of ever go backwards; although it is permitted to wrap. See the "Historical Implementation Notes" section for details on various implementation specific deviations that make use of this field unreliable. The field indicates the serial number of the packet which triggered this ACK packet, or zero if there is no such packet (i.e. the ack packet was delayed and should not be used for round-trip time computation). The receiver should note that any DATA packets transmitted with a serial number less than this, which are not acknowledged by this packet, are likely lost or reordered. Thus, these packets may be retransmitted, after a possible delay to allow for packet reordering (as measured by packet skew). The field specifies a particular type of an ACK packet. Valid reason codes are specified in the ``Packet Formats and Protocol Constants'' section; their meanings are as follows: REQUESTED Acknowledgement was requested. The peer received a DATA or ACK packet from us with the ACK-REQUESTED flag set, and this packet is acknowledging it. If sent in response to a DATA packet, the DATA packet's serialnumber is in the field. DUPLICATE A duplicate DATA packet was received. The duplicate DATA packet's serial number is in the field. OUT-OF-SEQUENCE A DATA packet was received out of sequence. The serial number of the DATA packet is in the field. EXCEEDS-WINDOW A DATA packet was received whose Sequence number exceeded the current receive window, and was dropped. The serial number of the DATA packet is in the field. NOSPACE A DATA packet was received, but no buffer space was available and therefore it was dropped. The serial number of the dropped DATA packet is in the field. PING This is a keep-alive packet, used to verify that the peer is still alive. If the REQUEST-ACK flag in the Rx packet is set, the recipient of this packet should reply with a PING-RESPONSE packet. PING-RESPONSE This is a response to a PING ACK packet with the REQUEST-ACK flag set. The serial number of the PING ACK is in the field. DELAY A delayed acknowledgement, usually because a certain amount of time has passed since the receipt of the last DATA packet and there are outstanding unacknowledged DATA packets. DELAY ACKs should not be used for RTT computations. IDLE Similar to DELAY but can be used for RTT computation. Introduced by OpenAFS 1.2. A peer should never delay the transmission of an ACK packet in response to a received packet unless it sets the field to DELAY. This is because ACK packets (except for DELAY ones) are used for RTT computation by Rx peers. All acknowledgement packets should clear the REQUEST-ACK flag in the Rx header, except when the field is set to PING. The field specifies the size of the variable- length Selective Acknowledgements Table or . The field can specify a size between 0 and 255 octets. The is a variable-length Selective Acknowledgements Table whose size is specified by the field. When the is zero, there is no table. When the is greater than zero, the 0th bit of each octet represents a single DATA packet by sequence number. The range of DATA packets whose acknowledgement state is represented by the are through ( + - 1) inclusive. The meaning of each bit is as follows: 0 Explicit negative acknowledgement: packet with the corresponding sequence number has not been received or has been dropped. 1 Explicit acknowledgement: packet with the corresponding sequence number has been received but may yet be dropped. When the EXTENDED-SACK Rx Packet Header flag is set and the equals 255, the number of DATA packets represented in the SACK table is NumAcks0 := MIN( - + 1, 2048) If is greater than 255, the SACK table is extended to a total of 256 octets by stealing the first of the reserved octets before the trailers. This octet must be set to zero if is less than 256. It's important to note the distinction between packets with Sequence numbers prior to , between and ( + - 1), and those with Sequence numbers of at least ( + ). Those in the first category have been hard-acknowledged and must not be dropped in the future; the DATA sender (ACK recipient) is permitted to recycle DATA packets once the leading edge of the window advances. Packets in the second category are individually soft-acknowledged in the , either as being queued for the application or not received. The DATA sender (ACK recipient) must keep all packets with sequence numbers in this range, but avoid retransmitting the positively acknowledged ones. Negatively acknowledged packets should be retransmitted according to the DATA sender's flow control algorithm. Packets in the third category are not acknowledged at all, and the DATA sender (ACK recipient) should assume no knowledge of their state; even if the Rx receive window exceeds the size of the . If the EXTENDED-SACK flag is set and is greater than 255, the size is 256 octets. The extended acknowledges up to 2048 DATA packets using horizontal striping. The two octets following the are and . is the number of 32-bit fields; currently four (4). is the number of additional extended SACK tables that follow the fields. Each additional Extended SACK can acknowledge up to an additional 2048 DATA packets. The optional fields are not 32-bit aligned with respect to the packet. If the EXTENDED-SACK flag is unset, the first field begins three octets after the end of the variable- length . If the EXTENDED-SACK flag is set, the first field begins following the field. Four 32-bit fields are defined: , , , and . Unrecognized fields should be ignored. Their presence depends on the version of the Rx peer; see the "Historical Implementation Notes" section for details. The and packet sizes are, respectively, the largest possible packet size that the peer is willing to accept from the ACK receiver, and the size of the packet the peer would prefer to receive. In the absence of these fields, it should be assumed that the maximum and interface Rx packet sizes are 1444 bytes. (1500 - IPv6 header (40) - IPv6 fragment header (8) - UDP header (8)) The indicates the size of the ACK sender's receive window, in packets. The maximum is 65535 packets. If this field is absent, the implementation must assume a receive window of 16 packets; Rx implementations that do not support this trailing field implemented a fixed window size of 16 packets. The field indicates how many DATA packets the ACK sender is willing to receive in a jumbogram (also described below). All DATA packets in a jumbogram (except the last one) are always 1412 bytes, regardless of the and packet sizes described above. When the field is missing, the ACK receiver must assume a value of one (1) packet. * Round-trip time computation To determine when packet retransmission is necessary, Rx computes some statistics about the round-trip time between the two hosts: exponentially-decaying averages of the round-trip time and the standard deviation thereof. Each acknowledgement packet which mentions a specific packet in the field and is not delayed is used to update the round-trip statistics. First, the round-trip time for this packet (R) is computed as the difference between the arrival time of the ack packet and the time we transmitted the packet with the serial number specified in . Next, the round-trip time average and standard deviation values are updated. For instance, this algorithm could be used: RTTdev = RTTdev * (3/4) + |RTTavg - R| / 4 RTTavg = RTTavg * (7/8) + R / 8 * Packet retransmission In order to support reliable data transport, Rx must retransmit packet which are lost in the network. This must not be done too early, otherwise we might retransmit a packet whose first copy is still in transit, thereby wasting bandwidth. Rx computes a retransmit timeout value T, and retransmits any packet which hasn't been positively acknowledged since last transmission for at least T seconds. This timeout could be computed as follows from the round-trip statistics above: T = RTTavg + 4 * RTTdev + 0.350 This allows the packet to be up to 4 deviations late and still not be retransmitted. The 350 msec fudge factor is used to compensate for bursty networks, though it is likely becoming less relevant (and accurate) with time. A more clever algorithm could take into account the maximum packet skew rate, and improve the retransmission strategy to take into the account the likelihood that a given packet has been reordered, and give it extra time before retransmission. * Keepalive and Timeout The upper layer (either the Rx RPC layer or the application) have to specify a timeout, T, to the call layer. If the peer is not heard from within T seconds, the call layer declares the call to be dead and propagates the error to the upper layer. In order to determine whether the peer is still alive or not, keepalive requests are used. These take form of an ack PING and PING-RESPONSE packets. When the client has not received any response from the server, either to the original request or the keepalive requests, in T seconds, the call times out. The following strategy may be used to determine when to send keepalive requests: Compute a keepalive timeout, KT = T/6 If the call was initiated KT seconds ago, or KT seconds have passed since the last keepalive request transmission, send a keepalive packet. This strategy limits the number of transmitted keepalive packets to a fixed number in the case of a dead server, and proportional to the real timeout in case of a slow server. It also allows up to 5 keepalives to be dropped before the server is erroneously declared dead. * Flow Control Every Rx client or server has associated with each Rx call a receive and transmit window. These windows indicate the number of packets that haven't been fully acknowledged packets (that is, not read by the peer's application) that an Rx sender can have outstanding at any time. A sender's transmit window may never be greater than it's peer's receive window for that call. The receive windows are exchanged via the parameter in an Ack packet. Rx ``sliding windows'' are similar to those used by TCP, except they measure packets rather than bytes. Also, in TCP the window effectively applies to bytes in flight between the two peers, whereas in Rx the window applies to packets between the user applications. For example, a transmit window of 8 on a certain Rx connection means that at most 8 packets can be transmitted and not yet read by the peer's application at any time. The sequence number of the first packet that hasn't been read by the application is indicated by the field of an Ack packet. The selection of initial window sizes isn't strictly defined by the Rx protocol, but historically the initial window size must be 16 packets. The Ack Trailer field can adjust the window up or down as necessary. Rx uses the slow start, congestion avoidance, and fast recovery algorithms[6]. The algorithms are modified to work in the context of Rx packet-based transmission windows, and are described below. These algorithms require two additional variables to be maintained for each active Rx call: a congestion window, cwind, and a slow start threshold, ssthresh. Define a "negative ack" as an Ack packet that contains a negative acknowledgement followed by a positive one. Similarly, define a "positive ack" to be any Ack that is not negative. Upon receiving three negative acks for a call in a row since the last congestion avoidance attempt (if any), the Rx protocol enters congestion avoidance for that Rx call. * Slow start, congestion avoidance, and fast recovery algorithms First, the congestion window, cwind, is initialized to 1. The number of unread transmitted packets is now limited not only by the transmission window, but also by the congestion window. The latter limit is a little different: Rx may send up to cwind packets (by sequence number) past the last contiguous positively acknowledged packet. For example, if an Ack packet indicates that packets 1, 2 and 8 were received, and cwind is 2, Rx may transmit packets 3 and 4. When congestion occurs (indicated by a negative ack or a packet retransmission timeout), Rx enters congestion avoidance and fast recovery. The slow-start threshold, ssthresh, is set to half of the effective transmission window (minimum of cwind and transmit window), but no less than 2 packets. If triggered by a negative ack, any negatively acknowledged packets should be retransmitted as soon as possible (i.e. window-permitting). If triggered by a retransmission timeout, the congestion window is reset to a single packet. When in fast-recovery mode, every additional negative ack packet received causes cwind to be increased by one packet. A positive ack packet causes cwind to be set to ssthresh, and terminates fast recovery. At this point we are back to congestion avoidance, since the cwind is half the original transmission window. When packet acknowledgements are received, the congestion window should be increased. If cwind is less than ssthresh, cwind should be increased by 1 for each newly acknowledged packet. If cwind is at least ssthresh, cwind is increased by 1 for each newly received Ack packet. The advertised can be larger than the size of the Rx ACK packet Selective Acknowledgement table. The sender of DATA packets whose sequence numbers do not fit within the SACK will not receive any feedback on the reception state of such packets. Debugging ========= Rx provides for an optional debugging interface, using the DEBUG packet type, allowing remote Rx clients to query an Rx peer for some Rx protocol statistics. Implementations are not required to implement this interface. Some parts of this interface may also be specific to a particular implementation of Rx. In order to prevent packet loops, a server should only reply to DEBUG packets with the client-initiated flag set. The payload of a debug request packet is always the same; both of the 32-bit quantities are in network byte order: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Debug Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Debug Index | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The debug type indicates the kind of debug information being sent or requested, and determines the format of the rest of the packet. The debug index allows some debug types to export array-like data, indexed by this field. The following debug types are defined for the Transarc implementation: 0x01 Retrieve basic connection statistics 0x02 Get information about some connections 0x03 Get information about all connections 0x04 Get all Rx stats 0x05 Get all peers of this server The index field in the debug packet indicates which element of the debug information the client wants to access, in cases where there are multiple entries in question. The responses to each of those debug queries contain the following information: 1. Retrieve basic connection stats An array of general statistics about packet allocation, server performance, and so on. The first octet in this response represents the debug protocol version being used by the server. See RX_DEBUGI_VERSION* in rx/rx.h. 2, 3. Get information about connections Both of these calls return a struct rx_debugConn (see rx/rx.h), indexed by the "index" field. The first version of the debug call (type 2) only retrieves information about connections which are deemed interesting, that is, connections which are active, or about to be reaped. The end of the list is signaled by a response where the connection ID value is 0xFFFFFFFF. 4. Get Rx stats This call returns a struct rx_stats to the client in network byte order, containing various statistics about the state of Rx on the server (see rx/rx.h). 5. Get all Rx peers Similar to the connection request above (2, 3) this call returns all the Rx peers of the server (in a network-byte-order struct rx_debugPeer), indexed by the index field in the request. End of list is indicated by a host value of 0xFFFFFFFF. (These are the first 4 octets.) In response to unknown requests, the server returns 0xFFFFFFF8 in the debug type field. XXX The response interface should probably be fixed to include a fixed header that indicates whether the request was successfully completed. Jumbograms ========== To be able to transmit more data in a single packet, Rx supports ``jumbograms'', which are single UDP datagrams containing multiple sequential Rx DATA packets. In a jumbogram, all packets except the last one must be of a fixed maximal size (1412 bytes). Because all the packets in the jumbogram are sequential, only one full header is needed. Here is what a jumbogram could look like: +-----------+---------------+--------------+---------------+ | Rx header | 1412 byte pkt | Short header | 1412 byte pkt | -> +-----------+---------------+--------------+---------------+ +--------------+- -+-----------------------+ -> | Short header | ... | <= 1412 byte last pkt | +--------------+- -+-----------------------+ Every Rx packet in a jumbogram except the first one must be preceeded by the short Rx header, and all packets except the last one must have the Jumbogram Rx flag set in their respective headers. The number of packets in a jumbogram may not exceed the peer's advertised Max Packets Per Jumbogram value in the Ack packet. The maximum number of packets per jumbogram should be assumed to be 1 (i.e., no jumbograms) unless explicitly specified otherwise by an Ack packet. If an Ack packet is received without the packet-per-jumbogram field, it might indicate that the peer is now running a version of Rx that does not support jumbograms, and therefore no jumbograms should be sent until they are explicitly enabled again. The short header in a jumbogram has the following makeup: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Flags | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Security Specific | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ All the packets in the jumbogram have the same Rx header fields (from the full Rx header) except for , , , and . The and fields for subsequent packets are taken from the short header preceeding that packet in the jumbogram. The Sequence and Serial Numbers are assumed to be consecutive, and are incremented by 1 from the first packet in the jumbogram (ie the full Rx header). Retransmitted packets should not be sent in a jumbogram. RPC Layer ========= This section discusses how an RPC call is made using the Rx protocol. There are two common ``types'' of Rx calls: simple and streaming (aka split). These mostly reflect a difference in the upper-level API rather than in the Rx protocol. A simple Rx call has a fixed number of input variables and a fixed number of output variables. A streaming (or split) Rx call, in addition to the above, allows the user to send and receive arbitrary amounts of data (whose length should be specified as a fixed-length argument.) In either case, an Rx call consists of two basic stages: client sending the data to the server, and server sending the response back to the client. No data can be sent by the client after the client sends a DATA packet with the LAST-PACKET flag set. The server must not send any data packets to the client before it receives all of the client's DATA packets up to and including the LAST-PACKET. The call successfully completes when the client receives all of the server's DATA packets up to and including the LAST-PACKET. After receiving the LAST-PACKET, the receiver must confirm that the sender did not transmit more DATA than was expected. When Rx services use XDR for marshaling, each remote function call associated with the Rx service (identified by the IP-port-serviceId triplet) is assigned a 32-bit integer opcode number. To make a simple Rx call, the caller must transmit the opcode number followed by the expected arguments for that call over an Rx channel using XDR encoding. The callee uses XDR to unmarshall the opcode and input arguments, performs a function call corresponding to that opcode and arguments, and then uses XDR to encode the return values back to the caller. The caller then uses XDR to receive the output variables. For streaming calls which send data from the caller to the callee, one convention is to include the length of the data to be sent as one of the fixed-length arguments, and send the variable-length data immediately after the fixed-length portion. For streaming calls which receive data, one convention is for the callee to first reply with a fixed-length field specifying the number of bytes it's about to send, and then send those bytes. Upon completion of the streaming part of the call, the output arguments are sent back to the caller in fixed-length XDR form, as with simple calls. Packet Formats and Protocol Constants ===================================== * Rx packet Every simple Rx packet has an Rx header, of the form below. All quantities are in network byte order. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |+| Connection Epoch | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Connection ID | * | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Call Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Serial Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Flags | User Status | Security ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Security Specific | Service ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload .... +-+-+-+-+- [*] The field marked with * is the Channel ID. The last two bits of the connection ID are used to multiplex between 4 parallel calls. [+] The bit marked with + is the Ignore-Source Flag used to indicate that only the connection ID should be used to identify this connection, and sender host/port should not be used. The following packet type values are defined: 1 DATA Standard data packet 2 ACK Acknowledgement of received data 3 BUSY Busy response 4 ABORT Abort packet 5 ACKALL Acknowledgement of all packets 6 CHALLENGE Challenge request 7 RESPONSE Challenge response 8 DEBUG Debug packet 9 PARAMS Exchange of parameters (ignored) 10 UNUSED_1 Unused and ignored 11 UNUSED_2 Unused and ignored 12 UNUSED_3 Unused and RX_PROTOCOL_ERROR abort 13 VERSION Get Rx version Any other unrecognized packet type returns an RX_PROTOCOL_ERROR abort The values for the Flags field are defined as follows: 0000 0001 CLIENT-INITIATED (Any packet type) 0000 0010 REQUEST-ACK (DATA and ACK only) 0000 0100 LAST-PACKET (DATA only) 0000 1000 MORE-PACKETS (DATA only) 0000 1000 EXTENDED-SACK (ACK only) 0001 0000 - Reserved - (See Historical Implementation Notes) 0010 0000 SLOW-START-OK (ACK only) 0010 0000 JUMBO-PACKET (DATA only) AFS3 Rx services commonly, but not necessarily, use the following value mappings for the Security field: 0 No security or encryption 1 bcrypt security, only used in AFS 2.0 2 "krb4" rxkad 3 "krb4" rxkad with encryption (sometimes) * Rx acknowledgement packet (EXTENDED-SACK Flag unset) This is the legacy ACK packet format which is applicable whenever an ACK packet is received without the Rx Packet Header Flag EXTENDED-SACK (8) set. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Buffer Space | Max Skew | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | First Packet | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Previous Packet | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Serial | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reason | Ack Count | SACK table (0 to 255 octets)... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ .. ... -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ... SACK | Reserved[0] | Reserved[1] | Reserved[2] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Maximum Packet Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Recommended Packet Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Receive Window Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Max Packets per Jumbogram | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Note that the trailing fields can have arbitrary alignment, determined by the size of the variable length SACK table in the packet. There are three reserved and unaligned octets between the SACK table and the start of the trailing fields: , and . All three reserved octets should be zero. The valid values for the Reason code are: 1 REQUESTED 2 DUPLICATE 3 OUT_OF_SEQUENCE 4 EXCEEDS_WINDOW 5 NOSPACE 6 PING 7 PING_RESPONSE 8 DELAY 9 IDLE * Extended Rx acknowledgement packet (EXTENDED-SACK Flag set) This is the Extended ACK packet format which is applicable whenever an ACK packet is received with the Rx Packet Header Flag EXTENDED-SACK (8) set. This ACK packet format is designed to be backward compatible with Rx peers that do not recognize the meaning of the EXTENDED-SACK Rx Header Flag. It differs from the Legacy ACK packet format as follows: 1. The field must be set to the largest DATA packet Sequence number accepted by the Rx peer. It must not be set to the Sequence number of a dropped DATA packet. Its value must not go backwards (although it is permitted to wrap.) 2. If is 255 then the number of DATA packets that are explicitly ACKed or NACKed within the first SACK table is NumAcks0 := MIN( - + 1, 2048) 3. If is greater than 255, the SACK table is extended to a total of 256 octets with the addition of . must be set to zero if is less than 256. 4. The acknowledgement state of each DATA packet is represented by a single bit using horizontal striping. Up to 2048 DATA packets starting with can be represented in the SACK table. The striping pattern is: := - bit 0 0 .. 255 1 256 .. 511 2 512 .. 767 3 768 .. 1023 4 1024 .. 1279 5 1280 .. 1535 6 1536 .. 1791 7 1792 .. 2047 5. becomes representing the number of 32-bit trailers. At present this value is set to four (4) but can be increased if new trailer fields are defined. 6. becomes representing the number of Extra SACK tables that are present in this ACK packet following the trailers. 7. Each Extra SACK table is preceded by a one octet field that specifies how many additional octets of the SACK table are present. of zero indicates an Extra SACK table consisting of one octet. The maximum size of each Extra SACK table is 256 octets which is represented by equal to 255. The number of acknowledged DATA packets in the first Extra SACK table is: NumAcks1 := MIN( - - 2047, 2048) The striping pattern is: := - ( + 2048) bit + 2048 0 0 .. 255 2048 .. 2303 1 256 .. 511 2304 .. 2559 2 512 .. 767 2560 .. 2775 3 768 .. 1023 2776 .. 3071 4 1024 .. 1279 3072 .. 3327 5 1280 .. 1535 3328 .. 3583 6 1536 .. 1791 3584 .. 3839 7 1792 .. 2047 3840 .. 4095 Subsequent Extra SACK tables repeat the pattern. 8. Up to three Extra SACK tables can be included in the ACK packet before the ACK packet grows beyond the mandatory minimum IPv6 MTU size (1280 octets). The minimum IPv4 MTU size (576 octets) is exceeded by the inclusion of one Extra SACK table. However, according to RFC9000[7] the IPv6 minimum is supported by most IPv4 networks. 9. Receivers of ACK packets must ignore any Extra SACK tables that are missing or truncated. The receiver should process the ACK packet as if the missing or truncated SACK tables were intentionally not sent. 10. The may be larger than the number of DATA packets represented in the provided SACK tables. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Buffer Space | Max Skew | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | First Packet | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Previous Packet (largest accepted DATA Sequence number) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Serial | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reason | Ack Count | SACK table (0 to 256 octets)... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ .. ... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ... SACK (horizontal striping) | Trailer Count | Extra SACKs | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Maximum Packet Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Recommended Packet Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Receive Window Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Max Packets per Jumbogram | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |SACK Size (N-1)| SACK table (0 to 256 octets) ... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ... ... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ... SACK (horizontal striping) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |SACK Size (N-1)| SACK table (0 to 256 octets) ... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ... ... +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ... SACK (horizontal striping) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Historical Implementation Notes =============================== Rx Packet Header Flag LAST-PACKET (4): Some older Rx implementations use DATA packets without the LAST-PACKET flag set to restrict the MTU. Notably IBM AFS 3.2 and earlier, which supported neither Jumbo Packets nor the the ACK packet trailers introduced in later versions. Such Rx peers adjust down the Path MTU to match the size of received DATA packets that do not have the LAST-PACKET flag set. Rx Packet Header Flag Bit-5 (16): AFS 3.2 through OpenAFS 1.0.1 used Rx Packet Header Flag Bit-5 (16) to indicate that the packet structure was located on an internal free packet queue. Any packet received with Bit-5 (16) set could be misinterpreted as a free packet. Therefore, Bit-5 must not be set in any Rx Packet written to the network. ACK : Use of the ackPacket.maxSkew field was broken prior to March 1989 and is unused. Some early Rx implementations mistakenly assigned ackPacket.maxSkew = htonl(rx_peer.inPacketSkew) even though the rx_peer.inPacketSkew field was an unsigned short. Another bug was a failure to ignore maxSkew computations when the ackPacket.serial number was zero. Since March 1989, the ackPacket.maxSkew field has been ignored when processing ACK packets and has been assigned zero when sending ACK packets. Prior to March 1989 the rx_peer.inPacketSkew field was calculated as skew = rx_conn.lastSerial - packet.header.serial rx_conn.lastSerial = packet.header.serial if (skew > 0 && skew > peer->inPacketSkew) peer->inPacketSkew = skew The most recently received ackPacket.maxSkew value was then used to restrict retransmissions: if (!packet.acked && packet.sent && packet.header.serial + ackPacket.maxSkew < ackPacket.serial) then retransmit packet The above computation failed to take into account ACK packets whose ackPacket.serial is zero or the possibility that serial numbers could wrap when more than 2^31-1 packets were sent using a connection in a single direction. As a result, lost DATA packets could fail to be retransmitted and the call could stall indefinitely. Computing maximum skew from connection serial numbers made sense when all Rx peers were single-threaded. In multi-threaded peers, connection serial numbers are allocated as the data stream is packetized. The operating system scheduler, per socket queues, and per-call window management can result in packets being written to the wire from multiple calls interleaved. Therefore, receipt of serial numbers out of sequence implies nothing about the skew of the network path. Enough time has passed that the ackPacket.maxSkew field could be considered unused and reserved for future use. Knowledge of the maximum skew between two Rx peers is useful information and could be leveraged to reduce unnecessary retransmissions. However, it cannot be computed by use of Rx connection serial numbers. ACK : The original and current meaning of the ackPacket.serial field when non-zero is that it contains the packet.header.serial of the incoming packet to which the ACK packet was immediately sent in response. The incoming packet could be a DATA or an ACK; it might or might not have the ACK_REQUESTED flag set. Originally, ACK packets with reason RX_ACK_PING_RESPONSE were sent with ackPacket.serial set to zero. From March 1989 through May 1989 the ackPacket.serial field was given a different meaning. Rx connections stored the maximum serial received from the peer (rx_conn.maxSerial). When constructing an ACK packet, the ackPacket.serial was assigned as follows if (ackPacket.reason == RX_ACK_REQUESTED ackPacket.serial = htonl(conn->maxSerial)) else ackPacket.serial = htonl(conn->maxSerial + 1) The intent of the change was to force packets to be retransmitted in response to receipt of an ACK even if the retransmission timer had been mistakenly turned off. In the previously deployed Rx implementations packets would not be sent if ackPacket.skew > ackPacket.serial - sentPacket.serial is true for any DATA packet in the sent queue. From May 1989 through OpenAFS 1.2.7 the ackPacket.serial field was set to htonl(conn->maxSerial) regardless of ackPacket.reason. OpenAFS 1.2.8 restored the original meaning of ackPacket.serial. A non-zero ackPacket.serial indicates the serial number of the DATA or ACK packet it was sent in response to. This now includes ACK packets with reason RX_ACK_PING_RESPONSE. During the period from May 1989, the receipt by an incoming connection of an ackPacket.serial greater than conn.nextSerial would advance conn.nextSerial. if (conn.type == RX_SERVER_CONNECTION && conn.nextSerial < ackPacket.serial) conn.nextSerial = ackPacket.serial + 1; Advancing the incoming connection's serial number was necessary in case the incoming connection had been previously garbage collected. OpenAFS 1.2.8 removed the advancement of conn.nextSerial by incoming connections. ACK : The value assigned to has been inconsistent across Rx implementations. Sometimes the value of contained the sequence number of the DATA packet that triggered the ACK packet. Sometimes it contained the sequence number of the DATA packet received before the DATA packet that triggered the ACK packet. The value might be a sequence number of a DATA packet that was dropped because it was a duplicate or because it was outside the receive window. In AFS 3.0 the value was the sequence number of the previously received DATA packet except when the ackPacket.type is DELAY. The value could be a sequence number of a dropped packet. The value can go backwards in subsequently received ACK packets. There is no relationship between and the ackPacket.nAcks size. Starting with AFS 3.5 is set to the sequence number of the DATA packet with the ACK_REQUESTED flag set when sending an ACK packet of ackPacket.type REQUESTED. This change in behavior made the REQUESTED-ACK consistent with the DELAY-ACK. Kernel Rx versions AFS 3.5 up to and including OpenAFS 1.6.22 stored the Rx serial number of a dropped DATA packet instead of the DATA packet's sequence number. This serial number would be sent in subsequent ACK packets until the next DATA packet was received. OpenAFS 1.2 Rx modified the behavior of IDLE-ACK packets by setting to the sequence number of the DATA packet that triggered the sending of the ACK. OpenAFS 1.4 Rx modified the behavior of OUT-OF-SEQUENCE-ACK packets by setting to the sequence number of the DATA packet that triggered the sending of the ACK. In theory the field combined with the could be used to detect and ignore out-of-sequence ACK packets. For that to be true must consistently contain the largest DATA packet sequence number accepted by the Rx peer. AuriStor Rx always sets to the largest DATA packet sequence number of an accepted DATA packet. never goes backward provided that ACK packets are processed in order. is never set to a sequence number of a dropped DATA packet. When the EXTENDED-SACK Flag is set must be set to the largest DATA packet sequence of any DATA packet accepted by the receiver for the call. The value must never go backwards; although it is permitted to wrap to zero. ACK Trailer fields: The fields were introduced as follows: AFS 3.3: The fields and were introduced in AFS 3.3. AFS 3.4: The field was introduced in AFS 3.4. The AFS 3.4 ACK receiver would only reduce the size of the active call's transmit window by the advertised . AFS 3.5: The field field was introduced in AFS 3.5 in conjunction with Jumbograms. The field can grow the size of the active call's transmit window by the advertised . Although the field is unsigned 32-bit, the maximum receive window is constrained to unsigned short (65535). RX Connection Epoch and the Meaning of the High Epoch Bit: The RX Connection Epoch is an unsigned 32-bit value which when combined with the unsigned 30-bit Connection ID (CID) forms the primary identifier for any RX connection. These values when combined with the source and destination endpoints, the direction (as measured by the setting of the CLIENT-INITIATED Flag bit) and the Security Index uniquely identify the packets belonging to a connection. The choice of Epoch and CID for any RX Connection belongs solely to the Connection Initiator. Once the Epoch and CID are selected for a Connection, they must not change. AFS 3.0: The RX stack initialized a global RX Epoch value to seconds since UNIX Epoch at startup. This value was assigned to each initiated RX Connection. CID values started with 1 and incremented sequentially. CID values could wrap to 0. RX FindConnection matched incoming packets to connections with an exact match of Epoch, CID, Direction (type), Security Index, Source Address and Source Port. RX FindConnection bound the peer endpoint to the connection when created and did not permit the peer endpoint to change. AFS 3.0 (Aug 1990 patch release) RX FindConnection ignored the source endpoint when matching incoming packets to connections if the receiver initiated the connection and the source port number matched the connection peer endpoint port. AFS 3.1b: Changes were made in response to the pending publication of "Hijacking AFS"[8]. The 31-bit of the RX Connection Epoch is designated the Ignore-Source Flag. The RX Epoch is replaced when the RXKAD Security Class is re-initialized to a random value and the Ignore-Source Flag is set. The Ignore-Source enabled Epoch is used for all initiated connections regardless of whether or not the RXKAD security class is used. "This function allows rxkad to set the epoch to a suitably random number which rx_NewConnection will use in the future. The principle purpose is to get rxnull connections to use the same epoch as the rxkad connections do, at least once the first rxkad connection is established. This is important now that the host/port addresses aren't used in FindConnection: the uniqueness of epoch/cid matters and the start time won't do." RX FindConnection ignores the source endpoint when matching incoming packets to connections if the Ignore-Source Flag is set. This was true for both the acceptor and the initiator. "epoch's high order bits mean route for security reasons only on the cid, not the host and port fields." RX connections with the Ignore-Source Flag set can accept packets from alternative endpoints provided that each peer continues to receive packets on the original source and destination endpoints. RX connections initiated by fileservers to vlservers and UBIK servers to each other used RXKAD and would therefore set the Ignore-Source Flag. Note that fileserver initiated RXAFSCB connections were rxnull but had the Ignore-Source Flag set. RXAFSCB services would therefore accept packets from any interface on a multihomed fileserver. AFS 3.4: Initialized KERNEL RX Epoch to seconds since UNIX Epoch and set the Ignore-Source Flag even though the Epoch is strictly time based and not unique. No change was made to the FindConnection logic. AFS 3.5: Introduced the rxLastConn cache pointer. Stopped treating the as part of the Connection identity. Packets that match a connection but have a different are dropped. Began to update the connection peer endpoint to the endpoint of the most recently received packet matched to the connection. "Ensure that the peer structure is set up in such a way that replies in this connection go back to that remote interface from which the last packet was sent out. In case, this packet's source IP address does not match the peer struct for this conn, then drop the refCount on conn->peer and get a new peer structure. We can check the host,port field in the peer structure without the rx_peerHashTable_lock because the peer structure has its refCount incremented and the only time the host,port in the peer struct gets updated is when the peer structure is created." OpenAFS 1.2.9 RX OPENAFS-SA-2003-002: 82523baf9f76eca38fc4856f52bc7cdabddf14b3 ("Clean up code in rxi_FindConnection") removed the logic introduced in IBM AFS 3.5 that updated the rx_connection peer upon the receipt of each accepted rx_packet. Restricted the application of Ignore-Source Flag acceptance only to packets received by the connection initiator. OpenAFS 1.4 OpenAFS b4566d725e1aa4f57d1e6db5821c590a4b6da7c0 ("partly-revert-rx-cleanup-20040804") reverted the restriction on acceptance of packets when the Ignore-Source Flag is set except by connection initiators because it broke the RXAFSCB service receipt of packets when the cache manager is multi-homed. "if there's a callback connection to a multihomed client, you need this or you end up with multiple connections, one per IP, being made from the single connection". OpenAFS 1.8 OpenAFS 39b165cdda941181845022c183fea1c7af7e4356 ("Move epoch and cid generation into the rx core") moved the generation of random Epochs and setting of the Ignore-Source Flag out of RXKAD and into RX proper. Now all connections have the Ignore-Source Flag set. AuriStorFS v0.192 AuriStorFS 506ba040fdc3b4325461ff9d8d8e2b5660e68111 ("rx: do not permit client connection packets to switch endpoints") removed the test of Ignore-Source Flag from rxi_conn_find(). Since connection initiators accept packets from any endpoint provided that the port number matches this change only prevents connection acceptors from matching packets to connections when the endpoint changes. Acceptors will create a new connection to bind the incoming call to a new endpoint. Without this change calls could enter a zombie state where ACK PING packets are successfully responded to but DELAYED ACK, DATA and ABORT response packets are ignored after the initiator moves to a new network endpoint and can no longer receive packets at the original endpoint. AuriStorFS v0.207 AuriStorFS 5d544dd373418539ef7e850c3cc0fd64bfdd7904 ("rx: identify connections by direction, epoch, cid, and securityIndex") restored inclusion of in the connection identity. Once again multiple connections that differ only by can exist between peers. AuriStorFS v2021.05-16 AuriStorFS 427754b023515881553afcf4382c84ed18931c6a ("rx: clear epoch high-bit to prevent conn endpoint switching") removes the setting of Ignore-Source Flag even though the Epoch is random. This will have no impact on communication with AuriStorFS services since the checks for High Epoch Bits were already removed. It will force IBM/OpenAFS acceptors to allocate a new connection upon receipt of a call from a new endpoint in order to bind the connection to the new endpoint. The Myth of Server Restart Detection Using the RX Connection Epoch OpenAFS 8d359e6dff5317698597e77f0a1dd5ba2bfb569a removed a March 1989 attempt at RX peer restart detection. The 1989 commit included the following statement: "The right way to detect a server restart in the midst of a call is to notice that the server epoch changed, btw." This statement is incorrect because changing the Connection Epoch will result in a distinct connection whose packets will not be mixed with those associated with a prior connection. Behavior of Rx peers that do not recognize DEBUG and/or VERSION packet types DEBUG and VERSION packets were not part of the original Rx implementation. DEBUG packets were introduced prior to the release of AFS 3.0 and VERSION packets were introduced in the AFS 3.3 release. When DEBUG or VERSION packets are unrecognized, the acceptor will attempt to match the incoming packets with an Rx connection using the epoch, cid, host, port and/or security index; and if the call number is non-zero attempt to match an Rx call. The unrecognized packet type will result in an ABORT packet with error code RX_PROTOCOL_ERROR. If the call number is zero, the ABORT will result in the connection being placed into an error state. History of Rx VERSION response data Rx VERSION packets were introduced in AFS 3.3. AFS 3.3 implemented a char[64] version buffer to copy the received version data into. It would read up to 1500 bytes from the network datagram. If the length read was at least 28 bytes it would copy MIN(64, (bytesRead - 28)) octets from &responseData[28] to the version buffer. It did not add a terminating NUL. The AFS 3.3 Rx VERSION acceptor always wrote 65 octets from a static array which contained the product version as a C-string. If the product version string exceeded 64 octets the buffer when copied to the VERSION response packet would not include a trailing NUL. AFS 3.3 Rx VERSION acceptors did not validate the presence of the CLIENT-INITIATED flag before issuing a response. AFS 3.3 Rx VERSION acceptors reused the incoming Rx packet to transmit the response. It reused the Rx header as received. As a result, the CLIENT-INITIATED flag was not cleared when transmitting the response. This could lead to packet loops. Nor did it clear the Sequence number, Serial number, UserStatus or other flag bits. OpenAFS 1.1.1 corrected the use of the CLIENT-INITIATED flag and truncated the response version string at 65 bytes and appended a trailing NUL. As of 1.1.1 it is safe to send a version string longer than 63 bytes plus a trailing NUL. However, OpenAFS 1.1.1 did not reset any other incoming flag bits nor were the Sequence number, Serial number, and UserStatus fields set to zero. Instead the values received in the incoming rx header are replayed in the response packet. Linux RxRPC does not reuse the incoming Rx packet for the response and only copies the Epoch, CID, and Call Number to the response header. History of Rx DEBUG response data Rx DEBUG packets were introduced prior to AFS 3.0. AFS 3.0 Rx DEBUG acceptors did not validate the presence of the CLIENT-INITIATED flag before issuing a response. AFS 3.0 Rx DEBUG acceptors reused the incoming Rx packet to transmit the response. It reused the Rx header as received. As a result, the CLIENT-INITIATED flag was not cleared when transmitting the response. This could lead to packet loops. Nor did it clear the Sequence number, Serial number, UserStatus or other flag bits. OpenAFS 1.1.1 corrected the use of the CLIENT-INITIATED flag. However, OpenAFS 1.1.1 did not reset any other incoming flag bits nor were the Sequence number, Serial number, and UserStatus fields set to zero. Instead the values received in the incoming rx header are replayed in the response packet. Linux RxRPC ignores received DEBUG packets. Package Name Intro Description ---------------------------------------------------------------------------- 1 RX_DEBUGI_GETSTATS AFS 3.0 Get basic Rx stats 2 RX_DEBUGI_GETCONN AFS 3.0 Get connection info 3 RX_DEBUGI_GETALLCONN AFS 3.1 Get even uninteresting connections 4 RX_DEBUGI_RXSTATS AFS 3.1 Get all Rx stats 5 RX_DEBUGI_GETPEER AFS 3.5p2 Get all peer stats -8 RX_DEBUGI_BADTYPE AFS 3.0 Requested package is unknown RX_DEBUGI_VERSION values identify which RX_DEBUGI_GETSTATS fields or RX_DEBUGI_GETPEER fields are available from the responding peer. The value is communicated via the struct rx_debugStats.version field returned from the RX_DEBUGI_GETSTATS package. Version Intro Description ---------------------------------------------------------------------------- 'L' AFS 3.0 Earliest production version. Unaligned connections 'M' AFS 3.1 Supports GETALLCONN and RXSTATS 'N' AFS 3.3 Adds rx_debugStats.nWaiting 'O' AFS 3.5 Adds rx_debugStats.idleThreads 'P' AFS 3.5p1 Adds new rx_stats fields: ignorePacketDally, receiveCbufPktAllocFailures, sendCbufPktAllocFailures 'Q' AFS 3.5p2 Supports GETPEER 'R' OAFS 1.4.0 Adds rx_debugStats.nWaited 'S' OAFS 1.6.0 Adds rx_debugStats.nPackets AFS 3.2 altered the struct rx_stats structure by increasing the size of the packetsRead[] and packetsSent[] arrays from 9 to 10 elements when RX_PACKET_TYPE_PARAMS was allocated. There is no new RX_DEBUGI_VERSION value matching this change. AFS 3.3 altered the struct rx_stats structure by increasing the size of the packetsRead[] and packetsSent[] arrays from 10 to 13 when RX_PACKET_TYPE_VERSION was allocated. This can be detected by RX_DEBUGI_VERSION 'N'. Acknowledgements ================ Jeffrey Hutzelman reviewed an early draft of this specification, and provided much appreciated feedback on technical details as well as document structuring. Love Hornquist-Astrand made many corrections to this specification, especially regarding backwards-compatibility with older Rx implementations. References ========== [1] /afs/sipb.mit.edu/contrib/doc/AFS/hijacking-afs.ps.gz [2] OpenAFS: src/rx/ [3] /afs/sipb.mit.edu/contrib/doc/AFS/ps/rx-spec.ps [4] /afs/stacken.kth.se/ftp/pub/arla/prog-afs/shadow/doc/r.vdoc [5] /afs/stacken.kth.se/ftp/pub/arla/prog-afs/shadow/doc/rx.mss [6] https://datatracker.ietf.org/doc/html/rfc5681 [7] https://datatracker.ietf.org/doc/html/rfc9000 [8] https://www.researchgate.net/publication/2513329_Hijacking_AFS/link/02e7e51eeaf7360cfb000000/download