OSF DCE SIG M. Karuzis (HP) Request For Comments: 20.0 October 1992 DCE RPC/DG PROTOCOL ENHANCEMENTS 1. INTRODUCTION This paper discusses three proposed enhancements to the DG protocol component of the DCE RPC runtime library: (a) Private Client Sockets (b) Multi-Buffer Fragments (c) Sending Message Vectors The primary motivation for each of these proposals is to increase RPC performance over the DG protocol. Note that this paper addresses only areas relevant to DCE RPC running over the datagram protocol. As such, "sockets" refer to UDP/IP sockets, "packets" refer to packets used by the datagram protocol, etc. 2. PRIVATE CLIENT SOCKETS 2.1. Functional Overview The DG protocol opens up one socket for each network address family supported on a host. Once opened, these sockets are kept in a pool for use whenever the process needs to make another RPC over that particular address family. In the event that concurrent calls are made over the same address family, the calls share a single socket from the pool. Making this concurrency work requires that there be a "helper" thread created to read from all of the open sockets, passing received data onto the call thread for which it is intended. This model works well for the case in which a client makes multiple concurrent RPCs. However, it's inefficient for those applications that don't require this degree of concurrency. To remedy this situation, we propose that along with the usual shared sockets in the socket pool, there are a small number of sockets (1 or 2) that are tagged as "private". Requests for sockets are always satisfied by returning a private socket, if one is available. After all private sockets are in use, subsequent socket requests are satisfied by returning a reference to the "shared" socket for that Karuzis Page 1 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 address family. As the name implies, a private socket is for the exclusive use of the call thread to which it is allocated; any received data that is not addressed to this call can be discarded. This being the case, the call thread does not need to rely on the listener thread for receiving data. It can read directly from the socket, avoiding the thread switches required for the listener thread to multiplex packets. In fact, in a client application that doesn't require a high degree of concurrency, there is no need to pay for the overhead of having a listener thread around at all. A second benefit of allowing a thread to read its own data is that it avoids the overhead of having the listener thread searching though lists of UUIDs trying to determine for which thread a given packet is ultimately destined. Under most circumstances a call thread knows exactly which packet it expects next, allowing it to short circuit much of the general purpose packet handling code. 2.2. Implementation Details 2.2.1. Socket creation The DG socket pool is modified to the following structure: pool ________/\________ / \ server client sockets sockets ______/ \______ / \ shared private sockets sockets The first client socket for a particular network address family (NAF) is created in the private sockets' area. While being used in a call, the socket is marked "in use" and can't be used by other calls. In a client that does not make concurrent RPCs, each subsequent RPC over this NAF will use this private socket. For a multi-threaded client which makes concurrent RPCs, it may happen that a call is made over a NAF for which a private socket exists, but that the socket is already being used in another call. In this case, if we have not already created the maximum number of private sockets for this NAF (currently 2), create another private socket. If we've already created the maximum number of sockets, and they're all in use, we create a shared socket for this NAF. The shared socket can then be used by any number of concurrent threads whenever all of the private sockets are in use. Karuzis Page 2 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 At the time the first shared socket for any NAF is created, the listener thread must be started. Socket data received on a shared socket is always read in by the listener thread and then queued to the appropriate call thread. The principal difference in the way private and shared sockets are created is in the socket's blocking mode. In an effort to improve performance, the runtime sometimes "guesses" when data might be available on a shared socket. In such cases, rather than calling select(), recvfrom() is called directly, in an attempt to avoid an extra system call. Since this call to recvfrom() is being made from the listener thread, and since it is possible that the "guess" was wrong, we need to insure that the call won't block; the listener thread may have other sockets that it needs to monitor. For this reason, shared sockets are created in non-blocking mode. With private sockets, on the other hand, the intention is to allow individual call threads to block on their sockets, so the socket is left in blocking mode. 2.2.2. Socket handling The majority of the code changes required to support private client sockets are in the path between the point that a call thread decides to block waiting for data, and the time that a newly received packet has been processed by the packet handling routines. With shared sockets, when a call thread decides to block for data, it calls the routine rpc__dg_call_wait() and blocks on a condition variable. New data sent to the call is read by the listener thread, which determines which type of message it is and calls the appropriate packet handling routine. In the normal case, the packet will contain the data which the call is waiting for, and the call will be woken up to continue its processing. With private sockets, calls handle their own sockets. Thus, when the call_wait() routine detects that a call thread is using a private socket, rather than blocking the thread on a condition variable, it redirects the thread into the packet processing code used by the listener thread. The thread then blocks directly on its private socket. When a packet is received, it is processed by the same packet handling routines, with the exception that now they are being called from within the call thread itself, not the listener thread. Again, in the normal case, the newly received packet will contain data the call was waiting for, except that now, rather than signalling a condition variable to wake up the thread, the thread simply returns up its stack until it reaches the call_wait() routine. From this point on the processing done by users of private and shared sockets is identical. Karuzis Page 3 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 2.2.3. Call blocking The discussion above deals only with calls that block waiting for data. This type of blocking emanates from the call_receive(), call_transmit(), and call_xmitq_push() routines (the latter two are waiting for ACKs to open up transmit window space). It is also possible for calls to block for other reasons, such as waiting for a packet pool reservation or waiting for a packet. Obviously, in such cases it would not be appropriate to allow the call to block on a socket. To handle this distinction, the call_wait() routine now takes an argument that specifies whether the call thread is blocking on a "network event". If not, the call thread does the normal condition variable wait. 2.2.4. Handling failures It may be the case that a server dies and a call thread is blocked on a socket through which no more data will ever come. Timeouts are detected by the timer thread which periodically inspects the state of all active calls. With a shared socket, a timeout can be handled by setting a flag in the call handle, and signalling the call's condition variable. The call, which is sleeping in the call_wait() routine, will wake up, recognize the timeout condition, and handle it appropriately. A different mechanism must be used in the case of private sockets, since the call will not be blocked on a condition variable. In this case, we must post a cancel against the call thread, since that's the only way to wake it out of recvfrom(). Note that posting a cancel to a thread is probably less efficient than signalling a condition variable. However, since these actions only occur in the presence of a failure condition, their effect on performance is of less concern. Of more importance is the fact that supporting this mechanism requires wrapping each socket receive call within a TRY/CATCH macro. This will have some performance impact on every call. Depending on the impact of using the exception macro, it may be desirable to implement a new version of the recvfrom() call, one which accepts a timeout parameter in the same manner as the select call. With the existence of such a call, it would no longer be necessary to rely on the timer thread to detect timeouts; the call thread could determine the amount of time until the next timeout, and then block on its socket for only that amount of time. (Such a call might be implemented by adding it into the kernel RPC code, in the same way as the sendmsgv() call discussed below.) Karuzis Page 4 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 3. MULTI-BUFFER FRAGMENTS 3.1. Terminology Since this discussion involves entities at several different levels in the protocol hierarchy, it will be helpful to fix some of the terms used: (a) Fragment An RPC protocol message containing data. From the RPC runtime's perspective, data transfers occur through the transfer of one or more fragments. Fragments are always processed one at a time. A fragment is a protocol abstraction, with no implication on how the bytes are stored, or how they are transferred across a network. (b) Packet-Buffer The data-structure used by the runtime to store the input and/or output arguments to an RPC. (c) Frame The physical network transmission unit. 3.2. Choosing an Appropriate Fragment Size In general, the easiest way to increase the throughput of the runtime is to increase the size of the fragments that are transferred from sender to receiver. If the runtime is viewed as a machine for processing fragments, the larger we can make the fragments, the fewer times we need to run the machine. The push toward increasing the fragment size is constrained by the following considerations: (a) Physical networks have maximum frame sizes. (b) Network protocols have limits on transfer sizes. (c) Platforms have limited buffering capacity. (d) Healthy networks require good network citizens (i.e., sharing bandwidth). To give a specific example, consider the DG protocol running over UDP/IP. UDP can handle up to 65,527 bytes of data per call. If a given UDP data unit is too big to fit into a single network frame, the IP layer provides a simple protocol for fragmenting/reassembling the data in order to get it to its destination. Karuzis Page 5 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 3.3. IP Fragmentation The IP fragmentation protocol has been the subject of much criticism, and is generally considered something unfit on which to rely. Unfortunately, avoiding IP fragmentation requires that a socket user (in this case, the RPC runtime) have some knowledge about the frame size of the underlying network(s). This knowledge would make it possible to limit the amount of data handed to UDP, per call, so that IP is never forced to fragment. As it turns out, there are several reasons why it is not always possible to accurately determine what frame size will be in effect for any given datagram: (a) Supported platforms don't provide a standard way in which this information may be queried. (b) The host may be multi-homed, each network using a different frame size. (c) The destination may be across multiple hops, with different sized frames for each. The current strategy, which is correct for the vast majority of DCE environments today, is to simply assume that the underlying network is ethernet, and to choose fragment/packet-buffer sizes that will fit within a single ethernet frame. As a result, the DG fragment size is equal to the maximum amount of data that could be carried in an ethernet frame, minus the size of the UDP and IP headers. This value works out to be: 1500 max ethernet frame data field - 8 size of LLC frame for SNAP protocol used by IP - 20 size of IP header - 8 size of UDP header ---- 1464 max size of DG protocol fragments 3.4. Going Beyond Ethernet Obviously, this is probably not the right thing to do when the target network is something other than ethernet. The two most compelling examples are the anticipated arrival of FDDI networks, with frames that can carry almost three times those of ethernet, and the use of intra-machine (loopback) RPC, where the frame size is moot, and fragment sizes are only limited by local network buffering considerations. In both of these cases, adequate performance requires the use of larger DG fragments. Karuzis Page 6 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 3.5. IP Fragmentation Reconsidered Despite the known limitations of the IP fragmentation protocol, it became apparent during testing done at HP that IP fragmentation may actually lead to better performance under certain circumstances. Much of the criticism of the protocol revolves around its inability to adequately handle congested, multi-hop networks. The evidence to support this observation is compelling. However, it appears that the simplicity of the protocol leads to increased throughput on relatively quiet, local networks. Under the current runtime implementation, each network frame received by a host must travel up through all of the layers of protocol (IP, UDP, RPC) until it reaches the specific call thread to which it is directed. If we allow IP to do fragmentation, most received frames need only be looked at by the IP layer; after several frames have been collected, and reassembled, they can then be passed as a single buffer up through UDP and on to RPC. As can be seen by the results presented in Appendix A, use of IP fragmentation resulted in 30% increase in inter-machine bulk transfer performance, and a 100% increase in intra-machine bulk transfer performance. These tests were done on the sort of quiet, local network, on which the IP fragmentation protocol appears to perform admirably. Since we expect that this will be a common environment in which DCE RPC applications will run, it is important that we take advantage of this performance opportunity. At the same time, we must guarantee acceptable performance over congested, multi-hop networks, in which it appears that the IP fragmentation protocol may not be an acceptable alternative. The key to achieving this flexibility is in allowing the runtime to adjust the size of the RPC fragments that it uses, thus controlling whether or not IP fragmentation is necessary. 3.6. Testing with Variable Sized Fragments: MBFs and LBFs This paper reports the results from two different test scenarios in which the runtime was modified to use larger fragment sizes. The first scheme, called "multi-buffer fragments" (MBF), retains the runtime's current default sized packet-buffer (1.5KB) but allows fragments to span multiple buffers. The packet-buffers are collected into a vector and transferred with the sendmsg()/recvmsg() socket calls. For comparison purposes, a second scheme called "large-buffer fragments" (LBF) is also evaluated. The idea behind LBF is to simply increase the size of the runtime packet-buffers so that each fragment can still be sent in a single buffer. Note that the LBF scheme is not considered a practical solution, since it is wasteful on systems that must continue to support both small and large fragment sizes; it is useful only as a yardstick for measuring the efficiency of the MBF implementation. Karuzis Page 7 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 3.7. MBF Implementation Details With MBFs, a single DG fragment is allowed to span multiple packet- buffer boundaries. Packet-buffers belonging to a single fragment are associated by a new "more_data" pointer added to xqe's and rqe's (xqe/rqe = transmit/receive queue element, the buffers that hold the data sent from/to an RPC). These transmit/receive queues thus become two-tiered. On the sending side, if the maximum fragment size agreed upon for a given conversation exceeds the packet pool buffer size, the sender will string together several buffers to contain the data. Multiple packet-buffers can be sent as a single UDP datagram with the sendmsg() call. On the receiving side, the listener thread now allocates 15 packet- buffers for reading in data from the network. Multiple packet- buffers can be used to read in a single UDP datagram with the recvmsg() call. By looking at the length returned from the recvmsg() call, and knowing the lengths of each buffer, it is possible to determine how many packet buffers were used to hold a given fragment. If a datagram fits in fewer than 15 packet-buffers, only those actually used are passed on for processing. The rest are reused by the listener thread, cutting down on the number of packets that need to be reallocated. The process of queueing a fragment remains unchanged; the more_data field is never seen by any of the packet handling logic. At the time the fragment is dequeued it is necessary to check whether the fragment consists of multiple buffers. If so, only the first is returned to the stubs. This requires that the rpc__dg_call_receive_int() routine muck around with the queue pointers a bit, to make sure the "head" and "last_in_order" pointers are set correctly. 4. SENDING MESSAGE VECTORS When running on networks where we can't, or don't want to, force IP fragmentation, another way to increase performance is to improve the efficiency with which data is passed into the kernel. Since it is necessary to make a separate system call for each datagram sent, small fragment sizes force the runtime to make a large number of system calls. The ideal solution to this problem would be if the socket API were augmented to allow applications to send multiple datagrams with a single call. A more practical solution involves providing this functionality ourselves, on platforms that already support kernel RPC. On such Karuzis Page 8 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 systems it is possible to add our own implementation of the augmented socket call into the kernel RPC library. The user space runtime can then pass multiple RPC fragments into the kernel with a single system call; the kernel routine, in turn, sending each of the fragments as individual datagrams. 4.1. Implementation Details The sendmsgv() call is analogous to the sendmsg() call. The difference is that while the sendmsg() call takes a scatter/gather array which it turns into a single UDP datagram, the sendmsgv() call takes multiple scatter/gather arrays and turns them into multiple, independent UDP datagrams. Ideally, such a routine belongs in the socket system call module, and should have a system call entry point by which user-space applications can invoke it. At the present time, the code which implements the sendmsgv() call is bound into the KRPC library, and is made available to applications through an ioctl() on the device at /dev/ncs. (KRPC registers itself as the handler for this device.) The sendmsgv() ioctl interface takes a "msg_t" argument, which is constructed as follows: typedef struct { int fd; /* user space descriptor referring to socket */ int *cc; /* return number of bytes actually sent */ int flags; /* not currently used */ int addr_len; /* length of socket address */ struct sockaddr *addr; /* where to send data */ } msgs_hdr_t, *msgs_hdr_p_t; typedef struct { struct iovec *msg_iov; /* an io-vector describing a single UDP datagram */ int msg_iovcnt; /* the number of entries in the above vector */ } msgvec_elt_t, *msgvec_elt_p_t; typedef struct { msgs_hdr_t hdr; int msgv_len; /* no. of elements in the following array */ msgvec_elt_t msgv[1]; /* array of independent UDP datagrams */ } msgs_t, *msgs_p_t; This ends up looking like the following: Karuzis Page 9 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 KERNEL SPACE || USER SPACE || msgs_t || -------- ----------- ----------- || --->| | --->| family | | fd | || | -------- | |---------| |-----------| || | | | data .. | | cc o-|-||---/ | | ... | |-----------| || | ----------- | addr_len | || /----------------| |-----------| || | _________ | addr o-|-||-/ / \ |-----------| || /--------| | | | msgv_len | || | V | V |-----------| || | ------------- | ---------- / | msg_iov o-|-||---/ /|iov_base o-|----/ | data ..| msgv[0] |-----------| || [0] |-----------| | ... | \ |msg_iovcnt | || \|iov_len | | ... | |-----------| || |-----------| | ... | / | msg_iov o-|-||-\ /|iov_base o-|------\ | | msgv[1] |-----------| || | [1] |-----------| | \/\/\/\/ \ |msg_iovcnt | || | \|iov_len | V |-----------| || | |-----------| ---------- . | . | || | . | . | | data ..| . | . | || . . | . | | ... | . | . | || . . | . | | ... | ------------- || . |-----------| | ... | || \/\/\/\/ The actual call (made from the user-space runtime) looks like this: rpc_socket_error_t serr; int ncsdev_fd; msgs_t *msg; serr = ioctl(ncsdev_fd, NIOCSCK(sizeof msgs_t), msg); The NIOCSCK macro does the bit shifting to create the ioctl() command; the sizeof parameter determines how many bytes of the "msg" argument need to be copied into kernel space. The ioctl() handler for /dev/ncs then calls the sendmsgv() routine, which has the following signature: PRIVATE rpc_socket_error_t rpc__socket_sendmsgv ( struct file *fp, msgvec_elt_t *msgv, int msgv_len, struct sockaddr *addr, int addr_len, int flags, int *cc ); Karuzis Page 10 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 APPENDIX A. PERFORMANCE MEASUREMENTS A.1. Private Client Sockets Private client sockets have been implemented on an HP Series 400 running HPUX 8.0, and resulted in a 3-5% increase in throughput, and a 5-7% decrease in latency (null call time). A.2. Multi-Buffer Fragments The following tests were run with executables derived from the dce1.0.1b18 code base.[1] (a) MBF Inter-machine Bulk Data Tests server: Series 400/HPUX 8.0 (16 MB) client: Series 300/HPUX 8.0 (16 MB) Fragment Size (bytes) Null (ms) Ins (KB/s) Outs (KB/s) base (b18) 10.2 324 323 1454 10.7 320 291 4096 10.7 413 403 8192 10.7 451 453 12282 10.7 452 469 16384 10.6 398 400 20480 10.7 420 426 (b) MBF Inter-machine Bulk Data Tests server: Series 700-OSF/1 1.0.2 (32 MB) client: Series 700-OSF/1 1.0.2 (16 MB) [These tests were done on an open network.] Fragment Size (bytes) Null (ms) Ins (KB/s) Outs (KB/s) base (b18) 5.6 671 627 1454 6.1 673 612 4096 5.9 871 766 8192 6.2 898 844 12282 5.7 859 809 16384 5.8 669 649 __________ 1. Note that although fragment sizes were allowed to increase substantially, the total number of bytes allowed in all outstanding (i.e., in transit) fragments was not increased -- 23,936 bytes. Thus, at a fragment size of 16,384, only one fragment could be sent at a time. Some testing was done with an unlimited window size (for comparison) and saw transfer rates of up to 540 KB/s, mixed with much lower rates due to dropped packets. It seems likely that some compromise between the extremes would safely lead to higher performance. Karuzis Page 11 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 20480 5.8 675 673 (c) MBF Intra-machine Bulk Data Tests Series 400/HPUX 8.0 (16 MB) Fragment Size (bytes) Null (ms) Ins (KB/s) Outs (KB/s) base (b18) 9.8 276 195 1454 10.4 250 189 4096 10.5 395 354 8192 10.5 506 470 12282 10.5 533 530 16384 10.5 574 562 20480 10.5 620 603 (d) MBF Intra-machine Bulk Data Tests Series 700-OSF/1 1.0.2 (32 MB) Fragment Size (bytes) Null (ms) Ins (KB/s) Outs (KB/s) base (b18) 4.42 641 666 1454 4.50 648 646 4096 4.56 992 1044 8192[2] 4.57 1190 1222 12282 4.58 1283 1300 16384 4.54 1334 1341 20480 4.58 1378 1390 (e) MBF Intra-machine Bulk Data Tests Series 700-OSF/1 1.0.2 (16 MB) Fragment Size (bytes) Null (ms) Ins (KB/s) Outs (KB/s) base (b18) 7.14 398 416 1454 7.23 391 407 4096 7.42 623 672 8192[3] - - - 12282 7.49 870 883 16384 7.47 926 946 20480 7.46 990 986 __________ 2. Because of dropped packets, two of the ten trials showed well below average performance. The figures here show the average after dropping the two lowest and two highest results. 3. It's not clear why the OSF/1 platforms saw so many dropped packets at the 8192 byte fragment size. Karuzis Page 12 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 A.2.1. Performance under stress One other test performed was to see how an overloaded receiver performed with and without the presence of IP fragmentation. There was a concern that fragmentation might tax the kernel's memory resources, and that dropped fragments and retransmissions would result in a decrease in performance. After reviewing the IP fragmentation algorithm, there does not appear to be any inherent reason why a heavily loaded receiver should perform any worse under fragmentation. The following testing was done on HPUX: (a) MBF Inter-machine Concurrent Bulk Data Tests 10-threaded client/server testing server: Series 400/HPUX 8.0 (16 MB) client: Series 300/HPUX 8.0 (16 MB) Fragment Size (bytes) Ins (KB/s) base (b18) 30 1454 28 4096 42 8192 50 12282 52 16384 57 20480 59 These results show no degradation in performance. Compare 10 clients using 8K MBFs with the results from a single client running on the same platform using the same sized MBFs. In the first case there are 10 clients transferring 50 KB/s each, in the second case, a single client sending 451 KB/s. Also, there was no evidence that the server was dropping fragments during the testing. It must be pointed out that a more useful stress test would have been to run the 10 clients from 10 different machines; however, these resources were not available at the time of the testing. This remains an area that will be closely monitored. A.2.2. Comparison with LBF For comparison purposes, we also present the performance measurements for a runtime which was modified to use only 16KB packet-buffers. This implementation is useful for assessing the overhead of the multi-buffer scheme, but would be highly inefficient in environment that required the ability to support both small and large fragment sizes. Karuzis Page 13 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 (a) LBF Inter-machine Bulk Data Tests server: Series 400/HPUX 8.0 (16 MB) client: Series 300/HPUX 8.0 (16 MB) Fragment Size (bytes) Throughput (KB/s) 1454 338 4096 458 8192 503 12282 486 16384 497 (b) LBF Intra-machine Bulk Data Tests Series 400/HPUX 8.0 (16 MB) Fragment Size (bytes) Throughput (KB/s) 1454 239 4096 378 8192 523 12282 591 16384 647 In comparison with the results of the MBF testing done on the same platform, these figures indicate that the current implementation of MBF imposes an overhead of approximately 10% over the simpler LBF scheme. Considering that LBF presents a "best case" scenario tuned to one particular environment (one which only support large fragment sizes), this cost seems quite reasonable. A.3. The Sendmsgv() Call The sendmsgv call was implemented and bound into an OSF/1 1.0.2 kernel running on an HP series 700 workstation. (a) Intra-machine Bulk Data Tests Series 700-OSF/1 1.0.2 (16 MB) Null Call (ms) Ins (KB/s) Outs (KB/s) base (b18) 6.43 443 452 sendmsgv w/1 pkt 6.69 430 441 sendmsgv w/vector 6.62 462 (+4%) 516 (+14%) (b) Inter-machine Bulk Data Tests sender : Series 700-OSF/1 1.0.2 (16 MB) receiver: Series 700-OSF/1 1.0.2 (32 MB) [Testing done on an open network.] Null Call (ms) Ins (KB/s) Outs (KB/s) base (b18) 3.95 729 717 sendmsgv w/1 pkt 4.08 726 720 sendmsgv w/vector 4.08 794 (+9%) 799 (+11%) Karuzis Page 14 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 For each of the above, the first line shows the results of the tests run on dce1.0.1b18. The second line show the results of implementing the sendmsgv() call, but not using its ability to handle vectors of messages. That is, sendmsgv() was simply substituted for the current use sendmsg() to check on the overhead of the new call. The last test shows the results of using the sendmsgv() call's ability to handle multiple messages. It was expected that the first two tests would have exhibited similar performance. This was not the case, and probably points to some inefficiency in the implementation of the sendmsgv() call. If this inefficiency can be corrected, it would be expected that the times shown in line 3 would increase by a similar margin. It's not clear why there is a 10% difference in the performance increase of the intra-machine Ins/Outs tests. The small increase in performance for the Ins test appears to be due to a loss of parallelism that happens any time client and server share a single CPU. Normally, a sender queues its next batch of packets concurrently to the receiver's processing of the last batch. When the receiver is ready for more data (which it signals by sending an ACK), the client should have a complete vector ready to send. Without this concurrency, when the sender receives an ACK, it does not have a full batch of packets queued, and can only send whatever data is ready. With fewer fragments ready to send, the vectoring ability of the sendmsgv() call goes unused. By looking at the number of fragments sent per call it was possible to confirm that the Ins test was suffering from a lack of concurrency. Presumably there is some imbalance in the protocol overhead between senders and receivers that prevented the Outs test from exhibiting the same degradation. Finally, although it wasn't implemented in time for this paper, it is possible to avoid affecting NULL call performance by not using the vector based call for single packet transmits. With this modification it is expected that NULL call performance would remain unchanged. APPENDIX B. PACKET RATIONING IMPLICATIONS OF MBF The DG packet rationing scheme makes the assumption that a one to one correspondence exists between packet-buffers and fragments. This is no longer the case with MBFs. Since the runtime is not capable of dealing with partial fragments, it's clear that what we really need is a fragment rationing scheme. Karuzis Page 15 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 Currently, under rationing, calls are allowed to use only a single packet-buffer at a time. This would work fine for senders of MBFs, since they can just scale down their fragment size whenever the system enters a rationing state. Unfortunately, this will not work on the receiving side. Since the receiver has advertised that it can handle MBFs, it may be the case that the sender has queued up several of these larger fragments at the time the receiver begins rationing. At this point it is not possible to force the receiver to use 1 packet-buffer at a time, since the next few fragments it will receive are MBFs. It's also not practical to tell the sender to repackage its fragments, since this would require changing the fragments' sequence numbers, which might conflict with out-of-order fragments already queued at the receiver. The most straightforward solution is to require a call to make as many packet-buffer reservations as are required to guarantee that it can always read in the largest MBF size that it advertises. For instance, an intra-machine call which wants to use 16K MBFs would need to make 12 reservations (12 buffers of 1454 bytes each). In effect, the call is making a fragment reservation. (If it later turns out that the other side of the conversation can not handle MBFs, some of the reservations could be returned.) First, let's consider if this is reasonable in a user-space runtime. Packet rationing is used to guarantee that a system that is low on packet-buffers, and high on buffer users, will not deadlock. In the kernel, where space is at a premium, there are only a limited number of buffers in the packet pool, and a high probability that this resource will get strained. In user space, however, land is cheap and the packet pool can be allowed to grow fairly large. Under these circumstances, the bottleneck for a system is likely to be the CPU, rather than memory. The user space runtime defines the maximum number of packet-buffers to be 100,000. In practice, this number is meaningless since there are other constraints that would preclude this many buffers from ever being in use. For example, each call is limited to queueing a maximum of 96 buffers; there would need to be over 1000 concurrent calls running for the maximum number of buffers to be in use. Packet rationing begins when the number of packet-buffers remaining in the pool is equal to the number of all current reservations. In 1.0.1, this means that if there are 1000 concurrent calls, packet rationing would begin when the following number of packet-buffers were in use: 100,000 number of buffers - 2,000 2 reservations for each of 1000 calls ------- 98,000 packet rationing threshold Karuzis Page 16 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 But if each call is only allowed to queue up a maximum of 96 buffers, that means the maximum number of buffers in use by all calls is 96,000. Therefore, under these conditions, packet rationing would never be necessary. What happens if we make 13 reservations for each call? Working out the math, the number of concurrent calls is lowered to 917: 100,000 buffers -11,921 13 reservations for 917 calls ------- 88,079 rationing threshold 917 no. of concurrent calls x 96 max. queue length per call ------- 88,032 total possible buffers in use Of course, if an application tried to run more than 917 concurrent calls everything would still work (slowly), the rationing code would kick in and guarantee that the calls ran without deadlock. Given all this, I would propose that we modify the packet rationing scheme to allow call threads to make multiple packet reservations. The call thread would then use the number of reservations received to determine the MBF size it wanted to advertise. From a cursory examination of the code, I think the rest of the packet rationing logic can remain the same. Users of the packet rationing code are not dependent on the implementation of the "reservation" abstraction, they just care about whether they have one or not. The picture for KRPC is not as promising. The maximum number of packet-buffers that KRPC will allocate is 64, which means that the system can support a maximum of 31 concurrent calls. If we assume that these calls may need to do an RPC to complete their processing, the number of concurrent calls is reduced to 15. Again, I think the right way to think about this is that rationing is the state where you have some maximum number of calls running, each using one fragment (not buffer) at a time. If you allow the fragment size to increase the number of concurrent calls decreases. With 4K MBFs, you'd be limited to 15 concurrent calls. With 8K MBFs, the limit would be 9. If KRPC continues to allow a maximum of 64 packet buffers, I think we probably should not allow the use of MBFs. If we could double the number of packet-buffers, we could then allow 4K MBFs. Karuzis Page 17 DCE-RFC 20.0 RPC/DG Protocol Enhancements October 1992 AUTHOR'S ADDRESS Mark Karuzis Internet email: markar@apollo.hp.com Distributed Object Computing Program Telephone: +1-508-436-4337 Hewlett-Packard Co. 250 Apollo Drive Chelmsford, MA 01824 USA Karuzis Page 18