OSF DCE SIG                                              M. Karuzis (HP)
   Request For Comments: 20.0                                  October 1992


                       DCE RPC/DG PROTOCOL ENHANCEMENTS


   1. INTRODUCTION

      This paper discusses three proposed enhancements to the DG protocol
      component of the DCE RPC runtime library:

        (a) Private Client Sockets

        (b) Multi-Buffer Fragments

        (c) Sending Message Vectors

      The primary motivation for each of these proposals is to increase RPC
      performance over the DG protocol.

      Note that this paper addresses only areas relevant to DCE RPC running
      over the datagram protocol.  As such, "sockets" refer to UDP/IP
      sockets, "packets" refer to packets used by the datagram protocol,
      etc.


   2. PRIVATE CLIENT SOCKETS

   2.1. Functional Overview

      The DG protocol opens up one socket for each network address family
      supported on a host.  Once opened, these sockets are kept in a pool
      for use whenever the process needs to make another RPC over that
      particular address family.  In the event that concurrent calls are
      made over the same address family, the calls share a single socket
      from the pool.  Making this concurrency work requires that there be a
      "helper" thread created to read from all of the open sockets, passing
      received data onto the call thread for which it is intended.

      This model works well for the case in which a client makes multiple
      concurrent RPCs.  However, it's inefficient for those applications
      that don't require this degree of concurrency.

      To remedy this situation, we propose that along with the usual shared
      sockets in the socket pool, there are a small number of sockets (1 or
      2) that are tagged as "private".  Requests for sockets are always
      satisfied by returning a private socket, if one is available.  After
      all private sockets are in use, subsequent socket requests are
      satisfied by returning a reference to the "shared" socket for that


   Karuzis                                                           Page 1


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


      address family.

      As the name implies, a private socket is for the exclusive use of the
      call thread to which it is allocated; any received data that is not
      addressed to this call can be discarded.  This being the case, the
      call thread does not need to rely on the listener thread for
      receiving data.  It can read directly from the socket, avoiding the
      thread switches required for the listener thread to multiplex
      packets.  In fact, in a client application that doesn't require a
      high degree of concurrency, there is no need to pay for the overhead
      of having a listener thread around at all.

      A second benefit of allowing a thread to read its own data is that it
      avoids the overhead of having the listener thread searching though
      lists of UUIDs trying to determine for which thread a given packet is
      ultimately destined.  Under most circumstances a call thread knows
      exactly which packet it expects next, allowing it to short circuit
      much of the general purpose packet handling code.

   2.2. Implementation Details

   2.2.1. Socket creation

      The DG socket pool is modified to the following structure:

                                pool
                         ________/\________
                        /                  \
                     server              client
                     sockets             sockets
                                     ______/ \______
                                    /               \
                                shared            private
                                sockets           sockets

      The first client socket for a particular network address family (NAF)
      is created in the private sockets' area.  While being used in a call,
      the socket is marked "in use" and can't be used by other calls.

      In a client that does not make concurrent RPCs, each subsequent RPC
      over this NAF will use this private socket.

      For a multi-threaded client which makes concurrent RPCs, it may
      happen that a call is made over a NAF for which a private socket
      exists, but that the socket is already being used in another call.
      In this case, if we have not already created the maximum number of
      private sockets for this NAF (currently 2), create another private
      socket.  If we've already created the maximum number of sockets, and
      they're all in use, we create a shared socket for this NAF.  The
      shared socket can then be used by any number of concurrent threads
      whenever all of the private sockets are in use.


   Karuzis                                                           Page 2


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


      At the time the first shared socket for any NAF is created, the
      listener thread must be started.  Socket data received on a shared
      socket is always read in by the listener thread and then queued to
      the appropriate call thread.

      The principal difference in the way private and shared sockets are
      created is in the socket's blocking mode.  In an effort to improve
      performance, the runtime sometimes "guesses" when data might be
      available on a shared socket.  In such cases, rather than calling
      select(), recvfrom() is called directly, in an attempt to avoid an
      extra system call.  Since this call to recvfrom() is being made from
      the listener thread, and since it is possible that the "guess" was
      wrong, we need to insure that the call won't block; the listener
      thread may have other sockets that it needs to monitor.  For this
      reason, shared sockets are created in non-blocking mode.  With
      private sockets, on the other hand, the intention is to allow
      individual call threads to block on their sockets, so the socket is
      left in blocking mode.

   2.2.2. Socket handling

      The majority of the code changes required to support private client
      sockets are in the path between the point that a call thread decides
      to block waiting for data, and the time that a newly received packet
      has been processed by the packet handling routines.  With shared
      sockets, when a call thread decides to block for data, it calls the
      routine rpc__dg_call_wait() and blocks on a condition variable.  New
      data sent to the call is read by the listener thread, which
      determines which type of message it is and calls the appropriate
      packet handling routine.  In the normal case, the packet will contain
      the data which the call is waiting for, and the call will be woken up
      to continue its processing.

      With private sockets, calls handle their own sockets.  Thus, when the
      call_wait() routine detects that a call thread is using a private
      socket, rather than blocking the thread on a condition variable, it
      redirects the thread into the packet processing code used by the
      listener thread.  The thread then blocks directly on its private
      socket.  When a packet is received, it is processed by the same
      packet handling routines, with the exception that now they are being
      called from within the call thread itself, not the listener thread.
      Again, in the normal case, the newly received packet will contain
      data the call was waiting for, except that now, rather than
      signalling a condition variable to wake up the thread, the thread
      simply returns up its stack until it reaches the call_wait() routine.
      From this point on the processing done by users of private and shared
      sockets is identical.


   Karuzis                                                           Page 3


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


   2.2.3. Call blocking

      The discussion above deals only with calls that block waiting for
      data.  This type of blocking emanates from the call_receive(),
      call_transmit(), and call_xmitq_push() routines (the latter two are
      waiting for ACKs to open up transmit window space).  It is also
      possible for calls to block for other reasons, such as waiting for a
      packet pool reservation or waiting for a packet.  Obviously, in such
      cases it would not be appropriate to allow the call to block on a
      socket.  To handle this distinction, the call_wait() routine now
      takes an argument that specifies whether the call thread is blocking
      on a "network event".  If not, the call thread does the normal
      condition variable wait.

   2.2.4. Handling failures

      It may be the case that a server dies and a call thread is blocked on
      a socket through which no more data will ever come.  Timeouts are
      detected by the timer thread which periodically inspects the state of
      all active calls.  With a shared socket, a timeout can be handled by
      setting a flag in the call handle, and signalling the call's
      condition variable.  The call, which is sleeping in the call_wait()
      routine, will wake up, recognize the timeout condition, and handle it
      appropriately.  A different mechanism must be used in the case of
      private sockets, since the call will not be blocked on a condition
      variable.  In this case, we must post a cancel against the call
      thread, since that's the only way to wake it out of recvfrom().

      Note that posting a cancel to a thread is probably less efficient
      than signalling a condition variable.  However, since these actions
      only occur in the presence of a failure condition, their effect on
      performance is of less concern.

      Of more importance is the fact that supporting this mechanism
      requires wrapping each socket receive call within a TRY/CATCH macro.
      This will have some performance impact on every call.  Depending on
      the impact of using the exception macro, it may be desirable to
      implement a new version of the recvfrom() call, one which accepts a
      timeout parameter in the same manner as the select call.  With the
      existence of such a call, it would no longer be necessary to rely on
      the timer thread to detect timeouts; the call thread could determine
      the amount of time until the next timeout, and then block on its
      socket for only that amount of time.  (Such a call might be
      implemented by adding it into the kernel RPC code, in the same way as
      the sendmsgv() call discussed below.)


   Karuzis                                                           Page 4


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


   3. MULTI-BUFFER FRAGMENTS

   3.1. Terminology

      Since this discussion involves entities at several different levels
      in the protocol hierarchy, it will be helpful to fix some of the
      terms used:

        (a) Fragment

            An RPC protocol message containing data.  From the RPC
            runtime's perspective, data transfers occur through the
            transfer of one or more fragments.  Fragments are always
            processed one at a time.  A fragment is a protocol abstraction,
            with no implication on how the bytes are stored, or how they
            are transferred across a network.

        (b) Packet-Buffer

            The data-structure used by the runtime to store the input
            and/or output arguments to an RPC.

        (c) Frame

            The physical network transmission unit.

   3.2. Choosing an Appropriate Fragment Size

      In general, the easiest way to increase the throughput of the runtime
      is to increase the size of the fragments that are transferred from
      sender to receiver.  If the runtime is viewed as a machine for
      processing fragments, the larger we can make the fragments, the fewer
      times we need to run the machine.

      The push toward increasing the fragment size is constrained by the
      following considerations:

        (a) Physical networks have maximum frame sizes.

        (b) Network protocols have limits on transfer sizes.

        (c) Platforms have limited buffering capacity.

        (d) Healthy networks require good network citizens (i.e., sharing
            bandwidth).

      To give a specific example, consider the DG protocol running over
      UDP/IP.  UDP can handle up to 65,527 bytes of data per call.  If a
      given UDP data unit is too big to fit into a single network frame,
      the IP layer provides a simple protocol for fragmenting/reassembling
      the data in order to get it to its destination.


   Karuzis                                                           Page 5


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


   3.3. IP Fragmentation

      The IP fragmentation protocol has been the subject of much criticism,
      and is generally considered something unfit on which to rely.
      Unfortunately, avoiding IP fragmentation requires that a socket user
      (in this case, the RPC runtime) have some knowledge about the frame
      size of the underlying network(s).  This knowledge would make it
      possible to limit the amount of data handed to UDP, per call, so that
      IP is never forced to fragment.  As it turns out, there are several
      reasons why it is not always possible to accurately determine what
      frame size will be in effect for any given datagram:

        (a) Supported platforms don't provide a standard way in which this
            information may be queried.

        (b) The host may be multi-homed, each network using a different
            frame size.

        (c) The destination may be across multiple hops, with different
            sized frames for each.

      The current strategy, which is correct for the vast majority of DCE
      environments today, is to simply assume that the underlying network
      is ethernet, and to choose fragment/packet-buffer sizes that will fit
      within a single ethernet frame.  As a result, the DG fragment size is
      equal to the maximum amount of data that could be carried in an
      ethernet frame, minus the size of the UDP and IP headers.  This value
      works out to be:

            1500    max ethernet frame data field
            -  8    size of LLC frame for SNAP protocol used by IP
            - 20    size of IP header
            -  8    size of UDP header
            ----
            1464    max size of DG protocol fragments

   3.4. Going Beyond Ethernet

      Obviously, this is probably not the right thing to do when the target
      network is something other than ethernet.  The two most compelling
      examples are the anticipated arrival of FDDI networks, with frames
      that can carry almost three times those of ethernet, and the use of
      intra-machine (loopback) RPC, where the frame size is moot, and
      fragment sizes are only limited by local network buffering
      considerations.  In both of these cases, adequate performance
      requires the use of larger DG fragments.


   Karuzis                                                           Page 6


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


   3.5. IP Fragmentation Reconsidered

      Despite the known limitations of the IP fragmentation protocol, it
      became apparent during testing done at HP that IP fragmentation may
      actually lead to better performance under certain circumstances.
      Much of the criticism of the protocol revolves around its inability
      to adequately handle congested, multi-hop networks.  The evidence to
      support this observation is compelling.  However, it appears that the
      simplicity of the protocol leads to increased throughput on
      relatively quiet, local networks.

      Under the current runtime implementation, each network frame received
      by a host must travel up through all of the layers of protocol (IP,
      UDP, RPC) until it reaches the specific call thread to which it is
      directed.  If we allow IP to do fragmentation, most received frames
      need only be looked at by the IP layer; after several frames have
      been collected, and reassembled, they can then be passed as a single
      buffer up through UDP and on to RPC.

      As can be seen by the results presented in Appendix A, use of IP
      fragmentation resulted in 30% increase in inter-machine bulk transfer
      performance, and a 100% increase in intra-machine bulk transfer
      performance.  These tests were done on the sort of quiet, local
      network, on which the IP fragmentation protocol appears to perform
      admirably.  Since we expect that this will be a common environment in
      which DCE RPC applications will run, it is important that we take
      advantage of this performance opportunity.  At the same time, we must
      guarantee acceptable performance over congested, multi-hop networks,
      in which it appears that the IP fragmentation protocol may not be an
      acceptable alternative.

      The key to achieving this flexibility is in allowing the runtime to
      adjust the size of the RPC fragments that it uses, thus controlling
      whether or not IP fragmentation is necessary.

   3.6. Testing with Variable Sized Fragments: MBFs and LBFs

      This paper reports the results from two different test scenarios in
      which the runtime was modified to use larger fragment sizes.  The
      first scheme, called "multi-buffer fragments" (MBF), retains the
      runtime's current default sized packet-buffer (1.5KB) but allows
      fragments to span multiple buffers.  The packet-buffers are collected
      into a vector and transferred with the sendmsg()/recvmsg() socket
      calls.  For comparison purposes, a second scheme called "large-buffer
      fragments" (LBF) is also evaluated.  The idea behind LBF is to simply
      increase the size of the runtime packet-buffers so that each fragment
      can still be sent in a single buffer.  Note that the LBF scheme is
      not considered a practical solution, since it is wasteful on systems
      that must continue to support both small and large fragment sizes; it
      is useful only as a yardstick for measuring the efficiency of the MBF
      implementation.


   Karuzis                                                           Page 7


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


   3.7. MBF Implementation Details

      With MBFs, a single DG fragment is allowed to span multiple packet-
      buffer boundaries.  Packet-buffers belonging to a single fragment are
      associated by a new "more_data" pointer added to xqe's and rqe's
      (xqe/rqe = transmit/receive queue element, the buffers that hold the
      data sent from/to an RPC).  These transmit/receive queues thus become
      two-tiered.

      On the sending side, if the maximum fragment size agreed upon for a
      given conversation exceeds the packet pool buffer size, the sender
      will string together several buffers to contain the data.  Multiple
      packet-buffers can be sent as a single UDP datagram with the
      sendmsg() call.

      On the receiving side, the listener thread now allocates 15 packet-
      buffers for reading in data from the network.  Multiple packet-
      buffers can be used to read in a single UDP datagram with the
      recvmsg() call.  By looking at the length returned from the recvmsg()
      call, and knowing the lengths of each buffer, it is possible to
      determine how many packet buffers were used to hold a given fragment.
      If a datagram fits in fewer than 15 packet-buffers, only those
      actually used are passed on for processing.  The rest are reused by
      the listener thread, cutting down on the number of packets that need
      to be reallocated.

      The process of queueing a fragment remains unchanged; the more_data
      field is never seen by any of the packet handling logic.  At the time
      the fragment is dequeued it is necessary to check whether the
      fragment consists of multiple buffers.  If so, only the first is
      returned to the stubs.  This requires that the
      rpc__dg_call_receive_int() routine muck around with the queue
      pointers a bit, to make sure the "head" and "last_in_order" pointers
      are set correctly.


   4. SENDING MESSAGE VECTORS

      When running on networks where we can't, or don't want to, force IP
      fragmentation, another way to increase performance is to improve the
      efficiency with which data is passed into the kernel.  Since it is
      necessary to make a separate system call for each datagram sent,
      small fragment sizes force the runtime to make a large number of
      system calls.

      The ideal solution to this problem would be if the socket API were
      augmented to allow applications to send multiple datagrams with a
      single call.

      A more practical solution involves providing this functionality
      ourselves, on platforms that already support kernel RPC.  On such


   Karuzis                                                           Page 8


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


      systems it is possible to add our own implementation of the augmented
      socket call into the kernel RPC library.  The user space runtime can
      then pass multiple RPC fragments into the kernel with a single system
      call; the kernel routine, in turn, sending each of the fragments as
      individual datagrams.

   4.1. Implementation Details

      The sendmsgv() call is analogous to the sendmsg() call.  The
      difference is that while the sendmsg() call takes a scatter/gather
      array which it turns into a single UDP datagram, the sendmsgv() call
      takes multiple scatter/gather arrays and turns them into multiple,
      independent UDP datagrams.

      Ideally, such a routine belongs in the socket system call module, and
      should have a system call entry point by which user-space
      applications can invoke it.  At the present time, the code which
      implements the sendmsgv() call is bound into the KRPC library, and is
      made available to applications through an ioctl() on the device at
      /dev/ncs.  (KRPC registers itself as the handler for this device.)

      The sendmsgv() ioctl interface takes a "msg_t" argument, which is
      constructed as follows:

            typedef struct {
                int             fd;          /* user space descriptor
                                                referring to socket */
                int             *cc;         /* return number of bytes
                                                actually sent */
                int             flags;       /* not currently used */
                int             addr_len;    /* length of socket address */
                struct sockaddr *addr;       /* where to send data */
            } msgs_hdr_t, *msgs_hdr_p_t;

            typedef struct {
                struct iovec    *msg_iov;    /* an io-vector describing a
                                                single UDP datagram */
                int             msg_iovcnt;  /* the number of entries in
                                                the above vector */
            } msgvec_elt_t, *msgvec_elt_p_t;

            typedef struct  {
                msgs_hdr_t      hdr;
                int             msgv_len;    /* no. of elements in the
                                                following array */
                msgvec_elt_t    msgv[1];     /* array of independent UDP
                                                datagrams */
            } msgs_t, *msgs_p_t;

      This ends up looking like the following:


   Karuzis                                                           Page 9


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


              KERNEL SPACE   ||   USER SPACE
                             ||
                    msgs_t   ||        --------       -----------
                -----------  ||    --->|      |   --->|  family |
               |    fd     | ||    |   --------   |   |---------|
               |-----------| ||    |              |   | data .. |
               |    cc   o-|-||---/               |   |   ...   |
               |-----------| ||                   |   -----------
               | addr_len  | ||  /----------------|
               |-----------| ||  |                           _________
               |   addr  o-|-||-/                           /         \
               |-----------| ||    /--------|              |          |
               | msgv_len  | ||    |        V              |          V
               |-----------| ||    |     -------------     |     ----------
             / | msg_iov o-|-||---/     /|iov_base o-|----/      | data ..|
      msgv[0]  |-----------| ||      [0] |-----------|           |  ...   |
             \ |msg_iovcnt | ||         \|iov_len    |           |   ...  |
               |-----------| ||          |-----------|           |    ... |
             / | msg_iov o-|-||-\       /|iov_base o-|------\    |        |
      msgv[1]  |-----------| ||  |   [1] |-----------|       |    \/\/\/\/
             \ |msg_iovcnt | ||  |      \|iov_len    |       V
               |-----------| ||  |       |-----------|   ----------
         .     |    .      | ||  |    .  |     .     |   | data ..|
         .     |    .      | ||  .    .  |     .     |   |  ...   |
         .     |    .      | ||  .    .  |     .     |   |   ...  |
               ------------- ||  .       |-----------|   |    ... |
                             ||                           \/\/\/\/

      The actual call (made from the user-space runtime) looks like this:

            rpc_socket_error_t serr;
            int                ncsdev_fd;
            msgs_t             *msg;

            serr = ioctl(ncsdev_fd, NIOCSCK(sizeof msgs_t), msg);

      The NIOCSCK macro does the bit shifting to create the ioctl()
      command; the sizeof parameter determines how many bytes of the "msg"
      argument need to be copied into kernel space.

      The ioctl() handler for /dev/ncs then calls the sendmsgv() routine,
      which has the following signature:

            PRIVATE rpc_socket_error_t rpc__socket_sendmsgv (
                    struct file     *fp,
                    msgvec_elt_t    *msgv,
                    int             msgv_len,
                    struct sockaddr *addr,
                    int             addr_len,
                    int             flags,
                    int             *cc );


   Karuzis                                                          Page 10


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


   APPENDIX A. PERFORMANCE MEASUREMENTS

   A.1. Private Client Sockets

      Private client sockets have been implemented on an HP Series 400
      running HPUX 8.0, and resulted in a 3-5% increase in throughput, and
      a 5-7% decrease in latency (null call time).

   A.2. Multi-Buffer Fragments

      The following tests were run with executables derived from the
      dce1.0.1b18 code base.[1]

        (a) MBF Inter-machine Bulk Data Tests
            server: Series 400/HPUX 8.0 (16 MB)
            client: Series 300/HPUX 8.0 (16 MB)

            Fragment Size (bytes)   Null (ms)    Ins (KB/s)    Outs (KB/s)
                base (b18)            10.2          324           323
                1454                  10.7          320           291
                4096                  10.7          413           403
                8192                  10.7          451           453
               12282                  10.7          452           469
               16384                  10.6          398           400
               20480                  10.7          420           426

        (b) MBF Inter-machine Bulk Data Tests
            server: Series 700-OSF/1 1.0.2 (32 MB)
            client: Series 700-OSF/1 1.0.2 (16 MB)
            [These tests were done on an open network.]

            Fragment Size (bytes)   Null (ms)    Ins (KB/s)    Outs (KB/s)
                base (b18)             5.6          671           627
                1454                   6.1          673           612
                4096                   5.9          871           766
                8192                   6.2          898           844
               12282                   5.7          859           809
               16384                   5.8          669           649


   __________

   1. Note that although fragment sizes were allowed to increase
   substantially, the total number of bytes allowed in all outstanding
   (i.e., in transit) fragments was not increased -- 23,936 bytes.  Thus,
   at a fragment size of 16,384, only one fragment could be sent at a time.
   Some testing was done with an unlimited window size (for comparison) and
   saw transfer rates of up to 540 KB/s, mixed with much lower rates due to
   dropped packets.  It seems likely that some compromise between the
   extremes would safely lead to higher performance.


   Karuzis                                                          Page 11


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


               20480                   5.8          675           673

        (c) MBF Intra-machine Bulk Data Tests
            Series 400/HPUX 8.0 (16 MB)

            Fragment Size (bytes)   Null (ms)    Ins (KB/s)    Outs (KB/s)
                base (b18)             9.8          276           195
                1454                  10.4          250           189
                4096                  10.5          395           354
                8192                  10.5          506           470
               12282                  10.5          533           530
               16384                  10.5          574           562
               20480                  10.5          620           603

        (d) MBF Intra-machine Bulk Data Tests
            Series 700-OSF/1 1.0.2 (32 MB)

            Fragment Size (bytes)   Null (ms)    Ins (KB/s)    Outs (KB/s)
                base (b18)            4.42          641           666
                1454                  4.50          648           646
                4096                  4.56          992          1044
                8192[2]               4.57         1190          1222
               12282                  4.58         1283          1300
               16384                  4.54         1334          1341
               20480                  4.58         1378          1390

        (e) MBF Intra-machine Bulk Data Tests
            Series 700-OSF/1 1.0.2 (16 MB)

            Fragment Size (bytes)   Null (ms)    Ins (KB/s)    Outs (KB/s)
                base (b18)            7.14          398           416
                1454                  7.23          391           407
                4096                  7.42          623           672
                8192[3]                  -            -             -
               12282                  7.49          870           883
               16384                  7.47          926           946
               20480                  7.46          990           986


   __________

   2. Because of dropped packets, two of the ten trials showed well below
   average performance.  The figures here show the average after dropping
   the two lowest and two highest results.

   3. It's not clear why the OSF/1 platforms saw so many dropped packets at
   the 8192 byte fragment size.


   Karuzis                                                          Page 12


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


   A.2.1. Performance under stress

      One other test performed was to see how an overloaded receiver
      performed with and without the presence of IP fragmentation.  There
      was a concern that fragmentation might tax the kernel's memory
      resources, and that dropped fragments and retransmissions would
      result in a decrease in performance.

      After reviewing the IP fragmentation algorithm, there does not appear
      to be any inherent reason why a heavily loaded receiver should
      perform any worse under fragmentation.  The following testing was
      done on HPUX:

        (a) MBF Inter-machine Concurrent Bulk Data Tests
            10-threaded client/server testing
            server: Series 400/HPUX 8.0 (16 MB)
            client: Series 300/HPUX 8.0 (16 MB)

            Fragment Size (bytes)   Ins (KB/s)
                base (b18)             30
                1454                   28
                4096                   42
                8192                   50
               12282                   52
               16384                   57
               20480                   59

      These results show no degradation in performance.  Compare 10 clients
      using 8K MBFs with the results from a single client running on the
      same platform using the same sized MBFs.  In the first case there are
      10 clients transferring 50 KB/s each, in the second case, a single
      client sending 451 KB/s.  Also, there was no evidence that the server
      was dropping fragments during the testing.

      It must be pointed out that a more useful stress test would have been
      to run the 10 clients from 10 different machines; however, these
      resources were not available at the time of the testing.  This
      remains an area that will be closely monitored.

   A.2.2. Comparison with LBF

      For comparison purposes, we also present the performance measurements
      for a runtime which was modified to use only 16KB packet-buffers.
      This implementation is useful for assessing the overhead of the
      multi-buffer scheme, but would be highly inefficient in environment
      that required the ability to support both small and large fragment
      sizes.


   Karuzis                                                          Page 13


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


        (a) LBF Inter-machine Bulk Data Tests
            server: Series 400/HPUX 8.0 (16 MB)
            client: Series 300/HPUX 8.0 (16 MB)

            Fragment Size (bytes)     Throughput (KB/s)
                1454                       338
                4096                       458
                8192                       503
               12282                       486
               16384                       497

        (b) LBF Intra-machine Bulk Data Tests
            Series 400/HPUX 8.0 (16 MB)

            Fragment Size (bytes)     Throughput (KB/s)
                1454                       239
                4096                       378
                8192                       523
               12282                       591
               16384                       647

      In comparison with the results of the MBF testing done on the same
      platform, these figures indicate that the current implementation of
      MBF imposes an overhead of approximately 10% over the simpler LBF
      scheme.  Considering that LBF presents a "best case" scenario tuned
      to one particular environment (one which only support large fragment
      sizes), this cost seems quite reasonable.

   A.3. The Sendmsgv() Call

      The sendmsgv call was implemented and bound into an OSF/1 1.0.2
      kernel running on an HP series 700 workstation.

        (a) Intra-machine Bulk Data Tests
            Series 700-OSF/1 1.0.2 (16 MB)

                               Null Call (ms)  Ins (KB/s)   Outs (KB/s)
            base (b18)             6.43           443           452
            sendmsgv w/1 pkt       6.69           430           441
            sendmsgv w/vector      6.62           462 (+4%)     516  (+14%)

        (b) Inter-machine Bulk Data Tests
            sender  : Series 700-OSF/1 1.0.2 (16 MB)
            receiver: Series 700-OSF/1 1.0.2 (32 MB)
            [Testing done on an open network.]

                               Null Call (ms)  Ins (KB/s)   Outs (KB/s)
            base (b18)             3.95           729           717
            sendmsgv w/1 pkt       4.08           726           720
            sendmsgv w/vector      4.08           794 (+9%)     799  (+11%)


   Karuzis                                                          Page 14


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


      For each of the above, the first line shows the results of the tests
      run on dce1.0.1b18.

      The second line show the results of implementing the sendmsgv() call,
      but not using its ability to handle vectors of messages.  That is,
      sendmsgv() was simply substituted for the current use sendmsg() to
      check on the overhead of the new call.

      The last test shows the results of using the sendmsgv() call's
      ability to handle multiple messages.

      It was expected that the first two tests would have exhibited similar
      performance.  This was not the case, and probably points to some
      inefficiency in the implementation of the sendmsgv() call.  If this
      inefficiency can be corrected, it would be expected that the times
      shown in line 3 would increase by a similar margin.

      It's not clear why there is a 10% difference in the performance
      increase of the intra-machine Ins/Outs tests.  The small increase in
      performance for the Ins test appears to be due to a loss of
      parallelism that happens any time client and server share a single
      CPU.  Normally, a sender queues its next batch of packets
      concurrently to the receiver's processing of the last batch.  When
      the receiver is ready for more data (which it signals by sending an
      ACK), the client should have a complete vector ready to send.
      Without this concurrency, when the sender receives an ACK, it does
      not have a full batch of packets queued, and can only send whatever
      data is ready.  With fewer fragments ready to send, the vectoring
      ability of the sendmsgv() call goes unused.

      By looking at the number of fragments sent per call it was possible
      to confirm that the Ins test was suffering from a lack of
      concurrency.  Presumably there is some imbalance in the protocol
      overhead between senders and receivers that prevented the Outs test
      from exhibiting the same degradation.

      Finally, although it wasn't implemented in time for this paper, it is
      possible to avoid affecting NULL call performance by not using the
      vector based call for single packet transmits.  With this
      modification it is expected that NULL call performance would remain
      unchanged.


   APPENDIX B. PACKET RATIONING IMPLICATIONS OF MBF

      The DG packet rationing scheme makes the assumption that a one to one
      correspondence exists between packet-buffers and fragments.  This is
      no longer the case with MBFs.  Since the runtime is not capable of
      dealing with partial fragments, it's clear that what we really need
      is a fragment rationing scheme.


   Karuzis                                                          Page 15


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


      Currently, under rationing, calls are allowed to use only a single
      packet-buffer at a time.  This would work fine for senders of MBFs,
      since they can just scale down their fragment size whenever the
      system enters a rationing state.  Unfortunately, this will not work
      on the receiving side.  Since the receiver has advertised that it can
      handle MBFs, it may be the case that the sender has queued up several
      of these larger fragments at the time the receiver begins rationing.
      At this point it is not possible to force the receiver to use 1
      packet-buffer at a time, since the next few fragments it will receive
      are MBFs.  It's also not practical to tell the sender to repackage
      its fragments, since this would require changing the fragments'
      sequence numbers, which might conflict with out-of-order fragments
      already queued at the receiver.

      The most straightforward solution is to require a call to make as
      many packet-buffer reservations as are required to guarantee that it
      can always read in the largest MBF size that it advertises.  For
      instance, an intra-machine call which wants to use 16K MBFs would
      need to make 12 reservations (12 buffers of 1454 bytes each).  In
      effect, the call is making a fragment reservation.  (If it later
      turns out that the other side of the conversation can not handle
      MBFs, some of the reservations could be returned.)

      First, let's consider if this is reasonable in a user-space runtime.
      Packet rationing is used to guarantee that a system that is low on
      packet-buffers, and high on buffer users, will not deadlock.  In the
      kernel, where space is at a premium, there are only a limited number
      of buffers in the packet pool, and a high probability that this
      resource will get strained.  In user space, however, land is cheap
      and the packet pool can be allowed to grow fairly large.  Under these
      circumstances, the bottleneck for a system is likely to be the CPU,
      rather than memory.

      The user space runtime defines the maximum number of packet-buffers
      to be 100,000.  In practice, this number is meaningless since there
      are other constraints that would preclude this many buffers from ever
      being in use.  For example, each call is limited to queueing a
      maximum of 96 buffers; there would need to be over 1000 concurrent
      calls running for the maximum number of buffers to be in use.

      Packet rationing begins when the number of packet-buffers remaining
      in the pool is equal to the number of all current reservations.  In
      1.0.1, this means that if there are 1000 concurrent calls, packet
      rationing would begin when the following number of packet-buffers
      were in use:

            100,000  number of buffers
            - 2,000  2 reservations for each of 1000 calls
            -------
             98,000  packet rationing threshold


   Karuzis                                                          Page 16


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


      But if each call is only allowed to queue up a maximum of 96 buffers,
      that means the maximum number of buffers in use by all calls is
      96,000.  Therefore, under these conditions, packet rationing would
      never be necessary.

      What happens if we make 13 reservations for each call?  Working out
      the math, the number of concurrent calls is lowered to 917:

            100,000  buffers
            -11,921  13 reservations for 917 calls
            -------
             88,079  rationing threshold

                917  no. of concurrent calls
               x 96  max. queue length per call
            -------
             88,032  total possible buffers in use

      Of course, if an application tried to run more than 917 concurrent
      calls everything would still work (slowly), the rationing code would
      kick in and guarantee that the calls ran without deadlock.

      Given all this, I would propose that we modify the packet rationing
      scheme to allow call threads to make multiple packet reservations.
      The call thread would then use the number of reservations received to
      determine the MBF size it wanted to advertise.  From a cursory
      examination of the code, I think the rest of the packet rationing
      logic can remain the same.  Users of the packet rationing code are
      not dependent on the implementation of the "reservation" abstraction,
      they just care about whether they have one or not.

      The picture for KRPC is not as promising.  The maximum number of
      packet-buffers that KRPC will allocate is 64, which means that the
      system can support a maximum of 31 concurrent calls.  If we assume
      that these calls may need to do an RPC to complete their processing,
      the number of concurrent calls is reduced to 15.  Again, I think the
      right way to think about this is that rationing is the state where
      you have some maximum number of calls running, each using one
      fragment (not buffer) at a time.  If you allow the fragment size to
      increase the number of concurrent calls decreases.  With 4K MBFs,
      you'd be limited to 15 concurrent calls.  With 8K MBFs, the limit
      would be 9.

      If KRPC continues to allow a maximum of 64 packet buffers, I think we
      probably should not allow the use of MBFs.  If we could double the
      number of packet-buffers, we could then allow 4K MBFs.


   Karuzis                                                          Page 17


   DCE-RFC 20.0          RPC/DG Protocol Enhancements          October 1992


   AUTHOR'S ADDRESS

   Mark Karuzis                        Internet email: markar@apollo.hp.com
   Distributed Object Computing Program          Telephone: +1-508-436-4337
   Hewlett-Packard Co.
   250 Apollo Drive
   Chelmsford, MA 01824
   USA


   Karuzis                                                          Page 18