Resent-From: icsc-nativewg@opengroup.org From: Matthew Pearson - APL Software Date: August 17, 2004 3:25:08 PM EDT Resent-To: icsc-nativewg@opengroup.org To: icsc-nativewg@opengroup.org Subject: draft meeting minutes 2004/08/17 articipant code: 356608 US Dialin: 1 866 874 0872 Intl Dialin: +44 1452 562 905 UK Dialin: 0845 146 2019 8/17/04 Meeting agenda (Taking minutes: Sun for IBM) cascading ascii art attendance diagram. (if you have more than 1 minus visible, you are not eligible to vote.) hp ibm netapp sun ---- ----- ------ --- m-3 m-2 m-1 + + - + <- when enforcement started m-0 + + - + new action items: AI fn to post to reflector and gather input on exposing local port number. present: jim hamrick (jh) [HP] matt pearson (mp) [Sun] fredy neeser (fn) [IBM] jay rosser (jr) [HP] Agenda bashing, approve minutes Minutes to approve: Email from Fred Worley, subject "ICSC ITWG draft minutes,Ê 8/4/04", sent 8/3/04 6:31PM PT so approved. wonder and amazement that fred was able to send out the minutes before the meeting happened. Action item review mp hasn't sent out old minutes, can't remember if there are any outstanding ai's. will post today or tomorrow. FN - MPA REQ/REP above or below RNIC-PI fn two possibilities. one is to do the req-rep handshake above the rnic-pi interface. the other is to have it go on below. we would need different rnic-pi support in each case. tried to identify what rnic-pi would need in order to implement these two different options. for either implementation we may need support beyond the RDMAC verbs. already have seen things like marker settings that are not mentioned in the verbs. fn also did a review of modify qp verb. although the verb does not mention mpa req/reply, if you want to use them you basically have to use the last streaming mode message to send the mpa reply. from the verbs perspective presence or nonpresence of LSM message determines active/passive role. we now have several definitions of active/passive. need to be clear which definition we mean. for example LSM from our ulp is not lsm from modify qp perspective. fn if we want to use the modify qp lsm as in the verbs it makes sense to have the handshake above the rnic-pi. however since the mpa protocol is offloaded might also make sense to ofload its startup as well. tcp CE: tcp not yet present, yet the syn-ack messages are considered part of the tcp spec. maybe this is just a formal reason for decision - believe complexity is most important consideration. fn next i compared the options both for TI interface as well as TD interface. let's look at I1 (above rnic-pi). [ lost thread. see 1.2 of his email message ] fn marker setting for request/reply. need to set this ... see 1.3. cannot change marker setting for send. this is determined by peer. all we can do is modify receive markers. one option is to override them, enable it - ignore what rnic thinks is best. but, maybe there is a markerless rnic out there that for some reason its performance is better w/o markers. we should have a way to query rnic to find this out. fn not advocating a markerless implementation. this is merely to cover cases where those implementations exist. jr what are the implications of this? if the device says it is better to use in a markerless mode, do we surface that to the user? fn think it could be hidden. all that is needed is that we need to fill in the marker bit in the mpa req reply. mp TI or TD case? fn TD case. mp so consumers have control of markers but no visibility in whether to use them? fn no way in itapi to ask rnic that question. this is an rnic-pi requirement. once we have convinced ourselves whether we need additional interfaces, we should contact rnic-pi group. jr also should consider what we would do with the information we request. there are other models that may arise, if we can obtain additional information from the device. more negotiation, for example, or expose to consumer. this can also relate to IRD/ORD issue. fn legacy support is another issue for thought. another issue is do we enforce enablement of receive markers? this would be a verbs extension, and of course querying a marker pref would be a verbs extension. there are arrows (=>) in my note which indicate the rnic-pi requirements. jr I1.4 and I1.6 are the same requirement? fn yes. for I2, it's trickier. if you want to do handshake below rnic-pi, interface is more complicated. I2.1 you would need a connect QP/accept QP interface. feel this is too complex. based on this feel it is safe to assume I1 is the path to go. jr I would agree. seems considerably more feasible. fn there were a few comments by caitlin, but i think she asks a different question: if it's above the rnic-pi, would it be a user level socket or a kernel socket that establishes the connection? think she was talking about immediate transition to rdma. think her concern was reducing the number of ctx switches, mainly. don't see how that is being reduced by using a kernel socket - in TI interface you still have private data that must be exchanged with consumers. not sure that handling req/reply in kernel reduces number of switches at all. jr private data is only exposed via the TI interface. trying to wrap my head around this in a user-mode CM. fn what she had in mind in her note is this. let's say a usermode app wants an rdma mode connection with immediate transition. one could do two things. user socket, exchange mpa rep/reply through user socket and then the context of the tcp connection would be migrated from user space to the rnic. basically with the modify qp to rts. the alternative would be to do it_ep_connect and immediately forward call from user library to a kernel layer and instead a kernel socket could be used to establish the connection. might be easier to transfer context in that case to the rnic. again, private data still needs to be exchanged so a context switch may still be necessary. not sure i fully understood caitlin's questions. maybe i will follow up and ask for more detail. jr if you use ep_accept you could make a single call and transfer to kernel mode and send private data all in one call, could it not? at least from the application's point of view. fn not sure. if you think about the connection request, on the responding side, that still needs to be surfaced. so even if both sides use kernel sockets the private data in the connection requests need to be surfaced. that would imply a context switch [at least on receive]. and similarly for the CE event on the initiating side that also needs to be surfaced, another ctx switch. jr haven't spent enough time with caitlin's email to count ctx switches. think we need to do an outright comparison here... it's not been part of my model to think of doing CE from user space. begs the question - is that something we need to seriously consider / is it a requirement? fn so your assumption is in the TI case one would always use a kernel socket? jr that had been my thought. it seems like a cleaner interface there. fn then you need to pass private data through some proxy interface between your kernel socket and your usermode application, on both sides. jr but the interfaces defined in the api already have a calling interface. you could have it_ep_connect map to a system call and map to the kernel. there's an interface to get it down there. fn so you're saying the syscall interface could easily support this. jr yes, i believe so. fn while this is for the TI case. for the TD, of course, we cannot interfere with the application. they might have usermode sockets. jr i don't know what you mean exactly by usermode socket. do you mean the conventional application sockets api? fn where part of the socket structure would be in userspace. jr do you mean something beyond berkeley sockets? usermode sockets could mean a lot on these technologies - right up to an SDP interface - entire library, or almost, in userspace - far fewer transitions than a berkeley sockets impl. trying to mix an offloaded sockets impl with this would be *very* complex. model i had in mind was a traditional sockets model. fn i guess language was confusing. i was also referring to berkeley sockets. jr you are using the term immediate transition with TI interface - which is slightly overloaded in iwarp spec. on llp establishment, rdma mode is IMMEDIATELY established. no streaming transition at all. that is a model we haven't really looked at very seriously. we could support it easily with the api, but the underlying mechanisms ... then there is the other model where at least you have a LSM. fn is this a terminology problem? jr maybe. you don't mean when you use the TI interface that we do an iwarp immediate transition (no LSM). you mean immediate from the consumer perspective. fn it is not that immediate. it's not like in infiniband. i was just trying to contrast it to the deferred transition from streaming mode. jr see appendix 13.1 connection initiation at LLP startup for what i am getting at. see also 6.6.1. an RI MUST support CE after some streaming mode messages have been sent. an RI MAY support CE with no streaming mode data. when doing CE i have assumed the MUST. fn have you considered a proposal to distinguish these three modes? two immediates and one deferred. i will be more careful with my use of the term immediate. iWARP CM additional high level requirements Email thread started by Jay Rosser, subject "Ballot - IT API Phase 2 Additional Requirements for iWARP CM 8/4/04", sent 8/4/04 12:00PM PT jr the only requirements that were rejected were TD private data. iWARP CM detailed requirements Email from Jay Rosser, subject "New draft detailed iWARP CM requirements and state diagram", sent 8/4/04 5:01PM PT Email thread started by Fredy Neeser, subject "Action Item -Ê Generating/parsing MPA REQ/REP above or below the RNIC-PI", sent 8/16/04 9:47AM PT jr fredy sent some email clarifying how to get the states to appear more similar and consistent. looks like we need some new terms: conversion initiator and conversion responder. these would have the inverse to rdma initiator and rdma responder. and thus states would look the same for both TI and TD case. fn first figure in email [] is the TI case. the active/passive roles here are determined by who calls connect and who calls connect. the left side here should correspond to active and the right to passive. i've defined here the left side as the rdma initiator. and the right as responder. fn second page. this is now the conversion interface precisely as jr had outlined it. it's just drawn in such a way that the mpa req is still sent from left to right. similarly the first rdma message is sent from left to right. clearly if we want the same itapi states to take the same roles the left side should be the active state and the right side the passive. on the other hand we have the right hand side in this case, which sends the last ULP streaming mode message with it_socket_convert, in a sense the right hand side is the initiator of the conversion. i think we can get round this dilemma by just defining new terminology. for the TD case in figure two the conversion initiator shall be the rdma responder and the conversion responder will be the rdma initiator fn figure 3 shows that the roles can be the same without req/rep. jr how do we expose this to the consumer? fn think the person who calls convert should always be called the initiator. jr in figure 1, the left side cannot successfully initiate anything [ unless? ] the passive side has waited. if the first it_socket_convert call has a notion of being a passive interface? fn somewhat counterintuitive? jr yes. but you still need to do a listen before you do a connect, no? fn most people would say the callers of ep_connect would be active. if we have the same api call, the conversion on both sides, then whoever issues the call first? think that is just natural. unfortunately the way these modes have been outlined in mpa there are these two different ways of assigning roles to the peers. don't see a way to avoid making this explicit to the itapi user. the consumer should know the concept of conversion initiator and responder. the active/passive states correspond to the rdma level, not conversion level, and the consumer should know that for some reason they are flipped between the TI and TD case. fn but i like the idea of calling the peer who sends the last ulp streaming mode message as the initiator is very clear. it works for sdp, iser, without mpa req/rep, but it applies only to the TD case. jr to be honest this issue does not give me a great deal of heartburn. you gave me an AI to rewrite the state diagrams. haven't done that yet so can't comment in more detail. fn i may be creating additional work and apologize for that but on the other hand it might turn out to be easier in the end. if you need to explain the it_ep_state manpage - states play the same role in TI or TD case? jr [ asks a question ] fn think an additional state is necessary on passive side. right now we have conn pending. one more state would be needed to recv/parse the mpa request. jr that is my next issue. fn perhaps we can iterate, i can give you intermediate feedback. Email thread started by Fredy Neeser, subject "IRD/ORD negotiation with TI vs. TD interface", sent 8/16/04 10:13AM PT fn in the TI case you have support for this in the ULP header. a similar negotiation is not possible without private data in the TD case. so you need to know the IRD ORD in advance or modify them after the connection is up. jr and i replied to you the iSER req that ORD be modified after CE. that's written into the protocol. but that is orthogonal to this issue though. had an offline discussion with one of my colleagues here and it's not something i had thought of, but it is worth considering, defining itapi private data so we could stick with our contention that the TD interface to consumers doesn't need private data, but the impl could use that private data in the mpa headers. we could negotiate ird/ord but that would require an interface to allow users to request it. another option is to define something in the cep header that defines the characteristics of the remote device. so you could find out via the TD private data what your remote side was capable of. mp but that would be only if the other side was running ITAPI? concerned. one main goal of TD is to speak to non ITAPI clients. if we use private data we break that. fn would this be a limitation? is there a conceivable application where a TI client operates with a non ITAPI peer? jr not possible unless the remote side uses the CEP header. fn what if there is an option to disable CEP header? then you could talk to a nonITAPI implementation that doesn't use a proprietary header? in other words there'd just be a standard on how to use the private data. jr downside is consumers could not negotiate ird/ord using standard interface. fn could still negotiate ird/ord via private data himself. jh what mechanism would you use to turn off CEP headers in TI case? would need to be backwards compatible with existing API. there's a two-way and three-way flag right now, could add it there - but waht is the benefit of complicating the TI model with this? mp sounds to me like this proposal is halfway between TI and TD case? jr goal is to write a TI api that can run on any transport. constraint is introduced by iwarp transport - imposes a programming model, you need to send a first framing pdu to pop the other side into connected mode. that's not backwards compatible. IB has no such requirement. that's another reason to revise major version number of itapi. such an application would work on ib, but ib apps wouldn't necessarily work on this. jr we are being pushed out of easily supporting legacy 1.0 apps unless they were written with iwarp in mind. fn main point is, you can talk to a non itapi consumer with TD model and deferred transition. can't with TI model. asymmetry here is a concern. jr even if you did have the option to disable CEP, we still have a requirement that the first messages be request/reply frames. that defines a ulp albeit a small one. potential for incompatibility. fn you are right, it's still a ulp. fn what is the side effect of this small ulp (non CEP)? requirement to send a message within a reasonable period of time. jr can't remember if we have timers in the detailed requirements. in IB, after the connection, either side may send a message. in iwarp, it's a requirement on the initiator to send the first framing pdu to get the responder in rdma mode. jh effect of not doing that is the remote side hangs. fn another concern is to make sure remote side doesn't post too soon. jh we need to rev the verstion number to make this clear, that consumers with a ulp model where the receiver posts the first message will not work. fn need to point this out somewhere. jr another issue is that the consumer needs to post a message to the ep before initiating connection. fn could do it in pending state? jr potential for race condition? ... fn ird. suppose initiator has ird of 16, responder has 2. no way for initiator to reduce his ird? doesn't this waste resources? jr yes, but it doesn't break anything jh could work around this but there would be more complexity. expose another flag to consumers? or tell consumer to prepare for not being able to change this. jr see page 74 of the verbs to see what we can do. ird reduction is optional, and ord is mandatory. [ discussion of state diagrams and who does what in what order ] jr thinking back to iwarp case of initiator sending first message, i had thought that was a TD requirement, but really it's required for all, to ensure interoperability. need a page or section to describe iwarp characteristics that all clients should follow. fn suggest add message sequence charts to new section on iwarp. easier to understand than state diagrams. along the lines of the time charts that jay and i have produced, initiator/responder and the messages they exchange. should be done for TI and TD case. jr feel these belong in the it_ep_state_t manpage. even though they are transport specific clients who want to code transport independently must follow them. fn question on ce-6.0.1. says it_conn_qual_t, should be conn_qual_type_t. connection qualifier contains remote port, right? jr yes. fn so if you use the TI model with connect are we exposing a local port to the consumer? do we allow the consumer to read the local port, or even to specify it? jr we have no such mechanism in itapi as it stands. question stands. is that a good thing? we are assuming ephemeral ports it seems. open issue of what if you want to bind to a well known port. haven't found a really strong case for it. fn perhaps we should ask this on the reflector? AI fn to post to reflector and gather input on exposing local port number. jh the way we define a path structure, the local port number would be a part of it. not sure how a bind would work. jr currently the path only has an address, not a port. caitlin had proposed a local ip address and a qos characteristic. we punted on the latter seeing as that is a phase 3 issue. fn in iwarp now there is a union called remote. how does this fit in? fn page 247 comment above definition of remote union in it_path_t that currently says the "remote component of the path" and should say "the transport dependent part of the path". fn can someone explain what a spigot means? jh the spigot is intended to indicate an addressable entity on your ia. most natural way of implementing it is mapping a spigot to a port on your local adapter. fn per port or per ip address? jh per port. each spigot can have multiple addresses. lot of flexibility for ia creators on what they wish to associate with a spigot. fn mostly though it's for mapping multiport adapters. jh in IB there are events that tell you fabric has gone up or gone down and are on a per port basis. can't remember if our events map to that. fn 6.0.4.7.1. says that for the rdmac transport it mpa marker control should always be it_false. i thought that rdmac transports would be selected per connection per this mpa no request/reply flag? not sure i understand this requirement. jr goal is to express the characteristics of the rnic itself. orthogonal to mpa startup frames. fn sometimes we use the rdmac/ietf distinction to mark differences in hardware and sometimes in protocol. jr would it be more clear if we said rdmac rnic rather than rdma transport? fn ce-1.4.1 will meet the requirement for 1.5 on the TD interface only? maybe should add a comment to make it really clear the interworking is only TD. jr so just add a sentence that says specifically interworking with non itapi peers is only available on TD interface only fn editorial suggestion on 1.4.2.1. says the ietf version etc optional use of mpa markers. should be mentioned that this is for receive only. whereas the rdmac requires it for both. fn so some of these would be voted down? jr as far as detailed requirements go we've tried to come to an agreement on them as a whole, rather than vote on each individually. jr going back to the rnic pi issue, if we go with i1, it is our implementation that is emitting these frames. so does the ietf requirement that you send frames, does that have any teeth? it seems on such a device, with i1, we'd have the option to not send these, right? jr somewhere in the requirements we have a statement that consumers trying to turn off no-req-rep on top of an ietf rnic - should get an error. but if these frames are a function of our sw implementation and not of the device, now this seems artificial. jh we want to return an error when we request something of a device it isn't capable of, we don't want to do it when we don't need to. fn with the mpa handshake above the rnic-pi it would always be possible to suppress this. how can we have a device where we cannot suppress this handshake? jh in your investigation, had you decided that you were not going to support devices that did i2? fn yes. it would be too complicated to support such a device. jh then i agree with you. jr then we are making an assumption that itapi will run on rnic-pi, yes? jh let me see if i understand this. you are talking about tossing an entire category of devices? i2 devices? jr i thought we were trying to generate requirements for the rnic pi. fn came to the conclusion that i1 is the logical thing to do. a few requirements, like verbs extensions come out of this. we should suggest i1 to rnic-pi. if they have compelling arguments for another implementation that is totally different from verbs we should consider it. but if one class of implementations is much easier, let's simplify it. jr feel a slight bit of resistance to this. the only requirement on us would be to be able to surface an error trying to disable req-rep. over the rnic-pi such an error would never occur. jh why do you want to encourage people to design hardware that has interoperability problems? jr don't have visibility into how much our encouragement affects designs. jh if someone can point to an rnic vendor that has gone down the bad path we should consider this, but if no one has yet, then... jr don't want to design ourselves into a hole. fn as a side question, how should we proceed with the rnic pi given this ai? does it make sense to propose one of those implementations to further discussion and see how they feel about it? jr agree that single recommendation and asking questions would be a good approach. fn they can still come back and say they would do it differently. -- meeting adjourned at 3:03 ET