8/31/04 ICSC ITWG Meeting Minutes Taking minutes: HP (FW) n HP Jim Hamrick (JH) Y HP Jay Rosser (JR) Y HP Fred Worley (FW) Y IBM Fredy Neeser (RS) n NetApp Arkady Kanevksy (AK) Y Sun Matt Pearson (MP) cascading ascii art attendance diagram. (if you have more than 1 minus visible, you are not eligible to vote.) hp ibm netapp sun ---- ----- ------ --- m-3 m-2 + + - + <- when enforcement started m-1 + + - + m-0 + + - + Next Meeting: Tuesday 9/7/04 Minutes: 1) Approval of previous minutes Minutes to approve: - Email from Matt Pearson, subject "draft meeting minutes 2004/08/17", sent 8/17/04 12:25PM PT Minutes approved 2) Action Item Review a) Capture handling of iwarp transport dependency regarding the first RDMA send message and the matching RDMA receive DTO in a requirement / input for man pages (JR) Discussion: - issue we were discussing last week is the need for additional states in the diagram to express that... until the consumer receives the first send form the remote they can not go into the connected state -con. not introduce additional states to the diagram, but instead put in strongly worded advisory text that says this behavior (verbs behavior) must be adhered to or conn est. may fail - see Jay email today 8/31/04 - detailed requirement number CE-xx I had an action item to produce text for the man pages describing the need to respect the iWARP verbs recommendations on posting the first send on the rdma initiator and the corresponding requirements on posting the first receive on the responder. CE-X.X The man pages must have text added describing the iWARP use model required of the Consumer to establish connections successfully. The following requirements specify what the text should convey: CE-X.X.1 On the iWARP transport, on receipt of a Connected Event, the Consumer on the RDMA Initating side must post a first Send DTO to the newly connected Endpoint in order for the remote Consumer to transition into the Connected state. CE-X.X.2 On the iWARP transport, the Consumer on the RDMA Responding side must post at least one Receive DTO to their to-be-connected Endpoint before attempting to respond to an incoming connection attempt. CE-X.X.2.1 Failure to post a Receive DTO as specified in CE-X.X.2 may cause the connection to be torn down on receipt of the remote Send. Jay Comments: FN: - one reason for conn. to be torn down is a timeout - if initiator fails to do a send after entering a connected state, what will happen? - JR: nothing - FN: so the initiator hangs in the connected state; so the timeout on the initiator side is not effective JR: but the initiator can do RDMA - it really *is* connected - if responder had registered memory and conveyed RKEYs by some OOB mechanism, he should be able to touch them FN: thinking of some application which sent the first message in the opposite direction (which would not work) - does accept have a timeout on the passive side? JR: no - connect takes a path, has a timeout in the path, but no corresponding data structure in accept FN: so passive side will hang indefinitely if it assumes that it should send first JR: not exactly hang - just won't receive a connected event; can do things, but can not do RDMA on that endpoint Agreed to discuss / review internally and discuss again in next week's meeting 3) iWARP CM issues See email thread started by Fredy Neeser, subject "Connection Management - MSCs and state diagrams", sent 8/24/04 1:57PM PT JR: device would need to receive a DDP packet to determine whether or not to go into RDMA mode - to manifest such a capability is outside the verbs - supporting such is a slippery slope - since the verbs to not provide a mech to do this, we are constrained more by the suggestions in the verbs text FN: - two conflicting specifications - verbs warn consumer that they should be cautious and not send ? "too early" - spec says protocol imp should make sure that first DDP packet has been received before anything can be sent out - MPA imp has to hold back any FPDUs that are in the line to be sent JR: which contradicts the notion of a ? state as defined in the verbs FN: no intermediate state the defined in the verbs that would hold back ? JR: discussed with local iWARP expert - opinion was quite strong that we should stick to the verbs - seems a slippery slope to move away from the verbs FN: Is there any conceivable application of being able to send away earlier before actually the underlying MPA has reached this state of receiving a DDP message? - that would mean the application blindly sends away the messages without being sure that the underlying MPA has received anything JR: Not sure that is what Caitlin is suggesting - still recommending that the underlying implementation be aware that remote has gone into MPA mode by receipt of that DDP packet, so distinction between Catlin's recommendation and Verbs recommendation is how soon a connected event from API p.o.v. appears to the consumer and allows them to post - differential in time from her recommendation and what the verbs recommend is the amount of time taken to surface a connected event up to the consumer, e.g. to schedule their process; to get to the point where then can then post - that is the difference in elapsed time FN: not sure this is what Catilin meant - in her opinion, ? would reach RTS state anyway, regardless of whether a DDP message has been received - then MPA protocols duty to hold back messages that the consumer may have sent - Caveat: may have misunderstood Caitlin's argument JR: may be requirement on the device to hold back - consumer could post but posting could not be transmitted until ... FN: that would mean then that the queue would fill up and the consumer would get an error message? JR: yes; consumer must be prepared for this anyway - bigger issue is how is such an interface to be architected - issue is that this is outside the verbs - how is it conveyed that the device is ready to send? FN: if RTS comes on too early, and still the first DDP segment has not been receive, that would not make any sense (IMO) RJ: RTS is clearly defined - the transmitter can transmit. Doing that too early would be problematic FN: agree it is a slippery slope to assume that MPA will take care of it - it doesn't hurt at all to have this check in the ITAPI implementation - do not expect check to be difficult to do RJ: imposes a programming model on the consumer - until they do a send, remote will not go into a connected state; is this a problem? - IMO not a problem FN: will consult with others at IBM 4) MM Requirements See email from Matt Pearson, subject "RMR (Un)linking", sent 8/24/04 9:51AM PT MP: MM section 9 is about window binding, sec 11 is about unbinding memory windows - FN and MP have not edited lately; belive them to be done - would like to put to vote next week - will review today in prep for vote D requirements are detailed requirements - D1, D2 are definitions Question: what to do about cases where we want to add arguments to existing API calls - done in D3.5 as suffix of "2" to the existing call name - expectation is that calls would have the same man page, but a suffix of 2 for the new version with more arguments - could also change the definition of the call (to add more arguments), but that would impact existing applications - note that we have agreed to change the major number of the API, so we are allowed to break backward compatibility - FN: should be possible for an application to use more than one major number of the API - JR: burden is on the implementation provider to have 2 libraries, e.g. JR: Suggest that we make a list of all the changes that will / could impact users of the v1 API - when we see the complete list, it may provide clarity on the reasons for changing the major number and the impact of making this type of change MP: For votable version, will leave wording as "interface will be enhanced, such as by adding a 2 suffix" but not ruling out changing the number of arguments when this decision is made D3.1 - Default allows return of either narrow or wide - immediate error if you select one the imp does not support D3.5 (remark) Preserving the existing meaning of RMR Create doesn't work on iWARP - RMR Create implicitly creates wide windows which are not supported by iWARP MP: Whenever you try to bind an RMR that is already bound, you will get an error JR: if you create a narrow RMR it is bound? MP: No - if you create a narrow and bind it at one place and then bind it somewhere else, you don't need a special error for that - any re-bind will generate an error - for iWARP, if you link it, you MUST unlink it before you relink or you will get an error - possibly not on IB - API should encourage/mandate unlink-before-relink for source portability. D4 - Binding rules / changes to bind function D4.1: - address base for RMR - can bind a RMR saying you want to use 0-based addressing (offset rather than VA) - can write to the 100th byte of the RMR by selecting 0-based addressing - old method would be RKey and a non-zero VA - remote side would need to do pointer arithmetic to write to offset 100 - Each time RKey is rebound, new base VA must be sent to remote host so it can do its pointer arithmetic properly, e.g. +100. - when bind, can now say "ADDRBASE_VADDR" or "_ZERO" - ZERO will only work on narrow RMRs - FN: what about implementations that may not have this feature? - MP: Do we need to have a queryable attribute to set this up? - MP: sure that _VADDR will work everywhere; not sure that _ZERO will work on narrow RMR AI: Add MM-6.1.D1 to the votable list Determine if _ZERO will always work on a narrow RMR, specifically w.r.t. IB VE OWNER: MP JR: IT RMR LINK is very similar to it_rmr_bind - MP: could have been bind, too, but thought link had a better sound to it - link/unlink works for both types of RMRs - terminology also works for LMR's in future - works for the priv mode stuff that will go out to the reflector soon Discussed need to have votable on naming conventions ("link" vs "bind") FN: lmr_create and lmr_create2: - since new args and features, must make default assumptions about these args if they are missing - how would the old it_api function work if the information is missing? - defaults need to be set - MP: D3.5 does tell you what you need to do for it_rmr_create - MM subgroup should make sure that defaults make sense - JR: assuming that it_rmr_create would be replaced with a new version with more arguments (using default MP: agrees that adding... keeping the functions and arguments the same and adding new functions with very similar names that have more arguments has the potential to be quite confusing to consumers - if it_rmr_link and it_rmr_bind do the same thing with different arguments, that may be too much of a sacrifice for backward comparability Implementation options: a) present 1.0 bindings and 2.0 bindings b) only present 2.0 bindings c) change the names JR: reiterates call for justification for changing the major number - we have requirements that have forced us to change the major number, but don't recall what they are MP: can come up with a list of the ways that src code will need to change that would / could change source code comparability - could do for all calls (not necessarily down to fields, but list of which calls are going to be done) FN: summary a) two sets of bindings - good idea to create a list with all the differences that we have for v1 calls b) generate an appendix in the manual describing how to implement issue one calls with issue 2 calls - doing this would provide clear information for consumers about how to generate equiv behavior with the new calls JR: suggestion to generate a proposal on this and discuss on the reflector - would like to have broader review of this issue D4.2: - broad address on wide RMR D4.3 AI: Massage text on D4.3 - Change "undefined" to "unsupported" OWNER: MP D5 - adding flags to param_mask_t - type, addr_base, unlinkable - type - wide or narrow RMR - addr_base - were you bound with zero based or va based addressing - unlinkable hasn't been fully fleshed out yet, but tells if MW can be locally or remotely invalidated - holdover from remote validation issues in MM20; not that far in first phase, but it will need to be there FN: Why would RMR not be invalidatable? - MP: there is an it_unlikable_t in the LMRs - LRMs can be not remotely or locally unlinkable - FN: clear for LMRs; as in IB - MP: true, for RMRs it will never be 0 - Why not remove D5.1.3 / D5.2.3? - MP: some RMRs that are remotely invalidatable, but not all - See MM 20.1 AI: Put MM 20.1 on the ballot OWNER: MP FN: If QP coming in from a peer matches the QP of an RMR, and the QP has... is it still possible that the RMR is not invalidatable? - MP: Wide RMR in IB can not be invalidated - will generate a protection error - if you query that type of MW and look at unlinkable flag, it will say can only be invalidated locally - on iWARP, all RMRs would say IT_INVAL_REMOTE - on IB, some would say INVAL_LOCAL, some INVAL_REMOTE - could tell based on what type of mem window you have and what transport you are on, so possible to get this data without interface, but since we have for LMRs, makes sense to include similar interface for RMRs - MP: will delete 0 from possible values in D5.2.3; will be either REMOTE or LOCAL, never 0 AI: Remove 0 from possible values for D5.2.3 OWNER: MP D7 - no disc D8 - if 'bound' is false, you know addrbase is invalid D9 remark - preview for local fence flag - ability to stall pipeline on both sides to make sure bind succeeds - very strict ordering - not quite ready for first drop - will not be voting on 13.0, so remove reference in remark AI: Remove remark in D9 OWNER: MP 9.1 AI: Edit 9.1.D1 and 9.2.D1 so that they correctly capture that the same structure element with be used to capture wide/narrow support OWNER: MP 9.3.D1.4 and D1.5 - MP: if you have an RMR that is bound, you can only unbind with a WRQ to the same EP it is bound to - you may get an incoming message to unbind when you have already unbound the RMR - verbs authors agree that unbinding a MR that is not bound should not be surfaced as an error (provided there is no PZ violation) If a narrow RMR is bound, you can only unbind it from that EP to which it was bound - if unbound, you can submit as many unbound requests to any EP in any PZ that you like - it is not an error - this is why 1.5 is wider in scope than 1.4 - nice to add a positive statement saying what you can do, e.g.: AI: Add Remark at 1.5.1, along the lines of: "You can unbind an unbound narrow RMR as many times as you like as long as all the unbinds go to EPs that are in the same PZ as the RMR" OWNER: MP 9.3.D1.6 - can't bind mem windows to 0-based addressed MRs, ever 9.3.D1.7 - priv aren't part of initial ph2 deliverable - good to have this here anyway - huge security hole if this were done 9.3.D1.8 - already in IT API 1, but put here also as a reminder - FN: what is tech reason for this requirement? - if LMR does not allow remote write, you can still enable remote write for RMR that is bound to that RMR - this makes more sense to me - what is technical reason for this one? - MP: If you have bound an LMR with local read permissions then the OS (or MMU or someone) might refuse to give the HBA or RNIC write permission to that block of memory - "I allow the HBA to write here" - if you disallow local write, there may be no way for IA to write on behalf of an incoming write message (from an OS memory management perspective) - JR: doesn't this also map onto the IB verbs? - MP: yes, in both specs MP: This completes the current list of completion errors, with the caveat that it would be an impressive feat at this point to have the list 100% complete. Others may be added as the detailed requirements process continues. 9.3.D2 - MM-9.3.D2 The API shall NOT return any error if the Consumer attempts to bind a Narrow RMR with RDMA permissions that exceed those of the target EP. - permissions checked not at bind time, but at access time - at bind time, check x and y in sync - at access time, check y and z in sync - JR: sholud there be text advising consumers not to do this? - MP: consumer should be warned if they do as described in 9.3.D2 and do not change the permissions of the EP target where they just did the bind then the first incoming request that violates the EP's permissions will break the connection (Should be captured as advise to the consumer in the man page, not as a requirement) AI: Update the [presumably it_rmr_bind()] man page as follows: Consumer should be warned if they do as described in 9.3.D2 (attempt to bind a Narrow RMR with RDMA permissions that exceed those of the target EP) and do not change the permissions of the EP target where they just did the bind then the first incoming request that violates the EP's permissions will break the connection (Should be captured as advise to the consumer in the man page, not as a requirement) OWNER: MP MM-9.5 The API shall only allow a remote Consumer to access a Narrow RMR if the EP to which the RDMA operation was directed matches the EP ID of the EP to which the Narrow RMR was last bound. - wording is somewhat confusing, but clarified by 9.5.D1 through 9.5.D1.3 - plan is to not expose "EP ID" anywhere in spec MM-9.6 The API shall only allow a remote Consumer to access any RMR if the PZ of the EP to which the RDMA operation was directed matches the PZ of the EP to which the RMR was last bound. [Addressed by MM-9.5.D1.1.] - HW is not required to behave this way - our API will behave this way even if hardware does not FN: For narrow RMRs, obvious that 9.6 will hold MP: already have this for wide RMRs - should not require any additional text in the man page to explain this MM-9.7 The API shall constrain the binding of Narrow RMRs to Endpoints. Zipped through the rest without major discussion MP: Believes this is complete votable list, barring a few additional requirements from other discussions and the AIs captured above. Questions: 1) What happens when a non-privileged user tries to call a privileged function? FN: if you attempt to fast register a QP that is not in priv mode you will get an error - user lib is not configured to be safe; HW will ensure that non-priv user does not try to access phys memory it_lmr_link is for fast registration - if non-priv user tries to use it_lmr_link should get an error - JR: well-behaved implementation should reject non-priv user from doing this - MP: expect we will need to expose the permissions error - unlikely as this case may be, some implementations may not prevent this case; we can't override the hardware - JR: no objections to doing so - FN: hardware helps us by providing a completion header - makes sense to pass the completion error to the consumer - other cases where high-level call is allowed only to priv consumers, but HW does not check in any way - JR: in those cases, e.g. for stage 1 of phase 2 w/o priv support, those calls should, IMO, be required in the imp. to be simple - check your priv and return an error - FN: can an imp necessarily easily check whether a call is priv or not? - JR: "Easily" is a good question - not efficiently, as priv check is a priv op - FN: hardware will not check if you can unlink LMRs - MP: no security risk to do so - FN: no, but its not a good idea - MP: perhaps should consider allowing uses to unlink LRMs - would create a "zombie" LMR that they would have to deal with (e.g. delete it) - JR: if there is no security hole, it seems unnecessary to impose a restriction on the implementation - you can shoot yourself in the foot, but you can not expose resources that you should not be able to expose - MP: now, if WRQ - if you are non-priv consumer and using ITAPI 2 and call priv WRQ generation function you will get priv error - already a hardware error that says this - DTO reqs priv, and you don't have it, you get priv error - other ones, you have to trap to kernel anyway - it can check and return an immediate error - concern that one might be able to get an intermediate error or a delayed error or both - need to think about this more, tho Second part: - if you try to do something that requires priv, like it_lmr_create, depending on certain flags you require priv. to do that - it_lmr_create reqs priv anyway - if consumer is not priv, can return immediate error - WRQ request where there is not a context switch, then the HW will do priv check - error will be returned as an immediate error or as a completion, but never both - for calls that require a context switch anyway, it is easy to return an immediate error if there is a priv violation Proposal: - add text to [presumably it_lmr_create - not clear if this comment would also refer to other fucntions] man pages regarding "zombie" LMR that this is not a recommended thing to do AI: add text to [presumably it_lmr_create - not clear if this comment would also refer to other fucntions] man pages regarding "zombie" LMR that this is not a recommended thing to do OWNER: MP 4) Additional high-level MM issues See email from Matt Pearson, subject "Fwd: High level issues for today's concall", sent 8/31/04 2:36PM PT Discussion of excerpts from above email: > o support for lists of DTOs (requirement DTO-2.0): > - Right now, we have many separate API calls for work requests, > including it_post_send, it_post_rdma_write, it_lmr_link, etc. > - Does this in any way limit ourselves in trying to fulfill DTO-2.0? > - Are we planning to introduce a work request data structure with > a work request type? How do we submit an array of DTOs? - should there be a universal DTO representation? - is this objective achievable with our current approach - JR: if you define a list of DTOs call, it would be a superset of existing calls - FN: would make current work request obsolete - FN: on the other hand, might be inefficient to do so - structure could be quite large to support all cases - FN: looks like there is nothing that precludes list of DTOs with current approach - JR: this was consciously not adopted in v1, even though it was an option then, but not aware of anything that precludes lists > o Interpretation of PBL mode (block list mode vs. page list mode). > - Does block list mode include page list mode as a special case? > - it_ia_create is lacking the PBL mode expected by "Open RNIC" FN: Initially thought that consumer has to open an RNIC in one mode and then can not use in the other mode - on re-reading, seems like page-list mode is a subset of block-list mode - MP: interface to it is the same; if registering a page list, could do so in either page list or block list mode - if you open an RNIC in block list mode, will it be slower for page accesses than if you opened it in page list mode? - not sure, but added flexibility of block list mode may have a cost FN: Related issue - to open RNIC in this mode, have to pass this flag - in ia_create there is no such flag - should we allow passing an additional flag at create time? - MP: flag for this in new requirements AI: Review old minutes for discussion on this topic from pre-high level requirements discussions OWNER: FW [Note: I found the notes of this conversation (see 9/24/03 meeting minutes). It does not, as I had remembered, resolve that this is not an issue - it is a discussion that it is not clear how to expose the feature to consumers. The discussion notes that, depending on the vote to or not to expose this feature to consumers, how to expose the feature may not be an issue and explicitly delays the decision of whether and how to deal with it to the votables. Per the vote update of 5/24/04, the ITWG has voted to address this issue with specifics left to detailed requirements.] Goal to release the 9.0 and 11.0 Memory Management sections this week as votables for a vote on 9/14/04