comm=ofi completion #12264

gbtitus · 2019-02-06T18:34:44Z

The ofi comm layer "implements" the whole comm interface in the sense that all the functions are present and any Chapel program will link. But a number of the functions are implemented such that although they work functionally they don't behave exactly as implied or expected. For example, chpl_comm_get_nb() actually does a blocking GET, not a nonblocking one. Here we list what still needs doing.

The text was updated successfully, but these errors were encountered:

gbtitus · 2019-02-26T18:18:12Z

I just sent a question to libfabric-users@lists.openfabrics.org asking about libfabric and processor/network atomic operation non-coherence.

gbtitus · 2019-02-27T18:38:30Z

We have an authoritative response on atomic coherence. Basically, see the paragraph on visibility in the fi_atomic(3) man page (which is going to be updated for clarity, also). For the Chapel comm layer, which does not request target-side completions, atomic results are guaranteed to be visible when the associated completion is seen on the initiating side. Achieving coherence for a given location, when multiple target-side actors (NIC and CPU, or multiple NICs) are performing AMOs on that location, requires that all ops done by one actor must be visible (completions seen) before any ops are initiated by another actor.

An implication of this rule is that if the provider says it can do network AMOs for at least one kind of operation on a given location (non-fetching, fetching, or comparison) but it also cannot do network AMOs for at least one other kind of operation, then the comm layer needs to limit itself to non-network operations for all ops on that location. An easier version of this is that when we're deciding whether to do a network AMO or a processor AMO via AM, we have to choose the latter unless the provider says it can do all operations of concern to us using the network. Otherwise we can end up with one initiator doing one kind of AMO and another initiator doing a different kind of AMO, to the same location, using different techniques (network or processor-via-AM) but not synchronizing, thus breaking the coherence protocol.

gbtitus · 2022-04-26T21:19:02Z

I think one could make the argument that comm=ofi is "complete" in the sense that it passes testing on several different network architectures with different providers. Not everything in the description block has been done, but much of what's not done yet can be viewed as functionally good enough, or else a performance improvement. So, I'm removing myself as the assignee but leaving this open for the group to make the decision.

gbtitus self-assigned this Feb 6, 2019

gbtitus added area: Runtime type: Performance type: Portability type: Unimplemented Feature Epic labels Feb 6, 2019

gbtitus mentioned this issue Feb 7, 2019

Nearly final comm=ofi full testing updates. #12275

Merged

gbtitus removed their assignment Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comm=ofi completion #12264

comm=ofi completion #12264

gbtitus commented Feb 6, 2019 •

edited

Loading

gbtitus commented Feb 26, 2019

gbtitus commented Feb 27, 2019

gbtitus commented Apr 26, 2022

comm=ofi completion #12264

comm=ofi completion #12264

Comments

gbtitus commented Feb 6, 2019 • edited Loading

gbtitus commented Feb 26, 2019

gbtitus commented Feb 27, 2019

gbtitus commented Apr 26, 2022

gbtitus commented Feb 6, 2019 •

edited

Loading