-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix found issues of comm replay #180
fix found issues of comm replay #180
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI Sanshan,
Thanks for implementing this P2P feature. Overall looks good. Please check my inline comments.
Also please provide a couple ET for P2P cases. We will put them into our integration tests.
Regards
Sheng
@@ -144,8 +144,10 @@ def __init__(self) -> None: | |||
"reduce_scatter_base": self.reduce_scatter_base, # pyre-ignore[16]: | |||
"scatter": self.scatter, # pyre-ignore[16]: | |||
"barrier": self.barrier, | |||
"incast": self.incast, # pyre-ignore[16]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the code since it is deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
et_replay/comm/commsTraceParser.py
Outdated
@@ -122,10 +122,12 @@ def _parse_comms_op_node( # noqa: C901 | |||
comm_nodes = ( | |||
node for node in in_trace.nodes.values() if node.name == "record_param_comms" | |||
) | |||
is_seq = lambda x: isinstance(x, list) and len(x) == 2 and isinstance(x[0], int) and isinstance(x[1], bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can is_seq be renamed to is_seq_id? It makes easier to understand the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
et_replay/comm/commsTraceParser.py
Outdated
for node in comm_nodes: | ||
# according to macro RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA in torch/csrc/distributed/c10d/ParamCommsUtils.hpp | ||
# ["wait", "barrier", "init"] record 1st element as seq, others record starting from input tensor | ||
index_base = 0 if isinstance(node.inputs[0], int) else 1 | ||
# ["wait", "barrier", "init"] record 1st element as seq, whose 1st element is sequence number as int, 2nd element is isP2P as bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
["wait", "barrier", "init"] record 1st element as seq, whose 1st element is sequence number as int, 2nd element is isP2P as bool -->
["wait", "barrier", "init"] record 1st element as seq_id, whose 1st element is an integer for sequence number, and 2nd element is a bool for isP2P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -1435,12 +1435,9 @@ def readArgs(self, parser: ArgumentParser) -> argparse.Namespace: | |||
help="The backend to be used in PyTorch distributed process group", | |||
) # backend used for the network stack | |||
parser.add_argument( | |||
"--z", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have "--b" as a short version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
# ExecutionTraceObserver dump records from cpp level, which are always async send/recv. | ||
# Then replay in torch.distributed API level, we should use isend/irecv. | ||
self.collectiveFunc["send"] = self.isend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no isend and irecv, can we also change self.isend ->self.send, self.irecv -> self.recv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Because
- "send" in trace is in c++ context, whose semantics is always async send
self.isend
function here is in python context, wheresend
represent sync send, butisend
represent async send
So, I think we should keep the name convention here, as I explained in the comment.
@@ -474,6 +474,7 @@ def all_gather_base(self, collectiveArgs, retFlag=False, pair=False): | |||
if retFlag: | |||
return retObj | |||
|
|||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clean up the dead code instead of commenting it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -527,6 +530,7 @@ def multicast(self, collectiveArgs): | |||
self.irecv(collectiveArgs, collectiveArgs.srcOrDst) | |||
else: | |||
self.recv(collectiveArgs, collectiveArgs.srcOrDst) | |||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove send and rename isend -> send. The same for recv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for removing.
29b5a46
to
094a2ab
Compare
@shengfukevin |
094a2ab
to
b682ffb
Compare
Summary: fix replay to send/recv fix blocking option fix process group update of barrier Comment out unused function incast and multicast remove function extractCommsInfo() for deprecatd basic trace fix wait comm op for both collective and p2p Test Plan: comm_replay --trace-type et --trace-path /home/sanshang/lustre_storage/000_code/chakra/comm_replay_eval_06/training/et_iter_50_51 --num-replays 10 ## additional notes This is copied from #180. Differential Revision: D64312970 Pulled By: shengfukevin
@TaekyungHeo, do you want to address the linter errors? |
Summary: fix replay to send/recv fix blocking option fix process group update of barrier Comment out unused function incast and multicast remove function extractCommsInfo() for deprecatd basic trace fix wait comm op for both collective and p2p Test Plan: comm_replay --trace-type et --trace-path /home/sanshang/lustre_storage/000_code/chakra/comm_replay_eval_06/training/et_iter_50_51 --num-replays 10 ## additional notes This is copied from #180. Differential Revision: D64312970 Pulled By: shengfukevin
Summary: fix replay to send/recv fix blocking option fix process group update of barrier Comment out unused function incast and multicast remove function extractCommsInfo() for deprecatd basic trace fix wait comm op for both collective and p2p Pull Request resolved: #182 Test Plan: comm_replay --trace-type et --trace-path /home/sanshang/lustre_storage/000_code/chakra/comm_replay_eval_06/training/et_iter_50_51 --num-replays 10 ## additional notes This is copied from #180. Reviewed By: briancoutinho Differential Revision: D64312970 Pulled By: shengfukevin fbshipit-source-id: 42c3657f3ba789a29a8a25db4fdfa4f54a394c27
the duplicated PR has been merged aleady. |
Depend On
Summary
TestPlan
comm_replay --trace-type et --trace-path /home/sanshang/lustre_storage/000_code/chakra/comm_replay_eval_06/training/et_iter_50_51 --num-replays 10