Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix found issues of comm replay #180

Closed

Conversation

TaekyungHeo
Copy link
Contributor

@TaekyungHeo TaekyungHeo commented Sep 16, 2024

Depend On

  1. [c10d] fix sequence numbers for coalesced operations #135132
  2. fix sequence number for group #134578

Summary

  1. fix replay to send/recv
  2. fix blocking option
  3. fix process group update of barrier
  4. Comment out unused function incast and multicast
  5. remove function extractCommsInfo() for deprecatd basic trace
  6. fix wait comm op for both collective and p2p

TestPlan

comm_replay --trace-type et --trace-path /home/sanshang/lustre_storage/000_code/chakra/comm_replay_eval_06/training/et_iter_50_51 --num-replays 10
image

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 16, 2024
@TaekyungHeo TaekyungHeo marked this pull request as ready for review September 17, 2024 02:12
@TaekyungHeo TaekyungHeo changed the title Draft: fix found issues of comm replay fix found issues of comm replay Sep 18, 2024
Copy link
Contributor

@shengfukevin shengfukevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI Sanshan,

Thanks for implementing this P2P feature. Overall looks good. Please check my inline comments.

Also please provide a couple ET for P2P cases. We will put them into our integration tests.

Regards
Sheng

@@ -144,8 +144,10 @@ def __init__(self) -> None:
"reduce_scatter_base": self.reduce_scatter_base, # pyre-ignore[16]:
"scatter": self.scatter, # pyre-ignore[16]:
"barrier": self.barrier,
"incast": self.incast, # pyre-ignore[16]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the code since it is deprecated.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -122,10 +122,12 @@ def _parse_comms_op_node( # noqa: C901
comm_nodes = (
node for node in in_trace.nodes.values() if node.name == "record_param_comms"
)
is_seq = lambda x: isinstance(x, list) and len(x) == 2 and isinstance(x[0], int) and isinstance(x[1], bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can is_seq be renamed to is_seq_id? It makes easier to understand the code

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for node in comm_nodes:
# according to macro RECORD_PARAM_COMMS and RECORD_PARAM_COMMS_DATA in torch/csrc/distributed/c10d/ParamCommsUtils.hpp
# ["wait", "barrier", "init"] record 1st element as seq, others record starting from input tensor
index_base = 0 if isinstance(node.inputs[0], int) else 1
# ["wait", "barrier", "init"] record 1st element as seq, whose 1st element is sequence number as int, 2nd element is isP2P as bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

["wait", "barrier", "init"] record 1st element as seq, whose 1st element is sequence number as int, 2nd element is isP2P as bool -->

["wait", "barrier", "init"] record 1st element as seq_id, whose 1st element is an integer for sequence number, and 2nd element is a bool for isP2P

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1435,12 +1435,9 @@ def readArgs(self, parser: ArgumentParser) -> argparse.Namespace:
help="The backend to be used in PyTorch distributed process group",
) # backend used for the network stack
parser.add_argument(
"--z",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have "--b" as a short version?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# ExecutionTraceObserver dump records from cpp level, which are always async send/recv.
# Then replay in torch.distributed API level, we should use isend/irecv.
self.collectiveFunc["send"] = self.isend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no isend and irecv, can we also change self.isend ->self.send, self.irecv -> self.recv

Copy link

@GSSBMW GSSBMW Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Because

  1. "send" in trace is in c++ context, whose semantics is always async send
  2. self.isend function here is in python context, where send represent sync send, but isend represent async send

So, I think we should keep the name convention here, as I explained in the comment.

@@ -474,6 +474,7 @@ def all_gather_base(self, collectiveArgs, retFlag=False, pair=False):
if retFlag:
return retObj

'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clean up the dead code instead of commenting it out

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -527,6 +530,7 @@ def multicast(self, collectiveArgs):
self.irecv(collectiveArgs, collectiveArgs.srcOrDst)
else:
self.recv(collectiveArgs, collectiveArgs.srcOrDst)
'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove send and rename isend -> send. The same for recv.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for removing.

@TaekyungHeo TaekyungHeo force-pushed the sanshang/fix_comm_replay branch from 29b5a46 to 094a2ab Compare October 11, 2024 10:09
@GSSBMW
Copy link

GSSBMW commented Oct 11, 2024

@shengfukevin
Has tried to resolve all your comments. Please help review, thanks!
For ET of P2P, does pytorch test has such cases to generate? Otherwise I need to write test to generate such traces.

@TaekyungHeo TaekyungHeo force-pushed the sanshang/fix_comm_replay branch from 094a2ab to b682ffb Compare October 14, 2024 11:31
facebook-github-bot pushed a commit that referenced this pull request Oct 14, 2024
Summary:
fix replay to send/recv
fix blocking option
fix process group update of barrier
Comment out unused function incast and multicast
remove function extractCommsInfo() for deprecatd basic trace
fix wait comm op for both collective and p2p


Test Plan:
comm_replay --trace-type et --trace-path /home/sanshang/lustre_storage/000_code/chakra/comm_replay_eval_06/training/et_iter_50_51 --num-replays 10

## additional notes
This is copied from #180.

Differential Revision: D64312970

Pulled By: shengfukevin
@c-p-i-o
Copy link

c-p-i-o commented Oct 15, 2024

@TaekyungHeo, do you want to address the linter errors?
You can run:
make lint followed by lintrunner -a that should hopefully take care of the linter errors for you.

facebook-github-bot pushed a commit that referenced this pull request Oct 16, 2024
Summary:
fix replay to send/recv
fix blocking option
fix process group update of barrier
Comment out unused function incast and multicast
remove function extractCommsInfo() for deprecatd basic trace
fix wait comm op for both collective and p2p


Test Plan:
comm_replay --trace-type et --trace-path /home/sanshang/lustre_storage/000_code/chakra/comm_replay_eval_06/training/et_iter_50_51 --num-replays 10

## additional notes
This is copied from #180.

Differential Revision: D64312970

Pulled By: shengfukevin
facebook-github-bot pushed a commit that referenced this pull request Oct 18, 2024
Summary:
fix replay to send/recv
fix blocking option
fix process group update of barrier
Comment out unused function incast and multicast
remove function extractCommsInfo() for deprecatd basic trace
fix wait comm op for both collective and p2p

Pull Request resolved: #182

Test Plan:
comm_replay --trace-type et --trace-path /home/sanshang/lustre_storage/000_code/chakra/comm_replay_eval_06/training/et_iter_50_51 --num-replays 10

## additional notes
This is copied from #180.

Reviewed By: briancoutinho

Differential Revision: D64312970

Pulled By: shengfukevin

fbshipit-source-id: 42c3657f3ba789a29a8a25db4fdfa4f54a394c27
@shengfukevin
Copy link
Contributor

the duplicated PR has been merged aleady.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants