-
Notifications
You must be signed in to change notification settings - Fork 62
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
et_replay refactoring: removed hack code for FBGEMM, introduced skip-…
…node option, MAST launcher update (#181) Summary: Pull Request resolved: #181 This DIFF include the following et_replay refactoring: 1. Cleaned up the hack code for FBGEMM The current implementation for FBGEMM related ops relied on guessing the parameters of the original FBGEMM module based on its forward and backward function calls. Since FBGEMM keeps involving, most of the code are outdated. It is not a sustainable way to support it. The most important reason that these ops can not be replayed is it usually has index input tensors. If random data is used for these integer tensors, it usually runs into illegal memory issue. To fix this issue, another DIFF (https://www.internalfb.com/diff/D62889784) is going to capture index tensor based on user's selection. Then in replay, the index tensor is loaded for FBGEMM ops. It has been proved in ICVR model, with the index tensor data, we can replay all of FBGEMM ops. 2. Introduced new options --skip-node-file and --update-skip-node-file: If --skip-node-file is available, the json file that defines the nodes to skip will be loaded to skip the ops, --update-skip-node-file is a special run mode, it will go through all compute ops, if an op fails to run, the skip-node-file will be updated to include the failed op. 3. MAST launcher has been updated to create a new FBPKG for et_replay 4. The DFS traverser to collect the ops has been simplified. If a node is an operator, the children of that node will be ignored. The only exception is c10:: related ops since record_param_comms is a child of c10:: op, and comm_replay only uses record_param_comms 5. generate_io_tensor for CommsReplayManager in et_replay.py has been removed temporarily, it does not handle all collectives correctly for creating input/output tensors. The current version uses comms_replay to allocate the tensors. We can put it back when that function is ready. 6. Some other minor fixes, for example, use logger instead of print Reviewed By: briancoutinho Differential Revision: D61055957 fbshipit-source-id: 34ca74b221b3525b4e2a81df59df60f8924253c5
- Loading branch information
1 parent
e196340
commit 1ac7959
Showing
5 changed files
with
483 additions
and
436 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.