-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce multi-node training setup #26
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this on multi-gpu without any problems. Will test multi-node on our cluster as soon as I can get my hands on more than 1.
An update on my testing of this: The SLURM constants are read correctly also on our cluster, but I have yet to be able to get multi-node training working. I think this is however unrelated to this code, but rather related to me not having the correct setup for running multi-node on our cluster. Will ask around to see if I can get it working. In the meantime, @leifdenby (or anyone at DMI 😄), do you have a slurm setup that you could test this on? I just think it's a good idea to test on multiple different clusters to make sure that this is general enough. |
I have implemented the latest feedback, updated the CHANGELOG and added a SLURM-example submission script to |
As discussed at the dev-meeting just now, here is a slurm submission script example. I realized I put everything + some documentation already in this here PR: docs/examples/submit_slurm_job.sh |
#103 implements multi-node training, even simpler than this PR here. |
## Describe your changes This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow setting the number of nodes with the cli argument `num_nodes`. It is also possible to select a subset of visible GPU's using the argument `devices` (only when not using SLURM). Replaces #26 with a simpler method based on advice from @sadamov ## Type of change - [ ] 🐛 Bug fix (non-breaking change that fixes an issue) - [x] ✨ New feature (non-breaking change that adds functionality) - [ ] 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] 📖 Documentation (Addition or improvements to documentation) ## Checklist before requesting a review - [x] My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use `pull` with `--rebase` option if possible). - [x] I have performed a self-review of my code - [x] For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values - [x] I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code - [x] I have updated the [README](README.MD) to cover introduced code changes - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have given the PR a name that clearly describes the change, written in imperative form ([context](https://www.gitkraken.com/learn/git/best-practices/git-commit-message#using-imperative-verb-form)). - [x] I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee. ## Checklist for reviewers Each PR comes with its own improvements and flaws. The reviewer should check the following: - [x] the code is readable - [ ] the code is well tested - [x] the code is documented (including return types and parameters) - [x] the code is easy to maintain ## Author checklist after completed review - [ ] I have added a line to the CHANGELOG describing this change, in a section reflecting type of change (add section where missing): - *added*: when you have added new functionality - *changed*: when default behaviour of the code has been changed - *fixes*: when your contribution fixes a bug ## Checklist for assignee - [ ] PR is up to date with the base branch - [ ] the tests pass - [ ] author has added an entry to the changelog (and designated the change as *added*, *changed* or *fixed*) - Once the PR is ready to be merged, squash commits and merge the PR. --------- Co-authored-by: Simon Kamuk Christiansen <skc@volta.dmi.dk>
Enable multi-node GPU training with SLURM
This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow the code to detect if it is running within a SLURM job and automatically configure the number of devices and nodes based on the SLURM environment variables.
Key changes
use_distributed_sampler
toTrue
when not in evaluation mode to enable distributed trainingSLURM_JOB_ID
environment variabledevices
) based on theSLURM_GPUS_PER_NODE
environment variable, falling back totorch.cuda.device_count()
if not setnum_nodes
) based on theSLURM_JOB_NUM_NODES
environment variable, defaulting to 1 if not setRationale for using SLURM
SLURM (Simple Linux Utility for Resource Management) is a widely used job scheduler and resource manager for high-performance computing (HPC) clusters. It provides a convenient way to allocate and manage resources, including GPUs, across multiple nodes in a cluster.
By leveraging SLURM, we can easily scale our training to utilize multiple GPUs across multiple nodes without the need for manual configuration.