Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce multi-node training setup #26

Closed
wants to merge 5 commits into from

Conversation

sadamov
Copy link
Collaborator

@sadamov sadamov commented May 4, 2024

Enable multi-node GPU training with SLURM

This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow the code to detect if it is running within a SLURM job and automatically configure the number of devices and nodes based on the SLURM environment variables.

Key changes

  • Set use_distributed_sampler to True when not in evaluation mode to enable distributed training
  • Detect if running within a SLURM job by checking for the SLURM_JOB_ID environment variable
  • If running with SLURM:
    • Set the number of devices per node (devices) based on the SLURM_GPUS_PER_NODE environment variable, falling back to torch.cuda.device_count() if not set
    • Set the total number of nodes (num_nodes) based on the SLURM_JOB_NUM_NODES environment variable, defaulting to 1 if not set

Rationale for using SLURM

SLURM (Simple Linux Utility for Resource Management) is a widely used job scheduler and resource manager for high-performance computing (HPC) clusters. It provides a convenient way to allocate and manage resources, including GPUs, across multiple nodes in a cluster.

By leveraging SLURM, we can easily scale our training to utilize multiple GPUs across multiple nodes without the need for manual configuration.

@sadamov sadamov requested a review from joeloskarsson May 4, 2024 13:44
@sadamov sadamov added the enhancement New feature or request label May 4, 2024
@sadamov sadamov requested a review from leifdenby May 14, 2024 05:32
Copy link
Collaborator

@joeloskarsson joeloskarsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this on multi-gpu without any problems. Will test multi-node on our cluster as soon as I can get my hands on more than 1.

train_model.py Show resolved Hide resolved
@leifdenby leifdenby changed the title Introduces multi-node training setup Introduce multi-node training setup May 30, 2024
@joeloskarsson
Copy link
Collaborator

An update on my testing of this: The SLURM constants are read correctly also on our cluster, but I have yet to be able to get multi-node training working. I think this is however unrelated to this code, but rather related to me not having the correct setup for running multi-node on our cluster. Will ask around to see if I can get it working.

In the meantime, @leifdenby (or anyone at DMI 😄), do you have a slurm setup that you could test this on? I just think it's a good idea to test on multiple different clusters to make sure that this is general enough.

@sadamov
Copy link
Collaborator Author

sadamov commented Jun 7, 2024

I have implemented the latest feedback, updated the CHANGELOG and added a SLURM-example submission script to /docs/examples (is that a good location?) as discussed with @leifdenby. A new small section was added to the README.md.
@joeloskarsson yes, every cluster is different and I also have to adapt my submission scripts after major changes. Do you have ticket-support with your HPC-provider? They usually know what to do...

@sadamov
Copy link
Collaborator Author

sadamov commented Dec 16, 2024

As discussed at the dev-meeting just now, here is a slurm submission script example. I realized I put everything + some documentation already in this here PR: docs/examples/submit_slurm_job.sh

@SimonKamuk SimonKamuk mentioned this pull request Jan 22, 2025
20 tasks
@sadamov
Copy link
Collaborator Author

sadamov commented Jan 22, 2025

#103 implements multi-node training, even simpler than this PR here.

@sadamov sadamov closed this Jan 22, 2025
SimonKamuk added a commit that referenced this pull request Jan 23, 2025
## Describe your changes

This PR adds support for multi-node GPU training using the SLURM job
scheduler. The changes allow setting the number of nodes with the cli
argument `num_nodes`. It is also possible to select a subset of visible
GPU's using the argument `devices` (only when not using SLURM).

Replaces #26 with a simpler
method based on advice from @sadamov

## Type of change

- [ ] 🐛 Bug fix (non-breaking change that fixes an issue)
- [x] ✨ New feature (non-breaking change that adds functionality)
- [ ] 💥 Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] 📖 Documentation (Addition or improvements to documentation)

## Checklist before requesting a review

- [x] My branch is up-to-date with the target branch - if not update
your fork with the changes from the target branch (use `pull` with
`--rebase` option if possible).
- [x] I have performed a self-review of my code
- [x] For any new/modified functions/classes I have added docstrings
that clearly describe its purpose, expected inputs and returned values
- [x] I have placed in-line comments to clarify the intent of any
hard-to-understand passages of my code
- [x] I have updated the [README](README.MD) to cover introduced code
changes
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [x] I have given the PR a name that clearly describes the change,
written in imperative form
([context](https://www.gitkraken.com/learn/git/best-practices/git-commit-message#using-imperative-verb-form)).
- [x] I have requested a reviewer and an assignee (assignee is
responsible for merging). This applies only if you have write access to
the repo, otherwise feel free to tag a maintainer to add a reviewer and
assignee.

## Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should
check the following:
- [x] the code is readable
- [ ] the code is well tested
- [x] the code is documented (including return types and parameters)
- [x] the code is easy to maintain

## Author checklist after completed review

- [ ] I have added a line to the CHANGELOG describing this change, in a
section
  reflecting type of change (add section where missing):
  - *added*: when you have added new functionality
  - *changed*: when default behaviour of the code has been changed
  - *fixes*: when your contribution fixes a bug

## Checklist for assignee

- [ ] PR is up to date with the base branch
- [ ] the tests pass
- [ ] author has added an entry to the changelog (and designated the
change as *added*, *changed* or *fixed*)
- Once the PR is ready to be merged, squash commits and merge the PR.

---------

Co-authored-by: Simon Kamuk Christiansen <skc@volta.dmi.dk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants