Introduce multi-node training setup #26

sadamov · 2024-05-04T13:44:08Z

Enable multi-node GPU training with SLURM

This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow the code to detect if it is running within a SLURM job and automatically configure the number of devices and nodes based on the SLURM environment variables.

Key changes

Set use_distributed_sampler to True when not in evaluation mode to enable distributed training
Detect if running within a SLURM job by checking for the SLURM_JOB_ID environment variable
If running with SLURM:
- Set the number of devices per node (devices) based on the SLURM_GPUS_PER_NODE environment variable, falling back to torch.cuda.device_count() if not set
- Set the total number of nodes (num_nodes) based on the SLURM_JOB_NUM_NODES environment variable, defaulting to 1 if not set

Rationale for using SLURM

SLURM (Simple Linux Utility for Resource Management) is a widely used job scheduler and resource manager for high-performance computing (HPC) clusters. It provides a convenient way to allocate and manage resources, including GPUs, across multiple nodes in a cluster.

By leveraging SLURM, we can easily scale our training to utilize multiple GPUs across multiple nodes without the need for manual configuration.

joeloskarsson

Tested this on multi-gpu without any problems. Will test multi-node on our cluster as soon as I can get my hands on more than 1.

train_model.py

joeloskarsson · 2024-06-03T12:32:29Z

An update on my testing of this: The SLURM constants are read correctly also on our cluster, but I have yet to be able to get multi-node training working. I think this is however unrelated to this code, but rather related to me not having the correct setup for running multi-node on our cluster. Will ask around to see if I can get it working.

In the meantime, @leifdenby (or anyone at DMI 😄), do you have a slurm setup that you could test this on? I just think it's a good idea to test on multiple different clusters to make sure that this is general enough.

sadamov · 2024-06-07T11:10:20Z

I have implemented the latest feedback, updated the CHANGELOG and added a SLURM-example submission script to /docs/examples (is that a good location?) as discussed with @leifdenby. A new small section was added to the README.md.
@joeloskarsson yes, every cluster is different and I also have to adapt my submission scripts after major changes. Do you have ticket-support with your HPC-provider? They usually know what to do...

sadamov · 2024-12-16T12:45:56Z

As discussed at the dev-meeting just now, here is a slurm submission script example. I realized I put everything + some documentation already in this here PR: docs/examples/submit_slurm_job.sh

sadamov · 2025-01-22T15:06:02Z

#103 implements multi-node training, even simpler than this PR here.

@sadamov

## Describe your changes This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow setting the number of nodes with the cli argument `num_nodes`. It is also possible to select a subset of visible GPU's using the argument `devices` (only when not using SLURM). Replaces #26 with a simpler method based on advice from @sadamov ## Type of change - [ ] 🐛 Bug fix (non-breaking change that fixes an issue) - [x] ✨ New feature (non-breaking change that adds functionality) - [ ] 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] 📖 Documentation (Addition or improvements to documentation) ## Checklist before requesting a review - [x] My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use `pull` with `--rebase` option if possible). - [x] I have performed a self-review of my code - [x] For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values - [x] I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code - [x] I have updated the [README](README.MD) to cover introduced code changes - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have given the PR a name that clearly describes the change, written in imperative form ([context](https://www.gitkraken.com/learn/git/best-practices/git-commit-message#using-imperative-verb-form)). - [x] I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee. ## Checklist for reviewers Each PR comes with its own improvements and flaws. The reviewer should check the following: - [x] the code is readable - [ ] the code is well tested - [x] the code is documented (including return types and parameters) - [x] the code is easy to maintain ## Author checklist after completed review - [ ] I have added a line to the CHANGELOG describing this change, in a section reflecting type of change (add section where missing): - *added*: when you have added new functionality - *changed*: when default behaviour of the code has been changed - *fixes*: when your contribution fixes a bug ## Checklist for assignee - [ ] PR is up to date with the base branch - [ ] the tests pass - [ ] author has added an entry to the changelog (and designated the change as *added*, *changed* or *fixed*) - Once the PR is ready to be merged, squash commits and merge the PR. --------- Co-authored-by: Simon Kamuk Christiansen <skc@volta.dmi.dk>

Introduces multi-node training setup

896e9a5

sadamov requested a review from joeloskarsson May 4, 2024 13:44

sadamov added the enhancement New feature or request label May 4, 2024

Simon Adamov and others added 2 commits May 4, 2024 21:20

eval is recommended by torch to run on one device

f28f798

linter

f708da9

sadamov requested a review from leifdenby May 14, 2024 05:32

joeloskarsson requested changes May 29, 2024

View reviewed changes

train_model.py Show resolved Hide resolved

leifdenby changed the title ~~Introduces multi-node training setup~~ Introduce multi-node training setup May 30, 2024

sadamov added 2 commits June 7, 2024 12:43

Merge remote-tracking branch 'origin/main' into feature_multinode_ddp

d9778a4

Added documentation and an example slurm file

07903d5

sadamov mentioned this pull request Aug 26, 2024

Add "datastores" to represent input data from zarr, npy, etc #66

Merged

20 tasks

joeloskarsson added this to the v0.4.0 milestone Nov 20, 2024

leifdenby assigned ealerskans Dec 16, 2024

SimonKamuk mentioned this pull request Jan 22, 2025

Add multi-node-training #103

Merged

20 tasks

sadamov closed this Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce multi-node training setup #26

Introduce multi-node training setup #26

sadamov commented May 4, 2024

joeloskarsson left a comment

joeloskarsson commented Jun 3, 2024

sadamov commented Jun 7, 2024

sadamov commented Dec 16, 2024 •

edited

Loading

sadamov commented Jan 22, 2025

Introduce multi-node training setup #26

Introduce multi-node training setup #26

Conversation

sadamov commented May 4, 2024

Enable multi-node GPU training with SLURM

Key changes

Rationale for using SLURM

joeloskarsson left a comment

Choose a reason for hiding this comment

joeloskarsson commented Jun 3, 2024

sadamov commented Jun 7, 2024

sadamov commented Dec 16, 2024 • edited Loading

sadamov commented Jan 22, 2025

sadamov commented Dec 16, 2024 •

edited

Loading