GitHub - sadamov/ddp_starter: A most basic working setup for training ML-models with Pytorch using DDP and mulitple nodes.

DDP Starter

This is a starter project for distributed deep learning with PyTorch and Slurm.

Pre-requisites

Mamba
Git
Access to a Slurm cluster (e.g. balfrin@cscs.ch)

Installation

Copy latest code from github:

git clone git@github.com:sadamov/ddp_starter.git
cd ddp_starter

Create a new conda environment and install dependencies:

mamba env create -f environment.yml

Usage

sbatch test_slurm.sh

Then check out the logs in ./lightning_logs to see if the run was successful. The metrics.csv contains the training and validation losses across all epochs.

`Trainer.fit` stopped: `max_epochs=10` reached.

means that the run was successful.

For real case trainings, you will need to modify the batch_size and num_workers in the dataloader of class IrisDataModule to best utilize the available GPU and CPU resources.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
lightning_logs		lightning_logs
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
test_ddp.py		test_ddp.py
test_slurm.sh		test_slurm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DDP Starter

Pre-requisites

Installation

Usage

About

Releases

Packages

Contributors 2

Languages

License

sadamov/ddp_starter

Folders and files

Latest commit

History

Repository files navigation

DDP Starter

Pre-requisites

Installation

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages