diff --git a/README.md b/README.md index 7c7cd3a1..801edd73 100644 --- a/README.md +++ b/README.md @@ -448,6 +448,36 @@ python -m neural_lam.train_model --model hi_lam_parallel --graph hierarchical .. Checkpoint files for our models trained on the MEPS data are available upon request. +### High Performance Computing + +The training script can be run on a cluster with multiple GPU-nodes. Neural LAM is set up to use PyTorch Lightning's `DDP` backend for distributed training. +The code can be used on systems both with and without slurm. If the cluster has multiple nodes, set the `--num_nodes` argument accordingly. + +Using SLURM, the job can be started with `sbatch slurm_job.sh` with a shell script like the following. +``` +#!/bin/bash -l +#SBATCH --job-name=Neural-LAM +#SBATCH --time=24:00:00 +#SBATCH --nodes=2 +#SBATCH --ntasks-per-node=4 +#SBATCH --gres:gpu=4 +#SBATCH --partition=normal +#SBATCH --mem=444G +#SBATCH --no-requeue +#SBATCH --exclusive +#SBATCH --output=lightning_logs/neurallam_out_%j.log +#SBATCH --error=lightning_logs/neurallam_err_%j.log + +# Load necessary modules or activate environment, for example: +conda activate neural-lam + +srun -ul python -m neural_lam.train_model \ + --config_path /path/to/config.yaml \ + --num_nodes $SLURM_JOB_NUM_NODES +``` + +When using on a system without SLURM, where all GPU's are visible, it is possible to select a subset of GPU's to use for training with the `devices` cli argument, e.g. `--devices 0 1` to use the first 2 GPU's. + ## Evaluate Models Evaluation is also done using `python -m neural_lam.train_model --config_path `, but using the `--eval` option. Use `--eval val` to evaluate the model on the validation set and `--eval test` to evaluate on test data.