GPUs and GPU usage #39

greeneggsandyaml · 2021-02-19T04:59:22Z

greeneggsandyaml
Feb 19, 2021

Hello authors, thank you for your great repo and for ContraGAN!

I had a couple of quick questions:

How many GPUs do you use to train the pretrained models provided in the README (especially the BigGAN-2048 on ImageNet)?
I'm finding that GPU utilization is quite low when using multiple GPUs -- do you know why this might be?
Also, I saw you recently added LARS, but it looks like none of the pretrained models were trained using it. Have you had success speeding up training with LARS?

Thank you so much for your help!

Answered by mingukkang

Feb 21, 2021

Hi.

I am sorry for the late reply.

How many GPUs do you use to train the pre-trained models provided in the README (especially the BigGAN-2048 on ImageNet)?
=>
Models trained on CIFAR10: 1 GPU (2080Ti, RTX-TITAN, V100, A100, etc.)
Models trained on Tiny_ImageNet: RTX-TITAN x 1 (From DCGAN to SAGAN), RTX-TITAN x 4 (From BigGAN to ContraGAN +ADA)
Models trained on ImageNet: V100 32GB x 4 with Sync_BN (From SNGAN to BigGAN 256 B.S.), V100 32GB x 8 with Sync_BN and DP (BigGAN 2048 B.S., training takes almost a month)
I'm finding that GPU utilization is quite low when using multiple GPUs
=> yes, it is because you might train the model using DataParallel (DP)
If you train a model using Dist…

View full answer

mingukkang · 2021-02-21T17:46:30Z

mingukkang
Feb 21, 2021
Maintainer

Hi.

I am sorry for the late reply.

How many GPUs do you use to train the pre-trained models provided in the README (especially the BigGAN-2048 on ImageNet)?
=>
Models trained on CIFAR10: 1 GPU (2080Ti, RTX-TITAN, V100, A100, etc.)
Models trained on Tiny_ImageNet: RTX-TITAN x 1 (From DCGAN to SAGAN), RTX-TITAN x 4 (From BigGAN to ContraGAN +ADA)
Models trained on ImageNet: V100 32GB x 4 with Sync_BN (From SNGAN to BigGAN 256 B.S.), V100 32GB x 8 with Sync_BN and DP (BigGAN 2048 B.S., training takes almost a month)
I'm finding that GPU utilization is quite low when using multiple GPUs
=> yes, it is because you might train the model using DataParallel (DP)
If you train a model using DistributedDataParallel (DDP), you can make the most of GPUs for training.
Also, I saw you recently added LARS, but it looks like none of the pre-trained models were trained using it. Have you had success speeding up training with LARS?
=> No, I didn't try to train GAN models using LARS. I merely added LARS on the codebase since the optimizer is widely used to train big models using a large batch size.

Thank you:)

Best,

Minguk

0 replies

greeneggsandyaml · 2021-02-21T22:39:59Z

greeneggsandyaml
Feb 21, 2021
Author

Hi Minguk,

Thank you for the response! I really appreciate the detailed answers! Also your updates to the repo today were great.

To clarify, what were the exact commands used to train the BigGAN-256 and BigGAN-2048 models? If I'm going to train one of these for 3 or 4 weeks, I want to get the command right :)

For the models you have trained with 4/8 GPUs, did you use DP or DDP? From your previous response I gather that you used DP -- have you had success training models with DDP? Have you had any success with mixed precision training?

Also, when you use standing statistics for evaluation, -std_stat --standing_step STANDING_STEP, what do you use for STANDING_STEP? You don't have to do anything differently to use standing statistics, correct?

Best,
Luke

2 replies

mingukkang Feb 22, 2021
Maintainer

Hi!

Training generative models sometimes requires tremendous patience. Good Luck :)

what were the exact commands used to train the BigGAN-256 and BigGAN-2048 models?

Training BigGAN-256
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 src/main.py -c "src/configs/ILSVRC2012/BigGAN256.json" -t -e -l -rm_API -sync_bn --eval_type "valid"

Training BigGAN-2048
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 src/main.py -c "src/configs/ILSVRC2012/BigGAN2048.json" -t -e -l -rm_API -sync_bn --eval_type "valid"

For the models you have trained with 4/8 GPUs, did you use DP or DDP?
To see if DDP and Mixed_Precision training work appropriately, I've verified GANS' convergence on CIFAR10 only.
What do you use for STANDING_STEP? You don't have to do anything differently to use standing statistics, correct?
=> Actually, I did not use -std_stat and instead used the moving average of the previous statistics.
(This is the trick I found. You don't have to find a STANDING_STEP).

You can train your model using by "not typing -stat_otf" on the command line.

Please refer to the new updated README for ImageNet experiments.

I highly recommend training your model by comparing it to the log I uploaded.

Thank you!

Sincerely,

MinGuk Kang

greeneggsandyaml Feb 22, 2021
Author

Amazing, thank you! I This was a super helpful response.

I am definitely going to use this repo for my future work/research, so I may have more questions as things come up. :)

Thanks again, and all the best with your work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPUs and GPU usage #39

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

GPUs and GPU usage #39

greeneggsandyaml Feb 19, 2021

Replies: 2 comments · 2 replies

mingukkang Feb 21, 2021 Maintainer

greeneggsandyaml Feb 21, 2021 Author

mingukkang Feb 22, 2021 Maintainer

greeneggsandyaml Feb 22, 2021 Author

greeneggsandyaml
Feb 19, 2021

Replies: 2 comments 2 replies

mingukkang
Feb 21, 2021
Maintainer

greeneggsandyaml
Feb 21, 2021
Author

mingukkang Feb 22, 2021
Maintainer

greeneggsandyaml Feb 22, 2021
Author