You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to train on multi-GPUs, and I try 8, 4 and 2 gpus. But the GPU-Util of some gpus are low, almost 0%. An epoch training time on 8 gpus is almost 20 minutes longer than on a single gpu.
Your code sets the GPU default num as 4. But when I try 4 cards, there is also one card's GPU-Util always 0%. There is no 0% GPU-Util on the two cards, but the GPU-Util of one of the cards is still 20%.
This is GPU Usage when training on 4 cards:
I am not very clear about shard. I want to ask whether need to modify the code to train on multi-GPUs and accelerate the training ?
Looking forward to your reply!
The text was updated successfully, but these errors were encountered:
You mentioned you ran the code with 1 or 2 GPUs. Did you have this problem in those runs too? I suggest turning on log_device in the config file and compare the single GPU run with 4/8 GPUs run.
I haven't had this problem before, although GPU-util was around 50-60% for all GPUs.
The GPU-util was 70-80% when run with 1 GPU. And it was 50% and 20% respectively when run with 2 GPUs. But there is always a gpu which GPU-util is 0% all the time. I turn on log_device to get the device mapping, and I have sent you an email.
Moreover, I also wanna ask whether your experiment results in paper are averaged over 3 datasets(3/4/5 turn Reddit)? Because I run all epochs but the result is different from the paper. Could you please provide your results on each dataset?
Hello,
I want to train on multi-GPUs, and I try 8, 4 and 2 gpus. But the GPU-Util of some gpus are low, almost 0%. An epoch training time on 8 gpus is almost 20 minutes longer than on a single gpu.
Your code sets the GPU default num as 4. But when I try 4 cards, there is also one card's GPU-Util always 0%. There is no 0% GPU-Util on the two cards, but the GPU-Util of one of the cards is still 20%.
This is GPU Usage when training on 4 cards:
I am not very clear about shard. I want to ask whether need to modify the code to train on multi-GPUs and accelerate the training ?
Looking forward to your reply!
The text was updated successfully, but these errors were encountered: