Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Batch Size Calculation for Multi-GPU Training #5

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cosineai[bot]
Copy link

@cosineai cosineai bot commented Nov 15, 2024

This pull request addresses the issue of inconsistent batch size calculation during multi-GPU training. Previously, the number of batches per epoch did not correctly account for the number of GPUs, leading to an incorrect batch size. The fix involves adjusting the data parallel world size and rank when the model parallel unit (mpu) is not defined. This ensures that the number of batches is correctly calculated as the training data size divided by the number of GPUs, aligning with the expected behavior. The changes are made in the deepspeed/runtime/engine.py file.


Created by Genie. You can follow its reasoning on Cosine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants