Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharing CUDA tensors error when use mp with cuda #18

Closed
Xiao-Chenguang opened this issue Oct 24, 2024 · 2 comments · Fixed by #19
Closed

Sharing CUDA tensors error when use mp with cuda #18

Xiao-Chenguang opened this issue Oct 24, 2024 · 2 comments · Fixed by #19
Assignees
Labels
bug Something isn't working

Comments

@Xiao-Chenguang
Copy link
Owner

Using cuda with multiprocessing generates a warning about shared cuda tensors.
The log is listed below. It did not break the training anyway.

2024-10-24 21:32:36,219 INFO [MainProcess] Get following configs:
ACTIVE_CLIENT: 3
BATCH_SIZE: 64
CLIENT_EPOCHS: 2
DEVICE: cuda
LOG_LEVEL: 20
LR: 0.1
NUM_CLIENT: 100
NUM_PROCESS: 3
OPTIM:
  LR: 0.1
  MOMENTUM: 0.9
  NAME: SGD
SERVER_EPOCHS: 2
WB_ENTITY: example_entity
WB_PROJECT: example_project

2024-10-24 21:32:36,242 INFO [MainProcess] Start FedAvg.
2024-10-24 21:32:36,243 INFO [MainProcess] Round 1/2
2024-10-24 21:32:40,633 INFO [SpawnProcess-2] Worker-1 started.
2024-10-24 21:32:40,842 INFO [SpawnProcess-3] Worker-2 started.
2024-10-24 21:32:40,855 INFO [SpawnProcess-1] Worker-0 started.
2024-10-24 21:32:41,212 INFO [MainProcess] Train loss: 1.6377
2024-10-24 21:32:42,137 INFO [MainProcess] Test Loss: 36.2461, Accuracy: 0.7875
2024-10-24 21:32:42,138 INFO [MainProcess] Round 2/2
2024-10-24 21:32:42,302 INFO [MainProcess] Train loss: 0.7252
2024-10-24 21:32:43,085 INFO [MainProcess] Test Loss: 24.4372, Accuracy: 0.8069
[W1024 21:32:43.054855859 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1024 21:32:43.057068283 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1024 21:32:43.058267514 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
@Xiao-Chenguang Xiao-Chenguang self-assigned this Oct 24, 2024
@Xiao-Chenguang Xiao-Chenguang added the bug Something isn't working label Oct 24, 2024
@Xiao-Chenguang
Copy link
Owner Author

According to pytorch documentation, the best practice is to:

1. Release memory ASAP in the consumer..

E.g:

## Good
x = queue.get()
# do somethings with x
del x

2. Keep producer process running until all consumers exits. This will prevent the situation when the producer process releasing memory which is still in use by the consumer.

Eg:

## producer
# send tensors, do something
event.wait()

## consumer
# receive tensors and use them
event.set()

@Xiao-Chenguang
Copy link
Owner Author

As we have Queues in two directions for task and result from server to client and reverse direction, we need to care about the refcounting for both ways.

As we ensure the subprocess ends before the main process, shared tensors from the main process to the subprocess is safe. Subprocess will clear the garbage before closing.

However, we also have a results Queue that passes the tensor generated in the subprocess to the main process. We should make sure those tensors are cleared in the main process before the close of the subprocess. This is where the problem lies.

As the code uses 3 subprocesses, there are 3 warnings from individual subprocesses.

@Xiao-Chenguang Xiao-Chenguang linked a pull request Oct 25, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant