Sharing CUDA tensors error when use mp with cuda #18

Xiao-Chenguang · 2024-10-24T20:36:08Z

Using cuda with multiprocessing generates a warning about shared cuda tensors.
The log is listed below. It did not break the training anyway.

2024-10-24 21:32:36,219 INFO [MainProcess] Get following configs:
ACTIVE_CLIENT: 3
BATCH_SIZE: 64
CLIENT_EPOCHS: 2
DEVICE: cuda
LOG_LEVEL: 20
LR: 0.1
NUM_CLIENT: 100
NUM_PROCESS: 3
OPTIM:
  LR: 0.1
  MOMENTUM: 0.9
  NAME: SGD
SERVER_EPOCHS: 2
WB_ENTITY: example_entity
WB_PROJECT: example_project

2024-10-24 21:32:36,242 INFO [MainProcess] Start FedAvg.
2024-10-24 21:32:36,243 INFO [MainProcess] Round 1/2
2024-10-24 21:32:40,633 INFO [SpawnProcess-2] Worker-1 started.
2024-10-24 21:32:40,842 INFO [SpawnProcess-3] Worker-2 started.
2024-10-24 21:32:40,855 INFO [SpawnProcess-1] Worker-0 started.
2024-10-24 21:32:41,212 INFO [MainProcess] Train loss: 1.6377
2024-10-24 21:32:42,137 INFO [MainProcess] Test Loss: 36.2461, Accuracy: 0.7875
2024-10-24 21:32:42,138 INFO [MainProcess] Round 2/2
2024-10-24 21:32:42,302 INFO [MainProcess] Train loss: 0.7252
2024-10-24 21:32:43,085 INFO [MainProcess] Test Loss: 24.4372, Accuracy: 0.8069
[W1024 21:32:43.054855859 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1024 21:32:43.057068283 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1024 21:32:43.058267514 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

The text was updated successfully, but these errors were encountered:

Xiao-Chenguang · 2024-10-25T13:35:16Z

According to pytorch documentation, the best practice is to:

1. Release memory ASAP in the consumer..

E.g:

## Good
x = queue.get()
# do somethings with x
del x

2. Keep producer process running until all consumers exits. This will prevent the situation when the producer process releasing memory which is still in use by the consumer.

Eg:

## producer
# send tensors, do something
event.wait()

## consumer
# receive tensors and use them
event.set()

Xiao-Chenguang · 2024-10-25T13:42:31Z

As we have Queues in two directions for task and result from server to client and reverse direction, we need to care about the refcounting for both ways.

As we ensure the subprocess ends before the main process, shared tensors from the main process to the subprocess is safe. Subprocess will clear the garbage before closing.

However, we also have a results Queue that passes the tensor generated in the subprocess to the main process. We should make sure those tensors are cleared in the main process before the close of the subprocess. This is where the problem lies.

As the code uses 3 subprocesses, there are 3 warnings from individual subprocesses.

Xiao-Chenguang self-assigned this Oct 24, 2024

Xiao-Chenguang added the bug Something isn't working label Oct 24, 2024

Xiao-Chenguang linked a pull request Oct 25, 2024 that will close this issue

fix shared cuda tensor issue #19

Merged

Xiao-Chenguang closed this as completed in #19 Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharing CUDA tensors error when use mp with cuda #18

Sharing CUDA tensors error when use mp with cuda #18

Xiao-Chenguang commented Oct 24, 2024

Xiao-Chenguang commented Oct 25, 2024

Xiao-Chenguang commented Oct 25, 2024

Sharing CUDA tensors error when use mp with cuda #18

Sharing CUDA tensors error when use mp with cuda #18

Comments

Xiao-Chenguang commented Oct 24, 2024

Xiao-Chenguang commented Oct 25, 2024

Xiao-Chenguang commented Oct 25, 2024