You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using cuda with multiprocessing generates a warning about shared cuda tensors.
The log is listed below. It did not break the training anyway.
2024-10-24 21:32:36,219 INFO [MainProcess] Get following configs:
ACTIVE_CLIENT: 3
BATCH_SIZE: 64
CLIENT_EPOCHS: 2
DEVICE: cuda
LOG_LEVEL: 20
LR: 0.1
NUM_CLIENT: 100
NUM_PROCESS: 3
OPTIM:
LR: 0.1
MOMENTUM: 0.9
NAME: SGD
SERVER_EPOCHS: 2
WB_ENTITY: example_entity
WB_PROJECT: example_project
2024-10-24 21:32:36,242 INFO [MainProcess] Start FedAvg.
2024-10-24 21:32:36,243 INFO [MainProcess] Round 1/2
2024-10-24 21:32:40,633 INFO [SpawnProcess-2] Worker-1 started.
2024-10-24 21:32:40,842 INFO [SpawnProcess-3] Worker-2 started.
2024-10-24 21:32:40,855 INFO [SpawnProcess-1] Worker-0 started.
2024-10-24 21:32:41,212 INFO [MainProcess] Train loss: 1.6377
2024-10-24 21:32:42,137 INFO [MainProcess] Test Loss: 36.2461, Accuracy: 0.7875
2024-10-24 21:32:42,138 INFO [MainProcess] Round 2/2
2024-10-24 21:32:42,302 INFO [MainProcess] Train loss: 0.7252
2024-10-24 21:32:43,085 INFO [MainProcess] Test Loss: 24.4372, Accuracy: 0.8069
[W1024 21:32:43.054855859 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1024 21:32:43.057068283 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W1024 21:32:43.058267514 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
The text was updated successfully, but these errors were encountered:
2. Keep producer process running until all consumers exits. This will prevent the situation when the producer process releasing memory which is still in use by the consumer.
Eg:
## producer# send tensors, do somethingevent.wait()
## consumer# receive tensors and use themevent.set()
As we have Queues in two directions for task and result from server to client and reverse direction, we need to care about the refcounting for both ways.
As we ensure the subprocess ends before the main process, shared tensors from the main process to the subprocess is safe. Subprocess will clear the garbage before closing.
However, we also have a results Queue that passes the tensor generated in the subprocess to the main process. We should make sure those tensors are cleared in the main process before the close of the subprocess. This is where the problem lies.
As the code uses 3 subprocesses, there are 3 warnings from individual subprocesses.
Using cuda with multiprocessing generates a warning about shared cuda tensors.
The log is listed below. It did not break the training anyway.
The text was updated successfully, but these errors were encountered: