About training #121

John1911603424 · 2025-01-03T06:43:54Z

Why does it take more than a week to complete the experiment for unimatch training with one sixteenth of the data

LiheYoung · 2025-01-09T03:19:57Z

Please provide more details about your training environment and training logs.

John1911603424 · 2025-01-13T02:00:56Z

The training speed on Ubuntu with 2 GPUs is lower than that on Ubuntu with 1 GPU. Here are some training logs as follows:

[2024-12-31 19:32:24,400][ INFO] {'backbone': 'resnet50',
'batch_size': 2,
'conf_thresh': 0.95,
'config': 'configs/nansha.yaml',
'criterion': {'kwargs': {'ignore_index': 255}, 'name': 'CELoss'},
'crop_size': 801,
'data_root': '/data/Semi-SL/Nansha/无数据增强',
'dataset': 'nansha',
'dilations': [6, 12, 18],
'epochs': 200,
'labeled_id_path': 'splits/nansha/1_16/labeled.txt',
'local_rank': 0,
'lr': 0.01,
'lr_multi': 1.0,
'nclass': 7,
'ngpus': 2,
'port': 12345,
'replace_stride_with_dilation': [False, False, True],
'save_path': 'exp/nansha/unimatch/r50_200_2_0.01_1_CELoss_0.95/1_16',
'unlabeled_id_path': 'splits/nansha/1_16/unlabeled.txt'}

[2024-12-31 19:32:25,021][ INFO] Total params: 40.5M

user-Precision-7920-Tower:24437:24437 [0] NCCL INFO Bootstrap : Using enp0s31f6:192.168.207.78<0>
user-Precision-7920-Tower:24437:24437 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
user-Precision-7920-Tower:24437:24437 [0] NCCL INFO cudaDriverVersion 12050
NCCL version 2.14.3+cuda11.6
user-Precision-7920-Tower:24438:24438 [1] NCCL INFO cudaDriverVersion 12050
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO Failed to open libibverbs.so[.1]
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO NET/Socket : Using [0]enp0s31f6:192.168.207.78<0>
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO Using network Socket
user-Precision-7920-Tower:24438:24438 [1] NCCL INFO Bootstrap : Using enp0s31f6:192.168.207.78<0>
user-Precision-7920-Tower:24438:24438 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO Failed to open libibverbs.so[.1]
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO NET/Socket : Using [0]enp0s31f6:192.168.207.78<0>
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO Using network Socket
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO Channel 00/02 : 0 1
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO Channel 01/02 : 0 1
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO Channel 00 : 1[73000] -> 0[17000] via SHM/direct/direct
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO Channel 00 : 0[17000] -> 1[73000] via SHM/direct/direct
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO Channel 01 : 1[73000] -> 0[17000] via SHM/direct/direct
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO Channel 01 : 0[17000] -> 1[73000] via SHM/direct/direct
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO Connected all rings
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO Connected all rings
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO Connected all trees
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO Connected all trees
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
user-Precision-7920-Tower:24437:24484 [0] NCCL INFO comm 0x562a54d572d0 rank 0 nranks 2 cudaDev 0 busId 17000 - Init COMPLETE
user-Precision-7920-Tower:24438:24485 [1] NCCL INFO comm 0x5647d7a5f410 rank 1 nranks 2 cudaDev 1 busId 73000 - Init COMPLETE
[2024-12-31 19:32:27,325][ INFO] ===========> Epoch: 0, LR: 0.01000, Previous best: 0.00
/home/user/miniconda3/envs/unimatch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/home/user/miniconda3/envs/unimatch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
[2024-12-31 19:32:48,732][ INFO] Iters: 0, Total loss: 1.045, Loss x: 2.090, Loss s: 0.000, Loss w_fp: 0.000, Mask ratio: 0.000
[2024-12-31 19:37:48,257][ INFO] Iters: 84, Total loss: 0.689, Loss x: 1.360, Loss s: 0.032, Loss w_fp: 0.005, Mask ratio: 0.070
[2024-12-31 19:42:42,590][ INFO] Iters: 168, Total loss: 0.608, Loss x: 1.190, Loss s: 0.048, Loss w_fp: 0.004, Mask ratio: 0.100
[2024-12-31 19:47:36,842][ INFO] Iters: 252, Total loss: 0.563, Loss x: 1.099, Loss s: 0.050, Loss w_fp: 0.004, Mask ratio: 0.116
[2024-12-31 19:52:30,875][ INFO] Iters: 336, Total loss: 0.543, Loss x: 1.058, Loss s: 0.051, Loss w_fp: 0.004, Mask ratio: 0.128
[2024-12-31 19:57:25,436][ INFO] Iters: 420, Total loss: 0.529, Loss x: 1.029, Loss s: 0.054, Loss w_fp: 0.004, Mask ratio: 0.133
[2024-12-31 20:02:19,270][ INFO] Iters: 504, Total loss: 0.516, Loss x: 1.001, Loss s: 0.057, Loss w_fp: 0.004, Mask ratio: 0.142
[2024-12-31 20:07:13,000][ INFO] Iters: 588, Total loss: 0.503, Loss x: 0.975, Loss s: 0.060, Loss w_fp: 0.004, Mask ratio: 0.146
[2024-12-31 20:12:46,550][ INFO] ***** Evaluation ***** >>>> Class [0 hard roofs] IoU: 43.68
[2024-12-31 20:12:46,550][ INFO] ***** Evaluation ***** >>>> Class [1 green roofs] IoU: 0.00
[2024-12-31 20:12:46,550][ INFO] ***** Evaluation ***** >>>> Class [2 hardened ground] IoU: 46.33
[2024-12-31 20:12:46,550][ INFO] ***** Evaluation ***** >>>> Class [3 permeable pavement] IoU: 16.37
[2024-12-31 20:12:46,550][ INFO] ***** Evaluation ***** >>>> Class [4 vegetation] IoU: 64.44
[2024-12-31 20:12:46,550][ INFO] ***** Evaluation ***** >>>> Class [5 bare soil] IoU: 34.82
[2024-12-31 20:12:46,550][ INFO] ***** Evaluation ***** >>>> Class [6 water] IoU: 51.45
[2024-12-31 20:12:46,550][ INFO] ***** Evaluation original ***** >>>> MeanIoU: 36.73

[2024-12-31 20:12:46,550][ INFO] ***** Evaluation ***** >>>> Class [0 hard roofs] F1: 60.80
[2024-12-31 20:12:46,550][ INFO] ***** Evaluation ***** >>>> Class [1 green roofs] F1: 0.00
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [2 hardened ground] F1: 63.33
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [3 permeable pavement] F1: 28.14
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [4 vegetation] F1: 78.37
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [5 bare soil] F1: 51.66
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [6 water] F1: 67.94
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation original ***** >>>> MeanF1: 50.03

[2024-12-31 20:12:46,551][ INFO] ***** Evaluation original ***** >>>> Kappa: 55.87

[2024-12-31 20:12:46,551][ INFO] ***** Evaluation original ***** >>>> OA: 67.73

[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [0 hard roofs] UA: 69.69
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [1 green roofs] UA: 0.00
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [2 hardened ground] UA: 62.94
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [3 permeable pavement] UA: 70.07
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [4 vegetation] UA: 73.07
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [5 bare soil] UA: 48.08
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [6 water] UA: 62.36
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [0 hard roofs] PA: 53.92
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [1 green roofs] PA: 0.00
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [2 hardened ground] PA: 63.71
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [3 permeable pavement] PA: 17.60
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [4 vegetation] PA: 84.51
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [5 bare soil] PA: 55.81
[2024-12-31 20:12:46,551][ INFO] ***** Evaluation ***** >>>> Class [6 water] PA: 74.63

LiheYoung · 2025-02-03T08:53:07Z

Sorry for the late response. Are you using A100 for the training?

John1911603424 · 2025-02-06T15:26:55Z

I use 4090 for the training.

LiheYoung · 2025-02-07T00:05:52Z

I think the speed is within expectation. From our training log, it takes 1 minute for an A100 GPU to complete the first 43 iterations. In comparison, your first 84 iterations take 5 minutes. The iterations in each epoch are doubled and the GPU is switched from A100 to 4090, so I guess the 5x more time / epoch is normal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About training #121

About training #121

John1911603424 commented Jan 3, 2025

LiheYoung commented Jan 9, 2025

John1911603424 commented Jan 13, 2025

LiheYoung commented Feb 3, 2025

John1911603424 commented Feb 6, 2025

LiheYoung commented Feb 7, 2025

About training #121

About training #121

Comments

John1911603424 commented Jan 3, 2025

LiheYoung commented Jan 9, 2025

John1911603424 commented Jan 13, 2025

LiheYoung commented Feb 3, 2025

John1911603424 commented Feb 6, 2025

LiheYoung commented Feb 7, 2025