Replies: 1 comment 1 reply
-
按图设置一下这两个地方,它这个默认是用两张卡来训练的。 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
是跟着README.md文件一步一步做的,可以保证数据集的位置是没有错的,然后出错的地方应该是在Get Started下的第二步Model Traning,我的paddle版本是paddlepaddle-gpu==2.5.0 cudatoolkit=11.6,以下是报错信息,我不太懂是哪个地方出错了
E0719 10:46:00.151664 387175 place.cc:347] Invalid CUDAPlace(1), must inside [0, 1), because GPU number on your machine is 1
E0719 10:46:00.158169 387177 place.cc:347] Invalid CUDAPlace(3), must inside [0, 1), because GPU number on your machine is 1
I0719 10:46:00.170012 387174 tcp_utils.cc:181] The server starts to listen on IP_ANY:38913
I0719 10:46:00.170167 387174 tcp_utils.cc:130] Successfully connected to 127.0.0.1:38913
E0719 10:46:00.173277 387180 place.cc:347] Invalid CUDAPlace(6), must inside [0, 1), because GPU number on your machine is 1
E0719 10:46:00.177418 387176 place.cc:347] Invalid CUDAPlace(2), must inside [0, 1), because GPU number on your machine is 1
E0719 10:46:00.179487 387179 place.cc:347] Invalid CUDAPlace(5), must inside [0, 1), because GPU number on your machine is 1
C++ Traceback (most recent call last):
0 paddle::distributed::TCPStore::TCPStore(std::string, unsigned short, bool, unsigned long, int)
1 paddle::distributed::TCPStore::waitWorkers()
2 paddle::distributed::TCPStore::get(std::string const&)
3 paddle::distributed::TCPStore::wait(std::string const&)
4 void paddle::distributed::tcputils::receive_bytespaddle::distributed::ReplyType(int, paddle::distributed::ReplyType*, unsigned long)
Error Message Summary:
FatalError:
Termination signal
is detected by the operating system.[TimeInfo: *** Aborted at 1689734760 (unix time) try "date -d @1689734760" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x3e80005e7f8) received by PID 387174 (TID 0x7f1a1be7d180) from PID 387064 ***]
E0719 10:46:00.199708 387181 place.cc:347] Invalid CUDAPlace(7), must inside [0, 1), because GPU number on your machine is 1
C++ Traceback (most recent call last):
No stack trace in paddle, may be caused by external reasons.
Error Message Summary:
FatalError:
Termination signal
is detected by the operating system.[TimeInfo: *** Aborted at 1689734760 (unix time) try "date -d @1689734760" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x3e80005e7f8) received by PID 387178 (TID 0x7fd5c10ac180) from PID 387064 ***]
Traceback (most recent call last):
File "/home/iie/PaddleSpeech/paddlespeech/t2s/exps/ernie_sat/train.py", line 203, in
main()
File "/home/iie/PaddleSpeech/paddlespeech/t2s/exps/ernie_sat/train.py", line 197, in main
dist.spawn(train_sp, (args, config), nprocs=args.ngpu)
File "/home/iie/anaconda3/envs/paddlespeech_env/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 606, in spawn
while not context.join():
File "/home/iie/anaconda3/envs/paddlespeech_env/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 413, in join
self._throw_exception(error_index)
File "/home/iie/anaconda3/envs/paddlespeech_env/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 423, in _throw_exception
raise Exception("Process %d terminated with exit code %d." %
Exception: Process 1 terminated with exit code 255.
Beta Was this translation helpful? Give feedback.
All reactions