Jinhua Liang
Jinhua Liang
https://github.com/PaddlePaddle/Fleet/blob/develop/examples/quick-start/distributed_train.py#L37 在1.7.1下这里需要加上fleet.init_worker(), 执行完了后,需要加上fleet.stop_worker()
https://github.com/PaddlePaddle/Fleet/blob/develop/examples/ctr/criteo_reader.py#L54 这句话在python3和python2下不同,所以要么要list()包一下,要么在paddle在_gen_str()需要判断一下.
https://github.com/PaddlePaddle/benchmark/blob/master/NeuralMachineTranslation/Transformer/fluid/train/train.py#L616 ```bash 2019-05-21 09:29:28,729-INFO: Namespace(batch_size=4096, device='GPU', enable_ce=True, fetch_steps=100, local=True, opts=['dropout_seed', '10', 'learning_rate', '2.0', 'warmup_steps', '8000', 'beta2', '0.997', 'd_model', '512', 'd_inner_hid', '2048', 'n_head', '8', 'prepostprocess_dropout', '0.1', 'attention_dropout', '0.1', 'relu_dropout', '0.1', 'weight_sharing',...
单卡和多进程下不稳定复现,单进程多卡下目前还未出现 
paddle commit-id: 977e9fcb274f9497a193baf59303f4a2024f1791 运行脚本: https://github.com/PaddlePaddle/benchmark/blob/master/se-resnext/paddle/run_with_multi_process.sh 设置CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 时,运行一段时间出现hang住,其中2个卡出现GPU占用率100% 设置CUDA_VISIBLE_DEVICES=0,1,2,3 时,则没有问题
训练日志如下: ```bash step 75, loss: 2.736500, step_time_cost: 0.151 s step 76, loss: 2.795518, step_time_cost: 0.150 s step 77, loss: 2.817705, step_time_cost: 0.150 s step 78, loss: 2.724798, step_time_cost: 0.149 s...