benchmark icon indicating copy to clipboard operation
benchmark copied to clipboard

transformer 多进程单卡下报错

Open ccmeteorljh opened this issue 6 years ago • 4 comments

https://github.com/PaddlePaddle/benchmark/blob/master/NeuralMachineTranslation/Transformer/fluid/train/train.py#L616

2019-05-21 09:29:28,729-INFO: Namespace(batch_size=4096, device='GPU', enable_ce=True, fetch_steps=100, local=True, opts=['dropout_seed', '10', 'learning_rate', '2.0', 'warmup_steps', '8000', 'beta2', '0.997', 'd_model', '512', 'd_inner_hid', '2048', 'n_head', '8', 'prepostprocess_dropout', '0.1', 'attention_dropout', '0.1', 'relu_dropout', '0.1', 'weight_sharing', 'True', 'pass_num', '1', 'model_dir', 'tmp_models', 'ckpt_dir', 'tmp_ckpts'], pool_size=200000, shuffle=False, shuffle_batch=False, sort_type='pool', special_token=['<s>', '<e>', '<unk>'], src_vocab_fpath='data/vocab.bpe.32000', sync=True, token_delimiter=' ', train_file_pattern='data/train.tok.clean.bpe.32000.en-de', trg_vocab_fpath='data/vocab.bpe.32000', update_method='pserver', use_default_pe=False, use_mem_opt=True, use_py_reader=True, use_token_batch=True, val_file_pattern=None)
Traceback (most recent call last):
  File "train.py", line 784, in <module>
    train(args)
  File "train.py", line 641, in train
    dev_count = get_device_num()
  File "train.py", line 616, in get_device_num
    device_num = subprocess.check_output(['nvidia-smi','-L']).decode().count('\n')
NameError: global name 'subprocess' is not defined

ccmeteorljh avatar May 21 '19 11:05 ccmeteorljh

@ccmeteorljh 为什么是多进程单卡? 没有设置环境变量(CUDA_VISIBLE_DEVICES)?

chengduoZH avatar May 21 '19 23:05 chengduoZH

@ccmeteorljh 为什么是多进程单卡? 没有设置环境变量(CUDA_VISIBLE_DEVICES)?

设置了,想试试多进程模式下单卡和单进程单卡下的速度对比如何,上面那个问题import一下就可以了

Traceback (most recent call last):
  File "train.py", line 785, in <module>
    train(args)
  File "train.py", line 703, in train
    token_num, predict, pyreader)
  File "train.py", line 534, in train_loop
    feed=feed_dict_list)
  File "/opt/python/cp27-cp27mu/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 286, in run
    return_numpy=return_numpy)
  File "/opt/python/cp27-cp27mu/lib/python2.7/site-packages/paddle/fluid/executor.py", line 640, in run
    return_numpy=return_numpy)
  File "/opt/python/cp27-cp27mu/lib/python2.7/site-packages/paddle/fluid/executor.py", line 482, in _run_parallel
    "Feed a list of tensor, the list should be the same size as places"
ValueError: Feed a list of tensor, the list should be the same size as places

ccmeteorljh avatar May 22 '19 02:05 ccmeteorljh

设置了,想试试多进程模式下单卡和单进程单卡下的速度对比如何,上面那个问题import一下就可以了

老哥,你怎么解决的,求教,我也出现同样的问题

QianShengWu avatar Aug 13 '19 09:08 QianShengWu

@QianShengWu 目前还不支持多进程单卡模式

chengduoZH avatar Aug 14 '19 01:08 chengduoZH