DeepLearningExamples
DeepLearningExamples copied to clipboard
[Transformer-XL/PyTorch]RuntimeError : CUDA error : CUBLAS_STATUS_NOT_INITIALIZED when calling 'cublasCreate(handle)'
Related to Transformer-XL/PyTorch
Describe the bug I got an runtime Error : CUDA error : CUBLAS_STATUS_NOT_INITIALIZED when calling 'cublasCreate(handle)'
Run evaluation...
0: thread affinity: {0}
Experiment dir : LM-TFM
Namespace(affinity='single_unique', batch_size=16, clamp_len=400, cuda=True, data='../data/wikitext-103/', dataset='wt103', debug=False, dllog_file='eval_log.json', ext_len=0, fp16=False, load_torchscript=None, local_rank=0, log_all_ranks=False, log_interval=10, manual=None, manual_config=None, manual_vocab='word', max_size=None, mem_len=640, model='', no_env=False, percentiles=[90, 95, 99], repeat=1, same_length=True, save_data=False, save_torchscript=None, seed=1111, split='test', target_perplexity=None, target_throughput=None, tgt_len=64, type='pytorch', work_dir='LM-TFM')
Collecting environment information...
PyTorch version: 1.6.0a0+9907a3e
Is debug build: No
CUDA used to build PyTorch: 11.0
OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.14.0
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce GTX 1650
Nvidia driver version: 460.91.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.1
Versions of relevant libraries:
[pip] msgpack-numpy==0.4.3.2
[pip] numpy==1.18.1
[pip] pytorch-transformers==1.1.0
[pip] torch==1.6.0a0+9907a3e
[pip] torchtext==0.6.0
[pip] torchvision==0.7.0a0
[conda] magma-cuda110 2.5.2 5 local
[conda] mkl 2019.1 144
[conda] mkl-include 2019.1 144
[conda] msgpack-numpy 0.4.3.2 py36_0
[conda] nomkl 3.0 0
[conda] numpy 1.18.1 py36h94c655d_0
[conda] numpy-base 1.18.1 py36h2f8d375_1
[conda] pytorch-transformers 1.1.0 pypi_0 pypi
[conda] torch 1.6.0a0+9907a3e pypi_0 pypi
[conda] torchtext 0.6.0 pypi_0 pypi
[conda] torchvision 0.7.0a0 pypi_0 pypi
Loading checkpoint from LM-TFM/checkpoint_best.pt
Loading cached dataset...
Evaluating with: math fp32 type pytorch bsz 16 tgt_len 64 ext_len 0 mem_len 640 clamp_len 400
Traceback (most recent call last):
File "eval.py", line 515, in <module>
main()
File "eval.py", line 456, in main
loss = evaluate(iter, model, meters, args.log_interval, args.max_size, args.repeat)
File "eval.py", line 194, in evaluate
loss, mems = model(data, target, mems)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/transformer-xl/pytorch/mem_transformer.py", line 788, in forward
hidden, new_mems = self._forward(data, mems=mems)
File "/workspace/transformer-xl/pytorch/mem_transformer.py", line 711, in _forward
pos_emb = self.pos_emb(pos_seq)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/transformer-xl/pytorch/mem_transformer.py", line 39, in forward
sinusoid_inp = torch.ger(pos_seq, self.inv_freq)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'eval.py', '--local_rank=0', '--config_file', 'wt103_base.yaml', '--type', 'pytorch']' returned non-zero exit status 1.
To Reproduce I followed Quick Start Guide 1 to 4, and downloaded checkpoint for "Transformer-XL PyTorch checkpoint (base, amp)" from nvidia ngc.
- git clone https://github.com/NVIDIA/DeepLearningExamples
- cd DeepLearningExamples/PyTorch/LanguageModeling/Transformer-XL
- bash getdata.sh
- bash pytorch/scripts/docker/build.sh
- bash pytorch/scripts/docker/interactive.sh
- wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/transformerxl_pyt_ckpt_base_amp/versions/19.11.0/zip -O transformerxl_pyt_ckpt_base_amp_19.11.0.zip
- unzip transformerxl_pyt_ckpt_base_amp_19.11.0.zip
- bash run_wt103_base.sh eval 1 --type pytorch --model checkpoint_best.pt
Expected behavior I expected to get "test loss" and "test ppl" as an examples
Environment Please provide at least:
- Container version (e.g. pytorch:19.05-py3): transformer-xl:latest
- GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): GeForce GTX 1650
- CUDA driver version (e.g. 418.67): 460.91.3