transformers icon indicating copy to clipboard operation
transformers copied to clipboard

CUDA error: an illegal memory access was encountered

Open johnchienbronci opened this issue 2 years ago • 0 comments

I encountered some errors when running the run_speech_recognition_ctc_streaming.sh by deepspeed ( torchrun --nproc_per_node 1 ... ) and his issue consistently occurs with my custom corpora. Does anyone have any ideas? (I can fine-tune successfully using the Common Voice corpus)

environment: gpu number: 1 export CUDA_LAUNCH_BLOCKING=1 export TORCH_USE_CUDA_DSA=1

terminate called after throwing an instance of 'c10::Error'                                                                                                                           
  what():  CUDA error: an illegal memory access was encountered                                                                                                                        
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                    
                                                                                                                                                                                       
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):                                                                      
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b400ef097 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)                                   
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f7b400aaa33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)                                                                                                                                                                                      
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7b4019d5a8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)                                                                                                                                                                                     
frame #3: <unknown function> + 0x1f3de (0x7f7b401663de in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)                                                            
frame #4: <unknown function> + 0x22650 (0x7f7b40169650 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)                                                            
frame #5: <unknown function> + 0x22a35 (0x7f7b40169a35 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)                                                            
frame #6: <unknown function> + 0x4ef710 (0x7f7af1667710 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)                                                       
frame #7: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7f7b400cc393 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)                                                       
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7b400cc529 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)                                                         
frame #9: <unknown function> + 0x7761b8 (0x7f7af18ee1b8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)                                                       
frame #10: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f7af18ee506 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)                                     
frame #11: <unknown function> + 0x1388e1 (0x5580685a58e1 in /usr/bin/python3)                                                                                                          
frame #12: <unknown function> + 0x1386dc (0x5580685a56dc in /usr/bin/python3)                                                                                                          
frame #13: <unknown function> + 0x138787 (0x5580685a5787 in /usr/bin/python3)                                                                                                          
frame #14: <unknown function> + 0x174ac1 (0x5580685e1ac1 in /usr/bin/python3)                                                                                                          
frame #15: <unknown function> + 0x153090 (0x5580685c0090 in /usr/bin/python3)                                                                                                          
frame #16: <unknown function> + 0x166918 (0x5580685d3918 in /usr/bin/python3)                                                                                                          
frame #17: <unknown function> + 0x2593a7 (0x5580686c63a7 in /usr/bin/python3)                                                                                                          
frame #18: <unknown function> + 0x17a7b0 (0x5580685e77b0 in /usr/bin/python3)                                                                                                          
frame #19: <unknown function> + 0x25f5c1 (0x5580686cc5c1 in /usr/bin/python3)                                                                                                          
frame #20: _PyEval_EvalFrameDefault + 0x7a99 (0x5580685b9b49 in /usr/bin/python3)                                                                                                      
frame #21: <unknown function> + 0x16ac31 (0x5580685d7c31 in /usr/bin/python3)                                                                                                          
frame #22: PyObject_Call + 0x122 (0x5580685d88e2 in /usr/bin/python3)                                                                                                                  
frame #23: <unknown function> + 0x27c30c (0x5580686e930c in /usr/bin/python3)                                                                                                          
frame #24: _PyObject_MakeTpCall + 0x25b (0x5580685c04ab in /usr/bin/python3)                                                                                                           
frame #25: _PyEval_EvalFrameDefault + 0x1a2f (0x5580685b3adf in /usr/bin/python3)                                                                                                      
frame #26: <unknown function> + 0x16ac31 (0x5580685d7c31 in /usr/bin/python3)                                                                                                          
frame #27: _PyEval_EvalFrameDefault + 0x1a2f (0x5580685b3adf in /usr/bin/python3)                                                                                                      
frame #28: _PyFunction_Vectorcall + 0x7c (0x5580685ca1ec in /usr/bin/python3)                                                                                                          
frame #29: _PyEval_EvalFrameDefault + 0x6d5 (0x5580685b2785 in /usr/bin/python3)                                                                                                       
frame #30: <unknown function> + 0x141ed6 (0x5580685aeed6 in /usr/bin/python3)                                                                                                          
frame #31: PyEval_EvalCode + 0x86 (0x5580686a5366 in /usr/bin/python3)                                                                                                                 
frame #32: <unknown function> + 0x265108 (0x5580686d2108 in /usr/bin/python3)                                                                                                          
frame #33: <unknown function> + 0x25df5b (0x5580686caf5b in /usr/bin/python3)                                                                                                          
frame #34: <unknown function> + 0x264e55 (0x5580686d1e55 in /usr/bin/python3)                                                                                                          
frame #35: _PyRun_SimpleFileObject + 0x1a8 (0x5580686d1338 in /usr/bin/python3)                                                                                                        
frame #36: _PyRun_AnyFileObject + 0x43 (0x5580686d1033 in /usr/bin/python3)                                                                                                            
frame #37: Py_RunMain + 0x2be (0x5580686c22de in /usr/bin/python3)                                                                                                                     
frame #38: Py_BytesMain + 0x2d (0x55806869832d in /usr/bin/python3)                                                                                                                    
frame #39: <unknown function> + 0x29d90 (0x7f7b5c24ad90 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                            
frame #40: __libc_start_main + 0x80 (0x7f7b5c24ae40 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                
frame #41: _start + 0x25 (0x558068698225 in /usr/bin/python3)                                                                                                                          

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 24134) of binary: /usr/bin/python3

This doesn't solve my problem by pip3 install numpy --pre torch torchvision torchaudio --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117

johnchienbronci avatar Jul 01 '23 02:07 johnchienbronci