About GPU memory leak

Open xurong1981 opened this issue 6 years ago • 0 comments

Does anyone encounter the following error on "CUDA_ERROR_ILLEGAL_ADDRESS" ? I have changed multiprocessing to single process, but the same problem happened.

My GPU is GeForce RTX 2080 8GB (driver: 440.33.01), and tensorflow: 1.12.0 cuda: 9.0 cudnn: 7.5.0

Training command is like this way: CUDA_VISIBLE_DEVICES=0 python main.py
--model_name=model_roerich
--batch_size=1
--phase=train
--image_size=768
--lr=0.0002
--dsr=0.8
--ptcd=./data/Places2/data_large
--ptad=./data/artist/nicholas-roerich

Finally, the error message are: tensorflow::CurrentStackTrace() stream_executor::cuda::CUDADriver::SynchronizeContext(stream_executor::cuda::CudaContext*) stream_executor::StreamExecutor::SynchronizeAllActivity() tensorflow::GPUUtil::SyncAll(tensorflow::Device*) tensorflow::BaseGPUDevice::Sync()

Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int)
std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)


clone

*** End stack trace ***

2020-01-26 00:10:03.045956: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0x5402c10: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered Traceback (most recent call last): File "/home/username/.local/share/virtualenvs/adaptive-style-transfer-PbxNnQ9W/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/username/.local/share/virtualenvs/adaptive-style-transfer-PbxNnQ9W/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/username/.local/share/virtualenvs/adaptive-style-transfer-PbxNnQ9W/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

Jan 25 '20 15:01 xurong1981