DeepFaceLab icon indicating copy to clipboard operation
DeepFaceLab copied to clipboard

Cannot train (RTX PROBLEM)

Open HappiestCow22 opened this issue 5 years ago • 7 comments

Cannot Train for longer than 5 minutes.

The training CMD prompt window will stop updating the iteration count and the training window will not update when clicking P. (ENTER does nothing either. Closing the CMD prompt will unload VRAM and not save the model) The training completely ends without an error message or anything.

I have a 2080TI and i'd like to mention that I had the same problem on a 2080 non ti. My theory is it has something to do with the RTX cuda drivers packaged with DFL.

Older versions of DFL(the RTX specific branch) that still used the original SAE model can still train, (I have a 2.5mil iteration model that had almost no issues the entire training life)

This is an issue that was introduced as soon as SAEHD was added months ago. When SAEHD and SAE were able to be picked between, i always had to stay with SAE because SAEHD would crash every time. I have tried models with super low resolution and params and I have the same issue. Its not related to oom errors.

HappiestCow22 avatar Jun 16 '20 05:06 HappiestCow22

I have seen very few people with the same issue, but the 3 that i have seen all said they were using a 2080 TI as well.

HappiestCow22 avatar Jun 16 '20 05:06 HappiestCow22

happens to me too, often between 15-60 mins, 2080ti I didnt have it before the latest update

holycowdude avatar Jun 23 '20 20:06 holycowdude

So if I use MSI Afterburner to reduce my maximum power usage on my gpu to below 50% and do the same to my CPU through windows, i can train indefinitely without issues. I'm thinking its a power draw problem now. I ordered a 1000w power supply and I'm hoping that'll fix it.

HappiestCow22 avatar Jun 25 '20 23:06 HappiestCow22

is this resolved?

test1230-lab avatar Jul 09 '20 00:07 test1230-lab

I'm still getting a similar issue randomly using a res 320 model, settings are 384/92/72/22 UD Model Usually between 1 and 12 hours of training it gives an error, i've not seen the error in the first hour of training. RTX2080TI, 32gb RAM

2020-07-16 13:09:30.473189: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered 2020-07-16 13:09:30.473377: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1

I experienced the freeze issue previously as mentioned on the first post, but the latest versions i've tried i only get the illegal memory access error, not sure if its related?

holycowdude avatar Jul 16 '20 13:07 holycowdude

I have a PC with GTX 1070 8gb that is working out of the box, but another PC with RTX 2080 8gb is having memory allocation errors like "failed to allocate 2gb" which makes no sense...

EDIT: both with same settings

badjano avatar Aug 05 '20 23:08 badjano

Is this resolved ? im planning to get 2080ti but it it behaves like this, its a deal breaker

2blackbar avatar Feb 23 '22 23:02 2blackbar

Does this issue still persist? If not, would you mind closing it?

joolstorrentecalo avatar Jun 08 '23 20:06 joolstorrentecalo