DeepFaceLab icon indicating copy to clipboard operation
DeepFaceLab copied to clipboard

nan errors 5-10 min into the training.

Open zabique opened this issue 3 years ago • 2 comments

VM (Win10, CPU Epyc 2.3ghz x48 cores,128GB RAM, RTX A100 80GB)

Tested on own workspace (data) aligned extracted at 512 res Test model is 416 res, liae-udt, slightly bumpped dims, tried with gradient clipping on/off and with on training last a little bit longer.

Training start and last 400-4000 iter then crash with Nan in src/dst loss line, preview window crash.

================================================================================ Starting. Press "Enter" to stop training and save model. Traceback (most recent call last):an].6185] File "G:\DeepFaceLab_NVIDIA_RTX3000_series_build_11_20_2021\DeepFaceLab_NVIDIA_RTX3000_series_internal\DeepFaceLab\main.py", line 343, in arguments.func(arguments) File "G:\DeepFaceLab_NVIDIA_RTX3000_series_build_11_20_2021\DeepFaceLab_NVIDIA_RTX3000_series_internal\DeepFaceLab\main.py", line 132, in process_train Trainer.main(**kwargs) File "G:\DeepFaceLab_NVIDIA_RTX3000_series_build_11_20_2021\DeepFaceLab_NVIDIA_RTX3000_series_internal\DeepFaceLab\mainscripts\Trainer.py", line 317, in main lh_img = models.ModelBase.get_loss_history_preview(loss_history_to_show, iter, w, c) File "G:\DeepFaceLab_NVIDIA_RTX3000_series_build_11_20_2021\DeepFaceLab_NVIDIA_RTX3000_series_internal\DeepFaceLab\models\ModelBase.py", line 627, in get_loss_history_preview ph_max = int ( (plist_max[col][p] / plist_abs_max) * (lh_height-1) ) ValueError: cannot convert float NaN to integer [22:56:26][#001284][0175ms][nan][nan]

================================================================================= SO FAR: -Tested on lower spec model RTM 224 res model + stock Elon/Stark footage and same result, but takes longer to crash. -limited core number in system to 8. -tried different fork -unable to enable GPU scheduling as option does not exist for some reason and I know how to set it on different systems.

I appreciate any feedback

zabique avatar May 28 '22 06:05 zabique

Hello,

Same issue .. Did you fix it ?

jeremybarbaud avatar Aug 17 '22 11:08 jeremybarbaud

By switching to conda linux version.

On Wed, 17 Aug 2022, 12:05 jbarbaud, @.***> wrote:

Hello,

Same issue .. Did you fix it ?

— Reply to this email directly, view it on GitHub https://github.com/iperov/DeepFaceLab/issues/5525#issuecomment-1217863209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN4TVHW6Z6AYH73IGHL6WLTVZTBPXANCNFSM5XGFMK3A . You are receiving this because you authored the thread.Message ID: @.***>

zabique avatar Aug 17 '22 12:08 zabique

Issue solved / already answered (or it seems like user error), please close it.

joolstorrentecalo avatar Jun 08 '23 22:06 joolstorrentecalo