NeMo Memory is fully eaten and training quit with errors for 40k hours ASR training

During training, memory is noticed increasing as time goes on, until 74% training done and no memory available. The training is quit giving the following errors: We are using Coformer+ CTC with 1.2 cpu memory, 8 workers. We guess this might be related with pytorch lightning, and some of memory is hold till an epoch is done. There are no errors for smaller data, but once training data is getting larger, the bugs are triggered. Please give us tips for how to cure the problem, thanks !

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800190 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800190 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800380 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800345 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800374 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800421 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800489 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800421 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800489 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800380 milliseconds before timing out. [E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800345 milliseconds before timing out. [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out. [E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800374 milliseconds before timing out. asr_scripts_78/egs/sg/run.sh: line 547: 181694 Aborted (core dumped) python3 $scripts/asr_trainer.py --conf $cfg --devices $devices --accumulate_grad_batches 8 --accelerator 'cuda' --save_top_k 5 --val_check_interval 200 --checkpoint_path ${ckpt_path} --model.optim.lr 0.25 --model.optim.sched.warmup_steps 10000 --model.train_ds.max_duration 18.2 --model.train_ds.num_workers 8 --model.optim.sched.name NoamAnnealing --model.train_ds.manifest_filepath ${train_data} --resume_from_checkpoint "${pretrained_mdl}" --model.tokenizer.dir ${tokenizer} --model.tokenizer.type 'bpe' --model.train_ds.batch_size 64 --model.validation_ds.batch_size 64 --model.validation_ds.manifest_filepath ${valid_data} --model.interctc.loss_weights "[]" --model.interctc.apply_at_layers "[]" --model.optim.sched.last_epoch 25000

Apr 12 '24 03:04 haihua

Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?

What version of NeMo are you using ? Without sufficient details it's not possible to debug.

What I can say is we train on nodes with 400 GB ram per node and A100 with 80GB gpu memory and train on 90-400K hours of speech without oom in either CPU or GPU memory.

If you can visibly see CPU ram constantly increase during training, a pseudo fix could be to use exp_manager.max_time_per_run and set it to a reasonable value like a day, then the job stops after a day and you can restart it and avoid memory leak. It's not a fix but a temporary solution

Apr 12 '24 06:04 titu1994

Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?

CPU not GPU
Just single node We just added one row self.log('loss', loss_value, on_step=True, prog_bar=True, on_epoch=False,) in file:
nemo/collections/asr/models/ctc_models.py Previously, we used on_epoch=True, but now the problem still remains after changine to False.

What version of NeMo are you using ?

git log commit 0d3d8fa1bc78135e72400e8deecd69cfb3d9aa3f (HEAD -> main) Author: anteju [email protected] Date: Wed Nov 15 16:56:29 2023 -0800

[ASR] GSS-based mask estimator (#7849)

* Added GSS-based mask estimator for multispeaker scenarios

Signed-off-by: Ante Jukić <[email protected]>

* Addressed PR comments

Signed-off-by: Ante Jukić <[email protected]>

---------

Signed-off-by: Ante Jukić <[email protected]>
Co-authored-by: Taejin Park <[email protected]>

Actually, it's very easy to verify: you just submit a training task with, say librispeech data, you can observe you CPU memory keeps increasing within an epoch. But such memory increase won't hurt since memory increase slow and after an epoch, memory usage somehwo is going down again. Here, if we decrease our training data down to 30k, for 1.2T cpu memory, we can finish an epoch normally.

Apr 12 '24 07:04 haihua

On Fri, Apr 12, 2024 at 2:21 PM Somshubra Majumdar @.***> wrote:

Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?

What version of NeMo are you using ? Without sufficient details it's not possible to debug.

What I can say is we train on nodes with 400 GB ram per node and A100 with 80GB gpu memory and train on 90-400K hours of speech without oom in either CPU or GPU memory.

How many nodes have you used ? if you use a lot of nodes, then you might not trigger the bugs. Say, you have used 8 nodes, then there might be no issues ...

Regards,

Haihua

If you can visibly see CPU ram constantly increase during training, a pseudo fix could be to use exp_manager.max_time_per_run and set it to a reasonable value like a day, then the job stops after a day and you can restart it and avoid memory leak. It's not a fix but a temporary solution

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/NeMo/issues/8897#issuecomment-2051065239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZBHYOSYLVXC677XXF2LWLY454PZAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRGA3DKMRTHE . You are receiving this because you authored the thread.Message ID: @.***>

Apr 12 '24 07:04 haihua

That Nemo version is 6 months old, can you use r1.23 and see if it persists ? We do not see constantly increasing CPU memory per epoch, but that may be because we use multiple nodes - min 4 nodes

Apr 12 '24 07:04 titu1994

Hi, is this issue resolved? I've been running into the same issue. (I can confirm that it happens on 1.23 as well)

Apr 26 '24 16:04 riqiang-dp

Hi there,

Just checking here and wondering whether this is resolved? I am facing same issue.

Thank you.

May 23 '24 17:05 ROZBEH

using multiple nodes to train can avoid the problem.

On Fri, May 24, 2024, 1:58 AM ROZBEH @.***> wrote:

Hi there,

Just checking here and wondering whether this is resolved? I am facing same issue.

Thank you.

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/NeMo/issues/8897#issuecomment-2127743659, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZBHYJPQ36M6WWYAGGERQLZDYU2RAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRXG42DGNRVHE . You are receiving this because you authored the thread.Message ID: @.***>

May 24 '24 06:05 haihua

Thanks @haihua I'm indeed 5 nodes with 5 GPU each. Is that what you mean?

May 24 '24 12:05 ROZBEH

Yes, that's it.

On Fri, May 24, 2024, 8:06 PM ROZBEH @.***> wrote:

Thanks @haihua https://github.com/haihua I'm indeed 5 nodes with 5 GPU each. Is that what you mean?

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/NeMo/issues/8897#issuecomment-2129368588, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZBHYLT2XC5A4QKW26PJHDZD4ULRAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGM3DQNJYHA . You are receiving this because you were mentioned.Message ID: @.***>

May 24 '24 12:05 haihua

I see but the above issue is persistent with multi node and I'd like to get it working.

May 24 '24 13:05 ROZBEH

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jun 24 '24 01:06 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Jul 01 '24 01:07 github-actions[bot]