[BUG]Out of memory when training, and is streaming mode supported ?
description
- datasets: 1.1 GB Chinese corpus, about 2 million lines
- devices
- CPU=8 GPU=1 Memory=320G Node=1 Type=A100-SXM-80GB
question
The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)
- Program is stopped in the Tokenizing process
cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language
sh run_chinese.sh
I'm wondering if all the dataset is loaded in memory, which leads to OOM.
Do the loading process support streaming mode ?
@wqw547243068, can you please share ds_report output and stack trace?
try set batchsize=1
@wqw547243068, can you please share
ds_reportoutput and stack trace?
the ds_report output is attached below:
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/root/miniconda3/envs/deepspeed/lib/python3.7/site-packages/torch'] torch version .................... 1.11.0+cu113 deepspeed install path ........... ['/root/miniconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed'] deepspeed info ................... 0.9.0+970d827f, 970d827f, master torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.3 deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
description
datasets: 1.1 GB Chinese corpus, about 2 million lines
devices
- CPU=8 GPU=1 Memory=320G Node=1 Type=A100-SXM-80GB
question
The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)
- Program is stopped in the Tokenizing process
cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language sh run_chinese.shI'm wondering if all the dataset is loaded in memory, which leads to OOM.
Do the loading process support streaming mode ?
how long does it take for one epoch training for your hardware? 2 million lines could take long...
description
datasets: 1.1 GB Chinese corpus, about 2 million lines
devices
- CPU=8 GPU=1 Memory=320G Node=1 Type=A100-SXM-80GB
question
The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)
- Program is stopped in the Tokenizing process
cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language sh run_chinese.shI'm wondering if all the dataset is loaded in memory, which leads to OOM. Do the loading process support streaming mode ?
how long does it take for one epoch training for your hardware? 2 million lines could take long...
In my evaluation, it may take 20 hours/epoch in step 1 training, but i think the main problem is not in the training process. It's in the step of the tokenizer generation of training dataset, which is before torch.save(train_dataset,train_fnamet) stage. When I run this step, the memory is up to 320G(When i go into the training step, it takes 176G memory. ) Do you have any idea that how should I resolve this problem?