DeepSpeed [BUG]Out of memory when training, and is streaming mode supported ?

description

datasets: 1.1 GB Chinese corpus, about 2 million lines
devices
- CPU=8 GPU=1 Memory=320G Node=1 Type=A100-SXM-80GB

question

The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)

Program is stopped in the Tokenizing process

cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language
sh run_chinese.sh

I'm wondering if all the dataset is loaded in memory, which leads to OOM.

Do the loading process support streaming mode ?

Apr 13 '23 13:04 wqw547243068

@wqw547243068, can you please share ds_report output and stack trace?

Apr 13 '23 13:04 tjruwase

try set batchsize=1

Apr 14 '23 03:04 Kayce001

@wqw547243068, can you please share ds_report output and stack trace?

the ds_report output is attached below:

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/root/miniconda3/envs/deepspeed/lib/python3.7/site-packages/torch'] torch version .................... 1.11.0+cu113 deepspeed install path ........... ['/root/miniconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed'] deepspeed info ................... 0.9.0+970d827f, 970d827f, master torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.3 deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

Apr 14 '23 06:04 chenyujie1127

description

datasets: 1.1 GB Chinese corpus, about 2 million lines

devices

CPU=8 GPU=1 Memory=320G Node=1 Type=A100-SXM-80GB

question

The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)

Program is stopped in the Tokenizing process
cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language
sh run_chinese.sh
I'm wondering if all the dataset is loaded in memory, which leads to OOM.

Do the loading process support streaming mode ?

how long does it take for one epoch training for your hardware? 2 million lines could take long...

Apr 14 '23 07:04 xinj7

description

datasets: 1.1 GB Chinese corpus, about 2 million lines

devices

CPU=8 GPU=1 Memory=320G Node=1 Type=A100-SXM-80GB

question

The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)

Program is stopped in the Tokenizing process
cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language
sh run_chinese.sh
I'm wondering if all the dataset is loaded in memory, which leads to OOM. Do the loading process support streaming mode ?
how long does it take for one epoch training for your hardware? 2 million lines could take long...

In my evaluation, it may take 20 hours/epoch in step 1 training, but i think the main problem is not in the training process. It's in the step of the tokenizer generation of training dataset, which is before torch.save(train_dataset,train_fnamet) stage. When I run this step, the memory is up to 320G(When i go into the training step, it takes 176G memory. ) Do you have any idea that how should I resolve this problem?

Apr 14 '23 08:04 chenyujie1127