OOM error libcurand.so.10
Machine configuration information
(deepspeed) [menkeyi@gpu1 ~]$ df -Th Filesystem Type Size Used Avail Use% Mounted on none overlay 79G 29G 46G 39% / 192.168.100.44@o2ib:/data lustre 98T 4.3T 89T 5% /home dev devtmpfs 991M 0 991M 0% /dev tmpfs tmpfs 504G 646M 504G 1% /dev/shm tmpfs tmpfs 504G 19M 504G 1% /run tmpfs tmpfs 504G 0 504G 0% /sys/fs/cgroup
(deepspeed) [menkeyi@gpu1 ~]$ free -m total used free shared buff/cache available Mem: 1031741 115892 846810 1328 69039 821477 Swap: 0 0 0
(deepspeed) [menkeyi@gpu1 ~]$ cat /etc/redhat-release CentOS Linux release 7.9.2009 (Core)
=============================train
(deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node
---=== Running Step 1 ===---
Running:
bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_node/run_1.3b.sh /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Ch
at/output/actor-models/1.3b
---=== Finished Step 1 in 2:45:20 ===---
---=== Running Step 2 ===---
Running:
bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_node/run_350m.sh /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-
Chat/output/reward-models/350m
---=== Finished Step 2 in 3:56:06 ===---
---=== Running Step 3 ===---
Running:
bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node/run_1.3b.sh /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/ou$put/actor-models/1.3b /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m '' '' /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/1.3b
Traceback (most recent call last):
File "/home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 210, in
Launch command: bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node/run_1.3b.sh /home/menkeyi/DeepSpeedExamples/applications/D$epSpeed-Chat/output/actor-models/1.3b /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m '' '' /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/st$p3-models/1.3b
Log output: /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/1.3b/training.log
Please see our tutorial at https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning
Please check that you have installed our requirements: pip install -r requirements.txt
If you are seeing an OOM error, try modifying /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node/run_1.3b.sh:
-
Reduce
--per_device_*_batch_size -
Increase
--zero_stage {0,1,2,3}on multi-gpu setups -
Enable
--gradient_checkpointingor--only_optimizer_lora
=========================================training.log (deepspeed) [menkeyi@gpu1 ~]$ tail -f DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b/training.log [2023-04-17 22:53:45,410] [INFO] [logging.py:96:log_dist] [Rank 0] step=4130, skipped=74, lr=[7.642152964180552e-09, 7.642152964180552e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-17 22:53:46,284] [INFO] [timer.py:199:stop] epoch=1/micro_step=2065/global_step=4130, RunningAvgSamplesPerSec=30.58440684669661, CurrSamplesPerSec=30.61588841412397, MemAllocated=4.94GB, MaxMemAllocated=23.6GB ***** Evaluating perplexity, Epoch 2/2 ***** ppl: 2.7952592372894287 saving the final model ... [2023-04-17 22:54:07,899] [INFO] [launch.py:460:main] Process 24672 exits successfully. [2023-04-17 22:54:07,899] [INFO] [launch.py:460:main] Process 24668 exits successfully. [2023-04-17 22:54:08,901] [INFO] [launch.py:460:main] Process 24669 exits successfully. [2023-04-17 22:54:08,901] [INFO] [launch.py:460:main] Process 24671 exits successfully. [2023-04-17 22:54:08,902] [INFO] [launch.py:460:main] Process 24673 exits successfully. [2023-04-17 22:54:09,903] [INFO] [launch.py:460:main] Process 24670 exits successfully. [2023-04-17 22:54:09,904] [INFO] [launch.py:460:main] Process 24667 exits successfully. [2023-04-17 22:54:10,905] [INFO] [launch.py:460:main] Process 24666 exits successfully.
====================================step3-models/1.3b/training.log
(deepspeed) [menkeyi@gpu1 output]$ tail -f /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/1.3b/training.log
self.module = DeepSpeedOPTInference(_config, mp_group=self.mp_group)
File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/deepspeed/model_implementations/transformers/ds_opt.py", line 20, in init
ImportError: libcurand.so.10: cannot open shared object file: No such file or directory
super().init(config, mp_group, quantize_scales, quantize_groups, merge_count, mlp_extra_grouping)
File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 54, in init
inference_cuda_module = builder.load()
File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 449, in load
return self.jit_load(verbose)
File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "
Search results for libcurand.so.10 conda deepspeed installation method for Pytorch: pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
(deepspeed) [menkeyi@gpu1 output]$ find / -name libcurand.so.10 。。。。。。。。。。。。。。。。。。。。 /home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10