No disk space left while loading llama2-70B for SFT
Describe the bug https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html
docker image: nvcr.io/nvidia/nemo:24.01.01.framework
converted llama2-70B hf model to Nemo using above doc, which is 129GB in size. I have disk space of 1.2T. while running on 2xH100 with 16Gpus total, I getting following error.
Traceback (most recent call last):
File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py", line 225, in main
model = load_from_nemo(MegatronGPTSFTModel, cfg, trainer, gpt_cfg, modify_confg_fn=_modify_config)
File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py", line 116, in load_from_nemo
model = cls.restore_from(
File "/opt/NeMo/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
return super().restore_from(
File "/opt/NeMo/nemo/core/classes/modelPT.py", line 450, in restore_from
instance = cls._save_restore_connector.restore_from(
File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 1067, in restore_from
loaded_params = super().load_config_and_state_dict(
File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 143, in load_config_and_state_dict
self._unpack_nemo_file(
File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 572, in _unpack_nemo_file
tar.extractall(path=out_folder)
File "/usr/lib/python3.10/tarfile.py", line 2257, in extractall
self._extract_one(tarinfo, path, set_attrs=not tarinfo.isdir(),
File "/usr/lib/python3.10/tarfile.py", line 2324, in _extract_one
self._handle_fatal_error(e)
File "/usr/lib/python3.10/tarfile.py", line 2320, in _extract_one
self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
File "/usr/lib/python3.10/tarfile.py", line 2403, in _extract_member
self.makefile(tarinfo, targetpath)
File "/usr/lib/python3.10/tarfile.py", line 2456, in makefile
copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
File "/usr/lib/python3.10/tarfile.py", line 255, in copyfileobj
dst.write(buf)
OSError: [Errno 28] No space left on device
A clear and concise description of what the bug is.
Steps/Code to reproduce bug followed doc
https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html
WORLD_SIZE=16 srun --kill-on-bad-exit=0 -N 2 --ntasks-per-node=8 --cpus-per-task=24 --ntasks=16
--container-image="docker://nvcr.io#nvidia/nemo:24.01.01.framework" --container-name=nemo_llama_slurm --container-mounts="${_cont_mounts}"
python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py trainer.precision=bf16 trainer.devices=8 trainer.num_nodes=2
trainer.val_check_interval=0.1 trainer.max_steps=50 model.restore_from_path=${MODEL} model.micro_batch_size=1 model.global_batch_size=128
model.tensor_model_parallel_size=${TP_SIZE} model.activations_checkpoint_num_layers=1 model.pipeline_model_parallel_size=${PP_SIZE} model.megatron_amp_O2=True
model.sequence_parallel=True model.activations_checkpoint_granularity=full model.activations_checkpoint_method=uniform model.optim.name=distributed_fused_adam
model.optim.lr=5e-6 model.answer_only_loss=True model.data.train_ds.file_names=${TRAIN} model.data.validation_ds.file_names=${VALID}
model.data.test_ds.file_names=${TEST} model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} model.data.train_ds.max_seq_length=2048
model.data.validation_ds.max_seq_length=2048 model.data.train_ds.micro_batch_size=1 model.data.train_ds.global_batch_size=128
model.data.validation_ds.micro_batch_size=1 model.data.validation_ds.global_batch_size=128 model.data.test_ds.micro_batch_size=1
model.data.test_ds.global_batch_size=256 model.data.train_ds.num_workers=0 model.data.validation_ds.num_workers=0 model.data.test_ds.num_workers=0
model.data.validation_ds.metric.name=loss model.data.test_ds.metric.name=loss exp_manager.create_wandb_logger=False exp_manager.explicit_log_dir=/tmp/results
exp_manager.resume_if_exists=True exp_manager.resume_ignore_no_checkpoint=True exp_manager.create_checkpoint_callback=True
exp_manager.checkpoint_callback_params.monitor=validation_loss exp_manager.checkpoint_callback_params.save_best_model=False
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True ++cluster_type=BCP
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
- Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
- If method of install is [Docker], provide
docker pull&docker runcommands used
Environment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
- OS version Ubuntu 20.04
- docker image: nvcr.io/nvidia/nemo:24.01.01.framework
Additional context
Add any other context about the problem here. Example: GPU model
This is due to the /tmp folder inside your container does not have enough space. Because NeMo will untar the .nemo file into that folder, for 70B model, it needs a lot of space. You may mount an empty dir in host to /tmp in your container.
Right it's better to untar such large models with tar -xvf xyz.nemo /path and then use save restore connector to restore the model by explicitly stating the path of the extracted dir. There are some examples of this in inference scripts in LLM directories
Here is an example - https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_eval.py#L185-L187 and pass the connector to restore_from https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_eval.py#L238
Thanks @qijiaxing and @titu1994 for reply. applying tar command on .nemo files and updating model.restore_from_path to path solves issue I was facing.
@qijiaxing @titu1994 , reopening as getting following error after training all steps.
cmd
WORLD_SIZE=16 srun --kill-on-bad-exit=0 -N 2 --ntasks-per-node=8 --cpus-per-task=24 --ntasks=16 --container-image="docker://nvcr.io#nvidia/nemo:24.01.01.framework" --container-name=nemo_llama_slurm --container-mounts="${_cont_mounts}" python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py trainer.precision=bf16 trainer.devices=8 trainer.num_nodes=2 trainer.val_check_interval=1.0 trainer.max_steps=5 model.restore_from_path=${MODEL} model.micro_batch_size=1 model.global_batch_size=128 model.tensor_model_parallel_size=${TP_SIZE} model.activations_checkpoint_num_layers=1 model.pipeline_model_parallel_size=${PP_SIZE} model.megatron_amp_O2=True model.sequence_parallel=False model.activations_checkpoint_granularity=full model.activations_checkpoint_method=uniform model.optim.name=distributed_fused_adam model.optim.lr=5e-6 model.answer_only_loss=True model.data.train_ds.file_names=${TRAIN} model.data.validation_ds.file_names=${VALID} model.data.test_ds.file_names=${TEST} model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} model.data.train_ds.max_seq_length=512 model.data.validation_ds.max_seq_length=512 model.data.train_ds.micro_batch_size=1 model.data.train_ds.global_batch_size=128 model.data.validation_ds.micro_batch_size=1 model.data.validation_ds.global_batch_size=128 model.data.test_ds.micro_batch_size=1 model.data.test_ds.global_batch_size=256 model.data.train_ds.num_workers=0 model.data.validation_ds.num_workers=0 model.data.test_ds.num_workers=0 model.data.validation_ds.metric.name=loss model.data.test_ds.metric.name=loss exp_manager.create_wandb_logger=False exp_manager.explicit_log_dir=/workspace/result exp_manager.resume_if_exists=True exp_manager.resume_ignore_no_checkpoint=True exp_manager.create_checkpoint_callback=True exp_manager.checkpoint_callback_params.monitor=validation_loss exp_manager.checkpoint_callback_params.save_best_model=False exp_manager.checkpoint_callback_params.save_nemo_on_train_end=False ++cluster_type=BCP
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Traceback (most recent call last):
File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py", line 236, in main
trainer.fit(model)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1023, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 355, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 134, in run
self.on_advance_end()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 249, in on_advance_end
self.val_loop.run()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
return loop_run(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 122, in run
return self.on_run_end()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 258, in on_run_end
self._on_evaluation_end()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 303, in _on_evaluation_end
call._call_callback_hooks(trainer, hook_name, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 194, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 311, in on_validation_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 360, in _save_topk_checkpoint
self._save_monitor_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 663, in _save_monitor_checkpoint
self._update_best_and_save(current, trainer, monitor_candidates)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 714, in _update_best_and_save
self._save_checkpoint(trainer, filepath)
File "/opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py", line 383, in _save_checkpoint
super()._save_checkpoint(trainer, filepath)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 365, in _save_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1316, in save_checkpoint
self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 507, in save_checkpoint
self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 356, in save_checkpoint
dist_checkpointing.save(sharded_state_dict=checkpoint, checkpoint_dir=checkpoint_dir)
File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 278, in save
save_config(
File "/opt/megatron-lm/megatron/core/dist_checkpointing/core.py", line 76, in save_config
with config_path.open('w') as f:
File "/opt/megatron-lm/megatron/core/dist_checkpointing/core.py", line 76, in save_config
with config_path.open('w') as f:
File "/usr/lib/python3.10/pathlib.py", line 1119, in open
return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/result/checkpoints/megatron_gpt_sft--validation_loss=1.526-step=5-consumed_samples=640.0/metadata.json'
Also how can I adapt Tiny Shakespeare dataset?
By default, there is no /workspace/result folder inside NeMo container. Can you try give an existing dir to exp_manager.explicit_log_dir
Also how can I adapt Tiny Shakespeare dataset?
SFT normally requires data to be in style of <instruction, response>. But the dataset you mentioned is not this type. Maybe you can use it as pretrain?
By default, there is no
/workspace/resultfolder inside NeMo container. Can you try give an existing dir toexp_manager.explicit_log_dir
I mount my current working directory as /workspace and I already have result directory created at that path. so it should be valid path.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.