LMFlow icon indicating copy to clipboard operation
LMFlow copied to clipboard

[BUG]when map the dataset, i set the num_proc = 2 or 4, it will make mistakes.

Open nicosouth opened this issue 1 year ago • 8 comments

Running tokenizer on dataset (num_proc=2): 0%| | 0/666 [00:00<?, ? examples/s] [rank0]: Traceback (most recent call last): [rank0]: File "/data/mnt/LMFlow-20240514/examples/finetune.py", line 61, in [rank0]: main() [rank0]: File "/data/mnt/LMFlow-20240514/examples/finetune.py", line 57, in main [rank0]: tuned_model = finetuner.tune(model=model, dataset=dataset) [rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/pipeline/finetuner.py", line 237, in tune [rank0]: tokenized_dataset = model.tokenize(dataset) [rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/models/hf_decoder_model.py", line 622, in tokenize [rank0]: tokenized_datasets = raw_datasets.map( [rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/datasets/dataset.py", line 371, in map [rank0]: mapped_backend_dataset = self.backend_dataset.map(*args, **kwargs) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper [rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper [rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3189, in map [rank0]: for rank, done, content in iflatmap_unordered( [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1394, in iflatmap_unordered [rank0]: [async_result.get(timeout=0.05) for async_result in async_results] [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1394, in [rank0]: [async_result.get(timeout=0.05) for async_result in async_results] [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get [rank0]: raise self._value [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/pool.py", line 537, in _handle_tasks [rank0]: put(task) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/connection.py", line 214, in send [rank0]: self._send_bytes(_ForkingPickler.dumps(obj)) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/reduction.py", line 54, in dumps [rank0]: cls(buf, protocol, *args, **kwds).dump(obj) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 498, in dump [rank0]: StockPickler.dump(self, obj) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 487, in dump [rank0]: self.save(obj) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save [rank0]: f(self, obj) # Call unbound method with explicit self [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple [rank0]: save(element) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save [rank0]: f(self, obj) # Call unbound method with explicit self [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 886, in save_tuple [rank0]: save(element) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save [rank0]: f(self, obj) # Call unbound method with explicit self [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict [rank0]: StockPickler.save_dict(pickler, obj) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 971, in save_dict [rank0]: self._batch_setitems(obj.items()) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 997, in _batch_setitems [rank0]: save(v) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save [rank0]: f(self, obj) # Call unbound method with explicit self [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 1493, in save_function [rank0]: pickler.save_reduce(_create_function, (obj.code, [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 692, in save_reduce [rank0]: save(args) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save [rank0]: f(self, obj) # Call unbound method with explicit self [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple [rank0]: save(element) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save [rank0]: f(self, obj) # Call unbound method with explicit self [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple [rank0]: save(element) [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save [rank0]: f(self, obj) # Call unbound method with explicit self [rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 1226, in save_cell [rank0]: f = obj.cell_contents [rank0]: ValueError: Cell is empty

nicosouth avatar May 22 '24 03:05 nicosouth

Thanks for your interest in LMFlow! Could you please provide your .sh script? Also, what kind of dataset are you using?

wheresmyhair avatar May 22 '24 09:05 wheresmyhair

ok, this is my script, i just add the "--preprocessing_num_workers 4"

""""""""" model_name_or_path=/home/llm/model/Qwen1.5-1.8B dataset_path=/home/llm/data/text_test/ output_dir=/home/llm/model/output_models/finetune conversation_template=empty trust_remote_code=True

while [[ $# -ge 1 ]]; do key="$1" case ${key} in -m|--model_name_or_path) model_name_or_path="$2" shift ;; -d|--dataset_path) dataset_path="$2" shift ;; -o|--output_model_path) output_dir="$2" shift ;; --conversation_template) conversation_template="$2" shift ;; --deepspeed_args) deepspeed_args="$2" shift ;; --trust_remote_code) trust_remote_code="$2" shift ;; *) echo "error: unknown option "${key}"" 1>&2 exit 1 esac shift done

deepspeed --include="localhost:5" --master_port=11999
examples/finetune.py
--model_name_or_path ${model_name_or_path}
--trust_remote_code ${trust_remote_code}
--dataset_path ${dataset_path}
--output_dir ${output_dir}
--conversation_template ${conversation_template}
--num_train_epochs 1
--learning_rate 2e-5
--disable_group_texts 1
--block_size 1024
--per_device_train_batch_size 1
--deepspeed configs/ds_config_zero0.json
--bf16
--run_name finetune
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 5000
--dataloader_num_workers 1
--preprocessing_num_workers 4
| tee ${log_dir}/train.log
2> ${log_dir}/train.err """""""""

i use the ShuSheng dataset and convert data into the format required by lmflow.

thank you!

nicosouth avatar May 22 '24 09:05 nicosouth

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

wheresmyhair avatar May 22 '24 11:05 wheresmyhair

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

it's text_only.

nicosouth avatar May 22 '24 11:05 nicosouth

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

it's text_only.

We do repro this bug now and we are working on fixing it. Perhaps finetune with --preprocessing_num_workers 1 for now, and sorry for the inconvenience 🙏 If you have any other questions, please feel free to leave a comment.

wheresmyhair avatar May 22 '24 13:05 wheresmyhair

thank you for your contributions

nicosouth avatar May 24 '24 03:05 nicosouth

thank you for your contributions

FYI: We've located the bug, and dev team needs to perform a small-scale refactoring to fix. We will do ASAP and sorry for the inconvenience 🙏

wheresmyhair avatar May 30 '24 03:05 wheresmyhair

thank you for your contributions

FYI: Bug fixed, please see https://github.com/OptimalScale/LMFlow/pull/845 🤗

wheresmyhair avatar May 31 '24 02:05 wheresmyhair