MOSS 请教 RuntimeError: `<class 'models.quantization.QuantLinear'>' was not properly set up for sharding by zero.Init(). A subclass of torch.nn.Module must be defined before zero.Init() where an instance of the class is created.

请问量化的模型怎么训练呀，可以单卡吗，只要把路径改了就行吗 RuntimeError: `<class 'models.quantization.QuantLinear'>' was not properly set up for sharding by zero.Init(). A subclass of torch.nn.Module must be defined before zero.Init() where an instance of the class is created. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3741) of binary: /j05025/condaEnvs/moss2/bin/python Traceback (most recent call last): File "/j05025/condaEnvs/moss2/bin/accelerate", line 8, in

-eval_step 200 \
> --save_step 2000

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
Traceback (most recent call last):
  File "/j05025/MOSS/finetune_moss.py", line 311, in <module>
    train(args)           
  File "/j05025/MOSS/finetune_moss.py", line 184, in train
    model = MossForCausalLM.from_pretrained(args.model_path, use_cache=False)
  File "/j05025/condaEnvs/moss2/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2276, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/j05025/condaEnvs/moss2/lib/python3.9/site-packages/transformers/utils/generic.py", line 344, in __exit__
    self.stack.__exit__(*args, **kwargs)
  File "/j05025/condaEnvs/moss2/lib/python3.9/contextlib.py", line 513, in __exit__
    raise exc_details[1]
  File "/j05025/condaEnvs/moss2/lib/python3.9/contextlib.py", line 498, in __exit__
    if cb(*exc_details):
  File "/j05025/condaEnvs/moss2/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 447, in __exit__
    self.remove_wrappers()
  File "/j05025/condaEnvs/moss2/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 511, in remove_wrappers
    raise RuntimeError(msg)
RuntimeError: `<class 'models.quantization.QuantLinear'>' was not properly set up for sharding by zero.Init(). A subclass of torch.nn.Module must be defined before zero.Init() where an instance of the class is created.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3741) of binary: /j05025/condaEnvs/moss2/bin/python
Traceback (most recent call last):
  File "/j05025/condaEnvs/moss2/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/j05025/condaEnvs/moss2/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/j05025/condaEnvs/moss2/lib/python3.9/site-packages/accelerate/commands/launch.py", line 900, in launch_command
    deepspeed_launcher(args)
  File "/j05025/condaEnvs/moss2/lib/python3.9/site-packages/accelerate/commands/launch.py", line 643, in deepspeed_launcher
    distrib_run.run(args)
  File "/j05025/condaEnvs/moss2/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/j05025/condaEnvs/moss2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/j05025/condaEnvs/moss2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune_moss.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-28_05:59:13
  host      : 444d5q78sbnco-0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3741)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(/j05025/condaEnvs/moss2) root@444d5q78sbnco-0:/j05025/MOSS#

cd /j05025/MOSS/
conda activate moss2
out_dir=/j05025/trainOut/MOSS
model_name=moss-moon-003-sft-int8
accelerate launch \
    --config_file ./configs/sft.yaml \
    --deepspeed_multinode_launcher standard finetune_moss.py \
    --model_path /j05025/model/fnlp/$model_name \
    --data_dir ./sft_data \
    --output_dir $out_dir/ckpts/$model_name \
    --log_dir $out_dir/train_logs/$model_name \
    --n_epochs 2 \
    --train_bsz_per_gpu 4 \
    --eval_bsz_per_gpu 4 \
    --learning_rate 0.000015 \
    --eval_step 200 \
    --save_step 2000

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

pip list Package Version

absl-py 1.4.0 accelerate 0.18.0 aiohttp 3.8.4 aiosignal 1.3.1 async-timeout 4.0.2 attrs 23.1.0 cachetools 5.3.0 certifi 2022.12.7 charset-normalizer 3.1.0 cmake 3.26.3 contourpy 1.0.7 cycler 0.11.0 datasets 2.11.0 deepspeed 0.9.1 dill 0.3.6 filelock 3.12.0 fonttools 4.39.3 frozenlist 1.3.3 fsspec 2023.4.0 google-auth 2.17.3 google-auth-oauthlib 1.0.0 grpcio 1.54.0 hjson 3.1.0 huggingface-hub 0.14.1 idna 3.4 importlib-metadata 6.6.0 importlib-resources 5.12.0 Jinja2 3.1.2 kiwisolver 1.4.4 lit 16.0.2 Markdown 3.4.3 MarkupSafe 2.1.2 matplotlib 3.7.1 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 networkx 3.1 ninja 1.11.1 numpy 1.24.3 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 oauthlib 3.2.2 packaging 23.1 pandas 2.0.1 Pillow 9.5.0 pip 23.0.1 protobuf 4.22.3 psutil 5.9.5 py-cpuinfo 9.0.0 pyarrow 11.0.0 pyasn1 0.5.0 pyasn1-modules 0.3.0 pydantic 1.10.7 pyparsing 3.0.9 python-dateutil 2.8.2 pytz 2023.3 PyYAML 6.0 regex 2023.3.23 requests 2.29.0 requests-oauthlib 1.3.1 responses 0.18.0 rsa 4.9 sentencepiece 0.1.98 setuptools 66.0.0 six 1.16.0 sympy 1.11.1 tensorboard 2.12.2 tensorboard-data-server 0.7.0 tensorboard-plugin-wit 1.8.1 tokenizers 0.13.3 torch 2.0.0 tqdm 4.65.0 transformers 4.25.1 triton 2.0.0 typing_extensions 4.5.0 tzdata 2023.3 urllib3 1.26.15 Werkzeug 2.3.1 wheel 0.38.4 xxhash 3.2.0 yarl 1.9.2 zipp 3.15.0

文档：-eval_step 200.note 链接：http://note.youdao.com/noteshare?id=85ca614740ef8e0e04432cf34f99d12b&sub=B02848659F4F407AB4893C6D9903986F

Apr 28 '23 06:04 starplatinum3

我也遇到了这个问题，想训练量化模型但报错，请问你解决这个问题了吗？

May 06 '23 06:05 KickyGong

我也遇到了这个问题

May 10 '23 09:05 sunyi123

我也遇到了这个问题，想训练量化模型但报错

RuntimeError: `<class 'transformers_modules.local.quantization.QuantLinear'>' was not properly set up for sharding by zero.Init(). A subclass of torch.nn.Module must be defined before zero.Init() where an instance of the class is created.

May 12 '23 12:05 lhtpluto

遇到了相同的问题，期待解答

May 15 '23 03:05 WenjingBao

嗯，我这里解决了，是把run.sh里面--model_name_or_path这行改成本地地址的时候没在前面加

./

加上了就好了。。。

May 16 '23 08:05 WenjingBao

嗯，我这里解决了，是把run.sh里面--model_name_or_path这行改成本地地址的时候没在前面加

./

加上了就好了。。。

哇，所以可以单卡训练量化模型是吗？请问一下你训练的是哪个量化模型呢，用的卡是什么？

May 16 '23 08:05 KickyGong

嗯，我这里解决了，是把run.sh里面--model_name_or_path这行改成本地地址的时候没在前面加 ./ 加上了就好了。。。

哇，所以可以单卡训练量化模型是吗？请问一下你训练的是哪个量化模型呢，用的卡是什么？

应该是，我还在解决后续出现的别的bug...

May 16 '23 09:05 WenjingBao

刚发现之前做的并不能解决问题，只是清掉了cache导致新的bug更早出现了（捂脸）

不过问题好像是出在

./models/quantization.py

里面295行的QuantLinear这个class里

以及deepspeed的repo里面有个类似问题的issue：

DeepSpeed/issues/2812

但里面的解决方法好像比较复杂，还得继续看看

May 16 '23 09:05 WenjingBao

这次是真的解决了，我拿conda给finetune单独建了一个env，安装了下列package

`pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

pip install pandas accelerate==0.17.1 numpy==1.24.2 regex==2022.10.31 tqdm==4.64.1 transformers==4.25.1 deepspeed tensorboard

conda install jupyterlab=3.5.3 -c conda-forge`

然后先

accelerate test --config_file ./configs/sft.yaml

生成cache 接着手动把 ./models 里面的那五个.py都拷到了 ~/.cache/huggingface/modules/transformers_modules/local/里面（不是用linux的建议搜一下自己系统的cache在哪然后放进去）现在成功加载了模型开始读数据了（虽然紧接着就报错找不到 train.jsonl）

希望能帮到各位

May 18 '23 09:05 WenjingBao

找不到 train.jsonl 参考#282 https://github.com/OpenLMLab/MOSS/issues/282

May 18 '23 09:05 lhtpluto

这次是真的解决了，我拿conda给finetune单独建了一个env，安装了下列package

`pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

pip install pandas accelerate==0.17.1 numpy==1.24.2 regex==2022.10.31 tqdm==4.64.1 transformers==4.25.1 deepspeed tensorboard

conda install jupyterlab=3.5.3 -c conda-forge`

然后先

accelerate test --config_file ./configs/sft.yaml

生成cache 接着手动把 ./models 里面的那五个.py都拷到了 ~/.cache/huggingface/modules/transformers_modules/local/里面（不是用linux的建议搜一下自己系统的cache在哪然后放进去）现在成功加载了模型开始读数据了（虽然紧接着就报错找不到 train.jsonl）

希望能帮到各位

再补充一下，刚刚发现这么跑在jupyter重置kernel后又会报错 cannot find class，解决方式是重新跑之前把

~/.cache/huggingface/modules/transformers_modules/__pycache__/ ~/.cache/huggingface/modules/transformers_modules/local/__pycache__/

给删了就行

May 18 '23 10:05 WenjingBao

找到一个多卡/单卡训int8的示例code

https://github.com/yangzhipeng1108/moss-finetune-and-moss-finetune-int8

May 22 '23 08:05 WenjingBao

这次是真的解决了，我拿conda给finetune单独建了一个env，安装了下列package pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117 pip install pandas accelerate==0.17.1 numpy==1.24.2 regex==2022.10.31 tqdm==4.64.1 transformers==4.25.1 deepspeed tensorboard conda install jupyterlab=3.5.3 -c conda-forge 然后先 accelerate test --config_file ./configs/sft.yaml 生成cache 接着手动把 ./models 里面的那五个.py都拷到了 ~/.cache/huggingface/modules/transformers_modules/local/里面（不是用linux的建议搜一下自己系统的cache在哪然后放进去）现在成功加载了模型开始读数据了（虽然紧接着就报错找不到 train.jsonl）希望能帮到各位

再补充一下，刚刚发现这么跑在jupyter重置kernel后又会报错 cannot find class，解决方式是重新跑之前把

~/.cache/huggingface/modules/transformers_modules/__pycache__/ ~/.cache/huggingface/modules/transformers_modules/local/__pycache__/

给删了就行

你好，请问一下按照这个方法修改以后，./models里的5个.py文件会重新生成吗？我按照你这个方法进行修改以后，仍然会报之前的错误。即使是删掉了__pycache__也一样。

May 24 '23 12:05 summershape