总是报如下错误 [2023-06-06 17:50:42,704] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2023-06-06 17:50:42,704] [INFO] [checkpointing.py:231:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2023-06-06 17:50:42,704] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... /opt/conda/lib/python3.7/site-packages/torch/nn/init.py:403: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") replacing layer 0 attention with lora replacing layer 14 attention with lora replacing chatglm linear layer with 4bit [2023-06-06 17:51:38,573] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1122 [2023-06-06 17:51:38,575] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '128', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.00001', '--batch-size', '1', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = -9

启动命令： bash finetune/finetune_visualglm_qlora.sh finetune_visualglm_qlora.sh脚本如下： #! /bin/bash NUM_WORKERS=1 NUM_GPUS_PER_WORKER=1 MP_SIZE=1

script_path=$(realpath $0) script_dir=$(dirname $script_path) main_dir=$(dirname $script_dir) MODEL_TYPE="visualglm-6b" MODEL_ARGS="--max_source_length 64
--max_target_length 128
--lora_rank 10
--layer_range 0 14
--pre_seq_len 4"

#OPTIONS_SAT="SAT_HOME=$1" #"SAT_HOME=/raid/dm/sat_models" #OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2" HOST_FILE_PATH="hostfile" HOST_FILE_PATH="hostfile_single"

train_data="./fewshot-data/dataset.json" eval_data="./fewshot-data/dataset.json"

gpt_options="
--experiment-name finetune-$MODEL_TYPE
--model-parallel-size ${MP_SIZE}
--mode finetune
--train-iters 300
--resume-dataloader
$MODEL_ARGS
--train-data ${train_data}
--valid-data ${eval_data}
--distributed-backend nccl
--lr-decay-style cosine
--warmup .02
--checkpoint-activations
--save-interval 300
--eval-interval 10000
--save "./checkpoints"
--split 1
--eval-iters 10
--eval-batch-size 8
--zero-stage 1
--lr 0.00001
--batch-size 1
--gradient-accumulation-steps 4
--skip-init
--fp16
--use_qlora "

run_cmd="${OPTIONS_NCCL} ${OPTIONS_SAT} deepspeed --master_port 16666 --include localhost:0 --hostfile ${HOST_FILE_PATH} finetune_visualglm.py ${gpt_options}" echo ${run_cmd} eval ${run_cmd}

set +x

Jun 06 '23 09:06 danxuan2022

辛苦帮忙看下～在线等待中... 我使用的单卡配置是16GB显存是因为分布式启动使用launch.py，单机单卡launch启动不支持的原因吗？报错信息看不出是什么问题～～

Jun 06 '23 09:06 danxuan2022

默认就是单机单卡的，看起来有可能是机器的内存不足？不过我也不能确定，之前没遇到过

Jun 06 '23 10:06 1049451037

默认就是单机单卡的，看起来有可能是机器的内存不足？不过我也不能确定，之前没遇到过

默认应该是单机八卡吧？ NUM_GPUS_PER_WORKER=8， OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2"

OPTIONS_NCCL这个参数如果使用单卡的话是不是可以不用这个参数，单卡应该不涉及到nccl通信吧？

Jun 06 '23 10:06 danxuan2022

默认就是单机单卡的，看起来有可能是机器的内存不足？不过我也不能确定，之前没遇到过

嗯我换一台显存32GB的V100先试一下

Jun 06 '23 10:06 danxuan2022

NUM_GPUS这个参数没用到，指定显卡是--include localhost:0这个参数。

另外不是显存不足，是内存不足，模型要先放到内存上，再.cuda()

Jun 06 '23 10:06 1049451037

NUM_GPUS这个参数没用到，指定显卡是--include localhost:0这个参数。

另外不是显存不足，是内存不足，模型要先放到内存上，再.cuda()

ok,根据你们经验的话，单机单卡内存要多大可以

Jun 06 '23 10:06 danxuan2022

NUM_GPUS这个参数没用到，指定显卡是--include localhost:0这个参数。

另外不是显存不足，是内存不足，模型要先放到内存上，再.cuda()

NameError: name 'HackLinearNF4' is not defined我换了内存较大的机器重试了一下，代码里缺少这个模块?

Jun 06 '23 10:06 danxuan2022

pip install scipy bitsandbytes

Jun 06 '23 10:06 1049451037

pip install scipy bitsandbytes

Jun 06 '23 10:06 danxuan2022

from bitsandbytes.nn import LinearNF4

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/bitsandbytes/libbitsandbytes_cpu.so /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda-11.2/lib64'), PosixPath('/usr/local/cuda-11.2/targets/x86_64-linux'), PosixPath('/usr/local/cuda-11.2/targets/x86_64-linux/lib')} warn(msg) CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine! CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library... warn(msg) CUDA SETUP: Loading binary /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/bitsandbytes/libbitsandbytes_cpu.so... Traceback (most recent call last): File "", line 1, in ImportError: cannot import name 'LinearNF4' from 'bitsandbytes.nn' (/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/bitsandbytes/nn/init.py)

Jun 06 '23 11:06 danxuan2022

感觉是没有安装最新版的bitsandbytes，你是从pypi源安装的吗？清华源貌似最近出问题了

Jun 06 '23 11:06 1049451037

感觉是没有安装最新版的bitsandbytes，你是从pypi源安装的吗？清华源貌似最近出问题了

用的是清华源安装的，pypi源速度太慢试了好多次都超时了，可以提供一份pypi源安装的bitsandbytes吗？

Jun 06 '23 15:06 danxuan2022

那就试一下阿里云源：pip install -i https://mirrors.aliyun.com/pypi/simple/

Jun 06 '23 15:06 1049451037

那就试一下阿里云源：pip install -i https://mirrors.aliyun.com/pypi/simple/

阿里源可以，版本和pypi源最新的一致，这个问题解决了，但是又碰到了一个新的问题

replacing layer 0 attention with lora replacing layer 14 attention with lora replacing chatglm linear layer with 4bit [2023-06-06 23:39:50,871] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376 [2023-06-06 23:40:00,446] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/aistudio/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt [2023-06-06 23:42:00,679] [INFO] [RANK 0] > successfully loaded /home/aistudio/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt [2023-06-06 23:42:09,236] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers... [2023-06-06 23:42:09,527] [INFO] [RANK 0] Cannot find THUDM/chatglm-6b from Huggingface or sat. Creating a fake tokenizer... Traceback (most recent call last): File "finetune_visualglm.py", line 195, in training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sat/training/deepspeed_training.py", line 67, in training_main train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sat/data_utils/configure_data.py", line 197, in make_loaders train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sat/data_utils/configure_data.py", line 124, in make_dataset_full d = create_dataset_function(p, args) File "finetune_visualglm.py", line 161, in create_dataset_function dataset = FewShotDataset(path, image_processor, tokenizer, args) File "finetune_visualglm.py", line 119, in init input0 = tokenizer.encode("", add_special_tokens=False) AttributeError: 'FakeTokenizer' object has no attribute 'encode' [2023-06-06 23:42:11,387] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 5987 [2023-06-06 23:42:11,388] [ERROR] [launch.py:320:sigkill_handler] ['/opt/conda/envs/python35-paddle120-env/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '1', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = 1

Jun 06 '23 15:06 danxuan2022

看起来是你的tokenizer加载失败了，你可以用python运行一下

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('THUDM/chatglm-6b', trust_remote_code=True)

看一下报什么错

Jun 06 '23 15:06 1049451037

看起来是你的tokenizer加载失败了，你可以用python运行一下
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('THUDM/chatglm-6b', trust_remote_code=True)
看一下报什么错看起来是网络连接的问题，帮忙看下～～

tokenizer = AutoTokenizer.from_pretrained('THUDM/chatglm-6b', trust_remote_code=True) Traceback (most recent call last): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/utils/hub.py", line 429, in cached_file local_files_only=local_files_only, File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn return fn(*args, **kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/huggingface_hub/file_download.py", line 1292, in hf_hub_download "Connection error, and we cannot find the requested files in" huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/models/auto/tokenization_auto.py", line 659, in from_pretrained pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/configuration_utils.py", line 641, in _get_config_dict _commit_hash=commit_hash, File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/utils/hub.py", line 453, in cached_file f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load this file, couldn't find it in the" OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like THUDM/chatglm-6b is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Jun 07 '23 01:06 danxuan2022

看起来是你的tokenizer加载失败了，你可以用python运行一下
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('THUDM/chatglm-6b', trust_remote_code=True)
看一下报什么错看起来是网络连接的问题，帮忙看下～～

tokenizer = AutoTokenizer.from_pretrained('THUDM/chatglm-6b', trust_remote_code=True) Traceback (most recent call last): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/utils/hub.py", line 429, in cached_file local_files_only=local_files_only, File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn return fn(*args, **kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/huggingface_hub/file_download.py", line 1292, in hf_hub_download "Connection error, and we cannot find the requested files in" huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.
During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/models/auto/tokenization_auto.py", line 659, in from_pretrained pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/configuration_utils.py", line 641, in _get_config_dict _commit_hash=commit_hash, File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/utils/hub.py", line 453, in cached_file f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load this file, couldn't find it in the" OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like THUDM/chatglm-6b is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

有解决吗，我也遇到了相同的问题

Jun 08 '23 02:06 yhongxin

看起来是你的tokenizer加载失败了，你可以用python运行一下
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('THUDM/chatglm-6b', trust_remote_code=True)
看一下报什么错看起来是网络连接的问题，帮忙看下～～

tokenizer = AutoTokenizer.from_pretrained('THUDM/chatglm-6b', trust_remote_code=True) Traceback (most recent call last): File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/utils/hub.py", line 429, in cached_file local_files_only=local_files_only, File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn return fn(*args, **kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/huggingface_hub/file_download.py", line 1292, in hf_hub_download "Connection error, and we cannot find the requested files in" huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.
During handling of the above exception, another exception occurred: Traceback (most recent call last): File "", line 1, in File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/models/auto/tokenization_auto.py", line 659, in from_pretrained pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/configuration_utils.py", line 641, in _get_config_dict _commit_hash=commit_hash, File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/transformers/utils/hub.py", line 453, in cached_file f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load this file, couldn't find it in the" OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like THUDM/chatglm-6b is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
有解决吗，我也遇到了相同的问题

去huggingface把相应的文件下载下来，在from_pretrained的时候导入下载的文件就行

Jun 11 '23 07:06 yhongxin

这种大模型有办法部分加载吗？解决内存不足无法一次性载入的问题

Jun 12 '23 14:06 cFireworks

'FakeTokenizer' object has no attribute 'encode' , 试一下sat模型是否正确，路径确保正确，可以设置SAT_HOME='path to sat model'

Jul 28 '23 17:07 xavier-xiadia

看起来是你的tokenizer加载失败了，你可以用python运行一下
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('THUDM/chatglm-6b', trust_remote_code=True)
看一下报什么错

这个模型我自己下载下来放在文件夹下改掉这个路径是不是理论上就可以了啊

Dec 12 '23 13:12 dovs1314

只有单卡想微调（few-shot）VisualGLM-6B模型脚本需要怎么改？

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues