xtuner icon indicating copy to clipboard operation
xtuner copied to clipboard

internlm2_20b_qlora_msagent_react_e3_gpu8训练时如何添加自己的词汇表呢?

Open sxk000 opened this issue 1 year ago • 10 comments

首先,感谢上海人工智能实验室及其成员对书生模型、代码框架、技术经验的分享!

请问,internlm2_20b_qlora_msagent_react_e3_gpu8训练时如何添加自己的词汇表呢?

比如:breed_name、area_name等,当做一个token。

谢谢!

sxk000 avatar Jul 03 '24 01:07 sxk000

y一个不需要改代码的做法是: 只需要在配置里面加一下就行了

ADD_TOKENS_DECODER={
 "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92538": {
      "content": "<|plugin|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92539": {
      "content": "<|interpreter|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92540": {
      "content": "<|action_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92541": {
      "content": "<|action_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92542": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "92543": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
   # 这里加入新的,确保对应 id 没有被用到就行
   "92535": {
        "content": "breed_name",
        "lstrip": False,
        "normalized": False,
        "rstrip": False,
        "single_word": False,
        "special": True
    },
   "92536": {
        "content": "area_name",
        "lstrip": False,
        "normalized": False,
        "rstrip": False,
        "single_word": False,
        "special": True
    },
}
tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    added_tokens_decoder=ADD_TOKENS_DECODER,
    padding_side='right')

但是需要注意,qlora 默认是不会训练 embeding 层的,因此不知道对性能有多少影响

hhaAndroid avatar Jul 03 '24 10:07 hhaAndroid

@sxk000 @hhaAndroid 修改完tokenizer后,还需要在 config 中把修改后的 tokenizer 传入到 model 的配置中,这样会在初始化 model 的时候自动对 model 的 embedding 层和 output 层的参数同步进行扩充。

#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
model = dict(
+   tokenizer=tokenizer,
   ...)

HIT-cwh avatar Jul 04 '24 02:07 HIT-cwh

直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding

        for special_token in special_tokens:
            if special_token not in tokenizer.get_vocab():
                tokenizer.add_tokens([special_token], special_tokens=True)
        print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')

KooSung avatar Jul 05 '24 03:07 KooSung

直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding

        for special_token in special_tokens:
            if special_token not in tokenizer.get_vocab():
                tokenizer.add_tokens([special_token], special_tokens=True)
        print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')

谢谢!这种方法有效!

sxk000 avatar Jul 08 '24 03:07 sxk000

@sxk000 @hhaAndroid 修改完tokenizer后,还需要在 config 中把修改后的 tokenizer 传入到 model 的配置中,这样会在初始化 model 的时候自动对 model 的 embedding 层和 output 层的参数同步进行扩充。

#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
model = dict(
+   tokenizer=tokenizer,
   ...)

感谢回复! 按照这个方法加tokenizer=tokenizer,报没有tokenizer参数的错误,如下: image

Traceback (most recent call last):                                                                                                                        
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 307, in <module>                                       
    main()                                                                                                                                                
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 303, in main                                           
    runner.train()                                                                                                                                        
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train                           
    self.strategy.prepare(                                                                                                                                
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare                              
    model = self.build_model(model)                                                                                                                       
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 306, in build_model                               
    model = MODELS.build(model)                                                                                                                           
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build                                  
    return self.build_func(cfg, *args, **kwargs, registry=self)                                                                                           
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg            
    return build_from_cfg(cfg, registry, default_args)                                                                                                    
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg                  
    obj = obj_cls(**args)  # type: ignore                                                                                                                 
TypeError: SupervisedFinetune.__init__() got an unexpected keyword argument 'tokenizer'
``` `

请问应该怎么解决呢?

sxk000 avatar Jul 09 '24 11:07 sxk000

直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding

        for special_token in special_tokens:
            if special_token not in tokenizer.get_vocab():
                tokenizer.add_tokens([special_token], special_tokens=True)
        print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')

按照这种方法训练出来的模型,对结果有影响。

比如训练数据是: user:你是谁? assistant:我是猴子请来的救兵!

模型训练出来以后,测试结果会出现如下情况: user:你是谁? assistant:你是谁你是谁啊。

如果不加词表,是可以按照训练数据那样正常输出的!

我的操作步骤是:通过如下代码,把原来的模型扩充词表,然后保存tokenizer和model,最后通过扩充词表后保存的模型进行微调训练的。

from transformers import AutoTokenizer,AutoModel

def new_token():
	pretrained_model_name_or_path = '/apply/model/original/internlm2-chat-20b'
	token_file='/apply/data/finetune/token.txt'
	with open(token_file,'r',encoding='utf8') as f:
		token_list=f.readlines()
	token_list=''.join(token_list).split('\n')
	print(token_list)
	tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
	model = AutoModel.from_pretrained(pretrained_model_name_or_path)
	print('---1',tokenizer)
	for token_one in token_list:
		if token_one not in tokenizer.get_vocab():
			tokenizer.add_tokens([token_one],special_tokens=True)
	model.resize_token_embeddings(len(tokenizer))
	print('---2',tokenizer)
	tokenizer.save_pretrained(pretrained_model_name_or_path+'-new')
	model.save_pretrained(pretrained_model_name_or_path+'-new')

sxk000 avatar Jul 09 '24 11:07 sxk000

@sxk000 @hhaAndroid 修改完tokenizer后,还需要在 config 中把修改后的 tokenizer 传入到 model 的配置中,这样会在初始化 model 的时候自动对 model 的 embedding 层和 output 层的参数同步进行扩充。

#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
model = dict(
+   tokenizer=tokenizer,
   ...)

感谢回复! 按照这个方法加tokenizer=tokenizer,报没有tokenizer参数的错误,如下: image

Traceback (most recent call last):                                                                                                                        
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 307, in <module>                                       
    main()                                                                                                                                                
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 303, in main                                           
    runner.train()                                                                                                                                        
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train                           
    self.strategy.prepare(                                                                                                                                
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare                              
    model = self.build_model(model)                                                                                                                       
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 306, in build_model                               
    model = MODELS.build(model)                                                                                                                           
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build                                  
    return self.build_func(cfg, *args, **kwargs, registry=self)                                                                                           
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg            
    return build_from_cfg(cfg, registry, default_args)                                                                                                    
  File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg                  
    obj = obj_cls(**args)  # type: ignore                                                                                                                 
TypeError: SupervisedFinetune.__init__() got an unexpected keyword argument 'tokenizer'
``` `

请问应该怎么解决呢?

请问用的是xtuenr的什么版本呢

HIT-cwh avatar Jul 11 '24 04:07 HIT-cwh

我看您使用的是QLora算法,QLora算法是不支持扩词表的,因为QLora在训练过程中embedding是不训练的(参数一直是随机初始化的)。如果想要扩词表的话,建议尝试全量微调。

HIT-cwh avatar Jul 11 '24 04:07 HIT-cwh

我看您使用的是QLora算法,QLora算法是不支持扩词表的,因为QLora在训练过程中embedding是不训练的(参数一直是随机初始化的)。如果想要扩词表的话,建议尝试全量微调。

感谢解答!

最近在忙忙其他的事情了,没有及时回复您,非常抱歉!

xtuenr0.1.14

使用的时全参微调,config代码如下:

tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    padding_side='right')

model = dict(
    type=SupervisedFinetune,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        trust_remote_code=True,
        torch_dtype=torch.float16))    

sxk000 avatar Jul 29 '24 07:07 sxk000

直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding

        for special_token in special_tokens:
            if special_token not in tokenizer.get_vocab():
                tokenizer.add_tokens([special_token], special_tokens=True)
        print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')

按照这种方法训练出来的模型,对结果有影响。

比如训练数据是: user:你是谁? assistant:我是猴子请来的救兵!

模型训练出来以后,测试结果会出现如下情况: user:你是谁? assistant:你是谁你是谁啊。

如果不加词表,是可以按照训练数据那样正常输出的!

我的操作步骤是:通过如下代码,把原来的模型扩充词表,然后保存tokenizer和model,最后通过扩充词表后保存的模型进行微调训练的。

from transformers import AutoTokenizer,AutoModel

def new_token():
	pretrained_model_name_or_path = '/apply/model/original/internlm2-chat-20b'
	token_file='/apply/data/finetune/token.txt'
	with open(token_file,'r',encoding='utf8') as f:
		token_list=f.readlines()
	token_list=''.join(token_list).split('\n')
	print(token_list)
	tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
	model = AutoModel.from_pretrained(pretrained_model_name_or_path)
	print('---1',tokenizer)
	for token_one in token_list:
		if token_one not in tokenizer.get_vocab():
			tokenizer.add_tokens([token_one],special_tokens=True)
	model.resize_token_embeddings(len(tokenizer))
	print('---2',tokenizer)
	tokenizer.save_pretrained(pretrained_model_name_or_path+'-new')
	model.save_pretrained(pretrained_model_name_or_path+'-new')

你好,有两个问题想要请教一下: 1.以上方法的意思是不是,先扩充词表保存模型,然后使用保存好的模型再进行SFT,这里的SFT不需要额外操作了。 2.为什么一下方法训练出来的模型会有影响?我之前直接在配置文件里的tokenizer_config.py里面的additional_special_tokens以及added_tokens_decoder中添加我自定义的special token后,使用Qwen3-4B进行多轮对话的分类训练,要求输出分类标签就是我自定义的special token。结果训练出来,只有第一轮对话可以输出标签,后面轮的对话都不输出任何内容,不知道怎么回事。

直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding

        for special_token in special_tokens:
            if special_token not in tokenizer.get_vocab():
                tokenizer.add_tokens([special_token], special_tokens=True)
        print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')

按照这种方法训练出来的模型,对结果有影响。

LiziLiziok avatar Sep 19 '25 07:09 LiziLiziok