internlm2_20b_qlora_msagent_react_e3_gpu8训练时如何添加自己的词汇表呢?
首先,感谢上海人工智能实验室及其成员对书生模型、代码框架、技术经验的分享!
请问,internlm2_20b_qlora_msagent_react_e3_gpu8训练时如何添加自己的词汇表呢?
比如:breed_name、area_name等,当做一个token。
谢谢!
y一个不需要改代码的做法是: 只需要在配置里面加一下就行了
ADD_TOKENS_DECODER={
"0": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"92538": {
"content": "<|plugin|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"92539": {
"content": "<|interpreter|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"92540": {
"content": "<|action_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"92541": {
"content": "<|action_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"92542": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"92543": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
# 这里加入新的,确保对应 id 没有被用到就行
"92535": {
"content": "breed_name",
"lstrip": False,
"normalized": False,
"rstrip": False,
"single_word": False,
"special": True
},
"92536": {
"content": "area_name",
"lstrip": False,
"normalized": False,
"rstrip": False,
"single_word": False,
"special": True
},
}
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
added_tokens_decoder=ADD_TOKENS_DECODER,
padding_side='right')
但是需要注意,qlora 默认是不会训练 embeding 层的,因此不知道对性能有多少影响
@sxk000 @hhaAndroid 修改完tokenizer后,还需要在 config 中把修改后的 tokenizer 传入到 model 的配置中,这样会在初始化 model 的时候自动对 model 的 embedding 层和 output 层的参数同步进行扩充。
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
model = dict(
+ tokenizer=tokenizer,
...)
直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding
for special_token in special_tokens:
if special_token not in tokenizer.get_vocab():
tokenizer.add_tokens([special_token], special_tokens=True)
print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')
直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding
for special_token in special_tokens: if special_token not in tokenizer.get_vocab(): tokenizer.add_tokens([special_token], special_tokens=True) print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')
谢谢!这种方法有效!
@sxk000 @hhaAndroid 修改完tokenizer后,还需要在 config 中把修改后的 tokenizer 传入到 model 的配置中,这样会在初始化 model 的时候自动对 model 的 embedding 层和 output 层的参数同步进行扩充。
####################################################################### # PART 2 Model & Tokenizer # ####################################################################### model = dict( + tokenizer=tokenizer, ...)
感谢回复!
按照这个方法加tokenizer=tokenizer,报没有tokenizer参数的错误,如下:
Traceback (most recent call last):
File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 307, in <module>
main()
File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 303, in main
runner.train()
File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train
self.strategy.prepare(
File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare
model = self.build_model(model)
File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 306, in build_model
model = MODELS.build(model)
File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg
return build_from_cfg(cfg, registry, default_args)
File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
TypeError: SupervisedFinetune.__init__() got an unexpected keyword argument 'tokenizer'
``` `
请问应该怎么解决呢?
直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding
for special_token in special_tokens: if special_token not in tokenizer.get_vocab(): tokenizer.add_tokens([special_token], special_tokens=True) print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')
按照这种方法训练出来的模型,对结果有影响。
比如训练数据是: user:你是谁? assistant:我是猴子请来的救兵!
模型训练出来以后,测试结果会出现如下情况: user:你是谁? assistant:你是谁你是谁啊。
如果不加词表,是可以按照训练数据那样正常输出的!
我的操作步骤是:通过如下代码,把原来的模型扩充词表,然后保存tokenizer和model,最后通过扩充词表后保存的模型进行微调训练的。
from transformers import AutoTokenizer,AutoModel
def new_token():
pretrained_model_name_or_path = '/apply/model/original/internlm2-chat-20b'
token_file='/apply/data/finetune/token.txt'
with open(token_file,'r',encoding='utf8') as f:
token_list=f.readlines()
token_list=''.join(token_list).split('\n')
print(token_list)
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
model = AutoModel.from_pretrained(pretrained_model_name_or_path)
print('---1',tokenizer)
for token_one in token_list:
if token_one not in tokenizer.get_vocab():
tokenizer.add_tokens([token_one],special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
print('---2',tokenizer)
tokenizer.save_pretrained(pretrained_model_name_or_path+'-new')
model.save_pretrained(pretrained_model_name_or_path+'-new')
@sxk000 @hhaAndroid 修改完tokenizer后,还需要在 config 中把修改后的 tokenizer 传入到 model 的配置中,这样会在初始化 model 的时候自动对 model 的 embedding 层和 output 层的参数同步进行扩充。
####################################################################### # PART 2 Model & Tokenizer # ####################################################################### model = dict( + tokenizer=tokenizer, ...)感谢回复! 按照这个方法加tokenizer=tokenizer,报没有tokenizer参数的错误,如下:
Traceback (most recent call last): File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 307, in <module> main() File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 303, in main runner.train() File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train self.strategy.prepare( File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare model = self.build_model(model) File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 306, in build_model model = MODELS.build(model) File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg return build_from_cfg(cfg, registry, default_args) File "/root/miniconda3/envs/p310xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore TypeError: SupervisedFinetune.__init__() got an unexpected keyword argument 'tokenizer' ``` ` 请问应该怎么解决呢?
请问用的是xtuenr的什么版本呢
我看您使用的是QLora算法,QLora算法是不支持扩词表的,因为QLora在训练过程中embedding是不训练的(参数一直是随机初始化的)。如果想要扩词表的话,建议尝试全量微调。
我看您使用的是QLora算法,QLora算法是不支持扩词表的,因为QLora在训练过程中embedding是不训练的(参数一直是随机初始化的)。如果想要扩词表的话,建议尝试全量微调。
感谢解答!
最近在忙忙其他的事情了,没有及时回复您,非常抱歉!
xtuenr0.1.14
使用的时全参微调,config代码如下:
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16))
直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding
for special_token in special_tokens: if special_token not in tokenizer.get_vocab(): tokenizer.add_tokens([special_token], special_tokens=True) print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')按照这种方法训练出来的模型,对结果有影响。
比如训练数据是: user:你是谁? assistant:我是猴子请来的救兵!
模型训练出来以后,测试结果会出现如下情况: user:你是谁? assistant:你是谁你是谁啊。
如果不加词表,是可以按照训练数据那样正常输出的!
我的操作步骤是:通过如下代码,把原来的模型扩充词表,然后保存tokenizer和model,最后通过扩充词表后保存的模型进行微调训练的。
from transformers import AutoTokenizer,AutoModel def new_token(): pretrained_model_name_or_path = '/apply/model/original/internlm2-chat-20b' token_file='/apply/data/finetune/token.txt' with open(token_file,'r',encoding='utf8') as f: token_list=f.readlines() token_list=''.join(token_list).split('\n') print(token_list) tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path) model = AutoModel.from_pretrained(pretrained_model_name_or_path) print('---1',tokenizer) for token_one in token_list: if token_one not in tokenizer.get_vocab(): tokenizer.add_tokens([token_one],special_tokens=True) model.resize_token_embeddings(len(tokenizer)) print('---2',tokenizer) tokenizer.save_pretrained(pretrained_model_name_or_path+'-new') model.save_pretrained(pretrained_model_name_or_path+'-new')
你好,有两个问题想要请教一下: 1.以上方法的意思是不是,先扩充词表保存模型,然后使用保存好的模型再进行SFT,这里的SFT不需要额外操作了。 2.为什么一下方法训练出来的模型会有影响?我之前直接在配置文件里的tokenizer_config.py里面的additional_special_tokens以及added_tokens_decoder中添加我自定义的special token后,使用Qwen3-4B进行多轮对话的分类训练,要求输出分类标签就是我自定义的special token。结果训练出来,只有第一轮对话可以输出标签,后面轮的对话都不输出任何内容,不知道怎么回事。
直接在config中定义好新加的special token,传到dataset和model中,model 需要resize_embedding
for special_token in special_tokens: if special_token not in tokenizer.get_vocab(): tokenizer.add_tokens([special_token], special_tokens=True) print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')按照这种方法训练出来的模型,对结果有影响。
