Support llama3 fine-tune
Why are these changes needed?
Support llama3 fine-tune, which is the extension of https://github.com/lm-sys/FastChat/pull/3259. Also, the length-1 tokenization mismatch is fixed.
Related issue number (if applicable)
Checks
- [x] I've run
format.shto lint the changes in this PR. - [] I've included any doc changes needed.
- [x] I've made sure the relevant tests are passing (if applicable).
Currently, the tokenizer will automatically add another 'bos_token_id' for the input prompt. Since the prompt contains the 'bos_token_id', the result input_id contains two 'bos_token_id'.
https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L101C1-L107C16
Currently, the tokenizer will automatically add another 'bos_token_id' for the input prompt. Since the prompt contains the 'bos_token_id', the result input_id contains two 'bos_token_id'.
https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L101C1-L107C16
Oh, I see. So actually we need to change this part of code and avoid input_ids[0][0] == input_ids[0][1] == tokenizer. bos_token_id ?
@MrZhengXin I think we can add add_special_tokens=False # Do not add special tokens, it can works.
def tokenize_conversations(conversations, tokenizer):
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="max_length",
max_length=tokenizer.model_max_length,
truncation=True,
add_special_tokens=False # Do not add special tokens
).input_ids
targets = input_ids.clone()
return input_ids, targets
@MrZhengXin I think we can add
add_special_tokens=False # Do not add special tokens, it can works.def tokenize_conversations(conversations, tokenizer): input_ids = tokenizer( conversations, return_tensors="pt", padding="max_length", max_length=tokenizer.model_max_length, truncation=True, add_special_tokens=False # Do not add special tokens ).input_ids targets = input_ids.clone() return input_ids, targets
but i don't know why it will auto add the bos_token, i check the config file, it not obviously set to auto add bos_token, anybody konw why the tokenizer will auto add the bos_token?
@MrZhengXin @Oscarjia It works, but it seems to cause another issue in the following function
(https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L144)
When we use user_turn_separator to split the conversation, the first item would be <|begin_of_text|> if the conversation has no system prompt. However, we have ignored <|begin_of_text|> by target[:cur_len] = IGNORE_TOKEN_ID, where cur_len is 1. It ignores it again first iteration of the loop. So it will ignore <|begin_of_text|> twice and cause the 1 length mismatch.
@MrZhengXin @Oscarjia It works, but it seems to cause another issue in the following function
(https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L144)
When we use
user_turn_separatorto split the conversation, the first item would be<|begin_of_text|>if the conversation has no system prompt. However, we have ignored<|begin_of_text|>bytarget[:cur_len] = IGNORE_TOKEN_ID, where cur_len is 1. It ignores it again first iteration of the loop. So it will ignore<|begin_of_text|>twice and cause the 1 length mismatch.
Do you have any ideas on solving this?