FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

Support llama3 fine-tune

Open MrZhengXin opened this issue 1 year ago • 6 comments

Why are these changes needed?

Support llama3 fine-tune, which is the extension of https://github.com/lm-sys/FastChat/pull/3259. Also, the length-1 tokenization mismatch is fixed.

Related issue number (if applicable)

Checks

  • [x] I've run format.sh to lint the changes in this PR.
  • [] I've included any doc changes needed.
  • [x] I've made sure the relevant tests are passing (if applicable).

MrZhengXin avatar Apr 27 '24 11:04 MrZhengXin

Currently, the tokenizer will automatically add another 'bos_token_id' for the input prompt. Since the prompt contains the 'bos_token_id', the result input_id contains two 'bos_token_id'.

https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L101C1-L107C16

meet-cjli avatar May 05 '24 02:05 meet-cjli

Currently, the tokenizer will automatically add another 'bos_token_id' for the input prompt. Since the prompt contains the 'bos_token_id', the result input_id contains two 'bos_token_id'.

https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L101C1-L107C16

Oh, I see. So actually we need to change this part of code and avoid input_ids[0][0] == input_ids[0][1] == tokenizer. bos_token_id ?

MrZhengXin avatar May 05 '24 05:05 MrZhengXin

@MrZhengXin I think we can add add_special_tokens=False # Do not add special tokens, it can works.

def tokenize_conversations(conversations, tokenizer):
    input_ids = tokenizer(
        conversations,
        return_tensors="pt",
        padding="max_length",
        max_length=tokenizer.model_max_length,
        truncation=True,
    add_special_tokens=False  # Do not add special tokens
    ).input_ids
    targets = input_ids.clone()
    return input_ids, targets

Oscarjia avatar May 05 '24 08:05 Oscarjia

@MrZhengXin I think we can add add_special_tokens=False # Do not add special tokens, it can works.

def tokenize_conversations(conversations, tokenizer):
    input_ids = tokenizer(
        conversations,
        return_tensors="pt",
        padding="max_length",
        max_length=tokenizer.model_max_length,
        truncation=True,
    add_special_tokens=False  # Do not add special tokens
    ).input_ids
    targets = input_ids.clone()
    return input_ids, targets

but i don't know why it will auto add the bos_token, i check the config file, it not obviously set to auto add bos_token, anybody konw why the tokenizer will auto add the bos_token?

Oscarjia avatar May 05 '24 08:05 Oscarjia

@MrZhengXin @Oscarjia It works, but it seems to cause another issue in the following function

(https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L144)

When we use user_turn_separator to split the conversation, the first item would be <|begin_of_text|> if the conversation has no system prompt. However, we have ignored <|begin_of_text|> by target[:cur_len] = IGNORE_TOKEN_ID, where cur_len is 1. It ignores it again first iteration of the loop. So it will ignore <|begin_of_text|> twice and cause the 1 length mismatch.

meet-cjli avatar May 05 '24 16:05 meet-cjli

@MrZhengXin @Oscarjia It works, but it seems to cause another issue in the following function

(https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train_with_template.py#L144)

When we use user_turn_separator to split the conversation, the first item would be <|begin_of_text|> if the conversation has no system prompt. However, we have ignored <|begin_of_text|> by target[:cur_len] = IGNORE_TOKEN_ID, where cur_len is 1. It ignores it again first iteration of the loop. So it will ignore <|begin_of_text|> twice and cause the 1 length mismatch.

Do you have any ideas on solving this?

ReeceResearch avatar Jul 08 '24 13:07 ReeceResearch