maxtext icon indicating copy to clipboard operation
maxtext copied to clipboard

Add tokenizer's chat template to SFT

Open SurbhiJainUSC opened this issue 8 months ago • 1 comments

Description

  • This PR standardizes how we format conversational data for SFT. Our previous custom chat template was causing issues during fine-tuning of student model for distillation because the distillation data generation script uses the tokenizer's specific chat template. To ensure consistency and avoid unexpected results, we're modifying the SFT code to also adopt the tokenizer's chat template. Here is an example of the chat template for Deepseek model: https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat#chat-completion.
  • This PR also reduces the golden data file size from 75M to 18M. Will change it to use default model in a follow-up PR.

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned. This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

  • Unit tests, integration tests and E2E tests.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • [x] I have performed a self-review of my code.
  • [x] I have necessary comments in my code, particularly in hard-to-understand areas.
  • [x] I have run end-to-end tests tests and provided workload links above if applicable.
  • [x] I have made or will make corresponding changes to the doc if needed.

SurbhiJainUSC avatar May 15 '25 18:05 SurbhiJainUSC

we're modifying the SFT code to also adopt the tokenizer's chat template

Could you add the template to the PR description? It would help for context while reviewing

The template is different for each model. I will add an example maybe for context.

SurbhiJainUSC avatar May 16 '25 22:05 SurbhiJainUSC

we're modifying the SFT code to also adopt the tokenizer's chat template

Could you add the template to the PR description? It would help for context while reviewing

The template is different for each model. I will add an example maybe for context.

+1 thank you and that would be very helpful. Something like (1) a raw data example from dataset; (2) transformed prompt after using a specific tokenizer's apply_chat_template.

hengtaoguo avatar May 18 '25 05:05 hengtaoguo