Add tokenizer's chat template to SFT
Description
- This PR standardizes how we format conversational data for SFT. Our previous custom chat template was causing issues during fine-tuning of student model for distillation because the distillation data generation script uses the tokenizer's specific chat template. To ensure consistency and avoid unexpected results, we're modifying the SFT code to also adopt the tokenizer's chat template. Here is an example of the chat template for Deepseek model: https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat#chat-completion.
- This PR also reduces the golden data file size from 75M to 18M. Will change it to use default model in a follow-up PR.
Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned. This label is used for administrative purposes. Please do not add it manually.
Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.
Tests
- Unit tests, integration tests and E2E tests.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
- [x] I have performed a self-review of my code.
- [x] I have necessary comments in my code, particularly in hard-to-understand areas.
- [x] I have run end-to-end tests tests and provided workload links above if applicable.
- [x] I have made or will make corresponding changes to the doc if needed.
we're modifying the SFT code to also adopt the tokenizer's chat template
Could you add the template to the PR description? It would help for context while reviewing
The template is different for each model. I will add an example maybe for context.
we're modifying the SFT code to also adopt the tokenizer's chat template
Could you add the template to the PR description? It would help for context while reviewing
The template is different for each model. I will add an example maybe for context.
+1 thank you and that would be very helpful. Something like (1) a raw data example from dataset; (2) transformed prompt after using a specific tokenizer's apply_chat_template.