Fast-LLM
Fast-LLM copied to clipboard
Support easy concatenation of datasets
🎯 Goal (What & Why)
To streamline experiments, we need an easy way to concatenate datasets, which refers to sampling from two or more datasets with frequencies such that it is equivalent to sampling from the concatenation of the those datasets. This should be accomplished without requiring users to specify the frequencies themselves. Instead, Fast-LLM should compute those automatically.
🚀 Execution Plan
Step 1: What is the smallest working version?
Support this only for shallow datasets that are themselves only a collection of bin/idx files.
Step 2: What additional optimizations are possible (but optional)?
Support this for any hierarchical definition of datasets.
📌 Acceptance Criteria (Must-Haves for Completion)
- The feature must be functional and tested.
- The implementation must be documented in practical terms.
- The PR must include a performance/impact summary.
- No refactors unless directly necessary for feature completion.
🛠️ Project Management
- [x] Assign the project to the Fast-LLM project.
- [ ] Set the
Estimatefield (in days) in the GitHub project. - [ ] Use the
Sizefield to categorize the PR size (Small/Medium/Large). - [ ] Assign an owner when opening the issue.
Here is a sample training data-mix I'm using...
training:
type: blended
datasets:
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/starcoderdata/java/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/starcoderdata/javascript/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/starcoderdata/python/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/starcoderdata/sql/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-dclm/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/zyda-2/dolma-cc_crossdeduped-filtered/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/deu_Latn/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/fra_Latn/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/ita_Latn/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/nld_Latn/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/por_Latn/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/jpn_Jpan/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/spa_Latn/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/flan/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/pes2o_fixed_parquet/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/wiki/fast_llm_config.yaml
# Concatenated dolmino-wiki_rephrased_plus_no_scholar
- type: blended
datasets:
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/dolmino-wiki_rephrased_QA_plus/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/dolmino-wiki_rephrased_QA_with_context_plus/fast_llm_config.yaml
weights: [0.578, 0.422]
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/stackexchange/fast_llm_config.yaml
# Concatenated dolmino-mix-1124/math
- type: blended
datasets:
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/codesearchnet-owmfilter/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/dolmino_math_synth/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/gsm8k/main/train/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/gsm8k/socratic/train/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/mathcoder2-synthmath/ajibawa-2023/filtered-by-rule-education-college-students/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/mathcoder2-synthmath/ajibawa-2023/maths-college/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/mathcoder2-synthmath/m-a-p_Matrix/filtered-math/book_math/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/mathcoder2-synthmath/m-a-p_Matrix/filtered-math/book_science_fixed/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/metamath-owmfilter/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/tinyGSM-MIND/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/tulu_math/fast_llm_config.yaml
weights: [0.000174, 0.00165, 0.000127, 0.000153, 0.000565, 0.06789, 0.15655, 0.11777, 0.0080569, 0.62585, 0.0212]
# Concatenated slam_stage2_additional_data
- type: blended
datasets:
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/glaive-code-assistant-v3/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/OpenMathInstruct-2/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/self-oss-instruct-sc2-exec-filter-50k/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/squad_v2_processed/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/WebInstructSub/fast_llm_config.yaml
weights: [0.073229, 0.807006, 0.002434, 0.000917, 0.116413]
# Concatenated nvidia-sft
- type: blended
datasets:
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/chat/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/math/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/safety_fixed/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/science_fixed/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/traces/fast_llm_config.yaml
- type: file
path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/code/fast_llm_config.yaml
weights: [0.0005, 0.2531, 0.0003, 0.0403, 0.5065, 0.1993]
weights: [.0025, .01, .0045, .003, .476, .0145, .00357, .00357, .00357, .00357, .00357, .007, .00357, .1328, .0468, .03, .08, .0196, .1664, .075, .075]
As you can see it does get longer and gets hard to maintain.
- I see we have some version of concat here https://github.com/ServiceNow/Fast-LLM/blob/929c1cf91e8a2cd86c0800aac4053eb3897ffde2/fast_llm/data/dataset/config.py#L97C7-L97C32 It would better to get that so manual calculations can be avoided
- Currently we are not able to specify a split of the training dataset as validation. i.e the older split setup does not work anymore and we need to create validation datasets. In this case it would a repeat of the same datasets used in train and needing to specifically shuffle to avoid train-validation overlap. It would be good have the split option still present by default and additionally to add other validation datasets.