Support easy concatenation of datasets

Open tscholak opened this issue 11 months ago • 1 comments

🎯 Goal (What & Why)

To streamline experiments, we need an easy way to concatenate datasets, which refers to sampling from two or more datasets with frequencies such that it is equivalent to sampling from the concatenation of the those datasets. This should be accomplished without requiring users to specify the frequencies themselves. Instead, Fast-LLM should compute those automatically.

🚀 Execution Plan

Step 1: What is the smallest working version?

Support this only for shallow datasets that are themselves only a collection of bin/idx files.

Step 2: What additional optimizations are possible (but optional)?

Support this for any hierarchical definition of datasets.

📌 Acceptance Criteria (Must-Haves for Completion)

The feature must be functional and tested.
The implementation must be documented in practical terms.
The PR must include a performance/impact summary.
No refactors unless directly necessary for feature completion.

🛠️ Project Management

[x] Assign the project to the Fast-LLM project.
[ ] Set the Estimate field (in days) in the GitHub project.
[ ] Use the Size field to categorize the PR size (Small/Medium/Large).
[ ] Assign an owner when opening the issue.

Mar 24 '25 14:03 tscholak

Here is a sample training data-mix I'm using...

      training:
        type: blended
        datasets:
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/starcoderdata/java/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/starcoderdata/javascript/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/starcoderdata/python/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/starcoderdata/sql/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-dclm/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/zyda-2/dolma-cc_crossdeduped-filtered/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/deu_Latn/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/fra_Latn/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/ita_Latn/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/nld_Latn/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/por_Latn/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/jpn_Jpan/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/FineWeb2/spa_Latn/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/flan/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/pes2o_fixed_parquet/fast_llm_config.yaml
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/wiki/fast_llm_config.yaml
          # Concatenated dolmino-wiki_rephrased_plus_no_scholar
          - type: blended
            datasets:
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/dolmino-wiki_rephrased_QA_plus/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/dolmino-wiki_rephrased_QA_with_context_plus/fast_llm_config.yaml
            weights: [0.578, 0.422]
          - type: file
            path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/stackexchange/fast_llm_config.yaml
          # Concatenated dolmino-mix-1124/math
          - type: blended
            datasets:
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/codesearchnet-owmfilter/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/dolmino_math_synth/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/gsm8k/main/train/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/gsm8k/socratic/train/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/mathcoder2-synthmath/ajibawa-2023/filtered-by-rule-education-college-students/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/mathcoder2-synthmath/ajibawa-2023/maths-college/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/mathcoder2-synthmath/m-a-p_Matrix/filtered-math/book_math/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/mathcoder2-synthmath/m-a-p_Matrix/filtered-math/book_science_fixed/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/metamath-owmfilter/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/tinyGSM-MIND/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/dolmino-mix-1124/math/tulu_math/fast_llm_config.yaml
            weights: [0.000174, 0.00165, 0.000127, 0.000153, 0.000565, 0.06789, 0.15655, 0.11777, 0.0080569, 0.62585, 0.0212]
          # Concatenated slam_stage2_additional_data
          - type: blended
            datasets:
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/glaive-code-assistant-v3/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/OpenMathInstruct-2/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/self-oss-instruct-sc2-exec-filter-50k/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/squad_v2_processed/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/slam_stage2_additional_data/WebInstructSub/fast_llm_config.yaml
            weights: [0.073229, 0.807006, 0.002434, 0.000917, 0.116413]
          # Concatenated nvidia-sft        
          - type: blended
            datasets:
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/chat/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/math/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/safety_fixed/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/science_fixed/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/traces/fast_llm_config.yaml
              - type: file
                path: /mnt/datasets/tokenized/Mistral-Nemo-Base-2407/nvidia_trace_sft/code/fast_llm_config.yaml
            weights: [0.0005, 0.2531, 0.0003, 0.0403, 0.5065, 0.1993]
        weights: [.0025, .01, .0045, .003, .476, .0145, .00357, .00357, .00357, .00357, .00357, .007, .00357, .1328, .0468, .03, .08, .0196, .1664, .075, .075]

As you can see it does get longer and gets hard to maintain.

I see we have some version of concat here https://github.com/ServiceNow/Fast-LLM/blob/929c1cf91e8a2cd86c0800aac4053eb3897ffde2/fast_llm/data/dataset/config.py#L97C7-L97C32 It would better to get that so manual calculations can be avoided
Currently we are not able to specify a split of the training dataset as validation. i.e the older split setup does not work anymore and we need to create validation datasets. In this case it would a repeat of the same datasets used in train and needing to specifically shuffle to avoid train-validation overlap. It would be good have the split option still present by default and additionally to add other validation datasets.

Apr 25 '25 01:04 nitsanluke