Concat prompt and completion cols for tokenizing

Open nitsanluke opened this issue 11 months ago • 0 comments

✨ Description

Migrated from #248; this PR allows a dataset with prompt and completion specifically and in general any pair of text columns (eg: question and answer) to be combined and tokenized. (It is limited to two columns of input covering majority of use-cases). Further if the user needs loss mask span for based on prompt (part of the sequence) they can include them as well. Note: Merge after #255

🔍 Type of change

Select all that apply:

[ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
[X] 🚀 New feature (non-breaking change that adds functionality)
[ ] ⚠️ Breaking change (a change that could affect existing functionality)
[ ] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
[ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
[ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
[ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
[ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Create a new data source schema for prompt & completion style datasets
Include concat func. and loss masking span creation

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

[ ] 📜 I have read and followed the contributing guidelines.
[ ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
[ ] 🎉 The functionality is complete, and I have tested the changes.
[ ] 📝 I have updated the documentation if needed.
[ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
[ ] 🧩 I have commented my code, especially in hard-to-understand areas.

Sample config:

loading_workers: 1
tokenize_workers: 1
saving_workers: 1
output_path: ./debug_test/gsm8k
dataset:
  path: openai/gsm8k
  config_name: main
  split: train
  trust_remote_code: true
  source_schema:
    prompt_column: question
    completion_column: answer
    delimiter: " "

tokenizer:
  path: /mnt/slam_checkpoints/upstream/Mistral-Nemo-Base-2407/

May 08 '25 22:05 nitsanluke