Fast-LLM icon indicating copy to clipboard operation
Fast-LLM copied to clipboard

Combine GPTHuggingfaceDatasetConfig input sources into `source_schema`

Open nitsanluke opened this issue 9 months ago • 0 comments

✨ Description

This PR creates a common interface for all GPTHuggingfaceDatasetConfig input columns via the new source_schema variable. Beyond the variable filed we require additional keys to preprocess and tokenize different types of datasets. (eg: SFT, combine cols, etc). Therefore we have created a new variable source_schema which can accommodate these different data sources specific preprocessing and tokenization. Current variables field and loss_masking_spans are moved into TextColumnConfig as a type of input/data source.

Merge after #245

🔍 Type of change

Select all that apply:

  • [ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • [ ] 🚀 New feature (non-breaking change that adds functionality)
  • [ ] ⚠️ Breaking change (a change that could affect existing functionality)
  • [ ] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • [X] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • [ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • [ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
  • [ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • [ ] 📜 I have read and followed the contributing guidelines.
  • [ ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • [ ] 🎉 The functionality is complete, and I have tested the changes.
  • [ ] 📝 I have updated the documentation if needed.
  • [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • [ ] 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

  • [ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
  • [ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

  • [ ] 🧪 I have added or updated tests to cover my changes.
  • [ ] ✔️ New and existing tests pass locally with my changes.
  • [ ] 🚦 I have tested these changes on GPUs and verified training stability.
  • [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.

nitsanluke avatar May 07 '25 19:05 nitsanluke