Add data cleaning in fast-llm prepare, concept
✨ Description
part of #112
Closes #
🔍 Type of change
Select all that apply:
- [ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
- [ ] 🚀 New feature (non-breaking change that adds functionality)
- [ ] ⚠️ Breaking change (a change that could affect existing functionality)
- [ ] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
- [ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
- [ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
- [ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
- [ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)
📝 Changes
List the key changes introduced in this PR:
- Change A
- Change B
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
- [ ] 📜 I have read and followed the contributing guidelines.
- [ ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
- [ ] 🎉 The functionality is complete, and I have tested the changes.
- [ ] 📝 I have updated the documentation if needed.
- [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
- [ ] 🧩 I have commented my code, especially in hard-to-understand areas.
Dependencies and Configuration
- [ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
- [ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.
Testing
- [ ] 🧪 I have added or updated tests to cover my changes.
- [ ] ✔️ New and existing tests pass locally with my changes.
- [ ] 🚦 I have tested these changes on GPUs and verified training stability.
- [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.
Performance Impact
- [ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
- [ ] ✅ The benchmarks show no performance regression.
- [ ] 🚀 The benchmarks indicate a potential performance improvement.
- [ ] ⚠️ The benchmarks indicate a potential performance degradation.
- [ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.
📊 Performance Impact Details
If there is any impact on performance, describe it and provide benchmark results, if applicable:
🗒️ Additional Notes
Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.
Demonstration and Discussion of Concept
This code serves as a demonstration and discussion of the proposed concept.
Key Decisions:
-
Avoiding
runnable: I skip usingrunnablefor a config andconfigurablefor a processor. Also, all steps will operate directly onHF:datasets.Dataset. -
Introducing
ApplyInterface: A new interface,Apply, will be implemented. It will be HF-centric and allow applying any required processing on each dataset shard be itfiltermapor any other step supported bydatasets. -
Step Specification Approach: The approach used in
datasetsis followed, where step types are specified in the config file bytypefield. -
Aggregator Class: An aggregator class will sequentially invoke processing steps, in preparator config class it is called
processors. - Hollow Doc Length Filter: Implemented an empty doc lenght filter step to demonstrate how steps could look like.
Usage:
To prepare a dataset, simply call:
dataset = self._config.processors.apply(dataset)
config would be something like this:
processors:
steps:
-
type: length_filter
field: text
min_length_chars: 100
max_length_chars: 100000
- ...
@jlamypoirier, @tscholak What do you think?
Hi @bigximik, thanks for putting this together. I appreciate the careful thinking you've put in here!
However, let's simplify significantly. The goal isn't to design a general, modular pipeline system. It's just about adding these very specific cleaning filters. We already know exactly what filters we want and in what order.
Here's what I'd suggest:
- Just write a single function
clean_dataset(dataset, config)that explicitly calls each filter step in sequence. - Each filter can be toggled with a simple boolean in the config. No need for abstract registries or complicated type-based dispatch.
- Each filter is just a simple, testable function that uses
.filteror.mapdirectly on the HuggingFace dataset. - Let's keep the configuration as simple as possible (just booleans and thresholds).
We can always refactor if more complexity is actually required down the line, but let's get this feature shipped quickly and cleanly first.
Can you please move forward by just implementing the concrete filters directly?
Created basic implementation based on feedback.
- Word splitting and number detection implemented using regex.
- Binary data detection is based on non-printable characters.
- Presidio Model Management: Needs proper handling of model downloading and loading. Currently, it relies on SpaCy’s internal package installation, which won’t work in a frozen Docker image.
- ClamAV Integration: Needs a proper implementation. The chat-gpt suggested approach does not work. Proper implementation requires database download, loading, and file scanning, either via an outdated Python wrapper or a direct approach.
- Unit tests now cover all processors except ClamAV.
- Need to implement test skipping for PII and malware detection processors.
- Integration testing with
prepareis pending.
Next Steps
- Implement model download and loading management for Presidio.
- Implement multiprocessing for Presidio engine.
- Implement ClamAV database management and scanning.
- Add test skipping for specific processors.
- Verify integration with
prepare.