✨ Description

part of #112

Closes #

🔍 Type of change

Select all that apply:

[ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
[ ] 🚀 New feature (non-breaking change that adds functionality)
[ ] ⚠️ Breaking change (a change that could affect existing functionality)
[ ] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
[ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
[ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
[ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
[ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

[ ] 📜 I have read and followed the contributing guidelines.
[ ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
[ ] 🎉 The functionality is complete, and I have tested the changes.
[ ] 📝 I have updated the documentation if needed.
[ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
[ ] 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

[ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
[ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

[ ] 🧪 I have added or updated tests to cover my changes.
[ ] ✔️ New and existing tests pass locally with my changes.
[ ] 🚦 I have tested these changes on GPUs and verified training stability.
[ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

[ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
[ ] ✅ The benchmarks show no performance regression.
[ ] 🚀 The benchmarks indicate a potential performance improvement.
[ ] ⚠️ The benchmarks indicate a potential performance degradation.
[ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

Mar 26 '25 18:03 bigximik

Demonstration and Discussion of Concept

This code serves as a demonstration and discussion of the proposed concept.

Key Decisions:

Avoiding runnable: I skip using runnable for a config and configurable for a processor. Also, all steps will operate directly on HF:datasets.Dataset.
Introducing Apply Interface: A new interface, Apply, will be implemented. It will be HF-centric and allow applying any required processing on each dataset shard be it filter map or any other step supported by datasets.
Step Specification Approach: The approach used in datasets is followed, where step types are specified in the config file by type field.
Aggregator Class: An aggregator class will sequentially invoke processing steps, in preparator config class it is called processors.
Hollow Doc Length Filter: Implemented an empty doc lenght filter step to demonstrate how steps could look like.

Usage:

To prepare a dataset, simply call:

dataset = self._config.processors.apply(dataset)

config would be something like this:

processors:
  steps:
    -
     type: length_filter
     field: text
     min_length_chars: 100
     max_length_chars: 100000
    - ...

@jlamypoirier, @tscholak What do you think?

Mar 26 '25 18:03 bigximik

Hi @bigximik, thanks for putting this together. I appreciate the careful thinking you've put in here!

However, let's simplify significantly. The goal isn't to design a general, modular pipeline system. It's just about adding these very specific cleaning filters. We already know exactly what filters we want and in what order.

Here's what I'd suggest:

Just write a single function clean_dataset(dataset, config) that explicitly calls each filter step in sequence.
Each filter can be toggled with a simple boolean in the config. No need for abstract registries or complicated type-based dispatch.
Each filter is just a simple, testable function that uses .filter or .map directly on the HuggingFace dataset.
Let's keep the configuration as simple as possible (just booleans and thresholds).

We can always refactor if more complexity is actually required down the line, but let's get this feature shipped quickly and cleanly first.

Can you please move forward by just implementing the concrete filters directly?

Mar 27 '25 02:03 tscholak

Created basic implementation based on feedback.

Word splitting and number detection implemented using regex.
Binary data detection is based on non-printable characters.
Presidio Model Management: Needs proper handling of model downloading and loading. Currently, it relies on SpaCy’s internal package installation, which won’t work in a frozen Docker image.
ClamAV Integration: Needs a proper implementation. The chat-gpt suggested approach does not work. Proper implementation requires database download, loading, and file scanning, either via an outdated Python wrapper or a direct approach.
Unit tests now cover all processors except ClamAV.
Need to implement test skipping for PII and malware detection processors.
Integration testing with prepare is pending.

Next Steps

Implement model download and loading management for Presidio.
Implement multiprocessing for Presidio engine.
Implement ClamAV database management and scanning.
Add test skipping for specific processors.
Verify integration with prepare.

Apr 01 '25 14:04 bigximik

Add data cleaning in fast-llm prepare, concept

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Demonstration and Discussion of Concept

Key Decisions:

Usage:

Created basic implementation based on feedback.

Next Steps