data-juicer
data-juicer copied to clipboard
Automatically split input dataset in ray mode
Description
Split the dataset files into small pieces and process them in different batches to avoid exceeding the memory limit of Ray.
This PR is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this PR will be closed in 3 day.
Close this stale PR.
Cc: @pan-x-c, @chenyushuo
When available, please add the new rule that considers the Ray's auto-split feature in this PR and resolve conflicts for CR.
Additionally, we need to incorporate the streaming_load_json patch into the main branch to align with our 2.0 paper.