data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

Automatically split input dataset in ray mode

Open pan-x-c opened this issue 1 year ago • 3 comments

Description

Split the dataset files into small pieces and process them in different batches to avoid exceeding the memory limit of Ray.

pan-x-c avatar Sep 04 '24 12:09 pan-x-c

This PR is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this PR will be closed in 3 day.

github-actions[bot] avatar Sep 27 '24 09:09 github-actions[bot]

Close this stale PR.

github-actions[bot] avatar Sep 30 '24 09:09 github-actions[bot]

Cc: @pan-x-c, @chenyushuo

When available, please add the new rule that considers the Ray's auto-split feature in this PR and resolve conflicts for CR.

Additionally, we need to incorporate the streaming_load_json patch into the main branch to align with our 2.0 paper.

yxdyc avatar Dec 12 '24 03:12 yxdyc