datasets icon indicating copy to clipboard operation
datasets copied to clipboard

[WIP] Fix datasets export to JSON

Open varadhbhatnagar opened this issue 1 year ago • 6 comments

varadhbhatnagar avatar Sep 29 '24 12:09 varadhbhatnagar

Linked Issue: #7037 Ideas: #7039

varadhbhatnagar avatar Sep 29 '24 15:09 varadhbhatnagar

@albertvillanova / @lhoestq any early feedback?

AFAIK there is no param orient in load_dataset(). So for orientations other than "records", the loading isn't very accurate. Any thoughts?

varadhbhatnagar avatar Sep 29 '24 15:09 varadhbhatnagar

orient = "split" can also be handled. I will add the changes soon

varadhbhatnagar avatar Oct 09 '24 18:10 varadhbhatnagar

Thanks for diving into this ! I don't think we want the JSON export to be that complex though, especially if people can do ds.to_pandas().to_json(orient=...). Maybe we can just raise an error and suggest users to use pandas ? And also note that it loads the full dataset in memory so it's mainly for small scale datasets. The only acceptable option for large scale datasets is probably just JSON Lines anyway since it enables streaming.

lhoestq avatar Oct 11 '24 13:10 lhoestq

@lhoestq Simply doing ds.to_pandas().to_json(orient=...) is not going to give any batching or multiprocessing benefits right? Also, which function are you referring to - when you say that its meant for small scale datasets only?

varadhbhatnagar avatar Oct 11 '24 17:10 varadhbhatnagar

Yes indeed. Though I think it's fine since using something else than orient="lines" is only suitable/useful for small datasets. Or you know a case where a big dataset need to be in a format that is not orient="lines" ?

lhoestq avatar Oct 14 '24 12:10 lhoestq

@lhoestq Let me close this PR and open another one where I will add an error message, as suggested here.

Thanks for diving into this ! I don't think we want the JSON export to be that complex though, especially if people can do ds.to_pandas().to_json(orient=...). Maybe we can just raise an error and suggest users to use pandas ? And also note that it loads the full dataset in memory so it's mainly for small scale datasets. The only acceptable option for large scale datasets is probably just JSON Lines anyway since it enables streaming.

varadhbhatnagar avatar Oct 27 '24 13:10 varadhbhatnagar

Addressed here: #7273 @lhoestq

varadhbhatnagar avatar Nov 01 '24 11:11 varadhbhatnagar