[WIP] Fix datasets export to JSON
Linked Issue: #7037 Ideas: #7039
@albertvillanova / @lhoestq any early feedback?
AFAIK there is no param orient in load_dataset(). So for orientations other than "records", the loading isn't very accurate. Any thoughts?
orient = "split" can also be handled. I will add the changes soon
Thanks for diving into this ! I don't think we want the JSON export to be that complex though, especially if people can do ds.to_pandas().to_json(orient=...). Maybe we can just raise an error and suggest users to use pandas ? And also note that it loads the full dataset in memory so it's mainly for small scale datasets. The only acceptable option for large scale datasets is probably just JSON Lines anyway since it enables streaming.
@lhoestq Simply doing ds.to_pandas().to_json(orient=...) is not going to give any batching or multiprocessing benefits right? Also, which function are you referring to - when you say that its meant for small scale datasets only?
Yes indeed. Though I think it's fine since using something else than orient="lines" is only suitable/useful for small datasets. Or you know a case where a big dataset need to be in a format that is not orient="lines" ?
@lhoestq Let me close this PR and open another one where I will add an error message, as suggested here.
Thanks for diving into this ! I don't think we want the JSON export to be that complex though, especially if people can do
ds.to_pandas().to_json(orient=...). Maybe we can just raise an error and suggest users to use pandas ? And also note that it loads the full dataset in memory so it's mainly for small scale datasets. The only acceptable option for large scale datasets is probably just JSON Lines anyway since it enables streaming.
Addressed here: #7273 @lhoestq