Smarter chunking when importing multiple csv files

Open caewok opened this issue 6 years ago • 1 comments

When importing multiple csv files using csv_to_disk.frame, you can specify the number of chunks to output using nchunks. Then it looks like the import happens in two stages: (1) each csv file is split into nchunks fst files; (2) the fst files are combined so that the resulting output folder contains nchunks fst files. If you have a lot of csv files, and a large number of desired chunks, this results in quite a bit of temporary fst file writing.

It might be more efficient to import each csv file as a single fst file, and then combine those to nchunks fst files. In the best case, one might have, for example, 4 csv files and want nchunks=4, in which case it would clearly be easier to just create a single fst for each csv file. (I believe this is how Spark handles multiple file input, unless you repartition the frame). Other combinations are obviously not as easy, but perhaps adding an intermediate_chunk variable could specify how the individual csv files should be chunked. In any event, if the number of chunks desired is less than or equal to the number of csv files, then it is a fair bet that (absent extremely unbalanced csv files), it would be safe to treat each csv as a single fst without re-chunking.

Relatedly, it would be great if csv_to_disk.frame deleted the various intermediate folders created. For very large or repeated imports, this could quickly get out of hand if the intermediate folders are not removed.

Oct 26 '19 20:10 caewok

Thanks. Looks like you have a good grasp of the CSV reading. I am focused on documentation and writing tests. So I have planned to write a csv reading deep dive to explain every details.

Given your head start do you think you will have time to contribute a PR to improve the CSV reading function?

Oct 26 '19 23:10 xiaodaigh