[ENG-1809] Improve .csv dialect sniffing

Open cslzchen opened this issue 5 years ago • 0 comments

Ticket

https://openscience.atlassian.net/browse/ENG-1809

Purpose

Improve .csv dialect sniffing.

Changes

Provide a full row of data to csv.Sniffer().sniff() so it can effectively detect the correct dialect. MFR reads a small amount of data from the file and recursively read more until either the following happens.
- Nothing available. Return the empty string since there is no new line characters in this file at all.
- The to-sniff size goes over max render size allowed. Return TabularRenderError. However, this is more like a failsafe since oversized file should never reach the sniffer in the first place.
- Any type of new line characters found, trim (if necessary) and send the the full first row to the sniff.
- Please see the video captures in the tickets as a quick demo.
Updated existing and added new unit tests

Side effects

Memory usage for sniffing depends on the size of the first row of the file. In the worst case, it could sniff at most 10MB. Please note that we don't have enough statistical data on how many bytes the first row of a .csv file takes on average. However, files larger than 10MB have already failed the size check by the renderer before the sniffing starts.

Given that for CSV, we needs to read the full file into memory for rendering anyway and the partial sniff data is always deleted after use, I don't think it will be a problem.

[ ] For CR: maybe add a sentry alert when the sniffing size goes over 512KB (or some other size) so we know how often this happens and how effective this algorithm actually is.

QA Notes

TBD

Deployment Notes

N / A

May 29 '20 02:05 cslzchen