Document using --datetime-format to avoid over-aggressive date inference
Hi,
I'm having this weird type inference whilst using in2csv:
Excel file with a cell with value 611A_M46___EXT050.png | 611A_M46___FIN300.png is being changed into 6101-01-01T05:00:00 (Date ?) when using in2csv data.xlsx > data.csv.
Other cells from that column with "lookalike" content are treated as text: e.g. 560A_B97___FR.png | 560A_B97___SIL.png | 560A_B97___SIR.png , 570A_N15___BA.png | 570A_N15___FR.png | 570A_N15___SIL.png | 570A_N15___SIR.png , 571A_061___SIL.png , etc
Is this expected?
Using in2csv -I data.xlsx > data.csv, thus with the --no-inference parameter, the output of this particular cell is ok, but I'ld really need the inference for other cells in the data...
Thanks!
FYI: this is happening with in2csv 1.0.2 on macOS Sierra with python 2.7.10. I have another mac with an older version of in2csv (I think 0.9.1 - but I can't seem to get the version as in2csv -V isn't working there) - also macOS Sierra with python 2.7.10 - this behaviour isn't happening!
For type inference on specific columns, see #151.
agate 1.6.1 fixes the over-aggressive date inference, which csvkit will upgrade to once it's released: https://github.com/wireservice/agate/issues/653
It looks like explicitly setting --datetime-format does disable some overeager conversion of TEXT.
In my case, I'm using csvsql only to handle numeric locales, which requires type inference. But it was converting uuids to datetime. Setting the strptime format fixes everything and affords like a 10x speedup without all the datetime conversion
@bvdputte
Re-opened to document this way to control type inference.