csvkit icon indicating copy to clipboard operation
csvkit copied to clipboard

Document using --datetime-format to avoid over-aggressive date inference

Open bvdputte opened this issue 8 years ago • 3 comments

Hi,

I'm having this weird type inference whilst using in2csv:

Excel file with a cell with value 611A_M46___EXT050.png | 611A_M46___FIN300.png is being changed into 6101-01-01T05:00:00 (Date ?) when using in2csv data.xlsx > data.csv.

Other cells from that column with "lookalike" content are treated as text: e.g. 560A_B97___FR.png | 560A_B97___SIL.png | 560A_B97___SIR.png , 570A_N15___BA.png | 570A_N15___FR.png | 570A_N15___SIL.png | 570A_N15___SIR.png , 571A_061___SIL.png , etc

Is this expected?

Using in2csv -I data.xlsx > data.csv, thus with the --no-inference parameter, the output of this particular cell is ok, but I'ld really need the inference for other cells in the data...

Thanks!

FYI: this is happening with in2csv 1.0.2 on macOS Sierra with python 2.7.10. I have another mac with an older version of in2csv (I think 0.9.1 - but I can't seem to get the version as in2csv -V isn't working there) - also macOS Sierra with python 2.7.10 - this behaviour isn't happening!

bvdputte avatar Jan 04 '18 12:01 bvdputte

For type inference on specific columns, see #151.

agate 1.6.1 fixes the over-aggressive date inference, which csvkit will upgrade to once it's released: https://github.com/wireservice/agate/issues/653

jpmckinney avatar Jan 15 '18 15:01 jpmckinney

It looks like explicitly setting --datetime-format does disable some overeager conversion of TEXT.

In my case, I'm using csvsql only to handle numeric locales, which requires type inference. But it was converting uuids to datetime. Setting the strptime format fixes everything and affords like a 10x speedup without all the datetime conversion

@bvdputte

jnj16180340 avatar May 28 '19 22:05 jnj16180340

Re-opened to document this way to control type inference.

jpmckinney avatar Jun 03 '19 20:06 jpmckinney