onyxia icon indicating copy to clipboard operation
onyxia copied to clipboard

Data explorer does not render the timestamp datatype

Open trygu opened this issue 2 years ago • 8 comments

I've tried the Data explorer on a bunch of test-files, and it looks great. There is a small issue related to the rendering of timestamp (the native date time datatype in Parequet). It is rendered as a numeric value, and not a date time. I'm attaching a test-file (parquet), and screenshots from both the data explorer and duckdb.

Data explorer screenshot Duckdb screenshot

output.parquet.zip

trygu avatar Dec 06 '23 09:12 trygu

Hey @trygu,

Thanks for the feedback!

Seems like an easy enough fix!

garronej avatar Dec 06 '23 09:12 garronej

@garronej I don't want to dampen your spirits, but it's not that simple. Clients (pyarrow, R arrow, Spark...) manage datetime differently. So, the content of a parquet file will depend on the client used to create it. This is a problem that @pengfei99 had documented here. Don't you remember, Jo? 😉

RLesur avatar Dec 06 '23 11:12 RLesur

I think the problem is how you should represent a date time field; My suggestion is to represent it in a fixed iso-8601 fashion just like DuckDb. (and maybe at a later juncture, as a configurable option for the date time formatting).

trygu avatar Dec 06 '23 13:12 trygu

@garronej The parquet format is the easiest one to work with. For the CSV format, you will have more trouble. For example, if a file is encoded with window-1252, and you try to open it with UTF-8, all the special characters will be wrongly interpreted. You will likely encounter this issue eventually.

pengfei99 avatar Dec 06 '23 15:12 pengfei99

@garronej The parquet format is the easiest one to work with. For the CSV format, you will have more trouble. For example, if a file is encoded with window-1252, and you try to open it with UTF-8, all the special characters will be wrongly interpreted. You will likely encounter this issue eventually.

I agree, and I'm not sure if it's that important to fix this for CSV's, as you say they can be encoded (and not encoded) in a lot of different ways, but seldom as a large integer.

I think the most pressing issue is that the date time-type in Parquet is displayed using EPOCH. There are several strategies to GUESS that it is a date time-value, but the absolute best would be to get this information from the underlying reading of the metadata (of the parquet file), since the grid itself do not have this information. I also noticed that there is a milestone related to the File explorer. I would love this issue to be part of that milestone as well (nudge @fcomte).

trygu avatar Apr 07 '24 21:04 trygu