datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Unify `from_json` and `parse_tabular` implementations

Open dtulga opened this issue 1 year ago • 4 comments

This issue is to unify the existing from_json and from_jsonl implementations with the existing implementations in parse_tabular, from_csv, and from_parquet. This is to consolidate dynamic model generation and schema inference for these import functions. Current functionality (such as jmespath support) should be preserved, so the implementations likely cannot be identical between these import functions, but they should use similar dynamic model generation, schema inference, etc. and this should also ideally remove the dependency on datamodel-code-generator if possible.

dtulga avatar Oct 28 '24 15:10 dtulga

This article may be helpful in the future, as it talks about pyarrow's support for JSON: https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html

dtulga avatar Oct 29 '24 18:10 dtulga

thanks @dtulga !

shcheklein avatar Oct 29 '24 18:10 shcheklein

Hi, I am wondering if this issue is open for contribution under some guidance 🙂

PanGan21 avatar Nov 07 '24 10:11 PanGan21

@PanGan21 hi, yes, absolutely. Please take a look in the parse_tabular and from_json implementations, especially the part where it depends on the datamodel-code-generator - that's is hackiest part that we would like to get rid of. Let us know if something is not clear. It can not the simplest task tbh but can be an interesting one!

shcheklein avatar Nov 07 '24 18:11 shcheklein