datachain Unify `from_json` and `parse

This issue is to unify the existing from_json and from_jsonl implementations with the existing implementations in parse_tabular, from_csv, and from_parquet. This is to consolidate dynamic model generation and schema inference for these import functions. Current functionality (such as jmespath support) should be preserved, so the implementations likely cannot be identical between these import functions, but they should use similar dynamic model generation, schema inference, etc. and this should also ideally remove the dependency on datamodel-code-generator if possible.

Oct 28 '24 15:10 dtulga

This article may be helpful in the future, as it talks about pyarrow's support for JSON: https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html

Oct 29 '24 18:10 dtulga

thanks @dtulga !

Oct 29 '24 18:10 shcheklein

Hi, I am wondering if this issue is open for contribution under some guidance 🙂

Nov 07 '24 10:11 PanGan21

@PanGan21 hi, yes, absolutely. Please take a look in the parse_tabular and from_json implementations, especially the part where it depends on the datamodel-code-generator - that's is hackiest part that we would like to get rid of. Let us know if something is not clear. It can not the simplest task tbh but can be an interesting one!

Nov 07 '24 18:11 shcheklein

Unify `from_json` and `parse_tabular` implementations