xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Use a neutral format to have lossless interface with JSON, scipp, Astropy, pandas

Open loco-philippe opened this issue 1 year ago • 4 comments

Is your feature request related to a problem?

Each tool has a specific structure for processing multidimensional data with the following consequences:

  • interfaces dedicated to each tool,
  • partially processed data,
  • no unified representation of data structures

Describe the solution you'd like

The proposed format (see jupyter notebook, github repository, PyPI package ) is based on the following principles:

  • neutral format available for tabular or multidimensional tools (e.g. Numpy, pandas, xarray, scipp, astropy),
  • taking into account a wide variety of data types as defined in NTV format,
  • high interoperability: reversible (lossless round-trip) interface with tabular or multidimensional tools,
  • reversible and compact JSON format,
  • Ease of sharing and exchanging multidimensional and tabular data,

Describe alternatives you've considered

No response

Additional context

https://github.com/numpy/numpy/issues/12481#issuecomment-2049179803 https://github.com/astropy/astropy/issues/16286 https://github.com/scipp/scipp/issues/3422

loco-philippe avatar Apr 11 '24 08:04 loco-philippe

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Apr 11 '24 08:04 welcome[bot]

It's not clear to me what changes you're asking for in xarray. If you want to create a new on-disk storage format you can, and you can teach xarray to read it using the backend entrypoint system. Are you asking for something that falls outside of that framework?

TomNicholas avatar Apr 11 '24 14:04 TomNicholas

Thank you @TomNicholas for your quick response.

Currently, the interface between Xarray and other multidimensional tools like scipp or NDData only process part of the data because the internal structures of each tool are different.

To have reversible 'lossless round-trip' interfaces it is necessary to define a common data structure and a mapping between this structure and the structure of the tool (here Xarray).

This is what was defined in the proposed format and implemented in the indicated package. This shows for example that an Xarray Dataset can be transformed reversibly into a Scipp Dataset and vice versa or even into JSON data in an equally reversible manner.

To be clearer, my requests for Xarray are as follows:

  • does Xarray wish to participate in the definition (or validation) of this common data structure (so as to ensure that it covers all the developments envisaged for Xarray)?
  • is Xarray interested in integrating the interface defined towards this structure (or is it better to include it in a third party)?
  • is Xarray interested in integrating the defined JSON interface (or is it better to include it in a third party)?
  • does Xarray have use cases associated with interfaces between tools (or is this to do with Xarray discussions)?

loco-philippe avatar Apr 11 '24 22:04 loco-philippe

Thanks for the clarification @loco-philippe .

Xarray Dataset can be transformed reversibly into a Scipp Dataset and vice versa

That's cool to know!

I'll attempt to answer these questions, but others feel free to correct me.

does Xarray wish to participate in the definition (or validation) of this common data structure (so as to ensure that it covers all the developments envisaged for Xarray)?

is Xarray interested in integrating the interface defined towards this structure (or is it better to include it in a third party)?

I don't really think we need to be active participants until you ask for a specific change in xarray. Our data model is well-defined, and would need a very good reason to change.

is Xarray interested in integrating the defined JSON interface (or is it better to include it in a third party)?

Note that xarray maps well to the zarr format, which already stores all metadata in JSON files. If the numerical data arrays themselves can also be serialized to JSON (e.g. through https://github.com/numpy/numpy/issues/12481), then you have a JSON representation of an entire xarray.Dataset right there.

is Xarray interested in integrating the defined JSON interface (or is it better to include it in a third party)?

Xarray deliberately tries to make it easy for third parties to write code to serialize/deserialize to any data format they like. Again see our backend entrypoint system. I don't see a need to add any Dataset.to_new_format() or open_new_format_as_dataset functions to xarray, because these can live in your third party library (possibly as a BackendEntryPoint subclass). Once the new format becomes popular then we could consider accepting a PR.

TomNicholas avatar Apr 11 '24 22:04 TomNicholas