mdio-python icon indicating copy to clipboard operation
mdio-python copied to clipboard

Add printable representation for MDIOReader and MDIOWriter

Open srib opened this issue 3 years ago • 13 comments

Currently, MDIOReader and MDIOWriter prints the Python object. It would be useful to have a nice printable representation.

class InfoReporter:

    def __init__(self, obj):
        self.obj = obj

    def __repr__(self):
        items = self.obj.info_items()
        return info_text_report(items)

    def _repr_html_(self):
        items = self.obj.info_items()
        return info_html_report(items)

from here is a good model to follow.

srib avatar Sep 01 '22 18:09 srib

Good idea.

https://github.com/pydata/xarray/issues/1627

Another excellent example from Xarray. We could also consider using Xarray as a backend to reduce duplicate effort.

tasansal avatar Sep 01 '22 20:09 tasansal

Xarray dev here! I just discovered this very cool project via twitter.

Would love to help you integrate Xarray into the library. I had a quick tour through the docs, and indeed it seems like Xarray could help reduce some boilerplate while bringing lots of features that would help your package. (It layers great on top of Zarr and Dask.) For an example of a domain-specific package built on top of these packages, check out sgkit.

Also tagging @tomnicholas, another Xarray dev with a big interest in energy. Let us know how we can help!

rabernat avatar Sep 02 '22 23:09 rabernat

@rabernat 👋🏽!

Thank you for your interest and your generous offer to help us out. Will browse through sgkit as you suggested.

@tasansal

srib avatar Sep 03 '22 00:09 srib

Hi @rabernat 👋

Great to see you here! Big fan of your work.

We should collaborate! We are planning to have a steering committee pretty soon and would love you have you and more Xarray developers on board.

Our main intention is to have a energy domain specific library with some features similar Xarray and some extra domain specific features.

We have done a lot of heavy lifting incorporating exploration seismology data, and have some more implementations for wind resource data (to be integrated later).

tasansal avatar Sep 03 '22 02:09 tasansal

That sounds like a great vision! Our goal in Xarray is to be a generic container for multi-dimensional labeled arrays with metadata. We'd love it if you could rely on Xarray as a base data container. We have a section in our docs on extending Xarray which explains how a third party package can add custom functionality to Xarray objects. We also have an entry points to implement your own backend for custom file formats. (Note that Xarray already has very strong support for Zarr I/O.)

If you feel like there are features missing from Xarray that are holding you back from adopting it internally, we would love to hear about it on our issue tracker.

rabernat avatar Sep 03 '22 21:09 rabernat

@rabernat I am starting to look into the Xarray backend integration.

In our exploration seismology data case, we have groups of rich information (arrays, metadata, etc.) related to the other groups. Still, we want to keep them separate for various reasons:

  • We want to have GIS / CRS and coordinate data separately.
  • We want to have auxiliary variables separate from actual array data.
  • We want them to be able to share dimensions and coordinates.
  • Seismic data can have additional user interpretation created later, which we would want to keep in another group.

One of the first challenges for using Xarray as a backend is that Xarray can only work with groups with data duplication. I know we can write datasets into different groups of a Zarr / NetCDF; if I remember correctly, each group will have a copy of the coordinates, dimensions, etc., associated with the group data.

Is there a workaround to this? Would you be able to suggest a better alternative to our thought process? Given that Zarr v3 will separate performance-critical metadata from dataset groups, maybe it won't be a big problem for us anymore, but that is quite far from being mainstream.

tasansal avatar Jan 06 '23 20:01 tasansal

You may want to look into Xarray Datatree - https://xarray-datatree.readthedocs.io/ - a new package created by @tomnicholas. Soon this will become part of Xarray proper (see https://github.com/pydata/xarray/pull/7418).

rabernat avatar Jan 06 '23 20:01 rabernat

@tasansal I would love to help you get going with xarray! It sounds like datatree could fit some of your needs too.

One of the first challenges for using Xarray as a backend is that Xarray can only work with groups with data duplication. I know we can write datasets into different groups of a Zarr / NetCDF; if I remember correctly, each group will have a copy of the coordinates, dimensions, etc., associated with the group data.

Datatree empowers you to work with many groups at once. However at the moment you might still need to duplicate things across groups. One long-term solution to this might be to implement symbolic nodes in datatree, but I expect that using xarray and datatree would streamline your code a lot even without that feature.

Xarray and datatree work well with zarr already, so that should work nicely for you.

TomNicholas avatar Jan 06 '23 21:01 TomNicholas

Hey @rabernat and @TomNicholas

I am taking a stab at making our backend Xarray. I took a look at extending Xarray and sgkit libraries.

My understanding is:

  1. Do not inherit DataArray and Dataset if you want to re-implement the whole api.
  2. Use the custom dataset accessors to add more properties etc.
  3. Be like sgkit.

Here are some of the things we want to have on top of regular Xarray functionality:

  • a. Implement domain-specific repr and html_repr (adding more information to default reprs)
  • b. Add more required but hidden metadata (like _ARRAY_DIMENSIONS) that won't show in repr, but used for internal representation of the data (maybe doing "a" first will handle this if it hides anything with the _ prefix).
  • c. Have metadata conventions similar to ZEP0004
  • d. Utilize lower-level Zarr machinery like Zarr locks, fsspec caching, etc.
  • e. Support Zarr v3.
  • f. Have custom methods to access specific parts of the dataset.
  • g. Have hidden variables that will have a suffix on disk, but it will be used to mask/unmask data. Similar to numpy's masked arrays, where you keep a bool mask with the array data but it is transparent to the user.

Given the above assumptions and requirements, what approach would work best for us? I am leaning towards option 1, unfortunately, since it gives the ultimate flexibility. But if these can all be done with option 2, that would be better!

I also noticed API documentation using option 2 is a bit hacky using a special sphinx extension. We were planning to move to Mkdocs, which may cause a problem. Any thoughts on what to do here?

To give you an idea about our roadmap; we are adding these features to MDIO.

  • Strict data models with version control.
  • Schematized dataset creation from JSON using Pydantic.
  • Separate the energy domain (oil & gas, wind, solar) functionality as plugins to MDIO and make the core more lightweight.
  • Domain-specific out of the box schemas for Seismic, Wind, and more in the future.
  • Somehow try to keep everything backwards compatible :-)

Thanks!

tasansal avatar Sep 28 '23 16:09 tasansal

Hi @tasansal! This all sounds very ambitious and exciting!

My understanding is:

Yep pretty much! Also check out this new page on interoperability in xarray I wrote. (It's from a PR but should be released as part of the main docs soon).

Going through your list of features one-by-one:

a. Implement domain-specific repr and html_repr (adding more information to default reprs)

I'm actually not sure what the best way to fully overwrite the repr without subclassing would be. Monkey-patching xarray classes on import seems a bit hacky but might be enough...

b. Add more required but hidden metadata (like _ARRAY_DIMENSIONS) that won't show in repr, but used for internal representation of the data (maybe doing "a" first will handle this if it hides anything with the _ prefix).

Adding additional but hidden information would be a major change to the xarray data model - just storing the information normally but hiding it via the repr would be a lot easier if (a) is solved.

c. Have metadata conventions similar to ZEP0004

This seems fairly decoupleable from the other ideas, but for inspiration you should look at cf-xarray (which interprets CF conventions) and xarray-dataclasses.

d. Utilize lower-level Zarr machinery like Zarr locks, fsspec caching, etc.

Anything like this would likely be of interest to the wider xarray / Zarr community, so could be implemented as improvements to xarray's Zarr backend, for example.

e. Support Zarr v3.

This is in-progress for xarray, with most of the effort currently focused on making zarr-python support v3.

f. Have custom methods to access specific parts of the dataset.

This is easy using a custom accessor.

g. Have hidden variables that will have a suffix on disk, but it will be used to mask/unmask data. Similar to numpy's masked arrays, where you keep a bool mask with the array data but it is transparent to the user.

I'm not sure I totally understand this one, but (1) if the mask is data-dependent, I feel it should still be explicitly listed instead of hidden, and (2) operations which require the mask could be re-implemented on an accessor. Still, this might be a decent reason to subclass.

Strict data models

Again take a look at xarray-dataclasses.

Given the above assumptions and requirements, what approach would work best for us?

I am leaning towards option 1, unfortunately, since it gives the ultimate flexibility.

I think it sounds plausible to solve all of this without subclassing (but I don't fully understand all the requirements, so please don't take that as gospel!). If you do decide you want to subclass then it would be amazing if you could help us out with a few upstream contributions to make the subclassing easier. See https://github.com/pydata/xarray/issues/3980

I also noticed API documentation using option 2 is a bit hacky using a special sphinx extension. We were planning to move to Mkdocs, which may cause a problem. Any thoughts on what to do here?

We also have plans to move our documentation to markdown (using MYST), so we would also be interested in any solution to this.

Does that help?

TomNicholas avatar Oct 26 '23 16:10 TomNicholas

Coming back to this @tasansal - looking at the MDIO docs it really seems like what you have here is very closely related to the general tensor storage engine Icechunk that we have built at Earthmover. I think you could get a lot of the features of MDIO with less development effort by simply using Icechunk, and our future roadmaps are also closely aligned.

Separately I also wonder if it would be possible to use MDIO's SEG-Y parser to make a VirtualiZarr reader for the SEG-Y format (see https://github.com/zarr-developers/VirtualiZarr/issues/218)...

TomNicholas avatar Apr 03 '25 22:04 TomNicholas

@TomNicholas, great timing! We are currently working on our MDIO v1 implementation which follows the xarray zarr encoding (interoperable w/ Xarray). We also decided to test subclassing Xarray's Dataset and DataArray as discussed above to inherit functionality and be able to add our custom methods or visualizations. I think all of the above will align nicely with icechunk as well.

Regarding VirtualiZarr, we had the same idea. I think it's theoretically possible but haven't had a chance to look in more detail. SEG-Y is basically a single chunk structured array (headers + sample data). We could calculate all binary offsets that fall on an N-D grid once we scan the file. The scan is painful for sure but will enable more convenient access. The performance won't match chunked Zarr, but better than YOLO'ing it with raw SEG-Y parsers.

What are your thoughts?

tasansal avatar Apr 03 '25 23:04 tasansal

I think all of the above will align nicely with icechunk as well.

TBH I think that if Earthmover does it's job well eventually there won't be much need for MDIO for be a separate project at all - the only design goal I can see here that is not generic is supporting the SEG-Y file format.

I think it's theoretically possible but haven't had a chance to look in more detail. SEG-Y is basically a single chunk structured array (headers + sample data). We could calculate all binary offsets that fall on an N-D grid once we scan the file.

This sounds great, and like it has a good chance of working out. I would love to see you have a go at writing a virtual zarr reader, and I'm happy to provide guidance on that.

TomNicholas avatar Apr 04 '25 14:04 TomNicholas