iris icon indicating copy to clipboard operation
iris copied to clipboard

read-write Xarray with nc-dataset adapter.

Open pp-mo opened this issue 3 years ago • 7 comments

Targets feature branch, just to unify testing. This introduces a "netcdf dataset adapter" for xarray datasets, enabling to read/write xarray data as if it were a netcdf file. ALSO combined here (and not really separate for now):

If this idea flies, it might be worth separating out the core-iris changes into a separate PR rather than doing it all at once?

Hence "WIP"

~Also, various TO-DOs noted on this old "private PR" : https://github.com/pp-mo/iris/pull/72~ (update: all done)

pp-mo avatar Jun 28 '22 11:06 pp-mo

Status update: Almost there with CI. Tests are functioning. Now just some docs-build errors.

pp-mo avatar Jun 28 '22 13:06 pp-mo

Lazy xarray output

Here's an interesting follow-on possibility ... https://github.com/pp-mo/iris/pull/73 (keeping this feature separate for now, just to simplify the discussion).

without this, iris "streams" all data into the netcdf variables, as it would to a file, so everything gets realised and stored in memory.

pp-mo avatar Jun 29 '22 16:06 pp-mo

Important Context : things this can be used for ?

In addition to just "fixing data a bit" on either load or save, I think this might possibly help with a number of outstanding "desired features" (all relating to netcdf) :

  • load chunks control : https://github.com/SciTools/iris/issues/3333
  • precise formatting round-trips : https://github.com/SciTools/iris/discussions/4215
  • lazy saving : https://github.com/SciTools/iris/issues/4190
  • output "append" mode : https://github.com/SciTools/iris/issues/565

Generally, all these issues may be better handled in Xarray, since they are close to the file representation -- so a solution there can work with specific variables+dimensions, rather than Iris CF-representing objects.

pp-mo avatar Jul 01 '22 10:07 pp-mo

Ping @dcherian

Just thought that I'd make you aware of this PR.

It's early days, but you may be able to offer some perspective on this...

bjlittle avatar Jul 25 '22 15:07 bjlittle

Ping @dcherian

Thanks for making the link @bjlittle

FWIW some other comments on this that you might also like to consider ..

The current approach here is rather untidy, and grew out of an attempt to create a simple 'wrapper' object to an existing xarray dataset. But that basic approach actually fails once write support is added, since it is not possible to modify the dimensions of an existing xarray dataset, and the result is now a rather clunky "mixed" approach.

The principal limitation of this as it is, is that it expects an xarray-dataset which does not intpret cf/coords/time-data - which is why those are all turned off in the test example. Likewise, the xarray that you get out of the 'to_array' method will also have shortcomings in those respects + not match a "normal" load from file (e.g. in terms of time data). [ update, 2022-08-18 :

  • in fact, we maybe should also be using "mask_and_scale=False" in the loading too
  • as I think @TomekTrzeciak also pointed out, this particular xarray encoding aspect is (perhaps uniquely?) non-reversible,
    • .. in that, it loses information by (potentially) mixing up NaN values and masked points (i.e. any occurrence of fill-value)
  • however, the other data transformations (which this disables) definitely all can in principle be "undone" by design (the decode+encode methods) So there is a valid route to removing those interpretations from a 'regular' dataset, as is done on saving.

]

If this "load/save xarray as a netcdf dataset" approach has some general utility, then we should put it into xarray (not Iris!), and work on those problems, which I believe is entirely feasibble.

pp-mo avatar Jul 25 '22 16:07 pp-mo

Thanks @bjlittle

Is there an issue I can read about the motivation for this?

If it is to take advantage of Xarray's read/write backend options, would it be easier to use the to_iris method?

dcherian avatar Jul 25 '22 17:07 dcherian

Thanks @bjlittle

Is there an issue I can read about the motivation for this?

You're quite right, there should probably be an issue. I will work towards that ... For now, I can basically say that the ideal goal is ~lossless conversion between iris, xarray and netcdf files.


If it is to take advantage of Xarray's read/write backend options, would it be easier to use the [to_iris method]

It seems to me that a big "problem" with xarray to_iris/from_iris is that it requires embedding knowledge of Iris into xarray. And also, effectively, knowledge of CF, and what parts of CF Iris supports -- all of which is really Iris' business.

Meanwhile, there is a reasonable argument that an iris.to_array/from_xarray would be a better approach, since what Iris does is mostly "extend" xarray capabilities by adding CF handling. But that then causes essentially the same problems of injecting xarray-specifics into Iris.

And with both those solutions, there is a particular problem regarding dependencies and testing, especially for C.I. :

  • in order to test an xarray to/from-iris, the xarray CI needs to test against Iris, which is a costly dependency
  • and exactly the same is true for the "converse solution" (iris to/from_xarray)

However, this kind of "dataset adapter" approach only depends on knowledge of xarray and netCDF4.

  • so, as it has no Iris-specific knowledge, it really should not be here (**)
    • it could properly belong in xarray,
    • or in a separate package
  • and it could then be used by any other packages wanting to establish file-less linkage to xarray data
    • does that seem feasible ?

(**) but we still need certain changes in Iris :

  1. support reading/writing a netcdf-dataset (or similar) instead of a filepath
  2. pass through lazy data on write, instead of realise+store

Hence, code here is really just a PoC, showing how those solutions can work together

pp-mo avatar Jul 26 '22 13:07 pp-mo

I have now broken this by advancing the F-B with a mergeback. But that's ok, I am now rationalising this work to separate concerns a bit.

So the "let iris read+write nc datasets" part is now up, in https://github.com/SciTools/iris/pull/5024, other PRs to follow, against the updated feature-branch.

pp-mo avatar Oct 12 '22 15:10 pp-mo