read-write Xarray with nc-dataset adapter.
Targets feature branch, just to unify testing. This introduces a "netcdf dataset adapter" for xarray datasets, enabling to read/write xarray data as if it were a netcdf file. ALSO combined here (and not really separate for now):
- a new loading "scheme" enabling iris to load from odd objects in place of file-specs:
- changes enabling the netcdf file format to read from an open dataset (instead of filepath)
- changes enabling the netcdf file format to save to an open dataset (instead of filepath)
If this idea flies, it might be worth separating out the core-iris changes into a separate PR rather than doing it all at once?
Hence "WIP"
~Also, various TO-DOs noted on this old "private PR" : https://github.com/pp-mo/iris/pull/72~ (update: all done)
Status update: Almost there with CI. Tests are functioning. Now just some docs-build errors.
Lazy xarray output
Here's an interesting follow-on possibility ... https://github.com/pp-mo/iris/pull/73 (keeping this feature separate for now, just to simplify the discussion).
without this, iris "streams" all data into the netcdf variables, as it would to a file, so everything gets realised and stored in memory.
Important Context : things this can be used for ?
In addition to just "fixing data a bit" on either load or save, I think this might possibly help with a number of outstanding "desired features" (all relating to netcdf) :
- load chunks control : https://github.com/SciTools/iris/issues/3333
- precise formatting round-trips : https://github.com/SciTools/iris/discussions/4215
- lazy saving : https://github.com/SciTools/iris/issues/4190
- output "append" mode : https://github.com/SciTools/iris/issues/565
Generally, all these issues may be better handled in Xarray, since they are close to the file representation -- so a solution there can work with specific variables+dimensions, rather than Iris CF-representing objects.
Ping @dcherian
Just thought that I'd make you aware of this PR.
It's early days, but you may be able to offer some perspective on this...
Ping @dcherian
Thanks for making the link @bjlittle
FWIW some other comments on this that you might also like to consider ..
The current approach here is rather untidy, and grew out of an attempt to create a simple 'wrapper' object to an existing xarray dataset. But that basic approach actually fails once write support is added, since it is not possible to modify the dimensions of an existing xarray dataset, and the result is now a rather clunky "mixed" approach.
The principal limitation of this as it is, is that it expects an xarray-dataset which does not intpret cf/coords/time-data - which is why those are all turned off in the test example. Likewise, the xarray that you get out of the 'to_array' method will also have shortcomings in those respects + not match a "normal" load from file (e.g. in terms of time data). [ update, 2022-08-18 :
- in fact, we maybe should also be using "mask_and_scale=False" in the loading too
- as I think @TomekTrzeciak also pointed out, this particular xarray encoding aspect is (perhaps uniquely?) non-reversible,
- .. in that, it loses information by (potentially) mixing up NaN values and masked points (i.e. any occurrence of fill-value)
- however, the other data transformations (which this disables) definitely all can in principle be "undone" by design (the decode+encode methods) So there is a valid route to removing those interpretations from a 'regular' dataset, as is done on saving.
]
If this "load/save xarray as a netcdf dataset" approach has some general utility, then we should put it into xarray (not Iris!), and work on those problems, which I believe is entirely feasibble.
Thanks @bjlittle
Is there an issue I can read about the motivation for this?
If it is to take advantage of Xarray's read/write backend options, would it be easier to use the to_iris method?
Thanks @bjlittle
Is there an issue I can read about the motivation for this?
You're quite right, there should probably be an issue. I will work towards that ... For now, I can basically say that the ideal goal is ~lossless conversion between iris, xarray and netcdf files.
If it is to take advantage of Xarray's read/write backend options, would it be easier to use the [
to_irismethod]
It seems to me that a big "problem" with xarray to_iris/from_iris is that it requires embedding knowledge of Iris into xarray. And also, effectively, knowledge of CF, and what parts of CF Iris supports -- all of which is really Iris' business.
Meanwhile, there is a reasonable argument that an iris.to_array/from_xarray would be a better approach, since what Iris does is mostly "extend" xarray capabilities by adding CF handling.
But that then causes essentially the same problems of injecting xarray-specifics into Iris.
And with both those solutions, there is a particular problem regarding dependencies and testing, especially for C.I. :
- in order to test an xarray to/from-iris, the xarray CI needs to test against Iris, which is a costly dependency
- and exactly the same is true for the "converse solution" (iris to/from_xarray)
However, this kind of "dataset adapter" approach only depends on knowledge of xarray and netCDF4.
- so, as it has no Iris-specific knowledge, it really should not be here (**)
- it could properly belong in xarray,
- or in a separate package
- and it could then be used by any other packages wanting to establish file-less linkage to xarray data
- does that seem feasible ?
(**) but we still need certain changes in Iris :
- support reading/writing a netcdf-dataset (or similar) instead of a filepath
- pass through lazy data on write, instead of realise+store
Hence, code here is really just a PoC, showing how those solutions can work together
I have now broken this by advancing the F-B with a mergeback. But that's ok, I am now rationalising this work to separate concerns a bit.
So the "let iris read+write nc datasets" part is now up, in https://github.com/SciTools/iris/pull/5024, other PRs to follow, against the updated feature-branch.