Document use as a runtime translation layer
I realized there is now another use case for VirtualiZarr that is not clearly documented: using ManifestStore as a runtime translation layer to access other cloud-optimized formats (e.g. COGeoTIFF) through the zarr-python API.
Previously to access data via zarr-python you would have had to serialize the virtual reference to Kerchunk/Icechunk before you could read the data via zarr-python. Now (at least after a few PRs are merged) you can just do this:
from virtualizarr.parsers import TiffParser
manifest_store = TiffParser("<file.tiff>")
xr.open_zarr(manifest_store)
Doing this for an already cloud-optimized format like COG is interesting, because it should be lightning fast, as our parser can get all the metadata in one or two requests (that's what cloud-optimized means). This means that at least for a single COG, there is no need to serialize the metadata to Kerchunk or Icechunk for later, we can just use it immediately.
This is important: it means that any python program that can understand Zarr can immediately understand COG, with no performance penalty at runtime, no prior processing required, and no dependencies (beyond async_tiff for the parser).
cc @maxrjones who I think realized this before I did, and @mdsumner who will be interested.
oh niice!! I wanted this kind of separation of the manifest, awesome
Note that you can use this same idea to open data quickly that's referred to by Kerchunk / DMR++ files, because that's also already cloud-optimized:
from virtualizarr.parsers import DMRPPParser
manifest_store = DMRPPParser("<file.dmrpp>")
ds = xr.open_zarr(manifest_store)
cc @ayushnag
Note that interestingly when you do this for kerchunk, the ManifestStore class is basically acting as a replacement for the fsspec "mapper" class and the zarr.storage.FsspecStore class (used here by kerchunk's xarray backend).
I was motivated to try this out because there are some issues with ReferenceFileSystem with Zarr-Python 3 that I have yet to fully understand (using FSMap is plagued by https://github.com/zarr-developers/zarr-python/issues/2706 and the kerchunk backend seems to hang when avoiding FSMap).
While we succesfully use ManifestStore as a runtime translation layer in the tests to load directly from HDF5, etc, there's more work needed for loading data referenced in Kerchunk JSONs - https://github.com/maxrjones/test-scripts/blob/main/test-load-virtual-zarr/test-virtualizarr-hdf5-zarr-python-3.py.
Oh interesting, thanks for checking @maxrjones . At first glance that error looks like it might be a quite easy fix - it looks like we're passing a ManifestGroup object somewhere when we should be passing it's string path within the store, or something like that.
Oh interesting, thanks for checking @maxrjones . At first glance that error looks like it might be a quite easy fix - it looks like we're passing a
ManifestGroupobject somewhere when we should be passing it's string path within the store, or something like that.
Actually, it does work if you specify zarr_format=3. I updated the example linked above. Probably would be worth figuring out how to provide a more useful error message if someone omits the zarr_format.