VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Support flattened dmrpp files.

Open betolink opened this issue 9 months ago • 2 comments

Some dmrpp files created by OPeNDAP override the hierarchical structure of the HDF5/NetCDF format and flatten their structures, when they do that some dimensions get assigned the phony_dim_1, phony_dim_2 etc. and variables are not parsed correctly.

ICESat-2: https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2020/01/02/ATL06_20200102190333_01080603_006_01.h5.dmrpp SMAP: https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/SMAP/SPL4SMGP/007/2023/12/31/SMAP_L4_SM_gph_20231231T223000_Vv7031_001.h5.dmrpp

According to @Mikejmnez, this wouldn't be a heavy lift and it will allow us to support more collections (until they fix the dmrpp generation) cc @danielfromearth @ayushnag

betolink avatar Apr 29 '25 23:04 betolink

I will take a look at this. @betolink can you write a minimal example that reproduces the error? It will help me understand if the issue is the dmr++ itself, the parser, or the dataset (there may be a combination of those things).

Flattened dmr++ is an option within the building process of the dmr++. These used to be produced (relatively recently) because many of other NASA tools that use dmr++s (and clients talking directly to hyrax data servers) could not understand/parse Groups. And so those dmr++ flattened the access to those files, even though the original file was not flat. Now many of the same NASA tools are compatible with Groups, along with clients APIs, and so the newer dmr++ do not necessarily need to be flatten, and some of the DAACs are choosing this route.

And so the problem may not be the parser itself, not the dmr++ but rather an error that arises when trying to access a file that is hierarchical, as if it is not (following the dmr++ route).

phoney_dims

The presence of {phoney_dim1, phoney_dim_2, ..., phoney_dim_N} means that the original file does not have named dimensions. Not global, not local. The same option that flattens the dmr++ also creates this missing named dimensions. That is probably not the error that @betolink is finding, but one that any user will stumble upon when the dmr++ gets updated and no longer flattened. I have run into dmr++s that are not flattened and that do not have named dimensions. The current dmr++ parser errs when it cannot find a name for a dimension. Even when I attempt to create a dataset with xarray talking to the cloud opendap server, creating the xarray dataset errs because there is a mismatch between the shape of arrays and the number of (named) dimensions.

Mikejmnez avatar Apr 30 '25 15:04 Mikejmnez

There is logic here in the parser to handle phony_dims however it may be a combination of factors as you said that are causing it to not work for this dataset

ayushnag avatar May 01 '25 18:05 ayushnag