zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

ZEP9: Parse Metadata Objects

Open brokkoli71 opened this issue 11 months ago • 6 comments

This PR implements the following aspects of ZEP9 (Phase 1)

  • If metadata JSON contains invalid keys, or if a value object contains invalid keys, the zarr array should be rejected unless it contains `"must_understand": false``.
  • It should be possible to specify a value in metadata in the following ways
    • as a JSON object
    • As a string value (only if the object would have no attributes other than its name).

this will help to enable zarr extensions

TODOs in code:

  • [x] read metadata fails accordingly (+test)
  • [x] specifying metadata value works as string (+test)
    • [x] in particular bytes codec should not be possible to specify as string value if it requires the "endian" argument for multi-byte data types

TODO:

  • [x] Add unit tests and/or doctests in docstrings
  • [x] Add docstrings and API docs for any new/modified user-facing classes and functions
  • [x] New/modified features documented in docs/user-guide/*.rst
  • [x] Changes documented as a new file in changes/
  • [x] GitHub Actions have all passed
  • [x] Test coverage is 100% (Codecov passes)

brokkoli71 avatar Feb 26 '25 10:02 brokkoli71

what are the current use cases for this?

d-v-b avatar May 13 '25 12:05 d-v-b

what are the current use cases for this?

There are a number of extensions appearing now in zarr-extensions. Potentially, there will also be modifications to existing extensions such as added attributes in the metadata. zarr-python should fail if it encounters unknown extensions or attributes, unless marked with must_understand=false. This PR helps with that.

normanrz avatar May 16 '25 07:05 normanrz

as I understand it, all of the extensions proposed in zarr-extensions have static JSON representations. Is this not true?

And this PR seems to allow the possibility that any one of those JSON representations might gain new fields, which should be ignored iff those fields are JSON objects containing the {"must_understand" : False}. That seems kind weird to me -- why would we allow an entirely optional subset of, e.g., the gzip codec JSON? I was under the impression that for codecs and data types, the JSON representation contained everything you need to know to make sense of the data type. I don't see how optional, ignorable fields fit into that model.

d-v-b avatar May 16 '25 07:05 d-v-b

There may be fields that are not strictly necessary for reading data that could be marked as optional. An example might be a "chunk_layout" in the sharding codec to denote how the chunks are ordered in the shard, e.g. morton, c, random etc.. While useful when writing, it is not necessary for reading because all chunk offsets are stored in the index.

Additionally, there may be new optional fields that are added to the root of the array or group metadata through a ZEP.

normanrz avatar May 16 '25 08:05 normanrz

There may be fields that are not strictly necessary for reading data that could be marked as optional. An example might be a "chunk_layout" in the sharding codec to denote how the chunks are ordered in the shard, e.g. morton, c, random etc.. While useful when writing, it is not necessary for reading because all chunk offsets are stored in the index.

Additionally, there may be new optional fields that are added to the root of the array or group metadata through a ZEP.

these examples are not yet in use, which is why I asked what the current use cases are. This PR makes changes to how metadata is parsed (e.g., checking the contents of gzip json metadata) that, as far as I can tell, have no use.

Currently, I think people can expect that zarr-python can round-trip zarr data. To me, that means that if zarr-python can read zarr data from one place, it should be able to create a structurally identical copy of that zarr data somewhere else. The concept in this PR -- that we would support extra metadata fields in any metadata object which can be ignored when reading -- violates this expectation. So I think we need to have a larger conversation about what these hypothetical optional metadata fields mean for zarr-python before we add support for them. Until there are real examples out there of metadata with these must_understand=False fields, I don't feel like I can properly evaluate this PR.

d-v-b avatar May 16 '25 08:05 d-v-b

Fair enough. Then we should scope this PR to

  • enforcing that unknown fields lead to an error
  • accepting both name-only and object notations of extensions

normanrz avatar May 16 '25 08:05 normanrz