ngff icon indicating copy to clipboard operation
ngff copied to clipboard

RFC: Zarr v3

Open normanrz opened this issue 1 year ago • 23 comments

This is an RFC proposal for adopting Zarr v3 as the new storage format for OME-Zarr.

It is a followup to the discussions in #206 and on image.sc.

Briefly, this proposal aims to adopt Zarr v3 as the new format for the next version of OME-Zarr. That unlocks the new features of Zarr v3 including sharding. Zarr v2 would not be allowed anymore (only through older versions of OME-Zarr). Additionally, there are some small changes to the OME-Zarr metadata to improve namespacing and versioning.

This RFC is currently in draft status with the goal of clarifying questions before the full review. Additional endorsements are also welcome.

Check this link for a review: https://ngff--227.org.readthedocs.build/rfc/2/index.html

normanrz avatar Feb 15 '24 13:02 normanrz

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/adopt-zarr-v3-in-ome-zarr/84786/2

imagesc-bot avatar Feb 15 '24 14:02 imagesc-bot

I think moving to zarr v3 is a great step. I would be happy to "endorse" this RFC.

Thanks, @normanrz !

kevinyamauchi avatar Feb 16 '24 08:02 kevinyamauchi

Does the sharding codec need to be detailed here or can we just name-drop it as an advantage of zarr v3 and then link to zarr's information about it?

There are a couple of things which could change in NGFF to make it more zarr-y. They're not blockers to adoption, just something to reduce downstream users' headaches, which IMO will never be addressed if not now.

One is the the case convention: zarr uses snake_case, NGFF uses camelCase.

The other is how enums are tagged: zarr uses adjacently tagged ({"name": "something", "config": {"somekey": ..., "anotherkey": ...}}) where NGFF uses internally tagged ({"name": "something", "somekey": ..., "anotherkey": ...}).

clbarnes avatar Feb 16 '24 11:02 clbarnes

One is the the case convention: zarr uses snake_case, NGFF uses camelCase.

The other is how enums are tagged: zarr uses adjacently tagged ({"name": "something", "config": {"somekey": ..., "anotherkey": ...}}) where NGFF uses internally tagged ({"name": "something", "somekey": ..., "anotherkey": ...}).

These are both very good points -- on the latter, you might want to weigh in over at https://github.com/ome/ngff/pull/138, where a lot of new enums are being minted.

d-v-b avatar Feb 16 '24 11:02 d-v-b

Does the sharding codec need to be detailed here or can we just name-drop it as an advantage of zarr v3 and then link to zarr's information about it?

I think sharding is a major motivation to move to v3. That is why it gets so much space in my proposal. It is intended as background information and, in the end, has no direct implications on the OME-Zarr spec.

There are a couple of things which could change in NGFF to make it more zarr-y. They're not blockers to adoption, just something to reduce downstream users' headaches, which IMO will never be addressed if not now.

One is the the case convention: zarr uses snake_case, NGFF uses camelCase.

The other is how enums are tagged: zarr uses adjacently tagged ({"name": "something", "config": {"somekey": ..., "anotherkey": ...}}) where NGFF uses internally tagged ({"name": "something", "somekey": ..., "anotherkey": ...}).

Pinging @joshmoore for advice on whether that should be a separate RFC. I think it is possible to bundle multiple RFCs into one new version of the spec.

normanrz avatar Feb 16 '24 12:02 normanrz

Thanks @normanrz ! I'm also happy to endorse this RFC! :

bogovicj avatar Feb 16 '24 15:02 bogovicj

Also happy to endorse, this would be very beneficial for our use cases, huge thanks @normanrz !

matthewh-ebi avatar Feb 16 '24 17:02 matthewh-ebi

Thanks @normanrz, I am also happy to endorse this RFC.

tischi avatar Feb 16 '24 18:02 tischi

Thanks a lot @normanrz , I am also happy to endorse this RFC! It will be great to get sharding for OME-Zarrs.

jluethi avatar Feb 16 '24 21:02 jluethi

I also endorse this RFC!

constantinpape avatar Feb 18 '24 12:02 constantinpape

I'm happy to endorse this RFC! 👍

will-moore avatar Feb 19 '24 09:02 will-moore

Obviously I'm all in favour of supporting v3, but:

Zarr v2 would not be allowed anymore (only through older versions of OME-Zarr).

What is the motivation for this? Why should we couple ome-zarr and zarr so tightly? If someone has ome-zarr v0.5 (or whatever) metadata at the root of a v2 zarr folder, why should that be forbidden?

jni avatar Feb 20 '24 03:02 jni

Gaaaah, don't comment before reading the article! 😅 I just read:

The metadata of Zarr v3 arrays are not backwards compatible with Zarr v2.

which explains it. However, it does still seem lightweight to support both on the ome side?

jni avatar Feb 20 '24 03:02 jni

However, it does still seem lightweight to support both on the ome side?

While it is easy to support both versions in the OME spec document, I'm concerned with the complexity burden for implementations. I'd rather not add one dimension to the compatibility matrix. OME-Zarr implementations that build upon libraries probably have good support for v2 and v3 at the moment. However, this might change in the future (anyone remember what happened to Zarr v1?). For example, in zarr-python, we are working on a refactoring that is v3-first. It is not unlikely that in the future v2 support will become deprecated. Also, there are implementations that roll their own Zarr stack that would need to add support for v2 and v3.

From a OME-Zarr user perspective, the hard cut also makes things simpler: ≤ 0.5 => Zarr v2; > 0.5 => Zarr v3 (or whatever the version number will be). If users wish to upgrade their data from one OME-Zarr version to another, it would be easy to also migrate the core Zarr metadata to v3. This is a fairly cheap operation, because only json files are touched. Zarr v2 and v3 metadata could even live side-by-side in the same hierarchy. There are functions available that can migrate the metadata automatically (e.g. in zarrita and soon zarr-python).

normanrz avatar Feb 20 '24 08:02 normanrz

Sure, I guess it is indeed easier as a user to know if you have an ome-zarr v0.5 file that all readers would support it, rather than have to understand whether your reader supports zarr v2. It would indeed be easy to get into a situation like "does this USB cable support data transfer and at what speed?" 😅

Anyway, please take my question as more for my own information rather than as a blocker: I too am happy to endorse this plan. 😊

jni avatar Feb 20 '24 10:02 jni

normanrz commented 4 days ago Pinging @joshmoore for advice on whether that should be a separate RFC. I think it is possible to bundle multiple RFCs into one new version of the spec.

The current plan is definitely to collect multiple RFCs into a single spec version, but that being said, I personally don't think sharding needs a separate RFC. With the move to v3, we have the chance to decouple the NGFF specific from specifics of the backend (looking at you, dimension separator), so I agree with @clbarnes that should be more referencing an existing spec (ZEP1 & ZEP2) but we do need to make users & implementers aware of the trade-offs that a given backend provides them. It's going to be a fine balance.

normanrz commented 3 hours ago

However, it does still seem lightweight to support both on the ome side?

While it is easy to support both versions in the OME spec document,

I still need to go through the text, but a :+1: for including any explanations you give here in the main text if they are not already there.

joshmoore avatar Feb 20 '24 14:02 joshmoore

joshmoore commented 19 hours ago] ... so I agree with @clbarnes that should be more referencing an existing spec (ZEP1 & ZEP2)...

@normanrz gently pointed out that I had misunderstood his question. The point was whether or not this RFC should include issues about camelCasing, etc. I would think not. Judging simply be the amount of endorsers this has already received, adding more issues especially ones that are prone to bike shedding can only make things less clear. (And for such topics, it likely makes sense to do some consensus building outside of the RFC before bringing it for review (D2 "gather support") but that's beyond the scope of this PR. :smile:)

joshmoore avatar Feb 21 '24 16:02 joshmoore

Thanks for the endorsements and feedback! I moved the proposal to draft state D3, which is intended to clarify questions before the review. Any feedback and questions are appreciated. More endorsements are also very welcome!

Based on the feedback, I moved the sections about sharding to "Background" (because it is not really part of this RFC; just illustrates a motivation for adopting v3) and added the motivation for dropping v2 support in the text.

normanrz avatar Feb 21 '24 17:02 normanrz

+1 to the chorus of endorsements.

I am looking forward to using Zarr v3 with several of my hats.

perlman avatar Feb 22 '24 20:02 perlman

Thanks @normanrz! I endorse this RFC.

ziw-liu avatar Mar 12 '24 21:03 ziw-liu

I endorse this RFC.

Some comments:

  • I like that the OME-Zarr metadata can now also be stored in array attributes. That partially satisfies this request https://github.com/ome/ngff/issues/207. multiscales in an array will need additional constraints on the datasets part though (e.g., single element with empty path).
  • I initially thought there were some unspecified changes to the coordinateTransformations metadata in the example, but I see it is based on this draft https://github.com/ome/ngff/pull/138.

LDeakin avatar Mar 14 '24 22:03 LDeakin

  • I like that the OME-Zarr metadata can now also be stored in array attributes. That partially satisfies this request https://github.com/ome/ngff/issues/207. multiscales in an array will need additional constraints on the datasets part though (e.g., single element with empty path).

While I would like that, I didn't intend to change this behavior as part of this RFC.

normanrz avatar Mar 20 '24 16:03 normanrz

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617/1

imagesc-bot avatar Apr 30 '24 12:04 imagesc-bot

As described in https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617/2, merging this to move forward with the first round of reviews. Thanks all for your feedback & endorsements.

joshmoore avatar Apr 30 '24 14:04 joshmoore

Commenting solely in an individual capacity, I endorse the premise that "OME-Zarr should adopt Zarr v3 as the storage format". I have recently deployed Zarr v3 shards within Janelia on behalf of Philip Keller's Lab for use with neuroglancer, tensorstore, and other compatible tools.

The content of the RFC itself is confusing. It seems to mostly replicate parts of the Zarr v3 specification and highlight key changes. Rather than duplicate such specification content, the RFC should mainly reference the canonical Zarr v3 specification.

Lacking from the RFC is content pertinent to OME-Zarr as a standard. I would particularly like to see the following.

  1. Details about how OME metadata would be integrated into the new storage format.
  2. Clarity about the status of image data in Zarr v2. Is Zarr v2 deprecated? Do we plan to maintain Zarr v2 as part of the OME-Zarr specification?
  3. Should there be OME-Zarr v2 and OME-Zarr v3 specifications?
  4. Further guidance on how to transition Zarr v2 content to Zarr v3. For each of the metadata examples in the current specification, how do those appear in the Zarr v3 specification?
  5. Is it compliant for an array to contain both Zarr v2 and Zarr v3 metadata files side by side?

mkitti avatar May 01 '24 15:05 mkitti

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617/4

imagesc-bot avatar May 01 '24 15:05 imagesc-bot

Thanks for your feedback @mkitti! Apart from just referencing the Zarr v3 spec, I thought it would be useful to highlight some of the v3 features as a motivation for adopting the new version. I didn't want to assume that everybody in this community is aware of all the changes in Zarr v3. I'll make sure to better separate that from the actual changes to OME-Zarr in the next iteration of the RFC after the first round of reviews. I also hope that things will become clearer once I added the changes to the spec document.

Lacking from the RFC is content pertinent to OME-Zarr as a standard. I would particularly like to see the following.

  1. Details about how OME metadata would be integrated into the new storage format.

The OME metadata will be stored in the attributes of the the groups' zarr.json files under a new versioned namespace https://ngff.openmicroscopy.org/0.5. This is explained in this section in the RFC.

  1. Clarity about the status of image data in Zarr v2. Is Zarr v2 deprecated? Do we plan to maintain Zarr v2 as part of the OME-Zarr specification?
  2. Should there be OME-Zarr v2 and OME-Zarr v3 specifications?

OME-Zarr <0.5 only supports Zarr v2 and OME-Zarr ≥0.5 will only support Zarr v3. It is recommended that implementation support a number of OME-Zarr versions to support for reading existing data. I think that recommendation is useful not only for this RFC. This is discussed in this section in the RFC.

  1. Further guidance on how to transition Zarr v2 content to Zarr v3. For each of the metadata examples in the current specification, how do those appear in the Zarr v3 specification?

There is some mention of migration scripts in this section of the RFC. The next iteration of the RFC after the first round of reviews will also contain the changes to the spec document, with updated examples.

  1. Is it compliant for an array to contain both Zarr v2 and Zarr v3 metadata files side by side?

Yes, Zarr v2 and v3 metadata files can exist side-by-side. As a consequence, OME-Zarr 0.4 and 0.5 metadata should also be able to exist side-by-side. With the new versioned namespace, this will be even easier for future versions. We should probably add an explicit recommendation that implementations prefer the newest version of the metadata that they support.

normanrz avatar May 01 '24 18:05 normanrz