xarray icon indicating copy to clipboard operation
xarray copied to clipboard

What options do I have for `<U#` type data saving in zarr?

Open doronbehar opened this issue 11 months ago • 8 comments

What is your issue?

So I tried out xarray today with zarr version 3.0.4, and encountered these scary warnings:

/nix/store/qasysgiacqplrbda5yl65wg7jrs0gcjl-python3-3.12.9-env/lib/python3.12/site-packages/zarr/codecs/vlen_utf8.py:44: UserWarning: The codec `vlen-utf8` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  return cls(**configuration_parsed)
/nix/store/qasysgiacqplrbda5yl65wg7jrs0gcjl-python3-3.12.9-env/lib/python3.12/site-packages/zarr/core/array.py:3991: UserWarning: The dtype `<U5` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  meta = AsyncArray._create_metadata_v3(
/nix/store/qasysgiacqplrbda5yl65wg7jrs0gcjl-python3-3.12.9-env/lib/python3.12/site-packages/zarr/api/asynchronous.py:203: UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  warnings.warn(

A MWE is:

import xarray as xr
import numpy as np

xr.DataArray(np.array([
    "hello",
    "world",
])).to_zarr("test_utf8_strings.zarr")

Is <U5 a variable length utf8 type? It shouldn't be... Also, what are my alternatives?

doronbehar avatar Feb 25 '25 17:02 doronbehar

Is <U5 a variable length utf8 type? It shouldn't be...

This is the easy question! It's a fixed length UTF string, but I believe Zarr does encode it as UTF-8.

For what it's worth, the future proof way to create NumPy arrays of UTF-8 data is to use the UTF-8 string dtype (np.dtypes.StringDType, which requires numpy v2). However, this is not (yet) the default in NumPy or Xarray.

Also, what are my alternatives?

You can write Zarr v2 files by passing zarr_version=2, which will silence most of these warnings, but not really resolve them, given that these are non-standard Zarr v2 conventions, too.

Otherwise, Zarr v3 needs a way to silence these warnings. And perhaps an advocate to push through the Zarr standardization process :).

shoyer avatar Mar 05 '25 23:03 shoyer

@d-v-b is working on expanding the dtype story in zarr3 now -- including fixed-length-strings. Expect an update within a month or so here.

jhamman avatar Mar 05 '25 23:03 jhamman

For what it's worth, the future proof way to create NumPy arrays of UTF-8 data is to use the UTF-8 string dtype (np.dtypes.StringDType, which requires numpy v2). However, this is not (yet) the default in NumPy or Xarray.

If you mean by "not the default" that np.array(["hello", "world"]) without explicitly specifying a dtype argument, doesn't use np.dtypes.StringDType, but uses <U5 by default, then I understand what you are saying. However, personally I don't think it should be the default :). Also, just to clear out a bit of ambiguity I found in that sentence, I tried:

xr.DataArray(np.array(
    ["hello", "world"],
    dtype=np.dtypes.StringDType,
)).to_zarr("test_utf8_strings.zarr")

And it miserably failed:

Traceback (most recent call last):
  File "/home/doron/repos/lab-ion-trap-simulations/./t.py", line 9, in <module>
    )).to_zarr("test_utf8_strings.zarr")
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/core/dataarray.py", line 4428, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/api.py", line 2216, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/api.py", line 1952, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/zarr.py", line 1022, in store
    self.set_variables(
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/zarr.py", line 1194, in set_variables
    zarr_array = self._create_new_array(
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/zarr.py", line 1089, in _create_new_array
    zarr_array = self.zarr_group.create(
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/hierarchy.py", line 1195, in create
    return self._write_op(self._create_nosync, name, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/hierarchy.py", line 952, in _write_op
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/hierarchy.py", line 1201, in _create_nosync
    return create(store=self._store, path=path, chunk_store=self._chunk_store, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/creation.py", line 209, in create
    init_array(
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/storage.py", line 455, in init_array
    _init_array_metadata(
  File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/storage.py", line 584, in _init_array_metadata
    raise ValueError("missing object_codec for object array")
ValueError: missing object_codec for object array

The above was obtained with Zarr v2. With Zarr v3, I got the same warnings as in the top level comment of this issue.

@d-v-b is working on expanding the dtype story in zarr3 now -- including fixed-length-strings. Expect an update within a month or so here.

OK That's comforting, thanks :).

doronbehar avatar Mar 06 '25 05:03 doronbehar

For what it's worth, the future proof way to create NumPy arrays of UTF-8 data is to use the UTF-8 string dtype (np.dtypes.StringDType, which requires numpy v2). However, this is not (yet) the default in NumPy or Xarray.

If you mean by "not the default" that np.array(["hello", "world"]) without explicitly specifying a dtype argument, doesn't use np.dtypes.StringDType, but uses <U5 by default, then I understand what you are saying.

Yes, this is how things currently work.

However, personally I don't think it should be the default :).

I agree, UTF-8 would be a much saner default! It's just a relatively new NumPy feature, and NumPy is very conservative about making breaking changes.

shoyer avatar Mar 06 '25 05:03 shoyer

Is there an xarray issue here too? Despite the warnings, zarr3 does read back the StringDType while xarray does not.

xr.DataArray(
    np.array(["hello", "world"], dtype=np.dtypes.StringDType),
    name="test",
).to_zarr("test_utf8_strings.zarr", mode="w")

Reading it back with xarray ...

xr.open_dataarray("test_utf8_strings.zarr").dtype

gives dtype('O').

Reading it back with zarr ...

zarr.open_group("test_utf8_strings.zarr")["test"].dtype

gives StringDType().

xarray==2025.3.1 zarr==3.0.6

itcarroll avatar Apr 02 '25 01:04 itcarroll

Is any update on how to fix these warnings?

sbatururimi avatar May 16 '25 16:05 sbatururimi

full support for <U# data types will come in zarr-python 3.1 after we get https://github.com/zarr-developers/zarr-python/pull/2874 merged and a spec finished for the <U# data types (see this PR in the zarr-extensions repo).

Until the spec is done, we could look into ways to silence the warnings, but the underlying problem those warnings are warning about will still be true.

d-v-b avatar May 16 '25 16:05 d-v-b

Is this connected to the new numpy string dtype?

alippai avatar May 16 '25 16:05 alippai