What options do I have for `<U#` type data saving in zarr?
What is your issue?
So I tried out xarray today with zarr version 3.0.4, and encountered these scary warnings:
/nix/store/qasysgiacqplrbda5yl65wg7jrs0gcjl-python3-3.12.9-env/lib/python3.12/site-packages/zarr/codecs/vlen_utf8.py:44: UserWarning: The codec `vlen-utf8` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
return cls(**configuration_parsed)
/nix/store/qasysgiacqplrbda5yl65wg7jrs0gcjl-python3-3.12.9-env/lib/python3.12/site-packages/zarr/core/array.py:3991: UserWarning: The dtype `<U5` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
meta = AsyncArray._create_metadata_v3(
/nix/store/qasysgiacqplrbda5yl65wg7jrs0gcjl-python3-3.12.9-env/lib/python3.12/site-packages/zarr/api/asynchronous.py:203: UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
warnings.warn(
A MWE is:
import xarray as xr
import numpy as np
xr.DataArray(np.array([
"hello",
"world",
])).to_zarr("test_utf8_strings.zarr")
Is <U5 a variable length utf8 type? It shouldn't be... Also, what are my alternatives?
Is
<U5a variable length utf8 type? It shouldn't be...
This is the easy question! It's a fixed length UTF string, but I believe Zarr does encode it as UTF-8.
For what it's worth, the future proof way to create NumPy arrays of UTF-8 data is to use the UTF-8 string dtype (np.dtypes.StringDType, which requires numpy v2). However, this is not (yet) the default in NumPy or Xarray.
Also, what are my alternatives?
You can write Zarr v2 files by passing zarr_version=2, which will silence most of these warnings, but not really resolve them, given that these are non-standard Zarr v2 conventions, too.
Otherwise, Zarr v3 needs a way to silence these warnings. And perhaps an advocate to push through the Zarr standardization process :).
@d-v-b is working on expanding the dtype story in zarr3 now -- including fixed-length-strings. Expect an update within a month or so here.
For what it's worth, the future proof way to create NumPy arrays of UTF-8 data is to use the UTF-8 string dtype (
np.dtypes.StringDType, which requires numpy v2). However, this is not (yet) the default in NumPy or Xarray.
If you mean by "not the default" that np.array(["hello", "world"]) without explicitly specifying a dtype argument, doesn't use np.dtypes.StringDType, but uses <U5 by default, then I understand what you are saying. However, personally I don't think it should be the default :). Also, just to clear out a bit of ambiguity I found in that sentence, I tried:
xr.DataArray(np.array(
["hello", "world"],
dtype=np.dtypes.StringDType,
)).to_zarr("test_utf8_strings.zarr")
And it miserably failed:
Traceback (most recent call last):
File "/home/doron/repos/lab-ion-trap-simulations/./t.py", line 9, in <module>
)).to_zarr("test_utf8_strings.zarr")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/core/dataarray.py", line 4428, in to_zarr
return to_zarr( # type: ignore[call-overload,misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/api.py", line 2216, in to_zarr
dump_to_store(dataset, zstore, writer, encoding=encoding)
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/api.py", line 1952, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/zarr.py", line 1022, in store
self.set_variables(
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/zarr.py", line 1194, in set_variables
zarr_array = self._create_new_array(
^^^^^^^^^^^^^^^^^^^^^^^
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/xarray/backends/zarr.py", line 1089, in _create_new_array
zarr_array = self.zarr_group.create(
^^^^^^^^^^^^^^^^^^^^^^^
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/hierarchy.py", line 1195, in create
return self._write_op(self._create_nosync, name, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/hierarchy.py", line 952, in _write_op
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/hierarchy.py", line 1201, in _create_nosync
return create(store=self._store, path=path, chunk_store=self._chunk_store, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/creation.py", line 209, in create
init_array(
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/storage.py", line 455, in init_array
_init_array_metadata(
File "/nix/store/mynsacdp58wmf7j6yyydyrz16vl3imzb-python3-3.12.9-env/lib/python3.12/site-packages/zarr/storage.py", line 584, in _init_array_metadata
raise ValueError("missing object_codec for object array")
ValueError: missing object_codec for object array
The above was obtained with Zarr v2. With Zarr v3, I got the same warnings as in the top level comment of this issue.
@d-v-b is working on expanding the dtype story in zarr3 now -- including fixed-length-strings. Expect an update within a month or so here.
OK That's comforting, thanks :).
For what it's worth, the future proof way to create NumPy arrays of UTF-8 data is to use the UTF-8 string dtype (
np.dtypes.StringDType, which requires numpy v2). However, this is not (yet) the default in NumPy or Xarray.If you mean by "not the default" that
np.array(["hello", "world"])without explicitly specifying adtypeargument, doesn't usenp.dtypes.StringDType, but uses<U5by default, then I understand what you are saying.
Yes, this is how things currently work.
However, personally I don't think it should be the default :).
I agree, UTF-8 would be a much saner default! It's just a relatively new NumPy feature, and NumPy is very conservative about making breaking changes.
Is there an xarray issue here too? Despite the warnings, zarr3 does read back the StringDType while xarray does not.
xr.DataArray(
np.array(["hello", "world"], dtype=np.dtypes.StringDType),
name="test",
).to_zarr("test_utf8_strings.zarr", mode="w")
Reading it back with xarray ...
xr.open_dataarray("test_utf8_strings.zarr").dtype
gives dtype('O').
Reading it back with zarr ...
zarr.open_group("test_utf8_strings.zarr")["test"].dtype
gives StringDType().
xarray==2025.3.1 zarr==3.0.6
Is any update on how to fix these warnings?
full support for <U# data types will come in zarr-python 3.1 after we get https://github.com/zarr-developers/zarr-python/pull/2874 merged and a spec finished for the <U# data types (see this PR in the zarr-extensions repo).
Until the spec is done, we could look into ways to silence the warnings, but the underlying problem those warnings are warning about will still be true.
Is this connected to the new numpy string dtype?