no. bytes is smaller than no. bytes stored for variable length bytestrings
Zarr version
v2.17.0
Numcodecs version
v0.12.1
Python Version
v3.12.2
Operating System
Linux
Installation
Using micromamba (conda)
Description
I am trying to save variable-length byte strings to a Zarr array (to be precise, RDKit Mol objects as byte strings). I noticed that the number of bytes is drastically lower than the number of bytes stored. I am trying to better understand why this might happen. I assumed it would be because the byte strings are padded, but this does not seem to be the case. Any insights on what is happening here and how to optimize this?
Steps to reproduce
Here is a little toy example to reproduce:
import os
import math
import random
# Seed for reproducibility
random.seed(0)
# Generate 10k variable length (between 1 and 1000) byte strings
bytes_data = [os.urandom(math.ceil(random.random() * 1000)) for i in range(10000)]
arr = zarr.array(bytes_data, dtype=bytes)
print(arr.info)
Which should print:
Type : zarr.core.Array
Data type : object
Shape : (10000,)
Chunk shape : (10000,)
Order : C
Read-only : False
Filter [0] : VLenBytes()
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.KVStore
No. bytes : 80000 (78.1K)
No. bytes stored : 5051905 (4.8M)
Storage ratio : 0.0
Chunks initialized : 1/1
I would expect some overhead, but 60x seems excessive!
Things I tried
Disabling compression
arr = zarr.array(bytes_data, dtype=bytes, compressor=None)
No. bytes : 80000 (78.1K)
No. bytes stored : 5051781 (4.8M)
(Since we're using random bytes, compression has no effect as expected. In the original use case of storing molecules, removing compression increases the No. bytes stored as expected.)
No padding
bytes_data = [os.urandom(1000) for i in range(10000)]
arr = zarr.array(bytes_data, dtype=bytes, compressor=None)
No. bytes : 80000 (78.1K)
No. bytes stored : 10040209 (9.6M)
Additional output
No response
I think the issue here is simply that the "No. bytes" estimate is not correct for object dtypes.
The actual, raw size of your data is:
sum(len(b) for b in bytes_data)
# -> 5011549
This is very close to the compressed size.
However, because you are passing a list of python objects, Zarr does not know a priori how big each item is. (Contrast this with, e.g. int64 data.) It seems like by default it is assuming that each of the 10000 items is 8 bytes in size, which is a massive underestimate of the actual data.
It would probably be best for the No. bytes field to evaluate to "unknown" in this case.
Does that make sense?
That makes perfect sense to me. Thank you for the quick response!
Would you like me to keep this issue open to track the change to "unknown"?