zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

no. bytes is smaller than no. bytes stored for variable length bytestrings

Open cwognum opened this issue 1 year ago • 2 comments

Zarr version

v2.17.0

Numcodecs version

v0.12.1

Python Version

v3.12.2

Operating System

Linux

Installation

Using micromamba (conda)

Description

I am trying to save variable-length byte strings to a Zarr array (to be precise, RDKit Mol objects as byte strings). I noticed that the number of bytes is drastically lower than the number of bytes stored. I am trying to better understand why this might happen. I assumed it would be because the byte strings are padded, but this does not seem to be the case. Any insights on what is happening here and how to optimize this?

Steps to reproduce

Here is a little toy example to reproduce:

import os
import math
import random

# Seed for reproducibility
random.seed(0)

# Generate 10k variable length (between 1 and 1000) byte strings
bytes_data = [os.urandom(math.ceil(random.random() * 1000)) for i in range(10000)] 

arr = zarr.array(bytes_data, dtype=bytes)
print(arr.info)

Which should print:

Type               : zarr.core.Array
Data type          : object
Shape              : (10000,)
Chunk shape        : (10000,)
Order              : C
Read-only          : False
Filter [0]         : VLenBytes()
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : zarr.storage.KVStore
No. bytes          : 80000 (78.1K)
No. bytes stored   : 5051905 (4.8M)
Storage ratio      : 0.0
Chunks initialized : 1/1

I would expect some overhead, but 60x seems excessive!

Things I tried

Disabling compression

arr = zarr.array(bytes_data, dtype=bytes, compressor=None)
No. bytes          : 80000 (78.1K)
No. bytes stored   : 5051781 (4.8M)

(Since we're using random bytes, compression has no effect as expected. In the original use case of storing molecules, removing compression increases the No. bytes stored as expected.)

No padding

bytes_data = [os.urandom(1000) for i in range(10000)] 
arr = zarr.array(bytes_data, dtype=bytes, compressor=None)
No. bytes          : 80000 (78.1K)
No. bytes stored   : 10040209 (9.6M)

Additional output

No response

cwognum avatar Mar 12 '24 15:03 cwognum

I think the issue here is simply that the "No. bytes" estimate is not correct for object dtypes.

The actual, raw size of your data is:

sum(len(b) for b in bytes_data)
# -> 5011549

This is very close to the compressed size.

However, because you are passing a list of python objects, Zarr does not know a priori how big each item is. (Contrast this with, e.g. int64 data.) It seems like by default it is assuming that each of the 10000 items is 8 bytes in size, which is a massive underestimate of the actual data.

It would probably be best for the No. bytes field to evaluate to "unknown" in this case.

Does that make sense?

rabernat avatar Mar 12 '24 16:03 rabernat

That makes perfect sense to me. Thank you for the quick response!

Would you like me to keep this issue open to track the change to "unknown"?

cwognum avatar Mar 12 '24 16:03 cwognum