zarr-python In Memory performance compared to NumPy much slower

Zarr version

2.14.2

Numcodecs version

0.11.0

Python Version

3.10

Operating System

Linux

Installation

Pip with virtualenv

Description

I was using zarr arrays as a grouped set of related numpy arrays. I noticed when I switched the in-memory performance dropped significantly. I disabled the compressor and chunking to remove any overhead I can find.

I attached a short snippet with line_profiler to demonstrate the basic case of just writing elements to an array. 99 percent of the time is spent writing to the Zarr array instead of the NumPy array of the same size and shape.

Having looked at the source code for MemoryStore, I can see that the chunk is seralised as bytes and stored in a dictionary with key 0.0 and bytes value which I presume reflects the filesystem but this perhaps is where it goes really slow compared to NumPy.

Is this not a use case for Zarr? Is it optimised for reads instead? I understand if this out of context for Zarr arrays. Thank you for your time.

Steps to reproduce

import numcodecs
import numpy as np
import tqdm
import zarr

print(zarr.__version__)
print(numcodecs.__version__)

mem_store = zarr.storage.MemoryStore()
z_array = zarr.zeros(
    (200000, 100), chunks=False, store=mem_store, compressor=None, dtype=np.float32, write_empty_chunks=False
)
np_array = np.zeros((200000, 100), dtype=np.float32)

print(z_array.info)


@profile
def row_by_row():
    """Row by row."""
    for i in tqdm.trange(100):
        r_array = np.random.random(100)
        np_array[i] = r_array
        z_array[i] = r_array


@profile
def in_chunks():
    """In chunks."""
    for i in tqdm.trange(100):
        r_array = np.random.random((200, 100))
        np_array[:200] = r_array
        z_array[:200] = r_array


def main():
    """Run the main function."""
    row_by_row()
    in_chunks()


if __name__ == "__main__":
    main()

Additional output

2.14.2
0.11.0
Type               : zarr.core.Array
Data type          : float32
Shape              : (200000, 100)
Chunk shape        : (200000, 100)
Order              : C
Read-only          : False
Compressor         : None
Store type         : zarr.storage.MemoryStore
No. bytes          : 80000000 (76.3M)
No. bytes stored   : 231
Storage ratio      : 346320.3
Chunks initialized : 0/1

Wrote profile results to scribble.py.lprof
Timer unit: 1e-06 s

Total time: 6.58158 s
File: scribble.py
Function: row_by_row at line 19

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    19                                           @profile
    20                                           def row_by_row():
    21                                               """Row by row."""
    22       100      14604.7    146.0      0.2      for i in tqdm.trange(100):
    23       100       1267.1     12.7      0.0          r_array = np.random.random(100)
    24       100        456.9      4.6      0.0          np_array[i] = r_array
    25       100    6565249.6  65652.5     99.8          z_array[i] = r_array

Total time: 6.55283 s
File: scribble.py
Function: in_chunks at line 28

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    28                                           @profile
    29                                           def in_chunks():
    30                                               """In chunks."""
    31       100      13821.0    138.2      0.2      for i in tqdm.trange(100):
    32       100      13546.1    135.5      0.2          r_array = np.random.random((200, 100))
    33       100       1375.9     13.8      0.0          np_array[:200] = r_array
    34       100    6524084.0  65240.8     99.6          z_array[:200] = r_array

Apr 17 '23 21:04 nuric

Think this is expected.

Zarr is storing compressed, chunked blobs in-memory. As opposed to NumPy, which has a raw uncompressed buffer to work with. As a result there will be more overhead with the former compared to the latter.

Even with these compression disabled and combining everything into one chunk, there is some Python overhead working with this Zarr store.

Plus to protect against accidental mutation of the Zarr data (due to other references to the original buffer lying around), we perform a copy to bytes. This ensures the data is a new buffer that is read-only from Python (protecting against modifications after it is stored).

We could probably do other things to improve on this like using HUGEPAGES when copying. Or manage our own buffer to avoid repeated allocation.

That said, there would still be some friction between the reliability side and the performance side of the discussion.

Apr 17 '23 22:04 jakirkham

Hi everyone,

I love Zarr for its versatility and although I appreciate that this is not the main use case that Zarr is built for, I am surprised by the difference in read-only performance, even if I store the Zarr array as a single chunk, without any compression and using the MemoryStore. Is this expected for read-only as well?

For example:

import zarr
import numpy as np

# Create the Numpy array
np_arr = np.random.random((10000, 1024))

# Create an equivalent Zarr array
root = zarr.open(zarr.MemoryStore())
zarr_arr = root.array(
    name="A", 
    data=np_arr, 
    chunks=False,     # Disable chunks
    compressor=None,  # Disable compression
    read_only=True,   # Explicitly set as read-only
)

Then:

%%timeit
for i in range(10000):
    x = np_arr[i]

Gives: 501 µs ± 6.45 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

And:

%%timeit
for i in range(10000):
    x = zarr_arr[i]

Gives: 154 ms ± 1.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) That's ~300x slower.

And finally:

%%timeit
for i in range(10000):
    x = root["A"][i]

Gives: 355 ms ± 9.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) That's ~709x slower.

Apr 10 '24 19:04 cwognum

@cwognum - the issue here is that Zarr python currently does not implement any special case for datasets with no compression. It's still loading the entire chunk into memory (making a copy), rather than indexing into the bytes to pull only the data that you need.

I predict that if you retry your example with smaller chunks, it will actually go faster.

Apr 10 '24 20:04 rabernat

Thanks for the quick response @rabernat!

I tried again with:

store = zarr.MemoryStore()
root = zarr.open(store)
arr = root.array(
    name="A", 
    data=arr, 
    chunks=(1, None), 
    compressor=None, 
    read_only=True
)

And find: 152 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Apr 10 '24 20:04 cwognum

Either way, I don't think it's a big issue.

In my downstream code I can always load the data into a Numpy array myself and use the Numpy copy from there on out. I'm just surprised by the magnitude of the difference and would be curious to better understand what is causing it.

Apr 10 '24 20:04 cwognum

better understand what is causing it.

I'm almost certain it's all of the memory copies that happen.

We should find a way to optimize this path.

Apr 10 '24 20:04 rabernat