python-caterva icon indicating copy to clipboard operation
python-caterva copied to clipboard

Help on read/write to disk with parallel compression

Open snehashis-roy opened this issue 5 years ago • 1 comments

Hello. I am trying to use the parallel feature of Blosc compression to write (and read) large (~100GB) dataset to disk, similar to h5py's (or zarr's) create_dataset. I could not find a simple example on how to write numpy ndarray to disk. Could you please point me to the right direction? Thanks.

snehashis-roy avatar Jun 03 '20 00:06 snehashis-roy

Hi @piby2 ,

In latest version, there is a benchmarking file that can help you: https://github.com/Blosc/cat4py/blob/master/bench/compare_getslice.py

To be able to run it, you should go to latest version either cloning cat4py again (recommended option)

git clone --recurse-submodules https://github.com/Blosc/cat4py

or updating master branch to last commit.

Then compile it using:

rm -rf _skbuild cat4py/*.so*  # If you have a previous build
python setup.py build_ext --build-type=RelWithDebInfo

To check the installation, run:

PYTHONPATH=. pytest

Finallly, run the benchmark:

PYTHONPATH=. python bench/compare_getslice.py 1  # 1 enables persistency in this benchmark

PS: You shouldn't use the -O0 flag, it disables all compiler optimizations.

aleixalcacer avatar Jun 03 '20 08:06 aleixalcacer