So far we only support reading/writing from paths, this PR adds tests / type hints for binary buffers / file-like objects.

[x] TODO: tests

Aug 08 '22 12:08 flying-sheep

Codecov Report

Merging #800 (ef3e1e9) into master (cc3ba6f) will decrease coverage by 0.05%. The diff coverage is 86.95%.

@@            Coverage Diff             @@
##           master     #800      +/-   ##
==========================================
- Coverage   83.21%   83.15%   -0.06%     
==========================================
  Files          34       34              
  Lines        4450     4458       +8     
==========================================
+ Hits         3703     3707       +4     
- Misses        747      751       +4

Impacted Files	Coverage Δ
anndata/_io/h5ad.py	`90.82% <81.25%> (-1.16%)`	:arrow_down:
anndata/_core/anndata.py	`83.45% <100.00%> (+0.04%)`	:arrow_up:
anndata/_core/merge.py	`93.73% <0.00%> (-0.28%)`	:arrow_down:

Aug 08 '22 12:08 codecov[bot]

General question about the PR, why buffers specifically and not h5py.File/ h5py.Group or zarr stores directly?

Aug 30 '22 09:08 ivirshup

That’s also a good idea, but this PR is about the most generic serialization format: a byte stream.

We can add the others at another date.

See https://github.com/quiltdata/quilt/pull/2974 for motivation to get this in quickly.

Sep 09 '22 08:09 flying-sheep

Currently, this works:

import anndata as ad
from anndata.experimental import read_elem, write_elem

import h5py

from io import BytesIO

adata = ad.read_h5ad("/Users/isaac/data/pbmc3k_raw.h5ad")

bio = BytesIO()
with h5py.File(bio, "w") as f:
    write_elem(f, "/", adata)

with h5py.File(bio, "r") as f:
    from_bytes = read_elem(f["/"])

I don't think I want to directly support file-like objects as an input type to read_h5ad. It's more to support on our side, and the docs in h5py for this feature are full of "use at your own risk" warnings. See also h5py/h5py#1698

Sep 13 '22 21:09 ivirshup

What is the use case here? I'm not too familiar with Quilt.

Sep 13 '22 21:09 ivirshup

Check out their promo material: https://quiltdata.com/

Basically it’s a catalog to browse and manage in-house or public data. The data is stored in versioned “packages” (file trees on S3) together with searchable/queryable metadata. Metadata can be validated using schemas.

But if [read|write]_elem works, it’s true, they can just use that.

Sep 14 '22 09:09 flying-sheep

For quilt:

I saw that, but the specifics were a little unclear. Looks like it's only s3?

I think there are some similarities to things I'd like to do with zarr and OME, but not quite sure yet.

For specific use case, are you expecting the data to always be delivered as streams? If so, I think pickling would have less overhead from chunking the arrays. If the data is on the cloud, zarr may be a nicer choice for storage format (plus defaults to more modern compression libraries).

Sep 14 '22 11:09 ivirshup

I agree about zarr, hdf5 is a legacy thing. I also don’t know if their (de)serialization supports S3 file hierarchies as opposed to individual files.

Regarding pickling: I avoid that format for anything but caches because of its lack of stability.

Sep 14 '22 12:09 flying-sheep

One can have backwards compatible pickled objects. I believe both numpy and pandas do this.

I don't think it should be too hard for us, as anndata is basically builtin python types and those. I'd be up for a PR implementing this.

Of course you could have then have references to items which don't have compatibility guarantees.

Sep 14 '22 12:09 ivirshup

yeah, that’s the problem, right? it’s easy that something sneaks in and everything breaks. Recovery is near impossible.

Sep 15 '22 09:09 flying-sheep

Formalize support for buffer IO for .h5ad

Codecov Report