Formalize support for buffer IO for .h5ad
So far we only support reading/writing from paths, this PR adds tests / type hints for binary buffers / file-like objects.
- [x] TODO: tests
Codecov Report
Merging #800 (ef3e1e9) into master (cc3ba6f) will decrease coverage by
0.05%. The diff coverage is86.95%.
@@ Coverage Diff @@
## master #800 +/- ##
==========================================
- Coverage 83.21% 83.15% -0.06%
==========================================
Files 34 34
Lines 4450 4458 +8
==========================================
+ Hits 3703 3707 +4
- Misses 747 751 +4
| Impacted Files | Coverage Δ | |
|---|---|---|
| anndata/_io/h5ad.py | 90.82% <81.25%> (-1.16%) |
:arrow_down: |
| anndata/_core/anndata.py | 83.45% <100.00%> (+0.04%) |
:arrow_up: |
| anndata/_core/merge.py | 93.73% <0.00%> (-0.28%) |
:arrow_down: |
General question about the PR, why buffers specifically and not h5py.File/ h5py.Group or zarr stores directly?
That’s also a good idea, but this PR is about the most generic serialization format: a byte stream.
We can add the others at another date.
See https://github.com/quiltdata/quilt/pull/2974 for motivation to get this in quickly.
Currently, this works:
import anndata as ad
from anndata.experimental import read_elem, write_elem
import h5py
from io import BytesIO
adata = ad.read_h5ad("/Users/isaac/data/pbmc3k_raw.h5ad")
bio = BytesIO()
with h5py.File(bio, "w") as f:
write_elem(f, "/", adata)
with h5py.File(bio, "r") as f:
from_bytes = read_elem(f["/"])
I don't think I want to directly support file-like objects as an input type to read_h5ad. It's more to support on our side, and the docs in h5py for this feature are full of "use at your own risk" warnings. See also h5py/h5py#1698
What is the use case here? I'm not too familiar with Quilt.
Check out their promo material: https://quiltdata.com/
Basically it’s a catalog to browse and manage in-house or public data. The data is stored in versioned “packages” (file trees on S3) together with searchable/queryable metadata. Metadata can be validated using schemas.
But if [read|write]_elem works, it’s true, they can just use that.
For quilt:
I saw that, but the specifics were a little unclear. Looks like it's only s3?
I think there are some similarities to things I'd like to do with zarr and OME, but not quite sure yet.
For specific use case, are you expecting the data to always be delivered as streams? If so, I think pickling would have less overhead from chunking the arrays. If the data is on the cloud, zarr may be a nicer choice for storage format (plus defaults to more modern compression libraries).
I agree about zarr, hdf5 is a legacy thing. I also don’t know if their (de)serialization supports S3 file hierarchies as opposed to individual files.
Regarding pickling: I avoid that format for anything but caches because of its lack of stability.
One can have backwards compatible pickled objects. I believe both numpy and pandas do this.
I don't think it should be too hard for us, as anndata is basically builtin python types and those. I'd be up for a PR implementing this.
Of course you could have then have references to items which don't have compatibility guarantees.
yeah, that’s the problem, right? it’s easy that something sneaks in and everything breaks. Recovery is near impossible.