human_gene_v2.5.h5 fails file integrity check
Not an issue with the GH repo per say, but I noticed that the latest ARCHS4 file for human gene expression (human_gene_v2.5.h5) fails a file integrity check.
Reproduce:
wget https://s3.dev.maayanlab.cloud/archs4/files/human_gene_v2.5.h5
sha1sum human_gene_v2.5.h5
(expected SHA1) from the download page Expected SHA1: a7b21b55515959add7b1d620371bc4b2fb610976 Actual SHA1: ae96de0519b9f008b0dc3a9f944ee9007daf2f6a
To make sure it wasn't just a network issue on my end, I reconstructed the "etag" for the file on S3. The expected etag is 7c26b4ebb22b89d795968bf37df4b5e4-5706 (based on the output of curl -I https://s3.dev.maayanlab.cloud/archs4/files/human_gene_v2.5.h5). I verified that the multi-part file etag I calculated for the file on my disk does match this expected etag.
Hi, yes there seems to be something with the sha1 sum on the download page that does not match the file. I have tested the file on s3 and it works as expected. I will look further into it. If the download would be corrupted, the file would not be openable with archs4py (e.g. archs4py.ls(...)). I am also getting: ae96de0519b9f008b0dc3a9f944ee9007daf2f6a
Hi, thanks for looking into this.
I can load most of the data, but am having trouble with these four samples:
- GSM6998368
- GSM6998371
- GSM6998380
- GSM6998386
If I try to load these specific samples with H5py, I get OSError: Can't synchronously read data (inflate() failed), which I thought might mean there is corruption localized to a particular chunk.
If I try to load these samples with archs4py, it just looks like they have zero counts:
>>> a4.data.samples(file, ["GSM6998368","GSM6998371","GSM6998380","GSM6998386"])[:].sum()
GSM6998368 0
GSM6998371 0
GSM6998380 0
GSM6998386 0
dtype: uint64
>>>
Edit: I noticed archs4py.data.get_sample returns an array of zeros on exception, and doesn't raise. When I modified this function to raise the exception, it was the same exception raised from h5py: OSError: Can't synchronously read data (inflate() failed)
I can load most of the data, but am having trouble with these four samples:
- GSM6998368
- GSM6998371
- GSM6998380
- GSM6998386
Actually, this doesn't seem to be linked to the SHA1 mismatch because I have the same problem loading 5 samples from mouse_gene_v2.5.h5 (even though the hash matches the download page):
- GSM3723071
- GSM7230982
- GSM7230984
- GSM7230985
- GSM7230988
We have moved to a new website https://archs4.org with a new data packaging system in place. The newest data releases are under "latest". Some samples might be missing in the files. I have tested the samples and they seem to be present in the latest data release.
https://archs4.org/data -> enter GSM id in search field top left -> select mouse -> search -> bottom right sample search returns all samples from same series -> download -> click ready.