archs4 icon indicating copy to clipboard operation
archs4 copied to clipboard

human_gene_v2.5.h5 fails file integrity check

Open wigginno opened this issue 1 year ago • 3 comments

Not an issue with the GH repo per say, but I noticed that the latest ARCHS4 file for human gene expression (human_gene_v2.5.h5) fails a file integrity check.

Reproduce:

wget https://s3.dev.maayanlab.cloud/archs4/files/human_gene_v2.5.h5
sha1sum human_gene_v2.5.h5

(expected SHA1) from the download page Expected SHA1: a7b21b55515959add7b1d620371bc4b2fb610976 Actual SHA1: ae96de0519b9f008b0dc3a9f944ee9007daf2f6a

To make sure it wasn't just a network issue on my end, I reconstructed the "etag" for the file on S3. The expected etag is 7c26b4ebb22b89d795968bf37df4b5e4-5706 (based on the output of curl -I https://s3.dev.maayanlab.cloud/archs4/files/human_gene_v2.5.h5). I verified that the multi-part file etag I calculated for the file on my disk does match this expected etag.

wigginno avatar Sep 08 '24 16:09 wigginno

Hi, yes there seems to be something with the sha1 sum on the download page that does not match the file. I have tested the file on s3 and it works as expected. I will look further into it. If the download would be corrupted, the file would not be openable with archs4py (e.g. archs4py.ls(...)). I am also getting: ae96de0519b9f008b0dc3a9f944ee9007daf2f6a

lachmann12 avatar Sep 09 '24 17:09 lachmann12

Hi, thanks for looking into this.

I can load most of the data, but am having trouble with these four samples:

  • GSM6998368
  • GSM6998371
  • GSM6998380
  • GSM6998386

If I try to load these specific samples with H5py, I get OSError: Can't synchronously read data (inflate() failed), which I thought might mean there is corruption localized to a particular chunk. If I try to load these samples with archs4py, it just looks like they have zero counts:

>>> a4.data.samples(file, ["GSM6998368","GSM6998371","GSM6998380","GSM6998386"])[:].sum()
GSM6998368    0
GSM6998371    0
GSM6998380    0
GSM6998386    0
dtype: uint64
>>>

Edit: I noticed archs4py.data.get_sample returns an array of zeros on exception, and doesn't raise. When I modified this function to raise the exception, it was the same exception raised from h5py: OSError: Can't synchronously read data (inflate() failed)

wigginno avatar Sep 09 '24 19:09 wigginno

I can load most of the data, but am having trouble with these four samples:

  • GSM6998368
  • GSM6998371
  • GSM6998380
  • GSM6998386

Actually, this doesn't seem to be linked to the SHA1 mismatch because I have the same problem loading 5 samples from mouse_gene_v2.5.h5 (even though the hash matches the download page):

  • GSM3723071
  • GSM7230982
  • GSM7230984
  • GSM7230985
  • GSM7230988

wigginno avatar Sep 09 '24 19:09 wigginno

We have moved to a new website https://archs4.org with a new data packaging system in place. The newest data releases are under "latest". Some samples might be missing in the files. I have tested the samples and they seem to be present in the latest data release.

https://archs4.org/data -> enter GSM id in search field top left -> select mouse -> search -> bottom right sample search returns all samples from same series -> download -> click ready.

lachmann12 avatar Mar 06 '25 15:03 lachmann12