Some error while reading gzip file
I'm trying to do kmer counting on some GIAB fastq.gz files which have reads of varying length. However, when I try to run KMC I receive the very vague error message: "Some error while reading gzip file." Any ideas about what might be causing this? Thanks!
Are your input files publicly available, if yes could you point me how I can get them?
Sure! They're the whole exome files from the GIAB project for Ashkenazim Trio Son (NA24385). The BAM files can be found here, which I converted to fastq using samtools for KMC kmer counting (samtools fastq file.bam > file.sam).
Thanks! More detailed info would be appreciated, this is what I have found: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/CSHL_bwamem_bam_GRCh37/ but I am not sure if this is your input, so could you confirm or send URL to download your data? Also, exact command lines (samtools, compression if any and kmc) could help me to diagnose the problem.
BTW, I think you would rather want to name samtools output file.fastq:
samtools fastq file.bam > file.fastq
The errror you mentioned suggest that you compressed samtools output, so my guess is you command line was something like:
samtools file.bam > file.sam
gzip file.sam
mkdir tmp_dir
kmc -k28 file.sam.gz o tmp_dir
If you did not perform any compression to *.gz, KMC somehow incorrectly detected your input file as compressed. The exact command lines would be really helpful.
Some other notes: You may achieve results faster if you will transform bam to FASTA instead of FASTQ, than you must run kmc with -fa flag:
samtools fasta file.bam > file.fasta
kmc -k28 -fa file.fasta tmp_dir
In a couple of days, there should be a new release of KMC. The new version will support reading from bam input files directly, so no samtools conversion will be required in your case. Anyway, thanks for using KMC!
Apologies, I thought I had included a link to the exact data files! They can be found here: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ion_exome/
You're absolutely correct that I forgot to mention using the gzip compression function in samtools, which I invoked as
samtools fastq file.bam > file.fastq.gz
(when detecting .gz in the file suffix, samtools automatically compresses the output).
It was these fastq.gz files that threw the vague error.
Excited to hear that KMC will soon support bam files! Does this include filtering bam files with kmc_tools? If possible, I would still appreciate insight on this particular fastq.gz case as we are primarily interested in filtering fastq files.
Hi. Thank you very much. I think I know the reason. When you run samtools this way it cannot detect that you redirected stdout to any file (in particular it cannot detect its extension). This is why the output is not compressed. Probably you should use
samtools -0 file.fastq.gz file.bam
KMC was confused because seeing gz ext it assumes gzipped file.
Unfortunately, currently only k-mer counting will be supported for bam files, we will consider adding it for kmc_tools filter operation. Thanks for the suggestion!
Aha! I'm so used to reading paired-end files. I believe your command is missing the fastq argument, however.
Unfortunately, now I am receiving a different error:
104953 Floating point exception
KMC seems to be making it through most of the file, since its printing out Stage updates, but crashes around 86%. Any idea what's causing this?
Hi! Unfortunately, I was not able to reproduce this bug. Below is my command line:
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ion_exome/HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.bam
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ion_exome/HG002_NA24385_SRR1767407_IonXpress_020_rawlib_24030.bam
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ion_exome/HG002_NA24385_SRR1767408_IonXpress_020_rawlib_24050.bam
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ion_exome/HG002_NA24385_SRR1767409_IonXpress_020_rawlib_24038.bam
../samtools/samtools-1.8/samtools fastq -0 HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.fastq.gz HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.bam
../samtools/samtools-1.8/samtools fastq -0 HG002_NA24385_SRR1767407_IonXpress_020_rawlib_24030.fastq.gz HG002_NA24385_SRR1767407_IonXpress_020_rawlib_24030.bam
../samtools/samtools-1.8/samtools fastq -0 HG002_NA24385_SRR1767408_IonXpress_020_rawlib_24050.fastq.gz HG002_NA24385_SRR1767408_IonXpress_020_rawlib_24050.bam
../samtools/samtools-1.8/samtools fastq -0 HG002_NA24385_SRR1767409_IonXpress_020_rawlib_24038.fastq.gz HG002_NA24385_SRR1767409_IonXpress_020_rawlib_24038.bam
mkdir tmp
kmc3.0.0/kmc -k28 HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.fastq.gz HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028 tmp
kmc3.0.0/kmc -k28 HG002_NA24385_SRR1767407_IonXpress_020_rawlib_24030.fastq.gz HG002_NA24385_SRR1767407_IonXpress_020_rawlib_24030 tmp
kmc3.0.0/kmc -k28 HG002_NA24385_SRR1767408_IonXpress_020_rawlib_24050.fastq.gz HG002_NA24385_SRR1767408_IonXpress_020_rawlib_24050 tmp
kmc3.0.0/kmc -k28 HG002_NA24385_SRR1767409_IonXpress_020_rawlib_24038.fastq.gz HG002_NA24385_SRR1767409_IonXpress_020_rawlib_24038 tmp
KMC output:
****************************
Stage 1: 100%
Stage 2: 100%
1st stage: 186.328s
2nd stage: 30.3182s
Total : 216.646s
Tmp size : 13807MB
Stats:
No. of k-mers below min. threshold : 1136493277
No. of k-mers above max. threshold : 0
No. of unique k-mers : 1528010005
No. of unique counted k-mers : 391516728
Total no. of k-mers : 13387124874
Total no. of reads : 82654309
Total no. of super-k-mers : 1294387447
*******************************
Stage 1: 100%
Stage 2: 100%
1st stage: 214.087s
2nd stage: 38.6248s
Total : 252.712s
Tmp size : 15778MB
Stats:
No. of k-mers below min. threshold : 1164315907
No. of k-mers above max. threshold : 0
No. of unique k-mers : 1568269000
No. of unique counted k-mers : 403953093
Total no. of k-mers : 15306907837
Total no. of reads : 92985629
Total no. of super-k-mers : 1478898440
*****************************
Stage 1: 100%
Stage 2: 100%
1st stage: 195.173s
2nd stage: 32.6008s
Total : 227.774s
Tmp size : 14167MB
Stats:
No. of k-mers below min. threshold : 1186570229
No. of k-mers above max. threshold : 0
No. of unique k-mers : 1631807441
No. of unique counted k-mers : 445237212
Total no. of k-mers : 13709252207
Total no. of reads : 90400856
Total no. of super-k-mers : 1328968760
*******************************
Stage 1: 100%
Stage 2: 100%
1st stage: 210.572s
2nd stage: 37.1707s
Total : 247.743s
Tmp size : 15266MB
Stats:
No. of k-mers below min. threshold : 1301868995
No. of k-mers above max. threshold : 0
No. of unique k-mers : 1751534818
No. of unique counted k-mers : 449665823
Total no. of k-mers : 14795773918
Total no. of reads : 92792496
Total no. of super-k-mers : 1431276858
To dig further I need to reproduce this bug, so could you please show me your command line (esspiecially how you run KMC, together with the way you prepare input data for KMC).
Huh, it seems to be working today. I did re-run my samtools commands just in case, so maybe I still hadn't been invoking samtools correctly. Thanks for your help!
Hi,
I copied your latest version (commit 609ce4074eb1168695501dd152280c39a0ad0ece), and compiled it on my machine, but an error occurs when reading gzip files. It implied this message: some error occurs when reading gzip files.
I checked the code, it happened at the 703 line in fastq_reader.cpp, while the stream.avail_in is 1. I replaced the code with your released code (v3.1.1rc1), i.e., when it's mulitstream just call inflateReset() and check if its return value is Z_OK . It worked fine for me.
So my questions are:
- Is it ok for just checking inflateReset(&stream) return value when coming to the end of ret is Z_STREAM_END file?
- what caused this error? I mean shouldn't avail_in be zero when Z_STREAM_END?
Thanks,
Dengfeng
Hi,
thanks for reporting that. I would appreciate if you could send me your input data and exact command you use to run kmc.
I think maybe I know how it should be fixed, but I would prefer to check it on the data causing the problem.
In general, the code that is present in the 609ce4074eb1168695501dd152280c39a0ad0ece commit is a fix for a case when there are trailing garbage data. In general gz standard allows for a following situation.
block of valid data (gz stream)
------------------------------
block of valid data (gz stream)
------------------------------
...
------------------------------
block of valid data (gz stream)
------------------------------
trailing garbage
------------------------------
The code that is present in kmc3.1.1rc1 may fail when there is trailing garbage (you may check issue #110 which was fixed with 88053157bb74fb9d1b39bb90a2994f79a257000c).
I think that in your case there is exactly one byte of trailing garbage, which I didn't take into account when introducing a fix for #110.
Answering your questions:
- I think it may be not sufficient, according to the docs of
inflateResetinzlib.h:
inflateReset returns Z_OK if success, or Z_STREAM_ERROR if the source
stream state was inconsistent (such as zalloc or state being Z_NULL).
, so I think it will never return Z_STREAM_END (but maybe I am wrong if you know how to check if the next data block is a valid zstream let me know, because now I need to check its two byte starting marker manually:
garbage = b1 != 0x1f || b2 != 0x8b;
- well, it may not be zero if the file contains multiple zstreams or some garbage after last valid zstream data.
If you cannot send me your input data, please let me know, maybe I will try to make a fix and you could check if it works?
BTW. thanks for such a level of details when reporting this bug, it is really helpful.
Hi,
Thanks for your answers.
I don't know how to check valid start for next block. Just I think if It's already z_stream_end, maybe you should just ignore the remaining bytes in avail_in, reset stream and goto next data, just a guess, could be wrong.
The data could be downloaded by the following command: aws s3 --no-sign-request sync s3://genomeark/species/Taeniopygia_guttata/bTaeGut1/genomic_data/10x/
and KMC command is: kmc -t12 -m20 @kmc_10x.fofn NA.res .
If you can't download the data, just email me your code and I can test it.
Please let me know if you fix the error.
Thanks,
Dengfeng
BTW, my email adress is [email protected], cheers.