KMC icon indicating copy to clipboard operation
KMC copied to clipboard

Some error while reading gzip file

Open irenatfh opened this issue 7 years ago • 11 comments

I'm trying to do kmer counting on some GIAB fastq.gz files which have reads of varying length. However, when I try to run KMC I receive the very vague error message: "Some error while reading gzip file." Any ideas about what might be causing this? Thanks!

irenatfh avatar May 07 '18 20:05 irenatfh

Are your input files publicly available, if yes could you point me how I can get them?

marekkokot avatar May 08 '18 06:05 marekkokot

Sure! They're the whole exome files from the GIAB project for Ashkenazim Trio Son (NA24385). The BAM files can be found here, which I converted to fastq using samtools for KMC kmer counting (samtools fastq file.bam > file.sam).

irenatfh avatar May 08 '18 20:05 irenatfh

Thanks! More detailed info would be appreciated, this is what I have found: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/CSHL_bwamem_bam_GRCh37/ but I am not sure if this is your input, so could you confirm or send URL to download your data? Also, exact command lines (samtools, compression if any and kmc) could help me to diagnose the problem.

BTW, I think you would rather want to name samtools output file.fastq:

samtools fastq file.bam > file.fastq

The errror you mentioned suggest that you compressed samtools output, so my guess is you command line was something like:

samtools file.bam > file.sam
gzip file.sam
mkdir tmp_dir
kmc -k28 file.sam.gz o tmp_dir

If you did not perform any compression to *.gz, KMC somehow incorrectly detected your input file as compressed. The exact command lines would be really helpful.

Some other notes: You may achieve results faster if you will transform bam to FASTA instead of FASTQ, than you must run kmc with -fa flag:

samtools fasta file.bam > file.fasta
kmc -k28 -fa file.fasta tmp_dir

In a couple of days, there should be a new release of KMC. The new version will support reading from bam input files directly, so no samtools conversion will be required in your case. Anyway, thanks for using KMC!

marekkokot avatar May 08 '18 20:05 marekkokot

Apologies, I thought I had included a link to the exact data files! They can be found here: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ion_exome/

You're absolutely correct that I forgot to mention using the gzip compression function in samtools, which I invoked as

samtools fastq file.bam > file.fastq.gz

(when detecting .gz in the file suffix, samtools automatically compresses the output).

It was these fastq.gz files that threw the vague error.

Excited to hear that KMC will soon support bam files! Does this include filtering bam files with kmc_tools? If possible, I would still appreciate insight on this particular fastq.gz case as we are primarily interested in filtering fastq files.

irenatfh avatar May 09 '18 00:05 irenatfh

Hi. Thank you very much. I think I know the reason. When you run samtools this way it cannot detect that you redirected stdout to any file (in particular it cannot detect its extension). This is why the output is not compressed. Probably you should use

samtools -0 file.fastq.gz file.bam

KMC was confused because seeing gz ext it assumes gzipped file.

Unfortunately, currently only k-mer counting will be supported for bam files, we will consider adding it for kmc_tools filter operation. Thanks for the suggestion!

marekkokot avatar May 09 '18 05:05 marekkokot

Aha! I'm so used to reading paired-end files. I believe your command is missing the fastq argument, however.

Unfortunately, now I am receiving a different error:

104953 Floating point exception

KMC seems to be making it through most of the file, since its printing out Stage updates, but crashes around 86%. Any idea what's causing this?

irenatfh avatar May 11 '18 00:05 irenatfh

Hi! Unfortunately, I was not able to reproduce this bug. Below is my command line:

wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ion_exome/HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.bam
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ion_exome/HG002_NA24385_SRR1767407_IonXpress_020_rawlib_24030.bam
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ion_exome/HG002_NA24385_SRR1767408_IonXpress_020_rawlib_24050.bam
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/ion_exome/HG002_NA24385_SRR1767409_IonXpress_020_rawlib_24038.bam

../samtools/samtools-1.8/samtools fastq -0 HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.fastq.gz HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.bam
../samtools/samtools-1.8/samtools fastq -0 HG002_NA24385_SRR1767407_IonXpress_020_rawlib_24030.fastq.gz HG002_NA24385_SRR1767407_IonXpress_020_rawlib_24030.bam
../samtools/samtools-1.8/samtools fastq -0 HG002_NA24385_SRR1767408_IonXpress_020_rawlib_24050.fastq.gz HG002_NA24385_SRR1767408_IonXpress_020_rawlib_24050.bam
../samtools/samtools-1.8/samtools fastq -0 HG002_NA24385_SRR1767409_IonXpress_020_rawlib_24038.fastq.gz HG002_NA24385_SRR1767409_IonXpress_020_rawlib_24038.bam

mkdir tmp
kmc3.0.0/kmc -k28 HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028.fastq.gz HG002_NA24385_SRR1767406_IonXpress_020_rawlib_24028 tmp
kmc3.0.0/kmc -k28 HG002_NA24385_SRR1767407_IonXpress_020_rawlib_24030.fastq.gz HG002_NA24385_SRR1767407_IonXpress_020_rawlib_24030 tmp
kmc3.0.0/kmc -k28 HG002_NA24385_SRR1767408_IonXpress_020_rawlib_24050.fastq.gz HG002_NA24385_SRR1767408_IonXpress_020_rawlib_24050 tmp
kmc3.0.0/kmc -k28 HG002_NA24385_SRR1767409_IonXpress_020_rawlib_24038.fastq.gz HG002_NA24385_SRR1767409_IonXpress_020_rawlib_24038 tmp

KMC output:

****************************
Stage 1: 100%
Stage 2: 100%
1st stage: 186.328s
2nd stage: 30.3182s
Total    : 216.646s
Tmp size : 13807MB

Stats:
   No. of k-mers below min. threshold :   1136493277
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :   1528010005
   No. of unique counted k-mers       :    391516728
   Total no. of k-mers                :  13387124874
   Total no. of reads                 :     82654309
   Total no. of super-k-mers          :   1294387447
*******************************
Stage 1: 100%
Stage 2: 100%
1st stage: 214.087s
2nd stage: 38.6248s
Total    : 252.712s
Tmp size : 15778MB

Stats:
   No. of k-mers below min. threshold :   1164315907
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :   1568269000
   No. of unique counted k-mers       :    403953093
   Total no. of k-mers                :  15306907837
   Total no. of reads                 :     92985629
   Total no. of super-k-mers          :   1478898440
*****************************
Stage 1: 100%
Stage 2: 100%
1st stage: 195.173s
2nd stage: 32.6008s
Total    : 227.774s
Tmp size : 14167MB

Stats:
   No. of k-mers below min. threshold :   1186570229
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :   1631807441
   No. of unique counted k-mers       :    445237212
   Total no. of k-mers                :  13709252207
   Total no. of reads                 :     90400856
   Total no. of super-k-mers          :   1328968760
*******************************
Stage 1: 100%
Stage 2: 100%
1st stage: 210.572s
2nd stage: 37.1707s
Total    : 247.743s
Tmp size : 15266MB

Stats:
   No. of k-mers below min. threshold :   1301868995
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :   1751534818
   No. of unique counted k-mers       :    449665823
   Total no. of k-mers                :  14795773918
   Total no. of reads                 :     92792496
   Total no. of super-k-mers          :   1431276858

To dig further I need to reproduce this bug, so could you please show me your command line (esspiecially how you run KMC, together with the way you prepare input data for KMC).

marekkokot avatar May 11 '18 09:05 marekkokot

Huh, it seems to be working today. I did re-run my samtools commands just in case, so maybe I still hadn't been invoking samtools correctly. Thanks for your help!

irenatfh avatar May 22 '18 00:05 irenatfh

Hi,

I copied your latest version (commit 609ce4074eb1168695501dd152280c39a0ad0ece), and compiled it on my machine, but an error occurs when reading gzip files. It implied this message: some error occurs when reading gzip files.

I checked the code, it happened at the 703 line in fastq_reader.cpp, while the stream.avail_in is 1. I replaced the code with your released code (v3.1.1rc1), i.e., when it's mulitstream just call inflateReset() and check if its return value is Z_OK . It worked fine for me.

So my questions are:

  1. Is it ok for just checking inflateReset(&stream) return value when coming to the end of ret is Z_STREAM_END file?
  2. what caused this error? I mean shouldn't avail_in be zero when Z_STREAM_END?

Thanks,

Dengfeng

dfguan avatar Apr 30 '19 08:04 dfguan

Hi,

thanks for reporting that. I would appreciate if you could send me your input data and exact command you use to run kmc.

I think maybe I know how it should be fixed, but I would prefer to check it on the data causing the problem.

In general, the code that is present in the 609ce4074eb1168695501dd152280c39a0ad0ece commit is a fix for a case when there are trailing garbage data. In general gz standard allows for a following situation.

block of valid data (gz stream)
------------------------------  
block of valid data (gz stream)
------------------------------ 
...
------------------------------   
block of valid data (gz stream)
------------------------------   
trailing garbage
------------------------------    

The code that is present in kmc3.1.1rc1 may fail when there is trailing garbage (you may check issue #110 which was fixed with 88053157bb74fb9d1b39bb90a2994f79a257000c).

I think that in your case there is exactly one byte of trailing garbage, which I didn't take into account when introducing a fix for #110.

Answering your questions:

  1. I think it may be not sufficient, according to the docs of inflateReset in zlib.h:
     inflateReset returns Z_OK if success, or Z_STREAM_ERROR if the source
   stream state was inconsistent (such as zalloc or state being Z_NULL).

, so I think it will never return Z_STREAM_END (but maybe I am wrong if you know how to check if the next data block is a valid zstream let me know, because now I need to check its two byte starting marker manually:

garbage = b1 != 0x1f || b2 != 0x8b;
  1. well, it may not be zero if the file contains multiple zstreams or some garbage after last valid zstream data.

If you cannot send me your input data, please let me know, maybe I will try to make a fix and you could check if it works?

BTW. thanks for such a level of details when reporting this bug, it is really helpful.

marekkokot avatar Apr 30 '19 15:04 marekkokot

Hi,

Thanks for your answers.

I don't know how to check valid start for next block. Just I think if It's already z_stream_end, maybe you should just ignore the remaining bytes in avail_in, reset stream and goto next data, just a guess, could be wrong.

The data could be downloaded by the following command: aws s3 --no-sign-request sync s3://genomeark/species/Taeniopygia_guttata/bTaeGut1/genomic_data/10x/

and KMC command is: kmc -t12 -m20 @kmc_10x.fofn NA.res .

If you can't download the data, just email me your code and I can test it.

Please let me know if you fix the error.

Thanks,

Dengfeng

BTW, my email adress is [email protected], cheers.

dfguan avatar Apr 30 '19 16:04 dfguan