cljam icon indicating copy to clipboard operation
cljam copied to clipboard

bgzipped VCF file over HTTP causes infinite loop

Open totakke opened this issue 3 years ago • 3 comments

A bgzipped VCF file over HTTP causes infinite loop.

$ gzip -c test-resources/vcf/test-v4_3.vcf > test-resources/vcf/test-v4_3-gzip.vcf.gz
$ bgzip -c test-resources/vcf/test-v4_3.vcf > test-resources/vcf/test-v4_3-bgzip.vcf.gz
$ python3 -m http.server 8000
(require '[cljam.io.vcf :as vcf])

;; gzipped file correctly finished.
(with-open [rdr (vcf/vcf-reader "http://localhost:8000/test-resources/vcf/test-v4_3-gzip.vcf.gz")]
  (doall (vcf/read-variants rdr)))
;;=> ({:NA00001 {:GT "0|0", :GQ 48, :DP 1, :HQ (51 51)},
;;     :alt ["A"],
;;     :ref "G",
;;     :FORMAT (:GT :GQ :DP :HQ),
;;     :NA00002 {:GT "1|0", :GQ 48, :DP 8, :HQ (51 51)},
;;     :pos 14370,
;;     :filter (:PASS),
;;     :id "rs6054257",
;;     :info {:NS 3, :DP 14, :AF (0.5), :DB :exists, :H2 :exists},
;;     :qual 29.0,
;;     :NA00003 {:GT "1/1", :GQ 43, :DP 5, :HQ (nil nil)},
;;     :chr "20"}
;;    ...)

;; bgzipped file causes infinite loop.
(with-open [rdr (vcf/vcf-reader "http://localhost:8000/test-resources/vcf/test-v4_3-bgzip.vcf.gz")]
  (doall (vcf/read-variants rdr)))
;; infinite loop...

cljam can read the bgzipped file as a local file, so I think this issue is related to bgzf4j.

(with-open [rdr (vcf/vcf-reader "test-resources/vcf/test-v4_3-bgzip.vcf.gz")]
  (doall (vcf/read-variants rdr)))
;;=> ({:NA00001 {:GT "0|0", :GQ 48, :DP 1, :HQ (51 51)},
;;     :alt ["A"],
;;     :ref "G",
;;     :FORMAT (:GT :GQ :DP :HQ),
;;     :NA00002 {:GT "1|0", :GQ 48, :DP 8, :HQ (51 51)},
;;     :pos 14370,
;;     :filter (:PASS),
;;     :id "rs6054257",
;;     :info {:NS 3, :DP 14, :AF (0.5), :DB :exists, :H2 :exists},
;;     :qual 29.0,
;;     :NA00003 {:GT "1/1", :GQ 43, :DP 5, :HQ (nil nil)},
;;     :chr "20"}
;;    ...)

I noticed the issue while reading COSMIC's bgzipped VCF, but this NegativeArraySizeException seems to be another problem. A further inverstigation is needed.

(with-open [rdr (vcf/vcf-reader "http://localhost:8000/CosmicCodingMuts.vcf.gz")]
  (dorun (vcf/read-variants rdr)))
;; Caused by java.lang.NegativeArraySizeException
;; -481634578
;;
;;    BGZFInputStream.java:  365  bgzf4j.BGZFInputStream/inflateBlock
;;    BGZFInputStream.java:  353  bgzf4j.BGZFInputStream/readBlock
;;    BGZFInputStream.java:  102  bgzf4j.BGZFInputStream/available
;;    BGZFInputStream.java:  202  bgzf4j.BGZFInputStream/readLine
;;              reader.clj:  179  cljam.io.vcf.reader/read-data-lines
;;              reader.clj:  175  cljam.io.vcf.reader/read-data-lines
;;              reader.clj:  182  cljam.io.vcf.reader/read-data-lines/fn

totakke avatar May 09 '22 04:05 totakke

I found that SeekableHTTPStream in bgzf4j achieves range reading via the Range HTTP header. However, Python's http.server does not support the Range header and will always return the full range of the document, which seems to cause the infinite reading (You can comfirm this by seeing http.server return 200 like the following log, which means, according to RFC 7233, that it ignored the Range request and returned the entire document):

::ffff:127.0.0.1 - - [12/May/2022 hh:mm:ss] "GET /test-resources/vcf/test-v4_3-bgzip.vcf.gz HTTP/1.1" 200 -

Ideally, it would be nice to be able to handle cases where the server does not support the Range header. Otherwise, it might be good to at least check if the server returned 206 for a Range request.

athos avatar May 12 '22 07:05 athos

@athos Thank you for the investigation. I have created chrovis/bgzf4j#1 for resolving the problem related to the Range request support. I'm sorry that bgzf4j project does not have a test environment (that is an old project...), but please review the PR when you are available.

totakke avatar May 13 '22 10:05 totakke

@totakke Thank you for making the PR 🙏 I’m done with my review and merged the change.

athos avatar May 16 '22 06:05 athos

Fixed by 38b7918

alumi avatar Apr 14 '23 05:04 alumi