Biostrings icon indicating copy to clipboard operation
Biostrings copied to clipboard

Add zstd support

Open hjarnek opened this issue 1 year ago • 3 comments

Hi,

It would be great with support for zstd compression and decompression of especially FASTQ files, as they can get very big with modern sequencing technologies, and zstd seems more and more like the given successor to gzip. Probably (hopefully) the field of bioinformatics will move away from gzip in the near future, and zstd is an increasingly popular candidate. It's much faster, has better compression ratio, supports multithreading natively, and comes in a well-maintained C library. Any plans to implement this?

hjarnek avatar Oct 14 '24 20:10 hjarnek

I just tried it out with hg19.fa and won't bother with statistics. For compressing the large single sequence, zstd with default parameters seems very performant relative to gzip. I then asked whether it is part of the samtools/htslib stack and saw https://github.com/samtools/htslib/pull/1770, so that does not seem super favorable at the moment. It does pop up in a UKBB workflow: https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/whole-exome-sequencing-oqfe-protocol/protocol-for-processing-ukb-whole-exome-sequencing-data-sets. @hjarnek please supply some links with information on uptake in bioinformatics so that we can assess the priority of such a move.

vjcitn avatar Oct 14 '24 21:10 vjcitn

I don't have any specific sources, it's just an observation that zstd is being used a lot in other contexts, and seeing as gzip is getting old compared to more modern compression algorithms, I thought zstd could be a good successor. Who knows what the field will eventually settle on. I'm a biologist, not a computer scientist, but I think it's clear that data compression is becoming increasingly valued as the amounts of data grow, also in bioinformatics, so I find it logical that people will try to move away from gzip in the near future. There are of course other fast high compression algorithms next to zstd, maybe another one is better suited. I see the discussion was going strong for a while in the GH issue related to the PR you linked, and according to a pretty graph there, zstd seems to be coming out on top also with bioinformatic data. But I'm not the right person to discuss technical details with.

hjarnek avatar Oct 14 '24 22:10 hjarnek

zstd has been used in the compression of rawdata from Oxford Nanopore sequencing platforms.

  1. ont_fast5_api, which we use to compress the fast5 files(>1TB per 30x WGS sample)
  2. slow5tools and slow5 article. The authors suggest using zstd instead of old zlib.
  3. zstd has been used in conda and other tools for a while.

I'm not a user of Biostrings but look for the way htslib team treat zstd(https://github.com/samtools/htslib/pull/1770) as you, which turns out a little disappointing.

From my experience replacing zlib (.gz) with zstd (.zst) in my scientific projects as much as possible, a large database with limited words (ATCGN) benefits a lot. It significantly reduces the disk usage and does not take a very long time to decompress. With zstd -dc (alias zstdcat), it can be used seamlessly in bioinformatic practices. In addition, it can be rsyncable.

As for the trade-off about compression level, I think you can optimize the level towards given cores, or just leave it to user.

YuanfengZhang avatar Jun 04 '25 09:06 YuanfengZhang