strobealign Use ISA-L to speed up gzip decompression

Hi.

I speeded up the loading part of Strobealign in this PR. In this PR, I replaced the use of zlib's gzread for gzip decompression with a class that performs gzip decompression and double buffering using ISA-L. As a side effect, double buffering is used to read raw fastq, which also speeds up.

The benchmark for gzip decompression with gzopen is shown below.

Total time mapping: 2610.28 s.
Total time reading read-file(s): 2284.26 s.
Total time creating strobemers: 25.25 s.
Total time finding NAMs (non-rescue mode): 42.51 s.
Total time finding NAMs (rescue mode): 72.65 s.
Total time sorting NAMs (candidate sites): 121.26 s.
Total time base level alignment (ssw): 167.53 s.
Total time writing alignment to files: 0.00 s.

On the other hand, this PR will speed up

Total time mapping: 727.12 s.
Total time reading read-file(s): 110.86 s.
Total time creating strobemers: 47.43 s.
Total time finding NAMs (non-rescue mode): 70.84 s.
Total time finding NAMs (rescue mode): 151.78 s.
Total time sorting NAMs (candidate sites): 232.37 s.
Total time base level alignment (ssw): 324.62 s.
Total time writing alignment to files: 0.00 s.

This PR improves scalability in systems with a very large number of cores.

Could you please review it?

Apr 27 '23 00:04 telmin

I am trying to fix CI, but I can't figure out why it SEGVs on MacOS and why the build fails with pythonbindings. Can you please give me some advice? I tried to create a similar environment locally for pythonbindings and the build passes. it may be a problem with the SOURCE_DIR variable, but I am not sure what the cause is...

Apr 27 '23 09:04 telmin

I tried to create a similar environment locally for pythonbindings and the build passes. it may be a problem with the SOURCE_DIR variable, but I am not sure what the cause is...

For testing the Python bindings, you should compile via pip install. So something like this:

python3 -m venv .venv
.venv/bin/pip install .

I was able to reproduce the error when I did the above. But as I said, I will try to help with figuring out that issue.

Apr 27 '23 09:04 marcelm

1. Usage of ISA-L needs to be optional since it is not available everywhere (Apple Silicon for example)

Hm, it seems to be available for aarch64, so I take that back.

Apr 27 '23 09:04 marcelm

Once again regarding this:

2. We can tackle this later, but should keep it in mind: We need to make it possible to dynamically link to libisal so that an already installed one can be used. So when building the Conda package, for example, we would just add isa-l as a dependency.

Maybe it is actually better to always assume that libisal is available on the system. Looking at this isal-l wiki page, it’s clear that libisal is available on many Linux distributions and on macOS via brew. So this is probably simpler and more reliable than using ExternalProject.

Apr 27 '23 09:04 marcelm

It got this working using the system isa-l by replacing ExternalProject_* with this:

find_package(PkgConfig REQUIRED)
pkg_check_modules(ISAL REQUIRED IMPORTED_TARGET GLOBAL libisal>=2.30.0)
target_include_directories(salib PUBLIC src/ ext/ ${PROJECT_BINARY_DIR} ${ISAL_INCLUDE_DIRS})
target_link_libraries(salib PUBLIC ZLIB::ZLIB Threads::Threads zstr::zstr PkgConfig::ISAL)

And in iowrap.hpp, I had to change #include "igzip_lib.h" to #include "isa-l/igzip_lib.h". I then installed libisal with sudo apt install libisal-dev and strobealign compiled fine, and even pip install . worked.

Apr 27 '23 10:04 marcelm

Thanks for the review. I think I was able to fix all the points you made. I should have used isa-l from the package manager, as you pointed out.

I am not sure what is causing it to SEGV under MacOS...

Apr 28 '23 02:04 telmin

Thank you for your review. I have fix the points you mentioned.

Apr 30 '23 12:04 telmin

Regarding the macOS failure, this is the error message:

Total time writing index: 0.02 s
tests/run.sh: line 56:  2953 Segmentation fault: 11  strobealign -r 150 -i tests/phix.fasta

The strobealign -i command does not even read from any input FASTQ files, and it crashes right at the end shortly before it would exit anyway. Maybe some cleanup that doesn’t work as expected?

May 03 '23 12:05 marcelm

The reason I am double-buffering the gzip decompression is to speed up the process; trying to do gzip decompression without double-buffering would be very slow, so I am buffering to avoid this.

But you are right, I think we should properly consider if we really need GB of Buffer. So I took some benchmarks. In the following benchmark, decompression is done n bytes at a time and the decompressed buffer is allocated to 4*n bytes.

256KB / 1MB

Total time reading read-file(s): 211.97 s.

1MB / 4MB

Total time reading read-file(s): 109.14 s.

2MB / 8MB

Total time reading read-file(s): 82.67 s.

4MB / 16MB

Total time reading read-file(s): 85.07 s.

8MB / 32MB

Total time reading read-file(s): 88.06 s.

16MB / 64MB

Total time reading read-file(s): 85.09 s.

32MB / 128MB

Total time reading read-file(s): 90.15 s.

64MB / 256MB

Total time reading read-file(s): 91.16 s.

128MB / 512MB

Total time reading read-file(s): 93.04 s.

256MB / 1GB

Total time reading read-file(s): 92.55 s.

On my machine, with the proper buffer size, it seems to generally finish in around 80sec. 256KB unit decompression does not seem to cover up the decompression time. So I did the decompression in 2MB increments and changed the buffer size to 8MB for decompressed data.

May 11 '23 07:05 telmin

Hi, I’m back from traveling, sorry for the delay. Thanks for the update, and great that you could even fix the macOS issue.

I’m trying to understand UncompressedReader and IsalGzipReader, but am a bit slow. I think a couple of comments would have helped. But can you help me for now by explaining a couple of things:

What are read_buffer and read_buffer_work and what is the difference?
Why does UncompressReader use a separate thread?
Is it correct that a thread is created and joined for each read chunk? (I guess it’s ok because thread creation is efficient.)

I also wonder whether using mmap() for the input is appropriate. I’d tend to use that for files that need random access, but the input FASTQ needs to be streamed only once, so I don’t see why it would be necessary to map it in its entirety into memory. It also appears to make the code more complex as you now need to find out what the file size is in advance.

May 26 '23 14:05 marcelm

This has been sitting here for a while, but it would still be nice to have in strobealign. I made the modifications I think are necessary to merge this and opened #418 instead.

@telmin I squashed your contribution to a single commit (preserving authorship information).

I’m closing this now in favor of #418.

Apr 22 '24 14:04 marcelm