vg icon indicating copy to clipboard operation
vg copied to clipboard

Giraffe alignment is very slow and produces warning[vg::Watchdog] messages unless rescue is disabled

Open polchan opened this issue 1 year ago • 14 comments

Dear VG Team Members,

I hope this message finds you well.

While using the vg giraffe tool to align short reads, I encountered an issue similar to the one described in issue #4171 . Specifically, some samples process successfully in approximately 20 minutes, while others fail to complete even after a full day of computation. This inconsistency was observed using vg version 1.58.0, with the graph constructed from 23 genomes using minigraph-cactus. Notably, the command executed for all samples was identical.

Given that some samples were processed successfully, I ruled out potential issues related to the graph and vg itself. Upon examining the differences between the samples, I noticed that the fastq.gz files for those that were easily processed are only half to a third of the size of those that encountered errors. These files originate from resequencing data provided by others. As such, I suspect that the errors may be due to the large size of the data in some samples.

Here is a log from a sample that failed to run ~/tools/vg giraffe -p -t 12 --max-extensions 1600 -Z panMalus.d2.gbz -d panMalus.d2.dist -m panMalus.d2.min -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_1.fastq.gz -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_2.fastq.gz > SRR14557116n.gam:

Preparing Indexes Loading Minimizer Index Loading GBZ Loading Distance Index v2 Paging in Distance Index v2 Initializing MinimizerMapper Loading and initialization: 58.8633 seconds Of which Distance Index v2 paging: 1.80507 seconds Mapping reads to "-" (GAM) --watchdog-timeout 10 --match 1 --mismatch 4 --gap-open 6 --gap-extend 1 --full-l-bonus 5 --max-multimaps 1 --hit-cap 10 --hard-hit-cap 500 --score-fraction 0.9 --max-min 500 --num-bp-per-min 1000 --distance-limit 200 --max-extensions 1600 --max-alignments 8 --cluster-score 50 --pad-cluster-score 20 --cluster-coverage 0.3 --extension-score 1 --extension-set 20 --rescue-attempts 15 --max-fragment-length 2000 --paired-distance-limit 2 --rescue-subgraph-size 4 --rescue-seed-limit 100 --chaining-cluster-distance 100 --precluster-connection-coverage-threshold 0.3 --min-precluster-connections 10 --max-precluster-connections 50 --max-lookback-bases 100 --min-lookback-items 1 --lookback-item-hard-cap 15 --chain-score-threshold 100 --min-chains 1 --chain-min-score 100 --max-chain-connection 100 --max-tail-length 100 --max-dp-cells 16777216 --rescue-algorithm dozeu Using fragment length estimate: 433.691 +/- 19.1175 warning[vg::Watchdog]: Thread 0 has been checked in for 10 seconds processing: SRR14557116.18068, SRR14557116.18068 warning[vg::Watchdog]: Thread 2 has been checked in for 10 seconds processing: SRR14557116.20283, SRR14557116.20283 warning[vg::Watchdog]: Thread 11 has been checked in for 10 seconds processing: SRR14557116.19070, SRR14557116.19070 warning[vg::Watchdog]: Thread 6 has been checked in for 10 seconds processing: SRR14557116.25833, SRR14557116.25833 warning[vg::Watchdog]: Thread 4 has been checked in for 10 seconds processing: SRR14557116.26411, SRR14557116.26411 warning[vg::Watchdog]: Thread 1 has been checked in for 10 seconds processing: SRR14557116.29993, SRR14557116.29993 warning[vg::Watchdog]: Thread 8 has been checked in for 10 seconds processing: SRR14557116.28908, SRR14557116.28908 warning[vg::Watchdog]: Thread 10 has been checked in for 10 seconds processing: SRR14557116.29652, SRR14557116.29652 warning[vg::Watchdog]: Thread 7 has been checked in for 10 seconds processing: SRR14557116.32621, SRR14557116.32621 warning[vg::Watchdog]: Thread 9 has been checked in for 10 seconds processing: SRR14557116.33712, SRR14557116.33712 warning[vg::Watchdog]: Thread 7 finally checked out after 14 seconds and 20808 kb memory growth processing: SRR14557116.32621, SRR14557116.32621 warning[vg::Watchdog]: Thread 5 has been checked in for 10 seconds processing: SRR14557116.296057, SRR14557116.296057 warning[vg::Watchdog]: Thread 3 has been checked in for 10 seconds processing: SRR14557116.39673, SRR14557116.39673 warning[vg::Watchdog]: Thread 5 finally checked out after 23 seconds and 0 kb memory growth processing: SRR14557116.296057, SRR14557116.296057 warning[vg::Watchdog]: Thread 7 has been checked in for 10 seconds processing: SRR14557116.42053, SRR14557116.42053 warning[vg::Watchdog]: Thread 5 has been checked in for 10 seconds processing: SRR14557116.305002, SRR14557116.305002 warning[vg::Watchdog]: Thread 5 finally checked out after 20 seconds and 0 kb memory growth processing: SRR14557116.305002, SRR14557116.305002 warning[vg::Watchdog]: Thread 5 has been checked in for 10 seconds processing: SRR14557116.308221, SRR14557116.308221 warning[vg::Watchdog]: Thread 7 finally checked out after 1020 seconds and 24628 kb memory growth processing: SRR14557116.42053, SRR14557116.42053 warning[vg::Watchdog]: Thread 7 has been checked in for 10 seconds processing: SRR14557116.42113, SRR14557116.42113 warning[vg::Watchdog]: Thread 11 finally checked out after 10631 seconds and 508080 kb memory growth processing: SRR14557116.19070, SRR14557116.19070 warning[vg::Watchdog]: Thread 11 has been checked in for 10 seconds processing: SRR14557116.42588, SRR14557116.42588 warning[vg::Watchdog]: Thread 11 finally checked out after 2780 seconds and 0 kb memory growth processing: SRR14557116.42588, SRR14557116.42588 warning[vg::Watchdog]: Thread 11 has been checked in for 10 seconds processing: SRR14557116.43021, SRR14557116.43021

Despite running for 10 hours, the process did not complete and showed no signs of doing so even with extended time.

I experimented with various parameters, such as --max-extensions, --max-alignments, and --hard-hit-cap, but found that these adjustments had no effect and the samples continued to fail. However, when I modified the --rescue-algorithm parameter, I discovered that not using any rescue algorithm yielded unexpected results. Here is the log from a successful run of the same sample, with the only difference being the change in the --rescue-algorithm parameter (~/tools/vg giraffe -p -t 12 -A none --max-extensions 1600 -Z panMalus.d2.gbz -d panMalus.d2.dist -m panMalus.d2.min -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_1.fastq.gz -f /ngsproject/qmyu/ncbi/public/re_sequence/2021_Plant_Biotechnol_J/rawreads/SRR14557116_2.fastq.gz > SRR14557116.gam).:

Preparing Indexes Loading Minimizer Index Loading GBZ Loading Distance Index v2 Paging in Distance Index v2 Initializing MinimizerMapper Loading and initialization: 61.6039 seconds Of which Distance Index v2 paging: 1.47501 seconds Mapping reads to "-" (GAM) --watchdog-timeout 10 --match 1 --mismatch 4 --gap-open 6 --gap-extend 1 --full-l-bonus 5 --max-multimaps 1 --hit-cap 10 --hard-hit-cap 500 --score-fraction 0.9 --max-min 500 --num-bp-per-min 1000 --distance-limit 200 --max-extensions 1600 --max-alignments 8 --cluster-score 50 --pad-cluster-score 20 --cluster-coverage 0.3 --extension-score 1 --extension-set 20 --rescue-attempts 0 --max-fragment-length 2000 --paired-distance-limit 2 --rescue-subgraph-size 4 --rescue-seed-limit 100 --chaining-cluster-distance 100 --precluster-connection-coverage-threshold 0.3 --min-precluster-connections 10 --max-precluster-connections 50 --max-lookback-bases 100 --min-lookback-items 1 --lookback-item-hard-cap 15 --chain-score-threshold 100 --min-chains 1 --chain-min-score 100 --max-chain-connection 100 --max-tail-length 100 --max-dp-cells 16777216 --rescue-algorithm none Using fragment length estimate: 433.691 +/- 19.1175 Mapped 115609390 reads across 12 threads in 16840.4 seconds with 48.7877 additional single-threaded seconds. Mapping speed: 571.946 reads per second per thread Used 218606 CPU-seconds (including output). Achieved 528.848 reads per CPU-second (including output) Used 1164134888389127 CPU instructions (not including output). Mapping slowness: 10.0696 M instructions per read at 5325.26 M mapping instructions per inclusive CPU-second Memory footprint: 27.0476 GB

Based on this observation, could it be that the default --rescue-algorithm parameter is causing the warnings flagged by [vg::Watchdog] and leading to the failures? Additionally, I would like to inquire whether the Dozeu algorithm might be having an adverse effect when applied to samples with high sequencing depth.

I would greatly appreciate any insights or suggestions you might have regarding this issue.

Best regards,

Bo-Cheng Guo

polchan avatar Aug 07 '24 02:08 polchan

Hi, it could be the same problem https://github.com/vgteam/vg/issues/4207. I have solved this problem.

zhangyixing3 avatar Aug 07 '24 04:08 zhangyixing3

May be you could filter more samples from clip pangenome graph ,for example d5 or d8.

zhangyixing3 avatar Aug 07 '24 04:08 zhangyixing3

Hi, it could be the same problem #4207. I have solved this problem.

@zhangyixing3 Thanks your reply. I have been exploring various solutions to address the issue at hand. Specifically, I attempted renaming the graph, distance, and minimizer files and tested both on a local server and a HPC cluster. Unfortunately, these changes did not resolve the problem.

It was not until I modified the --rescue-algorithm parameter that I observed a significant improvement. Previously, I had been using the default Dozeu algorithm. By opting not to use any rescue algorithm (i.e., setting -A none), the samples that had previously failed to process were successfully computed into GAM files without triggering warning[vg::Watchdog].

Therefore, I suspect that the Dozeu rescue algorithm may be responsible for triggering the warning[vg::Watchdog].In particular, this issue may lead to program failures when dealing with sequencing files of high depth.

polchan avatar Aug 07 '24 06:08 polchan

In my opinion,some reads differ too much from the reference genome or map to complex regions. Giraffe executes the rescue-algorithm to handle such situations. You might find that some data is difficult to map, for example Thread 1 finally checked out after 62309 seconds and 0 kb memory growth processing: SRR19465.19470728, SRR19465.19470728 . Setting -A none directly might be another solution.

zhangyixing3 avatar Aug 07 '24 07:08 zhangyixing3

@zhangyixing3 In fact, all my samples belong to the same species, and the graph I used includes these species as well, making it unlikely that the issue is due to significant differences. Admittedly, I have not tested GSSW, so I cannot comment on its performance. Overall, I believe that the Dozeu algorithm may have issues with processing certain samples.

polchan avatar Aug 07 '24 13:08 polchan

I think the specific issue here is the complexity of the rescue subgraph.

When Giraffe can find a good alignment for one read but not the pair, it extracts a subgraph at approximately the right distance from the aligned read. Then it tries to align the pair to the subgraph using a simplified version of the Giraffe aligner. We have some safeguards for too large subgraphs and for having too many seeds in the subgraph, but not for complex subgraphs.

Out of the two rescue algorithms, dozeu is faster but uses heuristics. GSSW is slower but always finds the best alignment. Neither of them is haplotype-aware, and both of them require that the graph they are aligning to is acyclic. If the (relevant part of) the rescue subgraph contains cycles, it will be transformed into an equivalent DAG. And if the rescue subgraph is complex, the DAG can be very large and aligning the read to it can be slow.

We used to have a prototype haplotype-aware rescue algorithm, but it was too slow to be practical. Perhaps we could try using our new haplotype-aware WFA here.

jltsiren avatar Aug 07 '24 19:08 jltsiren

If we think the problem is that the rescue DAGs get too big, we could add a limit and abort dagification and rescue if we hit it.

@polchan I don't think that the depth specifically is the problem, except that it provides more reads and thus more chances to hit it. The watchdog times how long each read pair takes to map individually and complains when that particular pair takes a long time. So if you see:

warning[vg::Watchdog]: Thread 11 finally checked out after 10631 seconds and 508080 kb memory growth processing: SRR14557116.19070, SRR14557116.19070

Then you should be able to reproduce that warning with a FASTQ containing only that one SRR14557116.19070 read pair, and it should still take ten thousand seconds to map on its own.

Running a pair by itself it by itself with the --show-work and --track-provenance Giraffe options would make a GAM with embedded annotations noting how long it spent in each stage of the alignment process, and a debugging log that might have useful information about what it was doing that managed to take so long.

adamnovak avatar Aug 08 '24 14:08 adamnovak

If we think the problem is that the rescue DAGs get too big, we could add a limit and abort dagification and rescue if we hit it.

If the rescue DAGs take too long, it would be better to impose some limitations rather than disable the rescue altogether.

When I mapping 10023 sample, I encouter these warnings as blow.

warning[vg::Watchdog]: Thread 3 finally checked out after 295 seconds and 23804 kb memory growth processing: SRR1946553.2920144, SRR1946553.2920144
warning[vg::Watchdog]: Thread 5 has been checked in for 10 seconds processing: SRR1946553.9728216, SRR1946553.9728216
warning[vg::Watchdog]: Thread 5 finally checked out after 143 seconds and 0 kb memory growth processing: SRR1946553.9728216, SRR1946553.9728216
warning[vg::Watchdog]: Thread 3 has been checked in for 10 seconds processing: SRR1946553.11939425, SRR1946553.11939425
warning[vg::Watchdog]: Thread 4 has been checked in for 10 seconds processing: SRR1946553.16735946, SRR1946553.16735946
warning[vg::Watchdog]: Thread 4 finally checked out after 13 seconds and 0 kb memory growth processing: SRR1946553.16735946, SRR1946553.16735946
warning[vg::Watchdog]: Thread 1 has been checked in for 10 seconds processing: SRR1946553.19470728, SRR1946553.19470728
warning[vg::Watchdog]: Thread 3 finally checked out after 2914 seconds and 30932 kb memory growth processing: SRR1946553.11939425, SRR1946553.11939425
warning[vg::Watchdog]: Thread 1 finally checked out after 62309 seconds and 0 kb memory growth processing: SRR1946553.19470728, SRR1946553.19470728
warning[vg::Watchdog]: Thread 8 has been checked in for 10 seconds processing: SRR1946553.25227565, SRR1946553.25227565
warning[vg::Watchdog]: Thread 8 finally checked out after 2871 seconds and 0 kb memory growth processing: SRR1946553.25227565, SRR1946553.25227565

I extracted the SRR1946553.2920144 SRR1946553.9728216 SRR1946553.16735946 SRR1946553.25227565 SRR1946553.11939425 SRR1946553.19470728 from the raw data. However, when I finished mapping, it only took 5 seconds, and I did not reproduce warnings.

vg giraffe -t 1 -p -Z ../../mapping.d5.giraffe.gbz  -m ../../mapping.d5.min  -d  ../../mapping.d5.dist  / 
-f test_1.fq  -f test_2.fq  --show-work  --track-provenance  1>test_gam.log  2>test.log

test.log

test_gam.log

zhangyixing3 avatar Aug 12 '24 09:08 zhangyixing3

@zhangyixing3 When you are mapping a small subset of reads, you have to specify the fragment length distribution manually to ensure that the reads are mapped in the same way as in the full run.

For example, if you have the following line in the original log:

Using fragment length estimate: 433.691 +/- 19.1175

You add the following options to the vg giraffe command:

--fragment-mean 433.691 --fragment-stdev 19.1175

jltsiren avatar Aug 12 '24 09:08 jltsiren

Thanks, I've already rerun it.

zhangyixing3 avatar Aug 12 '24 10:08 zhangyixing3

Hi, dear sir, Using the following command:

vg giraffe -t 1 -p -Z ../../mapping.d5.giraffe.gbz  -m ../../mapping.d5.min  -d  ../../mapping.d5.dist  \
      -f test_1.fq  -f test_2.fq   \
     --show-work  --track-provenance --fragment-mean 334.005  --fragment-stdev 28.9223  1>test.gam 2>test.log

The mapping process for six sequences took a long time (from August 12, 18:15 to August 14, 08:53). Here are the results. test.gam.gz test.log.gz

zhangyixing3 avatar Aug 14 '24 03:08 zhangyixing3

Yeah it definitely looks like the rescues are taking all the time. There's warning[vg::Watchdog]: Thread 0 finally checked out after 125536 seconds and 0 kb memory growth processing: SRR1946553.19470728, SRR1946553.19470728, and between the 10 second message and that it's doing nothing but rescues and output. And the GAM-embedded stats for one of those reads show:

    "stage_align_results": 97,
    "stage_align_time": 0.016973553225398064,
    "stage_cluster_results": 66,
    "stage_cluster_time": 0.35473459959030151,
    "stage_extend_results": 66,
    "stage_extend_time": 0.14604765176773071,
    "stage_minimizer_results": 4,
    "stage_minimizer_time": 0.000012450000212993473,
    "stage_pairing_results": 0,
    "stage_pairing_time": 125535.9375,
    "stage_seed_results": 318,
    "stage_seed_time": 0.0020093880593776703,
    "stage_winner_results": 0,
    "stage_winner_time": 0.0099111190065741539

So all the time is going to the "pairing" stage where rescue happens.

We need to add those rescue work limits to resolve this.

adamnovak avatar Aug 14 '24 21:08 adamnovak

Is it possible to add a time limit for rescue? By counting the number of output reads, I found that only a very small portion of the reads (about 0.5%) exhibit this issue. I believe that even discarding these reads would not significantly affect the results.

zwh82 avatar Sep 10 '24 08:09 zwh82

We much prefer to express limits in terms of work rather than wall-clock time since otherwise the results are not deterministic across systems, and so runs are very hard to repeat.

adamnovak avatar Oct 02 '24 21:10 adamnovak

Hi, please let me know if I should open a new issue. However, I seem to be encountering a similar issue as above. I just wondered if there were any updates on resolving this. Thank you.

NinaMercedes avatar Feb 10 '25 12:02 NinaMercedes

@adamnovak @jltsiren Hello, I encountered some similar issues. I would like to know if the Giraffe rescue algorithm's alignment is reproducible. When I use Giraffe to align certain sequencing data, I also encounter a large number of these warnings. When extracting node abundances for further analysis (on the same sequencing data), I find that the results from different alignments can vary significantly — sometimes they were quite good, while other times they were relatively poor. However, for sequencing data that doesn't produce these warnings, the results are quite stable.

warning[vg::giraffe]: Refusing to perform too-large rescue alignment of 151 bp against 13072 bp ordered subgraph for read SRR10610665.1277 which would use more than 1572864 cells and might exhaust Dozeu's allocator; suppressing further warnings.
warning[vg::Watchdog]: Thread 18 has been checked in for 10 seconds processing: SRR10610665.6814, SRR10610665.6814
warning[vg::Watchdog]: Thread 42 has been checked in for 10 seconds processing: SRR10610665.25202, SRR10610665.25202
warning[vg::Watchdog]: Thread 42 finally checked out after 10 seconds and 2401496 kb memory growth processing: SRR10610665.25202, SRR10610665.25202
warning[vg::Watchdog]: Thread 37 has been checked in for 10 seconds processing: SRR10610665.16479, SRR10610665.16479

Thanks.

zwh82 avatar Mar 31 '25 09:03 zwh82

The rescue algorithm (and Giraffe as a whole) is intended to be deterministic, and always give the same answer for the same graph, read pair, and fragment length parameters. If you pull out a single read pair and re-align it, you probably will need to specify the fragment length distribution parameters manually, since they won't be auto-detected from the initial input reads in that case.

@zwh82 I don't really know what you mean by "node abundances" (maybe covering read depth on each graph node?) or what it means for those to be "quite good" or "relatively poor".

It sounds like you have a read set where:

  • It contains slow-to-map reads that produce the warnings
  • Also, when you align it multiple times you get different alignments each time

Is that right? That would mean you've found a Giraffe bug: you should get the same alignments each time you run a given vg giraffe command, whether some reads are slow to map or not. Would you be able to share the data with us so we can reproduce and fix the problem?

adamnovak avatar Mar 31 '25 20:03 adamnovak

@zwh82 Can you open a fresh issue about your problem? It sounds like it's not really related to the alignment being too slow to finish.

adamnovak avatar Mar 31 '25 20:03 adamnovak

Sorry, I found that the problem seems to be from the constructed pan-genome.

zwh82 avatar Apr 01 '25 01:04 zwh82