vg icon indicating copy to clipboard operation
vg copied to clipboard

error with vg mpmap

Open GSgen opened this issue 3 years ago • 6 comments

Hello,

I have made pggb graphs for individual chromosomes and used the following steps to run vg mpmap:

vg convert -g smooth.gfa -p > ch01.pg #all steps for all chromosomes separately
grep ch01 combined.gtf > ch01.gtf #extracting chromosome 1 for all the lines in the graph from the combined .gtf file
vg mod -X 256 ch01.pg > ch01_chopped.pg
vg rna -p -n ch01.gtf ch01_chopped.pg > spliced_ch01.pg
vg ids -j spliced_ch01.pg
cat spliced_ch*.pg > all_ch.pg #combining all Packed Graphs for individual chromosomes
vg prune -P all_ch.pg > all_ch_pruned.pg
vg index -p -g all_ch.gcsa all_ch_pruned.pg
vg index -x all_ch.xg all_ch_pruned.pg
vg snarls -T all_ch.xg > all_ch.snarls
vg index -x all_ch.xg -s all_ch.snarls -j all_ch.dist
vg mpmap -x all_ch_pruned.pg -g all_ch.gcsa -d all_ch.dist -n rna -f R_1.fastq.gz -f R_2.fastq.gz > mapped_reads.gamp

All the steps run without error. However, vg mpmap step gives:

[vg mpmap] elapsed time 0 s: Loading graph from all_ch_pruned.pg
[vg mpmap] elapsed time 1 s: Completed loading graph
[vg mpmap] elapsed time 1 s: Graph is in PackedGraph format. PackedGraph is memory efficient, but has some slow queries. See `vg convert` if you want to change graph formats.
[vg mpmap] elapsed time 1 s: Identifying reference paths
[vg mpmap] elapsed time 1 s: Loading distance index from all_ch.dist (in background)
[vg mpmap] elapsed time 1 s: Loading GCSA2 from all_ch.gcsa
[vg mpmap] elapsed time 1 s: Completed loading GCSA2
[vg mpmap] elapsed time 1 s: Loading LCP from all_ch.gcsa.lcp
[vg mpmap] elapsed time 1 s: Memoizing GCSA2 queries (in background)
[vg mpmap] elapsed time 1 s: Completed loading distance index
[vg mpmap] elapsed time 1 s: Completed loading LCP
[vg mpmap] elapsed time 6 s: Completed memoizing GCSA2 queries
[vg mpmap] elapsed time 6 s: Building null model to calibrate mismapping detection
[vg mpmap] elapsed time 9 s: Mapping reads from R_1.fastq.gz and R_2.fastq.gz using 32 threads
warning:[vg mpmap] Mapped 25000 read pairs as unpaired reads to learn fragment length distribution, but only obtained 6 unambiguous, consistently mapped pairs. Often this indicates data issues, such as reads that are pre-sorted with unmappable reads at the front, reads that are not actually paired, or mismatched indexes.

It ran for three days and did not produce the output .gamp file. I appreciate any suggestions about this. Thank you!

GSgen avatar Jun 10 '22 18:06 GSgen

One issue I see is that you should be using all_ch.xg to map with, not all_ch_pruned.pg. The pruned graph should only be used for indexing with GCSA2.

I suspect what's actually causing the problem is this line though:

cat spliced_ch*.pg > all_ch.pg

The PackedGraph format can't be concatenated like this. I think it's probably only loading the first chromosome and then trying to map everything to that chromosome, which is why you are getting that warning. The correct way to merge PackedGraph is with vg combine.

jeizenga avatar Jun 13 '22 16:06 jeizenga

Hi @jeizenga,

I did not get a chance to follow up earlier. I combined the individual spliced graphs using vg combine and used:

vg combine spliced_ch*.pg > all_ch.pg
vg prune -P all_ch.pg > all_ch_pruned.pg
vg index -p -t 16 -g all_ch.gcsa all_ch_pruned.pg

However, I am getting an error about OUT_OF_MEMORY for the vg index step. I have used ~250GB memory on a cluster.

GSgen avatar Jun 29 '22 01:06 GSgen

We've seen cases of complex graphs where the default pruning parameters are not quite enough to restrain the graph for GCSA2 indexing. Can you try adding -k 64 -M 64 to the vg prune command?

jeizenga avatar Jun 29 '22 03:06 jeizenga

@jeizenga I tried,

vg prune -k 64 -M 64 -P all_ch.pg > all_ch_pruned.pg
vg index -p -g all_ch.gcsa all_ch_pruned.pg

It gave an OUT_OF_MEMORY for the vg index step again.

GSgen avatar Jun 30 '22 18:06 GSgen

Can you share the inputs? I can check locally if anything is glaringly wrong.

jeizenga avatar Jun 30 '22 18:06 jeizenga

I sent you an email. Thank you for your help.

GSgen avatar Jun 30 '22 19:06 GSgen

Closing this issue because I believe we worked it out over email.

jeizenga avatar Dec 22 '22 16:12 jeizenga