ra RA failed - unequal lengths in sequence and overlap file for sequence with id xxxxx

Hi, RA has failed with the following error message

unequal lengths in sequence and overlap file for sequence with id 508204 without any further log message.

Here are the last few lines of the log-file:

[M::main] CMD: /disk2/nguinkal/ra/vendor/minimap2/minimap2 -t 100 -x ava-pb /disk2/nguinkal/mnt/winpro/I3-NGS-Daten/NGS-Fisch/Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta /disk2/nguinkal/mnt/winpro/I3-NGS-Daten/NGS-Fisch/Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta
[M::main] Real time: 12065.368 sec; CPU: 868149.797 sec
[Ra::run] preconstruction stage
[rala::Graph::initialize] loaded sequences
[rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 508204!

I am running RA on Debian based linux Workstation. I have no clue what is causing that bug. Thanks for any hints. Cheers, Julien

Nov 22 '18 15:11 bbalog87

Hi Julien, please run the following commands from ra root folder and paste the output here:

cd vendor/rala
git status
git log

Best regards, Robert

Nov 22 '18 16:11 rvaser

Hi @rvaser Thanks for replying.

git status

On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)

	modified:   ../../../vendor/rala (modified content, untracked content)

no changes added to commit (use "git add" and/or "git commit -a")

git log

Author: rvaser <[email protected]>
Date:   Sat Nov 17 09:10:56 2018 +0100

   Fixed illumina polishing extension

commit 20a39156f7efb59de36616e6f0314202b287b804
Author: rvaser <[email protected]>
Date:   Sat Nov 17 00:50:22 2018 +0100

   Updated rala

commit ca71a013084ed41f3ecc2ca3918be5f5a2f482d9
Author: rvaser <[email protected]>
Date:   Thu Nov 8 11:28:39 2018 +0100

   Updated rala

Nov 22 '18 16:11 bbalog87

Please run head -n 1016410 zander_pooled.subreads.renamed.fasta | tail -n 2 and paste the sequence causing the problem here.

Nov 22 '18 17:11 rvaser

Please also run git log from ra/vendor/rala.

Nov 23 '18 10:11 rvaser

> head -n 1016410 zander_pooled.subreads.renamed.fasta | tail -n 2 just output this: AGAGCTCTGTCCACGTCAATATCGCTCCTAATTTACATTTCACCTTATTTAAACGCTTTT GGTGTTTTGTGTAGGTTATTCCCTCTTTCAGTAATGTAAGCATTAACTGCGATGCCGCTA

Why not -n 508204 instead of -n 1016410? Best, Julien

> git log in ra/vendor/rala outputs this:

Author: rvaser <[email protected]>
Date:   Fri Nov 16 23:29:36 2018 +0100

    Reverted overlap trimming

commit 9cfb0076a20e36591f4329512a8f9f16e0b9eca1
Author: rvaser <[email protected]>
Date:   Fri Nov 16 19:58:07 2018 +0100

    Reverted some changes

Nov 23 '18 12:11 bbalog87

As each sequence is usually represented with two lines in FASTA format (one for header and one for the data), we are using 2 * read_id + 2 to obtain the sequence with head and tail. Nonetheless, your file seems to be wrapped to 60 characters per line which makes it a bit harder to extract the sequence by id. The error you have encountered is that my parser has obtained a sequences with length that differs from the length stored in the .paf file generated from minimap2, which is quite odd as I have tested this feature. I'll check again.

P.S. I truncated the outputs in above messages so I do not have to scroll too much :D

Nov 23 '18 14:11 rvaser

Wrapped files are forking for me. Do you mind sending me your file so I can investigate further?

Nov 23 '18 14:11 rvaser

Okay, I understand the issue. But how can I fix it? unfortunately I cannot send some sample data because the files are about ~950GB for illumina PE data and ~120GB for PacBio data, since RA expects two types of data: TSG and NGS data. PS: My PE-data are in file.R1.fq and file.R2 data. I have poooled them in one file before passing them to RA. Maybe, minimap2 has problems in mapping such pooled reads?

Nov 23 '18 15:11 bbalog87

I understand, that is a lot of data. The error occurred in the layout step where only TGS data is used. Can you please provide me with the command you run?

Nov 23 '18 15:11 rvaser

Ok, here is the command I ran

$: ./ra -t 100 -x pb /path/to/pacbioData/SMRT_Sequel-pooled/zander_pooled.subreads.renamed.fasta /path/to/pe-reads/pooled/in/one/singleFile/HiSeqX/pe.renamed.fastq

Nov 23 '18 16:11 bbalog87

That looks fine. I am not sure what could be wrong. By the extension .renamed.fasta, did you perhaps rename your reads with ids? If so can you grep the reported sequence from the file?

Nov 23 '18 16:11 rvaser

Yes, I renamed my reads with Ids like:

1 READ1 2 READ2 ...

Nov 26 '18 14:11 bbalog87

Sorry. No reads were renamed, just the headers in NGS reads file

Nov 26 '18 14:11 bbalog87

Can you please replace this lines https://github.com/rvaser/rala/blob/iterative/src/overlap.cpp#L55-L57 (located at vendor/rala/src/overlap.cpp) with:

fprintf(stderr, "[rala::Overlap::transmute] error: "
    "unequal lengths in sequence and overlap file for sequence with id %lu (%u - %zu)!\n",
    a_id_, a_length_, piles[a_id_]->data().size());

and https://github.com/rvaser/rala/blob/iterative/src/overlap.cpp#L74-L76 (in the same file) with:

fprintf(stderr, "[rala::Overlap::transmute] error: "
    "unequal lengths in sequence and overlap file for sequence with id %lu (%u - %zu)!\n",
    b_id_, b_length_, piles[b_id_]->data().size());

Run make afterwards and paste the log here again please.

Nov 26 '18 18:11 rvaser

Okay, I have just copied the whole log file, since I do not know in which part you are interested to.

Author: rvaser <[email protected]>
Date:   Fri Nov 16 23:29:36 2018 +0100

    Reverted overlap trimming

commit 9cfb0076a20e36591f4329512a8f9f16e0b9eca1
Author: rvaser <[email protected]>
Date:   Fri Nov 16 19:58:07 2018 +0100

Nov 27 '18 16:11 bbalog87

By log I meant that you run ra again with your data and paste here everything it outputs until the error occurs again :)

Nov 27 '18 17:11 rvaser

ok, sorry. It will take a while to run. I will post here tomorrow.

Nov 27 '18 17:11 bbalog87

Hi @rvaser ,

can I send you the whole log file by email? The same error has occurred.

[M::main] CMD: /Zander_Project/pipelines/assembly/ra/vendor/minimap2/minimap2 -t 130 -x ava-pb /Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta /Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta
[M::main] Real time: 11905.419 sec; CPU: 882940.242 sec
[Ra::run] preconstruction stage
[rala::Graph::initialize] loaded sequences
[rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 508204 (23914 - 23317)!

Nov 28 '18 08:11 bbalog87

Please do so :)

Nov 28 '18 09:11 rvaser

Please run awk 'NR==508206' RS='>' zander_pooled.subreads.renamed.fasta > read_of_interest.fasta and send me the file via email.

Nov 28 '18 15:11 rvaser

@rvaser I have just sent the read of interest via email.

Nov 28 '18 18:11 bbalog87

Please run grep "m54229_180630_022602" zander_pooled.subreads.renamed.fasta and make sure there is only one match.

The read length that my parser obtains matches to the one in file (it even has the length in its name m54229_180630_022602 74908655 0_23317) . So I guess that there might be two sequences with the same name. Or something went wrong in minimap2 parsing which I'll look into if there is only one match in the above grep command.

Nov 29 '18 10:11 rvaser

Hi @rvaser,

It can't be only one match, since all reads (headers) in zander_pooled.subreads.renamed.fasta start with "m54229_180630_022602"

Nov 29 '18 18:11 bbalog87

That is the problem. I only take the read name up to the first white space to "save" memory, nonetheless I think it is expected that this part of the name should be unique. You can easily circumvent this problem by replacing all spaces with underscores in your read file. Something like this sed 's/ /_/g' zander_pooled.subreads.renamed.fasta > zander_pooled.subreads.fixed.fasta.

Nov 29 '18 18:11 rvaser

Ok Thank you. Your asembler (RA) doesn't get information from the reads name (fasta-headers)? Because some do, and renaming fasta-headers causes problems. I will rename as suggested and re-run RA-assembler. Thank you very much. I hope it fixes it.

Nov 29 '18 18:11 bbalog87

Headers are discarded immediately and replaced with ids. I hope that this fixes it as well :D

Nov 29 '18 18:11 rvaser

Okay, then it doesn't matter whatever is in the fast-headers :) I would let you know.

Nov 29 '18 18:11 bbalog87

PS: Do I need to reset the part of the code in rala, that I edited?

Nov 29 '18 18:11 bbalog87

Nope, it will only trigger if the same error occurs.

Nov 29 '18 18:11 rvaser

Okay. :+1:

Nov 29 '18 18:11 bbalog87

Hi @rvaser minimap2 stage finished without error. rala is now running. I think the error was triggered by non-unique reads identifiers in the fasta file. PS: Do you have an estimation of the expected run time of Ra depending on the data?

Dec 03 '18 15:12 bbalog87

Hi Julien, that depends on how repetitive is your genome. Depending on the size of your data and if you don't have anything plant related, I would say under a week. I hope you have enough RAM as you have 1TB Illumina reads :D

Best regards, Robert

Dec 03 '18 15:12 rvaser

Hi Robert, I have 1.5 TB RAM and I'm using 40 cpus. Illunima PE-data : ~ 950 GB PacBio data : ~120 GB I hope the RAM will not crash :) The genome is not so repetitive. (~12% repeat content, 0.1% hetorozygosity)

Best, Julien

Dec 03 '18 15:12 bbalog87

That will be a bit tight because overhead for Illumina data equals 0.5 times the size of the FASTQ file. You might want to downsample it a bit. How much is the coverage if you have freaking ~500GB of sequences (i.e. how large is the genome you are assembling)?

Dec 03 '18 15:12 rvaser

Actuallly, the illumina data are in fastq. Thus, the raw reads size is ~50% of the fastq (= ~450 GB). The theoretical (k-mer profile) mean coverage is ~140X. I have not computed yet the real sequencing coverage, since the genome size is unknown. But I expect a genome size < 2 Gb

Dec 03 '18 15:12 bbalog87

What is the most RAM-consuming step in the RA pipeline?

I think, it should be racon-stage. Currently, rala has a peak RAM consumption of only ~140 GB.

Dec 03 '18 15:12 bbalog87

Racon, as it loads all reads into memory. The problem will be the Illumina polishing step. You can try running ra without Illumina data and try to polish the assembly afterwards with Illumina and racon. If the program crashes, all intermediate files will be deleted, including the PB polished assembly.

Dec 03 '18 15:12 rvaser

Or you can let it run, be sneaky and copy the intermediate files while it polishes with Illumina data :)

Dec 03 '18 15:12 rvaser

Okay, this is a great suggestion :+1: . If ra only uses illumina data for polishing (not for consensus), then it also makes sense to run ra without NGS data. I can polish the PB contig afterwards using racon, as you suggested.

Dec 03 '18 15:12 bbalog87

Hi Robert,

As expected (and feared) ra has failed because of RAM overload :( I think, our data load and computational resources are just not suitable to run ra in NGS+TGS setup. RA can run assembly with only TGS, right? Maybe I should try running ra just with PB-data.

Best, Julien

Dec 06 '18 11:12 bbalog87

Hi Julien, as I thought you have unusual amount of Illumina data. You can of course run ra with PB only data. Later if you want to you can subsample the Illumina reads for manual polishing with racon.

Best regards, Robert

Dec 06 '18 11:12 rvaser

But according to your workflow description in github, even, with only TGS data, racon is ran twice over the preliminary PB only-based assembly. By manual polishing, you probably mean, polishing the contigs created by ra using a subset o high quality (e.g. Q>35) illumina reads. I will definitively try this approach.

Thanks for the kind overall support for this issue.

Dec 06 '18 11:12 bbalog87

Racon is run twice with TGS data, and once with NGS data. The NGS polishing is the part you will do manually due to memory shortage, i.e. polish the outputted ra contigs with subset of Illumina reads.

Dec 06 '18 11:12 rvaser

Hi there, sorry to reawaken this, but it seems I'm having the same problem with ra. Minimap2 runs just fine, creates an overlaps.paf, rala begins and creates an empty preconstruction.fasta file... end then the job stops, the working directory is emptied, and this gets printed to stdout:

[M::main] Version: 2.14-r892-dirty [M::main] CMD: /nfs/research1/marioni/claumer/ra/vendor/minimap2/minimap2 -t 40 -x ava-ont Lm10_151218_renamed.fastq Lm10_151218_renamed.fastq [M::main] Real time: 3842.330 sec; CPU: 57219.102 sec; Peak RSS: 57.284 GB [rala::Graph::initialize] loaded sequences [rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 2345870! [Ra::run] preconstruction overlap stage [Ra::run] preconstruction stage [Ra::exit] warning: unable to clean work directory!

I'm calling ra like this:

../ra/build/bin/ra -t 40 -u -x ont Lm10_151218_renamed.fastq

And I do not think that it's a problem with unique headers as some of the messages above seemed to indicate... I saw the same problem both on my raw dataset and on a version where I used fastx_renamer to give each read a number in sequence.

I am running this on an amplified dataset: a longish-insert library that I have used adapter-driven PCR to amplify, following an optimised custom protocol that should minimize coverage bias, although it's possible that some exists. Because of the nature of PCR which is selective for shorter inserts, the results tend to look like this:

6823466 reads with read length N50 of 2643bp and maximum of 14993bp 8191293 reads with read length N50 of 1513bp and maximum of 3497bp

I'm having a bear of a time getting canu to finish on these (lots of relatively short reads causes the all-by-all mhap steps to take forever with it), so I was really excited when I came across ra which seems like it should have some considerable advantages over e.g. miniasm. But it seems that as written there's a difficulty parsing the paf...

Happy to provide any materials you might need to help debug this, log files, raw reads, etc.

Jan 10 '19 13:01 claumer

Hi Christopher, have you checked that the identifiers are all different? You can send me the data via mail if you don't mind so I can investigate locally.

Best regards, Robert

Jan 10 '19 13:01 rvaser

Hi Robert,

I've just sent you a link by email to download these data and play around yourself. I find with this dataset on my local cluster that minimap2 needs a minimum of 60 GB to work with the 40 threads I've been assigning it.

Pretty sure the headers all should be different after running fastx_renamer. So I think it should be some other kind of problem...

Grazie mille, Chris Laumer

Jan 10 '19 17:01 claumer