RA failed - unequal lengths in sequence and overlap file for sequence with id xxxxx
Hi, RA has failed with the following error message
unequal lengths in sequence and overlap file for sequence with id 508204 without any further log message.
Here are the last few lines of the log-file:
[M::main] CMD: /disk2/nguinkal/ra/vendor/minimap2/minimap2 -t 100 -x ava-pb /disk2/nguinkal/mnt/winpro/I3-NGS-Daten/NGS-Fisch/Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta /disk2/nguinkal/mnt/winpro/I3-NGS-Daten/NGS-Fisch/Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta
[M::main] Real time: 12065.368 sec; CPU: 868149.797 sec
[Ra::run] preconstruction stage
[rala::Graph::initialize] loaded sequences
[rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 508204!
I am running RA on Debian based linux Workstation. I have no clue what is causing that bug. Thanks for any hints. Cheers, Julien
Hi Julien, please run the following commands from ra root folder and paste the output here:
cd vendor/rala
git status
git log
Best regards, Robert
Hi @rvaser Thanks for replying.
git status
On branch master
Your branch is up to date with 'origin/master'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
(commit or discard the untracked or modified content in submodules)
modified: ../../../vendor/rala (modified content, untracked content)
no changes added to commit (use "git add" and/or "git commit -a")
git log
Author: rvaser <[email protected]>
Date: Sat Nov 17 09:10:56 2018 +0100
Fixed illumina polishing extension
commit 20a39156f7efb59de36616e6f0314202b287b804
Author: rvaser <[email protected]>
Date: Sat Nov 17 00:50:22 2018 +0100
Updated rala
commit ca71a013084ed41f3ecc2ca3918be5f5a2f482d9
Author: rvaser <[email protected]>
Date: Thu Nov 8 11:28:39 2018 +0100
Updated rala
Please run head -n 1016410 zander_pooled.subreads.renamed.fasta | tail -n 2 and paste the sequence causing the problem here.
Please also run git log from ra/vendor/rala.
> head -n 1016410 zander_pooled.subreads.renamed.fasta | tail -n 2
just output this:
AGAGCTCTGTCCACGTCAATATCGCTCCTAATTTACATTTCACCTTATTTAAACGCTTTT
GGTGTTTTGTGTAGGTTATTCCCTCTTTCAGTAATGTAAGCATTAACTGCGATGCCGCTA
Why not -n 508204 instead of -n 1016410? Best, Julien
> git log in ra/vendor/rala
outputs this:
Author: rvaser <[email protected]>
Date: Fri Nov 16 23:29:36 2018 +0100
Reverted overlap trimming
commit 9cfb0076a20e36591f4329512a8f9f16e0b9eca1
Author: rvaser <[email protected]>
Date: Fri Nov 16 19:58:07 2018 +0100
Reverted some changes
As each sequence is usually represented with two lines in FASTA format (one for header and one for the data), we are using 2 * read_id + 2 to obtain the sequence with head and tail. Nonetheless, your file seems to be wrapped to 60 characters per line which makes it a bit harder to extract the sequence by id. The error you have encountered is that my parser has obtained a sequences with length that differs from the length stored in the .paf file generated from minimap2, which is quite odd as I have tested this feature. I'll check again.
P.S. I truncated the outputs in above messages so I do not have to scroll too much :D
Wrapped files are forking for me. Do you mind sending me your file so I can investigate further?
Okay, I understand the issue. But how can I fix it? unfortunately I cannot send some sample data because the files are about ~950GB for illumina PE data and ~120GB for PacBio data, since RA expects two types of data: TSG and NGS data. PS: My PE-data are in file.R1.fq and file.R2 data. I have poooled them in one file before passing them to RA. Maybe, minimap2 has problems in mapping such pooled reads?
I understand, that is a lot of data. The error occurred in the layout step where only TGS data is used. Can you please provide me with the command you run?
Ok, here is the command I ran
$: ./ra -t 100 -x pb /path/to/pacbioData/SMRT_Sequel-pooled/zander_pooled.subreads.renamed.fasta /path/to/pe-reads/pooled/in/one/singleFile/HiSeqX/pe.renamed.fastq
That looks fine. I am not sure what could be wrong. By the extension .renamed.fasta, did you perhaps rename your reads with ids? If so can you grep the reported sequence from the file?
Yes, I renamed my reads with Ids like:
1 READ1 2 READ2 ...
Sorry. No reads were renamed, just the headers in NGS reads file
Can you please replace this lines https://github.com/rvaser/rala/blob/iterative/src/overlap.cpp#L55-L57 (located at vendor/rala/src/overlap.cpp) with:
fprintf(stderr, "[rala::Overlap::transmute] error: "
"unequal lengths in sequence and overlap file for sequence with id %lu (%u - %zu)!\n",
a_id_, a_length_, piles[a_id_]->data().size());
and https://github.com/rvaser/rala/blob/iterative/src/overlap.cpp#L74-L76 (in the same file) with:
fprintf(stderr, "[rala::Overlap::transmute] error: "
"unequal lengths in sequence and overlap file for sequence with id %lu (%u - %zu)!\n",
b_id_, b_length_, piles[b_id_]->data().size());
Run make afterwards and paste the log here again please.
Okay, I have just copied the whole log file, since I do not know in which part you are interested to.
Author: rvaser <[email protected]>
Date: Fri Nov 16 23:29:36 2018 +0100
Reverted overlap trimming
commit 9cfb0076a20e36591f4329512a8f9f16e0b9eca1
Author: rvaser <[email protected]>
Date: Fri Nov 16 19:58:07 2018 +0100
By log I meant that you run ra again with your data and paste here everything it outputs until the error occurs again :)
ok, sorry. It will take a while to run. I will post here tomorrow.
Hi @rvaser ,
can I send you the whole log file by email? The same error has occurred.
[M::main] CMD: /Zander_Project/pipelines/assembly/ra/vendor/minimap2/minimap2 -t 130 -x ava-pb /Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta /Zander-Project_2017_2020/RawData/Zander_SMRT_Sequel/SMRT_Sequel-Pooled/zander_pooled.subreads.renamed.fasta
[M::main] Real time: 11905.419 sec; CPU: 882940.242 sec
[Ra::run] preconstruction stage
[rala::Graph::initialize] loaded sequences
[rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 508204 (23914 - 23317)!
Please do so :)
Please run awk 'NR==508206' RS='>' zander_pooled.subreads.renamed.fasta > read_of_interest.fasta and send me the file via email.
@rvaser I have just sent the read of interest via email.
Please run grep "m54229_180630_022602" zander_pooled.subreads.renamed.fasta and make sure there is only one match.
The read length that my parser obtains matches to the one in file (it even has the length in its name m54229_180630_022602 74908655 0_23317) . So I guess that there might be two sequences with the same name. Or something went wrong in minimap2 parsing which I'll look into if there is only one match in the above grep command.
Hi @rvaser,
It can't be only one match, since all reads (headers) in zander_pooled.subreads.renamed.fasta start with "m54229_180630_022602"
That is the problem. I only take the read name up to the first white space to "save" memory, nonetheless I think it is expected that this part of the name should be unique. You can easily circumvent this problem by replacing all spaces with underscores in your read file. Something like this sed 's/ /_/g' zander_pooled.subreads.renamed.fasta > zander_pooled.subreads.fixed.fasta.
Ok Thank you. Your asembler (RA) doesn't get information from the reads name (fasta-headers)? Because some do, and renaming fasta-headers causes problems. I will rename as suggested and re-run RA-assembler. Thank you very much. I hope it fixes it.
Headers are discarded immediately and replaced with ids. I hope that this fixes it as well :D
Okay, then it doesn't matter whatever is in the fast-headers :) I would let you know.
PS: Do I need to reset the part of the code in rala, that I edited?
Nope, it will only trigger if the same error occurs.
Okay. :+1:
Hi @rvaser minimap2 stage finished without error. rala is now running. I think the error was triggered by non-unique reads identifiers in the fasta file. PS: Do you have an estimation of the expected run time of Ra depending on the data?
Hi Julien, that depends on how repetitive is your genome. Depending on the size of your data and if you don't have anything plant related, I would say under a week. I hope you have enough RAM as you have 1TB Illumina reads :D
Best regards, Robert
Hi Robert, I have 1.5 TB RAM and I'm using 40 cpus. Illunima PE-data : ~ 950 GB PacBio data : ~120 GB I hope the RAM will not crash :) The genome is not so repetitive. (~12% repeat content, 0.1% hetorozygosity)
Best, Julien
That will be a bit tight because overhead for Illumina data equals 0.5 times the size of the FASTQ file. You might want to downsample it a bit. How much is the coverage if you have freaking ~500GB of sequences (i.e. how large is the genome you are assembling)?
Actuallly, the illumina data are in fastq. Thus, the raw reads size is ~50% of the fastq (= ~450 GB). The theoretical (k-mer profile) mean coverage is ~140X. I have not computed yet the real sequencing coverage, since the genome size is unknown. But I expect a genome size < 2 Gb
What is the most RAM-consuming step in the RA pipeline?
I think, it should be racon-stage. Currently, rala has a peak RAM consumption of only ~140 GB.
Racon, as it loads all reads into memory. The problem will be the Illumina polishing step. You can try running ra without Illumina data and try to polish the assembly afterwards with Illumina and racon. If the program crashes, all intermediate files will be deleted, including the PB polished assembly.
Or you can let it run, be sneaky and copy the intermediate files while it polishes with Illumina data :)
Okay, this is a great suggestion :+1: . If ra only uses illumina data for polishing (not for consensus), then it also makes sense to run ra without NGS data. I can polish the PB contig afterwards using racon, as you suggested.
Hi Robert,
As expected (and feared) ra has failed because of RAM overload :( I think, our data load and computational resources are just not suitable to run ra in NGS+TGS setup. RA can run assembly with only TGS, right? Maybe I should try running ra just with PB-data.
Best, Julien
Hi Julien, as I thought you have unusual amount of Illumina data. You can of course run ra with PB only data. Later if you want to you can subsample the Illumina reads for manual polishing with racon.
Best regards, Robert
But according to your workflow description in github, even, with only TGS data, racon is ran twice over the preliminary PB only-based assembly. By manual polishing, you probably mean, polishing the contigs created by ra using a subset o high quality (e.g. Q>35) illumina reads. I will definitively try this approach.
Thanks for the kind overall support for this issue.
Racon is run twice with TGS data, and once with NGS data. The NGS polishing is the part you will do manually due to memory shortage, i.e. polish the outputted ra contigs with subset of Illumina reads.
Hi there, sorry to reawaken this, but it seems I'm having the same problem with ra. Minimap2 runs just fine, creates an overlaps.paf, rala begins and creates an empty preconstruction.fasta file... end then the job stops, the working directory is emptied, and this gets printed to stdout:
[M::main] Version: 2.14-r892-dirty [M::main] CMD: /nfs/research1/marioni/claumer/ra/vendor/minimap2/minimap2 -t 40 -x ava-ont Lm10_151218_renamed.fastq Lm10_151218_renamed.fastq [M::main] Real time: 3842.330 sec; CPU: 57219.102 sec; Peak RSS: 57.284 GB [rala::Graph::initialize] loaded sequences [rala::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence with id 2345870! [Ra::run] preconstruction overlap stage [Ra::run] preconstruction stage [Ra::exit] warning: unable to clean work directory!
I'm calling ra like this:
../ra/build/bin/ra -t 40 -u -x ont Lm10_151218_renamed.fastq
And I do not think that it's a problem with unique headers as some of the messages above seemed to indicate... I saw the same problem both on my raw dataset and on a version where I used fastx_renamer to give each read a number in sequence.
I am running this on an amplified dataset: a longish-insert library that I have used adapter-driven PCR to amplify, following an optimised custom protocol that should minimize coverage bias, although it's possible that some exists. Because of the nature of PCR which is selective for shorter inserts, the results tend to look like this:
6823466 reads with read length N50 of 2643bp and maximum of 14993bp 8191293 reads with read length N50 of 1513bp and maximum of 3497bp
I'm having a bear of a time getting canu to finish on these (lots of relatively short reads causes the all-by-all mhap steps to take forever with it), so I was really excited when I came across ra which seems like it should have some considerable advantages over e.g. miniasm. But it seems that as written there's a difficulty parsing the paf...
Happy to provide any materials you might need to help debug this, log files, raw reads, etc.
Hi Christopher, have you checked that the identifiers are all different? You can send me the data via mail if you don't mind so I can investigate locally.
Best regards, Robert
Hi Robert,
I've just sent you a link by email to download these data and play around yourself. I find with this dataset on my local cluster that minimap2 needs a minimum of 60 GB to work with the 40 threads I've been assigning it.
Pretty sure the headers all should be different after running fastx_renamer. So I think it should be some other kind of problem...
Grazie mille, Chris Laumer