how to reduce run time
Hi,
I'm trying to predict target genes of lncRNA with IntaRNA and I just want to know target genes for each lncRNA and don't care the interaction details, like the position of interaction, which mRNA interacting with lncRNA and so on.
So here is what i do:
- prepare mRNA sequence with gff3 and genome.fa (target)
- prepare lncRNA sequence with gff3 and genome.fa (query)
- predict interaction with IntaRNA:
IntaRNA -q LncRNA.fa -t mRNA.fasta --out intarna_results.csv --outMode C --outOverlap N --threads 32
But this cmd requires a long time to run, it lasts about 42h until now. So can you give me some advice to achieve my goal and reduce the run time ? Thanks!
Best, sunshx
Hi sunshx,
IntaRNA was developed for the investigation of RNA-RNA interactions within a most correct thermodynamic model. This comes to the cost of some expensive computations, mainly the computation of accessibility profiles of the interacting RNAs.
I would guess that most of the runtime you observe is spent in computing these accessibility profiles before any interaction is predicted.
Possible workarounds:
- reduce the maximal length of considered interactions (
--intLenMax=), which will also reduce the accessibility computation to regions of that maximal length - since you are doing a target screen, you might relate to the IntaRNAsTar parameters or personality, which speedup computation too
- alternatively, the more simple helix-based computation of IntaRNAhelix might also speedup things (can be combined with the parameters of IntaRNAsTar linked above), but both are mainly influencing the cost for prediction, not the accessibility computation
- if you are doing multiple searches/runs vs the same (long) RNAs, you might consider precomputation and loading accessibilities from file
Hope this helps.
But in the end you are using a tool with a model for thermodynamic details that need time to compute...
Best, Martin
Hi Martin,
Thanks for your advice. Here is what i do:
- first, computing accessibilities with RNAplfold :
fasta=mrna_lncrna.fastaRNAplfold -u 100 < $fastaand this step takes about 6h for about 25000 sequences. - then, predicting target genes with IntaRNA:
IntaRNA -q "GTAGTGGCCACAGCCTTACAGGCAGGCAG" -t "GGTACCAGAGCCAAGACCCTCGGCC" --out results.csv --intLenMax 29 --outMode C --outOverlap N -n 1 --outSep , --threads 1 --model=B --personality=IntaRNAsTar --qAcc P --tAcc P --qAccFile RNAplfold/ENSG00000000003_lunp --qId ENSG00000000003 --tAccFile RNAplfold/ENSG00000099869_lunp --tId ENSG00000099869
The qAccFile/tAccFile requires input one by one, so i implement step 2 through iteration:
for mrna1_seq in mrna_seqs:
for lncrna1_seq in lncrna_seqs:
IntaRNA -q lncrna1_seq -t mrna1_seq --out results.csv --intLenMax 29 --outMode C \
--outOverlap N -n 1 --outSep , --threads 1 --model=B --personality=IntaRNAsTar \
--qAcc P --tAcc P --qAccFile RNAplfold/lncrna1_lunp --qId lncrna1 --tAccFile RNAplfold/mrna1_lunp --tId mrna1
and it needs to compute about 20000(mrna) * 5000(lncrna) = 100,000,000 times totally. The problem is that sometimes one mrna iretation takes 10h ! I don't know wheather i do something wrong.
And during runing, sometimes generating warning:
# WARNING : Exception raised for : #thread 0 #target 0 #query 0 : std::bad_alloc
# WARNING : Exception raised : std::exception
==> Please report (including input) to the IntaRNA development team! Thanks!
In short, the time of computing accessibility profiles is acceptable, but the step of prediction requires a long time. Do you have some advice for this ?
Best, sunshx
Hi,
I dont think you are doing something wrong but you are doing an extremly big screen.
Guessing your mRNAs and lncRNAs are long, even posting maximal interaction length restrictions etc. on IntaRNA will cause a lot of computational work (as you experience).
Since the computation requires tabularizations of local solutions for optimization, memory consumptions goes up strongly with sequence lengths. I think this causes the std::bad_alloc error, which is eventually a signal for "memory exhausted".
You might try to go back to "normal" prediction model --model=X instead of --model=B. Not sure right now if the latter is more memory demanding for long sequences...
A machine with more memory resources might be useful. If the RAM memory is exhausted, most systems start "swapping" parts of the information to harddrive, which dramatically slows down computations.. So best double check the memory consumption while running a single job on long sequences and see if your memory is sufficient or not.
For such vast and computationally expensive screens, single computer usage is almost pointless. So you might need to check for a compute grid of your university to run the jobs on.
Eventually, you are using a tool for high-detail analyses for a broad screen.
Maybe a tool like riBlast or riSearch2 (https://rth.dk/resources/risearch/) might be more suited to your task, since these are using simpler interaction models and sophisticated data structures for long sequence screens. And if interested, you could use IntaRNA in a second investigation layer to double check the top results of one of these tools.
Hope that helps, best, Martin