Rascaf icon indicating copy to clipboard operation
Rascaf copied to clipboard

validate and filter the connections with databases

Open melop opened this issue 9 years ago • 11 comments

Hello,

   In the manual it mentioned an optional step of validating proposed connections by mapping to a public protein database. I think this is really a great idea, but I don't seem to find the script for doing that in the repository.
  Is this function currently implemented or do the users have to come up with their own solutions?

Best Regards, Rongfeng Cui Max-Planck Institute for the Biology of Ageing.

melop avatar Nov 28 '16 13:11 melop

Hi Rongfeng,

The feature is kind of there. To do this, you need to run "rascaf" with "-cs" option to output the sequence supporting the connection of two contigs.

As a result, you will have a file look like "*_cs.fa". Then you can blast those sequences against refseq_rna database. If you don't have the blast database locally, you can use the "-remote" option. For example: "blastn -db refseq_rna -task dc-megablast -query rascaf_cs.fa -out blast.out -outfmt 6 -remote -max_target_seqs 20" This procedure is quite slow and assumes you have blastn on your machine.

After that you can run the "EvaluateBlastResults.pl" script as: "perl ./EvaluateBlastResults.pl blast.out rascaf_cs.fa". The output will show the validation for the sequence from rascaf_cs.fa file. 1: valided. 0: uncertain. -1: misorientation -2: misorder -3: chimeric -4: no match in the blast database.

You can then adjust the corresponding connection in the rascaf.out file based on the validation result.

Thanks, Li

mourisl avatar Nov 28 '16 19:11 mourisl

Dear Li,

Thank you for the quick reply!

For the last step, is there a way to automate the process? I could write a script for it in case you don't already have one.

Another question is why not use tblastn to make the alignment at the amino acid level?

Thanks Ray On Nov 28, 2016 8:58 PM, "Li Song" [email protected] wrote:

Hi Rongfeng,

The feature is kind of there. To do this, you need to run "rascaf" with "-cs" option to output the sequence supporting the connection of two contigs.

As a result, you will have a file look like "*_cs.fa". Then you can blast those sequences against refseq_rna database. If you don't have the blast database locally, you can use the "-remote" option. For example: "blastn -db nt -task dc-megablast -query rascaf_cs.fa -out blast.out -outfmt 6 -remote -max_target_seqs 20" This procedure is quite slow and assumes you have blastn on your machine.

After that you can run the "EvaluateBlastResults.pl" script as: "perl ./EvaluateBlastResults.pl blast.out rascaf_cs.fa". The output will show the validation for the sequence from rascaf_cs.fa file. 1: valided. 0: uncertain. -1: misorientation -2: misorder -3: chimeric -4: no match in the blast database.

You can then adjust the corresponding connection in the rascaf.out file based on the validation result.

Thanks, Li

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263377020, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwTB6WLOeovbvvSrLm55e6blEQ6Ayks5rCzJ3gaJpZM4K9yGt .

melop avatar Nov 28 '16 20:11 melop

Sorry I meant tblastx. On Nov 28, 2016 9:33 PM, "Ray Cui" [email protected] wrote:

Dear Li,

Thank you for the quick reply!

For the last step, is there a way to automate the process? I could write a script for it in case you don't already have one.

Another question is why not use tblastn to make the alignment at the amino acid level?

Thanks Ray On Nov 28, 2016 8:58 PM, "Li Song" [email protected] wrote:

Hi Rongfeng,

The feature is kind of there. To do this, you need to run "rascaf" with "-cs" option to output the sequence supporting the connection of two contigs.

As a result, you will have a file look like "*_cs.fa". Then you can blast those sequences against refseq_rna database. If you don't have the blast database locally, you can use the "-remote" option. For example: "blastn -db nt -task dc-megablast -query rascaf_cs.fa -out blast.out -outfmt 6 -remote -max_target_seqs 20" This procedure is quite slow and assumes you have blastn on your machine.

After that you can run the "EvaluateBlastResults.pl" script as: "perl ./EvaluateBlastResults.pl blast.out rascaf_cs.fa". The output will show the validation for the sequence from rascaf_cs.fa file. 1: valided. 0: uncertain. -1: misorientation -2: misorder -3: chimeric -4: no match in the blast database.

You can then adjust the corresponding connection in the rascaf.out file based on the validation result.

Thanks, Li

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263377020, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwTB6WLOeovbvvSrLm55e6blEQ6Ayks5rCzJ3gaJpZM4K9yGt .

melop avatar Nov 28 '16 20:11 melop

For the last step, I'm sorry we don't have the script to remove the connections from rascaf.out now. If you can wait for 1 day, I can implement it now (this is on my todo list anyway).

I think tblastx would work as long as the output is the same as blastn (-outfmt 6).

Thanks, Li

mourisl avatar Nov 28 '16 20:11 mourisl

Hi Li,

    no problem, please take your time. Let me know when this script is

ready.

    If I want to use a protein database, like taxonomic specific

uniprot, is it also possible (with blastx -outfmt 6)? Best Regards, Ray

On Mon, Nov 28, 2016 at 9:45 PM, Li Song [email protected] wrote:

For the last step, I'm sorry we don't have the script to remove the connections from rascaf.out now. If you can wait for 1 day, I can implement it now (this is on my todo list anyway).

I think tblastx would work as long as the output is the same as blastn (-outfmt 6).

Thanks, Li

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263388955, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwQmm2w5e4FakGKt0RJ55a1B9g0JYks5rCz2CgaJpZM4K9yGt .

melop avatar Nov 28 '16 20:11 melop

I just tested blastx with one sequence and I believe the script will work fine with blastx.

mourisl avatar Nov 28 '16 21:11 mourisl

Ok, thank you for testing!

Ray On Nov 28, 2016 10:26 PM, "Li Song" [email protected] wrote:

I just tested blastx with one sequence and I believe the script will work fine with blastx.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263399295, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwUSjKUN_UczTq1RcLu22woQf-WTDks5rC0cBgaJpZM4K9yGt .

melop avatar Nov 28 '16 21:11 melop

I just uploaded the script "FilterRascafConnection.pl". You can use it after obtaining the validation results from blast. The commands is like: "perl FilterRascafConnection.pl blast_eval.out rascaf_cs.fa rascaf.out > rascaf_filter.out" .

Please let me know whether this works.

Thanks, Li

mourisl avatar Nov 29 '16 03:11 mourisl

Dear Li,

   thank you for your fast response. I will test this script.

   In your paper you have compared Rascaf to two other RNA scaffolders.

I am aware of an unpublished scaffolder called BESST_RNA. Out of curiosity I'm comparing results between them. So far it seems that BESST_RNA yielded a somewhat higher N50 (input N50: 815,196, BESST_RNA N50: 910,428, Rascaf N50: 848,421), and a slightly better BUSCO performance: (input complete BUSCOs: 4380, BESST_RNA: 4410, Rascaf: 4397). BESST_RNA requires merging of all bam files into a single input. Other than that I set a minimum link number to 4 for both programs. I'm wondering if you would have time to perform some systematic comparisons between the two regarding accuracy and performance.

Best Regards, Ray

On Tue, Nov 29, 2016 at 4:25 AM, Li Song [email protected] wrote:

I just uploaded the script "FilterRascafConnection.pl". You can use it after obtaining the validation results from blast. The commands is like: "perl FilterRascafConnection.pl blast_eval.out rascaf_cs.fa rascaf.out > rascaf_filter.out" .

Please let me know whether this works.

Thanks, Li

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263466183, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwRQbMltGttxeXU1IsBA96NW-iom9ks5rC5s7gaJpZM4K9yGt .

melop avatar Nov 29 '16 09:11 melop

Thanks for letting me know the comparison. I tried to run BESST_RNA, but it crashed. It is likely a problem about python on our sever. I will try it on a different computer later.

mourisl avatar Nov 29 '16 22:11 mourisl

Dear Li,

    thanks for looking at this.

Best Regards, Ray

On Tue, Nov 29, 2016 at 11:15 PM, Li Song [email protected] wrote:

Thanks for letting me know the comparison. I tried to run BESST_RNA, but it crashed. It is likely a problem about python on our sever. I will try it on a different computer later.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263717325, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwfEQgBgqBJKRiSYHENJhFFeXpGAWks5rDKP3gaJpZM4K9yGt .

melop avatar Nov 29 '16 22:11 melop