common_scripts icon indicating copy to clipboard operation
common_scripts copied to clipboard

Extract exon sequence based on GFF3 end FASTA

Open marcelolaia opened this issue 5 years ago • 2 comments

Hi, after a hard search on the net I found this awesome script. It works nice. However, I need to extract all exon sequence from a genome based on GFF3 and FASTA. Please, found attached a GFF3 sample file. From that file I need to extract these sequences: >Eucgr.A00001.1.v2.0.exon.1 ACTGTGACA...... >Eucgr.A00001.1.v2.0.exon.2 ACTGTGACA...... >Eucgr.A00001.1.v2.0.exon.3 ACTGTGACA...... (...) >Eucgr.A00001.1.v2.0.exon.12 ACTGTGACA...... (...) Could you help me? Thank you so much!

sample_GFF3_tsv.txt

marcelolaia avatar Apr 23 '20 23:04 marcelolaia

You might be looking for the gff2fasta.pl script located here: https://github.com/ISUGIFsingularity/utilities/tree/master/utilities Please let me know if that is the case.

isugif avatar Apr 24 '20 00:04 isugif

On 23/04/20 at 05:29, Andrew Severin wrote:

You might be looking for the gff2fasta.pl script located here: [1]https://github.com/ISUGIFsingularity/utilities/tree/master/utilities Please let me know if that is the case.

Hi Andrew, Thank you for promptly replay.

I found that script on Biostar forum. I run it and got the files. However, the output.exon.fasta [1] don't much the exons in the GFF3 [2].

Please, could you help me?

For exemple:

Inside GFF3 file are:

Chr01 phytozomev10 exon 2787014 2787767 . - . ID=Eucgr.A00001.1.v2.0.exon.12;Parent=Eucgr.A00001.1.v2.0;pacid=32049109 Chr01 phytozomev10 exon 2787803 2787834 . - . ID=Eucgr.A00001.1.v2.0.exon.11;Parent=Eucgr.A00001.1.v2.0;pacid=32049109 Chr01 phytozomev10 exon 2788190 2788300 . - . ID=Eucgr.A00001.1.v2.0.exon.10;Parent=Eucgr.A00001.1.v2.0;pacid=32049109 Chr01 phytozomev10 exon 2789313 2789399 . - . ID=Eucgr.A00001.1.v2.0.exon.9;Parent=Eucgr.A00001.1.v2.0;pacid=32049109 Chr01 phytozomev10 exon 2789765 2789884 . - . ID=Eucgr.A00001.1.v2.0.exon.8;Parent=Eucgr.A00001.1.v2.0;pacid=32049109 Chr01 phytozomev10 exon 2789985 2790162 . - . ID=Eucgr.A00001.1.v2.0.exon.7;Parent=Eucgr.A00001.1.v2.0;pacid=32049109 Chr01 phytozomev10 exon 2790477 2790694 . - . ID=Eucgr.A00001.1.v2.0.exon.6;Parent=Eucgr.A00001.1.v2.0;pacid=32049109 Chr01 phytozomev10 exon 2790774 2790880 . - . ID=Eucgr.A00001.1.v2.0.exon.5;Parent=Eucgr.A00001.1.v2.0;pacid=32049109 Chr01 phytozomev10 exon 2790969 2791089 . - . ID=Eucgr.A00001.1.v2.0.exon.4;Parent=Eucgr.A00001.1.v2.0;pacid=32049109 Chr01 phytozomev10 exon 2791278 2791373 . - . ID=Eucgr.A00001.1.v2.0.exon.3;Parent=Eucgr.A00001.1.v2.0;pacid=32049109 Chr01 phytozomev10 exon 2791468 2791696 . - . ID=Eucgr.A00001.1.v2.0.exon.2;Parent=Eucgr.A00001.1.v2.0;pacid=32049109 Chr01 phytozomev10 exon 2792210 2792340 . - . ID=Eucgr.A00001.1.v2.0.exon.1;Parent=Eucgr.A00001.1.v2.0;pacid=32049109

However, gff2fasta.pl script retrive only one exon, instead 12, for gene Eucgr.A00001.1.v2.0.

Thank you so much!

  1. https://www.dropbox.com/s/dicbsvo7hsuznq5/output.exon.fasta?dl=0

  2. https://www.dropbox.com/s/b8ge28sa6nwcuhl/Egrandis_297_v2.0.gene_exons.gff3?dl=0

  3. Reference genome (~700 MB) https://www.dropbox.com/s/4p0mxqak9erjil8/Egrandis_297_v2.0.fa?dl=0

-- Laia, ML https://publons.com/researcher/1755871/ http://orcid.org/0000-0001-6366-4558

marcelolaia avatar Apr 24 '20 01:04 marcelolaia