How to get genes from pycno genome?
want to be able to align sea star coelomocyte RNA seq data to the pycno genome.
The files provided that I have access to are of transcripts. Wiki page of the key files I have here
Is there a way to get the genes? I think @sr320 said that you did things with oyster stuff recently related to this?
Can you show head of alignment/ Deseq output so we know what to match... and go ahead and add head / description of what genome files are available here.
(Updated: no need for code)
Also, can you explain why you need the genes if you want to align RNA-seq data to genome? You have the RNA-seq data and you have the genome. So, you should be able to do the alignments, right?
The GTF file appears to have gene annotations, so you can extract genes from that if you need their positions. Are you asking for code on how to accomplish this task?
I think @kubu4 is on base with gtf idea.
I think another way to frame issue is: I have this count matrix file from hisat/ deseq output file but I am having trouble finding annotations associated with the iDs?
@grace-ac Is this accurate?
head of Deseq output
log2 fold change (MLE): condition exposed vs control
Wald test p-value: condition exposed vs control
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
g20392.t1 0.204575 1.475366 3.114837 0.473657 0.63574420 NA
g13761.t1 1.981251 -2.209603 2.073068 -1.065861 0.28648645 0.4346174
g4199.t4 84.995694 0.861167 0.825231 1.043546 0.29669543 0.4456007
g1746.t1 6.779187 0.807499 0.809692 0.997292 0.31862293 0.4689647
g16359.t1 99.172769 -0.907016 0.317377 -2.857846 0.00426528 0.0165324
g22398.t1 71.385761 -0.551942 0.308878 -1.786922 0.07395015 0.1585745
Genome files available:
P. helianthoides genome page from NCBI: here
The downloaded genome from the dataset lives on Raven in:
/home/shared/8TB_HDD_02/graceac9/GitHub/paper-pycno-sswd-2021-2022/data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna and the head of the file looks like this:
>CM063243.1 Pycnopodia helianthoides isolate M0D057908R chromosome 1, whole genome shotgun sequence
gttaaaataatttgaatattgGATTAGTTTCAAACCCTCCCAGATCCTTCTAGATCCTCTTTGTTGAAATacaggattca
gaaggactgagggctgtaggcccaaatgacagttgcttatcactgggtcaggcatacagTAGGGCAGatgggtgatggga
P. helianthoides genome gene predictions from Dryad: here
Scheibelhut et. al 2023
though all of these files appear to be transcripts as each gene ends in a ".t#'.
augustus.hints.aa
>g3450.t1
MNYRTDDEIYEDEVDEETKLHHRIDERGVNGTDSRDEPTRVPPAKQLKIQAPVLASRLQL
EAVRRPHPPPLHPPQIHLYLEPHLEALQQRCDSYKGIVG*
augustus.hints.codingseq
>g3450.t1
ATGAACTACCGGACTGATGATGAAATATATGAAGATGAGGTGGACGAGGAGACAAAACTG
CATCACAGAATTGATGAGCGTGGAGTGAACGGAACTGATTCTCGAGATGAACCAACAAGA
augustus.hints.gtf
pycn_heli.0392 AUGUSTUS gene 1 4781 0.95 + . g1
pycn_heli.0392 AUGUSTUS transcript 1 4781 0.95 + g1.t1
pycn_heli.0392 AUGUSTUS intron 1 4506 0.95 + . transcript_id "g1.t1"; gene_id "g1";
interproscan.gff3
##gff-version 3
##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269
##interproscan-version 5.61-93.0
interproscan.tsv
g3531.t1 e585d160537c159881dfb8cf27e10150 635 CDD cd13970ABC1_ADCK3 288 537 2.58694E-146 T 19-04-2023 IPR034646 ADCK3-like domain -
g3531.t1 e585d160537c159881dfb8cf27e10150 635 PANTHER PTHR43851 - 6 633 8.3E-243 T 19-04-2023 - -
g3531.t1 e585d160537c159881dfb8cf27e10150 635 MobiDBLite mobidb-lite consensus disorder prediction 151 176 - T 19-04-2023 - -
Also, can you explain why you need the genes if you want to align RNA-seq data to genome? You have the RNA-seq data and you have the genome. So, you should be able to do the alignments, right?
Yep! I have the alignment already - that first line in my initial comment on this issue shouldn't be there. Alignment to genome was done!
I think I want the genes to do the DEG work... because right now i'm doing differentially expressed transcripts I think
Somebody double check- but looks like your alignment count matrix IDs correspond to - augustus.hints.codingseq So these annotations would work…
But another question is - is this hisat out put based on genes or transcript level.
On Tue, Sep 24, 2024 at 2:50 PM Grace Crandall @.***> wrote:
head of Deseq output
log2 fold change (MLE): condition exposed vs control Wald test p-value: condition exposed vs control DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj
g20392.t1 0.204575 1.475366 3.114837 0.473657 0.63574420 NA g13761.t1 1.981251 -2.209603 2.073068 -1.065861 0.28648645 0.4346174 g4199.t4 84.995694 0.861167 0.825231 1.043546 0.29669543 0.4456007 g1746.t1 6.779187 0.807499 0.809692 0.997292 0.31862293 0.4689647 g16359.t1 99.172769 -0.907016 0.317377 -2.857846 0.00426528 0.0165324 g22398.t1 71.385761 -0.551942 0.308878 -1.786922 0.07395015 0.1585745 Genome files available: P. helianthoides genome page from NCBI: here https://urldefense.com/v3/__https://www.ncbi.nlm.nih.gov/datasets/taxonomy/7614/__;!!K-Hz7m0Vt54!lYKax_GqjNhRQ-CmZ8av_pzCqCxRrxmAu6TY4DwiGNvPJNf24geVGwEYzXbbwuxbKPXTOcz3qabVHqnbZGHTjY8$
The downloaded genome from the dataset lives on Raven in:
/home/shared/8TB_HDD_02/graceac9/GitHub/paper-pycno-sswd-2021-2022/data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna and the head of the file looks like this:
CM063243.1 Pycnopodia helianthoides isolate M0D057908R chromosome 1, whole genome shotgun sequence gttaaaataatttgaatattgGATTAGTTTCAAACCCTCCCAGATCCTTCTAGATCCTCTTTGTTGAAATacaggattca gaaggactgagggctgtaggcccaaatgacagttgcttatcactgggtcaggcatacagTAGGGCAGatgggtgatggga
P. helianthoides genome gene predictions from Dryad: here https://urldefense.com/v3/__https://datadryad.org/stash/dataset/doi:10.5061/dryad.51c59zwfd__;!!K-Hz7m0Vt54!lYKax_GqjNhRQ-CmZ8av_pzCqCxRrxmAu6TY4DwiGNvPJNf24geVGwEYzXbbwuxbKPXTOcz3qabVHqnbsgfLDbo$
Scheibelhut et. al 2023 though all of these files appear to be transcripts as each gene ends in a ".t#'. augustus.hints.aa
g3450.t1 MNYRTDDEIYEDEVDEETKLHHRIDERGVNGTDSRDEPTRVPPAKQLKIQAPVLASRLQL EAVRRPHPPPLHPPQIHLYLEPHLEALQQRCDSYKGIVG*
augustus.hints.codingseq
g3450.t1 ATGAACTACCGGACTGATGATGAAATATATGAAGATGAGGTGGACGAGGAGACAAAACTG CATCACAGAATTGATGAGCGTGGAGTGAACGGAACTGATTCTCGAGATGAACCAACAAGA
augustus.hints.gtf
pycn_heli.0392 AUGUSTUS gene 1 4781 0.95 + . g1 pycn_heli.0392 AUGUSTUS transcript 1 4781 0.95 + g1.t1 pycn_heli.0392 AUGUSTUS intron 1 4506 0.95 + . transcript_id "g1.t1"; gene_id "g1";
interproscan.gff3
##gff-version 3 ##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269 https://urldefense.com/v3/__http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269__;!!K-Hz7m0Vt54!lYKax_GqjNhRQ-CmZ8av_pzCqCxRrxmAu6TY4DwiGNvPJNf24geVGwEYzXbbwuxbKPXTOcz3qabVHqnb2GcYRcU$ ##interproscan-version 5.61-93.0
interproscan.tsv
g3531.t1 e585d160537c159881dfb8cf27e10150 635 CDD cd13970ABC1_ADCK3 288 537 2.58694E-146 T 19-04-2023 IPR034646 ADCK3-like domain - g3531.t1 e585d160537c159881dfb8cf27e10150 635 PANTHER PTHR43851 - 6 633 8.3E-243 T 19-04-2023 - - g3531.t1 e585d160537c159881dfb8cf27e10150 635 MobiDBLite mobidb-lite consensus disorder prediction 151 176 - T 19-04-2023 - -
Also, can you explain why you need the genes if you want to align RNA-seq data to genome? You have the RNA-seq data and you have the genome. So, you should be able to do the alignments, right?
Yep! I have the alignment already - that first line in my initial comment on this issue shouldn't be there. Alignment to genome was done!
I think I want the genes to do the DEG work... because right now i'm doing differentially expressed transcripts I think
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/RobertsLab/resources/issues/1965*issuecomment-2372450244__;Iw!!K-Hz7m0Vt54!lYKax_GqjNhRQ-CmZ8av_pzCqCxRrxmAu6TY4DwiGNvPJNf24geVGwEYzXbbwuxbKPXTOcz3qabVHqnbP0OBDDU$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN4ET63G2DDPOPUJTCTZYHNAZAVCNFSM6AAAAABOZFXBAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZSGQ2TAMRUGQ__;!!K-Hz7m0Vt54!lYKax_GqjNhRQ-CmZ8av_pzCqCxRrxmAu6TY4DwiGNvPJNf24geVGwEYzXbbwuxbKPXTOcz3qabVHqnb4Zyk1OY$ . You are receiving this because you were mentioned.Message ID: @.***>
@kubu4 yes asking for code on how to accomplish!
here's the top 10 lines head of the .gtf file:
pycn_heli.0392 AUGUSTUS gene 1 4781 0.95 + . g1
pycn_heli.0392 AUGUSTUS transcript 1 4781 0.95 + . g1.t1
pycn_heli.0392 AUGUSTUS intron 1 4506 0.95 + . transcript_id "g1.t1"; gene_id "g1";
pycn_heli.0392 AUGUSTUS CDS 4507 4781 0.95 + 2 transcript_id "g1.t1"; gene_id "g1";
pycn_heli.0392 AUGUSTUS exon 4507 4781 . + . transcript_id "g1.t1"; gene_id "g1";
pycn_heli.0392 AUGUSTUS stop_codon 4779 4781 . + 0 transcript_id "g1.t1"; gene_id "g1";
pycn_heli.0392 AUGUSTUS gene 16379 16612 0.78 - . g2
pycn_heli.0392 AUGUSTUS transcript 16379 16612 0.78 - . g2.t1
pycn_heli.0392 AUGUSTUS stop_codon 16379 16381 . - 0 transcript_id "g2.t1"; gene_id "g2";
pycn_heli.0392 AUGUSTUS CDS 16379 16612 0.78 - 0 transcript_id "g2.t1"; gene_id "g2";
@sr320 I think you're right that the DESeq2 output matches the augustus.hints.codingseq... but aren't those all transcripts and not genes because they end in .t#?
I think another way to frame issue is: I have this count matrix file from hisat/ deseq output file but I am having trouble finding annotations associated with the iDs?
And yes, correct!
I'm feeling unsure of what is right or not ...
Quick way to pull genes from GFF/GTF:
awk '$3 == "gene"' augustus.hints.gtf
Basically, that says, if the third field (i.e. column) is the word "gene", then print the entire line.
the suspense is killing me - where do things stand on this?
If there were only a set a side time of the week where science could be discussed to go over this.. maybe it could be an hour or more...
had this response drafted and never submitted comment!!! i'm working on notebook post now for what i'm doing - yesterday and today
Ok, thanks! that worked.
pycn_heli.0255 AUGUSTUS gene 20802 21077 0.95 - . g24176
pycn_heli.0255 AUGUSTUS gene 21134 22597 1 - . g24177
pycn_heli.0086 AUGUSTUS gene 11884 12165 0.99 + . g24178
pycn_heli.0086 AUGUSTUS gene 12367 14223 0.39 + . g24179
pycn_heli.0086 AUGUSTUS gene 18738 23684 0 - . g24180
pycn_heli.0086 AUGUSTUS gene 23819 24382 0.85 - . g24181
pycn_heli.0086 AUGUSTUS gene 29454 35433 0.6 - . g24182
pycn_heli.0086 AUGUSTUS gene 45428 49864 0.93 + . g24183
pycn_heli.0086 AUGUSTUS gene 53564 54061 0.86 + . g24184
@grace-ac
But another question is - is this hisat out put based on genes or transcript level?
R code: paper-pycno-sswd-2021-2022/code/15-hisat2-deseq2-summer2022.Rmd
it looks like transcript level
Lets start workflow with gene level analysis.
running today --> https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/20-hisat2-genecount-matrices.Rmd
still running? does an issue persist?
it completed and then i got side-tracked by data entry.
i checked it and there were only 20 files that were converted to bam files... should be 32.
there are 32 sam files.
so i'm running that conversion chunk again... maybe it got interrupted somehow? i'll keep tabs on it and once it's done will post issue if something is wrong still
it looks like it's running fine right now
step of converting the sam files (n=32) to bam files not working... I re-ran it and it only converted 20 files.
Code: https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/20-hisat2-genecount-matrices.Rmd
Repo: https://github.com/grace-ac/paper-pycno-sswd-2021-2022/tree/main
The corresponding SAM files are empty (and the resulting BAM files), so there's nothing to sort...
oop! makes perfect sense now. that issue with the sam files happened a while back and i thought i fixed it but i didn't check it officially. thanks for pointing it out!
for the purpose of this question, just sharing one DEG list:
Have DEG list from 2021 data set
head DEGlist_2021_exposedVcontrol.tab
baseMean log2FoldChange lfcSE stat pvalue padj
g21712 15506.8006860189 -0.686001394104519 0.279769093821729 -2.45202707966608 0.0142053971059541 0.04104326062556
g21713 931.361342578574 0.765760120416294 0.279988740054853 2.73496755714631 0.00623864245462093 0.0207814457271467
g21711 12883.0935253533 -0.709970512692711 0.228673239474894 -3.10473807220743 0.00190447593029036 0.00763473087082639
g21769 95.2608451733162 3.6958184007692 1.00771829984027 3.66751144774787 0.000244922588350467 0.00136039062124746
g15181 7362.04249036577 2.13117880973538 0.606155663157497 3.5158935885115 0.000438276606665117 0.00222160900619904
g15182 848.103634775757 1.44107523921733 0.603474139884921 2.38796519017722 0.0169419463881676 0.047565397521059
g15183 916.675431757199 -0.700558231713056 0.24491415606803 -2.86042359886482 0.00423075479942892 0.0149434418878582
g7651 625.045813674949 1.17515444248357 0.338850532449784 3.46806137203791 0.000524227551336494 0.00257724022366272
g7656 369.553524158447 1.34279754324671 0.359840419468955 3.73164733752917 0.000190231694582979 0.00109975735292238
Next step: Annotation.
I have this BLAST output from before https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/analyses/16-blast-annotation/blast_out_sep.tab:
head blast_out_sep.tab
g3452.t1 sp Q9Y6A2 CP46A_HUMAN 36.842 380 201 6 73 1119 16 387 6.38e-73 238
g3453.t1 sp Q9Y6A2 CP46A_HUMAN 39.574 470 267 7 79 1470 20 478 2.82e-111 342
g3454.t1 sp Q9H4F8 SMOC1_HUMAN 41.026 156 80 3 43 504 21 166 3.34e-31 129
g3455.t1 sp Q9D844 DNJC4_MOUSE 35.028 177 100 3 100 618 37 202 1.38e-22 95.9
g3456.t1 sp P80018 GLBC_MOLAR 27.559 127 91 1 115 495 17 142 1.81e-08 54.7
g3457.t1 sp Q9NRC6 SPTN5_HUMAN 31.110 1649 1112 11 1 4926 1850 3481 0.0 758
g3458.t1 sp P16546 SPTN1_MOUSE 36.947 655 411 2 67 2028 300 953 1.10e-123 414
g3459.t1 sp Q00963 SPTCB_DROME 45.908 782 414 4 40 2361 36 816 0.0 691
g3460.t1 sp Q9WVK8 CP46A_MOUSE 38.248 468 266 9 103 1470 22 478 2.16e-106 329
g3461.t1 sp Q9Y6A2 CP46A_HUMAN 37.681 414 218 9 79 1254 22 417 2.44e-86 275
but as you can see, it isn't the genes, it's transcripts. I used:
head -3 ../data/augustus.hints.codingseq
>g3450.t1
ATGAACTACCGGACTGATGATGAAATATATGAAGATGAGGTGGACGAGGAGACAAAACTG
CATCACAGAATTGATGAGCGTGGAGTGAACGGAACTGATTCTCGAGATGAACCAACAAGA
So now I want to re-run BLAST with the genes, not the transcripts....
And I'm not sure which file to use.
Do I use the genome? It doesn't have the gene IDs like the DEG list above does:
head -3 ../data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna
>CM063243.1 Pycnopodia helianthoides isolate M0D057908R chromosome 1, whole genome shotgun sequence
gttaaaataatttgaatattgGATTAGTTTCAAACCCTCCCAGATCCTTCTAGATCCTCTTTGTTGAAATacaggattca
gaaggactgagggctgtaggcccaaatgacagttgcttatcactgggtcaggcatacagTAGGGCAGatgggtgatggga
Are all sequences from blast ending in .t1?
Two approaches:
- Use existing BLAST output, but strip the
.t1from the first column and then sort for uniques just on the first column.
sed 's/\.t1//' blast_out_sep.tab | sort --unique -k1,1
This is probably fine, since it's unlikely that multiple transcripts for a given gene would get annotated as a different protein.
Then, join with DEG list.
- Use genes from DEG list to extract genes from GTF to then extract sequences from
GCA_032158295.1_ASM3215829v1_genomic.fna. Then, BLAST those gene sequences.
Ok!
I think I'll go for option 1. But in the meantime, I can also do option 2 and have the BLAST running while I continue with the list for option 1. Unless that seems unnecessary.
pretty sure everything is all good!
here's post for 2021 results
post for 2022 results coming later today
Nice post! Heat maps look nice!
Did notice this, though...
I got these numbers in excel…
:thinking:
:wink:
ugh i know... i'm the worst.