resources icon indicating copy to clipboard operation
resources copied to clipboard

How to get genes from pycno genome?

Open grace-ac opened this issue 1 year ago • 21 comments

want to be able to align sea star coelomocyte RNA seq data to the pycno genome.

The files provided that I have access to are of transcripts. Wiki page of the key files I have here

Is there a way to get the genes? I think @sr320 said that you did things with oyster stuff recently related to this?

grace-ac avatar Sep 24 '24 21:09 grace-ac

Can you show head of alignment/ Deseq output so we know what to match... and go ahead and add head / description of what genome files are available here.

(Updated: no need for code)

sr320 avatar Sep 24 '24 21:09 sr320

Also, can you explain why you need the genes if you want to align RNA-seq data to genome? You have the RNA-seq data and you have the genome. So, you should be able to do the alignments, right?

The GTF file appears to have gene annotations, so you can extract genes from that if you need their positions. Are you asking for code on how to accomplish this task?

kubu4 avatar Sep 24 '24 21:09 kubu4

I think @kubu4 is on base with gtf idea.

I think another way to frame issue is: I have this count matrix file from hisat/ deseq output file but I am having trouble finding annotations associated with the iDs?

@grace-ac Is this accurate?

sr320 avatar Sep 24 '24 21:09 sr320

head of Deseq output

log2 fold change (MLE): condition exposed vs control 
Wald test p-value: condition exposed vs control 
DataFrame with 6 rows and 6 columns
           baseMean log2FoldChange     lfcSE      stat     pvalue      padj
          <numeric>      <numeric> <numeric> <numeric>  <numeric> <numeric>
g20392.t1  0.204575       1.475366  3.114837  0.473657 0.63574420        NA
g13761.t1  1.981251      -2.209603  2.073068 -1.065861 0.28648645 0.4346174
g4199.t4  84.995694       0.861167  0.825231  1.043546 0.29669543 0.4456007
g1746.t1   6.779187       0.807499  0.809692  0.997292 0.31862293 0.4689647
g16359.t1 99.172769      -0.907016  0.317377 -2.857846 0.00426528 0.0165324
g22398.t1 71.385761      -0.551942  0.308878 -1.786922 0.07395015 0.1585745

Genome files available:

P. helianthoides genome page from NCBI: here

The downloaded genome from the dataset lives on Raven in:
/home/shared/8TB_HDD_02/graceac9/GitHub/paper-pycno-sswd-2021-2022/data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna and the head of the file looks like this:

>CM063243.1 Pycnopodia helianthoides isolate M0D057908R chromosome 1, whole genome shotgun sequence
gttaaaataatttgaatattgGATTAGTTTCAAACCCTCCCAGATCCTTCTAGATCCTCTTTGTTGAAATacaggattca
gaaggactgagggctgtaggcccaaatgacagttgcttatcactgggtcaggcatacagTAGGGCAGatgggtgatggga

P. helianthoides genome gene predictions from Dryad: here

Scheibelhut et. al 2023
though all of these files appear to be transcripts as each gene ends in a ".t#'.

augustus.hints.aa

>g3450.t1
MNYRTDDEIYEDEVDEETKLHHRIDERGVNGTDSRDEPTRVPPAKQLKIQAPVLASRLQL
EAVRRPHPPPLHPPQIHLYLEPHLEALQQRCDSYKGIVG*

augustus.hints.codingseq

>g3450.t1
ATGAACTACCGGACTGATGATGAAATATATGAAGATGAGGTGGACGAGGAGACAAAACTG
CATCACAGAATTGATGAGCGTGGAGTGAACGGAACTGATTCTCGAGATGAACCAACAAGA

augustus.hints.gtf

pycn_heli.0392	AUGUSTUS	gene	1	4781	0.95	+	.	g1
pycn_heli.0392	AUGUSTUS	transcript	1	4781	0.95	+	g1.t1
pycn_heli.0392	AUGUSTUS	intron	1	4506	0.95	+	.	transcript_id "g1.t1"; gene_id "g1";

interproscan.gff3

##gff-version 3
##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269
##interproscan-version 5.61-93.0

interproscan.tsv

g3531.t1	e585d160537c159881dfb8cf27e10150	635	CDD	cd13970ABC1_ADCK3	288	537	2.58694E-146	T	19-04-2023	IPR034646	ADCK3-like domain	-
g3531.t1	e585d160537c159881dfb8cf27e10150	635	PANTHER	PTHR43851	-	6	633	8.3E-243	T	19-04-2023	-	-
g3531.t1	e585d160537c159881dfb8cf27e10150	635	MobiDBLite	mobidb-lite	consensus disorder prediction	151	176	-	T	19-04-2023	-	-

Also, can you explain why you need the genes if you want to align RNA-seq data to genome? You have the RNA-seq data and you have the genome. So, you should be able to do the alignments, right?

Yep! I have the alignment already - that first line in my initial comment on this issue shouldn't be there. Alignment to genome was done!

I think I want the genes to do the DEG work... because right now i'm doing differentially expressed transcripts I think

grace-ac avatar Sep 24 '24 21:09 grace-ac

Somebody double check- but looks like your alignment count matrix IDs correspond to - augustus.hints.codingseq So these annotations would work…

But another question is - is this hisat out put based on genes or transcript level.

On Tue, Sep 24, 2024 at 2:50 PM Grace Crandall @.***> wrote:

head of Deseq output

log2 fold change (MLE): condition exposed vs control Wald test p-value: condition exposed vs control DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj g20392.t1 0.204575 1.475366 3.114837 0.473657 0.63574420 NA g13761.t1 1.981251 -2.209603 2.073068 -1.065861 0.28648645 0.4346174 g4199.t4 84.995694 0.861167 0.825231 1.043546 0.29669543 0.4456007 g1746.t1 6.779187 0.807499 0.809692 0.997292 0.31862293 0.4689647 g16359.t1 99.172769 -0.907016 0.317377 -2.857846 0.00426528 0.0165324 g22398.t1 71.385761 -0.551942 0.308878 -1.786922 0.07395015 0.1585745

Genome files available: P. helianthoides genome page from NCBI: here https://urldefense.com/v3/__https://www.ncbi.nlm.nih.gov/datasets/taxonomy/7614/__;!!K-Hz7m0Vt54!lYKax_GqjNhRQ-CmZ8av_pzCqCxRrxmAu6TY4DwiGNvPJNf24geVGwEYzXbbwuxbKPXTOcz3qabVHqnbZGHTjY8$

The downloaded genome from the dataset lives on Raven in:

/home/shared/8TB_HDD_02/graceac9/GitHub/paper-pycno-sswd-2021-2022/data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna and the head of the file looks like this:

CM063243.1 Pycnopodia helianthoides isolate M0D057908R chromosome 1, whole genome shotgun sequence gttaaaataatttgaatattgGATTAGTTTCAAACCCTCCCAGATCCTTCTAGATCCTCTTTGTTGAAATacaggattca gaaggactgagggctgtaggcccaaatgacagttgcttatcactgggtcaggcatacagTAGGGCAGatgggtgatggga

P. helianthoides genome gene predictions from Dryad: here https://urldefense.com/v3/__https://datadryad.org/stash/dataset/doi:10.5061/dryad.51c59zwfd__;!!K-Hz7m0Vt54!lYKax_GqjNhRQ-CmZ8av_pzCqCxRrxmAu6TY4DwiGNvPJNf24geVGwEYzXbbwuxbKPXTOcz3qabVHqnbsgfLDbo$

Scheibelhut et. al 2023 though all of these files appear to be transcripts as each gene ends in a ".t#'. augustus.hints.aa

g3450.t1 MNYRTDDEIYEDEVDEETKLHHRIDERGVNGTDSRDEPTRVPPAKQLKIQAPVLASRLQL EAVRRPHPPPLHPPQIHLYLEPHLEALQQRCDSYKGIVG*

augustus.hints.codingseq

g3450.t1 ATGAACTACCGGACTGATGATGAAATATATGAAGATGAGGTGGACGAGGAGACAAAACTG CATCACAGAATTGATGAGCGTGGAGTGAACGGAACTGATTCTCGAGATGAACCAACAAGA

augustus.hints.gtf

pycn_heli.0392 AUGUSTUS gene 1 4781 0.95 + . g1 pycn_heli.0392 AUGUSTUS transcript 1 4781 0.95 + g1.t1 pycn_heli.0392 AUGUSTUS intron 1 4506 0.95 + . transcript_id "g1.t1"; gene_id "g1";

interproscan.gff3

##gff-version 3 ##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269 https://urldefense.com/v3/__http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269__;!!K-Hz7m0Vt54!lYKax_GqjNhRQ-CmZ8av_pzCqCxRrxmAu6TY4DwiGNvPJNf24geVGwEYzXbbwuxbKPXTOcz3qabVHqnb2GcYRcU$ ##interproscan-version 5.61-93.0

interproscan.tsv

g3531.t1 e585d160537c159881dfb8cf27e10150 635 CDD cd13970ABC1_ADCK3 288 537 2.58694E-146 T 19-04-2023 IPR034646 ADCK3-like domain - g3531.t1 e585d160537c159881dfb8cf27e10150 635 PANTHER PTHR43851 - 6 633 8.3E-243 T 19-04-2023 - - g3531.t1 e585d160537c159881dfb8cf27e10150 635 MobiDBLite mobidb-lite consensus disorder prediction 151 176 - T 19-04-2023 - -

Also, can you explain why you need the genes if you want to align RNA-seq data to genome? You have the RNA-seq data and you have the genome. So, you should be able to do the alignments, right?

Yep! I have the alignment already - that first line in my initial comment on this issue shouldn't be there. Alignment to genome was done!

I think I want the genes to do the DEG work... because right now i'm doing differentially expressed transcripts I think

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/RobertsLab/resources/issues/1965*issuecomment-2372450244__;Iw!!K-Hz7m0Vt54!lYKax_GqjNhRQ-CmZ8av_pzCqCxRrxmAu6TY4DwiGNvPJNf24geVGwEYzXbbwuxbKPXTOcz3qabVHqnbP0OBDDU$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN4ET63G2DDPOPUJTCTZYHNAZAVCNFSM6AAAAABOZFXBAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZSGQ2TAMRUGQ__;!!K-Hz7m0Vt54!lYKax_GqjNhRQ-CmZ8av_pzCqCxRrxmAu6TY4DwiGNvPJNf24geVGwEYzXbbwuxbKPXTOcz3qabVHqnb4Zyk1OY$ . You are receiving this because you were mentioned.Message ID: @.***>

sr320 avatar Sep 24 '24 21:09 sr320

@kubu4 yes asking for code on how to accomplish!

here's the top 10 lines head of the .gtf file:

pycn_heli.0392	AUGUSTUS	gene	1	4781	0.95	+	.	g1
pycn_heli.0392	AUGUSTUS	transcript	1	4781	0.95	+	.	g1.t1
pycn_heli.0392	AUGUSTUS	intron	1	4506	0.95	+	.	transcript_id "g1.t1"; gene_id "g1";
pycn_heli.0392	AUGUSTUS	CDS	4507	4781	0.95	+	2	transcript_id "g1.t1"; gene_id "g1";
pycn_heli.0392	AUGUSTUS	exon	4507	4781	.	+	.	transcript_id "g1.t1"; gene_id "g1";
pycn_heli.0392	AUGUSTUS	stop_codon	4779	4781	.	+	0	transcript_id "g1.t1"; gene_id "g1";
pycn_heli.0392	AUGUSTUS	gene	16379	16612	0.78	-	.	g2
pycn_heli.0392	AUGUSTUS	transcript	16379	16612	0.78	-	.	g2.t1
pycn_heli.0392	AUGUSTUS	stop_codon	16379	16381	.	-	0	transcript_id "g2.t1"; gene_id "g2";
pycn_heli.0392	AUGUSTUS	CDS	16379	16612	0.78	-	0	transcript_id "g2.t1"; gene_id "g2";

@sr320 I think you're right that the DESeq2 output matches the augustus.hints.codingseq... but aren't those all transcripts and not genes because they end in .t#?

I think another way to frame issue is: I have this count matrix file from hisat/ deseq output file but I am having trouble finding annotations associated with the iDs?

And yes, correct!

I'm feeling unsure of what is right or not ...

grace-ac avatar Sep 25 '24 19:09 grace-ac

Quick way to pull genes from GFF/GTF:

awk '$3 == "gene"' augustus.hints.gtf

Basically, that says, if the third field (i.e. column) is the word "gene", then print the entire line.

kubu4 avatar Sep 25 '24 20:09 kubu4

the suspense is killing me - where do things stand on this?

sr320 avatar Sep 27 '24 15:09 sr320

If there were only a set a side time of the week where science could be discussed to go over this.. maybe it could be an hour or more...

sr320 avatar Sep 27 '24 17:09 sr320

had this response drafted and never submitted comment!!! i'm working on notebook post now for what i'm doing - yesterday and today

Ok, thanks! that worked.

pycn_heli.0255	AUGUSTUS	gene	20802	21077	0.95	-	.	g24176
pycn_heli.0255	AUGUSTUS	gene	21134	22597	1	-	.	g24177
pycn_heli.0086	AUGUSTUS	gene	11884	12165	0.99	+	.	g24178
pycn_heli.0086	AUGUSTUS	gene	12367	14223	0.39	+	.	g24179
pycn_heli.0086	AUGUSTUS	gene	18738	23684	0	-	.	g24180
pycn_heli.0086	AUGUSTUS	gene	23819	24382	0.85	-	.	g24181
pycn_heli.0086	AUGUSTUS	gene	29454	35433	0.6	-	.	g24182
pycn_heli.0086	AUGUSTUS	gene	45428	49864	0.93	+	.	g24183
pycn_heli.0086	AUGUSTUS	gene	53564	54061	0.86	+	.	g24184

grace-ac avatar Sep 27 '24 18:09 grace-ac

@grace-ac

But another question is - is this hisat out put based on genes or transcript level?

sr320 avatar Sep 27 '24 18:09 sr320

R code: paper-pycno-sswd-2021-2022/code/15-hisat2-deseq2-summer2022.Rmd

it looks like transcript level

grace-ac avatar Sep 27 '24 18:09 grace-ac

Lets start workflow with gene level analysis.

sr320 avatar Sep 27 '24 18:09 sr320

still running? does an issue persist?

sr320 avatar Oct 04 '24 23:10 sr320

it completed and then i got side-tracked by data entry.

i checked it and there were only 20 files that were converted to bam files... should be 32.

there are 32 sam files.

so i'm running that conversion chunk again... maybe it got interrupted somehow? i'll keep tabs on it and once it's done will post issue if something is wrong still

grace-ac avatar Oct 04 '24 23:10 grace-ac

Screenshot 2024-10-04 at 4 38 11 PM

it looks like it's running fine right now

grace-ac avatar Oct 04 '24 23:10 grace-ac

step of converting the sam files (n=32) to bam files not working... I re-ran it and it only converted 20 files.

Code: https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/code/20-hisat2-genecount-matrices.Rmd

Repo: https://github.com/grace-ac/paper-pycno-sswd-2021-2022/tree/main

grace-ac avatar Oct 07 '24 18:10 grace-ac

The corresponding SAM files are empty (and the resulting BAM files), so there's nothing to sort...

kubu4 avatar Oct 07 '24 19:10 kubu4

image

kubu4 avatar Oct 07 '24 19:10 kubu4

oop! makes perfect sense now. that issue with the sam files happened a while back and i thought i fixed it but i didn't check it officially. thanks for pointing it out!

grace-ac avatar Oct 07 '24 22:10 grace-ac

for the purpose of this question, just sharing one DEG list:

Have DEG list from 2021 data set

head DEGlist_2021_exposedVcontrol.tab 
baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
g21712	15506.8006860189	-0.686001394104519	0.279769093821729	-2.45202707966608	0.0142053971059541	0.04104326062556
g21713	931.361342578574	0.765760120416294	0.279988740054853	2.73496755714631	0.00623864245462093	0.0207814457271467
g21711	12883.0935253533	-0.709970512692711	0.228673239474894	-3.10473807220743	0.00190447593029036	0.00763473087082639
g21769	95.2608451733162	3.6958184007692	1.00771829984027	3.66751144774787	0.000244922588350467	0.00136039062124746
g15181	7362.04249036577	2.13117880973538	0.606155663157497	3.5158935885115	0.000438276606665117	0.00222160900619904
g15182	848.103634775757	1.44107523921733	0.603474139884921	2.38796519017722	0.0169419463881676	0.047565397521059
g15183	916.675431757199	-0.700558231713056	0.24491415606803	-2.86042359886482	0.00423075479942892	0.0149434418878582
g7651	625.045813674949	1.17515444248357	0.338850532449784	3.46806137203791	0.000524227551336494	0.00257724022366272
g7656	369.553524158447	1.34279754324671	0.359840419468955	3.73164733752917	0.000190231694582979	0.00109975735292238

Next step: Annotation.

I have this BLAST output from before https://github.com/grace-ac/paper-pycno-sswd-2021-2022/blob/main/analyses/16-blast-annotation/blast_out_sep.tab:

 head blast_out_sep.tab 
g3452.t1	sp	Q9Y6A2	CP46A_HUMAN	36.842	380	201	6	73	1119	16	387	6.38e-73	238
g3453.t1	sp	Q9Y6A2	CP46A_HUMAN	39.574	470	267	7	79	1470	20	478	2.82e-111	342
g3454.t1	sp	Q9H4F8	SMOC1_HUMAN	41.026	156	80	3	43	504	21	166	3.34e-31	129
g3455.t1	sp	Q9D844	DNJC4_MOUSE	35.028	177	100	3	100	618	37	202	1.38e-22	95.9
g3456.t1	sp	P80018	GLBC_MOLAR	27.559	127	91	1	115	495	17	142	1.81e-08	54.7
g3457.t1	sp	Q9NRC6	SPTN5_HUMAN	31.110	1649	1112	11	1	4926	1850	3481	0.0	758
g3458.t1	sp	P16546	SPTN1_MOUSE	36.947	655	411	2	67	2028	300	953	1.10e-123	414
g3459.t1	sp	Q00963	SPTCB_DROME	45.908	782	414	4	40	2361	36	816	0.0	691
g3460.t1	sp	Q9WVK8	CP46A_MOUSE	38.248	468	266	9	103	1470	22	478	2.16e-106	329
g3461.t1	sp	Q9Y6A2	CP46A_HUMAN	37.681	414	218	9	79	1254	22	417	2.44e-86	275

but as you can see, it isn't the genes, it's transcripts. I used:

head -3 ../data/augustus.hints.codingseq
>g3450.t1
ATGAACTACCGGACTGATGATGAAATATATGAAGATGAGGTGGACGAGGAGACAAAACTG
CATCACAGAATTGATGAGCGTGGAGTGAACGGAACTGATTCTCGAGATGAACCAACAAGA

So now I want to re-run BLAST with the genes, not the transcripts....

And I'm not sure which file to use.

Do I use the genome? It doesn't have the gene IDs like the DEG list above does:

head -3 ../data/ncbi_dataset/data/GCA_032158295.1/GCA_032158295.1_ASM3215829v1_genomic.fna
>CM063243.1 Pycnopodia helianthoides isolate M0D057908R chromosome 1, whole genome shotgun sequence
gttaaaataatttgaatattgGATTAGTTTCAAACCCTCCCAGATCCTTCTAGATCCTCTTTGTTGAAATacaggattca
gaaggactgagggctgtaggcccaaatgacagttgcttatcactgggtcaggcatacagTAGGGCAGatgggtgatggga

grace-ac avatar Oct 15 '24 18:10 grace-ac

Are all sequences from blast ending in .t1?

sr320 avatar Oct 15 '24 19:10 sr320

Two approaches:

  1. Use existing BLAST output, but strip the .t1 from the first column and then sort for uniques just on the first column.

sed 's/\.t1//' blast_out_sep.tab | sort --unique -k1,1

This is probably fine, since it's unlikely that multiple transcripts for a given gene would get annotated as a different protein.

Then, join with DEG list.

  1. Use genes from DEG list to extract genes from GTF to then extract sequences from GCA_032158295.1_ASM3215829v1_genomic.fna. Then, BLAST those gene sequences.

kubu4 avatar Oct 15 '24 19:10 kubu4

Ok!

I think I'll go for option 1. But in the meantime, I can also do option 2 and have the BLAST running while I continue with the list for option 1. Unless that seems unnecessary.

grace-ac avatar Oct 15 '24 20:10 grace-ac

pretty sure everything is all good!

here's post for 2021 results

post for 2022 results coming later today

grace-ac avatar Oct 18 '24 20:10 grace-ac

Nice post! Heat maps look nice!

Did notice this, though...

I got these numbers in excel…

:thinking:

:wink:

kubu4 avatar Oct 18 '24 21:10 kubu4

ugh i know... i'm the worst.

grace-ac avatar Oct 18 '24 21:10 grace-ac