biopython icon indicating copy to clipboard operation
biopython copied to clipboard

NCBI Sequin tbl format parser

Open AgustinPardo opened this issue 11 months ago • 5 comments

Hello,

Do you have any method to parse the "NCBI Sequin tbl" format file:

This is an example of the format:

>Feature Chr_1
1	8618836	REFERENCE
			CFMR	12345
1495	550	gene
			locus_tag	Tatro_000001
1495	1230	mRNA
1171	550
			product	hypothetical protein
			transcript_id	gnl|ncbi|Tatro_000001-T1_mrna
			protein_id	gnl|ncbi|Tatro_000001-T1
1495	1230	CDS
1171	550
			codon_start	1
			db_xref	InterPro:IPR002410
			db_xref	PFAM:PF08386
			db_xref	InterPro:IPR000073
			db_xref	InterPro:IPR029058
			db_xref	InterPro:IPR013595
			db_xref	InterPro:IPR050266
			db_xref	PFAM:PF12697
			note	MEROPS:MER0025512
			product	hypothetical protein
			transcript_id	gnl|ncbi|Tatro_000001-T1_mrna
			protein_id	gnl|ncbi|Tatro_000001-T1
5108	3585	gene
			locus_tag	Tatro_000002
5108	4516	mRNA
4452	3585
			product	hypothetical protein
			transcript_id	gnl|ncbi|Tatro_000002-T1_mrna
			protein_id	gnl|ncbi|Tatro_000002-T1
4959	4516	CDS
4452	3781
			codon_start	1
			db_xref	PFAM:PF00172
			db_xref	InterPro:IPR001138
			db_xref	InterPro:IPR036864
			product	hypothetical protein
			transcript_id	gnl|ncbi|Tatro_000002-T1_mrna
			protein_id	gnl|ncbi|Tatro_000002-T1

Regards

AgustinPardo avatar Feb 22 '25 21:02 AgustinPardo

No, I don' think we have anything for this, although I did once look at the semi-related NCBI protein tables (*.ptt files) https://github.com/biopython/biopython/issues/1725

What is your use case (and can you use the GenBank format files instead)?

peterjc avatar Feb 24 '25 14:02 peterjc

No, I don' think we have anything for this, although I did once look at the semi-related NCBI protein tables (*.ptt files) #1725

What is your use case (and can you use the GenBank format files instead)?

I need to parse and process this specific file format

AgustinPardo avatar Mar 02 '25 16:03 AgustinPardo

I think it could be parsed into SeqRecord objects (with missing sequences - although we do know their lengths) and SeqFeature objects, allowing it to fit under Bio.SeqIO. A simpler parser might suffice for your needs?

peterjc avatar Mar 03 '25 11:03 peterjc

(with missing sequences - although we do know their lengths)

Then you can create a Seq object with a defined length but undefined sequence contents.

mdehoon avatar Mar 04 '25 00:03 mdehoon

Finally, I created my parser from scratch reading line by line.

AgustinPardo avatar May 29 '25 02:05 AgustinPardo