choice between fasta and HMM for alignment to a genome

Open dcopetti opened this issue 6 months ago • 1 comments

Hi, I have a set of protein sequences of TE genes in a fasta file (there is some redundancy of close sequences, but there are proteins from all TE classes, so high variation) that I am using to query genome assemblies (200 Mb to 20 Gb) to find the location of TE coding regions. I am running BATH as bathsearch --cpu 20 -o output_bath --tblout output_tab TE_proteins.fa assembly.fa and it is taking more than a week now. I wonder if creating clusters of sequences belonging to the same TE family and making HMM models will be useful to speed up the searches I will do with the next assemblies - would the time invested in this pay off in the long term? do you have an estimation of the difference between the two options? Thanks, Dario

Aug 06 '25 21:08 dcopetti

Quick note for posterity: we've been in conversation with @dcopetti offline, and are helping him to look at his TE data.

Thoughts:

A big part of the slowness is probably due to the fact that BATH, when presented with a bunch of query sequences, produces the associated internal pHMM in a serial fashion. One way to fix that is to first run bathbuild on the sequences (which will run in parallel), then search with that. Another is to wait a bit: we're currently working on parallelization of the internal HMM building phase.
BATH with frameshifts enabled is pretty slow. We have long-term plans to address this, but in the short term, that's the way it is.
The sequences used by Dario are likely redundant. We're helping to cluster them to improve search performance (both speed and sensitivity). When done, we'll revise the documentation to include this kind of thing as a use case. (Note: this is relevant to related work going on in RepeatMasker, which also has a (mammalian-centric) database of TE proteins).

Aug 19 '25 19:08 traviswheeler