AMPlify ignores sequences containing stop codon indicator
We noticed that AMPlify strictly sticks to the 20 standard amino acids in input sequences and ignores all others, as stated in its help message:
$AMPlify -h
[...]
AMPlify v2.0.0
------------------------------------------------------
Predict whether a sequence is AMP or not.
Input sequences should be in fasta format.
Sequences should be shorter than 201 amino acids long,
and should not contain amino acids other than the 20 standard ones.
So far, so clear. But even if a stop codon is indicated with the commonly used asterisk *, the sequence is ignored. I believe this behaviour might not be desired, because several sequence annotation tools (e.g. Pyrodigal, Prodigal, Bakta, Prokka) append the * by default; for Prodigal, Prokka, and Bakta it is not even possible to deactivate the * as stop codon indicator. Thus, one cannot simply use the output from such annotation tools as input for AMPlify without first removing all *.
My feature request is thus, to have AMPlify accept sequences with stop codon indicator and remove the asterisk internally if necessary.
Minimum reproducible example:
- Download this FASTA file: amplify-failed-genes.faa.gz (contains two sequences: one too long and one with
*)
zcat amplify-failed-genes.faa.gz > amplify-failed-genes.faa
AMPlify -s amplify-failed-genes.faa
I'll link another issue where this behaviour was observed.
Thank you for your message. We understand how the inclusion of stop codon indicators (such as *) in sequence outputs from annotation tools like Prodigal, Prokka, and Bakta can cause issues when used with AMPlify.
While the current behaviour was designed to strictly accept only the 20 standard amino acids to ensure clean inputs, we acknowledge that many annotation tools append the stop codon symbol (*) by default, and this can indeed interfere with direct input into AMPlify.
We appreciate your suggestion to automatically handle stop codons by removing the asterisk internally. This could enhance AMPlify’s usability, especially for users working with outputs from a variety of annotation pipelines (or users who do not know about AMPlify's behaviour/have not read the documentation). We will certainly consider adding this functionality to future versions, as it could streamline workflows and reduce the need for additional preprocessing.
In the meantime, as you’ve mentioned, a simple one-liner in PERL or another scripting language can resolve this issue by removing the asterisks prior to running AMPlify. We will also make sure to update our documentation to better highlight this behaviour for users who may not be familiar with it.
Thanks again for your valuable feedback and interest in AMPlify. Rene
That sounds great, thanks @warrenlr for considering this request for a next AMPlify release 🚀
Hi @jasmezz,
Thank you once again for your valuable suggestion regarding AMPlify! We truly appreciate your insights.
After carefully considering your feature request, we have decided to implement functionality to handle stop codon indications (*), as they provide users with an important distinction between biologically ‘complete’ peptides and right-truncated ones.
With the release of version 2.0.1, we now process asterisks by internally clipping them from sequences. However, users utilizing the predict script will still be able to see the original sequences, including the asterisks, as they were initially provided. This ensures a seamless experience for both training and prediction purposes. Please note that we still do not support asterisks located within the sequence or non-standard amino acids.
We sincerely appreciate your contribution in helping make AMPlify a more comprehensive tool. Berke
That's really cool, thank you a quick response and release!
I just have to point out one tiny thing that you missed: The version number in AMPlify --help still shows v2.0.0 instead of v2.0.1. Any chance you could fix this, maybe in a patch release?