funannotate funannotate train: Alignment failed, BAM files empty. Please check logfile.

Hi @nextgenusfs, I apologize for posting the same/related problem as https://github.com/nextgenusfs/funannotate/issues/349. I've tried finding ways to address the issue, but I have been so far unsuccessful. I've encountered the same issue on funannotate train where it had a non-zero exit error "Alignment failed, BAM files empty. Please check logfile". PASA version is 2.4.1. Sharing both funannotate-train.log, the list of funannotate dependencies installed on the system for reference and the ls -l of the output train directory.

Thank you in advance for the help!

funannotate-train.log funannotate_check_show-versions.txt output_directory.txt

Sep 28 '21 10:09 papelypluma

Looks like the seqclean dependency of PASA likely failed -- check these two log files:

err_seqcl_trinity.fasta.log
seqcl_trinity.fasta.log

Oct 04 '21 03:10 nextgenusfs

Hi @nextgenusfs. Thanks for taking a look at this. Here are the log files. Doesn't seem to have any errors though. I've noticed that the training process stops during minimap2. I tried running just the minimap2 to check what's going on and here's what I got:

samtools sort: truncated file. Aborting [E::sam_parse1] no SQ lines present in the header

seqcl_trinity.fasta.log err_seqcl_trinity.fasta.log

Oct 05 '21 08:10 papelypluma

Hi @nextgenusfs Jon. I've added the flag -I <int>G to minimap2, and it seems it's working without errors at least at the minimap2 step. Could it be that if the genome is larger than 4G minimap2 (by default) will switch to multipart indexing? Thus, it will require flags such as --split-prefix or -I <int>G for samtools to properly sort the bam file. The genome I'm dealing with is ca. 5Gb. If adding this flag (-I) will address the issue, I was wondering if there's a way to specify it in funannotate train's current version or add this to train.py module? I'm running the minimap2 step with this additional flag (-I), and I'm thinking of re-invoking funannotate train command hoping it picks up from this minimap2 step.

Update: I was able to address the minimap2 error by incorporating the flag '-I 6G' into library.py allowing the pipeline to get through it and move to PASA. However, another error with PASA comes out. Sharing the log file here.

funannotate-train.log

Thank you.

Oct 06 '21 03:10 papelypluma

Looks like blat died in PASA -- not sure of the reason but I'd try to run one of those commands that it says failed manually and see if that gives you a hint. You could try to bypass this by passing --aligners minimap2 to only use the minimap2 alignments. Not sure, but perhaps blat is dying because of a memory issue.

The excessive memory is because you have >1.5 million Trinity transcripts? That seems rather excessive considering you probably are expecting something like ~20k genes? Is there something non-standard about the RNA-seq data you feed to the script? If you are running PASA with the SQLite backend and it is going through 1.5 million transcripts, it might take a very long time to run as SQLite is single threaded.

Oct 06 '21 17:10 nextgenusfs

Hi Jon, thanks for the suggestions! How would minimap2 being the only aligner specified affect the result assuming it worked without errors? I don't see any unusual with the the RNA-seq data. Considering that PASA runs with SQLite by default, is there an easy way to make it run with MySQL on Conda? I looked at the installation instructions for PASA with MySQL, and it doesn't seem to be trivial though.

Oct 07 '21 01:10 papelypluma