Possible IO performance issue
Maybe there is an IO bottleneck. When I have let’s say 1000 Hits the command lib.runMultiProgress(runExonerate, Hits, args.cpus, progress=args.progress) (https://github.com/nextgenusfs/funannotate/blob/master/funannotate/aux_scripts/funannotate-p2g.py#L318) runs the method runExonerate 1000 times in parallel. Let’s say that all of them are on the same scaffold scaff1.
Each time with open(os.path.join(tmpdir, 'scaffolds', ScaffID+'.fa'), 'r') as fullscaff: will be called with ScaffID = ‘scaff1’ (https://github.com/nextgenusfs/funannotate/blob/master/funannotate/aux_scripts/funannotate-p2g.py#L220). The same fasta file will be read 1000 times. Would it be better to preload the fasta headers with the corresponding sequences and sequence length, and use this cache to retrieve e.g. the hit sequence?
It probably can be improved. I was never able to get exonerate to run with file like objects, ie so wouldn't need to write so many files. PR certainly welcome if you can speed it up.
Unfortunately, my python knowledge is limited.
Based on my currently running tests, writing the files does not seem to be an IO issue. Of course there are many of them, but they are small and will be written when the time consuming CPU calculation is done. I can only see that reading produces high IO.
There is a way to do exonerate in server mode this might help. https://www.animalgenome.org/bioinfo/resources/manuals/exonerate/server.html
okay in looking deeper. keeping whole scaffold in memory might be excessive - this might be better if we use one of the DBIndex tools to retrieve the slide - the Biopython index code works reasonably well for that without pulling whole seqeunce object into memory I believe. http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec67 or http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec70 ?
Haven't looked at the code in awhile, but if I didn't use seqio.index that would be easy. Not sure if that will ease the read requirements?
Before we try to fix i think need to know if this is high IO because exonerate is being run 1000 times and reopening a bunch of files (eg 1000 small slices of a chromosome) and if this would be sped up by running exonerate in server mode and giving start/end coordinates (which I don't think it allows though genewise did allow this).
or if this is about the steps to re-open the scaffold file and produce the smaller chunks needed to run exonerate - this could be sped up by seqio_index I believe.
Well it should only run n processes (--cpus) at a time via the multiprocessing pool. So I don't see how it could be running 1000 at a time unless cpus=1000.
We could change this to ThreadPool which would be less overhead as exonerate will be launched as a separate process via subprocess. The ThreadPool will then just use a new thread to launch the process instead of process to launch another process. That change is literally just something like from multiprocessing.dummy import Pool as ThreadPool.
Okay, I just looked at it. Yes this code is rather old -- at the time I couldn't figure out how to pass kwargs properly to multiprocessing pool (ie pass a SeqIO.index object as a parameter to that function). So should be possible to index the genome with SeqIO.index() and then pass that object to runExonerate() function so not reading the genome fasta file over and over again.
both of those sound smart. threading and the index - it might be good to see if we can figure out how to test some benchmarking in there. but maybe just start/end of the exonerate step is sufficient and we can do this on a branch first.
I have something nearly done, give me a few more minutes. It might actually be slower but will allow for debugging the time. But I de-coupled the FASTA file generation from the exonerate run -- I actually think the old code is perhaps slightly unstable. I had a bunch of error checks in there to deal with it, but this newer version seems more robust.
This is with a smallish test set -- but should print timings of the steps in the log file. @Bjoernsen if you are able to test and see if different/better behavior that would be great. It is in the p2g_io branch.
Each of the steps is timed, ie the first step (tblastn/diamond), then second step is to parse those data and write query and target FASTA files, finally then the exonerate commands are run in parallel.
(py3-funannotate) jon@Jons-MacBook-Pro:~/test-rna_seq_ed925828-fb1b-4713-a2a2-802db2ae1b6e$ time funannotate util prot2genome -g test.softmasked.fa -p protein.evidence.fasta -o index.gff3 --cpus 4 --logfile index.log -f diamond
[Oct 14 04:37 PM]: Mapping 1,065 proteins to genome using diamond and exonerate
[Oct 14 04:38 PM]: Found 1,505 preliminary alignments with diamond in 0:00:02 --> generated FASTA files for exonerate in 0:00:10
[Oct 14 04:38 PM]: Exonerate finished in 0:00:31: found 1,270 alignments
real 0m45.746s
user 1m34.527s
sys 0m16.428s
(py3-funannotate) jon@Jons-MacBook-Pro:~/test-rna_seq_ed925828-fb1b-4713-a2a2-802db2ae1b6e$ time funannotate util prot2genome -g test.softmasked.fa -p protein.evidence.fasta -o tblastn.gff3 --cpus 4 --logfile index-tblastn.log -f tblastn
[Oct 14 04:40 PM]: Mapping 1,065 proteins to genome using tblastn and exonerate
[Oct 14 04:40 PM]: Found 1,510 preliminary alignments with tblastn in 0:00:05 --> generated FASTA files for exonerate in 0:00:10
[Oct 14 04:41 PM]: Exonerate finished in 0:00:40: found 1,475 alignments
real 0m57.624s
user 2m29.750s
sys 0m17.999s
I tested the ThreadPool method -- it was slower because (at least on my Mac) it doesn't use full cpus as it keeps the processes as threads, so you end up with one python process running many threads. Whereas when you use standard Pool method, it will launch a new process for every exonerate call -- this will utilize the full cpu available. It doesn't make a big different on this small dataset, but will be a large performance difference on a large genome and using something like uniprot (default).
Ran a test with 6 cpus with uniprot and a fungal genome (this is on my Mac with SSD hard drive):
$ time funannotate util prot2genome -g 24266-2.final.fasta -p /usr/local/share/funannotate/uniprot_sprot.fasta -o test-genome.p2g.gff3 --cpus 6 --logfile test-genome.log
[Oct 14 08:35 PM]: Mapping 562,755 proteins to genome using diamond and exonerate
[Oct 14 09:09 PM]: Found 268,032 preliminary alignments with diamond in 0:06:32 --> generated FASTA files for exonerate in 0:27:53
[Oct 14 10:17 PM]: Exonerate finished in 1:06:06: found 1,422 alignments
real 103m56.737s
user 347m13.948s
sys 70m57.558s
I also tried to multi-thread the generation of the FASTA files -- this did not work I think because of too many requests to open the index file at the same time. Perhaps it would work if it was all in memory but I'm not sure. Or I could chunk the data differently but might not be worth the effort. This is working better than old code I think.
very nice - do you want me to post some test numbers from centos and see? I can also test with SSD tmpdir vs network tmpdir.
Sure that would be helpful. I think it's possible to multithreaded the fasta file generation, but would need to split the scaffold into one per file and then split all the calls by scaffold so that each scaffold index would only be accessed by a single process. Not sure it is worth the extra work?
I was also going to see if there are some more pident or evalue thresholds that we can use to filter/reduce the number of exonerate calls without losing any valid alignments.
you can set a --score option for filtering but it doesn't compute evalue. There's a '--best' or '--bestn' option too but it may add computationally I am not sure - maybe the same as now just reducing the output.
Or do you mean evalue filtering on the diamond step?
I mean the diamond/tblastn step. I set this rather low evalue threshold but could do some post filtering, ie in the run about it ran exonerate 250k times to find only 1500 alignments with pident greater than 80%.
there may be some options around Diamond doing realignment of the HSPs but not sure if maybe computing a total % aligned of the query might help also speedup (make it a parameter)? Since basically if you have peps from close or same species you might wanto have a high stringency while for species out there on their own distant branches you want a more liberal input for exon support? Though truthfully exonerate may not recover those alignments well enough anyways? We could try to do something empirical to see how the total distribution of stats for the alignments from diamond/tblastn scores/pident/pct-aligned compare to the distribution of what are actually kept when exonerate is run?
the smithwaterman cleanup can be helpful in NCBI TBLASTN too. but inthe end we are doing that one better with exonerate and p2g model I guess.
https://github.com/bbuchfink/diamond/releases/tag/v2.0.12
Could then set these params in tblastn/diamond run:
--id minimum identity% to report an alignment
--query-cover minimum query cover% to report an alignment
~300k prelim alignments took > 1.5 hours to just parse the input fasta files on a different (cloud) filesystem today (and only 30 min to run exonerate with 24 cpus). So probably I'll multi-process that function. But I think we can probably set some more intelligent thresholds either in diamond/tblastn or just filtering with python after the fact that could reduce this as well.
Okay, this should now write the fasta files for exonerate in parallel processes. I've also then exposed a --tmpdir option in funannotate util prot2genome and in funannotate predict that should allow user to direct this temp p2g folder to a faster drive. @hyphaltip let me know if this is faster for you. I'm trying it out now on AWS/cloud. If it needs to go faster than we can run some experiments at modifying the diamond/tblastn filtering parameters to yield fewer hits to run exonerate on.
This is sufficiently fast for me now (and fixed the race/conflict/unstable code).