Better parallelism for very long queries
Hello,
I ran fulgor pseudoalign with "-t 32" and this was one line of the output:
Can't make user of more parsing threads than file (pairs); setting # of parsing threads to 1
Am I correct in understanding that Fulgor only ran single threaded in this case? Based on the runtime it seems like this is true. Would it be possible for Fulgor to have multiple processes process the reads in a single file?
No, this does not mean that Fulgor ran single threaded. Rather, it means that it used only one parsing thread (but many query threads). A fundamental limitation of the fastq format is that, due to the plaintext, non-headered, and arbitrary nature of records, it is not really possible to parse a fastq file (or pair of files in paired end reads) effectively with more than one thread. Fulgor uses a producer consumer model, so that one parsing thread will still feed multiple consumers (i.e. mapper threads). However, fairly quickly, parsing the input becomes a bottleneck. If you have multiple separate fastq input files, it is possible to parse those in parallel (and therefore provide work more efficiently to the workers).
Thank you @rob-p for the perfect answer to @tbenavi1. I'll just add that you can also inspect your CPU usage using a visual tool such as htop from terminal.
Hello, my query file is a fasta file of a genome assembly. So, my fasta file has 25 sequences, with each sequence being an entire chromosome. I performed pseudoalignment with a single thread querying only chr1 and it completed in about a minute. I performed pseudoalignment with a single thread querying the entire genome and it completed in under 12 minutes. I performed pseudoalignment with 25 threads querying only chr1 and it completed in about a minute as expected. I performed pseudoalignment with 25 threads querying the entire genome and it completed in 10 minutes, which didn't give the expected speedup over using a single thread. Is it possible there is some minimum number of sequences given to a thread before the next one? Can we reopen the issue? Thanks.
Hi @tbenavi1,
Yes; the multithreaded mechanism of the parallel parser / worker strategy operates on the assumption that one is mapping many (generally shorter) queries against the index. To reduce the overhead of the lockfree queue through which parsed query records are shared, reads are placed onto the queue in “chunks” of a default size a several thousand. This is because in the typical case where there are hundreds of thousands to hundreds of millions of queries, placing each query on the queue individually introduces non-trivial overhead. Granted, this means that parallelization will be limited (or even absent) in a use case like yours where there are only a handful of queries.
Of course, it would be possible to modify this policy, but the many short query case is the much more common one. So I think the current question is, what is the best way to proceed to ensure that something reasonable happens in both cases. We could try to detect this automatically, but perhaps that would be error prone. Alternatively, we could introduce some kind of flag for the user to pass which would allow them to control the parallel batch size in terms of the number of records, and for use cases like yours, you could set this value to 1. Any thoughts on this @jermp?
Yes, a flag for batch size works for me.
Are we talking about this https://github.com/jermp/fulgor/blob/main/tools/pseudoalign.cpp#L38 as "batch size"?
As @rob-p suggested, could you @tbenavi1 try to set this value to one and see how the parallelism behave? Thanks! Ok, I'll re-open the issue with a different title :)
I can edit this variable and let you know. Glancing at the code, it looks like this variable has to do with the output, and not processing the input?
Yes, you're right. Indeed, I'm not sure either how to control the batch size which could be an internal state of the parallel parser we are using. Surely @rob-p knows better than me this, since I haven't coded it.
Hi, I am wondering what variable I need to edit to change the chunk size? Thanks.
Looking here https://github.com/jermp/fulgor/blob/main/tools/pseudoalign.cpp#L115, it seems like you can pass the chunk size as a parameter for the constructor of the parser. So, this line https://github.com/jermp/fulgor/blob/main/tools/pseudoalign.cpp#L115 should be replaced by, I think,
fastx_parser::FastxParser<fastx_parser::ReadSeq> rparser(query_filenames, num_threads, 1, 1);
for your case, i.e., make the last parameter (the chunk size) be 1. But cannot guarantee this since I haven't tried it myself.
Hi @tbenavi1, have you had any luck regarding this? In particular, have you tried the thing in my previous message? Please, let us know. Thanks!