Benchmarking the Two-Locus Framework and LD Calculator
For the purposes of assessing how the two-locus framework performs relative to the LD Calculator, I've performed a series of benchmarks. In our initial benchmarks, the two-locus framework was rather slow relative to the LD Calculator (up to 10x slower). We went ahead and improved some of the low-hanging inefficiencies in the code and now I'm happy to report that the two-locus framework is within 2x the speed of the LD Calculator, and in some cases faster. I have created a document that attempts to thoroughly detail the changes made and the reasoning behind them: benchmarks.pdf. For brevity, I've listed some of the highlights below for a high-level overview.
Here is a brief description of the changes that we made to increase performance:
- Base case: The original code that is currently in tskit (link).
- Malloc out of hot loop: Preallocate a structure of arrays used for temporary calculations instead of allocating and freeing for every pair of sites (link).
- Refactor Bit Arrays: Refactor bit array interface, removing the need for temporary arrays in many cases. All functions now take a row index as a parameter (link).
- Precompute Allele Counts and Biallelic Summary Function: Store precomputed bit arrays for each sample set and allele and the count of samples for each allele. Introduce a biallelic summary function that avoids multiple redundant computations, leaving the original normalized summary function for multiallelic sites (link).
Python and C tests are all passing for each of these patches.
We benchmark each change (in our benchmarks, each change is layered on the next) on a set of tree sequences with the following parameters:
| parameter | value |
|---|---|
| r | 1e-8 |
| $N_e$ | 1e4 |
| L | 2e6 |
With ploidy=1 and sample sizes ranging from $10^2$ - $10^5$. We sample 15 replicates to obtain these results:
In these plots the Relative difference is $(\textrm{tl} - \textrm{ldc}) / \textrm{ldc}$ where ldc is the LD Calculator and tl is the two-locus framework.
Next, we generated a larger tree sequence with L=6e6 and sample sizes ranging from $10^3$ - $10^6$. Comparing all optimizations on the smaller (panel A) and larger (panel B), we get the following results:
I'm not sure that there's any desire to deprecate/remove the LD Calculator at this point, but at least this provides some real-world numbers on the relative performance between these two methods. I'd like to incorporate these patches into the codebase if at all possible.
cc @jeromekelleher @petrelharp @apragsdale
Excellent - great to get up to speed with the old LD calc code!
Great, thanks a lot @lkirk! Really nice to see these efficiency improvements.
I'd suggest opening a single PR that includes each of these changes, and then we can move on to documenting the LD Matrix methods.
Just to be clear, what's being plotted? You've said the y-axis is "( tl − ldc ) / ldc where ldc is the LD Calculator and tl is the two-locus framework." I'm interpreting that to mean that tl is the 'new' method; ldc is the 'old' method, and the values are runtimes, but then wouldn't bigger numbers be worse, not better?
Also btw it'd be nice to plot the absolute runtimes of the tests here, so we know what scale of compute we're looking at.
@petrelharp I chose relative change because it is a relatively intuitive measurement to me, I didn't have a strong opinion about which should be negative or positive. And yes, positive numbers are worse in the way I've set it up (Two locus is slower than LD Calculator when numbers are positive).
Here's the raw runtime plots from all combined patches. All of the raw runtime plots can be found in Section 3.2 of the document I uploaded (link)
Benchmarks on the smaller tree sequences
Benchmarks on the larger tree sequences
This looks pretty compelling to me, given the increase in flexibility of the new code
I've opened #3291 for further review. If there's questions about the specific implementation, I'd suggest looking at Section 3.2 of the optimizations document (link). I'd be happy to answer any questions, but many of the details (especially with respect to count precomputation are in there).
Okay, thanks. Just to make sure I know what's going on: the new code is slower in some cases, by quite a bit in relative terms in some cases, but those 'bad' cases are smaller ones, and the absolute difference is actually not a big deal. And the new code is much better in a lot of other ways. And, now I see that each of the optimizations is improving things. That sounds great; and I agree, the absolute timing plots are very compelling.
ps. nice work!!
@petrelharp, thank you for taking a look. I missed the notifications for your messages.
Also, I just wanted to mention that when I was looking through the code comments, I realized that Section 3.2.4 of my document was incorrect. I fixed the issue and fleshed out the explanation a bit. The way we store sample sets internally for LD matrices should now accurately reflect how it works in the code. The links above have been fixed to reflect this change.
Ah, very nice writeup. I like Figure 13 there; very clear.
And, very nice work!!