Distance index 2
Changelog Entry
To be copied to the draft changelog by merger:
- Update distance index 2 to be more efficient and make clustering faster
- Minimizers now have a payload with two ints
- The index registry now defaults to building a new distance index for giraffe, but it will still work with the old version and it will still make minimizers for the old version if it's given an old distance index
- DI2's will need to be rebuilt as well as minimizers that use the old distance index because the payload is bigger
Description
This updates the distance index and makes the new version the default for giraffe (but not mpmap). Giraffe is almost as fast as with the old distance index for hgsvc and 1000gp graphs, and is faster for the hprc graph
graph/cache dist size (GB) min size (gb) giraffe memory (GB) speed single end speed paired end
1000gp old 19 26 68.3 4382.17 4607.25
1000gp new 11 43 62.7 4023.52 4019.48
hgsvc old 8.5 14 36.3 5805.82 5408.39
hgsvc new 1.5 23 35.2 5571.27 4853.42
hprc old 4.3 26 44.8 3494.98 3442.74
hprc new 1.9 43 59.4 4007.94 3538.62
This is actually a bit slower since I merged vg/master, but I don't think the slow down came from anything in the distance index or clustering ddd07b is before merging master, 413887 is after merging master, eebc21is this pr
commit | 1000gp single | 1000gp paired | hgsvc single | hgsvc paired | hprc single | hprc paired
ddd07b | 4142.25 | 4186.96 | 5814.54 | 5057.88 | 4084.02 | 3721.58
413887 | 3989.71 | 4052.68 | 5567.94 | 4861.54 | 4004.69 | 3560.97
eebc21 | 4023.52 | 4019.48 | 5571.27 | 4853.42 | 4007.94 | 3538.62
It's also a bit wrong if I run giraffe with the clustering check turned on but not by much I think
@adamnovak @jltsiren Could you look this over for me please? All the clustering code is in src/snarl_seed_clusterer
Thanks, Jouni!