vg icon indicating copy to clipboard operation
vg copied to clipboard

Distance index 2

Open xchang1 opened this issue 3 years ago • 4 comments

Changelog Entry

To be copied to the draft changelog by merger:

  • Update distance index 2 to be more efficient and make clustering faster
  • Minimizers now have a payload with two ints
  • The index registry now defaults to building a new distance index for giraffe, but it will still work with the old version and it will still make minimizers for the old version if it's given an old distance index
  • DI2's will need to be rebuilt as well as minimizers that use the old distance index because the payload is bigger

Description

This updates the distance index and makes the new version the default for giraffe (but not mpmap). Giraffe is almost as fast as with the old distance index for hgsvc and 1000gp graphs, and is faster for the hprc graph

graph/cache       dist size (GB)	min size (gb)	giraffe memory (GB)	speed single end	speed paired end
1000gp old	    19	                26	                68.3	                 4382.17	                4607.25
1000gp new	    11	                43	                62.7	                 4023.52                   4019.48

hgsvc old	    8.5	                14	                36.3	                 5805.82	                5408.39
hgsvc new	    1.5	                23	                35.2	                 5571.27	                4853.42

hprc old	    4.3	                26	                44.8	                         3494.98	                3442.74
hprc new	    1.9	                43	                59.4	                         4007.94	                3538.62

xchang1 avatar Sep 11 '22 04:09 xchang1

This is actually a bit slower since I merged vg/master, but I don't think the slow down came from anything in the distance index or clustering ddd07b is before merging master, 413887 is after merging master, eebc21is this pr

commit  |  1000gp single |  1000gp paired |  hgsvc single |  hgsvc paired |  hprc single |  hprc paired
ddd07b  |  4142.25       |  4186.96       |  5814.54      |  5057.88      |  4084.02     |  3721.58
413887  |  3989.71       |  4052.68       |  5567.94      |  4861.54      |  4004.69     |  3560.97
eebc21  |  4023.52       |  4019.48       |  5571.27      |  4853.42      |  4007.94     |  3538.62

xchang1 avatar Sep 11 '22 04:09 xchang1

It's also a bit wrong if I run giraffe with the clustering check turned on but not by much I think

xchang1 avatar Sep 11 '22 04:09 xchang1

@adamnovak @jltsiren Could you look this over for me please? All the clustering code is in src/snarl_seed_clusterer

xchang1 avatar Sep 21 '22 18:09 xchang1

Thanks, Jouni!

xchang1 avatar Sep 26 '22 20:09 xchang1