sshash Support larger alphabets and k via generic kmer

Hi @jermp ! This is a continuation of my work that I mentioned in #35.

With this change, I rewrite current implementation of kmer_t in a full-fledged structure that handles bit manipulations over kmers. I think this should help greatly to compartmentalize current code that deals with alphabet characters <-> bits conversion, and also provides better semantics to some existing pieces of code, potentially improving their readability, and also allowing to use it with other alphabets.

I tried my best to avoid performance sacrifices, while making it (hopefully) generic enough for the needs of Metagraph. As such, new kmer type would still assume that the underlying storage for kmers is an unsigned integer, with the number of bits in it divisible by 64, but it should now support arbitrarily large integer types (so, 128-bit integers or some custom types with e.g. 256 bits should also work). To accommodate for larger alphabets, I still stick to the assumption that each alphabet character uses a fixed amount of bits, but now the specific number of bits is a template parameter, rather than the hardcoded number 2. I tried to highlight it in all the previous places where it was used as a hard-coded value.

I also compared two implementations using the first example command from the readme file:

./sshash build -i ../data/unitigs_stitched/salmonella_enterica_k31_ust.fa.gz -k 31 -m 13 --bench -o salmonella_enterica.index

You may see the resulting stats below:

before Pull Request After Pull Request (with uint64_t)

k = 31, m = 13, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
reading file '../data/unitigs_stitched/salmonella_enterica_k31_ust.fa.gz'...
m_buffer_size 29411764
sorting buffer...
saving to file './sshash.tmp.run_1705435490373836938.minimizers.0.bin'...
read 647 sequences, 4806944 bases, 4787534 kmers
num_kmers 4787534
num_super_kmers 479171
num_pieces 648 (+0.00812109 [bits/kmer])
=== step 1: 'parse_file' 0.538684 [sec] (112.518 [ns/kmer])
=== step 2: 'build_minimizers' 0.221001 [sec] (46.1618 [ns/kmer])
bits_per_offset = ceil(log2(4806975)) = 23
m_buffer_size 20833333
sorting buffer...
saving to file './sshash.tmp.run_1705435491133617083.bucket_pairs.0.bin'...
num_singletons 419549/447082 (93.8416%)
=== step 3: 'build_index' 0.099825 [sec] (20.851 [ns/kmer])
max_num_super_kmers_in_bucket 51
log2_max_num_super_kmers_in_bucket 6
num_buckets_in_skew_index 0/447082(0%)
=== step 4: 'build_skew_index' 0.003277 [sec] (0.684486 [ns/kmer])
=== total_time 0.862787 [sec] (180.215 [ns/kmer])
total index size: 2.85224 [MB]
SPACE BREAKDOWN:
  minimizers: 0.300206 [bits/kmer]
  pieces: 0.00246641 [bits/kmer]
  num_super_kmers_before_bucket: 0.15285 [bits/kmer]
  offsets: 2.30209 [bits/kmer]
  strings: 2.00815 [bits/kmer]
  skew_index: 1.33681e-05 [bits/kmer]
  weights: 0.000307465 [bits/kmer]
    weight_interval_values: 5.34722e-05 [bits/kmer]
    weight_interval_lengths: 0.000200521 [bits/kmer]
    weight_dictionary: 5.34722e-05 [bits/kmer]
  --------------
  total: 4.76611 [bits/kmer]
avg_nanosec_per_positive_lookup 710.753
avg_nanosec_per_negative_lookup 927.569
avg_nanosec_per_positive_lookup_advanced 731.852
avg_nanosec_per_negative_lookup_advanced 952.384
avg_nanosec_per_access 384.55
iterator: avg_nanosec_per_kmer 30.6924
2024-01-16 21:05:11: saving data structure to disk...
2024-01-16 21:05:11: DONE

k = 31, m = 13, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
reading file '../data/unitigs_stitched/salmonella_enterica_k31_ust.fa.gz'...
m_buffer_size 29411764
sorting buffer...
saving to file './sshash.tmp.run_1705435769505641298.minimizers.0.bin'...
read 647 sequences, 4806944 bases, 4787534 kmers
num_kmers 4787534
num_super_kmers 479171
num_pieces 648 (+0.00812109 [bits/kmer])
=== step 1: 'parse_file' 0.50195 [sec] (104.845 [ns/kmer])
=== step 2: 'build_minimizers' 0.204642 [sec] (42.7448 [ns/kmer])
bits_per_offset = ceil(log2(4806975)) = 23
m_buffer_size 20833333
sorting buffer...
saving to file './sshash.tmp.run_1705435770212350381.bucket_pairs.0.bin'...
num_singletons 419549/447082 (93.8416%)
=== step 3: 'build_index' 0.094723 [sec] (19.7853 [ns/kmer])
max_num_super_kmers_in_bucket 51
log2_max_num_super_kmers_in_bucket 6
num_buckets_in_skew_index 0/447082(0%)
=== step 4: 'build_skew_index' 0.002644 [sec] (0.552268 [ns/kmer])
=== total_time 0.803959 [sec] (167.928 [ns/kmer])
total index size: 2.85224 [MB]
SPACE BREAKDOWN:
  minimizers: 0.300206 [bits/kmer]
  pieces: 0.00246641 [bits/kmer]
  num_super_kmers_before_bucket: 0.15285 [bits/kmer]
  offsets: 2.30209 [bits/kmer]
  strings: 2.00815 [bits/kmer]
  skew_index: 1.33681e-05 [bits/kmer]
  weights: 0.000307465 [bits/kmer]
    weight_interval_values: 5.34722e-05 [bits/kmer]
    weight_interval_lengths: 0.000200521 [bits/kmer]
    weight_dictionary: 5.34722e-05 [bits/kmer]
  --------------
  total: 4.76611 [bits/kmer]
avg_nanosec_per_positive_lookup 772.495
avg_nanosec_per_negative_lookup 1013.68
avg_nanosec_per_positive_lookup_advanced 770.203
avg_nanosec_per_negative_lookup_advanced 1025.71
avg_nanosec_per_access 385.194
iterator: avg_nanosec_per_kmer 30.9183
2024-01-16 21:09:52: saving data structure to disk...
2024-01-16 21:09:52: DONE

I guess there is some slight slowdown, especially if you change to __uint128_t, but it also seems to me that the slowdown is not very significant compared to generality that this approach brings.

And, of course, I also tried out to run all the checks in the debug mode, both on the example query below, and also on datasets se.ust.k47.fa.gz and se.ust.k63.fa.gz. If you think that this change aligns with your vision for SSHash, I would be happy if you could review it, so that it can potentially be merged in the future.

Jan 16 '24 20:01 adamant-pwn

Hi @adamant-pwn and thanks for this! I quickly skimmed through the commits and the code seems really well done.

So, as far as I see, you use a template for the internal storage of a kmer. I wonder whether you could test SSHash with __uint128_t for k=63 for larger datasets. For example with the datasets I have here: https://zenodo.org/records/7239205. You could try human.63 or something similar to make a larger test. The performance should be compared with the numbers here https://github.com/jermp/sshash/tree/master/benchmarks.

On the same line, do you think one could use __uint256_t ? As far as I know, it is not yet standard nor well supported as an "integer" data type since it is a vector actually...

I'm curious to see what could happen with, say, k=96 or k=127. The space of SSHash should improve a lot!

Thanks, -Giulio

Jan 17 '24 08:01 jermp

Thanks for the pointers, I will go over the datasets that you shared, and get back with results here.

In the meantime, could you elaborate a bit on which __uint256_t are you referring to exactly? I think Metagraph uses the type from sdsl, do you want to check that one?

Theoretically, I can also make a bit of changes to support std::bitset<256>, as we're still mostly working with bit operations, except for when we have to compute (1<<b)-1, but this mask can as well be constructed in increments of 64 bits, like the reverse-complement works. I'd expect there may be a significant slowdown with standard bitsets, but it should allow arbitrarily large kmers.

Also do you have any datasets to check against 256-bit kmers?

Jan 18 '24 12:01 adamant-pwn

Hi @adamant-pwn, thanks for your willingness in testing this out!

You're right, sorry for the confusion, I was referring to __m256i. In general, it would be very nice to be able to support kmers larger than 63 (hence, needing more than 128 bits. For example: can we try this lib for 256 bits https://github.com/calccrypto/uint256_t?). That's why in my previous answer here https://github.com/jermp/sshash/issues/35 I was suggesting to look for C++ libraries for arbitrary-long kmers.

-Giulio

Jan 19 '24 11:01 jermp

Hi @jermp I tried running build on human.k63 with __uint128_t for k=63 and m=31, and also with uint256_t from calccrypto. Unfortunately, uint256_t usage is extremely slow (e.g., reading seems to take 15x longer compared to __uint128_t), making it quite impractical to test on large datasets. I will try to look into other available options of 256-bit types, and/or also figure out the reason behind it.

In the meantime please see below the results for human.k63 with 128-bit type:

Bench log with __uint128_t

k = 63, m = 31, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
reading file '../data/unitigs_stitched/human.k63.unitigs.fa.ust.fa.gz'...
m_buffer_size 29411764
read 100000 sequences, 105696573 bases, 99496635 kmers
read 200000 sequences, 214697764 bases, 202297826 kmers
read 300000 sequences, 314555962 bases, 295956024 kmers
read 400000 sequences, 417096019 bases, 392296081 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.0.bin'...
read 500000 sequences, 531577423 bases, 500577485 kmers
read 600000 sequences, 641881766 bases, 604681828 kmers
read 700000 sequences, 753533625 bases, 710133687 kmers
read 800000 sequences, 860572728 bases, 810972790 kmers
read 900000 sequences, 966201746 bases, 910401808 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.1.bin'...
read 1000000 sequences, 1074799974 bases, 1012800036 kmers
read 1100000 sequences, 1177786303 bases, 1109586365 kmers
read 1200000 sequences, 1284445481 bases, 1210045543 kmers
read 1300000 sequences, 1387959208 bases, 1307359270 kmers
read 1400000 sequences, 1496344159 bases, 1409544221 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.2.bin'...
read 1500000 sequences, 1601642232 bases, 1508642294 kmers
read 1600000 sequences, 1704579905 bases, 1605379967 kmers
read 1700000 sequences, 1809734567 bases, 1704334629 kmers
read 1800000 sequences, 1913668987 bases, 1802069049 kmers
read 1900000 sequences, 2015410398 bases, 1897610460 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.3.bin'...
read 2000000 sequences, 2111050791 bases, 1987050853 kmers
read 2100000 sequences, 2208479896 bases, 2078279958 kmers
read 2200000 sequences, 2297838310 bases, 2161438372 kmers
read 2300000 sequences, 2384042166 bases, 2241442228 kmers
read 2400000 sequences, 2476010359 bases, 2327210421 kmers
read 2500000 sequences, 2554605267 bases, 2399605329 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.4.bin'...
read 2600000 sequences, 2632827152 bases, 2471627214 kmers
read 2700000 sequences, 2705855836 bases, 2538455898 kmers
read 2800000 sequences, 2777866668 bases, 2604266730 kmers
read 2900000 sequences, 2846376916 bases, 2666576978 kmers
read 3000000 sequences, 2913930048 bases, 2727930110 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.5.bin'...
read 3079563 sequences, 2961741299 bases, 2770808393 kmers
num_kmers 2770808393
num_super_kmers 165899482
num_pieces 3079564 (+0.137818 [bits/kmer])
=== step 1: 'parse_file' 646.326 [sec] (233.263 [ns/kmer])
 == files to merge = 6
num_written_tuples = 50000000
num_written_tuples = 100000000
num_written_tuples = 150000000
num_written_tuples = 165899482
=== step 2: 'build_minimizers' 104.043 [sec] (37.5497 [ns/kmer])
bits_per_offset = ceil(log2(2961741362)) = 32
m_buffer_size 20833333
sorting buffer...
saving to file './sshash.tmp.run_1706706704653830813.bucket_pairs.0.bin'...
num_singletons 147668623/150850485 (97.8907%)
=== step 3: 'build_index' 123.475 [sec] (44.5628 [ns/kmer])
max_num_super_kmers_in_bucket 36004
log2_max_num_super_kmers_in_bucket 16
num_buckets_in_skew_index 23561/150850485(0.0156188%)
num_partitions 7
computing partitions...
num_kmers belonging to buckets of size > 64 and <= 128: 17348731
num_kmers belonging to buckets of size > 128 and <= 256: 14340210
num_kmers belonging to buckets of size > 256 and <= 512: 11881499
num_kmers belonging to buckets of size > 512 and <= 1024: 11484349
num_kmers belonging to buckets of size > 1024 and <= 2048: 8910908
num_kmers belonging to buckets of size > 2048 and <= 4096: 6277830
num_kmers belonging to buckets of size > 4096 and <= 36004: 10551588
num_kmers_in_skew_index 80795115(2.91594%)
building PTHash mphfs (with 8 threads) and positions...
lower 64; upper 128; num_bits_per_pos 7; keys_in_partition.size() 17348731
  built mphs[0] for 17348731 keys; bits/key = 2.3249
  built positions[0] for 17348731 keys; bits/key = 7.00002
lower 128; upper 256; num_bits_per_pos 8; keys_in_partition.size() 14340210
  built mphs[1] for 14340210 keys; bits/key = 2.34678
  built positions[1] for 14340210 keys; bits/key = 8.00003
lower 256; upper 512; num_bits_per_pos 9; keys_in_partition.size() 11881499
  built mphs[2] for 11881499 keys; bits/key = 2.24441
  built positions[2] for 11881499 keys; bits/key = 9.00003
lower 512; upper 1024; num_bits_per_pos 10; keys_in_partition.size() 11484349
  built mphs[3] for 11484349 keys; bits/key = 2.2486
  built positions[3] for 11484349 keys; bits/key = 10
lower 1024; upper 2048; num_bits_per_pos 11; keys_in_partition.size() 8910908
  built mphs[4] for 8910908 keys; bits/key = 2.28181
  built positions[4] for 8910908 keys; bits/key = 11
lower 2048; upper 4096; num_bits_per_pos 12; keys_in_partition.size() 6277830
  built mphs[5] for 6277830 keys; bits/key = 2.32959
  built positions[5] for 6277830 keys; bits/key = 12.0001
lower 4096; upper 36004; num_bits_per_pos 16; keys_in_partition.size() 10551588
  built mphs[6] for 10551588 keys; bits/key = 2.25964
  built positions[6] for 10551588 keys; bits/key = 16
num_bits_for_skew_index 985400144(0.355636 [bits/kmer])
=== step 4: 'build_skew_index' 120.959 [sec] (43.6548 [ns/kmer])
=== total_time 994.803 [sec] (359.03 [ns/kmer])
total index size: 1617.41 [MB]
SPACE BREAKDOWN:
  minimizers: 0.154931 [bits/kmer]
  pieces: 0.015002 [bits/kmer]
  num_super_kmers_before_bucket: 0.0904985 [bits/kmer]
  offsets: 1.91597 [bits/kmer]
  strings: 2.13782 [bits/kmer]
  skew_index: 0.355636 [bits/kmer]
  weights: 5.31253e-07 [bits/kmer]
    weight_interval_values: 9.23918e-08 [bits/kmer]
    weight_interval_lengths: 3.46469e-07 [bits/kmer]
    weight_dictionary: 9.23918e-08 [bits/kmer]
  --------------
  total: 4.66986 [bits/kmer]
avg_nanosec_per_positive_lookup 3659.03
avg_nanosec_per_negative_lookup 4382.01
avg_nanosec_per_positive_lookup_advanced 3074
avg_nanosec_per_negative_lookup_advanced 3624.09
avg_nanosec_per_access 1380.77
iterator: avg_nanosec_per_kmer 27.3282

I used the command

./sshash build -i ../data/unitigs_stitched/human.k63.unitigs.fa.ust.fa.gz -k 63 -m 31 --bench

with sshash built in Release mode, the command was executed on AMD EPYC-Rome (15) @ 2.249GHz.

Jan 31 '24 14:01 adamant-pwn

Hi @adamant-pwn, thank you very much for the update! Nice that the generalization does not compromise anything for k <= 63 (i.e., when using __uint128_t as internal representation).

Of course a 15X slowdown is not acceptable. I guess this comes from the uint256 type being bad rather than your code since it works very well for k <= 63. We have to come up with a more efficient way of handling larger kmers...

Feb 01 '24 08:02 jermp

Hi @jermp,

I spent a bit more time investigating possible ways to use a 256-bit type. At first, I did some changes to accommodate usage of std::bitset<256>, but it obviously also was quite slow. I also couldn't find a way to speed up calccrypto's implementation, and as I only really needed some bit operations and not all the other arithmetic, I decided to just implement something ad hoc.

I ended up making a new structure bitpack<Int, height> that represents $k\cdot 2^{h}$-bit integers by putting builtin $k$-bit integers into leaves of a templated full binary tree of height $h$. This way, in particular, we can use bitpack<__uint128_t, 1> or bitpack<uint64_t, 2> to emulate $256$-bit integers with bit operations, or more generally, bitpack<__uint128_t, h> for $128 \cdot 2^h$-bit integers.

I think for really long numbers, it would be more efficient to just store them in a plain array or vector rather than in a tree (like std::bitset does), but the tree structure is probably fine as long as $k$-mers are not super large (and if they were, we would have other concerns about computational complexity as a bottleneck anyway).

I removed any strings that is shorter than 127 from human.k63, and ran the benchmark again:

Bench log with bitpack<__uint128_t> and k=127

k = 127, m = 31, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
reading file '../data/unitigs_stitched/human.k127.unitigs.fa.ust.fa.gz'...
m_buffer_size 29411764
read 100000 sequences, 182189115 bases, 169589241 kmers
read 200000 sequences, 353905567 bases, 328705693 kmers
read 300000 sequences, 535877434 bases, 498077560 kmers
read 400000 sequences, 717529176 bases, 667129302 kmers
read 500000 sequences, 894938693 bases, 831938819 kmers
read 600000 sequences, 1069155657 bases, 993555783 kmers
read 700000 sequences, 1238429639 bases, 1150229765 kmers
read 800000 sequences, 1407403279 bases, 1306603405 kmers
sorting buffer...
saving to file './sshash.tmp.run_1707337830575510935.minimizers.0.bin'...
read 900000 sequences, 1577475282 bases, 1464075408 kmers
read 1000000 sequences, 1743208932 bases, 1617209058 kmers
read 1100000 sequences, 1907995326 bases, 1769395452 kmers
read 1200000 sequences, 2065574063 bases, 1914374189 kmers
read 1300000 sequences, 2216324177 bases, 2052524303 kmers
read 1400000 sequences, 2362189594 bases, 2185789720 kmers
read 1500000 sequences, 2500616290 bases, 2311616416 kmers
read 1600000 sequences, 2626636583 bases, 2425036709 kmers
read 1700000 sequences, 2745079914 bases, 2530880040 kmers
sorting buffer...
saving to file './sshash.tmp.run_1707337830575510935.minimizers.1.bin'...
read 1797875 sequences, 2853028046 bases, 2626495796 kmers
num_kmers 2626495796
num_super_kmers 55359550
num_pieces 1797876 (+0.172498 [bits/kmer])
=== step 1: 'parse_file' 1869.21 [sec] (711.673 [ns/kmer])
 == files to merge = 2
num_written_tuples = 50000000
num_written_tuples = 55359550
=== step 2: 'build_minimizers' 26.5036 [sec] (10.0909 [ns/kmer])
bits_per_offset = ceil(log2(2853028173)) = 32
m_buffer_size 20833333
sorting buffer...
saving to file './sshash.tmp.run_1707339726346319196.bucket_pairs.0.bin'...
num_singletons 50839844/51714801 (98.3081%)
=== step 3: 'build_index' 33.2175 [sec] (12.6471 [ns/kmer])
max_num_super_kmers_in_bucket 29204
log2_max_num_super_kmers_in_bucket 15
num_buckets_in_skew_index 5287/51714801(0.0102234%)
num_partitions 7
computing partitions...
num_kmers belonging to buckets of size > 64 and <= 128: 10007008
num_kmers belonging to buckets of size > 128 and <= 256: 7471182
num_kmers belonging to buckets of size > 256 and <= 512: 6508079
num_kmers belonging to buckets of size > 512 and <= 1024: 5386165
num_kmers belonging to buckets of size > 1024 and <= 2048: 3677016
num_kmers belonging to buckets of size > 2048 and <= 4096: 2186595
num_kmers belonging to buckets of size > 4096 and <= 29204: 4977749
num_kmers_in_skew_index 40213794(1.53108%)
building PTHash mphfs (with 8 threads) and positions...
lower 64; upper 128; num_bits_per_pos 7; keys_in_partition.size() 10007008
  built mphs[0] for 10007008 keys; bits/key = 2.26646
  built positions[0] for 10007008 keys; bits/key = 7.00004
lower 128; upper 256; num_bits_per_pos 8; keys_in_partition.size() 7471182
  built mphs[1] for 7471182 keys; bits/key = 2.30375
  built positions[1] for 7471182 keys; bits/key = 8.00004
lower 256; upper 512; num_bits_per_pos 9; keys_in_partition.size() 6508079
  built mphs[2] for 6508079 keys; bits/key = 2.32488
  built positions[2] for 6508079 keys; bits/key = 9.00005
lower 512; upper 1024; num_bits_per_pos 10; keys_in_partition.size() 5386165
  built mphs[3] for 5386165 keys; bits/key = 2.35027
  built positions[3] for 5386165 keys; bits/key = 10.0001
lower 1024; upper 2048; num_bits_per_pos 11; keys_in_partition.size() 3677016
  built mphs[4] for 3677016 keys; bits/key = 2.26738
  built positions[4] for 3677016 keys; bits/key = 11.0001
lower 2048; upper 4096; num_bits_per_pos 12; keys_in_partition.size() 2186595
  built mphs[5] for 2186595 keys; bits/key = 2.34717
  built positions[5] for 2186595 keys; bits/key = 12.0002
lower 4096; upper 29204; num_bits_per_pos 15; keys_in_partition.size() 4977749
  built mphs[6] for 4977749 keys; bits/key = 2.36104
  built positions[6] for 4977749 keys; bits/key = 15.0001
num_bits_for_skew_index 476511808(0.181425 [bits/kmer])
=== step 4: 'build_skew_index' 49.42 [sec] (18.816 [ns/kmer])
=== total_time 1978.35 [sec] (753.226 [ns/kmer])
total index size: 1026.85 [MB]
SPACE BREAKDOWN:
  minimizers: 0.0575382 [bits/kmer]
  pieces: 0.00957263 [bits/kmer]
  num_super_kmers_before_bucket: 0.0321532 [bits/kmer]
  offsets: 0.674475 [bits/kmer]
  strings: 2.1725 [bits/kmer]
  skew_index: 0.181425 [bits/kmer]
  weights: 5.60443e-07 [bits/kmer]
    weight_interval_values: 9.74683e-08 [bits/kmer]
    weight_interval_lengths: 3.65506e-07 [bits/kmer]
    weight_dictionary: 9.74683e-08 [bits/kmer]
  --------------
  total: 3.12766 [bits/kmer]
avg_nanosec_per_positive_lookup 4598.98
avg_nanosec_per_negative_lookup 5773.25
avg_nanosec_per_positive_lookup_advanced 4263.72
avg_nanosec_per_negative_lookup_advanced 5329.6
avg_nanosec_per_access 1585.56
iterator: avg_nanosec_per_kmer 38.8011

So, while there is some slowdown, I think it's reasonable this time, as there is no builtin support for 256-bit integers, and we have to emulate them with 128-bit integers. Also the size indeed reduced quite significantly, from 1617.41 [MB] (4.67 bits/kmer) to 1026.85 [MB] (3.13 bits/kmer), so it should also make up for the slowdown per kmer.

I also tried running it with --check on some smaller datasets, and it seemed to work correctly. That being said, there is always a possibility that I overlooked something, and I'd appreciate any independent verification of the results.

Feb 07 '24 21:02 adamant-pwn

Hi @adamant-pwn, wow that's very cool! So, if I'm understanding correctly, you use some template magic to let the compiler define recursively the bitwise operators working over two halves. Sweet.

And very nice space result: increasing k improves the space of SSHash. I wonder, where it goes in the limit :)

Although storing a human genome in 3.13 bits/kmer is remarkable, I do not think the experiment is truly correct because the kmers should come from a de Bruijn graph built with k=127 and not from parsing the unitigs of a de Bruijn graph of order k=63. So, we should build a dBG with k=127 using BCALM2+UST as explained in the README, and then index it with SSHash. This will give the true number of bits/kmer (which should be close to the one you had). Let me know if something is unclear.

Thank you very much! -Giulio

Feb 07 '24 22:02 jermp

Thanks for the quick response! Sure, the space result looks good. And I do hope that it is genuine and not a result of some bug in the code that misses something or not accounts for something when k goes beyond 63 :smiling_face_with_tear:

Regarding testing on unitigs for k=127, I assume the appropriate raw data is Homo_sapiens.GRCh38.dna.toplevel.fa.gz?

Feb 07 '24 23:02 adamant-pwn

And I do hope that it is genuine and not a result of some bug in the code that misses something or not accounts for something when k goes beyond 63.

I think there must necessarily be some errors when parsing a dBG of order 63 with k=127, but the space result should be quite close though.

Yes, that is the raw data: Homo_sapiens.GRCh38.dna.toplevel.fa.gz.

I would suggest to use GGCAT with the flag --eulertigs to output eulertigs in fasta format that will then be indexed by SSHash.

Feb 08 '24 00:02 jermp

Ok, below is the log on the "toplevel" dataset. I prepared the input the same way as described in the README:

Bench log on the "toplevel" human DNA, k=127

k = 127, m = 31, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
reading file '../data/unitigs_stitched/Homo_sapiens.GRCh38.dna.toplevel.fa.unitigs.fa.ust.fa.gz'...
m_buffer_size 29411764
read 100000 sequences, 641461998 bases, 628862124 kmers
read 200000 sequences, 1299384399 bases, 1274184525 kmers
sorting buffer...
saving to file './sshash.tmp.run_1707385092832615700.minimizers.0.bin'...
read 300000 sequences, 1765189961 bases, 1727390087 kmers
read 400000 sequences, 2233033487 bases, 2182633613 kmers
read 500000 sequences, 2631516533 bases, 2568516659 kmers
read 600000 sequences, 2921679399 bases, 2846079525 kmers
sorting buffer...
saving to file './sshash.tmp.run_1707385092832615700.minimizers.1.bin'...
sorting buffer...
saving to file './sshash.tmp.run_1707385092832615700.minimizers.2.bin'...
read 608208 sequences, 2934546233 bases, 2857912025 kmers
num_kmers 2857912025
num_super_kmers 58916081
num_pieces 608209 (+0.0536296 [bits/kmer])
=== step 1: 'parse_file' 2035.8 [sec] (712.338 [ns/kmer])
 == files to merge = 3
num_written_tuples = 50000000
num_written_tuples = 58916081
=== step 2: 'build_minimizers' 35.5332 [sec] (12.4333 [ns/kmer])
bits_per_offset = ceil(log2(2934546360)) = 32
m_buffer_size 20833333
sorting buffer...
saving to file './sshash.tmp.run_1707387164255862420.bucket_pairs.0.bin'...
num_singletons 51221382/52580188 (97.4157%)
=== step 3: 'build_index' 43.1616 [sec] (15.1025 [ns/kmer])
max_num_super_kmers_in_bucket 28977
log2_max_num_super_kmers_in_bucket 15
num_buckets_in_skew_index 9426/52580188(0.0179269%)
num_partitions 7
computing partitions...
num_kmers belonging to buckets of size > 64 and <= 128: 21978798
num_kmers belonging to buckets of size > 128 and <= 256: 17792929
num_kmers belonging to buckets of size > 256 and <= 512: 16696090
num_kmers belonging to buckets of size > 512 and <= 1024: 18053731
num_kmers belonging to buckets of size > 1024 and <= 2048: 17294062
num_kmers belonging to buckets of size > 2048 and <= 4096: 14098609
num_kmers belonging to buckets of size > 4096 and <= 28977: 20797100
num_kmers_in_skew_index 126711319(4.4337%)
building PTHash mphfs (with 8 threads) and positions...
lower 64; upper 128; num_bits_per_pos 7; keys_in_partition.size() 21978798
  built mphs[0] for 21978798 keys; bits/key = 2.2951
  built positions[0] for 21978798 keys; bits/key = 7.00002
lower 128; upper 256; num_bits_per_pos 8; keys_in_partition.size() 17792929
  built mphs[1] for 17792929 keys; bits/key = 2.32146
  built positions[1] for 17792929 keys; bits/key = 8.00002
lower 256; upper 512; num_bits_per_pos 9; keys_in_partition.size() 16696090
  built mphs[2] for 16696090 keys; bits/key = 2.32747
  built positions[2] for 16696090 keys; bits/key = 9.00002
lower 512; upper 1024; num_bits_per_pos 10; keys_in_partition.size() 18053731
  built mphs[3] for 18053731 keys; bits/key = 2.31951
  built positions[3] for 18053731 keys; bits/key = 10
lower 1024; upper 2048; num_bits_per_pos 11; keys_in_partition.size() 17294062
  built mphs[4] for 17294062 keys; bits/key = 2.32405
  built positions[4] for 17294062 keys; bits/key = 11
lower 2048; upper 4096; num_bits_per_pos 12; keys_in_partition.size() 14098609
  built mphs[5] for 14098609 keys; bits/key = 2.35036
  built positions[5] for 14098609 keys; bits/key = 12
lower 4096; upper 28977; num_bits_per_pos 15; keys_in_partition.size() 20797100
  built mphs[6] for 20797100 keys; bits/key = 2.30088
  built positions[6] for 20797100 keys; bits/key = 15
num_bits_for_skew_index 1592039440(0.557064 [bits/kmer])
=== step 4: 'build_skew_index' 209.907 [sec] (73.4478 [ns/kmer])
=== total_time 2324.4 [sec] (813.321 [ns/kmer])
total index size: 1199.73 [MB]
SPACE BREAKDOWN:
  minimizers: 0.0537208 [bits/kmer]
  pieces: 0.00327841 [bits/kmer]
  num_super_kmers_before_bucket: 0.0309644 [bits/kmer]
  offsets: 0.659683 [bits/kmer]
  strings: 2.05363 [bits/kmer]
  skew_index: 0.557064 [bits/kmer]
  weights: 5.15061e-07 [bits/kmer]
    weight_interval_values: 8.95759e-08 [bits/kmer]
    weight_interval_lengths: 3.3591e-07 [bits/kmer]
    weight_dictionary: 8.95759e-08 [bits/kmer]
  --------------
  total: 3.35834 [bits/kmer]
avg_nanosec_per_positive_lookup 4558.75
avg_nanosec_per_negative_lookup 5392.28
avg_nanosec_per_positive_lookup_advanced 4565.12
avg_nanosec_per_negative_lookup_advanced 5390.89
avg_nanosec_per_access 1563.39
iterator: avg_nanosec_per_kmer 34.9164

To summarize, total size is now 1199.73 [MB] at 3.35834 [bits/kmer].

Sorry, I already started BCALM2+UST yesterday and it took some time to construct, so I didn't have an opportunity to try GGCAT yet. If it's still relevant, could you please provide specific commands I should use to build the input dataset with GGCAT?

Feb 08 '24 10:02 adamant-pwn

Hi @adamant-pwn, ok that makes sense! I think this is the correct result. I would still perform a check to see if all kmers are found, etc. (just add --check at index construction, as you know).

Thanks!

For GGCAT, it should be very easy to use: just run the construction (no colored dBG needed, so without option -c) with -k 127 and --eulertigs.

Feb 09 '24 09:02 jermp

I'm sorry, I'm still a bit confused as to what I have to do with ggcat exactly.

I built ggcat, and tried using the following command:

$ gzip -d Homo_sapiens.GRCh38.dna.toplevel.fa.gz
$ ggcat build -k 127 -j 8 --eulertigs Homo_sapiens.GRCh38.dna.toplevel.fa
...
Compacted De Bruijn graph construction completed.
TOTAL TIME: 1387.88s
...
Final output saved to: output.fasta.lz4
$ lz4 -d output.fasta.lz4 
Decoding file output.fasta 
Decompressed : 135 MB  Error 68 : Unfinished stream 
$ ./sshash build -i ../data/unitigs_stitched/output.fasta -k 127 -m 31
...
total index size: 70.1673 [MB]
SPACE BREAKDOWN:
  minimizers: 0.058696 [bits/kmer]
  pieces: 0.0261435 [bits/kmer]
  num_super_kmers_before_bucket: 0.032964 [bits/kmer]
  offsets: 0.628415 [bits/kmer]
  strings: 2.53139 [bits/kmer]
  skew_index: 0.51633 [bits/kmer]
  weights: 9.9489e-06 [bits/kmer]
    weight_interval_values: 1.73024e-06 [bits/kmer]
    weight_interval_lengths: 6.48841e-06 [bits/kmer]
    weight_dictionary: 1.73024e-06 [bits/kmer]
  --------------
  total: 3.79395 [bits/kmer]

Clearly, something goes wrong here, but I can't be sure what exactly. The error during the .lz4 decompression looks very weird, and sshash doesn't seem to work with .lz4 files. I also tried to run ggcat on compressed Homo_sapiens.GRCh38.dna.toplevel.fa.gz, but trying to decompress lz4 also fails when I do this. Should I apply ggcat to the output of BCALM2+UST, rather than the raw file?

Running--check on human genome file seems to take very long. I can try to do it over the weekend, but I also tried running it on smaller datasets and everything seemed okay.

Feb 09 '24 15:02 adamant-pwn

I think you did everything right. The output of GGCAT is some lz file that should be decompressed and given to SSHash as input. The point is that there must be a bug in GGCAT since the file did not uncompress without errors. Perhaps we should open an issue on the GGCAT repo.

Can you check if with some other small file GGCAT works just fine?

If so, then it is definitely a bug in GGCAT occurring for larger files.

Feb 09 '24 15:02 jermp

The error actually persists even with smaller datasets:

$ gzip -d se.ust.k31.fa.gz 
$ ggcat build -k 31 -j 8 --eulertigs se.ust.k31.fa 
...
Final output saved to: output.fasta.lz4
$ lz4 -d output.fasta.lz4 
Decoding file output.fasta 
Error 68 : Unfinished stream

Feb 09 '24 17:02 adamant-pwn

I opened a new issue about this with ggcat (https://github.com/algbio/ggcat/issues/42), but maybe I'm just misuse ggcat somehow :thinking:

Feb 09 '24 17:02 adamant-pwn

Thanks. I'll try it myself later and see what happens. Perhaps is just the --eulertigs flag?

Feb 09 '24 17:02 jermp

I think I tried out without it as well, and got the same results. But surely would be nice to verify independently.

Feb 09 '24 17:02 adamant-pwn

In the meantime, here is the full log with --check and --bench on k=127 file that I built with BCALM2+UST. I also built sshash in Debug mode for it, so that if any asserts triggered it would fail. But so far it seems that the check completed successfully.

Feb 10 '24 20:02 adamant-pwn

Beautiful, thanks Oleksandr!

Feb 10 '24 20:02 jermp

Hi Giulio! I was on vacation for last two weeks, but am now back and ready to work further on this. That is, if further work is required 🙂

I saw that you accepted a pull request from Harun that creates a slight merge conflict with this PR. I suppose I should resolve it now, which I'll try to do soon. Are there any other actions related to this PR that you'd like me to take?

Mar 01 '24 13:03 adamant-pwn

Hi @adamant-pwn, sorry for the late reply. I do not think there are further actions :) SSHash on such larger k is so nice and powerful!

If you're interested (and given that you worked on this) we can discuss a potential SSHash-v2 paper with extensions and engineering. CC: @rob-p and @RagnarGrootKoerkamp who are definitely interested.

Mar 04 '24 14:03 jermp

I added a new commit that merges upstream changes from Harun's PR. Should be ready for merge now 🙂

Regarding SSHash-v2 paper: Sure, I would be happy to contribute! Please note though that I don't really have any formal publication experience, so I might need quite a bit of explanation on what exactly is needed from me 🙂

As a side note, current implementation for large k is really more like a proof of concept than something that should work for really large k, as with larger k we eventually break the assumption that operations with k-mers are $O(1)$, and the way they are now, I think there are a lot of places where the number of operations would scale with $\frac{k}{64}$ or even $(\frac{k}{64})^2$. Which is still fine when $\frac{k}{64}$ is at most a very small constant, but scaling it further would probably require a different k-mer model (if it is ever of interest, which as well might not be the case for biology applications).

Mar 07 '24 12:03 adamant-pwn

Hi @jermp ! I noticed that there were some new commits to the master branch on your side, which created a new marge conflict for this PR. I can update the PR to resolve them too, but I'd appreciate it if you could share your plans regarding merging this PR, so that I don't get stuck in a loop when I need to constantly update it to resolve new potential merge conflicts :slightly_smiling_face:

Mar 21 '24 13:03 adamant-pwn

Hi @adamant-pwn, sorry you're absolutely right!

What I merged yesterday into the master branch was something I began earlier and that I wanted to make the default but just delayed its integration for no good reason.

You did an absolutely great work, and I'm looking forward to integrate this into the master branch. Before doing that, I will perform some benchmarks to be sure we are not sacrificing speed nor space, at least for the current most important application of SSHash: indexing DNA (so 2-bit encoded strings). But it looks from your reports that your implementation does not sacrifice anything.

As I said before, with @rob-p we're discussing plans for a SSHash-v2 paper and you're more than welcome if you want to be part of that.

Thanks!

-Giulio

Mar 22 '24 10:03 jermp

Hi @jermp , thanks for the update! I've merged your changes into the PR, and also briefly re-run check and bench on se.ust.k63.fa.gz to make sure I didn't break something while doing so. Everything still looks good to me:

Run log


okulkov@compute-biomed-05:~/apps/sshash/build$ ./sshash build -i ../data/unitigs_stitched/se.ust.k63.fa.gz -k 63 -m 31 --bench --check
k = 63, m = 31, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
reading file '../data/unitigs_stitched/se.ust.k63.fa.gz'...
m_buffer_size 29411764
sorting buffer...
saving to file './sshash.tmp.run_1711546442838258499.minimizers.0.bin'...
read 238 sequences, 4958815 bases, 4944059 kmers
num_kmers 4944059
num_super_kmers 290990
num_pieces 239 (+0.00599427 [bits/kmer])
=== step 1: 'parse_file' 1.29271 [sec] (261.467 [ns/kmer])
=== step 2: 'build_minimizers' 0.234789 [sec] (47.4891 [ns/kmer])
bits_per_offset = ceil(log2(4958878)) = 23
m_buffer_size 20833333
sorting buffer...
saving to file './sshash.tmp.run_1711546444365949536.bucket_pairs.0.bin'...
num_singletons 289962/290421 (99.842%)
=== step 3: 'build_index' 0.032482 [sec] (6.56991 [ns/kmer])
max_num_super_kmers_in_bucket 8
log2_max_num_super_kmers_in_bucket 3
num_buckets_in_skew_index 0/290421(0%)
=== step 4: 'build_skew_index' 0.001391 [sec] (0.281348 [ns/kmer])
=== total_time 1.56137 [sec] (315.807 [ns/kmer])
total index size: 2.25106 [MB]
SPACE BREAKDOWN:
  minimizers: 0.189121 [bits/kmer]
  pieces: 0.00109384 [bits/kmer]
  num_super_kmers_before_bucket: 0.0921154 [bits/kmer]
  offsets: 1.35377 [bits/kmer]
  strings: 2.00601 [bits/kmer]
  skew_index: 1.29448e-05 [bits/kmer]
  weights: 0.000297731 [bits/kmer]
    weight_interval_values: 5.17793e-05 [bits/kmer]
    weight_interval_lengths: 0.000194172 [bits/kmer]
    weight_dictionary: 5.17793e-05 [bits/kmer]
  --------------
  total: 3.64245 [bits/kmer]
checking correctness of access and positive lookup...
checked 4944059 kmers
EVERYTHING OK!
checking correctness of negative lookup with random kmers...
EVERYTHING OK!
checking correctness of navigational queries for kmers...
checked 4944059 kmers
EVERYTHING OK!
checking correctness of navigational queries for contigs...
checked 238 contigs
EVERYTHING OK!
checking correctness of kmer iterator...
EVERYTHING OK!
checking correctness of contig iterator...
EVERYTHING OK!
avg_nanosec_per_positive_lookup 639.533
avg_nanosec_per_negative_lookup 884.007
avg_nanosec_per_positive_lookup_advanced 639.922
avg_nanosec_per_negative_lookup_advanced 899.111
avg_nanosec_per_access 153.53
iterator: avg_nanosec_per_kmer 24.37

Looking forward to your benchmarks and merging it into the master branch! Please let me know if there are any further questions or requests about the PR on your end.

Re: SSHash-v2 paper. As I wrote above, I'll be happy to contribute! Just let me know which actions are expected on my end for this :slightly_smiling_face:

Mar 27 '24 13:03 adamant-pwn

Fantastic! Thank you so much. I'll be running some benchmarks soon and keep you posted here.

So, what's the largest kmer length we could use in your opinion and still get some fast queries? k=127 perhaps? In a follow-up paper, we should definitely include results for k in range(47, 127+1, 16).

Oh, and on proteins (as suggested by @rob-p) so that we can enjoy a larger alphabet.

Mar 27 '24 21:03 jermp

For the largest kmer length, it's hard to tell because we didn't really try anything beyong k=127 yet. For me, k=127 seemed reasonably fast, but I don't know if it adheres with your scale. We can also try k=255 or k=511 just to see how much time it takes. It should be quite slow, but sufficiently fast to at least gather benchmark data?

On the other hand, what's the largest "practically interesting" value of k?

I mean e.g. if you wanted to lookup with k=1023, would it even make sense to build sshash for such k instead of build it for k=127 and only lookup for the first 127 characters? Or maybe the threshold is with even smaller k? I know that on random strings building such structure for k=20 should be enough to almost never have any errors whatsoever, but I don't know what's the smallest k with such property for real-life data.

Trying something with proteins would be cool, of course, but there is no test coverage for this case yet, and it could be that I still missed some places where e.g. the size of the alphabet is hard-coded as a constant number 2, rather than kmer_t::bits_per_char (as you see with one of two latest commits).

So, I guess we would need to double-check everything on small inputs with some naive algorithms when we try alternate alphabets. Or we can wait a bit until Metagraph integrates sshash to the degree that also includes usage of generic alphabets, which should hopefully happen in a foreseeable future :slightly_smiling_face:

Mar 27 '24 22:03 adamant-pwn

For the largest kmer length, it's hard to tell because we didn't really try anything beyong k=127 yet. For me, k=127 seemed reasonably fast, but I don't know if it adheres with your scale. We can also try k=255 or k=511 just to see how much time it takes. It should be quite slow, but sufficiently fast to at least gather benchmark data?

k=127 seems very reasonable to try. I've heard someone mentioning applications for very large k, although now I cannot recall.

On the other hand, what's the largest "practically interesting" value of k?

I think nobody has the answer right now :) k=31 and k=63 are used just because they are handy to handle and are sufficiently long to have good specificity. On the other hand, a very large k can break sensitivity in case of in/dels.

Trying something with proteins would be cool, of course, but there is no test coverage for this case yet, and it could be that I still missed some places where e.g. the size of the alphabet is hard-coded as a constant number 2, rather than kmer_t::bits_per_char (as you see with one of two latest commits).

Of course!

Further steps would be:

do some benchmarks first;
integrate your changes to the master branch;
start an email thread with folks potentially interested in a SSHash-v2 paper/software.

Mar 27 '24 22:03 jermp

Hi @adamant-pwn and @jermp,

There was a "very large k" request in the Cuttlefish repo some time back (https://github.com/COMBINE-lab/cuttlefish/issues/22). I think the most common use case for such a large k would be when one is doing comparative genomics on many very similar genomes. In that case, you'd like k to be large enough to span most repeats to disentangle the graph, and you expect relatively less branching from variation (because the genomes are very similar). Likewise, I'm not sure about 1000, but larger k-mers could become more useful for mapping as we move toward "ultra high quality" long reads.

--Rob

Mar 27 '24 23:03 rob-p

Hi @jermp any updates on merging this (and also on communications re: preparing a new paper)? 👀

I added two minor pull requests from Harun to our fork, they are very straightforward and just fix some warnings. But I expect that some further work might be needed in the future for integration of MetaGraph with SSHash, and it'd be easier if the current PR is merged, so that further changes in our fork won't interfere unduly with the PR.

May 27 '24 14:05 adamant-pwn

Support larger alphabets and k via generic kmer_t