pthash icon indicating copy to clipboard operation
pthash copied to clipboard

possibility to adapt builders for partitions with single elements

Open hmusta opened this issue 1 year ago • 5 comments

As part of unit testing, I've been creating hash tables with very few keys (e.g., making a sshash dictionary for a sequence set with a single super k-mer), frequently triggering this:

unknown file: Failure
C++ exception with description "each partition must contain more than one key: use less partitions" thrown in the test body.

I was wondering if it would be too much trouble to add support for single-element partitions to fix this?

hmusta avatar May 28 '24 21:05 hmusta

Hi @hmusta, it could be possible in principle but I don't think it makes much sense. The space benefit of any data structure is only well understood when the number of elements it represents is sufficiently large. For example, for a bunch of keys, any MPHF can easily take, say, 100 bits/key just because of the overhead of storing a binary vector made up of 64-bit words or fix-size metadata (such as size, hasher, etc.). Same applies to SSHash: why do you need an index for just one super-kmer? Can't you just scan it to locate the kmers?

jermp avatar Jun 06 '24 06:06 jermp

It's generally considered a good practice to not fail on corner cases.

Small inputs are often used for unit testing, or even manual tests, when you're more concerned with logical correctness of the program, rather than with its memory usage. As such, it would greatly benefit third-party users, such as Metagraph, if sshash wouldn't fail on small inputs and could be plugged in for unit tests as is, rather than having to add questionable hacks to the existing testing framework, to avoid using sshash-related code on small cases only.

That being said, we'd really appreciate it if this could be fixed.

adamant-pwn avatar Jun 06 '24 14:06 adamant-pwn

Hi all, this should be possibile now as of 94663599ab7d092a9301bf8c8709fbf558a04601, for internal memory construction, i.e., when using the method build_in_internal_memory() of both single_phf and partitioned_phf.

This has also been already integrated in SSHash (see latest commits to the master branch).

Can you please check this on your end?

PS. I need some more time (it's been busy hell here...) to reflect this for external memory construction too.

jermp avatar Jun 24 '24 08:06 jermp

Hi @jermp, thanks for the update! But sshash exclusively relies on external memory for minimizers here, so we still require some workarounds for this on our side.

adamant-pwn avatar Jul 01 '24 15:07 adamant-pwn

Yes, I know. I can look into this tomorrow. But this commit lets you build the skew index at least :)

jermp avatar Jul 01 '24 15:07 jermp

Hi @hmusta and @adamant-pwn, should be done as of 3042587a9a8d21032e06be263d1628158967d9dd.

I put a small script to test very small key sets that you can run as:

bash ../test/small_test.sh 1 3 4 > log.txt

Please, let me know if everything works on your end.

jermp avatar Jul 11 '24 07:07 jermp

Closing this for now, but feel free to re-open if any problem arises.

jermp avatar Jul 11 '24 07:07 jermp