KMC icon indicating copy to clipboard operation
KMC copied to clipboard

Converting database formats from KMC1.x to KMC2.x

Open cpockrandt opened this issue 6 years ago • 2 comments

Hi,

I'm using kmc_tools to reduce, intersect, etc. KMC2.x databases. reduce converts my KMC2.x database to a KMC1.x database and I was wondering whether there is a way to convert it back to KMC2.x?

As far as I know KMC2 introduced binning and signatures, so I guess that querying should be faster with a KMC2.x database?

cpockrandt avatar Jul 26 '19 00:07 cpockrandt

Hi,

well, it is not that obvious. In general, signatures are helpful during k-mer counting. At some point, I have done some measurements of querying kmc1 and kmc2 databases, and it turned out that kmc1 performed better. On the other hand, after that, I have spoken to my colleague who also uses kmc tell me that kmc2 format performed better.

Answering your question, currently, it is not possible to convert kmc1 database to kmc2 database. At least not easily and efficiently.

Is querying kmc database a bottleneck in your application?

It should be possible to query kmc database using multiple threads, in case of troubles with this I may help. I have also some idea to maybe increase a little querying, but it is to implement and check in practice. If querying is a bottleneck even using multiple threads let me know and we will consider what we may do.

The reason why the performance difference is not so obvious is as follows:

  1. In case of kmc1 k-mers are stored almost like an ordinary array (there is some trick to reduce query time and space requirement by splitting k-mers to prefixes and suffixes) and querying is simply a binary search
  2. In case of kmc2 there is up to 512 (when kmc run with defaults) arrays, the array for a specific k-mer is recognized by its signature, and binary search starts for such an array. So it seems that querying should work faster because the array to search is smaller, sometimes much smaller (the distribution of arrays sizes is unfortunately not uniform, so some array may be quite big). The problem is that before we choose apropriate array we need to determine k-mer signature, which requires finding minimum m-mer insinde k-mer.

On the other hand, as far as I remember (since I have implemented it long ago), there is also an option to query k-mers of a whole read (or some other sequence). The mean time of searching signature in a single k-mer should be lower than in case when you query for each k-mer of a sequence separatelly.

If you want to compare performance you may do the following (it is very inefficient workaround):

  1. Use kmc_tools dump operation do dump k-mers to text file
  2. Convert this text file to fasta format (probably some simple script will be required, keep in mind that if you want to have correct counters each k-mer must be repeated in the converted file multiple times)
  3. Count k-mers in such a file with kmc3.

Another option if you want to compare querying time is to count k-mers with kmc3, convert database to kmc1 with kmc_tools sort and just check how it will work.

marekkokot avatar Jul 26 '19 06:07 marekkokot

Thanks for your detailed reply!

My tools basically doesn't do anything else than querying k-mers for a read, so it is actually the "bottleneck". I'm already using the function GetCountersForRead(). Since I would assume that one or two cache misses more (during a binary search in the entire k-mer table) are more expensive than the mean running time for computing the signatures, I will benchmark it. That's a good workaround to convert the KMC1.x to a KMC2.x database by constructing a fasta file with each k-mer.

Thanks for the easy-to-read code and the many tools you provide for KMC databases, they are really useful! :)

cpockrandt avatar Aug 03 '19 05:08 cpockrandt