KMC union malfunction

Open TahaAslani opened this issue 6 years ago • 1 comments

Hi,

I am trying to use complex command to find the union of bunch of kmer counts. However, some kmers do not show in my final output file. For example, kmer AAC exists in 1280.7257 but is missing in the union output when I dump it to text file, instead 2 ACC counts exist, which doesn't make sense. Also note that when counting the kmers, I used option -b to count reverse compliments of the kmers as well.

I am using 3.1.1 (2019-05-19) and the command that I use is "kmc_tools complex MERGE.txt" where MERGE.txt is a text file that adds the kmers. I have uploaded my data here: https://drive.google.com/drive/folders/1BGCEj_PMOxqnZwekJ7b6CN4F585ejYdl?usp=sharing You can see the union dump text file as outUNION.txt

Thanks, Taha

Jul 15 '19 16:07 TahaAslani

Thanks for reporting this issue. I suspect that it may be related to the total number of files opened simultaneously. On my machince kmc_tools crashed with seg fault, which should be fixed with the last commit. I will try to dig a little deeper into this issue, but for now, assuming that you are using linux could you please increase the maximum number of files opened with command:

ulimit -n 10240

and then run kmc_tools? As far as I remember I have spotted some troubles with kmc_tools for a large number of input files (BTW. if you are annoyed with the blinking percentages you may hide the using -hp switch). Unfortunatelly, I do not remember details. It may be resonable to limit the number of input databases to some vaulue (for example 100 input databases at once) and write some simple script that wraps kmc_tools execution. For example if you have, lets say, 2000 input databases you may run kmc_tools to create intermediate databases db0-99 db100-199 ... db1900-1999. After that you may run kmc_tools again using those 20 databases as an input. I know this is a simple workaround and probably kmc_tools should do it by itselt, but for now such functionality is not implemented.

It may be especially important for larger vaules of k, when the KMC output is in more complex form and reading a single database may require more memory. In this case you may start by transforming each database to simpler format using sort command, which is more memory frugal for reading.

I would be glad if you could do following:

Try to run ulimit first and check if your example works
Tell me what is the expected number of input files and k values in your target workflow (I assume you are now experimenting with kmc_tools).

And let me know, basing on your answers I will try to advice you how to use kmc_tools in the best way.

Regards, Marek

Jul 15 '19 21:07 marekkokot