dashing icon indicating copy to clipboard operation
dashing copied to clipboard

The subtraction function of Dashing

Open liaoherui opened this issue 4 years ago • 8 comments

Firstly, thanks a lot for your excellent tool! It's really cool!

This is my question:

Currently, I have three files, A.fq, B.fa, C.fa. And I want to calculate the distance between A.fq and B.fa, but not consider k-mers in C.fa. In other words, I need to do dist(A-C, B-C). I am not sure whether dashing can be used to do this?

If it works, how about many "B", I mean, dist(A-C, "genome_path"-C), where "genome_path" refers to many fasta files (just like the input file (genome_path.txt) of dist function of dashing), not only one B.

Thanks a lot for your answering in advance!

liaoherui avatar May 17 '21 09:05 liaoherui

Hi there!

This is currently not supported, but it's something you could use Dashing to perform parts of the computation. If you perform sketching with --full-khash-sets, then Dashing will create 64-bit hash sets and write them to disk in a gzip file. The first 8 bytes is a number indicating the number of hashes, and the rest of the file is 64-bit. You could then load the hash sets for A and B, filtering by C.

I see that it could be very useful, and we'll consider supporting it directly as development continues. Let me know if you need any further help or have any question..

Thanks,

Daniel

dnbaker avatar May 17 '21 22:05 dnbaker

Hi, Daniel

Got it. Thanks a lot for your prompt reply!

I will try to use the "sketch" function. In this case, is it possible to load these hash files by python? Cause I am not very familiar with c++.

Another not so important question: (Besides, by using the "sketch" function, is it possible to save hash with genome labels like {kmer1->{A, C}, kmer2->{B}} in the future version? If so, it will be really helpful for the development of k-mer-based analysis tool! But I guess it could be time-consuming or is not the target of Dashing. If so, you can ignore this question.)

Regards, Herui Liao

liaoherui avatar May 18 '21 01:05 liaoherui

Hi, Daniel

I have just tested the "--full-khash-sets" parameters with one genome. But there is an error.

The command I used is dashing_s128 sketch -p12 --use-full-khash-sets -k31 test.fasta

And the error info is: image

I have assigned 200G memory for this job, which should be enough for one small bacterial genome (~5M).

liaoherui avatar May 18 '21 02:05 liaoherui

Hi Liao,

You're right, that probably isn't running out of memory. That was a bug, unfortunately, which I've found/fixed in this branch, and it's now been merged into main here.

You can download the new binaries from https://github.com/dnbaker/dashing/tree/main/release/.

You could do a lot of functionality from within Python. To parse each of these k-mer files, here's some Python code:

def parse_khs(path):
    import numpy as np
    import gzip
    data = np.frombuffer(gzip.open(path, "r").read(), dtype=np.uint64)
    ld = data[0]
    data = data[1:]
    assert ld == len(data)
    return data

This yields a hash set in vector form. So after using the fixed/rebuilt code to eliminate the segmentation fault, you might try something like this:

def filtered_jaccard(a, b, c):
    adata, bdata, cdata = map(parse_khs, (a, b, c))
    cset = set(c)
    afilt = set(adata) - cset
    bfilt = set(bdata) - cset
    isz = afilt & bfilt
    union = afilt | bfilt
    return len(isz) / len(union)

dnbaker avatar May 18 '21 17:05 dnbaker

Hi, Daniel

Thanks a lot for your suggestions and the new version of dashing!

However, when I tried this new binary, it seems there are still some problems. As shown in the picture below, the new version can not be executed while the old version works well.

image

The same with s256 and s512.

liaoherui avatar May 19 '21 01:05 liaoherui

Thanks for letting me know.

What's your operating system?

You could check if it has permissions (cmhod 755 dashing_s128), in case that fixes it. Otherwise, maybe cloning fresh and building manually might help?

git clone --recursive https://github.com/dnbaker/dashing
cd dashing && make dashing

You can then install it either with sudo make install or by manually copying the executable into a folder in the PATH environmental variable.

dnbaker avatar May 20 '21 15:05 dnbaker

Thanks! Will try.

My operation system is Linux, and the version info is shown below. image

liaoherui avatar May 21 '21 02:05 liaoherui

By the way - while there's no such functionality in this software, I added to Dashing2 a feature which supports something like this. The feature uses the flag --filterset <path>, which reads all k-mers from that path and then skips them when sketching from the remaining files.

And your request inspired me to provide it - so thank you!

dnbaker avatar Aug 15 '21 19:08 dnbaker