Mash icon indicating copy to clipboard operation
Mash copied to clipboard

Can Mash accurately classify subspecies?

Open rpalcab opened this issue 3 years ago • 1 comments

Hello,

I'm currently working on Mycobacterium caprae and Mycobacterium bovis. These subspecies of the M. tuberculosis complex are phylogenetically very similar, so the task of identifying them is not always trivial.

In one of my analysis, I expected all the samples to be M. caprae, but when looking at the Mash screen results I find that many of them could be assigned to both subspecies, since they got the same shared-hashes score and p-value, or just a difference of 1 in the shared-hashes score.

#Sample A

0.99957	991/1000	77	0	GCF_001941665.1_ASM194166v1_genomic.fna.gz	NZ_CP016401.1 Mycobacterium caprae strain Allgaeu genome
0.99957	991/1000	77	0	GCF_001483905.1_ASM148390v1_genomic.fna.gz	NZ_CP013741.1 Mycobacterium bovis strain BCG-1 (Russia), complete genome
0.99957	991/1000	77	0	GCF_001274555.1_ASM127455v1_genomic.fna.gz	NZ_CP009243.1 Mycobacterium bovis BCG strain Russia 368, complete genome

#Sample B

0.999377	987/1000	193	0	GCF_000195835.1_ASM19583v1_genomic.fna.gz	NC_002945.3 Mycobacterium bovis AF2122/97 chromosome, complete genome
0.999329	986/1000	193	0	GCF_001941665.1_ASM194166v1_genomic.fna.gz	NZ_CP016401.1 Mycobacterium caprae strain Allgaeu genome
0.999329	986/1000	193	0	GCF_001580385.1_ASM158038v1_genomic.fna.gz	NZ_CP014566.1 Mycobacterium bovis BCG str. Tokyo 172 substrain TRCS, complete genome

This makes me wonder whether Mash screen is able to identify in a subspecies level. Also, is a difference of 1 in the shared-hashes score robust enough to determine the taxonomy of an organism?

Thanks in advance

rpalcab avatar Mar 30 '22 14:03 rpalcab

In my experience, k-mer size of 17 (-k 17) and sketch size of 50000 (-s 50000) is enough for differentiating Salmonella serovars. The default sketch size of just 1000 certainly doesn't provide enough resolution for subspecies etc.

sheikki avatar Nov 07 '23 18:11 sheikki