The match is inconsistent with sketch size

Open vvsbiocode opened this issue 1 year ago • 0 comments

As I understand, the "match" output of mash distance is the number of matching hashes among all hashes. Number of all hashes is defined by sketch size. Right? I call mash through python subprocess to calculate mash dist pairwisely from lists of query and reference. The k-mer is constant for all pairs and sketch size = length of the longest sequence in pair (I compare short sequnces - genes). I make a sketch file separatly for two sequnces and then compare them with "mash dist". And sometimes in output I see that match value is different among pairs with same sketch size. E.g. sketch size=3155, k-mer=11 pair1 match=44/44 pair2 match=0/88 Why could it happen?

def mash_dist_tune_pairwise(reference_seq_dict,query_seq_dict,fna_dir,k_mer,mashapp,threads): #create matrix of scatch size for each comparison scketch_size_matrix=np.empty((len(reference_seq_dict),len(query_seq_dict))) #Create an empty array. scketch_size_matrix.fill(np. NaN) #create matrix of k-mer sizes for each comparison kmer_matrix=np.empty((len(reference_seq_dict),len(query_seq_dict))) kmer_matrix.fill(k_mer) #fill both matrix for i1,key1 in enumerate(reference_seq_dict.keys()): for i2,key2 in enumerate(query_seq_dict.keys()): g=np.max([len(reference_seq_dict[key1]['Sequence']),len(query_seq_dict[key2]['Sequence'])]) scketch_size_matrix[i1,i2]=round(g)
#creat empty dict dist_dict={} #make sketch for each pair for i1,key1 in enumerate(reference_seq_dict.keys()): dist_dict[key1]={} ref_filename="".join(["./",fna_dir,"/",key1,".fna"]) #sequence file path for i2,key2 in enumerate(query_seq_dict.keys()): query_filename="".join(["./",fna_dir,"/",key2,".fna"]) sketch_size=scketch_size_matrix[i1,i2] k_mer=kmer_matrix[i1,i2] #command cmd_sketch1=f"{mashapp} sketch -p {threads} -k {k_mer} -s {sketch_size} {ref_filename}" cmd_sketch2=f"{mashapp} sketch -p {threads} -k {k_mer} -s {sketch_size} {query_filename}" #run ref_sketch_res=subprocess.run([cmd_sketch1], shell=True, capture_output=True, text=True) query_sketch_res=subprocess.run([cmd_sketch2], shell=True, capture_output=True, text=True) #mash dist with Popen(f"{mashapp} dist -p {threads} {ref_filename}.msh {query_filename}.msh", shell=True, stdout=PIPE) as process: query_res = pd.read_csv(process.stdout,sep="\t",header=None) query_res.index=[key1] query_res=query_res.drop([0,1],axis=1) query_res.columns=["mshdist","p","match"] query_res_dict=query_res.to_dict(orient="dict") dist_dict[key1][key2]=query_res_dict
return (dist_dict,scketch_size_matrix,kmer_matrix)

UPD: I think (but not shure) that the reason is that there are not enough small hashes to get up to 3000 hashes. My assumption is based on this line from publication:

For a sketch size s and genome size n, a bottom sketch can be efficiently computed in O(n log s) time by maintaining a sorted list of size s and updating the current sketch only when a new hash is smaller than the current sketch maximum.

It is also evident from the fact that if I increas the sketch size (e.g. from 3000 to 10000), then the match is still the same (e.g. 44). What strikes me is that pairs of sequences have identity score 97, which means they are highly similar. So this is weird that mash does not recognise this. I had to decrease the k-mer size to 4 to recieve an expected low mash distance...

Jun 26 '24 20:06 vvsbiocode