Option to save and load trained model (some workaround suggested)
Hi
I tried to save the trained dictionary and reload it. It is not working. Do you have any idea how to do it? What I tried. To save the trained dictionary
myData = dict()
myData["_words"]=ss._words
myData["_deletes"]=ss._deletes
myData["_below_threshold_words"]=ss._below_threshold_words
myData["_max_length"]=ss._max_length
myData["_distance_algorithm"]=ss._distance_algorithm
myData["_max_dictionary_edit_distance"]=ss._max_dictionary_edit_distance
myData["_prefix_length"]=ss._prefix_length
myData["_count_threshold"]=ss._count_threshold
myData["_compact_mask"]=ss._compact_mask
filename = 'SymSpell_dictionary.json'
print('Saving dictionary...')
with open(filename, 'w',encoding='ISO-8859-1') as fp:
json.dump(myData, fp)
print('Saved dictionary...')
Once saved, tried to reload it like
myData = dict()
print('Loading dictionary...')
filename = 'SymSpell_dictionary.json'
with open(filename, 'r',encoding='ISO-8859-1') as fp:
myData = json.load(fp)
print('Loaded dictionary...')
ss._words=myData["_words"]
ss._deletes=myData["_deletes"]
ss._below_threshold_words=myData["_below_threshold_words"]
ss._max_length=myData["_max_length"]
ss._distance_algorithm=myData["_distance_algorithm"]
ss._max_dictionary_edit_distance=myData["_max_dictionary_edit_distance"]
ss._prefix_length=myData["_prefix_length"]
ss._count_threshold=myData["_count_threshold"]
ss._compact_mask=myData["_compact_mask"]
It is not working. It is loading, but spell correction is not working.
As a workaround, i added following two functions in main file, which are working
def save_words_with_freq_as_json(self,filename,encoding="utf8"):
print('Saving dictionary...')
with open(filename, 'w',encoding=encoding) as fp:
json.dump(self._words, fp)
print('Saved dictionary...')
return;
def load_words_with_freq_from_json_and_build_dictionary(self,filename,encoding="utf8"):
print('Loading dictionary...')
myData = dict()
with open(filename, 'r',encoding=encoding) as fp:
myData = json.load(fp)
for word in myData:
self._create_dictionary_entry(word,myData[word])
print('Loaded dictionary...')
To use it, you can save like
filename = 'SymSpell_Dctionary_Word.json'
ss.save_words_with_freq_as_json(filename,encoding='ISO-8859-1');
and load like
ss = SymSpell(max_dictionary_edit_distance=3)
filename = 'SymSpell_Dctionary_Word.json'
ss.load_words_with_freq_from_json_and_build_dictionary(filename,encoding='ISO-8859-1');
Above is working, if anyone interested. But, if we have save and load deletes/words etc, it will be faster compared to training every time.
Thanks for your suggestion.
There is a simple load_dictionary() method to read whitespace-separated word-count pairs from file. The creation of this file is not implemented, but as you correctly noticed is basically dumping _words dictionary.
Saving and loading _deletes dictionary was not intended in original SymSpell, because it is well-optimized to build it fast. I did not estimate the speed yet, but it seems that building _deletes with load_dictionary() is fast enough.
It was a simple issue. When we load JSON from file, keys are stored as string, where as _deletes keys are int (hash value). We need to do something like
for hs in sorted(myDeleteData):
ss._deletes[int(hs)]=myDeleteData[hs]
and good news is its working. The key here is, you need to convert hs into int ([int(hs)]) while transferring it to ss._deletes. I will be sending some pull request. I also implemented multi threading for creating the hash table in python. With four threads, it took 3 mins, intead of 7 min with out multi thread for loading 500K words in my local machine
Raised a pull request
Hi, thanks for the contribution, I appreciate it a lot. I will look through shortly and accept.
Do you want to join forces to further improve it?
Definitely would like to join.. I am now working on some more improvement. Will let you know once it is done. But, if you are merging my changes, revert the prime number back to original. I was not aware of specialty of those two numbers :-) FNV Hash algorithm.