pysymspell icon indicating copy to clipboard operation
pysymspell copied to clipboard

Option to save and load trained model (some workaround suggested)

Open sekarpdkt opened this issue 8 years ago • 5 comments

Hi

I tried to save the trained dictionary and reload it. It is not working. Do you have any idea how to do it? What I tried. To save the trained dictionary

myData = dict()
myData["_words"]=ss._words
myData["_deletes"]=ss._deletes
myData["_below_threshold_words"]=ss._below_threshold_words
myData["_max_length"]=ss._max_length
myData["_distance_algorithm"]=ss._distance_algorithm
myData["_max_dictionary_edit_distance"]=ss._max_dictionary_edit_distance
myData["_prefix_length"]=ss._prefix_length
myData["_count_threshold"]=ss._count_threshold
myData["_compact_mask"]=ss._compact_mask

filename = 'SymSpell_dictionary.json'
print('Saving dictionary...')
with open(filename, 'w',encoding='ISO-8859-1') as fp:
    json.dump(myData, fp)        
print('Saved dictionary...')

Once saved, tried to reload it like


myData = dict()
print('Loading dictionary...')
filename = 'SymSpell_dictionary.json'

with open(filename, 'r',encoding='ISO-8859-1') as fp:
    myData = json.load(fp)
print('Loaded dictionary...')

ss._words=myData["_words"]
ss._deletes=myData["_deletes"]
ss._below_threshold_words=myData["_below_threshold_words"]
ss._max_length=myData["_max_length"]
ss._distance_algorithm=myData["_distance_algorithm"]
ss._max_dictionary_edit_distance=myData["_max_dictionary_edit_distance"]
ss._prefix_length=myData["_prefix_length"]
ss._count_threshold=myData["_count_threshold"]
ss._compact_mask=myData["_compact_mask"]

It is not working. It is loading, but spell correction is not working.

As a workaround, i added following two functions in main file, which are working

    def save_words_with_freq_as_json(self,filename,encoding="utf8"):
        print('Saving dictionary...')
        with open(filename, 'w',encoding=encoding) as fp:
            json.dump(self._words, fp)        
        print('Saved dictionary...')
        return;
    def load_words_with_freq_from_json_and_build_dictionary(self,filename,encoding="utf8"):
        print('Loading dictionary...')
        myData = dict()

        with open(filename, 'r',encoding=encoding) as fp:
            myData = json.load(fp)
        for word in myData:
            self._create_dictionary_entry(word,myData[word])        
        print('Loaded dictionary...')

To use it, you can save like

filename = 'SymSpell_Dctionary_Word.json'
ss.save_words_with_freq_as_json(filename,encoding='ISO-8859-1');

and load like


ss = SymSpell(max_dictionary_edit_distance=3)
filename = 'SymSpell_Dctionary_Word.json'
ss.load_words_with_freq_from_json_and_build_dictionary(filename,encoding='ISO-8859-1');

Above is working, if anyone interested. But, if we have save and load deletes/words etc, it will be faster compared to training every time.

sekarpdkt avatar Apr 16 '18 11:04 sekarpdkt

Thanks for your suggestion.

There is a simple load_dictionary() method to read whitespace-separated word-count pairs from file. The creation of this file is not implemented, but as you correctly noticed is basically dumping _words dictionary.

Saving and loading _deletes dictionary was not intended in original SymSpell, because it is well-optimized to build it fast. I did not estimate the speed yet, but it seems that building _deletes with load_dictionary() is fast enough.

ne3x7 avatar Apr 16 '18 12:04 ne3x7

It was a simple issue. When we load JSON from file, keys are stored as string, where as _deletes keys are int (hash value). We need to do something like

for hs in sorted(myDeleteData):
    ss._deletes[int(hs)]=myDeleteData[hs]

and good news is its working. The key here is, you need to convert hs into int ([int(hs)]) while transferring it to ss._deletes. I will be sending some pull request. I also implemented multi threading for creating the hash table in python. With four threads, it took 3 mins, intead of 7 min with out multi thread for loading 500K words in my local machine

sekarpdkt avatar Apr 17 '18 14:04 sekarpdkt

Raised a pull request

sekarpdkt avatar Apr 17 '18 16:04 sekarpdkt

Hi, thanks for the contribution, I appreciate it a lot. I will look through shortly and accept.

Do you want to join forces to further improve it?

ne3x7 avatar Apr 20 '18 21:04 ne3x7

Definitely would like to join.. I am now working on some more improvement. Will let you know once it is done. But, if you are merging my changes, revert the prime number back to original. I was not aware of specialty of those two numbers :-) FNV Hash algorithm.

sekarpdkt avatar Apr 21 '18 03:04 sekarpdkt