Saving and reloading the pretrained model's vocab breaks the tokenizer.
Describe the bug
So I picked nvidia/parakeet-ctc-0.6b and I untarred the the .nemo file.
After that, I then loaded the model and changed the the vocab this way:
Steps/Code to reproduce bug
model.change_vocabulary(
new_tokenizer_dir=vocab_extension_path, new_tokenizer_type="bpe"
)
where vocab_extension_path = the path of the pretrained model.
Expected behavior The model's tokenizer is supposed to remain intact and not start generating gibberish because I am just reloading the extact tokenizer that was used to pretrain the model.
Why I need this I need to replace some tokens in the model's vocab while keeping the order tokens intact. If I cant keep other parts of the tokenizer intact then my replacement of tokens cannot work.
It should be path to a tokenizer directory not model.
the directory should contain:
-
tokenizer.model -
tokenizer.vocab -
vocab.txt
Yes, that's what I'm doing. Infact, I've been able to edit the pretrained model's tokenizer and changed the tokens inside of it. What I found out is that just reloading the pretrained tokenizer with the change_vocab method scatters the whole decoding process.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.