NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Saving and reloading the pretrained model's vocab breaks the tokenizer.

Open owos opened this issue 1 year ago • 2 comments

Describe the bug

So I picked nvidia/parakeet-ctc-0.6b and I untarred the the .nemo file. After that, I then loaded the model and changed the the vocab this way:

Steps/Code to reproduce bug

model.change_vocabulary(
            new_tokenizer_dir=vocab_extension_path, new_tokenizer_type="bpe"
        )

where vocab_extension_path = the path of the pretrained model.

Expected behavior The model's tokenizer is supposed to remain intact and not start generating gibberish because I am just reloading the extact tokenizer that was used to pretrain the model.

Why I need this I need to replace some tokens in the model's vocab while keeping the order tokens intact. If I cant keep other parts of the tokenizer intact then my replacement of tokens cannot work.

owos avatar Apr 23 '24 19:04 owos

It should be path to a tokenizer directory not model.

the directory should contain:

  • tokenizer.model
  • tokenizer.vocab
  • vocab.txt

nithinraok avatar May 08 '24 18:05 nithinraok

Yes, that's what I'm doing. Infact, I've been able to edit the pretrained model's tokenizer and changed the tokens inside of it. What I found out is that just reloading the pretrained tokenizer with the change_vocab method scatters the whole decoding process.

owos avatar May 08 '24 18:05 owos

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jun 08 '24 01:06 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jun 15 '24 01:06 github-actions[bot]