NeMo Saving and reloading the pretrained model's vocab breaks the tokenizer.

Describe the bug

So I picked nvidia/parakeet-ctc-0.6b and I untarred the the .nemo file. After that, I then loaded the model and changed the the vocab this way:

Steps/Code to reproduce bug

model.change_vocabulary(
            new_tokenizer_dir=vocab_extension_path, new_tokenizer_type="bpe"
        )

where vocab_extension_path = the path of the pretrained model.

Expected behavior The model's tokenizer is supposed to remain intact and not start generating gibberish because I am just reloading the extact tokenizer that was used to pretrain the model.

Why I need this I need to replace some tokens in the model's vocab while keeping the order tokens intact. If I cant keep other parts of the tokenizer intact then my replacement of tokens cannot work.

Apr 23 '24 19:04 owos

It should be path to a tokenizer directory not model.

the directory should contain:

tokenizer.model
tokenizer.vocab
vocab.txt

May 08 '24 18:05 nithinraok

Yes, that's what I'm doing. Infact, I've been able to edit the pretrained model's tokenizer and changed the tokens inside of it. What I found out is that just reloading the pretrained tokenizer with the change_vocab method scatters the whole decoding process.

May 08 '24 18:05 owos

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jun 08 '24 01:06 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Jun 15 '24 01:06 github-actions[bot]