How to turn off byte-fallback for Phi-3's tokenizer?
I have been trying out Phi-3 models and it's been a wonderful experience.
However, sometimes the tokenizer throws exception:
The line of code
text =self.tokenizer.decode(output_tokens)
throws Exception: 'utf-8' codec can't decode byte 0xf0 in position 10283: invalid continuation byte
Most of the time this happened when the model's output was quite long (~800 words, and if count in the brackets, dots, ... it's ~1.4k element; this is still far from the max_length 4196 imo)
I have researched around and find out this can be fixed by turning off the byte-fallback of the BPE tokenizer, then the tokenizer will ignore the non-utf8 tokens.
I have tried
Tweaked the tokenizer.json file:
- Set the model/
byte_fallbackto false - and remove the item
{"type": "ByteFallback"}in decoder/decoders section
but the errors still happens.
I am using the mini-4k-intruct onnx-cuda-int14 version, btw.
I wonder
Why did my changes not work and is there anyway to fix this? Thanks for every help and suggestion!
@nguyenthekhoig7 could you add this to the Hugging Face community chat https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/discussions this is the best place for queries such as this
Hi @leestott, thanks for the suggestion, I have created a discussion on Huggingface: discussion #10
@nguyenthekhoig7 Thanks I will close this issue.