PhiCookBook How to turn off byte-fallback for Phi-3's tokenizer?

I have been trying out Phi-3 models and it's been a wonderful experience.

However, sometimes the tokenizer throws exception:

The line of code

text =self.tokenizer.decode(output_tokens)

throws Exception: 'utf-8' codec can't decode byte 0xf0 in position 10283: invalid continuation byte

Most of the time this happened when the model's output was quite long (~800 words, and if count in the brackets, dots, ... it's ~1.4k element; this is still far from the max_length 4196 imo)

I have researched around and find out this can be fixed by turning off the byte-fallback of the BPE tokenizer, then the tokenizer will ignore the non-utf8 tokens.

I have tried

Tweaked the tokenizer.json file:

Set the model/byte_fallback to false
and remove the item {"type": "ByteFallback"} in decoder/decoders section

but the errors still happens.

I am using the mini-4k-intruct onnx-cuda-int14 version, btw.

I wonder

Why did my changes not work and is there anyway to fix this? Thanks for every help and suggestion!

May 23 '24 10:05 nguyenthekhoig7

@nguyenthekhoig7 could you add this to the Hugging Face community chat https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/discussions this is the best place for queries such as this

May 23 '24 14:05 leestott

Hi @leestott, thanks for the suggestion, I have created a discussion on Huggingface: discussion #10

May 24 '24 02:05 nguyenthekhoig7

@nguyenthekhoig7 Thanks I will close this issue.

May 24 '24 16:05 leestott