PhiCookBook icon indicating copy to clipboard operation
PhiCookBook copied to clipboard

How to turn off byte-fallback for Phi-3's tokenizer?

Open nguyenthekhoig7 opened this issue 1 year ago • 2 comments

I have been trying out Phi-3 models and it's been a wonderful experience.

However, sometimes the tokenizer throws exception:

The line of code

text =self.tokenizer.decode(output_tokens)

throws Exception: 'utf-8' codec can't decode byte 0xf0 in position 10283: invalid continuation byte

Most of the time this happened when the model's output was quite long (~800 words, and if count in the brackets, dots, ... it's ~1.4k element; this is still far from the max_length 4196 imo)

I have researched around and find out this can be fixed by turning off the byte-fallback of the BPE tokenizer, then the tokenizer will ignore the non-utf8 tokens.

I have tried

Tweaked the tokenizer.json file:

  • Set the model/byte_fallback to false
  • and remove the item {"type": "ByteFallback"} in decoder/decoders section

but the errors still happens.

I am using the mini-4k-intruct onnx-cuda-int14 version, btw.

I wonder

Why did my changes not work and is there anyway to fix this? Thanks for every help and suggestion!

nguyenthekhoig7 avatar May 23 '24 10:05 nguyenthekhoig7

@nguyenthekhoig7 could you add this to the Hugging Face community chat https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/discussions this is the best place for queries such as this

leestott avatar May 23 '24 14:05 leestott

Hi @leestott, thanks for the suggestion, I have created a discussion on Huggingface: discussion #10

nguyenthekhoig7 avatar May 24 '24 02:05 nguyenthekhoig7

@nguyenthekhoig7 Thanks I will close this issue.

leestott avatar May 24 '24 16:05 leestott