Robert Burke comments

Repositories
Issues
Comments

Results 32 comments of


                                            Robert Burke

Cannot tokenize byte sequences that are not valid UTF-8 due to design flaw

I don't think so ``` >>> from transformers import GPT2Tokenizer >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2') >>> tokenizer.encode('Ģ') ... [128, 95] ```

Cannot tokenize byte sequences that are not valid UTF-8 due to design flaw

I mean, I understand that I can use a separate lookup table of individual byte-values to tokens representing single bytes. But BPE tokenizers generally work by applying pre-tokenization to divide...