transformers icon indicating copy to clipboard operation
transformers copied to clipboard

[`Core tokenization`] `add_dummy_prefix_space` option to help with latest issues

Open ArthurZucker opened this issue 2 years ago • 5 comments

What does this PR do?

Allows users to use tokenizer.tokenize controlling the addition of prefix space. Let's also update fast!

ArthurZucker avatar Dec 13 '23 16:12 ArthurZucker

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Just wanted to say this would be hugely helpful for us over at https://github.com/probcomp/hfppl !

gabegrand avatar Jan 20 '24 20:01 gabegrand

Likewise the ability to not include an extra SPIECE_UNDERLINE / Llama token 29871 when encoding a word with a space in front ( <word>) would be huge for https://github.com/EleutherAI/lm-evaluation-harness !

haileyschoelkopf avatar Jan 21 '24 19:01 haileyschoelkopf

will make it for next release I hope!

ArthurZucker avatar Feb 15 '24 13:02 ArthurZucker

Would love to see this as well! Thanks for working on this.

claudiosv avatar Feb 16 '24 20:02 claudiosv

Failing test is unrelated 😉

ArthurZucker avatar Feb 20 '24 11:02 ArthurZucker