[`Core tokenization`] `add_dummy_prefix_space` option to help with latest issues

Open ArthurZucker opened this issue 2 years ago • 5 comments

What does this PR do?

Allows users to use tokenizer.tokenize controlling the addition of prefix space. Let's also update fast!

Dec 13 '23 16:12 ArthurZucker

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jan 16 '24 16:01 HuggingFaceDocBuilderDev

Just wanted to say this would be hugely helpful for us over at https://github.com/probcomp/hfppl !

Jan 20 '24 20:01 gabegrand

Likewise the ability to not include an extra SPIECE_UNDERLINE / Llama token 29871 when encoding a word with a space in front ( <word>) would be huge for https://github.com/EleutherAI/lm-evaluation-harness !

Jan 21 '24 19:01 haileyschoelkopf

will make it for next release I hope!

Feb 15 '24 13:02 ArthurZucker

Would love to see this as well! Thanks for working on this.

Feb 16 '24 20:02 claudiosv

Failing test is unrelated 😉

Feb 20 '24 11:02 ArthurZucker