[`Core tokenization`] `add_dummy_prefix_space` option to help with latest issues
What does this PR do?
Allows users to use tokenizer.tokenize controlling the addition of prefix space. Let's also update fast!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Just wanted to say this would be hugely helpful for us over at https://github.com/probcomp/hfppl !
Likewise the ability to not include an extra SPIECE_UNDERLINE / Llama token 29871 when encoding a word with a space in front ( <word>) would be huge for https://github.com/EleutherAI/lm-evaluation-harness !
will make it for next release I hope!
Would love to see this as well! Thanks for working on this.
Failing test is unrelated 😉