tiktokenizer icon indicating copy to clipboard operation
tiktokenizer copied to clipboard

token count is inconsistent with OpenAI tokenizer

Open GorvGoyl opened this issue 2 years ago • 1 comments

As shown below:

screenshot 2023-11-21 at 10 41 58@2x

screenshot 2023-11-21 at 10 42 20@2x

text:

<|im_start|>dd<|im_sep|>OpenAI's large language models (sometimes referred to as GPT's) process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.<|im_end|><|im_start|>assistant<|im_sep|><|im_end|><|im_start|>assistant<|im_sep|>

GorvGoyl avatar Nov 21 '23 03:11 GorvGoyl

any update to this? They work fine without the special characters.

https://platform.openai.com/tokenizer image

https://tiktokenizer.vercel.app/ image

syntaxtrash avatar Dec 11 '23 07:12 syntaxtrash

Hello! the OpenAI tokenizer does not treat special tokens as a single tokens, whereas the Tiktokenizer does when selecting gpt-3.5-turbo. You can get the same behaviour by selecting cl100k_base or o200k_base

dqbd avatar Feb 19 '25 08:02 dqbd