wrong param
fix bug Failed to get encoding for cl100k_base when getting num_tokens_from_string. Fall back to default encoding cl100k_base
model is the correct parameter name here. Is there something upstream that may be causing your issue?
code in entity_extraction_prompt.py
tokens_left = (
max_token_count
- num_tokens_from_string(prompt, model=encoding_model) # encoding_model =o200k_base
- num_tokens_from_string(entity_types, model=encoding_model)
if entity_types
else 0
)
def num_tokens_from_string(
string: str, model: str | None = None, encoding_name: str | None = None
) -> int:
"""Return the number of tokens in a text string."""
if model is not None:
try:
encoding = tiktoken.encoding_for_model(model) # o200k_base not model_name
except KeyError:
msg = f"Failed to get encoding for {model} when getting num_tokens_from_string. Fall back to default encoding {DEFAULT_ENCODING_NAME}"
log.warning(msg)
encoding = tiktoken.get_encoding(DEFAULT_ENCODING_NAME)
else:
encoding = tiktoken.get_encoding(encoding_name or DEFAULT_ENCODING_NAME)
return len(encoding.encode(string))
code in entity_extraction_prompt.py
tokens_left = ( max_token_count - num_tokens_from_string(prompt, model=encoding_model) # encoding_model =o200k_base - num_tokens_from_string(entity_types, model=encoding_model) if entity_types else 0 )def num_tokens_from_string( string: str, model: str | None = None, encoding_name: str | None = None ) -> int: """Return the number of tokens in a text string.""" if model is not None: try: encoding = tiktoken.encoding_for_model(model) # o200k_base not model_name except KeyError: msg = f"Failed to get encoding for {model} when getting num_tokens_from_string. Fall back to default encoding {DEFAULT_ENCODING_NAME}" log.warning(msg) encoding = tiktoken.get_encoding(DEFAULT_ENCODING_NAME) else: encoding = tiktoken.get_encoding(encoding_name or DEFAULT_ENCODING_NAME) return len(encoding.encode(string))
That's correct. The parameter name for create_entity_extraction_prompt is encoding_model because that's a higher-level function. The parameter in the token function is model because it's referring directly to the model used by tiktoken. So in create_entity_extraction_prompt we map between the two.
Thanks @mmdsnb. I also encountered this issue earlier and included the fix in this PR.
@natoverse the problem here is that we're passing the encoding model name (i.e. o200k_base) using the wrong parameter. In num_tokens_from_string(...), the model parameter expects values like gpt-4o. To pass in the encoding model name, we just need to use a different parameter. It should look like this:
tokens_left = (
max_token_count
- num_tokens_from_string(prompt, encoding_name=encoding_model)
- num_tokens_from_string(entity_types, encoding_name=encoding_model)
if entity_types
else 0
)
Notice where I changed model= to encoding_name=
I see, thanks for the clarification. Closing this PR as it is resolved in your API refactor