graphrag wrong param

fix bug Failed to get encoding for cl100k_base when getting num_tokens_from_string. Fall back to default encoding cl100k_base

Aug 02 '24 13:08 mmdsnb

model is the correct parameter name here. Is there something upstream that may be causing your issue?

Aug 05 '24 19:08 natoverse

code in entity_extraction_prompt.py

    tokens_left = (
        max_token_count
        - num_tokens_from_string(prompt, model=encoding_model) # encoding_model =o200k_base
        - num_tokens_from_string(entity_types, model=encoding_model)
        if entity_types
        else 0
    )

def num_tokens_from_string(
    string: str, model: str | None = None, encoding_name: str | None = None
) -> int:
    """Return the number of tokens in a text string."""
    if model is not None:
        try:
            encoding = tiktoken.encoding_for_model(model) # o200k_base not model_name 
        except KeyError:
            msg = f"Failed to get encoding for {model} when getting num_tokens_from_string. Fall back to default encoding {DEFAULT_ENCODING_NAME}"
            log.warning(msg)
            encoding = tiktoken.get_encoding(DEFAULT_ENCODING_NAME)
    else:
        encoding = tiktoken.get_encoding(encoding_name or DEFAULT_ENCODING_NAME)
    return len(encoding.encode(string))

Aug 06 '24 06:08 mmdsnb

code in entity_extraction_prompt.py

    tokens_left = (
        max_token_count
        - num_tokens_from_string(prompt, model=encoding_model) # encoding_model =o200k_base
        - num_tokens_from_string(entity_types, model=encoding_model)
        if entity_types
        else 0
    )

def num_tokens_from_string(
    string: str, model: str | None = None, encoding_name: str | None = None
) -> int:
    """Return the number of tokens in a text string."""
    if model is not None:
        try:
            encoding = tiktoken.encoding_for_model(model) # o200k_base not model_name 
        except KeyError:
            msg = f"Failed to get encoding for {model} when getting num_tokens_from_string. Fall back to default encoding {DEFAULT_ENCODING_NAME}"
            log.warning(msg)
            encoding = tiktoken.get_encoding(DEFAULT_ENCODING_NAME)
    else:
        encoding = tiktoken.get_encoding(encoding_name or DEFAULT_ENCODING_NAME)
    return len(encoding.encode(string))

That's correct. The parameter name for create_entity_extraction_prompt is encoding_model because that's a higher-level function. The parameter in the token function is model because it's referring directly to the model used by tiktoken. So in create_entity_extraction_prompt we map between the two.

Aug 06 '24 22:08 natoverse

Thanks @mmdsnb. I also encountered this issue earlier and included the fix in this PR.

@natoverse the problem here is that we're passing the encoding model name (i.e. o200k_base) using the wrong parameter. In num_tokens_from_string(...), the model parameter expects values like gpt-4o. To pass in the encoding model name, we just need to use a different parameter. It should look like this:

tokens_left = (
        max_token_count
        - num_tokens_from_string(prompt, encoding_name=encoding_model)
        - num_tokens_from_string(entity_types, encoding_name=encoding_model)
        if entity_types
        else 0
    )

Notice where I changed model= to encoding_name=

Aug 07 '24 05:08 jgbradley1

I see, thanks for the clarification. Closing this PR as it is resolved in your API refactor

Aug 07 '24 18:08 natoverse