dify icon indicating copy to clipboard operation
dify copied to clipboard

fix: tiktoken error in offline mode

Open hikariming opened this issue 10 months ago • 1 comments

Summary

This PR aims to optimize the usage of tiktoken to support offline mode. The following steps have been taken:

  1. Download and Cache Vocabulary Files:

    • Created a directory named tiktoken and navigated into it.
    • Downloaded the vocab.bpe and encoder.json files from the official OpenAI public storage using wget.
    • Copied these files to their respective cache - specific hashed names (6d1cbeee0f20b3d9449abfede4726ed8212e3aee for vocab.bpe and 6c7ea1a7e38e3a7f062df639a5b80947f075ffe6 for encoder.json).
    mkdir tiktoken
    cd tiktoken
    wget https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe
    cp vocab.bpe 6d1cbeee0f20b3d9449abfede4726ed8212e3aee
    wget https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json
    cp encoder.json 6c7ea1a7e38e3a7f062df639a5b80947f075ffe6
    
  2. Mount Tiktoken Cache:

    • Set the TIKTOKEN_CACHE_DIR environment variable to /app/api/.tiktoken/.
    • Mounted the local tiktoken directory to the corresponding container path /app/api/.tiktoken in the volume configuration.
    environment:
       TIKTOKEN_CACHE_DIR: /app/api/.tiktoken/
          ...
     volumes:
       - ./volumes/plugin_daemon:/app/storage
       - ./tiktoken:/app/api/.tiktoken
    

Reference

The optimization steps are referenced from How to use tiktoken in offline mode computer.

and the https://github.com/langgenius/dify/issues/14565#issuecomment-2716290533

Testing

This method has been tested and confirmed to be effective.

Root Cause Analysis

It was found that the issue was caused by tiktoken. Specifically, tiktoken_ext/openai_public.py#L17 tries to download the GPT - 2 tokenizer, and Dify references the GPT - 2 tokenizer at python/dify_plugin/interfaces/model/ai_model.py#L281.

This optimization ensures that tiktoken can function properly in an offline environment.

  • Close https://github.com/langgenius/dify/issues/14565
  • Close https://github.com/langgenius/dify/issues/15372
  • Close https://github.com/langgenius/dify/issues/16287
  • Close https://github.com/langgenius/dify/issues/16042
  • Close https://github.com/langgenius/dify/issues/16149
  • Close https://github.com/langgenius/dify/issues/15849
  • Close https://github.com/langgenius/dify/issues/16232
  • Close https://github.com/langgenius/dify/issues/16427

[!Tip] Close issue syntax: Fixes #<issue number> or Resolves #<issue number>, see documentation for more details.

Screenshots

Before After
... ...

Checklist

[!IMPORTANT]
Please review the checklist below before submitting your pull request.

  • [ ] This change requires a documentation update, included: Dify Document
  • [x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • [x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • [x] I've updated the documentation accordingly.
  • [x] I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

hikariming avatar Mar 27 '25 03:03 hikariming

In my practise, a better way to guarantee the required tiktoken cache is to run the tiktoken.encoding_for_model(model_name) in api/Dockerfile. Mounted cache for tiktoken is fragile and hard to explain the meaning of the hash in path and the concret file content.

bowenliang123 avatar Mar 27 '25 19:03 bowenliang123

Fixed it! Really appreciate the help!

imJaydenDu avatar Apr 23 '25 08:04 imJaydenDu