GPTCache icon indicating copy to clipboard operation
GPTCache copied to clipboard

[Feature]: Support huggingface transformers LLM model

Open Zjq9409 opened this issue 2 years ago • 15 comments

Is your feature request related to a problem? Please describe.

Can huggingface LLM model chat caching be support?

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

Zjq9409 avatar May 11 '23 09:05 Zjq9409

You can try to use the GPTCache api, a simple example like:

from gptcache.adapter.api import put, get, init_similar_cache

init_similar_cache()
put("hello", "foo")
print(get("hello"))

SimFG avatar May 11 '23 09:05 SimFG

Thank you for your prompt response. Is calling cache through LangChainLLMs encapsulating the hugging face model? GPTCache do not support Hugging Face Hub when reading the document.

image

Zjq9409 avatar May 11 '23 13:05 Zjq9409

Yes, you can use the LangChainLLMs, like:

from gptcache.adapter.langchain_models import LangChainLLMs
from langchain.llms import OpenAI

langchain_openai = OpenAI(model_name="text-ada-001")
llm = LangChainLLMs(llm=langchain_openai)
answer = llm(prompt=question)

if you use the langchain, you can also use it like:

import langchain
from langchain.cache import GPTCache
from gptcache.adapter.api import init_similar_cache

langchain.llm_cache = GPTCache(init_func=lambda cache: init_similar_cache(cache_obj=cache))

SimFG avatar May 11 '23 13:05 SimFG

@Zjq9409 If there isn't other question, i will close the issue

SimFG avatar May 12 '23 03:05 SimFG

from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
from langchain import PromptTemplate, HuggingFaceHub, LLMChain
from langchain.llms import OpenAI
from langchain import PromptTemplate

from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache.manager import get_data_manager, CacheBase, VectorBase
from gptcache import Cache
from gptcache.embedding import Onnx
from gptcache.processor.pre import get_prompt
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

import time
import os
from transformers import AutoModel, AutoTokenizer

os.environ['HUGGINGFACEHUB_API_TOKEN'] = ''
model_id = 'google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
pipe = pipeline(
    "text2text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=100
)

local_llm = HuggingFacePipeline(pipeline=pipe)
llm_cache = Cache()
onnx = Onnx()
cache_base = CacheBase('sqlite')
vector_base = VectorBase('faiss', dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base, max_size=10, clean_size=2)
llm_cache.init(
    pre_embedding_func=get_prompt,
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)
question = "上海有什么好吃的"
before = time.time()
cached_llm = LangChainLLMs(llm=local_llm)
answer = cached_llm(prompt=question, cache_obj=llm_cache)
print(answer)
print("Read through Time Spent =", time.time() - before)

questions = [
    '上海有哪些好吃的地方',
    '上海有哪些好吃的美食',
    '上海的美食有什么',
    '上海有什么好玩的地方',
    '怎么还花呗'
]
for question in questions:
        before = time.time()
        answer = cached_llm(prompt=question, cache_obj=llm_cache)
        print(f'Question: {question}')
        print(answer)
        print("Cache Hit Time Spent =", time.time() - before)
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Read through Time Spent = 0.35816311836242676
Question: 上海有哪些好吃的地方
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14227008819580078
Question: 上海有哪些好吃的美食
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.13515877723693848
Question: 上海的美食有什么
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.13379120826721191
Question: 上海有什么好玩的地方
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14519858360290527
Question: 怎么还花呗
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14341497421264648

Why is the last question is cached?

Zjq9409 avatar May 12 '23 09:05 Zjq9409

because the onnx.to_embeddings is the english embedding model, you need a chinese embedding model, reference: #317

SimFG avatar May 12 '23 09:05 SimFG

Does it support huggingface conversation caching?

Zjq9409 avatar May 13 '23 07:05 Zjq9409

About the conversation situation, it depends on how to use the llm. If you can give all the conversation info to the GPTCache, it will work finely, like the the messages of openai's chat complete, which provides the full conversation info.

SimFG avatar May 15 '23 02:05 SimFG

from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms.base import LLM
from langchain.document_loaders import UnstructuredFileLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from transformers import AutoTokenizer, AutoModel, AutoConfig
from sentence_transformers import SentenceTransformer
import torch
import os
import torch
from typing import List
import re
from tqdm import tqdm
import datetime
import numpy as np
import intel_extension_for_pytorch as ipex

EMBEDDING_MODEL = "text2vec" # embedding 模型,对应 embedding_model_dict
VECTOR_SEARCH_TOP_K = 6
LLM_MODEL = "chatglm-6b"     # LLM 模型名,对应 llm_model_dict
LLM_HISTORY_LEN = 3
DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
STREAMING = False
SENTENCE_SIZE = 100
CHUNK_SIZE = 250
embeddings = None
embedding_model_dict = {
    "text2vec": "/home/intel/zjq/prompt_test/text2vec-large-chinese/",
}

llm_model_dict = {
    "chatglm-6b-int4-qe": "THUDM/chatglm-6b-int4-qe",
    "chatglm-6b-int4": "THUDM/chatglm-6b-int4",
    "chatglm-6b": "/home/intel/zjq/chatglm",
}
VS_ROOT_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "vector_store")
class ChatGLM(LLM):
    max_token: int = 10000
    temperature: float = 0.8
    top_p = 0.9
    tokenizer: object = None
    model: object = None
    model_bf16 : object = None
    history_len: int = 10

    def __init__(self):
        super().__init__()

    @property
    def _llm_type(self) -> str:
        return "ChatGLM"

    def _call(self,
              prompt: str,
              history: List[List[str]] = [],
              streaming: bool = STREAMING,
              stop: List[str] = []
              ):  # -> Tuple[str, List[List[str]]]:
        import intel_extension_for_pytorch as ipex
        
        if streaming:
            
            for inum, (stream_resp, _) in enumerate(self.model_bf16.stream_chat(
                    self.tokenizer,
                    prompt,
                    history=history[-self.history_len:-1] if self.history_len > 0 else [],
                    max_length=self.max_token,
                    temperature=self.temperature,
                    top_p=self.top_p,
            )):
                if inum == 0:
                    history += [[prompt, stream_resp]]
                else:
                    history[-1] = [prompt, stream_resp]
                yield stream_resp, history
        else:
            response, _ = self.model_bf16.chat(
                self.tokenizer,
                prompt,
                history=history[-self.history_len:] if self.history_len > 0 else [],
                max_length=self.max_token,
                temperature=self.temperature,
                top_p=self.top_p,
            )
            history += [[prompt, response]]
            yield response, history

    def load_model(self,
                   model_name_or_path: str = "THUDM/chatglm-6b-int4",
                   llm_device=DEVICE,
                   **kwargs):
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name_or_path,
            trust_remote_code=True
        )
        model_config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
        self.model = AutoModel.from_pretrained(model_name_or_path, config=model_config, trust_remote_code=True,
                                              **kwargs)
        self.model = self.model.float().to(llm_device)
        self.model = self.model.eval()
        self.model_bf16 = ipex.optimize(
                self.model,
                dtype=torch.bfloat16,
                graph_mode=True,
                auto_kernel_selection=True,
                inplace=True,
                replace_dropout_with_identity=True)
        self.model_bf16 = self.model_bf16.eval()

from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache.adapter.langchain_models import LangChainChat

from gptcache.manager import get_data_manager, CacheBase, VectorBase
from gptcache import Cache
from gptcache.embedding import Onnx
from gptcache.processor.pre import get_prompt
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from gptcache.embedding import Huggingface


llm_cache = Cache()
cache_base = CacheBase('sqlite')
vector_base = VectorBase('faiss', dimension=huggingface.dimension)
data_manager = get_data_manager('sqlite', vector_base, max_size=10, clean_size=2)
llm_cache.init(
    pre_embedding_func=get_prompt,
    embedding_func=huggingface.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)

from gptcache.adapter.langchain_models import LangChainChat

llm = ChatGLM()
llm.load_model(model_name_or_path="THUDM/chatglm-6b-int4",
                            llm_device='cpu')
cached_llm = LangChainLLMs(llm=llm)

answer = cached_llm(prompt="你好", cache_obj=llm_cache)

I used chatglm to generate chat and need to cache it, but got an error.

 /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/langchain_m │
│ odels.py:60 in __call__                                                                          │
│                                                                                                  │
│    57 │   │   )                                                                                  │
│    58 │                                                                                          │
│    59 │   def __call__(self, prompt: str, stop: Optional[List[str]] = None, **kwargs) -> str:    │
│ ❱  60 │   │   return self._call(prompt=prompt, stop=stop, **kwargs)                              │
│    61                                                                                            │
│    62                                                                                            │
│    63 # pylint: disable=protected-access                                                         │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/langchain_m │
│ odels.py:49 in _call                                                                             │
│                                                                                                  │
│    46 │                                                                                          │
│    47 │   def _call(self, prompt: str, stop: Optional[List[str]] = None, **kwargs) -> str:       │
│    48 │   │   session = self.session if "session" not in kwargs else kwargs.pop("session")       │
│ ❱  49 │   │   return adapt(                                                                      │
│    50 │   │   │   self.llm,                                                                      │
│    51 │   │   │   cache_data_convert,                                                            │
│    52 │   │   │   update_cache_callback,                                                         │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/adapter.py: │
│ 142 in adapt                                                                                     │
│                                                                                                  │
│   139 │   │   │   llm_handler, cache_data_convert, update_cache_callback, *args, **kwargs        │
│   140 │   │   )                                                                                  │
│   141 │   else:                                                                                  │
│ ❱ 142 │   │   llm_data = llm_handler(*args, **kwargs)                                            │
│   143 │                                                                                          │
│   144 │   if cache_enable:                                                                       │
│   145 │   │   try:                                                                               │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:246   │
│ in __call__                                                                                      │
│                                                                                                  │
│   243 │                                                                                          │
│   244 │   def __call__(self, prompt: str, stop: Optional[List[str]] = None) -> str:              │
│   245 │   │   """Check Cache and run the LLM on the given prompt and input."""                   │
│ ❱ 246 │   │   return self.generate([prompt], stop=stop).generations[0][0].text                   │
│   247 │                                                                                          │
│   248 │   @property                                                                              │
│   249 │   def _identifying_params(self) -> Mapping[str, Any]:                                    │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:140   │
│ in generate                                                                                      │
│                                                                                                  │
│   137 │   │   │   │   output = self._generate(prompts, stop=stop)                                │
│   138 │   │   │   except (KeyboardInterrupt, Exception) as e:                                    │
│   139 │   │   │   │   self.callback_manager.on_llm_error(e, verbose=self.verbose)                │
│ ❱ 140 │   │   │   │   raise e                                                                    │
│   141 │   │   │   self.callback_manager.on_llm_end(output, verbose=self.verbose)                 │
│   142 │   │   │   return output                                                                  │
│   143 │   │   params = self.dict()                                                               │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:137   │
│ in generate                                                                                      │
│                                                                                                  │
│   134 │   │   │   │   {"name": self.__class__.__name__}, prompts, verbose=self.verbose           │
│   135 │   │   │   )                                                                              │
│   136 │   │   │   try:                                                                           │
│ ❱ 137 │   │   │   │   output = self._generate(prompts, stop=stop)                                │
│   138 │   │   │   except (KeyboardInterrupt, Exception) as e:                                    │
│   139 │   │   │   │   self.callback_manager.on_llm_error(e, verbose=self.verbose)                │
│   140 │   │   │   │   raise e                                                                    │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:325   │
│ in _generate                                                                                     │
│                                                                                                  │
│   322 │   │   generations = []                                                                   │
│   323 │   │   for prompt in prompts:                                                             │
│   324 │   │   │   text = self._call(prompt, stop=stop)                                           │
│ ❱ 325 │   │   │   generations.append([Generation(text=text)])                                    │
│   326 │   │   return LLMResult(generations=generations)                                          │
│   327 │                                                                                          │
│   328 │   async def _agenerate(                                                                  │
│                                                                                                  │
│ /home/intel/zjq/ChatGLM-6B/prompt_engine/pydantic/main.py:341 in                                 │
│ pydantic.main.BaseModel.__init__                                                                 │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/home/intel/zjq/ChatGLM-6B/prompt_engine/pydantic/main.py' │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValidationError: 1 validation error for Generation

Zjq9409 avatar May 17 '23 06:05 Zjq9409

@Zjq9409 From the error stack, can you try to run directly the llm, like:

llm(prompt="你好")

because i guess the error is caused by the empty text in the generations.append([Generation(text=text)]) from the last error trace

SimFG avatar May 17 '23 07:05 SimFG

still report the same problem.

Zjq9409 avatar May 17 '23 09:05 Zjq9409

@Zjq9409 if there is the same problem when you run the code:

llm(prompt="你好")

Looks like it should not be caused by GPTCache and it's caused the llm model.

SimFG avatar May 17 '23 09:05 SimFG

Actually, the usage method is: llm._call("你好") , but i need to combine with LangChainLLMs to use Cache, right?

Zjq9409 avatar May 17 '23 12:05 Zjq9409

@Zjq9409 yap

SimFG avatar May 17 '23 12:05 SimFG

when the cache is empty, it will call the origin llm model to get the answer, and then the answer will be saved to cache. In the next time, you will get the answer from the cache when you request a similar request.

SimFG avatar May 17 '23 12:05 SimFG

@Zjq9409 Is your problem solved? If you want to use the huggingface transformers LLM model, you can use the GPTCache api. If you encounter other problems, you can open a new issue.

SimFG avatar May 22 '23 07:05 SimFG

I also encountered this problem, how to solve this problem

iunique avatar May 26 '23 08:05 iunique

@iunique hi, you can open a new issue and describe your problem.

SimFG avatar May 26 '23 09:05 SimFG