[Feature]: Support huggingface transformers LLM model
Is your feature request related to a problem? Please describe.
Can huggingface LLM model chat caching be support?
Describe the solution you'd like.
No response
Describe an alternate solution.
No response
Anything else? (Additional Context)
No response
You can try to use the GPTCache api, a simple example like:
from gptcache.adapter.api import put, get, init_similar_cache
init_similar_cache()
put("hello", "foo")
print(get("hello"))
Thank you for your prompt response. Is calling cache through LangChainLLMs encapsulating the hugging face model? GPTCache do not support Hugging Face Hub when reading the document.
Yes, you can use the LangChainLLMs, like:
from gptcache.adapter.langchain_models import LangChainLLMs
from langchain.llms import OpenAI
langchain_openai = OpenAI(model_name="text-ada-001")
llm = LangChainLLMs(llm=langchain_openai)
answer = llm(prompt=question)
if you use the langchain, you can also use it like:
import langchain
from langchain.cache import GPTCache
from gptcache.adapter.api import init_similar_cache
langchain.llm_cache = GPTCache(init_func=lambda cache: init_similar_cache(cache_obj=cache))
@Zjq9409 If there isn't other question, i will close the issue
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
from langchain import PromptTemplate, HuggingFaceHub, LLMChain
from langchain.llms import OpenAI
from langchain import PromptTemplate
from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache.manager import get_data_manager, CacheBase, VectorBase
from gptcache import Cache
from gptcache.embedding import Onnx
from gptcache.processor.pre import get_prompt
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
import time
import os
from transformers import AutoModel, AutoTokenizer
os.environ['HUGGINGFACEHUB_API_TOKEN'] = ''
model_id = 'google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
pipe = pipeline(
"text2text-generation",
model=model,
tokenizer=tokenizer,
max_length=100
)
local_llm = HuggingFacePipeline(pipeline=pipe)
llm_cache = Cache()
onnx = Onnx()
cache_base = CacheBase('sqlite')
vector_base = VectorBase('faiss', dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base, max_size=10, clean_size=2)
llm_cache.init(
pre_embedding_func=get_prompt,
embedding_func=onnx.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
)
question = "上海有什么好吃的"
before = time.time()
cached_llm = LangChainLLMs(llm=local_llm)
answer = cached_llm(prompt=question, cache_obj=llm_cache)
print(answer)
print("Read through Time Spent =", time.time() - before)
questions = [
'上海有哪些好吃的地方',
'上海有哪些好吃的美食',
'上海的美食有什么',
'上海有什么好玩的地方',
'怎么还花呗'
]
for question in questions:
before = time.time()
answer = cached_llm(prompt=question, cache_obj=llm_cache)
print(f'Question: {question}')
print(answer)
print("Cache Hit Time Spent =", time.time() - before)
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:
1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。
2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。
3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Read through Time Spent = 0.35816311836242676
Question: 上海有哪些好吃的地方
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:
1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。
2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。
3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14227008819580078
Question: 上海有哪些好吃的美食
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:
1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。
2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。
3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.13515877723693848
Question: 上海的美食有什么
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:
1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。
2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。
3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.13379120826721191
Question: 上海有什么好玩的地方
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:
1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。
2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。
3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14519858360290527
Question: 怎么还花呗
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:
1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。
2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。
3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14341497421264648
Why is the last question is cached?
because the onnx.to_embeddings is the english embedding model, you need a chinese embedding model, reference: #317
Does it support huggingface conversation caching?
About the conversation situation, it depends on how to use the llm. If you can give all the conversation info to the GPTCache, it will work finely, like the the messages of openai's chat complete, which provides the full conversation info.
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms.base import LLM
from langchain.document_loaders import UnstructuredFileLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from transformers import AutoTokenizer, AutoModel, AutoConfig
from sentence_transformers import SentenceTransformer
import torch
import os
import torch
from typing import List
import re
from tqdm import tqdm
import datetime
import numpy as np
import intel_extension_for_pytorch as ipex
EMBEDDING_MODEL = "text2vec" # embedding 模型,对应 embedding_model_dict
VECTOR_SEARCH_TOP_K = 6
LLM_MODEL = "chatglm-6b" # LLM 模型名,对应 llm_model_dict
LLM_HISTORY_LEN = 3
DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
STREAMING = False
SENTENCE_SIZE = 100
CHUNK_SIZE = 250
embeddings = None
embedding_model_dict = {
"text2vec": "/home/intel/zjq/prompt_test/text2vec-large-chinese/",
}
llm_model_dict = {
"chatglm-6b-int4-qe": "THUDM/chatglm-6b-int4-qe",
"chatglm-6b-int4": "THUDM/chatglm-6b-int4",
"chatglm-6b": "/home/intel/zjq/chatglm",
}
VS_ROOT_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "vector_store")
class ChatGLM(LLM):
max_token: int = 10000
temperature: float = 0.8
top_p = 0.9
tokenizer: object = None
model: object = None
model_bf16 : object = None
history_len: int = 10
def __init__(self):
super().__init__()
@property
def _llm_type(self) -> str:
return "ChatGLM"
def _call(self,
prompt: str,
history: List[List[str]] = [],
streaming: bool = STREAMING,
stop: List[str] = []
): # -> Tuple[str, List[List[str]]]:
import intel_extension_for_pytorch as ipex
if streaming:
for inum, (stream_resp, _) in enumerate(self.model_bf16.stream_chat(
self.tokenizer,
prompt,
history=history[-self.history_len:-1] if self.history_len > 0 else [],
max_length=self.max_token,
temperature=self.temperature,
top_p=self.top_p,
)):
if inum == 0:
history += [[prompt, stream_resp]]
else:
history[-1] = [prompt, stream_resp]
yield stream_resp, history
else:
response, _ = self.model_bf16.chat(
self.tokenizer,
prompt,
history=history[-self.history_len:] if self.history_len > 0 else [],
max_length=self.max_token,
temperature=self.temperature,
top_p=self.top_p,
)
history += [[prompt, response]]
yield response, history
def load_model(self,
model_name_or_path: str = "THUDM/chatglm-6b-int4",
llm_device=DEVICE,
**kwargs):
self.tokenizer = AutoTokenizer.from_pretrained(
model_name_or_path,
trust_remote_code=True
)
model_config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
self.model = AutoModel.from_pretrained(model_name_or_path, config=model_config, trust_remote_code=True,
**kwargs)
self.model = self.model.float().to(llm_device)
self.model = self.model.eval()
self.model_bf16 = ipex.optimize(
self.model,
dtype=torch.bfloat16,
graph_mode=True,
auto_kernel_selection=True,
inplace=True,
replace_dropout_with_identity=True)
self.model_bf16 = self.model_bf16.eval()
from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache.adapter.langchain_models import LangChainChat
from gptcache.manager import get_data_manager, CacheBase, VectorBase
from gptcache import Cache
from gptcache.embedding import Onnx
from gptcache.processor.pre import get_prompt
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from gptcache.embedding import Huggingface
llm_cache = Cache()
cache_base = CacheBase('sqlite')
vector_base = VectorBase('faiss', dimension=huggingface.dimension)
data_manager = get_data_manager('sqlite', vector_base, max_size=10, clean_size=2)
llm_cache.init(
pre_embedding_func=get_prompt,
embedding_func=huggingface.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
)
from gptcache.adapter.langchain_models import LangChainChat
llm = ChatGLM()
llm.load_model(model_name_or_path="THUDM/chatglm-6b-int4",
llm_device='cpu')
cached_llm = LangChainLLMs(llm=llm)
answer = cached_llm(prompt="你好", cache_obj=llm_cache)
I used chatglm to generate chat and need to cache it, but got an error.
/home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/langchain_m │
│ odels.py:60 in __call__ │
│ │
│ 57 │ │ ) │
│ 58 │ │
│ 59 │ def __call__(self, prompt: str, stop: Optional[List[str]] = None, **kwargs) -> str: │
│ ❱ 60 │ │ return self._call(prompt=prompt, stop=stop, **kwargs) │
│ 61 │
│ 62 │
│ 63 # pylint: disable=protected-access │
│ │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/langchain_m │
│ odels.py:49 in _call │
│ │
│ 46 │ │
│ 47 │ def _call(self, prompt: str, stop: Optional[List[str]] = None, **kwargs) -> str: │
│ 48 │ │ session = self.session if "session" not in kwargs else kwargs.pop("session") │
│ ❱ 49 │ │ return adapt( │
│ 50 │ │ │ self.llm, │
│ 51 │ │ │ cache_data_convert, │
│ 52 │ │ │ update_cache_callback, │
│ │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/adapter.py: │
│ 142 in adapt │
│ │
│ 139 │ │ │ llm_handler, cache_data_convert, update_cache_callback, *args, **kwargs │
│ 140 │ │ ) │
│ 141 │ else: │
│ ❱ 142 │ │ llm_data = llm_handler(*args, **kwargs) │
│ 143 │ │
│ 144 │ if cache_enable: │
│ 145 │ │ try: │
│ │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:246 │
│ in __call__ │
│ │
│ 243 │ │
│ 244 │ def __call__(self, prompt: str, stop: Optional[List[str]] = None) -> str: │
│ 245 │ │ """Check Cache and run the LLM on the given prompt and input.""" │
│ ❱ 246 │ │ return self.generate([prompt], stop=stop).generations[0][0].text │
│ 247 │ │
│ 248 │ @property │
│ 249 │ def _identifying_params(self) -> Mapping[str, Any]: │
│ │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:140 │
│ in generate │
│ │
│ 137 │ │ │ │ output = self._generate(prompts, stop=stop) │
│ 138 │ │ │ except (KeyboardInterrupt, Exception) as e: │
│ 139 │ │ │ │ self.callback_manager.on_llm_error(e, verbose=self.verbose) │
│ ❱ 140 │ │ │ │ raise e │
│ 141 │ │ │ self.callback_manager.on_llm_end(output, verbose=self.verbose) │
│ 142 │ │ │ return output │
│ 143 │ │ params = self.dict() │
│ │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:137 │
│ in generate │
│ │
│ 134 │ │ │ │ {"name": self.__class__.__name__}, prompts, verbose=self.verbose │
│ 135 │ │ │ ) │
│ 136 │ │ │ try: │
│ ❱ 137 │ │ │ │ output = self._generate(prompts, stop=stop) │
│ 138 │ │ │ except (KeyboardInterrupt, Exception) as e: │
│ 139 │ │ │ │ self.callback_manager.on_llm_error(e, verbose=self.verbose) │
│ 140 │ │ │ │ raise e │
│ │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:325 │
│ in _generate │
│ │
│ 322 │ │ generations = [] │
│ 323 │ │ for prompt in prompts: │
│ 324 │ │ │ text = self._call(prompt, stop=stop) │
│ ❱ 325 │ │ │ generations.append([Generation(text=text)]) │
│ 326 │ │ return LLMResult(generations=generations) │
│ 327 │ │
│ 328 │ async def _agenerate( │
│ │
│ /home/intel/zjq/ChatGLM-6B/prompt_engine/pydantic/main.py:341 in │
│ pydantic.main.BaseModel.__init__ │
│ │
│ [Errno 2] No such file or directory: '/home/intel/zjq/ChatGLM-6B/prompt_engine/pydantic/main.py' │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValidationError: 1 validation error for Generation
@Zjq9409 From the error stack, can you try to run directly the llm, like:
llm(prompt="你好")
because i guess the error is caused by the empty text in the generations.append([Generation(text=text)]) from the last error trace
still report the same problem.
@Zjq9409 if there is the same problem when you run the code:
llm(prompt="你好")
Looks like it should not be caused by GPTCache and it's caused the llm model.
Actually, the usage method is: llm._call("你好") , but i need to combine with LangChainLLMs to use Cache, right?
@Zjq9409 yap
when the cache is empty, it will call the origin llm model to get the answer, and then the answer will be saved to cache. In the next time, you will get the answer from the cache when you request a similar request.
@Zjq9409 Is your problem solved? If you want to use the huggingface transformers LLM model, you can use the GPTCache api. If you encounter other problems, you can open a new issue.
I also encountered this problem, how to solve this problem
@iunique hi, you can open a new issue and describe your problem.