429 Request Error with Langchain Huggingface Endpoint
[x] I have checked the documentation and related resources and couldn't resolve my bug.
I want to create synthetic test data. Using the OpenAI or Anthropic API is very expensive so I want to use the HuggingFaceEndpointAPI and run it with Llama3-70B. I am running it on a GPU Compute Cluster and would assume that my compute is sufficient to load and run the model. However, I always run into rate limit error 429. Immediately when the script starts.
Ragas version: newest release (5th July) Python version: 3.10
Please find below the code. I tripe-checked it and it should be correct. Assume we have langchain document artifact for this of course. I excluded the data loading for better visibility.
Code to Reproduce
import pandas as pd
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context, conditional
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_text_splitters import (
TokenTextSplitter)
from langchain_huggingface import HuggingFaceEndpoint, HuggingFaceEmbeddings
import os
import re
import torch
import time
import random
import logging
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(f'running on device: {device}')
if __name__ == '__main__':
def make_request_with_backoff(max_retries=10, max_wait_time=300):
for attempt in range(max_retries):
try:
logging.info(f"Attempt {attempt + 1} of {max_retries}")
testset = generator.generate_with_langchain_docs(docs, 50, distributions, raise_exceptions=False)
logging.info(f"Successfully generated {len(testset)} items")
return testset
except Exception as e:
if "429" in str(e):
wait_time = min((2 ** attempt) + random.uniform(0, 1), max_wait_time)
logging.warning(f"Rate limit hit. Waiting {wait_time:.2f} seconds.")
time.sleep(wait_time)
else:
logging.error(f"Unexpected error: {str(e)}")
raise e
raise Exception("Max retries reached")
#load models
llm = HuggingFaceEndpoint(
repo_id="meta-llama/Meta-Llama-3-70B",
task="text-generation",
max_new_tokens=512,
repetition_penalty=1.03,
)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
generator_llm = llm
critic_llm = llm
embeddings = embeddings
generator = TestsetGenerator.from_langchain(
generator_llm,
critic_llm,
embeddings
)
# Change resulting question type distribution
distributions = { # uniform distribution
simple: 0.1,
reasoning: 0.35,
multi_context: 0.2,
conditional: 0.35
}
# Configure logging
logging.basicConfig(level=logging.INFO)
# Use the function
testset = make_request_with_backoff()
testset.to_pandas()
#store data
testset.to_parquet('synthetic_data/ragas_llama3_qa.parquet')
Error trace
File "/usr/local/lib/python3.10/dist-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B')
**Expected behavior**
I except the script to be running and use Llama3 to generate the test data.
Additional context I am running this on 6 Nvidia A10s using a Pytorch Image and Python 3.10. Specifications from the job script below:
#!/bin/bash
set -xe
srun \
--gpus=6 \
--mem=144GB \
--container-image=/data/enroot/nvcr.io_nvidia_pytorch_23.06-py3.sqsh \
--container-workdir=`pwd` \
--container-mounts=/my_path...\
./job.sh
Hi @jonas-nothnagel 👋, I'm here to assist you with any bugs, questions, or contributions you have regarding the repository.
To resolve the 429 "Too Many Requests" error when using the HuggingFaceEndpointAPI with the Llama3-70B model, you can implement rate limiting and retry logic in your code. Here is an example of how you can do this in Python:
import time
import random
import logging
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
# Function to make a request to the HuggingFaceEndpointAPI
def make_request(url, headers, data, max_retries=10, max_wait_time=300):
session = requests.Session()
retry = Retry(
total=max_retries, # Total number of retries
backoff_factor=1, # Time to wait between retries
status_forcelist=[429, 500, 502, 503, 504], # Retry on these status codes
method_whitelist=["HEAD", "GET", "OPTIONS", "POST"] # Retry on these methods
)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
for attempt in range(max_retries):
try:
logging.info(f"Attempt {attempt + 1} of {max_retries}")
response = session.post(url, headers=headers, json=data)
response.raise_for_status()
return response.json()
except requests.exceptions.HTTPError as e:
if response.status_code == 429:
wait_time = min((2 ** attempt) + random.uniform(0, 1), max_wait_time)
logging.warning(f"Rate limit hit. Waiting {wait_time:.2f} seconds.")
time.sleep(wait_time)
else:
logging.error(f"Unexpected error: {str(e)}")
raise e
raise Exception("Max retries reached")
# Example usage
url = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B"
headers = {"Authorization": "Bearer YOUR_API_KEY"}
data = {"inputs": "Your input data"}
response = make_request(url, headers, data)
if response:
print("Request successful")
print(response)
else:
print("Request failed")
This code sets up a retry mechanism that will automatically retry the request if it receives a 429 status code, with a backoff factor to wait between retries. Adjust the total and backoff_factor parameters as needed to fit your use case [1].
Additionally, you can use the RunConfig class in the ragas library to manage retries and rate limiting more effectively. Here is an example of how to configure it:
from ragas.run_config import RunConfig, add_async_retry, add_retry
# Configure run_config for docstore
run_config = RunConfig(max_retries=15, max_wait=90)
generator.set_run_config(run_config)
# Use the function with retry logic
testset = make_request_with_backoff()
This configuration will help manage rate limits and retries within the ragas library [2][3].
To continue talking to Dosu, mention @dosu.
from ragas.run_config import RunConfig, add_async_retry, add_retry
# Configure run_config for docstore
run_config = RunConfig(max_retries=15, max_wait=90)
generator.set_run_config(run_config)
# Use the function with retry logic
testset = make_request_with_backoff()
Hi @dosu where does generator come from
@jonas-nothnagel were you able to figure it out?