Can Custom Metrics be added?
[x] I checked the documentation and related resources and couldn't find an answer to my question.
Your Question what is unclear to you? What would you like to know? Can I add custom metrics? I didn't find information related to it.
Additional context Anything else you want to share with us? I'm thinking in creating a metric to reflect how many times the input and the response are in the same language.
Thank you!
Hello @LDelPinoNT! I'm here to help you with any questions or issues you have. Let's get started!
Yes, you can add custom metrics to reflect how many times the input and the response are in the same language. You can create a new metric class that checks the language of both the input and the response and counts the matches. Here is an example of how you can implement this:
from langchain_core.pydantic_v1 import BaseModel, Field
from ragas.metrics.base import MetricWithLLM, EvaluationMode
from langdetect import detect
class LanguageMatchAnswer(BaseModel):
input_language: str = Field(..., description="Language of the input")
response_language: str = Field(..., description="Language of the response")
match: int = Field(..., description="1 if languages match, 0 otherwise")
class LanguageMatchAnswers(BaseModel):
__root__: t.List[LanguageMatchAnswer]
def dicts(self) -> t.List[t.Dict]:
return self.dict()["__root__"]
@dataclass
class LanguageMatch(MetricWithLLM):
name: str = "language_match"
evaluation_mode: EvaluationMode = EvaluationMode.qac
def _detect_language(self, text: str) -> str:
return detect(text)
def _compute_score(self, answers: LanguageMatchAnswers):
matches = sum(answer.match for answer in answers.__root__)
num_answers = len(answers.__root__)
if num_answers:
score = matches / num_answers
else:
logger.warning("No answers were provided.")
score = np.nan
return score
async def _ascore(self: t.Self, row: t.Dict, callbacks: Callbacks) -> float:
input_text = row["question"]
response_text = row["answer"]
input_language = self._detect_language(input_text)
response_language = self._detect_language(response_text)
match = 1 if input_language == response_language else 0
answers = LanguageMatchAnswers.parse_obj(
[{"input_language": input_language, "response_language": response_language, "match": match}]
)
return self._compute_score(answers)
language_match = LanguageMatch()
This code defines a new metric LanguageMatch that detects the language of the input and the response and checks if they match. The _compute_score method calculates the proportion of matches.
Additionally, you can customize which LLMs and embeddings to use for this metric by leveraging the Ragas framework. You can pass any instance of BaseRagasLLM or BaseRagasEmbeddings directly to the metric or the evaluate function. For example:
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from ragas import evaluate
openai_key = "<your-key-here>"
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai_key)
openai_completion = ChatOpenAI(
model="gpt-3.5-turbo",
temperature=0.7,
max_tokens=4000,
top_k=1,
openai_api_key=openai_key
)
# Attach the LLM and embeddings to the metric
language_match.llm = openai_completion
language_match.embeddings = embeddings
# Evaluate using the custom metric
results = evaluate(metrics=[language_match], llm=openai_completion, embeddings=embeddings)
This allows you to use custom LLMs and embeddings for your LanguageMatch metric, ensuring that the evaluation is tailored to your specific needs [1][2][3][4][5].
To continue talking to Dosu, mention @dosu.
@dosu can you explain me what is "t" in the LanguageMatchAnswers class?
In the LanguageMatchAnswers class, t is an alias for the typing module, which provides support for type hints in Python. It is commonly used to specify the types of variables, function arguments, and return values.
Here's a breakdown of how t is used in the LanguageMatchAnswers class:
from typing import List, Dict # Importing the necessary types from the typing module
class LanguageMatchAnswers(BaseModel):
__root__: List[LanguageMatchAnswer] # Using List from the typing module to specify a list of LanguageMatchAnswer objects
def dicts(self) -> List[Dict]: # Using List and Dict from the typing module to specify the return type
return self.dict()["__root__"]
In this context:
-
List[LanguageMatchAnswer]indicates that__root__is a list ofLanguageMatchAnswerobjects. -
List[Dict]indicates that thedictsmethod returns a list of dictionaries.
By using t as an alias for typing, the code becomes more concise and readable.
To continue talking to Dosu, mention @dosu.
Hey @dosu can you tell me which modules should I import for the "logger" object and the "Callbacks" in the "LanguageMatch" class ?
You should import the logging module for the "logger" object and the BaseCallbackHandler from langchain_core.callbacks.base for the "Callbacks" in the "LanguageMatch" class.
Here are the relevant imports:
import logging
from langchain_core.callbacks.base import BaseCallbackHandler
Supporting evidence from the document:
-
For the "logger" object:
logger = logging.getLogger(__name__) -
For the "Callbacks":
class CostCallbackHandler(BaseCallbackHandler):
To continue talking to Dosu, mention @dosu.
For the record the final, fixed code: ` from langchain_core.pydantic_v1 import BaseModel, Field from ragas.metrics.base import MetricWithLLM, EvaluationMode from langchain.callbacks.base import Callbacks from langdetect import detect import typing as t from dataclasses import dataclass import logging import numpy as np from ragas.evaluation import RunConfig from datasets import load_dataset import nest_asyncio
class LanguageMatchAnswer(BaseModel): input_language: str = Field(..., description="Language of the input") response_language: str = Field(..., description="Language of the response") match: int = Field(..., description="1 if languages match, 0 otherwise")
class LanguageMatchAnswers(BaseModel): root: t.List[LanguageMatchAnswer]
def dicts(self) -> t.List[t.Dict]:
return self.dict()["__root__"]
@dataclass class LanguageMatch(MetricWithLLM): name: str = "language_match" evaluation_mode: EvaluationMode = EvaluationMode.qac
def _detect_language(self, text: str) -> str:
return detect(text)
def _compute_score(self, answers: LanguageMatchAnswers):
matches = sum(answer.match for answer in answers.__root__)
num_answers = len(answers.__root__)
if num_answers:
score = matches / num_answers
else:
logger = logging.getLogger(__name__)
logger.warning("No answers were provided.")
score = np.nan
return score
async def _ascore(self: t.Self, row: t.Dict, callbacks: Callbacks) -> float:
input_text = row["question"]
response_text = row["answer"]
input_language = self._detect_language(input_text)
response_language = self._detect_language(response_text)
match = 1 if input_language == response_language else 0
answers = LanguageMatchAnswers.parse_obj(
[{"input_language": input_language, "response_language": response_language, "match": match}]
)
return self._compute_score(answers)
language_match = LanguageMatch()
nest_asyncio.apply()
[insert your code for llm, embedding funcs and runconfig]
result = evaluate( ds, metrics=[ language_match ], llm=llm, embeddings=emb_fun, run_config=runconfig_cofig ) `
Closing this as resolved.
Checkout DiscreteMetric in recent versions.