mlrun
mlrun copied to clipboard
Model monitoring dev
Hi team,
Here is the PR of adding the llm metrics with application to our current model monitoring framework.
Here is the high level design data flow.
What has been done so far:
- In the
metrics.pywe have two type of metrics, the first one is using huggingface's evaluate package that can compute some traditional nlp metrics like BLEU, ROUGE scores. The other type is llm-as-judge metrics. It has 3 children class.single grading,pairwise gradingandref grading. They are using different prompt template fromprompt.py. - llm metrics needs the configs for the following attributes:
- which model to use as judge
- what's the config we want to use for the judge to do the inference
- for
pairwise gradingandref gradingwe need a second llm as bench mark model - need self-defined metrics name, definition, examples, grading rubric.
-
llm_application.pyhas theLLMModelMonitoringApp. It has a list of metrics (can be evaluate metrics or llm-as-judge metrics).
- the method
compute_metrics_over_datato compute all the metrics values as adict. - the method
compute_one_metric_over_datawill compute one metrics over all data and aggregate the grades as mean value. - the method
build_radar_chartcan build a radar plot to compare the performance of your llm with bench_mark model for different metrics. All above are tested undertests/model_monitoring/genaiin our llm server.
What needs to be done next
we need to test run_application of LLMMonitoringApp against our current model monitoring framework.
- deploy a custom llm to serving as endpoint.
- get the input stream with
sample_dfwhich should havequestion(x),answer(y_prediction),reference(y_true). - trigger the
run_applicationmethod to compute the metrics and detect the drifting. - for the logic inside of
run_applicationmethod, we need to do some design thinking about the drifting. For example, if we have self-defined metricsprofessionalismandcorrectness, do we set different threshold for different metrics, say ifprofessionalismis below 2, we need to return the drift, but for thecorrectness, if it's below 4, we need to return the drift. The current logic will compute all the result and log it as a dataframe.
Challenges: The biggest challenge is related to the system. we have 2 servers. Our llm server doesn't have the current model monitoring framework in place. The dev8 system has mlrun:1.6.0-rc16. However, I haven't run the system tests successfully yet.