Model monitoring dev

Open pengwei715 opened this issue 2 years ago • 0 comments

Hi team,

Here is the PR of adding the llm metrics with application to our current model monitoring framework.
Here is the high level design data flow.

What has been done so far:

In the metrics.py we have two type of metrics, the first one is using huggingface's evaluate package that can compute some traditional nlp metrics like BLEU, ROUGE scores. The other type is llm-as-judge metrics. It has 3 children class. single grading, pairwise grading and ref grading. They are using different prompt template from prompt.py.
llm metrics needs the configs for the following attributes:

which model to use as judge
what's the config we want to use for the judge to do the inference
for pairwise grading and ref grading we need a second llm as bench mark model
need self-defined metrics name, definition, examples, grading rubric.

llm_application.py has the LLMModelMonitoringApp. It has a list of metrics (can be evaluate metrics or llm-as-judge metrics).

the method compute_metrics_over_data to compute all the metrics values as a dict.
the method compute_one_metric_over_data will compute one metrics over all data and aggregate the grades as mean value.
the method build_radar_chart can build a radar plot to compare the performance of your llm with bench_mark model for different metrics. All above are tested under tests/model_monitoring/genai in our llm server.

What needs to be done next we need to test run_application of LLMMonitoringApp against our current model monitoring framework.

deploy a custom llm to serving as endpoint.
get the input stream with sample_df which should have question(x), answer(y_prediction), reference(y_true).
trigger the run_application method to compute the metrics and detect the drifting.
for the logic inside of run_application method, we need to do some design thinking about the drifting. For example, if we have self-defined metrics professionalism and correctness , do we set different threshold for different metrics, say if professionalism is below 2, we need to return the drift, but for the correctness, if it's below 4, we need to return the drift. The current logic will compute all the result and log it as a dataframe.

Challenges: The biggest challenge is related to the system. we have 2 servers. Our llm server doesn't have the current model monitoring framework in place. The dev8 system has mlrun:1.6.0-rc16. However, I haven't run the system tests successfully yet.

Dec 22 '23 19:12 pengwei715