Add swiss legal evals as new community tasks
Adds new community tasks with swiss legal evaluations. Currently translation tasks are supported but others may follow in the future.
@hynky1999 tagging you if you've got a couple minutes to check the templating when back from the offsite
Re templates:
We don't have any template for translation tasks atm.
There are many variants to go with (see the image below), but I would prefer going with the [src]: [input] [tgt]: (A variant). Since translation is inherently cross-lingual tasks and it's not clear which language we should use (target or source?), such template allows us to be independant on language (the language labels are kinda standardized, but yeah they will be in latin).
@JoelNiklaus Have you experimented different prompt formats?
I can quickly make a PR for the translation template and we can convert it to that.
I haven't experimented with prompts yet. Yes, going with variant A sounds good.
Thanks so much!
Btw. what is the reason you are not using the metrics from evaluate?
Evaluate is no longer actively maintained (it's indicated in the Github readme). We also wanted lighteval to be light, and not rely on a heap of dependencies.
I see. I used the direct implementation for COMET and METEOR, rather than evaluate.
PR looks great ! Do the results on your evals look sound ? Also, you can use the pre-commit hooks to format the files and fix the CI :)
pip install pre-commit
pre-commit install
pre-commit run --all-files
Great, thanks! Just ran the pre-commit hooks.
Couldn't run the evals yet because of the judge prompt. Hope to do that soon.
For some reason, bleurt_large, wmt22-comet-da, and judge_score_gpt-4o are saved to separated duplicated rows in the details. Also, judge_score_gpt-4o does not show up in the overview:
|community:sdst-text_level:de-fr:3| 0|bleu |44.8267|± | 0.5706|
| | |chrf |77.1781|± | 0.5906|
| | |ter |54.5455|± | 0.3742|
| | |meteor |62.1061|± |18.7888|
| | |BERTScore-P |98.6389|± | 0.4811|
| | |BERTScore-R |98.6984|± | 0.6839|
| | |BERTScore-F |98.6685|± | 0.5824|
| | |bleurt_large |14.3253|± | 0.5687|
| | |wmt22-comet-da|84.5221|± | 0.4793|
I am running this command:
python -m lighteval accelerate \
--model_args openai,model=gpt-4o-mini \
--tasks community|sdst-text_level:de-fr|3|0 \
--custom_tasks lighteval/community_tasks/swiss_legal_evals.py \
--output_dir outputs \
--override_batch_size 1 \
--save_details \
--max_samples 2
@clefourrier @NathanHB Do you know why this is happening and how can I fix it?
Duplicated rows: yes, each "metric type" leads to its own row, since they are not parsed the same (to make sure each comes with its own correct logprob related info for example). This is a feature not a bug. However, no idea why the judge eval is not there
Found the issue: The corpus_level_fn name was not matching the metric name
Hey @JoelNiklaus! Any update on this ?
There will be more tasks added to this PR from other people. But we could also merge this one and add the others in separate PRs. Whatever is best for you
We can keep this open and add more tasks it works for us !