lighteval Add swiss legal evals as new community tasks

Adds new community tasks with swiss legal evaluations. Currently translation tasks are supported but others may follow in the future.

Nov 11 '24 11:11 JoelNiklaus

@hynky1999 tagging you if you've got a couple minutes to check the templating when back from the offsite

Nov 12 '24 09:11 clefourrier

Re templates: We don't have any template for translation tasks atm. There are many variants to go with (see the image below), but I would prefer going with the [src]: [input] [tgt]: (A variant). Since translation is inherently cross-lingual tasks and it's not clear which language we should use (target or source?), such template allows us to be independant on language (the language labels are kinda standardized, but yeah they will be in latin).

@JoelNiklaus Have you experimented different prompt formats?

Source: https://arxiv.org/pdf/2301.07069

I can quickly make a PR for the translation template and we can convert it to that.

Nov 12 '24 21:11 hynky1999

I haven't experimented with prompts yet. Yes, going with variant A sounds good.

Thanks so much!

Nov 13 '24 08:11 JoelNiklaus

Btw. what is the reason you are not using the metrics from evaluate?

Nov 13 '24 11:11 JoelNiklaus

Evaluate is no longer actively maintained (it's indicated in the Github readme). We also wanted lighteval to be light, and not rely on a heap of dependencies.

Nov 13 '24 11:11 clefourrier

I see. I used the direct implementation for COMET and METEOR, rather than evaluate.

Nov 13 '24 13:11 JoelNiklaus

PR looks great ! Do the results on your evals look sound ? Also, you can use the pre-commit hooks to format the files and fix the CI :)

pip install pre-commit
pre-commit install
pre-commit run --all-files

Nov 19 '24 13:11 NathanHB

Great, thanks! Just ran the pre-commit hooks.

Couldn't run the evals yet because of the judge prompt. Hope to do that soon.

Nov 20 '24 10:11 JoelNiklaus

For some reason, bleurt_large, wmt22-comet-da, and judge_score_gpt-4o are saved to separated duplicated rows in the details. Also, judge_score_gpt-4o does not show up in the overview:

|community:sdst-text_level:de-fr:3|      0|bleu          |44.8267|±  | 0.5706|
|                                 |       |chrf          |77.1781|±  | 0.5906|
|                                 |       |ter           |54.5455|±  | 0.3742|
|                                 |       |meteor        |62.1061|±  |18.7888|
|                                 |       |BERTScore-P   |98.6389|±  | 0.4811|
|                                 |       |BERTScore-R   |98.6984|±  | 0.6839|
|                                 |       |BERTScore-F   |98.6685|±  | 0.5824|
|                                 |       |bleurt_large  |14.3253|±  | 0.5687|
|                                 |       |wmt22-comet-da|84.5221|±  | 0.4793|

I am running this command:

python -m lighteval accelerate \
  --model_args openai,model=gpt-4o-mini \
  --tasks community|sdst-text_level:de-fr|3|0 \
  --custom_tasks lighteval/community_tasks/swiss_legal_evals.py \
  --output_dir outputs \
  --override_batch_size 1 \
  --save_details \
  --max_samples 2

@clefourrier @NathanHB Do you know why this is happening and how can I fix it?

Nov 26 '24 13:11 JoelNiklaus

Duplicated rows: yes, each "metric type" leads to its own row, since they are not parsed the same (to make sure each comes with its own correct logprob related info for example). This is a feature not a bug. However, no idea why the judge eval is not there

Nov 26 '24 14:11 clefourrier

Found the issue: The corpus_level_fn name was not matching the metric name

Nov 26 '24 15:11 JoelNiklaus

Hey @JoelNiklaus! Any update on this ?

Feb 21 '25 13:02 NathanHB

There will be more tasks added to this PR from other people. But we could also merge this one and add the others in separate PRs. Whatever is best for you

Feb 21 '25 13:02 JoelNiklaus

We can keep this open and add more tasks it works for us !

Feb 21 '25 13:02 NathanHB