Nathan Habib issues

Results 29 issues of


                                            Nathan Habib

fix llm as judge warnings

launch lighteval using `lighteval --args`

- removes uneeded script files to launch lighteval - moves parsers function to `lighteval.commands.parsers` - creates a lighteval cli executable with 3 subcommands (list-tasks, nanotron, accelerate) how to use: ###...

add transformers model to be used as judge

feature request

Add a logger in the metric functions

It is currently cumbersome to log details of what is happening in metric functions, log judge prompt in llm as judge metric for example. Passing it the evaluation tracker would...

feature request

Auto batch size fix

What this PR does - Adds a security when computing batch size so that models do not OOM - Recomputes batch size for each greedy until tasks

Chat template fix

What this PR does - Adds a try catch when applying chat template as some models do not support system prompt in their templates.

Dp and mp support

adds support for both model and data parallelism

[FT] Add tool usage benchmarks

## Issue encountered Lighteval does not allow evaluating models on tool usage. ## Solution/Feature Add benchmarks for tool usage

feature request

Adds RULER benchmark

https://github.com/NVIDIA/RULER Available context size: 4096, 8192, 16384, 32768, 65536, 131072 ``` uv run lighteval vllm "model_name=meta-llama/Llama-3.1-8B,dtype=bfloat16,max_model_length=131072" "lighteval|ruler_{context size}|0|0" ```

new-task

[EVAL] TauBench:

## Evaluation short description A Benchmark for Tool-Agent-User Interaction in Real-World Domains. ## Evaluation metadata Provide all available - Paper url: - Github URL: https://github.com/sierra-research/tau-bench - Dataset url:

help wanted

new-task