Serbian LLM Benchmark Task
Serbian LLM Benchmark Task Configuration and Prompt Functions
Summary:
This pull request introduces task configurations and prompt functions for evaluating LLM models on various Serbian datasets. The module includes tasks for:
ARC (easy and challenge), BoolQ, Hellaswag, OpenBookQA, PIQA, Winogrande, custom OZ Eval dataset.
The tasks are defined using the LightevalTaskConfig class, and prompt generation is streamlined through a reusable serbian_eval_prompt function.
Changes:
-
Task Configurations:
- Configurations for ARC (Easy and Challenge), BoolQ, Hellaswag, OpenBookQA, PIQA, Winogrande, and OZ Eval tasks using
LightevalTaskConfig. - Enum class
HFSubsetsadded for dataset subset management, improving code maintainability and clarity. -
create_task_configfunction allows dynamic task creation with dependency injection for flexibility in dataset and metric selection.
- Configurations for ARC (Easy and Challenge), BoolQ, Hellaswag, OpenBookQA, PIQA, Winogrande, and OZ Eval tasks using
-
Prompt Functions:
- The
serbian_eval_promptfunction creates a structured multiple-choice prompt in Serbian. - The function supports dynamic query and choice generation with configurable tasks.
- The
-
Logging:
- A
hello_messagebanner is printed upon task initialization, listing all available tasks. - Task names are dynamically generated and printed using
hlog_warn.
- A
Key Features:
- Modular Design: Task configurations are modular, reusable, and easily extendable to accommodate new datasets and tasks.
-
Improved Readability: Introduction of the
HFSubsetsEnum improves the readability and maintainability of the dataset subset references. -
Enhanced Flexibility:
create_task_configfunction simplifies task creation, promoting cleaner and more maintainable code. - Clear Logging: Logging includes a friendly welcome message and a list of available tasks for easier debugging and interaction.
Future Enhancements:
- Additional prompt functions can be added for different task types.
- Unit tests should be written to ensure the integrity of prompt generation and task configuration.
Fixed ruff format --check . for ci,
It would be great that we using pre-commit run but when this is run some of file not satisfy criteria, and I don't want to mess with this file.
File affected with pre-commit run on image below.
etc ...
Mhh this should not happen, are you sure you are running the correct versions ?
Mhh this should not happen, are you sure you are running the correct versions ?
Absolutely, try checking at least one of those files manually in eg: evaluation-task-request.md
https://raw.githubusercontent.com/huggingface/lighteval/refs/heads/main/.github/ISSUE_TEMPLATE/evaluation-task-request.md
let's just wait for the quality check and see if we can merge.