lighteval icon indicating copy to clipboard operation
lighteval copied to clipboard

Serbian LLM Benchmark Task

Open DeanChugall opened this issue 1 year ago • 5 comments

Serbian LLM Benchmark Task Configuration and Prompt Functions

Summary:

This pull request introduces task configurations and prompt functions for evaluating LLM models on various Serbian datasets. The module includes tasks for:

ARC (easy and challenge), BoolQ, Hellaswag, OpenBookQA, PIQA, Winogrande, custom OZ Eval dataset.

The tasks are defined using the LightevalTaskConfig class, and prompt generation is streamlined through a reusable serbian_eval_prompt function.

Changes:

  1. Task Configurations:

    • Configurations for ARC (Easy and Challenge), BoolQ, Hellaswag, OpenBookQA, PIQA, Winogrande, and OZ Eval tasks using LightevalTaskConfig.
    • Enum class HFSubsets added for dataset subset management, improving code maintainability and clarity.
    • create_task_config function allows dynamic task creation with dependency injection for flexibility in dataset and metric selection.
  2. Prompt Functions:

    • The serbian_eval_prompt function creates a structured multiple-choice prompt in Serbian.
    • The function supports dynamic query and choice generation with configurable tasks.
  3. Logging:

    • A hello_message banner is printed upon task initialization, listing all available tasks.
    • Task names are dynamically generated and printed using hlog_warn.

Key Features:

  • Modular Design: Task configurations are modular, reusable, and easily extendable to accommodate new datasets and tasks.
  • Improved Readability: Introduction of the HFSubsets Enum improves the readability and maintainability of the dataset subset references.
  • Enhanced Flexibility: create_task_config function simplifies task creation, promoting cleaner and more maintainable code.
  • Clear Logging: Logging includes a friendly welcome message and a list of available tasks for easier debugging and interaction.

Future Enhancements:

  • Additional prompt functions can be added for different task types.
  • Unit tests should be written to ensure the integrity of prompt generation and task configuration.

DeanChugall avatar Oct 03 '24 15:10 DeanChugall

Fixed ruff format --check . for ci,

DeanChugall avatar Oct 07 '24 08:10 DeanChugall

It would be great that we using pre-commit run but when this is run some of file not satisfy criteria, and I don't want to mess with this file. File affected with pre-commit run on image below.

Screenshot from 2024-10-07 10-52-25

Screenshot from 2024-10-07 10-57-55

Screenshot from 2024-10-07 10-58-10

etc ...

DeanChugall avatar Oct 07 '24 08:10 DeanChugall

Mhh this should not happen, are you sure you are running the correct versions ?

NathanHB avatar Oct 07 '24 11:10 NathanHB

Mhh this should not happen, are you sure you are running the correct versions ?

Absolutely, try checking at least one of those files manually in eg: evaluation-task-request.md

https://raw.githubusercontent.com/huggingface/lighteval/refs/heads/main/.github/ISSUE_TEMPLATE/evaluation-task-request.md

DeanChugall avatar Oct 07 '24 11:10 DeanChugall

let's just wait for the quality check and see if we can merge.

NathanHB avatar Oct 08 '24 09:10 NathanHB