lighteval Performance compared to lm-evaluation-harness

Hi,

Thanks for sharing this package, it has lots of cool features!

I saw that arc-challenge was taking about twice longer that what I have with harness, I ran the following commands with lighteval:

# ran with 4A100 GPUs -> 611s
time  accelerate launch --multi_gpu --num_processes=4 lighteval/run_evals_accelerate.py --model_args="pretrained=meta-llama/Meta-Llama-3-8B" --tasks "leaderboard|arc:challenge|25|0" --output_dir "arc_challenge" --override batch_size 8

and the following command from harness (using big-refactor branch):

# ran with 4A100 GPUs -> 319s and big-refactor branch
time accelerate launch -m lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B,dtype="bfloat16" \
    --tasks arc_challenge \
    --batch_size 8 \
    --num_fewshot=25

Of-course, many things could cause this but I wanted to know if you have faced something similar or benchmarked light-eval compared to Harness?

If not, would you have a suggestion to get similar performance? (it seems bf16 are used by default so it should not be the culprit)

Apr 29 '24 13:04 geoalgo

Hi! Is your configuration for accelerate the same? But otherwise I'll take a look, thanks for opening this issue

Apr 29 '24 15:04 clefourrier

Yes, I used the default one, thanks.

Apr 29 '24 19:04 geoalgo

When you say the default one, what do you mean precisely? (Could you give the output of your configuration?) Because when launching lighteval, you specifically select the number of processes to use, and not when launching the lm_eval one.

If the model is small enough to fit 2 times on a GPU, you could be doing DP8 with lm_eval, and only DP4 for lighteval (which would also explain the difference in speed)

Apr 30 '24 06:04 clefourrier

I was using DDP with 4 GPUs in the two cases (if I got everything right 😅).

This is the output acceleration config I got with lighteval:

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`

and the one I got with lm_eval:

	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`

One thing that I am now wondering that could cause this gap is whether lighteval uses bf16 by default (which could cause a large gap). I will rerun by setting it explicitly and let you know.

Apr 30 '24 07:04 geoalgo

Thanks a lot!

Apr 30 '24 07:04 clefourrier

I reran setting bf16 explicitly and it took 11 min with the following command

time accelerate launch --multi_gpu --num_processes=4 lighteval/run_evals_accelerate.py     --model_args="pretrained=meta-llama/Meta-Llama-3-8B,dtype="bfloat16""     --tasks "leaderboard|arc:challenge|25|0"     --output_dir "arc_challenge2" --override_batch_size 8

and it took longer than lm_eval so bf16 does not seem to be the culprit.

Apr 30 '24 14:04 geoalgo