whisper.cpp whisper_full_with_state: failed to decode

After running successfully for 6 minutes of audio, whisper.cpp (compiled with CUDA support) exits with the following message:

[00:06:04.000 --> 00:06:11.000]   Gut, dann kommen wir zur Sache, nämlich zum digitale Dienste Gesetz.
whisper_full_with_state: failed to decode
./main: failed to process audio

This error happens using the ggml-tiny.bin as a model and the -bs 8 -bo 8 flags. Using the base model, the same issue happens some minutes later.

Are there any tools available to get a more descriptive error message that could help narrow the source of the issue?

Aug 03 '24 17:08 MathiasSchindler

Stuck on the same problem but with large-v3 model. -bs 6 -bo 6 works fine, any -bs larger than 6 - fails.

Aug 21 '24 18:08 Kostyansa

same issue here with ggml-large-v2.bin or ggml-large-v3.bin beamsize 8 whispercpp 1.5.4 @ggerganov any comment?

Sep 16 '24 18:09 WilliamTambellini

Can you provide a sample audio that reproduces the issue?

Sep 16 '24 18:09 ggerganov

Sure. the file is too big for github so no choice, I ve upload it to file.io: https://file.io/t7PWhdPk5oHK

Sep 16 '24 19:09 WilliamTambellini

Sorry, haven't had much time lately to work on whisper.cpp. Maybe sometime next week. Hopefully someone figures it out in the meantime.

Sep 20 '24 12:09 ggerganov

If I could help a bit: seems to be a failure in:

whisper_kv_cache_find_slot: n_ctx:1344 n_tokens:8 cache.head:0 cache.n:1324 cache.size:1344
...
whisper_kv_cache_find_slot: n_ctx:1344 n_tokens:8 cache.head:0 cache.n:1340 cache.size:1344

whisper_kv_cache_find_slot: while loop: n_tested:0 cache.head:0
whisper_kv_cache_find_slot: while loop: n_tested:1 cache.head:1
whisper_kv_cache_find_slot: while loop: n_tested:2 cache.head:2
...
whisper_kv_cache_find_slot: while loop: n_tested:1336 cache.head:1336
whisper_kv_cache_find_slot: while loop: n_tested:1337 cache.head:1337
whisper_kv_cache_find_slot: while loop: context reached: n_tested=1337 cache.head=1337. Resetting head to 0.
whisper_kv_cache_find_slot: while loop: n_tested:1344 cache.head:0 
whisper_kv_cache_find_slot: failed to find a slot for 8 tokens. n_tested=1345 n_ctx=1344 cache.head=1
whisper_decode_internal: failed to whisper_kv_cache_find_slot

whisper_full_with_state: failed to decode

        if (n_tested >= n_ctx) {
            WHISPER_LOG_ERROR("%s: failed to find a slot for %d tokens. n_tested=%d n_ctx=%d\n", __func__, n_tokens, n_tested, n_ctx);
            return false;
        }

Note:

In previous generated tokens, the max ctx is never reached.
see how cache.n is (dangerously) close to n_ctx (cache.size)
by looking at the previous considered tokens, the NN has entered into a loop of hallucinations by repeating several times the same seq of toks again and again, no beam opting for the EOT token, all beams opting for the same token.

Sep 23 '24 16:09 WilliamTambellini

@ggerganov I m now wondering about these 2 lines: https://github.com/ggerganov/whisper.cpp/blob/34972dbe221709323714fc8402f2e24041d48213/src/whisper.cpp#L987C9-L991C10

 if (cache.head + n_tokens > n_ctx) {
            n_tested += n_ctx - cache.head;
            cache.head = 0;
            continue;
        }

Searching at the beginning of the kvcache (head=0) could make sens but why increasing n_tested ? Best

Sep 24 '24 00:09 WilliamTambellini

Tks anyway. Moving forward, the line

n_tested += n_ctx - cache.head;

is likely wrong and could be commented. Though it does nt resolve the issue that sometimes, for some audio files, the algo still runs our of slot and whisper_kv_cache_find_slot would still fail and return false.

A solution I ve found is just to increase the self-attn text factor:

    // at this point, we don't know yet how many decoders will be used, so we overallocate 3x ctx
    // in theory, there can be a case where this is not enough, but in practice it should always be enough
    const int factor = 3;

    WHISPER_LOG_INFO("%s: init self-attn cache: n_ctx: %d\n", __func__, factor*ctx->model.hparams.n_text_ctx);
    if (!kv_cache_init(ctx->model.hparams, state->kv_self, ctx->backend, ctx->itype, factor*ctx->model.hparams.n_text_ctx)) {

https://github.com/ggerganov/whisper.cpp/blob/451e9ee92c24a49134ba9b2c059da809c2402f98/src/whisper.cpp#L3413

the comment "but in practice it should always be enough" is sadly not true as shown in that ticket. Increasing the factor of course uses more mem but that s another question.

@ggerganov woud you mind if I add a CLI/API arg in order for the user to override and set that text-factor (still default to 3)?

Best regards WT

Sep 24 '24 17:09 WilliamTambellini

Ah, good find. Adding a CLI arg for this implementation detail is not desired. I think we have to set the factor to 5 and restrict the number of beams to max of 5?

I see that OP used 8 beams - does that bring any improvements compared to 5?

Sep 24 '24 17:09 ggerganov

Tks.

Adding a CLI arg for this implementation detail is not desired.

It is certainly desired by us at least: atm for that audio file I need to set factor=10 but who knows for another language/file what would be the required settings. Now if you prefer, I m ok to add an optional envvar: eg "WHISPER_SELFATTN_TEXT_CTX_FACTOR"

I think we have to set the factor to 5 and restrict the number of beams to max of 5?

It does not seem to help at least for that file: I ve just tried and to get the best quality (no repeat/hallucination) without running out of slot, I have to set factor=10 and beamsize=8. As shown here, trying to guess a hard coded value for these kind of params rarely/never works.

I see that OP used 8 beams - does that bring any improvements compared to 5?

Yes, that s actually the only solution I ve found to prevent hallucination/repetition. And the speed is still very good (congrats btw).

Sep 24 '24 18:09 WilliamTambellini

@Kostyansa @MathiasSchindler PR to set the ctx factor: should resolve your issue too: https://github.com/ggerganov/whisper.cpp/pull/2433

Sep 26 '24 19:09 WilliamTambellini

Ah, good find. Adding a CLI arg for this implementation detail is not desired. I think we have to set the factor to 5 and restrict the number of beams to max of 5?

I see that OP used 8 beams - does that bring any improvements compared to 5?

I set the factor to 5 and recompiled it. The whisper_full_with_state error does not appear again. Thank you.

I used 8 beams instead of 5 because of a different issue. It appeared as increasing beam size to 8 reduces the numbers the model is stuck in a loop, repeating the same line over and over again. I described it in #1949 and #2191 but other people did not see a benefit in this.

Oct 01 '24 17:10 MathiasSchindler