whisper_full_with_state: failed to decode
After running successfully for 6 minutes of audio, whisper.cpp (compiled with CUDA support) exits with the following message:
[00:06:04.000 --> 00:06:11.000] Gut, dann kommen wir zur Sache, nämlich zum digitale Dienste Gesetz.
whisper_full_with_state: failed to decode
./main: failed to process audio
This error happens using the ggml-tiny.bin as a model and the -bs 8 -bo 8 flags. Using the base model, the same issue happens some minutes later.
Are there any tools available to get a more descriptive error message that could help narrow the source of the issue?
Stuck on the same problem but with large-v3 model. -bs 6 -bo 6 works fine, any -bs larger than 6 - fails.
same issue here with ggml-large-v2.bin or ggml-large-v3.bin beamsize 8 whispercpp 1.5.4 @ggerganov any comment?
Can you provide a sample audio that reproduces the issue?
Sure. the file is too big for github so no choice, I ve upload it to file.io: https://file.io/t7PWhdPk5oHK
Sorry, haven't had much time lately to work on whisper.cpp. Maybe sometime next week. Hopefully someone figures it out in the meantime.
If I could help a bit: seems to be a failure in:
whisper_kv_cache_find_slot: n_ctx:1344 n_tokens:8 cache.head:0 cache.n:1324 cache.size:1344
...
whisper_kv_cache_find_slot: n_ctx:1344 n_tokens:8 cache.head:0 cache.n:1340 cache.size:1344
whisper_kv_cache_find_slot: while loop: n_tested:0 cache.head:0
whisper_kv_cache_find_slot: while loop: n_tested:1 cache.head:1
whisper_kv_cache_find_slot: while loop: n_tested:2 cache.head:2
...
whisper_kv_cache_find_slot: while loop: n_tested:1336 cache.head:1336
whisper_kv_cache_find_slot: while loop: n_tested:1337 cache.head:1337
whisper_kv_cache_find_slot: while loop: context reached: n_tested=1337 cache.head=1337. Resetting head to 0.
whisper_kv_cache_find_slot: while loop: n_tested:1344 cache.head:0
whisper_kv_cache_find_slot: failed to find a slot for 8 tokens. n_tested=1345 n_ctx=1344 cache.head=1
whisper_decode_internal: failed to whisper_kv_cache_find_slot
whisper_full_with_state: failed to decode
if (n_tested >= n_ctx) {
WHISPER_LOG_ERROR("%s: failed to find a slot for %d tokens. n_tested=%d n_ctx=%d\n", __func__, n_tokens, n_tested, n_ctx);
return false;
}
Note:
- In previous generated tokens, the max ctx is never reached.
- see how cache.n is (dangerously) close to n_ctx (cache.size)
- by looking at the previous considered tokens, the NN has entered into a loop of hallucinations by repeating several times the same seq of toks again and again, no beam opting for the EOT token, all beams opting for the same token.
@ggerganov I m now wondering about these 2 lines: https://github.com/ggerganov/whisper.cpp/blob/34972dbe221709323714fc8402f2e24041d48213/src/whisper.cpp#L987C9-L991C10
if (cache.head + n_tokens > n_ctx) {
n_tested += n_ctx - cache.head;
cache.head = 0;
continue;
}
Searching at the beginning of the kvcache (head=0) could make sens but why increasing n_tested ? Best
Tks anyway. Moving forward, the line
n_tested += n_ctx - cache.head;
is likely wrong and could be commented. Though it does nt resolve the issue that sometimes, for some audio files, the algo still runs our of slot and whisper_kv_cache_find_slot would still fail and return false.
A solution I ve found is just to increase the self-attn text factor:
// at this point, we don't know yet how many decoders will be used, so we overallocate 3x ctx
// in theory, there can be a case where this is not enough, but in practice it should always be enough
const int factor = 3;
WHISPER_LOG_INFO("%s: init self-attn cache: n_ctx: %d\n", __func__, factor*ctx->model.hparams.n_text_ctx);
if (!kv_cache_init(ctx->model.hparams, state->kv_self, ctx->backend, ctx->itype, factor*ctx->model.hparams.n_text_ctx)) {
https://github.com/ggerganov/whisper.cpp/blob/451e9ee92c24a49134ba9b2c059da809c2402f98/src/whisper.cpp#L3413
the comment "but in practice it should always be enough" is sadly not true as shown in that ticket. Increasing the factor of course uses more mem but that s another question.
@ggerganov woud you mind if I add a CLI/API arg in order for the user to override and set that text-factor (still default to 3)?
Best regards WT
Ah, good find. Adding a CLI arg for this implementation detail is not desired. I think we have to set the factor to 5 and restrict the number of beams to max of 5?
I see that OP used 8 beams - does that bring any improvements compared to 5?
Tks.
Adding a CLI arg for this implementation detail is not desired.
It is certainly desired by us at least: atm for that audio file I need to set factor=10 but who knows for another language/file what would be the required settings. Now if you prefer, I m ok to add an optional envvar: eg "WHISPER_SELFATTN_TEXT_CTX_FACTOR"
I think we have to set the factor to 5 and restrict the number of beams to max of 5?
It does not seem to help at least for that file: I ve just tried and to get the best quality (no repeat/hallucination) without running out of slot, I have to set factor=10 and beamsize=8. As shown here, trying to guess a hard coded value for these kind of params rarely/never works.
I see that OP used 8 beams - does that bring any improvements compared to 5?
Yes, that s actually the only solution I ve found to prevent hallucination/repetition. And the speed is still very good (congrats btw).
@Kostyansa @MathiasSchindler PR to set the ctx factor: should resolve your issue too: https://github.com/ggerganov/whisper.cpp/pull/2433
Ah, good find. Adding a CLI arg for this implementation detail is not desired. I think we have to set the factor to 5 and restrict the number of beams to max of 5?
I see that OP used 8 beams - does that bring any improvements compared to 5?
I set the factor to 5 and recompiled it. The whisper_full_with_state error does not appear again. Thank you.
I used 8 beams instead of 5 because of a different issue. It appeared as increasing beam size to 8 reduces the numbers the model is stuck in a loop, repeating the same line over and over again. I described it in #1949 and #2191 but other people did not see a benefit in this.