When using -pc output in the terminal, some Chinese characters cannot be displayed normally
just like https://github.com/ggerganov/whisper.cpp/issues/25, when transcribed in zh(chinese), there are still some characters missing, and the model is from ggml-large.bin in hugging-face(https://huggingface.co/datasets/ggerganov/whisper.cpp/tree/main).
Maybe the large model and the large-v1 model still need to be repaired?
error example:
[00:00:13.000 --> 00:00:15.000] 各种AI的应用��出不��
ground truth:
[00:00:13.000 --> 00:00:15.000] 各种AI的应用层出不穷
It's already converted using the solution in #25. Maybe there is some other problem with the encoding - not sure
I'm very sorry that I may have caused some misunderstanding. Through experiments, I found that using -pc output on the terminal will lead to garbled code; And I initially determined that this was caused by the fixed length of char:
const char * text = whisper_full_get_token_text(ctx, i, j);
Like #25, sometimes the Chinese code length should be 2. I guess this should be the problem. And I modified the output statement and added spaces to get such results, which may prove my guess.
printf("%s%s %s%s", speaker.c_str(), k_colors[3].c_str(), text, "\033[0m");
[00:00:13.000 --> 00:00:15.000] 各 种 AI 的 应 用 � � 出 不 � �
Finally, I would like to apologize for the trouble that my wrong description may cause again.
No problems. Just to make sure I understand - when you remove the -pc flag, do you get the correct characters?
No problems. Just to make sure I understand - when you remove the
-pcflag, do you get the correct characters?
Yes, it outputs normally
No problems. Just to make sure I understand - when you remove the
-pcflag, do you get the correct characters?
At the same time, when I output to the file, it is normal.
Call Whisper_full_get_segment_text(ctx, i); I will get the correct output
Any news on this issue? I am facing the same problem in Greek. The bug also occurs when using -ml with a small value, for example -ml 5.
Any news on this issue? I am facing the same problem in Greek. The bug also occurs when using
-mlwith a small value, for example-ml 5.
I judge that this is caused by the uncertainty of the number correspondence between words and tokens. I think this will also lead to errors in obtaining the timestamp of each word. But I'm sorry, I haven't found a good way to fix this problem
I indeed intend to get word-level timestamps. This error does not occur during normal transcription (without -ml argument).
@ggerganov Any idea on what is causing this bug, or how it may be fixed?
If you add the --split-on-word argument does it fix the issue?
const char * text = whisper_full_get_token_text(ctx, i, j);
...
printf("%s%s%s%s", speaker.c_str(), k_colors[col].c_str(), text, "\033[0m");
The issue stems from the possibility that the token text may not adhere to the valid utf-8 string format. When using OpenAI's tiktoken tokenizer, a Chinese character in utf-8 encoding could be split into multiple tokens, which leading to the problem. In such a scenario printf("%s", text) outputs a scrambled or unintelligible string.
To resolve the issue I use icu library to check whether the token text is a valid utf-8 string or not. If yes, print out as usual; if not, the token text is pushed back to a temporary char buffer instead. This char buffer will not be printed out until bytes in the buffer form a valid utf-8 string.
So quick fix steps are (Ubuntu):
- install
iculibrary-
sudo apt-get install libicu-dev
-
- modify
LDFLAGSin./Makefile-
LDFLAGS = -licuuc
-
- modify
examples/main/main.cpp
// include `icu` header
#include <unicode/ustring.h>
...
// add `is_valid_utf8()` function to check the string is a valid utf-8 string or not
int is_valid_utf8(const char *str) {
UErrorCode error = U_ZERO_ERROR;
u_strFromUTF8(NULL, 0, NULL, str, -1, &error);
return error != U_INVALID_CHAR_FOUND;
}
...
// modify `if(params.print_colors)` loop
if (params.print_colors) {
// temp char buffer
char tmp[1024];
tmp[0] = '\0';
for (int j = 0; j < whisper_full_n_tokens(ctx, i); ++j) {
if (params.print_special == false) {
const whisper_token id = whisper_full_get_token_id(ctx, i, j);
if (id >= whisper_token_eot(ctx)) {
continue;
}
}
const char * text = whisper_full_get_token_text(ctx, i, j);
const float p = whisper_full_get_token_p (ctx, i, j);
const int col = std::max(0, std::min((int) k_colors.size() - 1, (int) (std::pow(p, 3)*float(k_colors.size()))));
// push to the temp char buffer
strcat(tmp, text);
// check the buffer
if( is_valid_utf8(tmp) ){
printf("%s%s%s%s", speaker.c_str(), k_colors[col].c_str(), tmp, "\033[0m");
tmp[0]='\0';
}
}
} else {
const char * text = whisper_full_get_segment_text(ctx, i);
printf("%s%s", speaker.c_str(), text);
}
before:

after:

One issue of the current implementation is the color for the temp char buffer is the same as last added token.
Note: I only ran some tests on Chinese, not sure the fix is applicable to other language as well.
Fixed in #1313