whisper.cpp When using -pc output in the terminal, some Chinese characters cannot be displayed normally

just like https://github.com/ggerganov/whisper.cpp/issues/25, when transcribed in zh(chinese), there are still some characters missing, and the model is from ggml-large.bin in hugging-face(https://huggingface.co/datasets/ggerganov/whisper.cpp/tree/main).

Maybe the large model and the large-v1 model still need to be repaired?

error example:

[00:00:13.000 --> 00:00:15.000] 各种AI的应用��出不��

ground truth:

[00:00:13.000 --> 00:00:15.000] 各种AI的应用层出不穷

Jan 11 '23 05:01 chenqianhe

It's already converted using the solution in #25. Maybe there is some other problem with the encoding - not sure

Jan 15 '23 09:01 ggerganov

I'm very sorry that I may have caused some misunderstanding. Through experiments, I found that using -pc output on the terminal will lead to garbled code; And I initially determined that this was caused by the fixed length of char:

const char * text = whisper_full_get_token_text(ctx, i, j);

Like #25, sometimes the Chinese code length should be 2. I guess this should be the problem. And I modified the output statement and added spaces to get such results, which may prove my guess.

printf("%s%s %s%s", speaker.c_str(), k_colors[3].c_str(), text, "\033[0m");

[00:00:13.000 --> 00:00:15.000] 各种 AI 的应用 � � 出不 � �

Finally, I would like to apologize for the trouble that my wrong description may cause again.

Jan 15 '23 11:01 chenqianhe

No problems. Just to make sure I understand - when you remove the -pc flag, do you get the correct characters?

Jan 15 '23 12:01 ggerganov

No problems. Just to make sure I understand - when you remove the -pc flag, do you get the correct characters?

Yes, it outputs normally

Jan 15 '23 12:01 chenqianhe

No problems. Just to make sure I understand - when you remove the -pc flag, do you get the correct characters?

At the same time, when I output to the file, it is normal. Call Whisper_full_get_segment_text(ctx, i); I will get the correct output

Jan 15 '23 12:01 chenqianhe

Any news on this issue? I am facing the same problem in Greek. The bug also occurs when using -ml with a small value, for example -ml 5.

Feb 06 '23 13:02 giannhskp

Any news on this issue? I am facing the same problem in Greek. The bug also occurs when using -ml with a small value, for example -ml 5.

I judge that this is caused by the uncertainty of the number correspondence between words and tokens. I think this will also lead to errors in obtaining the timestamp of each word. But I'm sorry, I haven't found a good way to fix this problem

Feb 06 '23 13:02 chenqianhe

I indeed intend to get word-level timestamps. This error does not occur during normal transcription (without -ml argument).

@ggerganov Any idea on what is causing this bug, or how it may be fixed?

Feb 08 '23 09:02 giannhskp

If you add the --split-on-word argument does it fix the issue?

Feb 11 '23 07:02 ggerganov

const char * text = whisper_full_get_token_text(ctx, i, j);
...
printf("%s%s%s%s", speaker.c_str(), k_colors[col].c_str(), text, "\033[0m");

The issue stems from the possibility that the token text may not adhere to the valid utf-8 string format. When using OpenAI's tiktoken tokenizer, a Chinese character in utf-8 encoding could be split into multiple tokens, which leading to the problem. In such a scenario printf("%s", text) outputs a scrambled or unintelligible string.

To resolve the issue I use icu library to check whether the token text is a valid utf-8 string or not. If yes, print out as usual; if not, the token text is pushed back to a temporary char buffer instead. This char buffer will not be printed out until bytes in the buffer form a valid utf-8 string.

So quick fix steps are (Ubuntu):

install icu library
- sudo apt-get install libicu-dev
modify LDFLAGS in ./Makefile
- LDFLAGS = -licuuc
modify examples/main/main.cpp

// include `icu` header
#include <unicode/ustring.h>

...

// add `is_valid_utf8()` function to check the string is a valid utf-8 string or not
int is_valid_utf8(const char *str) {
    UErrorCode error = U_ZERO_ERROR;
    u_strFromUTF8(NULL, 0, NULL, str, -1, &error);
    return error != U_INVALID_CHAR_FOUND;
}

...

// modify `if(params.print_colors)` loop
        if (params.print_colors) {
            // temp char buffer
            char tmp[1024]; 
            tmp[0] = '\0';

            for (int j = 0; j < whisper_full_n_tokens(ctx, i); ++j) {
                if (params.print_special == false) {
                    const whisper_token id = whisper_full_get_token_id(ctx, i, j);
                    if (id >= whisper_token_eot(ctx)) {
                        continue;
                    }
                }

                const char * text = whisper_full_get_token_text(ctx, i, j);
                const float  p    = whisper_full_get_token_p   (ctx, i, j);

                const int col = std::max(0, std::min((int) k_colors.size() - 1, (int) (std::pow(p, 3)*float(k_colors.size()))));

                // push to the temp char buffer
                strcat(tmp, text); 

                // check the buffer
                if( is_valid_utf8(tmp) ){
                    printf("%s%s%s%s", speaker.c_str(), k_colors[col].c_str(), tmp, "\033[0m");
                    tmp[0]='\0';
                }
            }
        } else {
            const char * text = whisper_full_get_segment_text(ctx, i);

            printf("%s%s", speaker.c_str(), text);
        }

before: 2023-04-14 17-16-29 的螢幕擷圖

after: 2023-04-14 17-14-42 的螢幕擷圖

One issue of the current implementation is the color for the temp char buffer is the same as last added token.

Note: I only ran some tests on Chinese, not sure the fix is applicable to other language as well.

Apr 14 '23 09:04 BancoLin

Fixed in #1313

Sep 28 '23 04:09 bobqianic