Qwen3-Coder icon indicating copy to clipboard operation
Qwen3-Coder copied to clipboard

Why do I have a lot of `code>` in generated Java code? What should I do to get rid of them?

Open ytxmobile98 opened this issue 1 year ago • 5 comments

I was doing some code completion evaluation using codefuseEval, on the Qwen2.5-Coder base model. When I ran a Java evaluation, I saw a lot of <|fim_prefix|> and code> markups in the generated code. So I followed the issue #99 and added the special tokens to the tokenizer, as follows:

        tokenizer = AutoTokenizer.from_pretrained(
            path, trust_remote_code=True, use_fast=False, legacy=False)

        add_special_tokens = ["<|file_sep|>", "<film_pad|>",
                              "<|fim_prefix|>", "<|fim_suffix|>",
                              "<|fim_middle|>", "<|repo_name|>"]
        tokenizer.add_special_tokens({"additional_special_tokens": add_special_tokens},
                                     replace_additional_special_tokens=False)
        tokenizer.eos_token = "<|file_sep|>"
        tokenizer.eos_token_id = 151664

Then, when I ran the evaluation again after modifying the evaluation code, adding the lines above, the <|fim_prefix|> markups are gone, but the code> markups are still there.

What do I need to do in order to get rid of the code> markups?


code

ytxmobile98 avatar Oct 25 '24 08:10 ytxmobile98

it is weired, let me try the samples

cyente avatar Nov 01 '24 08:11 cyente

it is weired, let me try the samples

@cyente Did you see anything unusual as you tried out?

ytxmobile98 avatar Nov 29 '24 08:11 ytxmobile98

maybe is's about "<film_pad|>"? I think it might be <|film_pad|>.

maybe is's about "<film_pad|>"? I think it might be <|film_pad|>.

Oh yes. Thanks!

ytxmobile98 avatar May 06 '25 07:05 ytxmobile98