Why do I have a lot of `code>` in generated Java code? What should I do to get rid of them?

Open ytxmobile98 opened this issue 1 year ago • 5 comments

I was doing some code completion evaluation using codefuseEval, on the Qwen2.5-Coder base model. When I ran a Java evaluation, I saw a lot of <|fim_prefix|> and code> markups in the generated code. So I followed the issue #99 and added the special tokens to the tokenizer, as follows:

        tokenizer = AutoTokenizer.from_pretrained(
            path, trust_remote_code=True, use_fast=False, legacy=False)

        add_special_tokens = ["<|file_sep|>", "<film_pad|>",
                              "<|fim_prefix|>", "<|fim_suffix|>",
                              "<|fim_middle|>", "<|repo_name|>"]
        tokenizer.add_special_tokens({"additional_special_tokens": add_special_tokens},
                                     replace_additional_special_tokens=False)
        tokenizer.eos_token = "<|file_sep|>"
        tokenizer.eos_token_id = 151664

Then, when I ran the evaluation again after modifying the evaluation code, adding the lines above, the <|fim_prefix|> markups are gone, but the code> markups are still there.

What do I need to do in order to get rid of the code> markups?

code