Why do I have a lot of `code>` in generated Java code? What should I do to get rid of them?
I was doing some code completion evaluation using codefuseEval, on the Qwen2.5-Coder base model. When I ran a Java evaluation, I saw a lot of <|fim_prefix|> and code> markups in the generated code. So I followed the issue #99 and added the special tokens to the tokenizer, as follows:
tokenizer = AutoTokenizer.from_pretrained(
path, trust_remote_code=True, use_fast=False, legacy=False)
add_special_tokens = ["<|file_sep|>", "<film_pad|>",
"<|fim_prefix|>", "<|fim_suffix|>",
"<|fim_middle|>", "<|repo_name|>"]
tokenizer.add_special_tokens({"additional_special_tokens": add_special_tokens},
replace_additional_special_tokens=False)
tokenizer.eos_token = "<|file_sep|>"
tokenizer.eos_token_id = 151664
Then, when I ran the evaluation again after modifying the evaluation code, adding the lines above, the <|fim_prefix|> markups are gone, but the code> markups are still there.
What do I need to do in order to get rid of the code> markups?
Attachments
The minimal dataset to test (5 examples)
First evaluation
result_humaneval_java.jsonl.bak.txt result_humaneval_java_evaluation_result.jsonl.bak.txt
Second evaluation
result_humaneval_java.jsonl.txt result_humaneval_java_evaluation_result.jsonl.txt
it is weired, let me try the samples
it is weired, let me try the samples
@cyente Did you see anything unusual as you tried out?
maybe is's about "<film_pad|>"? I think it might be <|film_pad|>.
maybe is's about "<film_pad|>"? I think it might be <|film_pad|>.
Oh yes. Thanks!