Significant BLEU Score Gap Between evaluate and pycocoevalcap in Comma-Separated vs. Period-Separated Lists
Hi,
I encountered a significant discrepancy between BLEU scores computed using evaluate and pycocoevalcap, even when using the same predictions and references that differ only slightly.
I conducted two parallel experiments where both the predictions and references follow the same structural formatting, but one uses comma-separated lists while the other uses period-separated short sentences. The reference and prediction differ by one term only.
The BLEU scores from evaluate remain high in both cases, while the scores from pycocoevalcap drop significantly in the comma-separated case.
Here is the test case:
import evaluate
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.meteor.meteor import Meteor
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize
# Load metrics
bleu_eval = evaluate.load("bleu")
bleu_pycoco = Bleu()
# ---------------------------
# Experiment 1: Comma-based
# ---------------------------
preds = ["opacity, consolidation, pleural effusion, and atelectasis are present."]
refs = ["opacity, consolidation, pleural effusion, and pneumonia are present."]
print("evaluate BLEU-4 (comma):", bleu_eval.compute(predictions=preds, references=refs, max_order=4)["bleu"])
gt_dict = {'1': [refs[0]]}
hypo_dict = {'1': [preds[0]]}
bleu_scores_1 = bleu_pycoco.compute_score(gt_dict, hypo_dict)
print("pycocoevalcap BLEU-4 (comma):", bleu_scores_1[0][3])
# ---------------------------
# Experiment 2: Period-based
# ---------------------------
preds = ["opacity . consolidation . pleural effusion . atelectasis are present ."]
refs = ["opacity . consolidation . pleural effusion . pneumonia are present ."]
print("evaluate BLEU-4 (period):", bleu_eval.compute(predictions=preds, references=refs, max_order=4)["bleu"])
gt_dict = {'1': [refs[0]]}
hypo_dict = {'1': [preds[0]]}
bleu_scores_2 = bleu_pycoco.compute_score(gt_dict, hypo_dict)
print("pycocoevalcap BLEU-4 (period):", bleu_scores_2[0][3])
The output is:
pycocoevalcap BLEU-4 (comma): 0.5946035573327129 evaluate BLEU-4 (period): 0.7016879391277372
evaluate BLEU-4 (period): 0.7016879391277372 pycocoevalcap BLEU-4 (period): 0.7016879389890388
Hi! I can take this.
I reproduced the discrepancy with the provided snippet. The gap stems from tokenization differences: evaluate’s BLEU uses SacreBLEU-style tokenization (e.g., 13a), while pycocoevalcap uses COCO’s PTBTokenizer; commas/periods are segmented differently, which drives the score drop in the comma-separated case.
Proposal: add an optional COCO/PTB-compatible tokenization mode to evaluate’s BLEU (e.g., tokenizer="coco" or tokenizer="ptb"), keeping the current default for backward compatibility. I’ll add tests that verify parity with pycocoevalcap on the two examples and document the option.
Questions: any preference on the API name (tokenizer="coco"/"ptb" vs a boolean like coco_tokenization=True)? Should we mirror COCO’s punctuation normalization exactly, or only the tokenization step?
I can open a PR in the next few days. Please assign this issue to me.
Hi! I can take this.
I reproduced the discrepancy with the provided snippet. The gap stems from tokenization differences: evaluate’s BLEU uses SacreBLEU-style tokenization (e.g., 13a), while pycocoevalcap uses COCO’s PTBTokenizer; commas/periods are segmented differently, which drives the score drop in the comma-separated case.
Proposal: add an optional COCO/PTB-compatible tokenization mode to evaluate’s BLEU (e.g., tokenizer="coco" or tokenizer="ptb"), keeping the current default for backward compatibility. I’ll add tests that verify parity with pycocoevalcap on the two examples and document the option.
Questions: any preference on the API name (tokenizer="coco"/"ptb" vs a boolean like coco_tokenization=True)? Should we mirror COCO’s punctuation normalization exactly, or only the tokenization step?
I can open a PR in the next few days. Please assign this issue to me.
Hi, thanks for your reply.