Double start of token ids in TrOCR add auto ?
Describe the bug The model I am using (TrOCR Model):
The problem arises when using:
- [x] the official example scripts: done by the nice tutorial (fine_tune) @NielsRogge
- [x] my own modified scripts: (as the script below )
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")
def compute_metrics(pred):
labels_ids = pred.label_ids
print('labels_ids',len(labels_ids), type(labels_ids),labels_ids)
pred_ids = pred.predictions
print('pred_ids',len(pred_ids), type(pred_ids),pred_ids)
pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
print(pred_str)
labels_ids[labels_ids == -100] = processor.tokenizer.pad_token_id
label_str = processor.batch_decode(labels_ids, skip_special_tokens=True)
print(label_str)
cer = cer_metric.compute(predictions=pred_str, references=label_str)
return {"cer": cer}
class Dataset(Dataset):
def __init__(self, root_dir, df, processor, max_target_length=128):
self.root_dir = root_dir
self.df = df
self.processor = processor
self.max_target_length = max_target_length
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
# get file name + text
file_name = self.df['file_name'][idx]
text = self.df['text'][idx]
# prepare image (i.e. resize + normalize)
image = Image.open(self.root_dir + file_name).convert("RGB")
pixel_values = self.processor(image, return_tensors="pt").pixel_values
# add labels (input_ids) by encoding the text
labels = self.processor.tokenizer(text,
padding="max_length",
max_length=self.max_target_length).input_ids
# important: make sure that PAD tokens are ignored by the loss function
labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels]
# encoding
return {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}
# python3 train.py path/to/labels path/to/images/
- Platform: Linux Ubuntu distribution [GCC 9.4.0] on Linux
- PyTorch version (GPU?): 0.8.2+cu110
- transformers: 4.22.2
- Python version:3.8.10
A clear and concise description of what the bug is. To Reproduce Steps to reproduce the behavior:
- After training the model or during the training phase when evaluating metrics calculate I see the model added the double start of token
<s><s>or ids[0,0, ......,2,1,1, 1 ] - here is an example during the show generated tokens in compute_metrics
Input predictions:
[[0,0,506,4422,8046,2,1,1,1,1,1]]Input references:[[0,597,2747 ...,1,1,1]] - Other examples during testing models [
]
Expected behavior A clear and concise description of what you expected to happen.
In 2 reproduced problems:
I am expecting during training Input predictions: [[,0,,506,4422,8046,2,1,1,1,1,1 ]]
In addition during the testing phase: generated text without double tensor([[0,11867,405,22379,1277,..........,368,2]])
<s>ennyit erről, tőlem fényképezz amennyit akarsz, a véleményem akkor</s>
related issue
Another example using the same collab notebook for another lang small dataset
cc @ArthurZucker
Thanks to Natabara for his comment The solution is super easy by just skipping start token labels= labels[1:] coming from the tokenizer because the tokenizer adds start token and the TrOCR adds start token automatically as mentioned in TrOCR paper
labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels]
# skiping <s> start of token coming from tokenizer because this will be set by Trocr model
labels = labels[1:]