Double start of token ids in TrOCR add auto ?

Open Mohammed20201991 opened this issue 3 years ago • 1 comments

Describe the bug The model I am using (TrOCR Model):

The problem arises when using:

[x] the official example scripts: done by the nice tutorial (fine_tune) @NielsRogge
[x] my own modified scripts: (as the script below )

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")
def compute_metrics(pred):
    labels_ids = pred.label_ids
    print('labels_ids',len(labels_ids), type(labels_ids),labels_ids)
    pred_ids = pred.predictions
    print('pred_ids',len(pred_ids), type(pred_ids),pred_ids)
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    print(pred_str)
    labels_ids[labels_ids == -100] = processor.tokenizer.pad_token_id
    label_str = processor.batch_decode(labels_ids, skip_special_tokens=True)
    print(label_str)
    cer = cer_metric.compute(predictions=pred_str, references=label_str)  
    return {"cer": cer}

class Dataset(Dataset):
    def __init__(self, root_dir, df, processor, max_target_length=128):
        self.root_dir = root_dir
        self.df = df
        self.processor = processor
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # get file name + text 
        file_name = self.df['file_name'][idx]
        text = self.df['text'][idx]
        # prepare image (i.e. resize + normalize)
        image = Image.open(self.root_dir + file_name).convert("RGB")
        pixel_values = self.processor(image, return_tensors="pt").pixel_values
        # add labels (input_ids) by encoding the text
        labels = self.processor.tokenizer(text, 
                                          padding="max_length",					                     
                                          max_length=self.max_target_length).input_ids
        # important: make sure that PAD tokens are ignored by the loss function
        labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels]
        # encoding  
        return {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}

# python3 train.py path/to/labels  path/to/images/

Platform: Linux Ubuntu distribution [GCC 9.4.0] on Linux
PyTorch version (GPU?): 0.8.2+cu110
transformers: 4.22.2
Python version:3.8.10

A clear and concise description of what the bug is. To Reproduce Steps to reproduce the behavior:

After training the model or during the training phase when evaluating metrics calculate I see the model added the double start of token <s><s> or ids [0,0, ......,2,1,1, 1 ]
here is an example during the show generated tokens in compute_metrics Input predictions: [[0,0,506,4422,8046,2,1,1,1,1,1]] Input references: [[0,597,2747 ...,1,1,1]]
Other examples during testing models []

Expected behavior A clear and concise description of what you expected to happen. In 2 reproduced problems: I am expecting during training Input predictions: [[,0,,506,4422,8046,2,1,1,1,1,1 ]]

In addition during the testing phase: generated text without double tensor([[0,11867,405,22379,1277,..........,368,2]]) <s>ennyit erről, tőlem fényképezz amennyit akarsz, a véleményem akkor</s> related issue Another example using the same collab notebook for another lang small dataset cc @ArthurZucker

Apr 14 '23 11:04 Mohammed20201991

Thanks to Natabara for his comment The solution is super easy by just skipping ~~start token labels= labels[1:] coming from the tokenizer because the tokenizer adds start token ~~and the TrOCR adds start token ~~automatically as mentioned in TrOCR paper~~~~~~

labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels]
        # skiping <s>  start of token coming from tokenizer because this will be set by Trocr model 
        labels = labels[1:]

Apr 18 '23 12:04 Mohammed20201991