Donut finetuning RVL-CDIP ipynb -- add class names to tokenizer as empty strings?

Open plamb-viso opened this issue 2 years ago • 1 comments

The ipynb states:

Prepare dataset
The first thing we'll do is add the class names as added tokens to the vocabulary of the decoder of Donut, and the corresponding tokenizer.

And then shows:

additional_tokens = ["", "", "", "", "", "", "",
  "", "", "", "", "", "",
  "", "", ""]

Why did this step add empty strings and not, for e.g. these class names:

id2label = {
  0: "letter",
  1: "form",
  2: "email",
  3: "handwritten",
  4: "advertisement",
  5: "scientific_report",
  6: "scientific_publication",
  7: "specification",
  8: "file_folder",
  9: "news_article",
  10: "budget",
  11: "invoice",
  12: "presentation",
  13: "questionnaire",
  14: "resume",
  15: "memo"
}

Oct 06 '23 13:10 plamb-viso

It's because you're reading the notebook from Github, if you'll open the notebook in Colab you will see the classes.

Oct 07 '23 09:10 NielsRogge