Transformers-Tutorials
Transformers-Tutorials copied to clipboard
Donut finetuning RVL-CDIP ipynb -- add class names to tokenizer as empty strings?
The ipynb states:
Prepare dataset
The first thing we'll do is add the class names as added tokens to the vocabulary of the decoder of Donut, and the corresponding tokenizer.
And then shows:
additional_tokens = ["", "", "", "", "", "", "",
"", "", "", "", "", "",
"", "", ""]
Why did this step add empty strings and not, for e.g. these class names:
id2label = {
0: "letter",
1: "form",
2: "email",
3: "handwritten",
4: "advertisement",
5: "scientific_report",
6: "scientific_publication",
7: "specification",
8: "file_folder",
9: "news_article",
10: "budget",
11: "invoice",
12: "presentation",
13: "questionnaire",
14: "resume",
15: "memo"
}
It's because you're reading the notebook from Github, if you'll open the notebook in Colab you will see the classes.
:)