Is the vocabulary of BERT the same as the vocabulary of BERT-joint?
As mentioned in the technical report, special markup tokens, such as "[Paragraph=N]" and "[Table=N]", were introduced. I think there are no such tokens in the vocabulary for BERT model. So the embedding table in the first layer of the transformer encoder seems different between BERT and BERT-joint. But the BERT-joint used a pre-trained BERT model. I had a hard time understanding this part. Any ideas?
BERT-joint uses different vocab as original BERT. Vocab list used for bert-joint has replaced [unused=xx] tokens from bert original vocab keeping the total size as it is (original vocab.txt file includes ~1k [unused=xx]). bert-joint is initialized from pre-trained bert model, but as long as the total number of vocabs stays same, taking pre-trained bert as initialization works.