Sentic-GCN icon indicating copy to clipboard operation
Sentic-GCN copied to clipboard

Don't we need to reconcile SpaCy and BERT tokens?

Open hjpark2017 opened this issue 3 years ago • 2 comments

First of all, thank you for releasing the program on your paper. What I'm curious about is that SpaCy divides sentences into word units, but BERT divides them into WordPiece units, so I think there will be a problem that the tokens are not accurately mapped to each other. I wonder which part of the program you uploaded deals with these problems.

hjpark2017 avatar Oct 01 '22 06:10 hjpark2017

First of all, thank you for releasing the program on your paper. What I'm curious about is that SpaCy divides sentences into word units, but BERT divides them into WordPiece units, so I think there will be a problem that the tokens are not accurately mapped to each other. I wonder which part of the program you uploaded deals with these problems.

Hi, Thanks for your question. I do agree that SpaCy divides sentences into word units, but BERT divides them into WordPiece units. That is, the tokens of a small number of samples are not incongruent in SenticGCN-BERT. For the datasets of this work, however, most samples are consistent. Therefore, we do not deal with this problem in our work. Definitely, you can also align the WordPiece units of BERT model for better results. Please let me know if there is any problem. Thanks!!!

BinLiang-NLP avatar Oct 06 '22 13:10 BinLiang-NLP

I'm sorry for the late greeting. Thank you for your kind explanation!

hjpark2017 avatar May 30 '23 04:05 hjpark2017