text icon indicating copy to clipboard operation
text copied to clipboard

Remove unk hardcode

Open mttk opened this issue 6 years ago • 6 comments

Delegate the unk_token to arguments when constructing the vocabulary.

Fixes #618 , relatively major issue.

mttk avatar Oct 24 '19 13:10 mttk

This is enough to fix the issue. I would suggest we follow the example of huggingface/transformers (https://github.com/huggingface/transformers/blob/master/transformers/tokenization_utils.py#L52-L68) and having something akin to these special symbols predefined as properties in the vocab, which should prove to be easier in the long run than using a single list. @zhangguanheng66

mttk avatar Oct 25 '19 12:10 mttk

This is enough to fix the issue. I would suggest we follow the example of huggingface/transformers (https://github.com/huggingface/transformers/blob/master/transformers/tokenization_utils.py#L52-L68) and having something akin to these special symbols predefined as properties in the vocab, which should prove to be easier in the long run than using a single list. @zhangguanheng66

I agree. It's more straightforward.

zhangguanheng66 avatar Oct 25 '19 14:10 zhangguanheng66

LGTM. Please let me know if you need a final review to merge the PR. Thanks.

zhangguanheng66 avatar Oct 29 '19 14:10 zhangguanheng66

Sure, it's ready for merging from my side, let me know if you see anything problematic.

mttk avatar Oct 29 '19 15:10 mttk

Is this backwards compatible? Do we need to include these changes in our release notes? Please also have a look at https://github.com/pytorch/text/releases/tag/0.4.0 which includes notes relevant to the UNK token.

cpuhrsch avatar Oct 29 '19 20:10 cpuhrsch

You are right, it won't be backwards compatible as loading previously stored models will error out due to the missing attribute. I'll add a fix later this week, I'm swamped right now. This change should be mentioned in the release notes in case someone was explicitly using Vocab.UNK in their code.

mttk avatar Nov 05 '19 19:11 mttk

Closing since Field class no longer exists in torchtext.

Nayef211 avatar Mar 10 '23 17:03 Nayef211