Remove unk hardcode
Delegate the unk_token to arguments when constructing the vocabulary.
Fixes #618 , relatively major issue.
This is enough to fix the issue. I would suggest we follow the example of huggingface/transformers (https://github.com/huggingface/transformers/blob/master/transformers/tokenization_utils.py#L52-L68) and having something akin to these special symbols predefined as properties in the vocab, which should prove to be easier in the long run than using a single list. @zhangguanheng66
This is enough to fix the issue. I would suggest we follow the example of huggingface/transformers (https://github.com/huggingface/transformers/blob/master/transformers/tokenization_utils.py#L52-L68) and having something akin to these special symbols predefined as properties in the vocab, which should prove to be easier in the long run than using a single list. @zhangguanheng66
I agree. It's more straightforward.
LGTM. Please let me know if you need a final review to merge the PR. Thanks.
Sure, it's ready for merging from my side, let me know if you see anything problematic.
Is this backwards compatible? Do we need to include these changes in our release notes? Please also have a look at https://github.com/pytorch/text/releases/tag/0.4.0 which includes notes relevant to the UNK token.
You are right, it won't be backwards compatible as loading previously stored models will error out due to the missing attribute.
I'll add a fix later this week, I'm swamped right now. This change should be mentioned in the release notes in case someone was explicitly using Vocab.UNK in their code.
Closing since Field class no longer exists in torchtext.