text Remove unk hardcode

Delegate the unk_token to arguments when constructing the vocabulary.

Fixes #618 , relatively major issue.

Oct 24 '19 13:10 mttk

This is enough to fix the issue. I would suggest we follow the example of huggingface/transformers (https://github.com/huggingface/transformers/blob/master/transformers/tokenization_utils.py#L52-L68) and having something akin to these special symbols predefined as properties in the vocab, which should prove to be easier in the long run than using a single list. @zhangguanheng66

Oct 25 '19 12:10 mttk

This is enough to fix the issue. I would suggest we follow the example of huggingface/transformers (https://github.com/huggingface/transformers/blob/master/transformers/tokenization_utils.py#L52-L68) and having something akin to these special symbols predefined as properties in the vocab, which should prove to be easier in the long run than using a single list. @zhangguanheng66

I agree. It's more straightforward.

Oct 25 '19 14:10 zhangguanheng66

LGTM. Please let me know if you need a final review to merge the PR. Thanks.

Oct 29 '19 14:10 zhangguanheng66

Sure, it's ready for merging from my side, let me know if you see anything problematic.

Oct 29 '19 15:10 mttk

Is this backwards compatible? Do we need to include these changes in our release notes? Please also have a look at https://github.com/pytorch/text/releases/tag/0.4.0 which includes notes relevant to the UNK token.

Oct 29 '19 20:10 cpuhrsch

You are right, it won't be backwards compatible as loading previously stored models will error out due to the missing attribute. I'll add a fix later this week, I'm swamped right now. This change should be mentioned in the release notes in case someone was explicitly using Vocab.UNK in their code.

Nov 05 '19 19:11 mttk

Closing since Field class no longer exists in torchtext.

Mar 10 '23 17:03 Nayef211