Customize spaCy tokenizers
Is your feature request related to a problem? Please describe.
The spaCy tokenizers sometimes lead to wrong tokens, e.g. for HTML data, tweets or often domain-specific terms.
For instance, 'refinery is #opensource', I might want ['refinery', 'is', '#opensource'], but i get ['refinery', 'is', '#', 'opensource'].
Describe the solution you'd like spaCy allows users to customize the tokenizer, e.g. in this stackoverflow thread: https://stackoverflow.com/questions/51012476/spacy-custom-tokenizer-to-include-only-hyphen-words-as-tokens-using-infix-regex
import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
def custom_tokenizer(nlp):
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]
We should allow users to customize their tokenizers, similar to either our programmatic interfaces.
Describe alternatives you've considered NLTK offers a wider set of tokenizers (https://www.nltk.org/api/nltk.tokenize.html), e.g. also for tweets. But I strongly believe we should stick to one tokenizer solution for now, which is spaCy.
Additional context -