Customize spaCy tokenizers

Open jhoetter opened this issue 3 years ago • 0 comments

Is your feature request related to a problem? Please describe. The spaCy tokenizers sometimes lead to wrong tokens, e.g. for HTML data, tweets or often domain-specific terms. For instance, 'refinery is #opensource', I might want ['refinery', 'is', '#opensource'], but i get ['refinery', 'is', '#', 'opensource'].

Describe the solution you'd like spaCy allows users to customize the tokenizer, e.g. in this stackoverflow thread: https://stackoverflow.com/questions/51012476/spacy-custom-tokenizer-to-include-only-hyphen-words-as-tokens-using-infix-regex

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

We should allow users to customize their tokenizers, similar to either our programmatic interfaces.

Describe alternatives you've considered NLTK offers a wider set of tokenizers (https://www.nltk.org/api/nltk.tokenize.html), e.g. also for tweets. But I strongly believe we should stick to one tokenizer solution for now, which is spaCy.

Additional context -

Jul 28 '22 11:07 jhoetter