Tokenizer icon indicating copy to clipboard operation
Tokenizer copied to clipboard

detokenize and correct_spaces problem with hyphens and En dashes

Open atlijas opened this issue 1 year ago • 0 comments

Using the newest version of Tokenizer, 3.4.5:

>>> from tokenizer import split_into_sentences, detokenize, tokenize, correct_spaces
# En dash and detokenize
>>> sent = 'Hamarinn dugir – og meira en það.'
>>> detokenize(tokenize(sent))
# Expected output: 'Hamarinn dugir – og meira en það.'
# Output: 'Hamarinn dugir–og meira en það.'

# En dash and correct_spaces
>>> s = list(split_into_sentences(sent))[0]
>>> correct_spaces(s)
# Expected output: 'Hamarinn dugir – og meira en það.'
# Output: 'Hamarinn dugir–og meira en það.'

# Hyphen and detokenize
>>> sent = 'Hamarinn dugir - og meira en það.'
>>> detokenize(tokenize(sent))
# Expected output: 'Hamarinn dugir - og meira en það.'
# Output: 'Hamarinn dugir-og meira en það.'

# Hyphen and correct_spaces
>>> s = list(split_into_sentences(sent))[0]
>>> correct_spaces(s)
# Expected output: 'Hamarinn dugir - og meira en það.'
# Output: 'Hamarinn dugir- og meira en það.'

atlijas avatar Sep 17 '24 14:09 atlijas