clean-text icon indicating copy to clipboard operation
clean-text copied to clipboard

Is it possible to remove punctuations but exclude cases like "drive-thru"?

Open Jess0-0 opened this issue 4 years ago • 4 comments

I'd like to remove punctuations from the text but would like to include "-". For example, "text---cleaning" will become "text cleaning" but "drive-thru" will still be "drive-thru" after the cleaning/

Jess0-0 avatar Nov 17 '21 18:11 Jess0-0

Right now, this is not possible. But this seems to me a feature this package should provide. I will look into it but this may take a while.

jfilter avatar Nov 17 '21 23:11 jfilter

You are mainly interested to keep hyphens in compound words, right? So other punctuation such as "." or "," should get removed.

jfilter avatar Jan 29 '22 20:01 jfilter

Yes that's correct. Other punctuation such as "." or "," should get removed.

Jess0-0 avatar Feb 04 '22 05:02 Jess0-0

I had the same kind of scenario. I solved it like this.

from cleantext import clean

def clean_with_exceptions(text, *args, **kwargs):
    exceptions = kwargs.pop("exceptions", [])
    for idx, exp in enumerate(exceptions):
        text = text.replace(exp, "exp{}exp".format("z" * (idx + 1)))
    text = clean(text, *args, **kwargs)
    for idx, exp in enumerate(exceptions):
        text = text.replace("exp{}exp".format("z" * (idx + 1)), exp)
    return text

cleaned_text = clean_with_exceptions(
    text,
    exceptions=["-"],
    no_line_breaks=True,
    no_urls=True,  # replace all URLs with a special token
    no_emails=True,  # replace all email addresses with a special token
    no_currency_symbols=True,  # replace all currency symbols with a special token
    no_punct=True,
)

It is a bit hackish, but it worked for my case.

tanwirahmad avatar Aug 08 '22 10:08 tanwirahmad