AugLy icon indicating copy to clipboard operation
AugLy copied to clipboard

Other similarities

Open ierezell opened this issue 4 years ago • 2 comments

🚀 Feature

Replace random word with a phonetically similar one.

Or also replace a random word with the same Part Of Speech or lemma (adjective with adjective or run with ran / running etc...)

Motivation

I'm training a transformer based model to spell check utterances (like a reversed augly). Like Hello r u fin tdy => Hello are you fine today.

I realized that quite often the spelling errors come from phonetically similar words exemple (not so good exemple but for the sake of the explanation) : "I love jeans" vs "I love gins"

Also, augmenting by replacing with sane pos or other inflections of the same lemma would help in the same direction (as better destroying the sentences to train a better spellchecking model)

Having this kind of built-in Augmentation would help building better models.

Pitch

Having a built-in augmenter that create mistakes not only with levensthein like distances but uses phonetics. I've done mine using epitran for phonetics and spacy for pos but other frameworks exists.

Alternatives

Implement my own augmenter (done). Use only text based distances which cannot find jean vs gin or cute vs beautiful or run vs running as they are textually too different but often found in chats.

ierezell avatar Jul 05 '21 21:07 ierezell

Hi @Ierezell! Thank you for all the awesome enhancements you're suggesting! This kind of augmentation is actually something we've talked about building internally, as these are very common misspellings that occur in the wild!

I'll take a look at the epitran library and see how we can support this!

jbitton avatar Jul 09 '21 17:07 jbitton

Seconded this that similar sounds are included, IDK about phonetic hashing tho

BradKML avatar Sep 01 '22 11:09 BradKML