Other similarities
🚀 Feature
Replace random word with a phonetically similar one.
Or also replace a random word with the same Part Of Speech or lemma (adjective with adjective or run with ran / running etc...)
Motivation
I'm training a transformer based model to spell check utterances (like a reversed augly).
Like Hello r u fin tdy => Hello are you fine today.
I realized that quite often the spelling errors come from phonetically similar words
exemple (not so good exemple but for the sake of the explanation) : "I love jeans" vs "I love gins"
Also, augmenting by replacing with sane pos or other inflections of the same lemma would help in the same direction (as better destroying the sentences to train a better spellchecking model)
Having this kind of built-in Augmentation would help building better models.
Pitch
Having a built-in augmenter that create mistakes not only with levensthein like distances but uses phonetics. I've done mine using epitran for phonetics and spacy for pos but other frameworks exists.
Alternatives
Implement my own augmenter (done).
Use only text based distances which cannot find jean vs gin or cute vs beautiful or run vs running as they are textually too different but often found in chats.
Hi @Ierezell! Thank you for all the awesome enhancements you're suggesting! This kind of augmentation is actually something we've talked about building internally, as these are very common misspellings that occur in the wild!
I'll take a look at the epitran library and see how we can support this!
Seconded this that similar sounds are included, IDK about phonetic hashing tho