pythainlp icon indicating copy to clipboard operation
pythainlp copied to clipboard

Command line for generating misspell texts

Open p16i opened this issue 4 years ago • 0 comments

Detailed description

From #614, we now have a way to produce misspells for Thai and English text.

One usecase of the module would be to simulate out-of-distribution (OOD) datasets due to misspelling. This command line will be an interface for practitioners who want to create such datasets.

More precisely, give a text file, the command line will read the file (by line) and add misspells accordingly. The number of misspells should be configured as well as the random see. The user can specify the output path; if not, the default option would be that the command line use the filename and add a suffix.

Context

In my view, being able to simulate OOD situations has implication in a number functionalities provided by PyThaiNLP, especially in segmentation related tasks.

Possible implementation

thainlp misspell --file ./some/data.txt --seed=1  --mispell-ratio 0.05 

# output file: ./some/data[-misspelled-r.05-seed1].txt

Remarks:

  • [...] is the suffix added by the command line;
  • mispell-ratio could be the number of misspells per 100 characters.

What's next?

Once we have the command line, we could try to use it with datasets such as BEST2010 or other standard datasets and evaluate the behavior of segmentation algorithms provided by PyThaiNLP.

p16i avatar Oct 07 '21 06:10 p16i