WordTokenizers.jl icon indicating copy to clipboard operation
WordTokenizers.jl copied to clipboard

High performance tokenizers for natural language processing and other related tasks

Results 12 WordTokenizers.jl issues
Sort by recently updated
recently updated
newest added

Was testing out the `SentencePieceModel` tokenizer on longer text (web articles of about a few thousand characters) and noticed that tokenization was taking a long time. Taking a look at...

Fix #60 We can also fix the issue by replacing \n by space at starting, when we get sentences, means we can add `sentences=replace(sentences, r"\n" => Base.SubstitutionString(" "))` this line...

Hello everyone, This is a PR for adding GPT2 tokenizer in extending pretrained tokenizers in **WordTokenizers.jl**. This might be helpful in future if developing end-to-end pipeline on top of GPT2...

Several times the paragraphs have new lines copied from the source document (particularly when copied from PDF) and they should be ignored when sentences are tokenized. This is the text...

Hi @Ayushk4 - I was suggested by @oxinabox and @aviks to ping you. I am interested in investigating and improving the sentence tokenizers part of WordTokenizers.jl. Would that be of...

We should benchmark agaisnt https://github.com/huggingface/tokenizers I don't expect for us to win, but it gives us a line to target against.

It's easy Removing the Empty string created in the final array. It's very hard to implement the function like .strip() in here because we are just substituting the strings in...

`julia>WordTokenizers.split_sentences(" This is a sentence.Laugh Out Loud. Keep coding. No. Yes! True! ohh!ya! me too. ")` `7-element Array{SubString{String},1}:` `" This is a sentence.Laugh Out Loud."` `"Keep coding."` `"No."` ` "Yes!"`...

It might be better if the empty second element in resulting array isn't there. ``` julia> WordTokenizers.split_sentences("This is a sentence. ") 2-element Array{SubString{String},1}: "This is a sentence." "" ```