WordTokenizers.jl issues

Optimize statistical unigram tokenizer `decode_forward`

2

Was testing out the `SentencePieceModel` tokenizer on longer text (web articles of about a few thousand characters) and noticed that tokenization was taking a long time. Taking a look at...

aria42

Sentence Splitters: no sentence break in between two words with no punctuation

2

Fix #60 We can also fix the issue by replacing \n by space at starting, when we get sentences, means we can add `sentences=replace(sentences, r"\n" => Base.SubstitutionString(" "))` this line...

dhruvil410

Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers

1

Hello everyone, This is a PR for adding GPT2 tokenizer in extending pretrained tokenizers in **WordTokenizers.jl**. This might be helpful in future if developing end-to-end pipeline on top of GPT2...

shikhargoswami

Sentence tokenization must ignore newline as whitespace in the default mode.

Several times the paragraphs have new lines copied from the source document (particularly when copied from PDF) and they should be ignored when sentences are tokenized. This is the text...

sambitdash

Interest in Improving Sentence Tokenization

2

Hi @Ayushk4 - I was suggested by @oxinabox and @aviks to ping you. I am interested in investigating and improving the sentence tokenizers part of WordTokenizers.jl. Would that be of...

TheCedarPrince

[WIP] Update README with JOSS Badge and Citation

I will add CITATION.bib as well.

Ayushk4

Benchmark against Rust library

We should benchmark agaisnt https://github.com/huggingface/tokenizers I don't expect for us to win, but it gives us a line to target against.

oxinabox

Filtering the empty strings from substring array

4

It's easy Removing the Empty string created in the final array. It's very hard to implement the function like .strip() in here because we are just substituting the strings in...

RohitPingale

Sentence spliting of sentences with out whitespace after period

2

`julia>WordTokenizers.split_sentences(" This is a sentence.Laugh Out Loud. Keep coding. No. Yes! True! ohh!ya! me too. ")` `7-element Array{SubString{String},1}:` `" This is a sentence.Laugh Out Loud."` `"Keep coding."` `"No."` ` "Yes!"`...

oxinabox

split_sentences - handling spaces after "."

7

It might be better if the empty second element in resulting array isn't there. ``` julia> WordTokenizers.split_sentences("This is a sentence. ") 2-element Array{SubString{String},1}: "This is a sentence." "" ```

Ayushk4

WordTokenizers.jl
WordTokenizers.jl copied to clipboard

Metadata

Optimize statistical unigram tokenizer `decode_forward`

Sentence Splitters: no sentence break in between two words with no punctuation

Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers

Sentence tokenization must ignore newline as whitespace in the default mode.

Interest in Improving Sentence Tokenization

[WIP] Update README with JOSS Badge and Citation

Benchmark against Rust library

Filtering the empty strings from substring array

Sentence spliting of sentences with out whitespace after period

split_sentences - handling spaces after "."

← Metadata

Owner

Metadata

WordTokenizers.jl WordTokenizers.jl copied to clipboard

Metadata

← Metadata

Owner

Metadata

WordTokenizers.jl
WordTokenizers.jl copied to clipboard