tokenizer icon indicating copy to clipboard operation
tokenizer copied to clipboard

NLP tokenizers written in Go language

Results 23 tokenizer issues
Sort by recently updated
recently updated
newest added

This change adds the missing json file to the repo, as well as embeds both the vocab and merge files and uses the embedded FS to create the pre-trained tokenizer.

The old versions have the "file not found" bug when trying to use pretrained tokenizers.

Hi, First of all, thanks for this great package! I am running inference against a triton server serving transformer models from go, and this library is a tremendous help. One...

Convert int to int64 at API to make it easy for API consumers. For example: `tokenizer.Decode(ids []int64)`

enhancement

Serialization of the Tokenizer and all the parts (PreTokenizer, Normalizer, ...) using `encoding/gob`. It is now easy to save/load an entire tokenizer. See: https://stackoverflow.com/questions/28020070/golang-serialize-and-deserialize-back

enhancement

Hi, thanks for this lib! I found that a `log.Print` is used at init: https://github.com/sugarme/tokenizer/blob/master/init.go#L21 which I can't avoid. My application is using stdout & stderr to communicate. Do you...

Hey, I you mentioned this implementation was heavily inspired by the Huggingface one. I was wondering if it's possible to load a tokenizer trained with huggingface with this implementation ?...

I encountered a panic while encoding some documents. Unfortunately I can't provide the documents, as they are private. After a quick look, it seems that `pairEncoding` in `util.go:108` is nil,...

/home/gopath/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:875:22: cannot use 2 * gb (untyped int constant 2147483648) as int value in argument to scanner.Buffer (overflows)

Is something planned?