tokenizer
tokenizer copied to clipboard
NLP tokenizers written in Go language
This change adds the missing json file to the repo, as well as embeds both the vocab and merge files and uses the embedded FS to create the pre-trained tokenizer.
The old versions have the "file not found" bug when trying to use pretrained tokenizers.
Hi, First of all, thanks for this great package! I am running inference against a triton server serving transformer models from go, and this library is a tremendous help. One...
Convert int to int64 at API to make it easy for API consumers. For example: `tokenizer.Decode(ids []int64)`
Serialization of the Tokenizer and all the parts (PreTokenizer, Normalizer, ...) using `encoding/gob`. It is now easy to save/load an entire tokenizer. See: https://stackoverflow.com/questions/28020070/golang-serialize-and-deserialize-back
Hi, thanks for this lib! I found that a `log.Print` is used at init: https://github.com/sugarme/tokenizer/blob/master/init.go#L21 which I can't avoid. My application is using stdout & stderr to communicate. Do you...
Hey, I you mentioned this implementation was heavily inspired by the Huggingface one. I was wondering if it's possible to load a tokenizer trained with huggingface with this implementation ?...
I encountered a panic while encoding some documents. Unfortunately I can't provide the documents, as they are private. After a quick look, it seems that `pairEncoding` in `util.go:108` is nil,...
/home/gopath/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:875:22: cannot use 2 * gb (untyped int constant 2147483648) as int value in argument to scanner.Buffer (overflows)
Is something planned?