tokenizer icon indicating copy to clipboard operation
tokenizer copied to clipboard

BOS/EOS tokens

Open trpstra opened this issue 5 years ago • 1 comments

Hi,

First of all, thanks for this great package! I am running inference against a triton server serving transformer models from go, and this library is a tremendous help.

One issue I couldn't figure out from the examples or the code is how to make the BPE tokenizer output encoded BOS and EOS tokens (i.e. < s > and < / s >). I checked that those tokens are part of my vocab.json but it seems they get ignored. I tried manually adding them to the tokenizer as special tokens, tried wrapping my input sentence in "< s > ... < / s >" manually, but I can't seem to get it to work. What am I missing?

Cheers!

edit: changed formatting for < s > so markdown doesn't eat them.

trpstra avatar Dec 17 '20 15:12 trpstra

Bump

cheshir avatar Apr 22 '21 11:04 cheshir

Close for now as too old. If anyone still has this issue, please provide with a simple example. Thanks.

sugarme avatar Jun 24 '23 05:06 sugarme