tokenizers PretrainedTokenizerFast from Tokenizer Does Not Keep The Same Properties?

I can't seem to create a "PreTrainedTokenizerFast" object from my original tokenizers tokenizer object that has the same proporties. This is the code for a byte pair tokenizer I have experimented on. The resulting fast tokenizer does not have a [PAD] token, and does not have any special tokens at all.

    tokenizer = ByteLevelBPETokenizer()
    tokenizer.preprocessor = pre_tokenizers.BertPreTokenizer()
    tokenizer.normalizer = normalizers.BertNormalizer()
    tokenizer.train_from_iterator(docs, vocab_size=16_000, min_frequency=15, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    tokenizer._tokenizer.post_processor = processors.BertProcessing(
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
    )
    tokenizer.enable_truncation(max_length=256)
    tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
    fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

The result of printing the fast_tokenizer is:

PreTrainedTokenizerFast(name_or_path='', vocab_size=16000, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={})

Which model_max_len and special_tokens are wrong in it. Also, there is no pad_token and pad_token_id in the fast_tokenizer object. (the warning for pad_token for example: Using pad_token, but it is not set yet.) Have I done anything wrong, or is this not supposed to happen?

The versions of libraries I'm using:

['tokenizers                    0.10.3',
 'transformers                  4.10.0.dev0']

Aug 17 '21 22:08 FeryET

Please refer to #14561 . Also if you would like to set the same value (e.g. model_max_len) as you train tokenizer before, you can modify PreTrainedTokenizerFast as following

PreTrainedTokenizerFast(tokenizer_object=tokenizer, model_max_length=256)

Oct 20 '22 07:10 lianghsun

Some information like special tokens semantics is not contained in this library (it has no clue HOW the tokens are used).

Have you tried doing something like


tokenizer = PreTrainedTokenizerFast(...) # The correct tokenizer

tokenizer.save_pretrained("./local_tokenizer")

newtokenizer = PreTokenizerFast.from_pretrained("./local_tokenizer") # This should be better.

Basically tokenizer.json contains only the "pure" tokenization part, and then transformers needs a few other files, for instance tokenizer_config.json to save information like model_max_length and the semantic meaning about tokens (like PAD, etc..)

tokenizers the library has no special treatment of any token (BOS/EOS etc..)

Oct 20 '22 09:10 Narsil

@Narsil You're right, tokenizers has no special treatment of special tokens. As describled in https://huggingface.co/course/chapter6/8?fw=pt:

To wrap the tokenizer in a PreTrainedTokenizerFast, we can either pass the tokenizer we built as a tokenizer_object or pass the tokenizer file we saved as tokenizer_file. The key thing to remember is that we have to manually set all the special tokens, since that class can’t infer from the tokenizer object which token is the mask token, the [CLS] token

So, I tried to deal with the transformers & tokenizers as following method:

from tokenizers import Tokenizer
_tokenizer = Tokenizer.from_file('path/to/tokenizer.json')

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=_tokenizer,
    # Load speacil tokens manually
    # https://huggingface.co/course/chapter6/8?fw=pt
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
    bos_token="[BOS]",
    eos_token="[EOS]",
    model_max_length=BLOCK_SIZE # same as the block size of the model
)

Oct 20 '22 09:10 lianghsun

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Mar 25 '24 01:03 github-actions[bot]