tokenizers Ability to re-train a Tokenizer with relevant parameters

Current state

When we want to train a Tokenizer, we need to give a Trainer initialized with a set of custom parameters:

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

# We need to provide the relevant parameters, to avoid using the general defaults
trainer = BpeTrainer(vocab_size=30000, special_tokens=[...], initial_alphabet=[...], ...)
tokenizer.train(files=[...], trainer=trainer)

Goal

Add the ability to re-train a Tokenizer using the same custom parameters that were used for training the first time. This would allow users to re-train some pre-trained tokenizers provided by the community with their own dataset. We'd be able to do this:

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
tokenizer.train(files=[ ... ])

and expect to get a Tokenizer very similar to the one we originally loaded (same special_tokens, vocab_size, ...), with a brand new vocabulary.

How

One of the ways to achieve this is to make the Trainer save its training params on the Model during training, thus allowing Model::get_trainer to return a Trainer instantiated as expected. All of this should be added to the serialization process.

Considerations

If some tokens were added with add_tokens or add_special_tokens, re-training is not currently supported because AddedVocabulary adds tokens on top of an existing vocabulary, expecting it to never change (cf #523) Also depends on #527

Nov 13 '20 17:11 n1t0

Just want to check that i'm facing the same problem:

use tokenizers::{
    models::bpe::{BpeTrainer, BPE},
    pre_tokenizers::whitespace::Whitespace,
    AddedToken, DecoderWrapper, Model, NormalizerWrapper, PostProcessorWrapper,
    PreTokenizerWrapper, TokenizerImpl,
};

fn main() -> Result<(), tokenizers::Error> {
    let mut tokenizer: TokenizerImpl<
        BPE,
        NormalizerWrapper,
        PreTokenizerWrapper,
        PostProcessorWrapper,
        DecoderWrapper,
    > = TokenizerImpl::new(
        BPE::builder()
            .unk_token("[UNK]".to_string())
            .build()
            .unwrap(),
    );

    let mut trainer = BpeTrainer::builder()
        .special_tokens(vec![
            AddedToken::from("[UNK]", true),
            AddedToken::from("[CLS]", true),
            AddedToken::from("[SEP]", true),
            AddedToken::from("[PAD]", true),
            AddedToken::from("[MASK]", true),
        ])
        .build();
    tokenizer.with_pre_tokenizer(Whitespace::default());
    let files = vec![
        "wikitext-103-raw/wiki.train.raw".into(),
        "wikitext-103-raw/wiki.test.raw".into(),
        "wikitext-103-raw/wiki.valid.raw".into(),
    ];
    tokenizer.train_from_files(&mut trainer, files)?;
    tokenizer.save("tokenizer-wiki.json", false)?;

    // Next wave
    let mut new_trainer = tokenizer.get_model().get_trainer();
    let files = vec![
        "wikitext-103-raw/wiki.train.raw".into(),
        "wikitext-103-raw/wiki.test.raw".into(),
        "wikitext-103-raw/wiki.valid.raw".into(),
    ];
    tokenizer.train_from_files(&mut new_trainer, files)?;
    tokenizer.save("tokenizer-wiki.json", false)?;

    Ok(())
}

This is basically the code, that is crashing with

thread 'main' panicked at 'Missing additional token', /.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.13.3/src/tokenizer/added_vocabulary.rs:293:26
stack backtrace:
   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_display
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:147:5
   3: core::panicking::panic_str
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:131:5
   4: core::option::expect_failed
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/option.rs:2045:5
   5: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
   6: <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::fold
   7: tokenizers::tokenizer::added_vocabulary::AddedVocabulary::add_special_tokens
   8: tokenizers::utils::iter::ResultShunt<I,E>::process
   9: tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>::train_from_files
  10: toktest::main

get_trainer returns not valid but default one, which is missing some tokens.

What will be the right strategy?

Jun 06 '23 13:06 Virviil

This issue is more a feature request than a problem. You are doing something wrong as the error indicates: pretty sure the special tokens are missing in the tokenizer while they are added to the trainer builder. Yes the feature presented would help you! Do you want to have a go at it? 🤗

Sep 22 '23 00:09 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 04 '24 01:05 github-actions[bot]