Ability to re-train a Tokenizer with relevant parameters
Current state
When we want to train a Tokenizer, we need to give a Trainer initialized with a set of custom parameters:
tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
# We need to provide the relevant parameters, to avoid using the general defaults
trainer = BpeTrainer(vocab_size=30000, special_tokens=[...], initial_alphabet=[...], ...)
tokenizer.train(files=[...], trainer=trainer)
Goal
Add the ability to re-train a Tokenizer using the same custom parameters that were used for training the first time.
This would allow users to re-train some pre-trained tokenizers provided by the community with their own dataset. We'd be able to do this:
tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
tokenizer.train(files=[ ... ])
and expect to get a Tokenizer very similar to the one we originally loaded (same special_tokens, vocab_size, ...), with a brand new vocabulary.
How
One of the ways to achieve this is to make the Trainer save its training params on the Model during training, thus allowing Model::get_trainer to return a Trainer instantiated as expected. All of this should be added to the serialization process.
Considerations
If some tokens were added with add_tokens or add_special_tokens, re-training is not currently supported because AddedVocabulary adds tokens on top of an existing vocabulary, expecting it to never change (cf #523)
Also depends on #527
Just want to check that i'm facing the same problem:
use tokenizers::{
models::bpe::{BpeTrainer, BPE},
pre_tokenizers::whitespace::Whitespace,
AddedToken, DecoderWrapper, Model, NormalizerWrapper, PostProcessorWrapper,
PreTokenizerWrapper, TokenizerImpl,
};
fn main() -> Result<(), tokenizers::Error> {
let mut tokenizer: TokenizerImpl<
BPE,
NormalizerWrapper,
PreTokenizerWrapper,
PostProcessorWrapper,
DecoderWrapper,
> = TokenizerImpl::new(
BPE::builder()
.unk_token("[UNK]".to_string())
.build()
.unwrap(),
);
let mut trainer = BpeTrainer::builder()
.special_tokens(vec![
AddedToken::from("[UNK]", true),
AddedToken::from("[CLS]", true),
AddedToken::from("[SEP]", true),
AddedToken::from("[PAD]", true),
AddedToken::from("[MASK]", true),
])
.build();
tokenizer.with_pre_tokenizer(Whitespace::default());
let files = vec![
"wikitext-103-raw/wiki.train.raw".into(),
"wikitext-103-raw/wiki.test.raw".into(),
"wikitext-103-raw/wiki.valid.raw".into(),
];
tokenizer.train_from_files(&mut trainer, files)?;
tokenizer.save("tokenizer-wiki.json", false)?;
// Next wave
let mut new_trainer = tokenizer.get_model().get_trainer();
let files = vec![
"wikitext-103-raw/wiki.train.raw".into(),
"wikitext-103-raw/wiki.test.raw".into(),
"wikitext-103-raw/wiki.valid.raw".into(),
];
tokenizer.train_from_files(&mut new_trainer, files)?;
tokenizer.save("tokenizer-wiki.json", false)?;
Ok(())
}
This is basically the code, that is crashing with
thread 'main' panicked at 'Missing additional token', /.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.13.3/src/tokenizer/added_vocabulary.rs:293:26
stack backtrace:
0: rust_begin_unwind
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
1: core::panicking::panic_fmt
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
2: core::panicking::panic_display
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:147:5
3: core::panicking::panic_str
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:131:5
4: core::option::expect_failed
at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/option.rs:2045:5
5: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
6: <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::fold
7: tokenizers::tokenizer::added_vocabulary::AddedVocabulary::add_special_tokens
8: tokenizers::utils::iter::ResultShunt<I,E>::process
9: tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>::train_from_files
10: toktest::main
get_trainer returns not valid but default one, which is missing some tokens.
What will be the right strategy?
This issue is more a feature request than a problem.
You are doing something wrong as the error indicates: pretty sure the special tokens are missing in the tokenizer while they are added to the trainer builder. Yes the feature presented would help you!
Do you want to have a go at it? 🤗
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.