Add the ability to serialize custom Python components
It is currently impossible to serialize custom Python components, so if a Tokenizer embeds some of them, the user can't save it.
I didn't really dig this so I don't know exactly what would be the constraints/requirements, but this is something we should explore at some point.
This is a useful feature. We can probably serialize Python objects using pickle or dill. However the serialization code is in Rust. Is it possible to serialize the custom Python components with pickle?
The end result has to be saved as JSON, I don't think it's doable. Also pickle is highly unsafe and not portable (despite being widely used).
Currently the workaround, is to override the component before save, and override after load
tokenizer.pre_tokenizer = Custom()
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.save("tok.json")
## Load later
tokenizer = Tokenizer.from_file("tok.json")
tokenizer.pre_tokenizer = Custom()
It is a bit inconvenient but at least it's safe and portable.
You also can't load it as a PreTrainedTokenizerFast if you have a custom component.
from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
As a workaround I do
from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
fast_tokenizer._tokenizer.pre_tokenizer=PreTokenizer.custom(CustomPreTokenizer())
but using overriding using the private _tokenizer maybe unpredictably problematic.
Totally understandable.
What kind of pre-tokenizer are you saving ? If some building blocks are missing we could add them to make the thing more composable/portable/shareable.
Is now can saving the custom pretokenizer?
No. custom is python code, it's not serializable by nature.