How to set the cache_dir in the Rust implementation?
Hey, thank you for your great work with these tokenizers.
When I use the tokenizers through the Python API via transformers, I can set a specific cache_dir like this
from transformers import AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name,cache_dir = self.cache_dir)
How can I do that in Rust? How can I print the default cache dir (in Rust)?
Not too sure about the equivalent cache_dir, but I use the EnvVar HF_HOME. It seems like the default is to build the apibuilder from hf-hub crate using the environment variables with no option for the cache directory. Cheers!
@wheynelau
Thank you for your answer! Since I am not very familiar with the Rust Tokenizer Library yet, would you mind completing the minimal example with your idea?
use tokenizers::tokenizer::{Result, Tokenizer};
fn main() -> Result<()> {
# #[cfg(feature = "http")]
# {
// needs http feature enabled
let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None)?;
let encoding = tokenizer.encode("Hey there!", false)?;
println!("{:?}", encoding.get_tokens());
# }
Ok(())
}
Kind regards
@sambaPython24 I don't really have an example, the code you have looks good.
You can do something like
HF_HOME=/path/to/cache cargo run --release