tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

How to set the cache_dir in the Rust implementation?

Open sambaPython24 opened this issue 7 months ago • 3 comments

Hey, thank you for your great work with these tokenizers.

When I use the tokenizers through the Python API via transformers, I can set a specific cache_dir like this

from transformers import AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name,cache_dir = self.cache_dir)

How can I do that in Rust? How can I print the default cache dir (in Rust)?

sambaPython24 avatar Sep 24 '25 18:09 sambaPython24

Not too sure about the equivalent cache_dir, but I use the EnvVar HF_HOME. It seems like the default is to build the apibuilder from hf-hub crate using the environment variables with no option for the cache directory. Cheers!

wheynelau avatar Sep 25 '25 06:09 wheynelau

@wheynelau

Thank you for your answer! Since I am not very familiar with the Rust Tokenizer Library yet, would you mind completing the minimal example with your idea?

use tokenizers::tokenizer::{Result, Tokenizer};

fn main() -> Result<()> {
    # #[cfg(feature = "http")]
    # {
    // needs http feature enabled
    let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None)?;

    let encoding = tokenizer.encode("Hey there!", false)?;
    println!("{:?}", encoding.get_tokens());
    # }
    Ok(())
}

Kind regards

sambaPython24 avatar Sep 25 '25 09:09 sambaPython24

@sambaPython24 I don't really have an example, the code you have looks good.

You can do something like

HF_HOME=/path/to/cache cargo run --release

wheynelau avatar Oct 06 '25 04:10 wheynelau