datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Networked Pull Through Cache

Open wrmedford opened this issue 9 months ago • 0 comments

Feature request

Introduce a HF_DATASET_CACHE_NETWORK_LOCATION configuration (e.g. an environment variable) together with a companion network cache service.

Enable a three-tier cache lookup for datasets:

  1. Local on-disk cache
  2. Configurable network cache proxy
  3. Official Hugging Face Hub

Motivation

  • Distributed training & ephemeral jobs: In high-performance or containerized clusters, relying solely on a local disk cache either becomes a streaming bottleneck or incurs a heavy cold-start penalty as each job must re-download datasets.
  • Traffic & cost reduction: A pull-through network cache lets multiple consumers share a common cache layer, reducing duplicate downloads from the Hub and lowering egress costs.
  • Better streaming adoption: By offloading repeat dataset pulls to a locally managed cache proxy, streaming workloads can achieve higher throughput and more predictable latency.
  • Proven pattern: Similar proxy-cache solutions (e.g. Harbor’s Proxy Cache for Docker images) have demonstrated reliability and performance at scale: https://goharbor.io/docs/2.1.0/administration/configure-proxy-cache/

Your contribution

I’m happy to draft the initial PR for adding HF_DATASET_CACHE_NETWORK_LOCATION support in datasets and sketch out a minimal cache-service prototype.

I have limited bandwidth so I would be looking for collaborators if anyone else is interested.

wrmedford avatar Apr 30 '25 15:04 wrmedford