Networked Pull Through Cache

Open wrmedford opened this issue 9 months ago • 0 comments

Introduce a HF_DATASET_CACHE_NETWORK_LOCATION configuration (e.g. an environment variable) together with a companion network cache service.

Enable a three-tier cache lookup for datasets:

Distributed training & ephemeral jobs: In high-performance or containerized clusters, relying solely on a local disk cache either becomes a streaming bottleneck or incurs a heavy cold-start penalty as each job must re-download datasets.
Traffic & cost reduction: A pull-through network cache lets multiple consumers share a common cache layer, reducing duplicate downloads from the Hub and lowering egress costs.
Better streaming adoption: By offloading repeat dataset pulls to a locally managed cache proxy, streaming workloads can achieve higher throughput and more predictable latency.
Proven pattern: Similar proxy-cache solutions (e.g. Harbor’s Proxy Cache for Docker images) have demonstrated reliability and performance at scale: https://goharbor.io/docs/2.1.0/administration/configure-proxy-cache/

I’m happy to draft the initial PR for adding HF_DATASET_CACHE_NETWORK_LOCATION support in datasets and sketch out a minimal cache-service prototype.

I have limited bandwidth so I would be looking for collaborators if anyone else is interested.

Apr 30 '25 15:04 wrmedford