datasets
datasets copied to clipboard
Networked Pull Through Cache
Feature request
Introduce a HF_DATASET_CACHE_NETWORK_LOCATION configuration (e.g. an environment variable) together with a companion network cache service.
Enable a three-tier cache lookup for datasets:
- Local on-disk cache
- Configurable network cache proxy
- Official Hugging Face Hub
Motivation
- Distributed training & ephemeral jobs: In high-performance or containerized clusters, relying solely on a local disk cache either becomes a streaming bottleneck or incurs a heavy cold-start penalty as each job must re-download datasets.
- Traffic & cost reduction: A pull-through network cache lets multiple consumers share a common cache layer, reducing duplicate downloads from the Hub and lowering egress costs.
- Better streaming adoption: By offloading repeat dataset pulls to a locally managed cache proxy, streaming workloads can achieve higher throughput and more predictable latency.
- Proven pattern: Similar proxy-cache solutions (e.g. Harbor’s Proxy Cache for Docker images) have demonstrated reliability and performance at scale: https://goharbor.io/docs/2.1.0/administration/configure-proxy-cache/
Your contribution
I’m happy to draft the initial PR for adding HF_DATASET_CACHE_NETWORK_LOCATION support in datasets and sketch out a minimal cache-service prototype.
I have limited bandwidth so I would be looking for collaborators if anyone else is interested.