[FEATURE] Sharded Quantization to Enable Consumer GPUs to Quantize 80B+ LLMs
Hey guys, can you please consider introducing sharded quantization as a core feature? This would allow users to quantize extremely large models (e.g., 80B+ parameters... CohereForAI/c4ai-command-a-03-2025!) on consumer-grade hardware by splitting the quantization process into manageable memory chunks. @Qubitium I tried to quantize command-a with gptqmodel on my 48gb of VRAM and no go :*-(
Why This Matters
- Hardware Limitations Today: Current quantization workflows require loading the full unquantized model into memory, which is impossible for 80B+ LLMs on most consumer GPUs (<24GB VRAM) or machines with limited RAM.
- Barrier to Adoption: Many of us lack access to data center-scale hardware. A sharded approach would democratize quantization, enabling users to work with large models locally without needing cloud GPUs.
- Scalability for Future Models: As model sizes grow, sharded quantization drives further adoption for the next generation of LLMs.
What This Enables
- Progressive Processing: Quantize models in memory-safe chunks (shards) tailored to the user’s hardware. For example: A 111B model could be split into 10GB shards, processed sequentially on a GPU with 16GB VRAM.
- Users specify the maximum shard size (e.g., shard_size="8GB"), and the library handles the rest.
- Lower Memory Overhead: Only a fraction of the model’s weights are loaded into memory at any time, reducing peak RAM/VRAM usage.
- Seamless Integration: After quantization, the sharded model can be saved to disk and loaded in pieces during inference, leveraging existing tools like torch.nn.DataParallel.
Example Scenario
A user with a 16GB GPU wants to quantize an 80B LLM to reduce its VRAM footprint.
- Without sharding: Problem: The unquantized model requires 60GB VRAM, making it impossible to load.
- With Sharding:
# User specifies shard size based on hardware limits
config = QuantizeConfig(
bits=4,
calibration_dataset="wikitext-2",
shard_size="10GB" # Maximum size of each quantization shard
)
model = GPTQModel.from_pretrained("80b-llm", quantize_config=config)
model.save_quantized("quantized_80b")
The library processes the model in chunks, quantizes each shard, and saves them individually. The final quantized model fits on their GPU!
Would the team consider prioritizing this enhancement? I’d be happy to refine this proposal or help test implementations!
Thank you for your hard work in advancing model efficiency!
Thoughts?
@ColumbusAI This would be a good feature to have but has two issues. Even if model weight is split via tensor parallel for forwarding code, the gptq quantization part is not parallelizable, at the moment. So splitting the gpu may in fact improve the quantization speed but would actually increase the memory usage by 1.5x.
This would be a great PR/feature to have.
Right now GPTQModel already only places the processing layer on gpu, quantize, move layer to cpu, and work/move next layer to gpu and repeat.
For inference we recommend vLLM or SGLang as they natively support tensor parallel for GPTQModel quantized models.
@ColumbusAI This would be a good feature to have but has two issues. Even if model weight is split via tensor parallel for forwarding code, the gptq quantization part is not parallelizable, at the moment. So splitting the gpu may in fact improve the quantization speed but would actually increase the memory usage by 1.5x.
This would be a great PR/feature to have.
Thanks @Qubitium ! Appreciate your insight and I'll take this on to look into further as well as see if I can take a crack at it with a PR. Appreciate you.