[Benchmark] 0.31ms Inference for BitNet on Tesla T4 (Sparse Ternary Kernel)

Open HyperFoldUK opened this issue 1 month ago • 0 comments

I was told that Fully Homomorphic Encryption (FHE) requires massive H100 clusters because 'bandwidth is the bottleneck.'

I disagreed. I argued that the bottleneck wasn't bandwidth—it was Arithmetic Density.

To prove it, I built a Sparse Ternary Kernel (HyperFold Phase 3) and ran it on a standard Tesla T4 (an old, free-tier GPU).

The Results:

Standard FHE on T4: ~20ms+ per bootstrap.

HyperFold FHE on T4: 0.31 ms per bootstrap.

We are achieving 3x the speed of State-of-the-Art H100 benchmarks on hardware that costs 1/10th the price.

This isn't just an optimization. This makes real-time FHE and 1.58-bit LLM inference viable on consumer silicon today. We don't need bigger pipes; we needed better fuel.

I would be happy to share the repo and discuss how this HyperFold kernel could be integrated into the next release of bitnet.cpp.

Best, Maurice Wilson HyperFold Technologies

Dec 26 '25 09:12 HyperFoldUK