HubLens › Trending › TheTom/turboquant_plus
TheTom

turboquant_plus

AILLMInferenceQuantizationllama.cppMachine Learning
View on GitHub
80
+340

// summary

TurboQuant+ is an experimental research implementation for llama.cpp that provides extreme KV cache compression using PolarQuant and Walsh-Hadamard rotation. The project validates that value cache compression is highly efficient, while key cache precision remains critical for maintaining attention quality. It supports advanced features like layer-adaptive compression and sparse decoding to optimize memory usage and throughput across various hardware backends.

// use cases

01
Extreme KV cache compression (up to 6.4x) to enable large context inference on memory-constrained hardware.
02
Asymmetric K/V quantization to maintain model quality by preserving high-precision keys while compressing values.
03
Sparse V decoding to skip low-weight positions, improving inference speed by up to 22.8% at long context lengths.