TheTom

turboquant_plus

AILLMInferenceQuantizationllama.cppMachine Learning

+340

// summary

TurboQuant+ is an experimental research implementation for llama.cpp that provides extreme KV cache compression using PolarQuant and Walsh-Hadamard rotation. The project validates that value cache compression is highly efficient, while key cache precision remains critical for maintaining attention quality. It supports advanced features like layer-adaptive compression and sparse decoding to optimize memory usage and throughput across various hardware backends.

// use cases

Extreme KV cache compression (up to 6.4x) to enable large context inference on memory-constrained hardware.

Asymmetric K/V quantization to maintain model quality by preserving high-precision keys while compressing values.

Sparse V decoding to skip low-weight positions, improving inference speed by up to 22.8% at long context lengths.