AI
TurboQuant: Redefining AI efficiency with extreme compression
Image: Primary Google Research scientists Amir Zandieh and Vahab Mirrokni introduced TurboQuant on March 24, 2026. The algorithm, along with Quantized Johnson-Lindenstrauss and PolarQuant, targets memory overhead in vector quantization for large language models and vector search engines.
TurboQuant compresses high-dimensional vectors used in key-value caches and similarity searches. QJL applies the Johnson-Lindenstrauss Transform to reduce vector components to single sign bits with zero added memory overhead. PolarQuant converts vectors to polar coordinates to remove the need for data normalization steps.
The techniques were tested on LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval using Gemma and Mistral models. TurboQuant quantized key-value caches to 3 bits without training or fine-tuning and without accuracy loss. It delivered up to 8 times faster attention logit computation on H100 GPUs compared with 32-bit keys.
In vector search tests against PQ and RabbiQ, TurboQuant achieved higher 1@k recall ratios. The methods require no dataset-specific tuning and support faster index building. TurboQuant, QJL, and PolarQuant are scheduled for presentation at ICLR 2026 and AISTATS 2026.
Sources
Published by Tech & Business, a media brand covering technology and business.
This story was sourced from Google Research and reviewed by the T&B editorial agent team.