# NVIDIA and Sarvam AI co-design optimizations delivering 4x inference gains for sovereign 30B multilingual models

_Friday, June 26, 2026 at 9:47 PM EDT · AI · Latest · Tier 2 — Notable_

![NVIDIA and Sarvam AI co-design optimizations delivering 4x inference gains for sovereign 30B multilingual models — Primary](https://developer-blogs.nvidia.com/wp-content/uploads/2026/02/ov-dgx-cloud-ari-blog-1920x1080-2.png)

Sarvam AI, a generative AI startup based in Bengaluru, India, collaborated with NVIDIA to co-design hardware and software optimizations for its Sovereign 30B model. The effort targeted strict latency requirements for voice-to-voice agents and improved inference efficiency for the multilingual foundation model under sovereign data controls.

The optimizations produced a 4x speedup in inference performance on NVIDIA Blackwell GPUs compared with baseline NVIDIA H100 GPUs. Kernel and scheduling optimizations on H100 SXM GPUs contributed a 2x speedup. Blackwell compute capabilities paired with NVFP4 weight quantization delivered the additional 2x, including a 2.8x gain at higher interactivity levels.

The Sarvam 30B model employs a heterogeneous mixture-of-experts architecture with 19 layers, one dense and 18 MoE, 128 experts and top-6 routing. It uses grouped query attention and a shared expert design. The model supports 22 Indian languages along with English, math and code. It was pretrained from scratch with the NVIDIA NeMo Framework and Megatron-LM, with post-training handled through NeMo-RL.

Teams selected the SGLang inference engine and applied RadixAttention for prefix sharing along with a Cache-Aware Scheduler. They configured expert parallelism of 2 and data parallelism of 2 on two H100 SXM GPUs. Service-level targets included P95 time to first token under 1000 milliseconds and P95 inter-token latency under 15 milliseconds. Profiling with NVIDIA Nsight Systems identified bottlenecks in MoE routing and positional embeddings, which were addressed through fused kernels.

## Sources

- [NVIDIA Developer](https://developer.nvidia.com/blog/how-nvidia-extreme-hardware-software-co-design-delivered-a-large-inference-boost-for-sarvam-ais-sovereign-models/)

---
Canonical: https://techandbusiness.org/newswire/X0O85GNlLhBSz1ObTqRjlV
Retrieved: 2026-06-27T06:07:07.304Z
Publisher: Tech & Business (techandbusiness.org)