NVIDIA and Sarvam AI co-design optimizations delivering 4x inference gains for sovereign 30B multilingual models

Sarvam AI, a generative AI startup based in Bengaluru, India, collaborated with NVIDIA to co-design hardware and software optimizations for its Sovereign 30B model. The effort targeted strict latency requirements for voice-to-voice agents and improved inference efficiency for the multilingual foundation model under sovereign data controls. The optimizations produced a 4x speedup in inference performance on NVIDIA Blackwell GPUs compared with baseline NVIDIA H100 GPUs. Kernel and scheduling optimizations on H100 SXM GPUs contributed a 2x speedup. Blackwell compute capabilities paired with NVFP4 weight quantization delivered the additional 2x, including a 2.8x gain at higher interactivity levels. The Sarvam 30B model employs a heterogeneous mixture-of-experts architecture with 19 layers, one dense and 18 MoE, 128 experts and top-6 routing. It uses grouped query attention and a shared expert design. The model supports 22 Indian languages along with English, math and code. It was pretrained from scratch with the NVIDIA NeMo Framework and Megatron-LM, with post-training handled through NeMo-RL. Teams selected the SGLang inference engine and applied RadixAttention for prefix sharing along with a Cache-Aware Scheduler. They configured expert parallelism of 2 and data parallelism of 2 on two H100 SXM GPUs. Service-level targets included P95 time to first token under 1000 milliseconds and P95 inter-token latency under 15 milliseconds. Profiling with NVIDIA Nsight Systems identified bottlenecks in MoE routing and positional embeddings, which were addressed through fused kernels.

NVIDIA and Sarvam AI co-design optimizations delivering 4x inference gains for sovereign 30B multilingual models

OpenAI staggers AI model release after Trump administration request

OpenAI removes access to sycophancy-prone GPT-4o model

Anthropic and OpenAI Release Dueling AI Models on the Same Day

Claude and Codex now available in public preview on GitHub