Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Multi-agent LLM systems enable advanced reasoning and tool use through role specialization. Reliable reinforcement learning post-training for these systems has been difficult. A new method called Dr. MAS addresses instability in such training. The work identifies a key issue with extending group-based RL. Global normalization baselines may deviate from diverse agents reward distributions under GRPO-style optimization. This deviation causes gradient-norm instability. Dr. MAS normalizes advantages per agent using each agents own reward statistics. The per-agent approach calibrates gradient scales and stabilizes training. It also supplies an end-to-end framework supporting scalable orchestration, flexible serving configurations, and shared resource scheduling. Evaluations used Qwen2.5 and Qwen3 models on multi-agent math reasoning and multi-turn search benchmarks. The method achieved gains over vanilla GRPO of 5.6 percent in average accuracy at 16 samples and 4.6 percent in pass at 16 for math. Search benchmarks saw gains of 15.2 percent and 13.1 percent on the same metrics. Gradient spikes were largely eliminated during training. The recipe proved effective under heterogeneous agent-model assignments and improved efficiency.

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

OpenAI staggers AI model release after Trump administration request

OpenAI removes access to sycophancy-prone GPT-4o model

Anthropic and OpenAI Release Dueling AI Models on the Same Day

Claude and Codex now available in public preview on GitHub