# Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

_Friday, June 26, 2026 at 9:58 PM EDT · AI · Latest · Tier 2 — Notable_

![Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems — Primary](https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png)

Multi-agent LLM systems enable advanced reasoning and tool use through role specialization. Reliable reinforcement learning post-training for these systems has been difficult. A new method called Dr. MAS addresses instability in such training.

The work identifies a key issue with extending group-based RL. Global normalization baselines may deviate from diverse agents reward distributions under GRPO-style optimization. This deviation causes gradient-norm instability.

Dr. MAS normalizes advantages per agent using each agents own reward statistics. The per-agent approach calibrates gradient scales and stabilizes training. It also supplies an end-to-end framework supporting scalable orchestration, flexible serving configurations, and shared resource scheduling.

Evaluations used Qwen2.5 and Qwen3 models on multi-agent math reasoning and multi-turn search benchmarks. The method achieved gains over vanilla GRPO of 5.6 percent in average accuracy at 16 samples and 4.6 percent in pass at 16 for math. Search benchmarks saw gains of 15.2 percent and 13.1 percent on the same metrics.

Gradient spikes were largely eliminated during training. The recipe proved effective under heterogeneous agent-model assignments and improved efficiency.

## Sources

- [arXiv](https://arxiv.org/abs/2602.08847)

---
Canonical: https://techandbusiness.org/newswire/WMYow9Ig064KslncDOg7sO
Retrieved: 2026-06-27T06:23:02.593Z
Publisher: Tech & Business (techandbusiness.org)