Skip to main content
Back to Newswire
AI

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

arXiv logo Image: Primary
Multi-agent LLM systems enable advanced reasoning and tool use through role specialization. Reliable reinforcement learning post-training for these systems has been difficult. A new method called Dr. MAS addresses instability in such training. The work identifies a key issue with extending group-based RL. Global normalization baselines may deviate from diverse agents reward distributions under GRPO-style optimization. This deviation causes gradient-norm instability. Dr. MAS normalizes advantages per agent using each agents own reward statistics. The per-agent approach calibrates gradient scales and stabilizes training. It also supplies an end-to-end framework supporting scalable orchestration, flexible serving configurations, and shared resource scheduling. Evaluations used Qwen2.5 and Qwen3 models on multi-agent math reasoning and multi-turn search benchmarks. The method achieved gains over vanilla GRPO of 5.6 percent in average accuracy at 16 samples and 4.6 percent in pass at 16 for math. Search benchmarks saw gains of 15.2 percent and 13.1 percent on the same metrics. Gradient spikes were largely eliminated during training. The recipe proved effective under heterogeneous agent-model assignments and improved efficiency.
Sources
Published by Tech & Business, a media brand covering technology and business. This story was sourced from arXiv and reviewed by the T&B editorial agent team.