A Review of Anomaly in LLM-Based Multi-Agent Systems

doi:10.19678/j.issn.1000-3428.0252754

Abstract

Abstract: Large Language Model-based Multi-Agent Systems have demonstrated significant potential in handling complex tasks. Their distributed nature and interaction uncertainty can lead to diverse anomalies, threatening system reliability. To systematically identify and classify such anomalies, this study conducts a comprehensive review. The research selected seven representative multi-agent systems and their corresponding datasets, collecting 13,418 operational traces, and employed a hybrid data analysis method combining preliminary LLM analysis with expert manual validation. A fine-grained, four-level anomaly classification framework was constructed, encompassing Model Understanding and Perception Anomalies, Agent Interaction Anomalies, Task Execution Anomalies, and External Environment Anomalies, and typical cases were analyzed to reveal the underlying logic and external causes of each type of anomaly. Statistical analysis indicates that Model Understanding and Perception Anomalies account for the highest proportion, with "Context Hallucination" and "Task Instruction Misunderstanding" being the primary issues. Agent Interaction Anomalies represent 16.8%, primarily caused by "Information Concealment." Task Execution Anomalies make up 27.1%, mainly characterized by "Repetitive Decision Errors." External Environment Anomalies constitute 18.3%, with "Memory Conflicts" as the predominant factor. In addition, model perception and understanding anomalies often act as root causes, triggering anomalies at other levels, highlighting the importance of enhancing the fundamental capabilities of the model. These classification and root cause analysis aims at providing theoretical support and practical reference for building highly reliable LLM-based multi-agent systems.

摘要： 基于大语言模型的多智能体系统虽在处理复杂任务方面展现巨大潜力，但其分布式特性与交互不确定性易引发多样化异常，威胁系统可靠性。为系统化识别并分类此种异常，本研究进行全面综述。研究选取七个代表性多智能体系统及相应数据集，收集13,418段运行轨迹，采用LLM初步分析与专家人工校验相结合的方法进行数据分析。研究构建了一个涵盖模型理解感知异常、智能体交互异常、任务执行异常和外部环境异常四个层级的细粒度异常分类框架，并结合典型案例揭示了各类异常产生的内在逻辑与外部诱因。统计分析显示，模型理解感知异常占比最高，其中“上下文幻觉”和“任务指令误解”是主要问题；智能体交互异常占16.8%，“信息隐瞒”是主因；任务执行异常占27.1%，主要表现为“决策重复出错”；外部环境异常占18.3%，以“记忆冲突”为主。此外，模型理解感知异常常作为根源性诱因，引发其他层级的异常，凸显了提升模型基础能力的重要性。此分类和根源分析旨在为构建高可靠的基于大语言模型的多智能体系统提供理论支撑与实践参考。

ZHANG Longyao, Wen Dongxin, MA Zhuangyu, SHU Yanjun, LI Qing, LIU Mingyi, ZUO Decheng. A Review of Anomaly in LLM-Based Multi-Agent Systems[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252754.

张珑耀, 温东新, 马庄宇, 舒燕君, 李庆, 刘明义, 左德承. 基于大语言模型的多智能体系统异常综述[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252754.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0252754

References

[1] 陈栋, 关新平, 龙承念, 陈彩莲. 基于多智能体的区域监控系统[J]. 计算机工程, 2010, 36(21): 191-193. CHEN D, GUAN X P, LONG C N, CHEN C L. Regional Monitoring System Based on Multi-Agent[J]. Computer Engineering, 2010, 36(21): 191-193.（in Chinese）
[2] GRONAUER S, DIEPOLD K. Multi-agent deep reinforcement learning: a survey[J]. Artificial Intelligence Review, 2022, 55(2): 895-943.
[3] 黄昌勤, 钟益华, 王希哲, 等. 从单智能体到多智能体：大模型智能体支持下的激励型学习活动设计与实证研究[J]. 华东师范大学学报(教育科学版), 2025, 43(05): 44-56. HUANG C X, ZHONG Y H, WANG X Z, et al. From single agent to multi-agent: Design and empirical study of motivational learning activities supported by large-scale intelligent agents[J]. Journal of East China Normal University, 2025, 43(5): 44-56.（in Chinese）
[4] HAN S, ZHANG Q, YAO Y, et al. LLM multi-agent systems: challenges and open problems [EB/OL]. [2024-02-05]. https://arxiv.org/abs/2402.03578.
[5] WANG S, ZHANG G, YU M, et al. G-safeguard: A topology-guided security lens and treatment on LLM-based multi-agent systems [EB/OL]. [2025-02-16]. https://arxiv.org/abs/2502.11127.
[6] CEMRI M, PAN M Z, YANG S, et al. Why do multi-agent LLM systems fail? [EB/OL]. [2025-03-17]. https://arxiv.org/abs/2503.13657.
[7] WANG Z, LI J, ZHOU Q, et al. A Survey on AgentOps: Categorization, Challenges, and Future Directions [EB/OL]. [2025-08-04]. https://arxiv.org/abs/2508.02121.
[8] 董之南, 张勤学, 胡进, 等. 面向大模型多智能体系统的多维评估方法[J]. Command Control & Simulation/Zhihui Kongzhi yu Fangzhen, 2025, 47(2). DONG Z N，ZHANG Q X，HU J，et al． A multi-dimensional evaluation method for large language model-powered multi-agent systems［J］. Command Control ＆ Simulation， 2025， 47(2): 121-131.（in Chinese）
[9] LI X, WANG S, ZENG S, et al. A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges [J]. Vicinagearth, 2024, 1(1): 9.
[10] HUANG J T, ZHOU J, JIN T, et al. On the resilience of LLM-based multi-agent collaboration with faulty agents [EB/OL]. [2024-08-02]. https://arxiv.org/abs/2408.00989.
[11] BARBI O, YORAN O, GEVA M. Preventing rogue agents improves multi-agent collaboration [EB/OL]. [2025-02-09]. https://arxiv.org/abs/2502.05986.
[12] SUNG Y Y, KIM H, ZHANG D. Verila: a human-centered evaluation framework for interpretable verification of LLM agent failures [EB/OL]. [2025-03-16]. https://arxiv.org/abs/2503.12651.
[13] EPPERSON W, BANSAL G, DIBIA V C, et al. Interactive debugging and steering of multi-agent AI systems[C]//Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 2025: 1-15.
[14] LI G, HAMMOUD H, ITANI H, et al. Camel: communicative agents for "mind" exploration of large language model society[J]. Advances in Neural Information Processing Systems, 2023, 36: 51991-52008.
[15] FOURNEY A, BANSAL G, MOZANNAR H, et al. Magentic-one: a generalist multi-agent system for solving complex tasks [EB/OL]. [2024-11-07]. https://arxiv.org/abs/2411.04468.
[16] Dong L, Lu Q, Zhu L. AgentOps: Enabling Observability of LLM Agents[J]. arXiv preprint arXiv, 2024, 2411.05285.
[17] AgentOS[EB/OL]. [2025- 05-27]. https://ag2.ai/.
[18] LI Q, CUI L, ZHAO X, et al. GSM-plus: a comprehensive benchmark for evaluating the robustness of LLMs as mathematical problem solvers [EB/OL]. [2024-02-29]. https://arxiv.org/abs/2402.19255.
[19] TRIVEDI H, KHOT T, HARTMANN M, et al. AppWorld: a controllable world of apps and people for benchmarking interactive coding agents [EB/OL]. [2024-07-26]. https://arxiv.org/abs/2407.18901.
[20] LI Y, XU J, HAN L, et al. Q-star meets scalable posterior sampling: bridging theory and practice via HyperAgent[C]//Proceedings of the 41st International Conference on Machine Learning: vol. 235. Vienna, Austria: JMLR.org, 2024: 29022-29062.
[21] JIMENEZ C E, YANG J, WETTIG A, et al. SWE-bench: can language models resolve real-world GitHub issues?[C/OL]//The Twelfth International Conference on Learning Representations. Vienna, Austria: ICLR, 2024: 1-14.
[22] OpenManus - open-source robotics control framework [EB/OL]. [2025-05-27]. https://open-manus.org.
[23] MIALON G, FOURRIER C, WOLF T, et al. GAIA: a benchmark for general AI assistants[C/OL]//The Twelfth International Conference on Learning Representations. Vienna, Austria: ICLR, 2024: 1-15.
[24] HONG S, ZHUGE M, CHEN J, et al. MetaGPT: Meta programming for a multi-agent collaborative framework[C]//International Conference on Learning Representations (ICLR). New York, USA: ICLR, 2024.
[25] QIAN C, CONG X, YANG C, et al. Communicative agents for software development [EB/OL]. [2023-07-16]. https://arxiv.org/abs/2307.07924.
[26] HENDRYCKS D, BURNS C, BASART S, et al. Measuring massive multitask language understanding[C/OL]//International Conference on Learning Representations. Vienna, Austria: ICLR, 2021: 1-10.
[27] MAST/traces[EB/OL]. [2025-05-27]. https://github.com/ multi-agent-systems-failure-taxonomy/MAST/tree/main/traces.
[28] Gemini 2.5 pro[EB/OL]. [2025-05-27]. https://deepmind. google/models/gemini/pro/.
[29] Lune H, Berg B L. Qualitative research methods for the social sciences[M]. 2017.
[30] Kohen J. A coefficient of agreement for nominal scale[J]. Educ Psychol Meas, 1960, 20: 37-46.

Please choose a citation manager

Content to export