diff --git "a/all_papers_0328.csv" "b/all_papers_0328.csv" new file mode 100644--- /dev/null +++ "b/all_papers_0328.csv" @@ -0,0 +1,497 @@ +Title,TLDR-EN,TLDR,Section,Category,Abstract,url,Year,Publish Venue,Addition Info,父记录,Duplicate +"Large Model Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends +","This survey explores future LM agent autonomous collaboration, covering current state, collaboration tech, security risks, and proposes future research directions. ",探讨未来大模型驱动智能体自主协作场景,涵盖现状、协作范式、安私挑战,提出未来研究方向 ,Introduction,Survey,"With the rapid advancement of large models (LMs), the development of general-purpose intelligent agents powered by LMs has become a reality. It is foreseeable that in the near future, LM-driven general AI agents will serve as essential tools in production tasks, capable of autonomous communication and collaboration without human intervention. This paper investigates scenarios involving the autonomous collaboration of future LM agents. We review the current state of LM agents, the key technologies enabling LM agent collaboration, and the security and privacy challenges they face during cooperative operations. To this end, we first explore the foundational principles of LM agents, including their general architecture, key components, enabling technologies, and modern applications. We then discuss practical collaboration paradigms from data, computation, and knowledge perspectives to achieve connected intelligence among LM agents. After that, we analyze the security vulnerabilities and privacy risks associated with LM agents, particularly in multi-agent settings, examining underlying mechanisms and reviewing current and potential countermeasures. Lastly, we propose future research directions for building robust and secure LM agent ecosystems. +",https://arxiv.org/abs/2409.14457,2024,Arxiv,,, +Agent AI: Surveying the Horizons of Multimodal Interaction,"The paper defines ""Agent AI"", explores improving agents via external knowledge etc., mitigates model hallucinations, and envisions virtual interactions. ",提出“Agent AI”概念,探索改进方法,可减少大模型幻觉,展望虚实环境交互未来 ,Introduction,Survey,"Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the ability of models to process and interpret visual and contextual data, which is critical for the creation of more sophisticated and context-aware AI systems. For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. To accelerate research on agent-based multimodal intelligence, we define ""Agent AI"" as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data, and can produce meaningful embodied actions. In particular, we explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback. We argue that by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agents acting and interacting in the physical world, we envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment. +",https://arxiv.org/abs/2401.03568,2024,Arxiv,,, +Large Language Model based Multi-Agents: A Survey of Progress and Challenges,"This survey discusses essential aspects and challenges of LLM - based multi - agent systems, offers insights, lists datasets, and maintains a GitHub repo. ",该综述深入探讨大语言模型多智能体系统关键方面、挑战,总结常用数据集,维护开源仓库更新研究。 ,Introduction,Survey,"Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks. Due to the impressive planning and reasoning abilities of LLMs, they have been used as autonomous agents to do many tasks automatically. Recently, based on the development of using one LLM as a single planning or decision-making agent, LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation. To provide the community with an overview of this dynamic field, we present this survey to offer an in-depth discussion on the essential aspects of multi-agent systems based on LLMs, as well as the challenges. Our goal is for readers to gain substantial insights on the following questions: What domains and environments do LLM-based multi-agents simulate? How are these agents profiled and how do they communicate? What mechanisms contribute to the growth of agents' capacities? For those interested in delving into this field of study, we also summarize the commonly used datasets or benchmarks for them to have convenient access. To keep researchers updated on the latest studies, we maintain an open-source GitHub repository, dedicated to outlining the research on LLM-based multi-agent systems. +",https://arxiv.org/abs/2402.01680,2024,Arxiv,,,🚫重复 +A survey on large language model based autonomous agents,"This paper offers a comprehensive survey of LLM - based autonomous agents, covering construction, applications, evaluation, and presents challenges & future directions. ",全面调研大语言模型自主智能体研究,提出构建框架、综述应用与评估策略,指明挑战与方向 ,Introduction,Survey,"Autonomous agents have long been a prominent research focus in both academic and industry communities. Previous research in this field often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and thus makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of web knowledge, large language models (LLMs) have demonstrated remarkable potential in achieving human-level intelligence. This has sparked an upsurge in studies investigating LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of the field of LLM-based autonomous agents from a holistic perspective. More specifically, we first discuss the construction of LLM-based autonomous agents, for which we propose a unified framework that encompasses a majority of the previous work. Then, we present a comprehensive overview of the diverse applications of LLM-based autonomous agents in the fields of social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field. To keep track of this field and continuously update our survey, we maintain a repository of relevant references at this https URL. +",https://arxiv.org/abs/2308.11432,2023,FCS,,, +The rise and potential of large language model based agents: a survey,"This paper surveys LLM-based agents, tracing concepts, presenting a framework, exploring applications, agent societies, and discussing key topics. ",全面调研大语言模型代理,介绍框架、应用、代理社会等,探讨领域关键话题与问题。 ,Introduction,Survey,"For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are artificial entities that sense their environment, make decisions, and take actions. Many efforts have been made to develop intelligent agents, but they mainly focus on advancement in algorithms or training strategies to enhance specific capabilities or performance on particular tasks. Actually, what the community lacks is a general and powerful model to serve as a starting point for designing AI agents that can adapt to diverse scenarios. Due to the versatile capabilities they demonstrate, large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI), offering hope for building general AI agents. Many researchers have leveraged LLMs as the foundation to build AI agents and have achieved significant progress. In this paper, we perform a comprehensive survey on LLM-based agents. We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents. Building upon this, we present a general framework for LLM-based agents, comprising three main components: brain, perception, and action, and the framework can be tailored for different applications. Subsequently, we explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation. Following this, we delve into agent societies, exploring the behavior and personality of LLM-based agents, the social phenomena that emerge from an agent society, and the insights they offer for human society. Finally, we discuss several key topics and open problems within the field. A repository for the related papers at this https URL. +",https://arxiv.org/abs/2309.07864,2023,SCIS,,, +Large Multimodal Agents: A Survey,"This paper systematically reviews large multimodal agents, categorizes research, reviews collaborative frameworks, standardizes evaluations, and suggests future directions. ",该文系统综述大语言模型驱动的多模态智能体,分类研究、整合评估框架,指明应用与方向。 ,Introduction,Survey,"Large language models (LLMs) have achieved superior performance in powering text-based AI agents, endowing them with decision-making and reasoning abilities akin to humans. Concurrently, there is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain. This extension enables AI agents to interpret and respond to diverse multimodal user queries, thereby handling more intricate and nuanced tasks. In this paper, we conduct a systematic review of LLM-driven multimodal agents, which we refer to as large multimodal agents ( LMAs for short). First, we introduce the essential components involved in developing LMAs and categorize the current body of research into four distinct types. Subsequently, we review the collaborative frameworks integrating multiple LMAs , enhancing collective efficacy. One of the critical challenges in this field is the diverse evaluation methods used across existing studies, hindering effective comparison among different LMAs . Therefore, we compile these evaluation methodologies and establish a comprehensive framework to bridge the gaps. This framework aims to standardize evaluations, facilitating more meaningful comparisons. Concluding our review, we highlight the extensive applications of LMAs and propose possible future research directions. Our discussion aims to provide valuable insights and guidelines for future research in this rapidly evolving field. An up-to-date resource list is available at this https URL. +",https://arxiv.org/abs/2402.15116,2024,Arxiv,,, +Understanding the planning of LLM agents: A survey,"This survey offers a systematic view of LLM - based agents planning, categorizes works, analyzes directions, and discusses future challenges. ",该调查对基于大语言模型的智能体规划作系统梳理,分类研究成果并分析挑战。 ,Introduction,Survey,"As Large Language Models (LLMs) have shown significant intelligence, the progress to leverage LLMs as planning modules of autonomous agents has attracted more attention. This survey provides the first systematic view of LLM-based agents planning, covering recent works aiming to improve planning ability. We provide a taxonomy of existing works on LLM-Agent planning, which can be categorized into Task Decomposition, Plan Selection, External Module, Reflection and Memory. Comprehensive analyses are conducted for each direction, and further challenges for the field of research are discussed. +",https://arxiv.org/abs/2402.02716,2024,Arxiv,,,🚫重复 +Computational Experiments Meet Large Language Model Based Agents: A Survey and Perspective,"This survey explores combining computational experiments with LLM - based agents, outlines their development, advantages, and addresses challenges and trends. ",该文探讨计算实验与大模型智能体融合,介绍发展、优势,指出挑战与趋势,为相关研究提供指引。 ,Introduction,Survey,"Computational experiments have emerged as a valuable method for studying complex systems, involving the algorithmization of counterfactuals. However, accurately representing real social systems in Agent-based Modeling (ABM) is challenging due to the diverse and intricate characteristics of humans, including bounded rationality and heterogeneity. To address this limitation, the integration of Large Language Models (LLMs) has been proposed, enabling agents to possess anthropomorphic abilities such as complex reasoning and autonomous learning. These agents, known as LLM-based Agent, offer the potential to enhance the anthropomorphism lacking in ABM. Nonetheless, the absence of explicit explainability in LLMs significantly hinders their application in the social sciences. Conversely, computational experiments excel in providing causal analysis of individual behaviors and complex phenomena. Thus, combining computational experiments with LLM-based Agent holds substantial research potential. This paper aims to present a comprehensive exploration of this fusion. Primarily, it outlines the historical development of agent structures and their evolution into artificial societies, emphasizing their importance in computational experiments. Then it elucidates the advantages that computational experiments and LLM-based Agents offer each other, considering the perspectives of LLM-based Agent for computational experiments and vice versa. Finally, this paper addresses the challenges and future trends in this research domain, offering guidance for subsequent related studies. +",https://arxiv.org/abs/2402.00262,2024,Arxiv,,, +"Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security","This paper focuses on Personal LLM Agents, discusses key questions like architecture, and surveys solutions to challenges for future end - user software paradigm. ",聚焦个人大语言模型代理,探讨架构、能力、效率及安全问题,分析挑战并调研解决方案 ,Introduction,Survey,"Since the advent of personal computing devices, intelligent personal assistants (IPAs) have been one of the key technologies that researchers and engineers have focused on, aiming to help users efficiently obtain information and execute tasks, and provide users with more intelligent, convenient, and rich interaction experiences. With the development of smartphones and IoT, computing and sensing devices have become ubiquitous, greatly expanding the boundaries of IPAs. However, due to the lack of capabilities such as user intent understanding, task planning, tool using, and personal data management etc., existing IPAs still have limited practicality and scalability. Recently, the emergence of foundation models, represented by large language models (LLMs), brings new opportunities for the development of IPAs. With the powerful semantic understanding and reasoning capabilities, LLM can enable intelligent agents to solve complex problems autonomously. In this paper, we focus on Personal LLM Agents, which are LLM-based agents that are deeply integrated with personal data and personal devices and used for personal assistance. We envision that Personal LLM Agents will become a major software paradigm for end-users in the upcoming era. To realize this vision, we take the first step to discuss several important questions about Personal LLM Agents, including their architecture, capability, efficiency and security. We start by summarizing the key components and design choices in the architecture of Personal LLM Agents, followed by an in-depth analysis of the opinions collected from domain experts. Next, we discuss several key challenges to achieve intelligent, efficient and secure Personal LLM Agents, followed by a comprehensive survey of representative solutions to address these challenges. +",https://arxiv.org/abs/2401.05459,2024,Arxiv,,,🚫重复 +"Large Model Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends","This survey explores future LM agent autonomous collaboration, covering current state, key tech, security/privacy, and proposes future research directions. ",探讨未来大模型智能体自主协作场景,介绍现状、协作范式,分析安全隐私问题并给出研究方向。 ,Introduction,Survey,"With the rapid advancement of large models (LMs), the development of general-purpose intelligent agents powered by LMs has become a reality. It is foreseeable that in the near future, LM-driven general AI agents will serve as essential tools in production tasks, capable of autonomous communication and collaboration without human intervention. This paper investigates scenarios involving the autonomous collaboration of future LM agents. We review the current state of LM agents, the key technologies enabling LM agent collaboration, and the security and privacy challenges they face during cooperative operations. To this end, we first explore the foundational principles of LM agents, including their general architecture, key components, enabling technologies, and modern applications. We then discuss practical collaboration paradigms from data, computation, and knowledge perspectives to achieve connected intelligence among LM agents. After that, we analyze the security vulnerabilities and privacy risks associated with LM agents, particularly in multi-agent settings, examining underlying mechanisms and reviewing current and potential countermeasures. Lastly, we propose future research directions for building robust and secure LM agent ecosystems. +",https://arxiv.org/abs/2409.14457,2024,Arxiv,,, +"The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey","This survey examines AI agent advancements, outlines current capabilities/limitations, and suggests future design considerations for robust systems. ",论文调研新兴AI代理架构,介绍单/多代理架构,明确选型要点、领导作用等关键要素推动设计发展。 ,Introduction,Survey,"This survey paper examines the recent advancements in AI agent implementations, with a focus on their ability to achieve complex goals that require enhanced reasoning, planning, and tool execution capabilities. The primary objectives of this work are to a) communicate the current capabilities and limitations of existing AI agent implementations, b) share insights gained from our observations of these systems in action, and c) suggest important considerations for future developments in AI agent design. We achieve this by providing overviews of single-agent and multi-agent architectures, identifying key patterns and divergences in design choices, and evaluating their overall impact on accomplishing a provided goal. Our contribution outlines key themes when selecting an agentic architecture, the impact of leadership on agent systems, agent communication styles, and key phases for planning, execution, and reflection that enable robust AI agent systems.",https://arxiv.org/abs/2404.11584,2024,Arxiv,,, +"Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects","This survey paper provides an overview of LLM - based intelligent agents in single - and multi - agent systems, covers key aspects and deployment mechanisms, and envisions prospects. ",本文全面概述单/多智能体系统的大模型智能体,涉定义、框架等,探讨部署机制并展望前景 ,Introduction,Survey,"Intelligent agents stand out as a potential path toward artificial general intelligence (AGI). Thus, researchers have dedicated significant effort to diverse implementations for them. Benefiting from recent progress in large language models (LLMs), LLM-based agents that use universal natural language as an interface exhibit robust generalization capabilities across various applications -- from serving as autonomous general-purpose task assistants to applications in coding, social, and economic domains, LLM-based agents offer extensive exploration opportunities. This paper surveys current research to provide an in-depth overview of LLM-based intelligent agents within single-agent and multi-agent systems. It covers their definitions, research frameworks, and foundational components such as their composition, cognitive and planning methods, tool utilization, and responses to environmental feedback. We also delve into the mechanisms of deploying LLM-based agents in multi-agent systems, including multi-role collaboration, message passing, and strategies to alleviate communication issues between agents. The discussions also shed light on popular datasets and application scenarios. We conclude by envisioning prospects for LLM-based agents, considering the evolving landscape of AI and natural language processing. +",https://arxiv.org/abs/2401.03428,2024,Arxiv,,,🚫重复 +Position Paper: Agent AI Towards a Holistic Intelligence,"This paper proposes Agent Foundation Model for embodied intelligent behavior, discusses Agent AI's capabilities and potential, guiding future research. ",论文聚焦 Agent AI,提出 Agent 基础模型,探讨其多领域能力、跨学科潜力,为研究指引方向。 ,Introduction,Survey,"Recent advancements in large foundation models have remarkably enhanced our understanding of sensory information in open-world environments. In leveraging the power of foundation models, it is crucial for AI research to pivot away from excessive reductionism and toward an emphasis on systems that function as cohesive wholes. Specifically, we emphasize developing Agent AI -- an embodied system that integrates large foundation models into agent actions. The emerging field of Agent AI spans a wide range of existing embodied and agent-based multimodal interactions, including robotics, gaming, and healthcare systems, etc. In this paper, we propose a novel large action model to achieve embodied intelligent behavior, the Agent Foundation Model. On top of this idea, we discuss how agent AI exhibits remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Furthermore, we discuss the potential of Agent AI from an interdisciplinary perspective, underscoring AI cognition and consciousness within scientific discourse. We believe that those discussions serve as a basis for future research directions and encourage broader societal engagement.",https://arxiv.org/abs/2403.00833,2024,Arxiv,,, +,,,,,,,,,,,🚫重复 +AgentBench: Evaluating LLMs as Agents,"The paper presents AgentBench, a multi - dimensional benchmark for evaluating LLMs as agents. It analyzes failure reasons and suggests ways to improve agent performance. ",提出多维基准 AgentBench 评估大模型作智能体能力,指出能力短板及提升方向。 ,Datasets & Benchmarks,Benchmark,"The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over 29 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.",https://openreview.net/pdf?id=zAdUB0aCTQ,2024,ICLR,,, +AgentHarm: Benchmarking Robustness of LLM Agents on Harmful Tasks,"The paper proposes AgentHarm, a new benchmark for LLM agent misuse, and releases it to enable evaluation of attacks and defenses. ",提出新基准 AgentHarm 助力研究大模型智能体滥用,公开���基准方便攻防评估。 ,Datasets & Benchmarks,Benchmark,"The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents---which use external tools and can execute multi-stage tasks---may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly complaint with malicious agent requests without jailbreaking, (2) simple universal jailbreak strings can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. We publicly release AgentHarm in the supplementary material to enable simple and reliable evaluation of attacks and defenses for LLM-based agents.",https://openreview.net/pdf?id=AC5n7xHuR1,2025,ICLR,,,🚫重复 +AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents,"The paper proposes AgentQuest, a modular benchmark framework with extensible benchmarks/metrics and new evaluation metrics to track LLM agent progress. ",提出 AgentQuest 框架,其基准和指标模块化、易扩展,还提供新评估指标,有望推动研究发展。 ,Datasets & Benchmarks,Benchmark,"The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and reliable progress. However, existing benchmarks are often narrow and simply compute overall task success. To face these issues, we propose AgentQuest – a framework where (i) both benchmarks and metrics are modular and easily extensible through well documented and easy-to-use APIs; (ii) we offer two new evaluation metrics that can reliably track LLM agent progress while solving a task. We exemplify the utility of the metrics on two use cases wherein we identify common failure points and refine the agent architecture to obtain a significant performance increase. Together with the research community, we hope to extend AgentQuest further and therefore we make it available under https://github.com/nec-research/agentquest.",https://aclanthology.org/2024.naacl-demo.19.pdf,2024,*ACL,,, +AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator,"The paper introduces AI Hospital to emulate medical interactions, develops MVME benchmark, and proposes dispute - resolution mechanism to enhance LLMs' clinical abilities. ",论文提出AI Hospital框架和MVME基准,设争端解决机制,推动评估LLM临床应用能力研究。 ,Datasets & Benchmarks,Benchmark,"Artificial intelligence has significantly revolutionized healthcare, particularly through large language models (LLMs) that demonstrate superior performance in static medical question answering benchmarks. However, evaluating the potential of LLMs for real-world clinical applications remains challenging due to the intricate nature of doctor-patient interactions. To address this, we introduce AI Hospital, a multi-agent framework emulating dynamic medical interactions between Doctor as player and NPCs including Patient and Examiner. This setup allows for more practical assessments of LLMs in simulated clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and multiple evaluation strategies to quantify the performance of LLM-driven Doctor agents on symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance medical interaction capabilities through iterative discussions. Despite improvements, current LLMs (including GPT-4) still exhibit significant performance gaps in multi-turn interactive scenarios compared to non-interactive scenarios. Our findings highlight the need for further research to bridge these gaps and improve LLMs’ clinical decision-making capabilities. Our data, code, and experimental results are all open-sourced at https://github.com/LibertFan/AI_Hospital.",https://aclanthology.org/2025.coling-main.680.pdf,2025,*ACL,,, +BENCHAGENTS: Automated Benchmark Creation with Agent Interaction,"The paper introduces BENCHAGENTS, an LLM - based framework to automate benchmark creation. It decomposes the process and uses agents with human feedback. ",提出 BENCHAGENTS 框架,利用大模型自动创建复杂能力基准,确保��量,还用于评估模型能力。 ,Datasets & Benchmarks,Benchmark,"Evaluations are limited by benchmark availability. As models evolve, there is a need to create benchmarks that can measure progress on new generative capabilities. However, creating new benchmarks through human annotations is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities while inherently ensuring data and metric quality. BENCHAGENTS decomposes the benchmark creation process into planning, generation, data verification, and evaluation, each of which is executed by an LLM agent. These agents interact with each other and utilize human-in-the-loop feedback from benchmark developers to explicitly improve and flexibly control data diversity and quality. We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.",https://arxiv.org/pdf/2410.22584,2024,Arxiv,,, +Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation,"This paper presents a multi - agent benchmark self - evolving framework for dynamic LLM evaluation, aiming to extend benchmarks and assist informed model selection. ",论文提出基准自进化框架动态评估大模型,用多智能体扩展实例,利于选模,助力基准持续进化。 ,Datasets & Benchmarks,Benchmark,"This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs). We utilize a multi-agent system to reframe new evolving instances with high confidence that extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, shortcut biases and probing their problem-solving sub-abilities. With this framework, we extend datasets across general and specific tasks, through various iterations. Experimental results show a performance decline in most LLMs against their original results under scalable and robust evaluations, offering a more accurate reflection of model capabilities alongside our fine-grained evaluation. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks. We hope this framework contributes the research community for continuously evolving benchmarks alongside LLM development.",https://aclanthology.org/2025.coling-main.223.pdf,2025,*ACL,,,🚫重复 +Benchmarking Data Science Agents,"The paper introduces DSEval and benchmarks for data science agents, streamlining dataset prep and offering insights for future progress. ",论文提出DSEval评估范式及创新基准,优化数据集准备,为数据科学代理性能评估及领域发展提供助力。 ,Datasets & Benchmarks,Benchmark,"In the era of data-driven decision-making, the +complexity of data analysis necessitates ad +vanced expertise and tools of data science, pre +senting significant challenges even for special +ists. Large Language Models (LLMs) have +emerged as promising aids as data science +agents, assisting humans in data analysis and +processing. Yet their practical efficacy remains +constrained by the varied demands of real +world applications and complicated analytical +process. In this paper, we introduce DSEval – +a novel evaluation paradigm, as well as a series +of innovative benchmarks tailored for assess +ing the performance of these agents throughout +the entire data science lifecycle. Incorporat +ing a novel bootstrapped annotation method, +we streamline dataset preparation, improve the +evaluation coverage, and expand benchmark +ing comprehensiveness. Our findings uncover +prevalent obstacles and provide critical insights +to inform future advancements in the field.*",https://aclanthology.org/2024.acl-long.308.pdf,2024,*ACL,,, +Benchmarking Large Language Models as AI Research Agents,"The paper proposes MLAgent-Bench for benchmarking AI research agents in ML tasks, designs an LLM-based agent, and identifies key challenges. ",提出MLAgent - Bench基准套件评估AI研究代理,设计LLM代理自动实验,指出关键挑战并开源代码。 ,Datasets & Benchmarks,Benchmark,"Human researchers can perform scientific experimentation loops – planning, experimenting, observing the results, and generating inferences. Can we build AI research agents to perform the same? To take a step towards building and evaluating research agents capable of such open-ended decision-making, we focus on the problem of having agents perform machine learning (ML) tasks given a research problem description and dataset. In this paper, we propose MLAgent- Bench, a suite of ML tasks for benchmarking AI research agents. Agents can perform actions like file system operations, executing code, and inspecting outputs. With these actions, agents could run experiments, analyze the results, and modify the code of entire machine learning pipelines, such as data processing, architecture, training processes, etc. The benchmark then automatically evaluates the agent’s performance objectively over various metrics related to performance and efficiency. We also design an LLM-based research agent to automatically perform experimentation loops in such an environment. Empirically, we find that a GPT-4- based research agent can feasibly build compelling ML models over many tasks in MLAgentBench, displaying highly interpretable plans and actions. However, the success rates vary considerably; they span from almost 90% on well-established older datasets to as low as 10% on recent Kaggle Challenges – unavailable during the LLM model’s pretraining – and even 0% on newer research challenges like BabyLM. Finally, we identify several key challenges for LLM-based research agents such as long-term planning and hallucination. Our code is released at https://anonymous.4open.science/r/MLAgentBench/.",https://openreview.net/pdf?id=N9wD4RFWY0,2024,ICLR,,, +"Benchmarking Large Language Models for Multi-agent Systems: A Comparative Analysis of AutoGen, CrewAI, and TaskWeaver","This paper benchmarks and comparatively analyzes AutoGen, CrewAI, and TaskWeaver in multi - agent systems, using ML code gen. for evaluation. ",该文对基于大模型的三款多智能体系统做对比分析,以机器学习代码生成评估性能,有积极参考价值。 ,Datasets & Benchmarks,Benchmark,"This paper presents the benchmarking of three multi-agent systems powered by large language models. The paper presents a comparative analysis of AutoGen, CrewAI, and TaskWeaver. Nowadays, large language models have emerged as powerful tools able to assist users in various areas. The integration of large language models into multi-agent systems increases their potential for collaborative problem-solving. This study focuses on a case study involving a machine learning code generation task which is used to evaluate the framework’s performance. To assess the performance of the solutions, it is requested to create energy forecasting models using the same dataset as the base. After producing the code, a new dataset is used to test the model performance using the root mean square error. The three solutions were able to provide results using multiple large language models. The best result was achieved by TaskWeaver using GPT-3.5, with an error of 25.04.",https://link.springer.com/chapter/10.1007/978-3-031-70415-4_4,2024,Others,,, +BLADE- Benchmarking Language Model Agents,"The paper presents BLADE, a benchmark to evaluate agents on open - ended data - driven science tasks, enabling insights into agents' analysis approaches. ",提出BLADE基准,用于自动评估大模型智能体处理开放性科研问题能力,助力数据驱动科学研究评估。 ,Datasets & Benchmarks,Benchmark,"Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents’ multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents’ analysis approaches.",https://aclanthology.org/2024.findings-emnlp.815.pdf,2024,*ACL,,, +CRAB: Cross-platfrom agent benchmark for multi-modal embodied language model agents,"This paper introduces CRAB, the first agent benchmark framework for cross - environment tasks, with fine - grained eval. and easy task construction. ",提出CRAB,首个支持跨环境任务的基准框架,含细粒度评估法等,还开发跨平台基准测试。 ,Datasets & Benchmarks,Benchmark,"The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 100 tasks in computer desktop and mobile phone environments. We evaluated four advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 35.26%.",https://openreview.net/pdf?id=kyExS4V0H7,2024,NIPS,,, +CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation in Real-World API Interactions,"The paper proposes CToolEval, a Chinese benchmark with 398 APIs for evaluating LLMs in real - world API interactions, releasing data and codes to promote research. ",提出CToolEval基准,含398个API覆盖14领域,有评估框架,公开数据代码促LLM作可靠代理研究 ,Datasets & Benchmarks,Benchmark,"Assessing the capabilities of large language models (LLMs) as agents in decision making and operational tasks is crucial for the development of LLM-as-agent service. We propose CToolEval, a benchmark designed to evaluate LLMs in the context of Chinese societal applications, featuring 398 APIs across 27 widely-used Apps (e.g., Apps for shopping, map, music, travel, etc.) that cover 14 domains. We further present an evaluation framework that simulates real-life scenarios, to facilitate the assessment of tool invocation ability of LLMs for tool learning and task completion ability for user interation. Our extensive experiments with CToolEval evaluate 11 LLMs, revealing that while GPT-3.5-turbo excels in tool invocation, Chinese LLMs usually struggle with issues like hallucination and a lack of comprehensive tool understanding. Our findings highlight the need for further refinement in decision-making capabilities of LLMs, offering insights into bridging the gap between current functionalities and agent-level performance. To promote further research for LLMs to fully act as reliable agents in complex, real-world situations, we release our data and codes at https://github.com/tjunlp-lab/CToolEval.",https://aclanthology.org/2024.findings-acl.928.pdf,2024,*ACL,,, +DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models,"The paper introduces DA - Code, a benchmark for LLMs on agent - based data science tasks, with unique features and a controllable setup, and releases it on GitHub. ",本文提出DA - Code基准评估LLMs在代理式数据科学任务上的表现,具多特性,还开发DA - Agent基线。 ,Datasets & Benchmarks,Benchmark,"We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, including Python and SQL, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously designed the evaluation suite to ensure the accuracy and robustness of the evaluation. We developed the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at [link](https://github.com/yiyihum/dabench)",https://aclanthology.org/2024.emnlp-main.748.pdf,2024,*ACL,,, +DCA-Bench: A Benchmark for Dataset Curation Agents,"The paper establishes DCA - Bench, a benchmark for dataset curation agents, using cases from 8 platforms and an auto - eval framework to measure LLM agents' real - world issue - detection ability. ",本文建立基准测大模型代理处理数据集问题能力,精选测试用例、提评估框架,强调应用待深入探索。 ,Datasets & Benchmarks,Benchmark,"The quality of datasets plays an increasingly crucial role in the research and development of modern artificial intelligence (AI). Despite the proliferation of open dataset platforms nowadays, data quality issues, such as incomplete documentation, inaccurate labels, ethical concerns, and outdated information, remain common in widely used datasets. Furthermore, these issues are often subtle and difficult to be detected by rule-based scripts, therefore requiring identification and verification by dataset users or maintainers--a process that is both time-consuming and prone to human mistakes. With the surging ability of large language models (LLM), it’s promising to streamline the discovery of hidden dataset issues with LLM agents. To achieve this, one significant challenge is enabling LLM agents to detect issues in the wild rather than simply fixing known ones. In this work, we establish a benchmark to measure LLM agent’s ability to tackle this challenge. We carefully curate 221 representative test cases from eight popular dataset platforms and propose an automatic evaluation framework using GPT-4. Our proposed framework shows strong empirical alignment with expert evaluations, validated through extensive comparisons with human annotations. Without any hints, a baseline GPT-4 agent can only reveal 11% of the data quality issues in the proposed dataset, highlighting the complexity of this task and indicating that applying LLM agents to real-world dataset curation still requires further in-depth exploration and innovation.",https://openreview.net/pdf?id=a4sknPttwV,2025,ICLR,,, +Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making,"This paper proposes an Embodied Agent Interface to benchmark LLMs for embodied decision making, unifying tasks, modules, and metrics. ",提出具身代理接口,统一多种任务、模块及评估指标,全面评估大语言模型在具身决策表现。 ,Datasets & Benchmarks,Benchmark,"We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision-making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics that break down evaluation into error types, such as hallucination errors, affordance errors, and various types of planning errors. Overall, our benchmark offers a comprehensive assessment of LLMs’ performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems and providing insights into the effective and selective use of LLMs in embodied decision making. +",https://proceedings.neurips.cc/paper_files/paper/2024/hash/b631da756d1573c24c9ba9c702fde5a9-Abstract-Datasets_and_Benchmarks_Track.html,2024,NIPS,,,🚫重复 +GTA: A Benchmark for General Tool Agents,"The paper proposes GTA, a benchmark for general tool agents with real queries, deployed tools, and multimodal inputs, aiding agent advancement. ",提出通用工具代理基准 GTA,含真实用户查询、部署工具和多模态输入,助力通用工具代理发展。 ,Datasets & Benchmarks,Benchmark,"In developing general-purpose agents, significant focus has been placed on integrating large language models (LLMs) with various tools. This poses a challenge to the tool-use capabilities of LLMs. However, there are evident gaps between existing tool evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only inputs, which fail to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We designed 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50\% of the tasks and most LLMs achieving below 25\%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which is beneficial for the advancement of general-purpose tool agents. Dataset and code are available at https://github.com/open-compass/GTA.",https://proceedings.neurips.cc/paper_files/paper/2024/file/8a75ee6d4b2eb0b777f549a32a5a5c28-Paper-Datasets_and_Benchmarks_Track.pdf,2024,NIPS,,, +LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs,"The paper presents LaMPilot, integrating LLMs into AD systems, and LaMPilot - Bench. It shows LLMs' potential in AD and releases code and data. ",提出LaMPilot框架将大模型集成到自动驾驶系统,引入LaMPilot - Bench数据集,发布代码数据促研究。 ,Datasets & Benchmarks,Benchmark,"Autonomous driving (AD) has made significant strides in recent years. However, existing frameworks struggle to interpret and execute spontaneous user instructions, such as ""overtake the car ahead.” Large Language Models (LLMs) have demonstrated impressive reasoning capabilities showing potential to bridge this gap. In this paper, we present LaMPilot, a novel framework that integrates LLMs into AD systems, enabling them to follow user instructions by generating code that leverages established functional primitives. We also introduce LaMPilot-Bench, the first bench-mark dataset specifically designed to quantitatively evaluate the efficacy of language model programs in AD. Adopting the LaMPilot framework, we conduct extensive experiments to assess the performance of off-the-shelf LLMs on LaMPilot-Bench. Our results demonstrate the potential of LLMs in handling diverse driving scenarios and following user instructions in driving. To facilitate further research in this area, we release our code and data at GitHub.com/PurdueDigitalTwin/LaMPilot.",http://javascript:void(),2024,IEEE,,, +MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents,"The paper introduces MedAgentBench, an eval suite for medical LLMs' agent capabilities, with real - world data and a migratable env, guiding model optimization. ",文章提出 MedAgentBench 评估套件,含临床任务等,可评估医学大模型代理能力,为领域模型优化提供框架。 ,Datasets & Benchmarks,Benchmark,"Recent large language models (LLMs) have demonstrated significant advancements, particularly in their ability to serve as agents thereby surpassing their traditional role as chatbots. These agents can leverage their planning and tool utilization capabilities to address tasks specified at a high level. However, a standardized dataset to benchmark the agent capabilities of LLMs in medical applications is currently lacking, making the evaluation of LLMs on complex tasks in interactive healthcare environments challenging. To address this gap, we introduce MedAgentBench, a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts. MedAgentBench encompasses 300 patient-specific clinically-derived tasks from 10 categories written by human physicians, realistic profiles of 100 patients with over 700,000 data elements, a FHIR-compliant interactive environment, and an accompanying codebase. The environment uses the standard APIs and communication infrastructure used in modern EMR systems, so it can be easily migrated into live EMR systems. MedAgentBench presents an unsaturated agent-oriented benchmark that current state-of-the-art LLMs exhibit some ability to succeed at. The best model (Claude 3.5 Sonnet v2) achieves a success rate of 69.67%. However, there is still substantial space for improvement which gives the community a next direction to optimize. Furthermore, there is significant variation in performance across task categories. MedAgentBench establishes this and is publicly available at this https URL , offering a valuable framework for model developers to track progress and drive continuous improvements in the agent capabilities of large language models within the medical domain.",https://arxiv.org/pdf/2501.14654,2025,Arxiv,,, +ML Research Benchmark,"The paper presents the ML Research Benchmark (MLRB) with 7 tasks for AI agents, offering a framework to assess them in real - world research. ",提出ML研究基准MLRB,含7个竞赛级任务,为评估对比AI智能体应对真实研究挑战提供框架 ,Datasets & Benchmarks,Benchmark,"Artificial intelligence agents are increasingly capable of performing complex tasks across various +domains. As these agents advance, there is a growing need to accurately measure and benchmark +their capabilities, particularly in accelerating AI research and development. Current benchmarks +focus on general machine learning tasks, but lack comprehensive evaluation methods for assessing AI +agents’ abilities in tackling research-level problems and competition-level challenges in the field of +AI. We present the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived +from recent machine learning conference tracks. These tasks span activities typically undertaken +by AI researchers, including model training efficiency, pretraining on limited data, domain specific +fine-tuning, and model compression. This paper introduces a novel benchmark and evaluates it using +agent scaffolds powered by frontier models, including Claude-3 and GPT-4o. The results indicate +that the Claude-3.5 Sonnet agent performs best across our benchmark, excelling in planning and +developing machine learning models. However, both tested agents struggled to perform non-trivial +research iterations. We observed significant performance variations across tasks, highlighting the +complexity of AI development and the challenges in creating versatile agent scaffolds. While current +AI agents can successfully navigate complex instructions and produce baseline results, they fall short +of the capabilities required for advanced AI research. The ML Research Benchmark provides a +valuable framework for assessing and comparing AI agents on tasks mirroring real-world AI research +challenges.",https://arxiv.org/pdf/2410.22553,2024,Arxiv,,, +MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering,"The paper introduces MLE - bench, a benchmark for AI agents in ML engineering. It provides baselines, evaluates models, and opens - sources code for future research. ",本文推出MLE - bench基准,用Kaggle竞赛测AI代理ML工程能力,开源代码助力相关研究。 ,Datasets & Benchmarks,Benchmark,"We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 71 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup — OpenAI's o1-preview with AIDE scaffolding — achieves at least the level of a Kaggle bronze medal in 17.3% of competitions. In addition to our main results, we investigate various forms of resource-scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code to facilitate future research in understanding the ML engineering capabilities of AI agents.",https://openreview.net/pdf?id=6s5uXNWGIh,2025,ICLR,,, +MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains,"The paper introduces the MMAU benchmark with offline tasks across five domains, evaluating five key capabilities of LLM agents, enhancing interpretability. ",提出MMAU基准,含综合离线任务,跨五领域评五大能力,提升大模型智能体性能可解释性 ,Datasets & Benchmarks,Benchmark,"Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covers five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at this https URL.",https://arxiv.org/pdf/2407.18961,2024,Arxiv,,, +OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web,"The paper introduces OmniACT, a dataset and benchmark for assessing agents' ability to automate diverse computer tasks, inspiring multimodal model research. ",论文提出 OmniACT 数据集与基准,涵盖桌面应用,评估生成程序能力��推动多模态模型发展。 ,Datasets & Benchmarks,Dataset&Benchmark,"For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce [inline-graphic not available: see fulltext], the first-of-a-kind dataset and benchmark for assessing an agent’s capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as “Play the next song"", as well as longer horizon tasks such as “Send an email to John Doe mentioning the time and place to meet"". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.",https://arxiv.org/pdf/2402.17553,2024,CVPR/ICCV/ECCV,,, +OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments,"This paper introduces OSWorld, a scalable real computer environment for multimodal agents, creating a benchmark to aid in developing such agents. ",推出 OSWorld 这一可扩展真实计算机环境及 369 个任务基准,为开发多模态通用代理提供见解 ,Datasets & Benchmarks,Benchmark,"Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at this https URL.",https://proceedings.neurips.cc/paper_files/paper/2024/file/5d413e48f84dc61244b6be550f1cd8f5-Paper-Datasets_and_Benchmarks_Track.pdf,2024,NIPS,,, +Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs,"The paper revisits LLM evaluation, introduces Benchmark+ and Assessment+, proposes TestAgent for dynamic evaluation across domains, offering a new perspective. ",提出Benchmark+和Assessment+概念,构建TestAgent框架,实现大模型跨领域动态评估,提供新视角与途径。 ,Datasets & Benchmarks,Benchmark,"While various vertical domain large language models (LLMs) have been developed, automatically evaluating their performance across different domains remains a critical challenge. Current benchmark-based methods often rely on static and costly datasets, are misaligned with practical user needs, and lack flexibility across domains. To address these limitations, we revisit the evaluation process and introduce two key concepts: Benchmark+, which extends the traditional question-answer benchmark into a more flexible ``strategy-criterion'' format; and Assessment+, which enhances the interaction process, enabling deeper exploration and supporting analysis from broader perspectives. We propose TestAgent, an agent-based evaluation framework that implements these concepts using retrieval-augmented generation and reinforcement learning. TestAgent enables automatic dynamic benchmark generation and in-depth assessment across diverse vertical domain scenarios. Experiments on tasks ranging from constructing multiple vertical domain evaluations to converting static benchmarks into dynamic forms demonstrate the effectiveness of TestAgent. This work offers an interesting perspective on automatic evaluation for LLMs and highlights a pathway for dynamic and domain-adaptive assessments.",https://arxiv.org/pdf/2410.11507,2024,Arxiv,,, +Seal-Tools: Self-instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark,"The paper introduces Seal - Tools, a tool learning dataset. It proposes a self - instruct method, designs metrics, and can serve as a new LLM tool - calling benchmark. ",论文提出Seal - Tools数据集,用自指令法生成工具与实例,设评估指标,可作评估LLMs工具调用能力新基准 ,Datasets & Benchmarks,Benchmark (fine-tine),"This paper presents a new tool learning dataset , which contains tools. Seal-Tools not only offers a large number of tools, but also includes instances which demonstrate the practical application of tools. Seeking to generate data on a large scale while ensuring reliability, we propose a self-instruct method to generate tools and instances, allowing precise control over the process. Moreover, our Seal-Tools contains hard instances that call multiple tools to complete the job, among which some are nested tool callings. For precise and comprehensive evaluation, we use strict format control and design three metrics from different dimensions. Therefore, Seal-Tools can serve as a new benchmark to evaluate the tool-calling ability of LLMs. Finally, we evaluate several prevalent LLMs and our finetuned model on Seal-Tools. The results show that current systems are far from perfect. The code, data and experiment results are available at https://github.com/fairyshine/Seal-Tools.",https://arxiv.org/pdf/2405.08355,2024,Others,,, +Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents,The paper introduces Tapilot - Crossing for evaluating LLM agents in interactive data analysis and proposes AIR to evolve LLMs into effective agents. ,提出新基准 Tapilot - Crossing 评估大模型交互数据分析能力,还提出 AIR 策略助其进化。 ,Datasets & Benchmarks,Benchmark (fine-tune),"Interactive Data Analysis, the collaboration between humans and LLM agents, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic interactive logs for data analysis hinder the quantitative evaluation of Large Language Model (LLM) agents in this task. To mitigate this issue, we introduce Tapilot-Crossing, a new benchmark to evaluate LLM agents on interactive data analysis. Tapilot-Crossing contains 1024 interactions, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably, Tapilot-Crossing is constructed by an economical multi-agent environment, Decision Company, with few human efforts. We evaluate popular and advanced LLM agents in Tapilot-Crossing, which underscores the challenges of interactive data analysis. Furthermore, we propose Adaptive Interaction Reflection (AIR), a self-generated reflection strategy that guides LLM agents to learn from successful history. Experiments demonstrate that Air can evolve LLMs into effective interactive data analysis agents, achieving a relative performance improvement of up to 44.5%.",https://arxiv.org/pdf/2403.05307,2024,Arxiv,,, +TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks,"This paper introduces TheAgentCompany, an extensible benchmark for evaluating AI agents on real - world tasks, simulating a software company environment. ",论文引入 TheAgentCompany 基准评估类似数字员工的大模型智能体,助力衡量其执行专业任务进展。 ,Datasets & Benchmarks,Benchmark,"We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.",https://arxiv.org/pdf/2412.14161,2024,Arxiv,,, +Tur[k]ingBench: A Challenge Benchmark for Web Agents,"The paper presents TurkingBench, a web - agent benchmark using natural HTML pages, and a framework for evaluation to drive agent development. ",提出TurkingBench基准,用真实网页任务实例评估模型,开发框架链接回应与网页操作,推动网页智能体发展。 ,Datasets & Benchmarks,Benchmark,"Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments. Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches that rely on artificially synthesized web pages, our benchmark uses natural HTML pages originally designed for crowdsourcing workers to perform various annotation tasks. Each task's HTML instructions are instantiated with different values derived from crowdsourcing tasks, creating diverse instances. This benchmark includes 32.2K instances spread across 158 tasks. To support the evaluation of TurkingBench, we have developed a framework that links chatbot responses to actions on web pages (e.g., modifying a text box, selecting a radio button). We assess the performance of cutting-edge private and open-source models, including language-only and vision-language models (such as GPT4 and InternVL), on this benchmark. Our results show that while these models outperform random chance, there is still significant room for improvement. We hope that this benchmark will drive progress in the evaluation and development of web-based agents.",https://arxiv.org/pdf/2403.11905,2024,Arxiv,,, +"Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey","This survey systematically overviews VLMs in aspects like model info, architectures, benchmarks, applications, and challenges, aiding domain - specific researchers. ",该综述系统性概述多模态VLM,涵盖模型、架构、基准等方面,还提及应用与挑战。 ,Datasets & Benchmarks,Survey,"Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.",https://arxiv.org/pdf/2501.02189,2025,Arxiv,,, +Large Language Model based Multi-Agents: A Survey of Progress and Challenges,"This survey discusses essential aspects and challenges of LLM - based multi - agent systems, summarizes datasets, and maintains a GitHub repo for latest studies. ",该综述探讨LLM - MA系统关键方面与挑战,总结常用数据集,维护开源库更新研究动态。 ,Datasets & Benchmarks,Survey,"Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks. Due to their notable capabilities in planning and reasoning, LLMs have been utilized as autonomous agents for the automatic execution of various tasks. Recently, LLM-based agent systems have rapidly evolved from single-agent planning or decision-making to operating as multi-agent systems, enhancing their ability in complex problem-solving and world simulation. To offer an overview of this dynamic field, we present this survey to offer an in-depth discussion on the essential aspects and challenges of LLM-based multi-agent (LLM-MA) systems. Our objective is to provide readers with an in-depth understanding of these key points: the domains and settings where LLM-MA systems operate or simulate; the profiling and communication methods of these agents; and the means by which these agents develop their skills. For those interested in delving into this field, we also summarize the commonly used datasets or benchmarks. To keep researchers updated on the latest studies, we maintain an open-source GitHub repository (github.com/taichengguo/LLM_MultiAgents_Survey_Papers), dedicated to outlining the research of LLM-MA research.",https://www.ijcai.org/proceedings/2024/0890.pdf,2024,IJCAI,,,🚫重复 +Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models,"The paper observes issues in agent training for LLMs and proposes Agent - FLAN to fine - tune models, alleviating hallucinations and improving agent capabilities. ",文章针对大语言模型作智能体能力不足问题,提出 Agent - FLAN 方法,改进语料、缓解幻觉、提升能力。 ,Datasets & Benchmarks,Dataset (fin-tune),"Open-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents. How to integrate agent ability into general LLMs becomes a crucial and urgent problem.This paper first delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent reasoning, which significantly shifts from the distribution of its pre-training data; (2) LLMs exhibit different learning speeds on the capabilities required by agent tasks; and (3) current approaches have side-effects when improving agent abilities by introducing hallucinations. Based on the above findings, we propose Agent-FLAN to effectively Fine-tune LANguage models for Agents.Through careful decomposition and redesign of the training corpus, Agent-FLAN enables Llama2-7B to outperform prior best works by 3.5% across various agent evaluation datasets. With comprehensively constructed negative samples, Agent-FLAN greatly alleviates the hallucination issues based on our established evaluation benchmark. Besides, it consistently improves the agent capability of LLMs when scaling model sizes while slightly enhancing the general capability of LLMs. The code and models are available at https://github.com/InternLM/Agent-FLAN.",https://aclanthology.org/2024.findings-acl.557/,2024,*ACL,,, +AgentBank: Towards Generalized LLM Agents via Fine-Tuning on 50000+ Interaction Trajectories,"The paper introduces AgentBank, a large trajectory tuning dataset, and fine - tunes LLMs on it to get Samoyed, showing its promise for generalized agent capabilities. ",本文推出含超5万轨迹的AgentBank数据集微调LLM得Samoyed,证明扩展数据可获通用智能体能力 ,Datasets & Benchmarks,Dataset (fin-tune),"Fine-tuning on agent-environment interaction trajectory data holds significant promise for surfacing generalized agent capabilities in open-source large language models (LLMs). In this work, we introduce AgentBank, by far the largest trajectory tuning data collection featuring more than 50k diverse high-quality interaction trajectories which comprises 16 tasks covering five distinct agent skill dimensions. Leveraging a novel annotation pipeline, we are able to scale the annotated trajectories and generate a trajectory dataset with minimized difficulty bias. Furthermore, we fine-tune LLMs on AgentBank to get a series of agent models, Samoyed. Our comparative experiments demonstrate the effectiveness of scaling the interaction trajectory data to acquire generalized agent capabilities. Additional studies also reveal some key observations regarding trajectory tuning and agent skill generalization.",https://aclanthology.org/2024.findings-emnlp.116/,2024,*ACL,,, +AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning,"This paper introduces AgentOhana to aggregate and unify agent trajectories, streamline data loading, balance data sources, and presents xLAM - v0.1 for AI agents. ",提出 AgentOhana 聚合多源数据并统一格式,构建训练管道,还推出大动作模型 xLAM - v0.1 。 ,Datasets & Benchmarks,Dataset (fin-tune),"Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories. In this paper, we introduce \textbf{AgentOhana} as a comprehensive solution to address these challenges. \textit{AgentOhana} aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. It meticulously standardizes and unifies these trajectories into a consistent format, streamlining the creation of a generic data loader optimized for agent training. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training. Additionally, we present \textbf{xLAM-v0.1}, a large action model tailored for AI agents, which demonstrates exceptional performance across various benchmarks. Begin the exploration at \url{https://github.com/SalesforceAIResearch/xLAM}.",http://arxiv.org/abs/2402.15506,2024,Arxiv,,, +AgentTuning: Enabling Generalized Agent Abilities for LLMs,"The paper presents AgentTuning to enhance LLMs' agent abilities without sacrificing general ones, using AgentInstruct and hybrid tuning, and open - sources models. ",提出 AgentTuning 方法增强大语言模型的智能体能力且不损通用能力,开源相关数据集与模型 ,Datasets & Benchmarks,Dataset (fin-tune),"Open large language models (LLMs) with great performance in various tasks have significantly advanced the development of LLMs. However, they are far inferior to commercial models such as ChatGPT and GPT-4 when acting as agents to tackle complex tasks in the real world. These agent tasks employ LLMs as the central controller responsible for planning, memorization, and tool utilization, necessitating both fine-grained prompting methods and robust LLMs to achieve satisfactory performance. Though many prompting methods have been proposed to complete particular agent tasks, there is lack of research focusing on improving the agent capabilities of LLMs themselves without compromising their general abilities. In this work, we present AgentTuning, a simple and general method to enhance the agent abilities of LLMs while maintaining their general LLM capabilities. We construct AgentInstruct, a lightweight instruction-tuning dataset containing high-quality interaction trajectories. We employ a hybrid instruction-tuning strategy by combining AgentInstruct with open-source instructions from general domains. AgentTuning is used to instruction-tune the Llama 2 series, resulting in AgentLM. Our evaluations show that AgentTuning enables LLMs' agent capabilities without compromising general abilities. The AgentLM-70B is comparable to GPT-3.5-turbo on unseen agent tasks, demonstrating generalized agent capabilities. We open source the AgentInstruct and AgentLM-7B, 13B, and 70B models at https://anonymous.4open.science/r/AgentTuning, serving open and powerful alternatives to commercial LLMs for agent tasks.",https://aclanthology.org/2024.findings-acl.181/,2024,*ACL,,, +Executable Code Actions Elicit Better LLM Agents,"This paper proposes CodeAct using executable Python code for LLM agents, builds an open - source agent, and collects a dataset to improve agent - oriented tasks. ",提出用可执行Python代码统一大语言模型(LLM)智能体动作空间(CodeAct),收集调优数据集并微调智能体。 ,Datasets & Benchmarks,Dataset (fin-tune),"Large Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents are typically prompted to produce actions by generating JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools). This work proposes to use executable Python code to consolidate LLM agents’ actions into a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions. Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate). The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language. To this end, we collect an instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn interactions using CodeAct. We show that it can be used with existing data to improve models in agent-oriented tasks without compromising their general capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., model training) using existing libraries and autonomously self-debug.",https://proceedings.mlr.press/v235/wang24h.html,2024,ICML,,,🚫重复 +FireAct: Toward Language Agent Fine-tuning,"This paper explores fine - tuning LMs for language agents. It proposes FireAct, shows benefits of diverse data, and offers experimental designs and insights. ",该论文探讨微调大模型获取语言智能体这一方向,提出FireAct方法,论证微调的综合益处。 ,Datasets & Benchmarks,Dataset (fin-tune),"Recent efforts have augmented language models (LMs) with external tools or environments, leading to the development of language agents that can reason and act. However, most of these agents rely on few-shot prompting techniques with off-the-shelf LMs. In this paper, we investigate and argue for the overlooked direction of fine-tuning LMs to obtain language agents. Using a setup of question answering (QA) with a Google search API, we explore a variety of base LMs, prompting methods, fine-tuning data, and QA tasks, and find language agents are consistently improved after fine-tuning their backbone LMs. For example, fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase. Furthermore, we propose FireAct, a novel approach to fine-tuning LMs with trajectories from multiple tasks and prompting methods, and show having more diverse fine-tuning data can further improve agents. Along with other findings regarding scaling effects, robustness, generalization, efficiency and cost, our work establishes comprehensive benefits of fine-tuning LMs for agents, and provides an initial set of experimental designs, insights, as well as open questions toward language agent fine-tuning.",http://arxiv.org/abs/2310.05915,2023,Arxiv,,, +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents,"A novel framework using multi - agent systems boosts LLMs' capabilities, addresses limitations, and shows potential in diverse domains. ",提出多智能体协作框架增强大语言模型能力,应对多挑战,展示多领域应用潜力。 ,Tools,Methodology,"In this paper, we present a novel framework for enhancing the capabilities of large language models (LLMs) by leveraging the power of multi-agent systems. Our framework introduces a collaborative environment where multiple intelligent agent components, each with distinctive attributes and roles, work together to handle complex tasks more efficiently and effectively. We demonstrate the practicality and versatility of our framework through case studies in artificial general intelligence (AGI), specifically focusing on the Auto-GPT and BabyAGI models. We also examine the ""Gorilla"" model, which integrates external APIs into the LLM. Our framework addresses limitations and challenges such as looping issues, security risks, scalability, system evaluation, and ethical considerations. By modeling various domains such as courtroom simulations and software development scenarios, we showcase the potential applications and benefits of our proposed multi-agent system. Our framework provides an avenue for advancing the capabilities and performance of LLMs through collaboration and knowledge exchange among intelligent agents.",http://arxiv.org/abs/2306.03314,2023,Arxiv,,, +Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval,"The paper proposes Re-Invoke, an unsupervised tool retrieval method. It uses synthetic queries, LLM understanding, and intent - based ranking for large toolset retrieval. ",提出无监督工具检索方法Re - Invoke,可高效适配大工具集,通过多步骤精准检索工具。 ,Tools,Methodology,"Recent advances in large language models (LLMs) have enabled autonomous agents with complex reasoning and task-fulfillment capabilities using a wide range of tools. However, effectively identifying the most relevant tools for a given task becomes a key bottleneck as the toolset size grows, hindering reliable tool utilization. To address this, we introduce Re-Invoke, an unsupervised tool retrieval method designed to scale effectively to large toolsets without training. Specifically, we first generate a diverse set of synthetic queries that comprehensively cover different aspects of the query space associated with each tool document during the tool indexing phase. Second, we leverage LLM's query understanding capabilities to extract key tool-related context and underlying intents from user queries during the inference phase. Finally, we employ a novel multi-view similarity ranking strategy based on intents to pinpoint the most relevant tools for each query. Our evaluation demonstrates that Re-Invoke significantly outperforms state-of-the-art alternatives in both single-tool and multi-tool scenarios, all within a fully unsupervised setting. Notably, on the ToolE datasets, we achieve a 20% relative improvement in nDCG@5 for single-tool retrieval and a 39% improvement for multi-tool retrieval.",http://arxiv.org/abs/2408.01875,2024,Arxiv,,, +Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations,"This paper bridges recommender models and LLMs with ""InteRecAgent"", combining strengths to create an interactive recommender system. ",论文提出InteRecAgent框架,结合推荐模型与大语言模型优势,使推荐系统交互化,有自然语言界面。 ,Tools,Methodology,"Recommender models excel at providing domain-specific item recommendations by leveraging extensive user behavior data. Despite their ability to act as lightweight domain experts, they struggle to perform versatile tasks such as providing explanations and engaging in conversations. On the other hand, large language models (LLMs) represent a significant step towards artificial general intelligence, showcasing remarkable capabilities in instruction comprehension, commonsense reasoning, and human interaction. However, LLMs lack the knowledge of domain-specific item catalogs and behavioral patterns, particularly in areas that diverge from general world knowledge, such as online e-commerce. Finetuning LLMs for each domain is neither economic nor efficient. In this paper, we bridge the gap between recommender models and LLMs, combining their respective strengths to create a versatile and interactive recommender system. We introduce an efficient framework called \textbf{InteRecAgent}, which employs LLMs as the brain and recommender models as tools. We first outline a minimal set of essential tools required to transform LLMs into InteRecAgent. We then propose an efficient workflow within InteRecAgent for task execution, incorporating key components such as memory components, dynamic demonstration-augmented task planning, and reflection. InteRecAgent enables traditional recommender systems, such as those ID-based matrix factorization models, to become interactive systems with a natural language interface through the integration of LLMs. Experimental results on several public datasets show that InteRecAgent achieves satisfying performance as a conversational recommender system, outperforming general-purpose LLMs. The source code of InteRecAgent is released at https://aka.ms/recagent.",http://arxiv.org/abs/2308.16505,2023,Arxiv,,, +Chain of Tools: Large Language Model is an Automatic Multi-tool Learner,"The paper proposes ATC for LLMs as multi - tool users and a probing method for tool learning, and builds ToolFlow benchmark to show superiority. ",提出ATC框架让大模型成多工具用户,用黑盒探测法使其成工具学习者,构建ToolFlow基准测试。 ,Tools,Benchmark & Methodology,"Augmenting large language models (LLMs) with external tools has emerged as a promising approach to extend their utility, empowering them to solve practical tasks. Existing work typically empowers LLMs as tool users with a manually designed workflow, where the LLM plans a series of tools in a step-by-step manner, and sequentially executes each tool to obtain intermediate results until deriving the final answer. However, they suffer from two challenges in realistic scenarios: (1) The handcrafted control flow is often ad-hoc and constraints the LLM to local planning; (2) The LLM is instructed to use only manually demonstrated tools or well-trained Python functions, which limits its generalization to new tools. In this work, we first propose Automatic Tool Chain (ATC), a framework that enables the LLM to act as a multi-tool user, which directly utilizes a chain of tools through programming. To scale up the scope of the tools, we next propose a black-box probing method. This further empowers the LLM as a tool learner that can actively discover and document tool usages, teaching themselves to properly master new tools. For a comprehensive evaluation, we build a challenging benchmark named ToolFlow, which diverges from previous benchmarks by its long-term planning scenarios and complex toolset. Experiments on both existing datasets and ToolFlow illustrate the superiority of our framework. Analysis on different settings also validates the effectiveness and the utility of our black-box probing algorithm.",http://arxiv.org/abs/2405.16533,2024,Arxiv,,, +EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction,"The paper introduces EASYTOOL, a framework that transforms tool documentation into concise instructions, helping LLM - based agents use tools more easily. ",论文提出EASYTOOL框架,将多样工具文档转为统一简洁指令,提升大模型智能体工具使用能力。 ,Tools,Methodology,"To address intricate real-world tasks, there has been a rising interest in tool utilization in applications of large language models (LLMs). To develop LLM-based agents, it usually requires LLMs to understand many tool functions from different tool documentation. But these documentations could be diverse, redundant or incomplete, which immensely affects the capability of LLMs in using tools. To solve this, we introduce EASYTOOL, a framework transforming diverse and lengthy tool documentation into a unified and concise tool instruction for easier tool usage. EasyTool purifies essential information from extensive tool documentation of different sources, and elaborates a unified interface (i.e., tool instruction) to offer standardized tool descriptions and functionalities for LLM-based agents. Extensive experiments on multiple different tasks demonstrate that EasyTool can significantly reduce token consumption and improve the performance of tool utilization in real-world scenarios. Our code will be available at \url{https://github.com/microsoft/JARVIS/} in the future.",http://arxiv.org/abs/2401.06201,2024,Arxiv,,, +LLM With Tools: A Survey,"This survey explores LLM - tool integration, proposes a paradigm, addresses challenges, explores tool creation by LLMs, and reproduces Chameleon's results. ",探讨教大模型用工具的方法、挑战与发展,提出集成范式,还研究自主造工具,分析代码结构。 ,Tools,Survey,"The integration of tools in augmenting large language models presents a novel approach toward enhancing the efficiency and accuracy of these models in handling specific, complex tasks. This paper delves into the methodology,challenges, and developments in the realm of teaching LLMs to use external tools, thereby pushing the boundaries of their capabilities beyond pre-existing knowledge bases. We introduce a standardized paradigm for tool integration guided by a series of functions that map user instructions to actionable plans and their execution, emphasizing the significance of understanding user intent, tool selection, and dynamic plan adjustment. Our exploration reveals the various challenges encountered, such as tool invocation timing, selection accuracy, and the need for robust reasoning processes. In addressing these challenges, we investigate techniques within the context of fine-tuning and incontext learning paradigms, highlighting innovative approaches to ensure diversity, augment datasets, and improve generalization.Furthermore, we investigate a perspective on enabling LLMs to not only utilize but also autonomously create tools, which may redefine their role from mere tool users to tool creators. Finally,we reproduced Chameleon's results on ScienceQA and analyzed the code structure.",http://arxiv.org/abs/2409.18807,2024,Arxiv,,, +ToolGen: Unified Tool Retrieval and Calling via Generation,"The paper introduces ToolGen, integrating tool knowledge into LLMs as tokens, transforming tool retrieval into generation for more autonomous AI systems. ",论文提出ToolGen,将工具知识融入大模型参数,免检索调用工具,提升性能与扩展性,拓大模型实用能力。 ,Tools,Methodology,"As large language models (LLMs) advance, their inability to autonomously execute tasks by directly interacting with external tools remains a critical limitation. Traditional methods rely on inputting tool descriptions as context, which is constrained by context length and requires separate, often inefficient, retrieval mechanisms. We introduce ToolGen, a paradigm shift that integrates tool knowledge directly into the LLM's parameters by representing each tool as a unique token. This enables the LLM to generate tool calls and arguments as part of its next token prediction capabilities, seamlessly blending tool invocation with language generation. Our framework allows the LLM to access and utilize a vast amount of tools with no additional retrieval step, significantly enhancing both performance and scalability. Experimental results with over 47,000 tools show that ToolGen not only achieves superior results in both tool retrieval and autonomous task completion but also sets the stage for a new era of AI agents that can adapt to tools across diverse domains. By fundamentally transforming tool retrieval into a generative process, ToolGen paves the way for more versatile, efficient, and autonomous AI systems. ToolGen enables end-to-end tool learning and opens opportunities for integration with other advanced techniques such as chain-of-thought and reinforcement learning, thereby expanding the practical capabilities of LLMs.",http://arxiv.org/abs/2410.03439,2024,Arxiv,,, +ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs,"The paper introduces ToolLLM, a tool - use framework. It creates ToolBench dataset, develops algorithms and evaluator, and fine - tunes LLaMA to enhance tool - use. ",提出ToolLLM框架,构建ToolBench数据集,开发算法和评估器,微调得ToolLLaMA,提升大模型工具使用能力 ,Tools,Methodology & Dataset & Benchmark,"Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.",http://arxiv.org/abs/2307.16789,2023,Arxiv,,, +ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph,"This paper proposes ToolNet, a plug - and - play framework organizing tools into a graph, enabling LLMs to handle thousands of tools with moderate token cost. ",论文提出ToolNet框架,将工具组织成有向图,增加工具数且控消耗,应对多工具场景。 ,Tools,Methodology,"While achieving remarkable progress in a broad range of tasks, large language models (LLMs) remain significantly limited in properly using massive external tools. Existing in-context learning approaches simply format tools into a list of plain text descriptions and input them to LLMs, from which, LLMs generate a sequence of tool calls to solve problems step by step. Such a paradigm ignores the intrinsic dependency between tools and offloads all reasoning loads to LLMs, making them restricted to a limited number of specifically designed tools. It thus remains challenging for LLMs to operate on a library of massive tools, casting a great limitation when confronted with real-world scenarios. This paper proposes ToolNet, a plug-and-play framework that scales up the number of tools to thousands with a moderate increase in token consumption. ToolNet organizes tools into a directed graph. Each node represents a tool, and weighted edges denote tool transition. Starting from an initial tool node, an LLM navigates in the graph by iteratively choosing the next one from its successors until the task is resolved. Extensive experiments show that ToolNet can achieve impressive results in challenging multi-hop tool learning datasets and is resilient to tool failures.",http://arxiv.org/abs/2403.00839,2024,Arxiv,,, +ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback,"The paper constructs MGToolBench to reflect real - world scenarios and proposes ToolPlanner to enhance LLM capabilities, aligning with user habits. ",构建MGToolBench数据集,提出ToolPlanner框架增强LLM能力,多粒度指令更贴合用户习惯 ,Tools,Methodology & Dataset,"Recently, tool-augmented LLMs have gained increasing attention. Given an instruction, tool-augmented LLMs can interact with various external tools in multiple rounds and provide a final answer. However, previous LLMs were trained on overly detailed instructions, which included API names or parameters, while real users would not explicitly mention these API details. This leads to a gap between trained LLMs and real-world scenarios. In addition, most works ignore whether the interaction process follows the instruction. To address these issues, we constructed a training dataset called MGToolBench, which contains statement and category-level instructions to better reflect real-world scenarios. In addition, we propose ToolPlanner, a two-stage reinforcement learning framework that utilizes path planning and two feedback mechanisms to enhance the LLM's task completion and instruction-following capabilities. Experimental results show that ToolPlanner significantly improves the Match Rate, Pass Rate and Win Rate by 26.8%, 20.2%, and 5.6% compared to the SOTA model. Human evaluation verifies that the multi-granularity instructions can better align with users' usage habits. Our data and code will be released upon acceptance.",http://arxiv.org/abs/2409.14826,2024,Arxiv,,, +TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems,"This paper presents TPTU - v2, a framework with API Retriever, LLM Finetuner and Demo Selector to enhance LLM - based agents' TPTU in real - world systems. ",提出 TPTU-v2 框架提升大模型代理任务规划与工具使用能力,含三组件应对现实系统挑战 ,Tools,Methodology,"Large Language Models (LLMs) have demonstrated proficiency in addressing tasks that necessitate a combination of task planning and the usage of external tools that require a blend of task planning and the utilization of external tools, such as APIs. However, real-world complex systems present three prevalent challenges concerning task planning and tool usage: (1) The real system usually has a vast array of APIs, so it is impossible to feed the descriptions of all APIs to the prompt of LLMs as the token length is limited; (2) the real system is designed for handling complex tasks, and the base LLMs can hardly plan a correct sub-task order and API-calling order for such tasks; (3) Similar semantics and functionalities among APIs in real systems create challenges for both LLMs and even humans in distinguishing between them. In response, this paper introduces a comprehensive framework aimed at enhancing the Task Planning and Tool Usage (TPTU) abilities of LLM-based agents operating within real-world systems. Our framework comprises three key components designed to address these challenges: (1) the API Retriever selects the most pertinent APIs for the user task among the extensive array available; (2) LLM Finetuner tunes a base LLM so that the finetuned LLM can be more capable for task planning and API calling; (3) the Demo Selector adaptively retrieves different demonstrations related to hard-to-distinguish APIs, which is further used for in-context learning to boost the final performance. We validate our methods using a real-world commercial system as well as an open-sourced academic dataset, and the outcomes clearly showcase the efficacy of each individual component as well as the integrated framework.",http://arxiv.org/abs/2311.11315,2023,Arxiv,,, +TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage,"The paper proposes an LLM-based AI agent framework, designs two agents, evaluates TPTU abilities, aiming to aid AI application leveraging of LLMs. ",提出LLM基AI智能体框架,设计两类智能体,评估TPTU能力,为AI应用提供参考与研究方向。 ,Tools,Methodology,"With recent advancements in natural language processing, Large Language Models (LLMs) have emerged as powerful tools for various real-world applications. Despite their prowess, the intrinsic generative abilities of LLMs may prove insufficient for handling complex tasks which necessitate a combination of task planning and the usage of external tools. In this paper, we first propose a structured framework tailored for LLM-based AI Agents and discuss the crucial capabilities necessary for tackling intricate problems. Within this framework, we design two distinct types of agents (i.e., one-step agent and sequential agent) to execute the inference process. Subsequently, we instantiate the framework using various LLMs and evaluate their Task Planning and Tool Usage (TPTU) abilities on typical tasks. By highlighting key findings and challenges, our goal is to provide a helpful resource for researchers and practitioners to leverage the power of LLMs in their AI applications. Our study emphasizes the substantial potential of these models, while also identifying areas that need more investigation and improvement.",http://arxiv.org/abs/2308.03427,2023,Arxiv,,,🚫重复 +GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction,"The paper proposes GPT4Tools to enable open - source LLMs to use tools via self - instruct, and provides a benchmark for evaluation. ",提出GPT4Tools让开源大模型用工具,生成数据集、用LoRA优化解视觉问题,还设基准评估。 ,Tools,Datasets & Benchmark,"This paper aims to efficiently enable Large Language Models (LLMs) to usemultimodal tools. Advanced proprietary LLMs, such as ChatGPT and GPT-4, haveshown great potential for tool usage through sophisticated prompt engineering.Nevertheless, these models typically rely on prohibitive computational costsand publicly inaccessible data. To address these challenges, we propose theGPT4Tools based on self-instruct to enable open-source LLMs, such as LLaMA andOPT, to use tools. It generates an instruction-following dataset by promptingan advanced teacher with various multi-modal contexts. By using the Low-RankAdaptation (LoRA) optimization, our approach facilitates the open-source LLMsto solve a range of visual problems, including visual comprehension and imagegeneration. Moreover, we provide a benchmark to evaluate the ability of LLMs touse tools, which is performed in both zero-shot and fine-tuning ways. Extensiveexperiments demonstrate the effectiveness of our method on various languagemodels, which not only significantly improves the accuracy of invoking seentools, but also enables the zero-shot capacity for unseen tools. The code anddemo are available at https://github.com/StevenGrove/GPT4Tools.",https://proceedings.neurips.cc/paper_files/paper/2023/hash/e393677793767624f2821cec8bdd02f1-Abstract-Conference.html?utm_campaign=Artificial%2BIntelligence%2BWeekly&utm_medium=email&utm_source=Artificial_Intelligence_Weekly_411,2023,NIPS,,, +Making Language Models Better Tool Learners with Execution Feedback,"The paper proposes TRICE, a two - stage framework, enabling models to learn when and how to use tools via execution feedback, improving tool usage. ",提出TRICE两阶段框架,让大模型通过执行反馈持续学习,有效掌握工具使用时机与方法。 ,Tools,Methodology,"Tools serve as pivotal interfaces that enable humans to understand and reshape the environment. With the advent of foundation models, AI systems can utilize tools to expand their capabilities and interact with the real world. Existing tool learning methodologies, encompassing supervised fine-tuning and prompt engineering approaches, often induce large language models to utilize tools indiscriminately, as complex tasks often exceed their own competencies. However, introducing tools for simple tasks, which the models themselves can readily resolve, can inadvertently propagate errors rather than enhance performance. This leads to the research question: can we teach language models when and how to use tools? To meet this need, we propose Tool leaRning wIth exeCution fEedback (TRICE), a two-stage end-to-end framework that enables the model to continually learn through feedback derived from tool execution, thereby learning when and how to use tools effectively. Experimental results, backed by further analysis, show that TRICE can make the large language model selectively use tools by improving the accuracy of tool usage while enhancing insufficient tool learning and mitigating excessive reliance on tools.",https://aclanthology.org/2024.naacl-long.195/,2024,*ACL,,, +API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs,"The paper introduces API - Bank, a benchmark for tool - augmented LLMs, addresses key questions on tool use, and provides datasets and error analysis for future research. ",本文提出API - Bank基准,开发评估系统、构建训练集,指明领域研究挑战。 ,Tools,Benchmark,"Recent research has demonstrated that Large Language Models (LLMs) can enhance their capabilities by utilizing external tools. However, three pivotal questions remain unanswered: (1) How effective are current LLMs in utilizing tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What obstacles need to be overcome to leverage tools? To address these questions, we introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs. For the first question, we develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls to assess the existing LLMs' capabilities in planning, retrieving, and calling APIs. For the second question, we construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits improved tool utilization compared to GPT-3, while GPT-4 excels in planning. However, there is still significant potential for further improvement. Moreover, Lynx surpasses Alpaca`s tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5. Through error analysis, we highlight the key challenges for future research in this field to answer the third question.",https://aclanthology.org/2023.emnlp-main.187/,2023,*ACL,,, +ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models,"The paper proposes ChatCoT, a tool - augmented framework for chat - based LLMs. It models CoT as multi - turn chats, integrating reasoning and tool use. ",提出ChatCoT框架,将思维链推理建模为多轮对话,统一推理与工具使用,提升复杂推理能力 ,Tools,Methodology,"Although large language models (LLMs) have achieved excellent performance in a variety of evaluation benchmarks, they still struggle in complex reasoning tasks which require specific knowledge and multi-hop reasoning. To improve the reasoning abilities, we propose ChatCoT, a tool-augmented chain-of-thought reasoning framework for chat-based LLMs (e.g., ChatGPT). In ChatCoT, we model the chain-of-thought (CoT) reasoning as multi-turn conversations, to utilize tools in a more natural way through chatting. At each turn, LLMs can either interact with tools or perform the reasoning. Our approach can effectively leverage the multi-turn conversation ability of chat-based LLMs, and integrate the thought chain following and tools manipulation in a unified way. Specially, we initialize the early turns of the conversation by the knowledge about tools, tasks, and reasoning format, and propose an iterative tool-augmented reasoning step to perform step-by-step tool-augmented reasoning. The experiment results on two complex reasoning datasets (MATH and HotpotQA) have shown the effectiveness of ChatCoT on complex reasoning tasks, achieving a 7.9% relative improvement over the state-of-the-art baseline.",https://aclanthology.org/2023.findings-emnlp.985/,2023,*ACL,,, +ToolQA: A Dataset for LLM Question Answering with External Tools,"The paper introduces ToolQA, a dataset for evaluating LLMs' tool - use in QA. It features scalable curation, minimizes data overlap, and offers new evaluation directions. ",提出新数据集ToolQA,可评估大模型用外部工具问答能力,减少与预训练数据重叠,指明研究方向。 ,Tools,Datasets,"Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs' pre-training data, enabling a more precise evaluation of LLMs' tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available for the broader scientific community on GitHub.",https://proceedings.neurips.cc/paper_files/paper/2023/hash/9cb2a7495900f8b602cb10159246a016-Abstract-Datasets_and_Benchmarks.html,2023,NIPS,,, +On the Tool Manipulation Capability of Open-source Large Language Models,"This paper explores enhancing open - source LLMs for tool manipulation with human supervision, adapts classical methods, creates ToolBench, offering a practical recipe. ",文章分析开源大模型工具操作失败原因,提出增强技术,创建ToolBench,以少量人力提升其竞争力。 ,Tools,Benchmark,"Recent studies on software tool manipulation with large language models (LLMs) mostly rely on closed model APIs. The industrial adoption of these models is substantially constrained due to the security and robustness risks in exposing information to closed LLM API services. In this paper, we ask can we enhance open-source LLMs to be competitive to leading closed LLM APIs in tool manipulation, with practical amount of human supervision. By analyzing common tool manipulation failures, we first demonstrate that open-source LLMs may require training with usage examples, in-context demonstration and generation style regulation to resolve failures. These insights motivate us to revisit classical methods in LLM literature, and demonstrate that we can adapt them as model alignment with programmatic data generation, system prompts and in-context demonstration retrievers to enhance open-source LLMs for tool manipulation. To evaluate these techniques, we create the ToolBench, a tool manipulation benchmark consisting of diverse software tools for real-world tasks. We demonstrate that our techniques can boost leading open-source LLMs by up to 90% success rate, showing capabilities competitive to OpenAI GPT-4 in 4 out of 8 ToolBench tasks. We show that such enhancement typically requires about one developer day to curate data for each tool, rendering a recipe with practical amount of human supervision.",http://arxiv.org/abs/2305.16504,2023,Arxiv,,, +RestGPT: Connecting Large Language Models with Real-World RESTful APIs,"The paper proposes RestGPT to connect LLMs with RESTful APIs via coarse-to-fine planning. It also introduces RestBench, paving a way to AGI. ",本文提出RestGPT连接大模型与RESTful API,采用在线规划机制,还设RestBench评估,为AGI探索新路。 ,Tools,Benchmark,"Tool-augmented large language models (LLMs) have achieved remarkable progress in tackling a broad range of tasks. However, existing methods are mainly restricted to specifically designed tools and fail to fulfill complex instructions, having great limitations when confronted with real-world scenarios. In this paper, we explore a more realistic scenario by connecting LLMs with RESTful APIs, which adhere to the widely adopted REST software architectural style for web service development. To address the practical challenges of tackling complex instructions, we propose RestGPT, which exploits the power of LLMs and conducts a coarse-to-fine online planning mechanism to enhance the abilities of task decomposition and API selection. RestGPT also contains an API executor tailored for calling RESTful APIs, which can meticulously formulate parameters and parse API responses. To fully evaluate the performance of RestGPT, we propose RestBench, a high-quality benchmark which consists of two real-world scenarios and human-annotated instructions with gold solution paths. Experiments show that RestGPT is able to achieve impressive results in complex tasks and has strong robustness, which paves a new way towards AGI. RestGPT and RestBench is publicly available at https://restgpt.github.io/.",http://arxiv.org/abs/2306.06624,2023,Arxiv,,, +Leveraging Large Language Models to Improve REST API Testing,"This paper presents RESTGPT, an approach using LLMs to boost REST API testing by extracting rules and values from specs, outpacing existing methods. ",论文提出 RESTGPT,借助大语言模型改善 REST API 测试,提取规则、生成参数值,表现优于现有技术。 ,Tools,Methodology,"The widespread adoption of REST APIs, coupled with their growing complexity and size, has led to the need for automated REST API testing tools. Current tools focus on the structured data in REST API specifications but often neglect valuable insights available in unstructured natural-language descriptions in the specifications, which leads to suboptimal test coverage. Recently, to address this gap, researchers have developed techniques that extract rules from these human-readable descriptions and query knowledge bases to derive meaningful input values. However, these techniques are limited in the types of rules they can extract and prone to produce inaccurate results. This paper presents RESTGPT, an innovative approach that leverages the power and intrinsic context-awareness of Large Language Models (LLMs) to improve REST API testing. RESTGPT takes as input an API specification, extracts machine-interpretable rules, and generates example parameter values from natural-language descriptions in the specification. It then augments the original specification with these rules and values. Our evaluations indicate that RESTGPT outperforms existing techniques in both rule extraction and value generation. Given these promising results, we outline future research directions for advancing REST API testing through LLMs.",https://dl.acm.org/doi/10.1145/3639476.3639769,2024,IEEE,,, +Toolformer: Language Models Can Teach Themselves to Use Tools,"The paper introduces Toolformer, enabling LMs to self - teach tool use via APIs with few demos, merging general and specialized abilities. ",提出Toolformer,让大模型自学用API调用外部工具,提升下游任务零样本性能,不损核心能力。 ,Tools,Methodology,"Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller specialized models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, a search engine, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.",https://proceedings.neurips.cc/paper_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html,2023,NIPS,,, +WebGPT: Browser-assisted question-answering with human feedback,"The paper fine - tunes GPT - 3 for long - form QA with web - browsing, uses imitation learning and human feedback, and collects references. ",该研究基于文本网页环境微调GPT - 3答长问题,结合模仿学习与人类反馈,训练评估用ELI5数据集。 ,Tools,Datasets,"We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.",http://arxiv.org/abs/2112.09332,2022,Arxiv,,, +WebCPM: Interactive Web Search for Chinese Long-form Question Answering,"This paper presents WebCPM, the first Chinese LFQA dataset with interactive web search. It collects data, fine - tunes models, and makes resources public. ",提出���个中文长文本问答数据集 WebCPM,基于交互式网络搜索,收集问答对,公开资源并微调模型。 ,Tools,Datasets,"Long-form question answering (LFQA) aims at answering complex, open-ended questions with detailed, paragraph-length responses. The de facto paradigm of LFQA necessitates two procedures: information retrieval, which searches for relevant supporting facts, and information synthesis, which integrates these facts into a coherent answer. In this paper, we introduce WebCPM, the first Chinese LFQA dataset. One unique feature of WebCPM is that its information retrieval is based on interactive web search, which engages with a search engine in real time. Following WebGPT, we develop a web search interface. We recruit annotators to search for relevant information using our interface and then answer questions. Meanwhile, the web search behaviors of our annotators would be recorded. In total, we collect 5,500 high-quality question-answer pairs, together with 15,372 supporting facts and 125,954 web search actions. We fine-tune pre-trained language models to imitate human behaviors for web search and to generate answers based on the collected facts. Our LFQA pipeline, built on these fine-tuned models, generates answers that are no worse than human-written ones in 32.5% and 47.5% of the cases on our dataset and DuReader, respectively. The interface, dataset, and codes are publicly available at https://github.com/thunlp/WebCPM.",https://aclanthology.org/2023.acl-long.499/,2023,*ACL,,, +ToolCoder: Teach Code Generation Models to use API search tools,"The paper proposes ToolCoder, integrating API search tools into code generation. It uses ChatGPT for annotation and shows potential of tool - incorporation. ",提出ToolCoder方法,结合API搜索工具与现有模型,引入数据标注法,展现出优秀代码生成能力。 ,Tools,Methodology,"Automatically generating source code from natural language descriptions has been a growing field of research in recent years. However, current large-scale code generation models often encounter difficulties when selecting appropriate APIs for specific contexts. These models may generate APIs that do not meet requirements or refer to non-existent APIs in third-party libraries, especially for lesser-known or private libraries. Inspired by the process of human developers using tools to search APIs, we propose ToolCoder, a novel approach that integrates API search tools with existing models to assist in code generation and API selection. To teach our model to use tools, we introduce an automated data annotation method using ChatGPT to add tool usage information into the source code data and fine-tune code generation models. During inference, we integrate API search tools into the generation process so that our model can automatically use the search tool to get suggestions when selecting an API. Our experimental results demonstrate that ToolCoder exhibits excellent performance and generalization across five public and private library code generation benchmarks, with at least 6.21\% improvement on average pass@1 metrics and 9.64\% improvement on average pass@10 metrics compared to state-of-the-art methods. Furthermore, we show that our relatively small ToolCoder model is comparable to one of the current best models, GPT-3.5, highlighting the potential of incorporating programming tools into the code generation process.",http://arxiv.org/abs/2305.04032,2023,Arxiv,,, +ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models,"This paper proposes ToolCoder, reformulating tool learning as code generation. It uses coding for reasoning, promotes reuse and debugging, a novel approach in tool learning. ",论文提出ToolCoder框架,将工具学习转为代码生成任务,借助编码范式规划,还优化了执行效率和鲁棒性。 ,Tools,Methodology,"Tool learning has emerged as a crucial capability for large language models (LLMs) to solve complex real-world tasks through interaction with external tools. Existing approaches face significant challenges, including reliance on hand-crafted prompts, difficulty in multi-step planning, and lack of precise error diagnosis and reflection mechanisms. We propose ToolCoder, a novel framework that reformulates tool learning as a code generation task. Inspired by software engineering principles, ToolCoder transforms natural language queries into structured Python function scaffold and systematically breaks down tasks with descriptive comments, enabling LLMs to leverage coding paradigms for complex reasoning and planning. It then generates and executes function implementations to obtain final responses. Additionally, ToolCoder stores successfully executed functions in a repository to promote code reuse, while leveraging error traceback mechanisms for systematic debugging, optimizing both execution efficiency and robustness. Experiments demonstrate that ToolCoder achieves superior performance in task completion accuracy and execution reliability compared to existing approaches, establishing the effectiveness of code-centric approaches in tool learning.",http://arxiv.org/abs/2502.11404,2025,Arxiv,,, +ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases,"This paper presents ToolAlpaca, creating diverse tool - use corpus to fine - tune compact models, proving feasible for them to learn generalized tool - use. ",论文提出ToolAlpaca框架,自动生成工具使用语料微调小模型,实现通用工具使用能力。 ,Tools,Methodology,"Enabling large language models to utilize real-world tools effectively is crucial for achieving embodied intelligence. Existing approaches to tool learning have either primarily relied on extremely large language models, such as GPT-4, to attain generalized tool-use abilities in a zero-shot manner, or utilized supervised learning to train limited scopes of tools on compact models. However, it remains uncertain whether smaller language models can achieve generalized tool-use abilities without tool-specific training. To address this question, this paper introduces ToolAlpaca, a novel framework designed to automatically generate a diverse tool-use corpus and learn generalized tool-use abilities on compact language models with minimal human intervention. Specifically, ToolAlpaca first automatically creates a highly diversified tool-use corpus by building a multi-agent simulation environment. The corpus contains 3938 tool-use instances from more than 400 real-world tool APIs spanning 50 distinct categories. Subsequently, the constructed corpus is employed to fine-tune compact language models, resulting in two models, namely ToolAlpaca-7B and ToolAlpaca-13B, respectively. Finally, we evaluate the ability of these models to utilize previously unseen tools without specific training. Experimental results demonstrate that ToolAlpaca achieves effective generalized tool-use capabilities comparable to those of extremely large language models like GPT-3.5, demonstrating that learning generalized tool-use ability is feasible for compact language models.",http://arxiv.org/abs/2306.05301,2023,Arxiv,,, +LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error,"The paper proposes STE, a bio - inspired method for tool - augmented LLMs, using trial - and - error, imagination and memory to improve tool use accuracy. ",提出生物启发的模拟试错法(STE)提升大模型工具使用准确性,还可实现工具持续学习。 ,Tools,Methodology,"Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM`s ‘imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.",https://aclanthology.org/2024.acl-long.570/,2024,*ACL,,,🚫重复 +Skills-in-Context: Unlocking Compositionality in Large Language Models,"The paper proposes ""skills-in-context"" (SKiC) in in-context learning to elicit LLMs' compositional generalization, with transferability and zero-shot potential. ",提出“技能情境(SKiC)”提示结构,挖掘大模型组合泛化能力,微调数据可实现零样本强弱泛化 ,Tools,Methodology,"We investigate how to elicit compositional generalization capabilities in large language models (LLMs). Compositional generalization empowers LLMs to solve complex problems by combining foundational skills, a critical reasoning ability akin to human intelligence. However, even the most advanced LLMs currently struggle with this form of reasoning. We examine this problem within the framework of in-context learning and find that demonstrating both foundational skills and compositional examples grounded in these skills within the same prompt context is crucial. We refer to this prompt structure as skills-in-context (SKiC). With as few as two exemplars, this in-context learning structure enables LLMs to tackle more challenging problems requiring innovative skill combinations, achieving near-perfect systematic generalization across a broad range of tasks. Intriguingly, SKiC also unlocks the latent potential of LLMs, allowing them to more actively utilize pre-existing internal skills acquired during earlier pretraining stages to solve complex reasoning problems. The SKiC structure is robust across different skill constructions and exemplar choices and demonstrates strong transferability to new tasks. Finally, inspired by our in-context learning study, we show that fine-tuning LLMs with SKiC-style data can elicit zero-shot weak-to-strong generalization, enabling the models to solve much harder problems directly with standard prompting.",https://aclanthology.org/2024.findings-emnlp.812/,2024,*ACL,,, +Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance,"The paper presents Task Bench, a parameterized benchmark for evaluating parallel runtime performance, introduces METG, and studies systems' scalability. ",提出参数化基准 Task Bench,引入新指标 METG,研究多种编程系统性能及特性。 ,Tools,Benchmark,"We present Task Bench, a parameterized benchmark designed to explore the performance of distributed programming systems under a variety of application scenarios. Task Bench dramatically lowers the barrier to benchmarking and comparing multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench's parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications. To assess the effectiveness and overheads of the tested systems, we introduce a novel metric, minimum effective task granularity (METG). We conduct a comprehensive study with 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. Running at scale, 100μs-long tasks are the finest granularity that any system runs efficiently with current technologies. We also study each system's scalability, ability to hide communication and mitigate load imbalance.",https://www.computer.org/csdl/proceedings-article/sc/2020/999800a864/1oeOToMWZBC,2020,IEEE,,, +ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings,"The paper proposes ToolkenGPT, using tool embeddings to let LLMs master tools like predicting tokens, addressing limitations of existing approaches. ",提出ToolkenGPT,通过工具嵌入让大模型像预测标记一样掌握工具,解决现有方法局限。 ,Tools,Methodology,"Integrating large language models (LLMs) with various tools has led to increased attention in the field. Existing approaches either involve fine-tuning the LLM, which is both computationally costly and limited to a fixed set of tools, or prompting LLMs by in-context tool demonstrations. Although the latter method offers adaptability to new tools, it struggles with the inherent context length constraint of LLMs when many new tools are presented, and mastering a new set of tools with few-shot examples remains challenging, resulting in suboptimal performance. To address these limitations, we propose a novel solution, named ToolkenGPT, wherein LLMs effectively learn to master tools as predicting tokens through tool embeddings for solving complex tasks. In this framework, each tool is transformed into vector embeddings and plugged into the language model head. Once the function is triggered during text generation, the LLM enters a special function mode to execute the tool calls. Our experiments show that function embeddings effectively help LLMs understand tool use and improve on several tasks, including numerical reasoning, knowledge-based question answering and embodied decision-making.",https://proceedings.neurips.cc/paper_files/paper/2023/hash/8fd1a81c882cd45f64958da6284f4a3f-Abstract-Conference.html,2023,NIPS,,, +MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting,"The paper proposes MultiTool-CoT, a framework using CoT prompting to integrate multiple external tools in reasoning, aiming at enhancing task performance. ",提出MultiTool - CoT框架,借思维链提示融合多外部工具用于推理,应用于特定数据集有提升。 ,Tools,Methodology,"Large language models (LLMs) have achieved impressive performance on various reasoning tasks. To further improve the performance, we propose MultiTool-CoT, a novel framework that leverages chain-of-thought (CoT) prompting to incorporate multiple external tools, such as a calculator and a knowledge retriever, during the reasoning process. We apply MultiTool-CoT to the Task 2 dataset of NumGLUE, which requires both numerical reasoning and domain-specific knowledge. The experiments show that our method significantly outperforms strong baselines and achieves state-of-the-art performance.",https://aclanthology.org/2023.acl-short.130/,2023,*ACL,,, +TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs,"The paper, ""TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs"", shows AI's progress with advanced models for open - domain tasks. ",论文聚焦通过连接基础模型与数百万API完成任务,指出先进基础模型在多领域任务能力强。 ,Tools,Methodology,"In recent years, artificial intelligence (AI) has made incredible progress. Advanced foundation models such as ChatGPT can offer powerful conversation, in-context learning, and code generation abilities for a broad range of open-domain tasks. They can ...",https://spj.science.org/doi/10.34133/icomputing.0063,2024,Others,,, +Gorilla: Large Language Model Connected with Massive APIs,"The paper develops Gorilla, a fine - tuned LLaMA model. With RAT training, it excels in API calls, adapts to doc changes and reduces hallucination. ",本文开发Gorilla模型结合检索器,用新方法训练,引入APIBench评估,提升大模型用工具能力。 ,Tools,Methodology,"Large Language Models (LLMs) have seen an impressive wave of advances, withmodels now excelling in a variety of tasks, such as mathematical reasoning andprogram synthesis. However, their potential to effectively use tools via API callsremains unfulfilled. This is a challenging task even for today’s state-of-the-artLLMs such as GPT-4 largely due to their unawareness of what APIs are availableand how to use them in a frequently updated tool set. We develop Gorilla, afinetuned LLaMA model that surpasses the performance of GPT-4 on writing APIcalls. Trained with the novel Retriever Aware Training (RAT), when combinedwith a document retriever, Gorilla demonstrates a strong capability to adapt totest-time document changes, allowing flexible user updates or version changes.It also substantially mitigates the issue of hallucination, commonly encounteredwhen prompting LLMs directly. To evaluate the model’s ability, we introduceAPIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, andTensorHub APIs. The successful integration of the retrieval system with Gorillademonstrates the potential for LLMs to use tools more accurately, keep up withfrequently updated documentation, and consequently increase the reliability andapplicability of their outputs. Gorilla’s code, model, data, and demo are availableat: https://gorilla.cs.berkeley.edu",https://proceedings.neurips.cc/paper_files/paper/2024/hash/e4c61f578ff07830f5c37378dd3ecb0d-Abstract-Conference.html,2024,NIPS,,, +CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models,"The paper proposes CREATOR, a framework enabling LLMs to create tools, disentangling creation and execution, and shows its benefits for problem - solving and knowledge transfer. ",提出 CREATOR 框架使大语言模型能自创工具,解耦抽象创建与具体执行,推动解题范式革新 ,Tools,Methodology,"Large Language Models (LLMs) have made significant progress in utilizing tools, but their ability is limited by API availability and the instability of implicit reasoning, particularly when both planning and execution are involved. To overcome these limitations, we propose CREATOR, a novel framework that enables LLMs to create their own tools using documentation and code realization. CREATOR disentangles abstract tool creation and concrete decision execution, resulting in improved performance. We evaluate CREATOR on MATH and TabMWP benchmarks, respectively consisting of challenging math competition problems and diverse tabular contents. Remarkably, CREATOR outperforms existing chain-of-thought, program-of-thought, and tool-using baselines. Additionally, we introduce the Creation Challenge dataset, featuring 2K diverse questions, to emphasize the necessity and benefits of LLMs' tool creation ability. Further research demonstrates that leveraging LLMs as tool creators facilitates knowledge transfer, and LLMs exhibit varying levels of tool creation abilities, enabling them to adapt to diverse situations. The tool creation ability revolutionizes the LLM`s problem-solving paradigm, driving us closer to the next frontier of artificial intelligence.",https://aclanthology.org/2023.findings-emnlp.462/,2023,*ACL,,, +LARGE LANGUAGE MODELS AS TOOL MAKERS,"The paper presents LATM, a closed-loop framework enabling LLMs to create and use tools. It divides labor to cut costs and extends cache applicability. ",提出 LATM 闭环框架,让大语言模型自制工具,分工降低成本,扩展缓存机制,提升任务解决效率。 ,Tools,Methodology,"Recent research has highlighted the potential of large language models (LLMs) to improve their problem-solving capabilities with the aid of suitable external tools. In our work, we further advance this concept by introducing a closedloop framework, referred to as LLMs As Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving. Our approach consists of two phases: 1) tool making: an LLM acts as the tool maker that crafts tools for a set of tasks, where a tool is implemented as a Python utility function. 2) tool using: another LLM acts as the tool user, which applies the tool built by the tool maker for problem-solving. The tool user can be either the same or a different LLM from the tool maker. On the problem-solving server side, tool-making enables continual tool generation and caching as new requests emerge. This framework enables subsequent requests to access cached tools via their corresponding APIs, enhancing the efficiency of task resolution. Beyond enabling LLMs to create their own tools, our framework also uncovers intriguing opportunities to optimize the serving cost of LLMs: Recognizing that tool-making requires more sophisticated capabilities, we assign this task to a powerful, albeit resource-intensive, model. Conversely, the simpler tool-using phase is delegated to a lightweight model. This strategic division of labor allows the once-off cost of tool-making to be spread over multiple instances of tool-using, significantly reducing average costs while maintaining strong performance. Furthermore, our method offers a functional cache through the caching and reuse of tools, which stores the functionality of a class of requests instead of the natural language responses from LLMs, thus extending the applicability of the conventional cache mechanism. We evaluate our approach across various complex reasoning tasks, including Big-Bench tasks. With GPT-4 as the tool maker and GPT-3.5 as the tool user, LATM demonstrates performance equivalent to using GPT-4 for both roles, but with a significantly reduced inference cost. The codebase can be found in https://github.com/ ctlllll/LLM-ToolMaker.",https://arxiv.org/abs/2305.17126,2024,ICLR,,, +GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution,"The paper introduces GEAR, a generalizable and efficient query - tool grounding algorithm, delegating tasks to different models and evaluating semantically for better tool use. ",提出GEAR算法,将工具接地与执行分至小、大模型,泛化性好、效率高,降低成本提升精度。 ,Tools,Methodology,"Augmenting large language models (LLM) to use external tools enhances their performance across a variety of tasks. However, prior works over-rely on task-specific demonstration of tool use that limits their generalizability and computational cost due to making many calls to large-scale LLMs. We introduce GEAR, a computationally efficient query-tool grounding algorithm that is generalizable to various tasks that require tool use while not relying on task-specific demonstrations. GEAR achieves better efficiency by delegating tool grounding and execution to small language models (SLM) and LLM, respectively; while leveraging semantic and pattern-based evaluation at both question and answer levels for generalizable tool grounding. We evaluate GEAR on 14 datasets across 6 downstream tasks, demonstrating its strong generalizability to novel tasks, tools and different SLMs. Despite offering more efficiency, GEAR achieves higher precision in tool grounding compared to prior strategies using LLM prompting, thus improving downstream accuracy at a reduced computational cost. For example, we demonstrate that GEAR-augmented GPT-J and GPT-3 outperform counterpart tool-augmented baselines because of better tool use.",https://arxiv.org/pdf/2307.08775,2023,Arxiv,,, +"CAMEL: Communicative Agents for ""Mind"" Exploration of Large Language Model Society","This paper proposes a role-playing framework for autonomous agent cooperation, offers a scalable study approach, and open - sources relevant library. ",提出角色扮演通信代理框架,提供研究多智能体合作的可扩展方法,开源库支持相关研究。 ,Agent Construction," Methodology ","The rapid advancement of chat-based language models has led to remarkable progress in complex task-solving. However, their success heavily relies on human input to guide the conversation, which can be challenging and time-consuming. This paper explores the potential of building scalable techniques to facilitate autonomous cooperation among communicative agents, and provides insight into their ""cognitive"" processes. To address the challenges of achieving autonomous cooperation, we propose a novel communicative agent framework named role-playing. Our approach involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of a society of agents, providing a valuable resource for investigating conversational language models. In particular, we conduct comprehensive studies on instruction-following cooperation in multi-agent settings. Our contributions include introducing a novel communicative agent framework, offering a scalable approach for studying the cooperative behaviors and capabilities of multi-agent systems, and open-sourcing ourlibrary to support research on communicative agents and beyond: this https URL.",https://arxiv.org/abs/2303.17760,2023,NIPS,,,🚫��复 +AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation,"AutoGen is an open - source framework enabling LLM app building via multi - agent conversation, customizable and supporting diverse interaction patterns. ",论文提出开源框架AutoGen,可多智能体协作构建LLM应用,灵活定义交互,适用于多领域。 ,Agent Construction," Methodology ","AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define agent interaction behaviors. Both natural language and computer code can be used to program flexible conversation patterns for different applications. AutoGen serves as a generic infrastructure to build diverse applications of various complexities and LLM capacities. Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc. ",https://arxiv.org/abs/2308.08155,2023,Arxiv,,,🚫重复 +AutoAgents: A Framework for Automatic Agent Generation,"The paper introduces AutoAgents, a framework that adaptively generates and coordinates agents for tasks, with an observer for improvement, offering new perspectives. ",提出AutoAgents框架,依任务自适应生成并协调多智能体,设观察者优化方案,为解决复杂任务提供新思路 ,Agent Construction," Methodology ","Large language models (LLMs) have enabled remarkable advances in automated task-solving with multi-agent systems. However, most existing LLM-based multi-agent approaches rely on predefined agents to handle simple tasks, limiting the adaptability of multi-agent collaboration to different scenarios. Therefore, we introduce AutoAgents, an innovative framework that adaptively generates and coordinates multiple specialized agents to build an AI team according to different tasks. Specifically, AutoAgents couples the relationship between tasks and roles by dynamically generating multiple required agents based on task content and planning solutions for the current task based on the generated expert agents. Multiple specialized agents collaborate with each other to efficiently accomplish tasks. Concurrently, an observer role is incorporated into the framework to reflect on the designated plans and agents' responses and improve upon them. Our experiments on various benchmarks demonstrate that AutoAgents generates more coherent and accurate solutions than the existing multi-agent methods. This underscores the significance of assigning different roles to different tasks and of team cooperation, offering new perspectives for tackling complex tasks. The repository of this project is available at this https URL.",https://arxiv.org/abs/2309.17288,2024,IJCAI,,, +MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework,"This paper introduces MetaGPT, a meta - programming framework for LLM - based multi - agent collaborations, encoding SOPs and using assembly line to solve complex tasks. ",论文提出MetaGPT框架,将人类工作流融入多智能体协作,编码SOP精简流程,高效拆解复杂任务。 ,Agent Construction," Methodology ","Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together. On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems. Our project can be found at this https URL",https://arxiv.org/abs/2308.00352,2024,ICLR,,,🚫重复 +Cognitive Architectures for Language Agents,"The paper proposes Cognitive Architectures for Language Agents (CoALA), organizing existing agents and identifying directions for more capable ones, aiming for language - based general intelligence. ",本文提出语言智能体认知架构CoALA,回顾整理现有工作,指明智能体发展方向,迈向基于语言的通用智能。 ,Agent Construction," Methodology ","Recent efforts have augmented large language models (LLMs) with external resources (e.g., the Internet) or internal control flows (e.g., prompt chaining) for tasks requiring grounding or reasoning, leading to a new class of language agents. While these agents have achieved substantial empirical success, we lack a framework to organize existing agents and plan future developments. In this paper, we draw on the rich history of cognitive science and symbolic artificial intelligence to propose Cognitive Architectures for Language Agents (CoALA). CoALA describes a language agent with modular memory components, a structured action space to interact with internal memory and external environments, and a generalized decision-making process to choose actions. We use CoALA to retrospectively survey and organize a large body of recent work, and prospectively identify actionable directions towards more capable agents. Taken together, CoALA contextualizes today's language agents within the broader history of AI and outlines a path towards language-based general intelligence.",https://arxiv.org/abs/2309.02427,2024,TMLR,,, +Executable Code Actions Elicit Better LLM Agents,"This paper proposes CodeAct using executable Python code for LLM agents, builds an open - source agent, and collects a dataset to enhance agent - oriented tasks. ",提出用可执行Python代码统一大模型智能体动作空间(CodeAct),建数据集、微调模型以完成复杂任务 ,Agent Construction," Methodology ","Large Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents are typically prompted to produce actions by generating JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools). This work proposes to use executable Python code to consolidate LLM agents' actions into a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions. Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate). The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language. To this end, we collect an instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn interactions using CodeAct. We show that it can be used with existing data to improve models in agent-oriented tasks without compromising their general capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., model training) using existing libraries and autonomously self-debug.",https://arxiv.org/abs/2402.01030,2024,ICML,,,🚫重复 +ChatDev: Communicative Agents for Software Development,"This paper introduces ChatDev, a LLM - powered software development framework where agents communicate via language, unifying development phases. ",论文提出ChatDev框架,用大模型驱动的智能体统一设计、编码和测试,以语言沟通实现多智能体协作。 ,Agent Construction," Methodology ","Software development is a complex task that necessitates cooperation among multiple members with diverse skills. Numerous studies used deep learning to improve specific phases in a waterfall model, such as design, coding, and testing. However, the deep learning model in each phase requires unique designs, leading to technical inconsistencies across various phases, which results in a fragmented and ineffective development process. In this paper, we introduce ChatDev, a chat-powered software development framework in which specialized agents driven by large language models (LLMs) are guided in what to communicate (via chat chain) and how to communicate (via communicative dehallucination). These agents actively contribute to the design, coding, and testing phases through unified language-based communication, with solutions derived from their multi-turn dialogues. We found their utilization of natural language is advantageous for system design, and communicating in programming language proves helpful in debugging. This paradigm demonstrates how linguistic communication facilitates multi-agent collaboration, establishing language as a unifying bridge for autonomous task-solving among LLM agents. The code and data are available at this https URL.",https://arxiv.org/abs/2307.07924,2024,*ACL,,,🚫重复 +Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents,"This paper presents ChatSim, a system enabling editable 3D driving scene simulations via natural language with LLM agent framework and novel rendering methods. ",论文提出ChatSim系统,用LLM协作框架实现自然语言编辑,结合新方法���成逼真场景,代码开源。 ,Agent Construction," Methodology ",Scene simulation in autonomous driving has gained significant attention because of its huge potential for generating customized data. However existing editable scene simulation approaches face limitations in terms of user interaction efficiency multi-camera photo-realistic rendering and external digital assets integration. To address these challenges this paper introduces ChatSim the first system that enables editable photo-realistic 3D driving scene simulations via natural language commands with external digital assets. To enable editing with high command flexibility ChatSim leverages a large language model (LLM) agent collaboration framework. To generate photo-realistic outcomes ChatSim employs a novel multi-camera neural radiance field method. Furthermore to unleash the potential of extensive high-quality digital assets ChatSim employs a novel multi-camera lighting estimation method to achieve scene-consistent assets' rendering. Our experiments on Waymo Open Dataset demonstrate that ChatSim can handle complex language commands and generate corresponding photo-realistic scene videos. Code can be accessed at: https://github.com/yifanlu0227/ChatSim.,https://openaccess.thecvf.com/content/CVPR2024/papers/Wei_Editable_Scene_Simulation_for_Autonomous_Driving_via_Collaborative_LLM-Agents_CVPR_2024_paper.pdf,2024,CVPR/ICCV/ECCV,,, +A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration,"The paper proposes DyLAN, a framework for LLM - powered agent collaboration with a two - stage paradigm for dynamic agent selection and collaboration, not just fixed setups. ",提出Dynamic LLM - Powered Agent Network框架,两阶段运作动态选团队协作,计算成本适中。 ,Agent Construction," Methodology ","Recent studies show that collaborating multiple large language model (LLM) powered agents is a promising way for task solving. However, current approaches are constrained by using a fixed number of agents and static communication structures. In this work, we propose automatically selecting a team of agents from candidates to collaborate in a dynamic communication structure toward different tasks and domains. Specifically, we build a framework named Dynamic LLM-Powered Agent Network (DyLAN) for LLM-powered agent collaboration, operating a two-stage paradigm: (1) Team Optimization and (2) Task Solving. During the first stage, we utilize an agent selection algorithm, based on an unsupervised metric called Agent Importance Score, enabling the selection of best agents according to their contributions in a preliminary trial, oriented to the given task. Then, in the second stage, the selected agents collaborate dynamically according to the query. Empirically, we demonstrate that DyLAN outperforms strong baselines in code generation, decision-making, general reasoning, and arithmetic reasoning tasks with moderate computational cost. On specific subjects in MMLU, selecting a team of agents in the team optimization stage improves accuracy by up to 25.0% in DyLAN.",https://arxiv.org/abs/2310.02170,2024,COLM,,,🚫重复 +AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation,"This paper presents AgentCoder, a multi - agent framework for code generation. It addresses challenges by agent collaboration, surpassing single - agent models. ",论文提出 AgentCoder,采用多智能体框架,协作生成代码,克服单智能体局限,提升代码生成性能。 ,Agent Construction," Methodology ","The advancement of natural language processing (NLP) has been significantly boosted by the development of transformer-based large language models (LLMs). These models have revolutionized NLP tasks, particularly in code generation, aiding developers in creating software with enhanced efficiency. Despite their advancements, challenges in balancing code snippet generation with effective test case generation and execution persist. To address these issues, this paper introduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution comprising a multi-agent framework with specialized agents: the programmer agent, the test designer agent, and the test executor agent. During the coding procedure, the programmer agent will focus on the code generation and refinement based on the test executor agent's feedback. The test designer agent will generate test cases for the generated code, and the test executor agent will run the code with the test cases and write the feedback to the programmer. This collaborative system ensures robust code generation, surpassing the limitations of single-agent models and traditional methodologies. Our extensive experiments on 9 code generation models and 12 enhancement approaches showcase AgentCoder's superior performance over existing code generation models and prompt engineering techniques across various benchmarks. For example, AgentCoder (GPT-4) achieves 96.3\% and 91.8\% pass@1 in HumanEval and MBPP datasets with an overall token overhead of 56.9K and 66.3K, while state-of-the-art obtains only 90.2\% and 78.9\% pass@1 with an overall token overhead of 138.2K and 206.5K.",https://arxiv.org/abs/2312.13010,2023,Arxiv,,, +More Agents Is All You Need,"The paper proposes Agent Forest, a sampling - and - voting method. LLM performance scales with agent number, orthogonal to existing methods, code available. ",论文提出 Agent Forest 采样投票法,发现大模型性能随智能体数量提升,增强程度与任务难度相关。 ,Agent Construction," Methodology ","We find that, simply via a sampling-and-voting method, the performance of large language models (LLMs) scales with the number of agents instantiated. Also, this method, termed as Agent Forest, is orthogonal to existing complicated methods to further enhance LLMs, while the degree of enhancement is correlated to the task difficulty. We conduct comprehensive experiments on a wide range of LLM benchmarks to verify the presence of our finding, and to study the properties that can facilitate its occurrence. Our code is publicly available at: this https URL",https://arxiv.org/abs/2402.05120,2024,TMLR,,, +Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents,"The paper introduces Agent Hospital, a hospital simulacrum with LLM - powered agents. Doctor agents evolve without manual labeling, and methods benefit various apps. ",提出模拟医院“Agent Hospital”,让医生智能体自主进化,建设与进化方法有广泛应用潜力 ,Agent Construction," Methodology ","The recent rapid development of large language models (LLMs) has sparked a new wave of technological revolution in medical artificial intelligence (AI). While LLMs are designed to understand and generate text like a human, autonomous agents that utilize LLMs as their ""brain"" have exhibited capabilities beyond text processing such as planning, reflection, and using tools by enabling their ""bodies"" to interact with the environment. We introduce a simulacrum of hospital called Agent Hospital that simulates the entire process of treating illness, in which all patients, nurses, and doctors are LLM-powered autonomous agents. Within the simulacrum, doctor agents are able to evolve by treating a large number of patient agents without the need to label training data manually. After treating tens of thousands of patient agents in the simulacrum (human doctors may take several years in the real world), the evolved doctor agents outperform state-of-the-art medical agent methods on the MedQA benchmark comprising US Medical Licensing Examination (USMLE) test questions. Our methods of simulacrum construction and agent evolution have the potential in benefiting a broad range of applications beyond medical AI.",https://arxiv.org/abs/2405.02957,2024,Arxiv,,, +Empowering biomedical discovery with AI agents,"The paper proposes “AI scientists” via collaborative agents. They combine human and AI strengths, use models for learning, and impact multiple biomedical areas. ",论文提出“AI科学家”,结合人类与AI能力,用多模型助力生物医学多领域研究。 ,Agent Construction," Methodology ","We envision “AI scientists” as systems capable of skeptical learning and reasoning that empower biomedical research through collaborative agents that integrate AI models and biomedical tools with experimental platforms. Rather than taking humans out of the discovery process, biomedical AI agents combine human creativity and expertise with AI’s ability to analyze large datasets, navigate hypothesis spaces, and execute repetitive tasks. AI agents are poised to be proficient in various tasks, planning discovery workflows and performing self-assessment to identify and mitigate gaps in their knowledge. These agents use large language models and generative models to feature structured memory for continual learning and use machine learning tools to incorporate scientific knowledge, biological principles, and theories. AI agents can impact areas ranging from virtual cell simulation, programmable control of phenotypes, and the design of cellular circuits to developing new therapies.",https://www.cell.com/cell/fulltext/S0092-8674(24)01070-5?&target=_blank,2024,Others,,, +War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars,"The paper proposes WarAgent, an LLM - powered multi - agent system to simulate historical conflicts, offering AI - augmented insights for conflict resolution and peacekeeping. ",提出 WarAgent 系统模拟历史国际冲突,评估 AI 能力,为解决冲突与维和策略提供新思路 ,Agent Construction," Methodology ","Can we avoid wars at the crossroads of history? This question has been pursued by individuals, scholars, policymakers, and organizations throughout human history. In this research, we attempt to answer the question based on the recent advances of Artificial Intelligence (AI) and Large Language Models (LLMs). We propose \textbf{WarAgent}, an LLM-powered multi-agent AI system, to simulate the participating countries, their decisions, and the consequences, in historical international conflicts, including the World War I (WWI), the World War II (WWII), and the Warring States Period (WSP) in Ancient China. By evaluating the simulation effectiveness, we examine the advancements and limitations of cutting-edge AI systems' abilities in studying complex collective human behaviors such as international conflicts under diverse settings. In these simulations, the emergent interactions among agents also offer a novel perspective for examining the triggers and conditions that lead to war. Our findings offer data-driven and AI-augmented insights that can redefine how we approach conflict resolution and peacekeeping strategies. The implications stretch beyond historical analysis, offering a blueprint for using AI to understand human history and possibly prevent future international conflicts. Code and data are available at \url{this https URL}.",https://arxiv.org/abs/2311.17227,2023,Arxiv,,, +A Survey on the Memory Mechanism of Large Language Model based Agents,"This paper presents a comprehensive survey on memory mechanisms of LLM - based agents, covering concepts, designs, apps, limitations and future directions. ",该文全面综述大语言模型智能体记忆机制,探讨需求、设计评估等,分析局限并指明方向。 ,Agent Construction,Survey,"Large language model (LLM) based agents have recently attracted much attention from the research and industry communities. Compared with original LLMs, LLM-based agents are featured in their self-evolving capability, which is the basis for solving real-world problems that need long-term and complex agent-environment interactions. The key component to support agent-environment interactions is the memory of the agents. While previous studies have proposed many promising memory mechanisms, they are scattered in different papers, and there lacks a systematical review to summarize and compare these works from a holistic perspective, failing to abstract common and effective designing patterns for inspiring future studies. To bridge this gap, in this paper, we propose a comprehensive survey on the memory mechanism of LLM-based agents. In specific, we first discuss ''what is'' and ''why do we need'' the memory in LLM-based agents. Then, we systematically review previous studies on how to design and evaluate the memory module. In addition, we also present many agent applications, where the memory module plays an important role. At last, we analyze the limitations of existing work and show important future directions. To keep up with the latest advances in this field, we create a repository at \url{this https URL}.",https://arxiv.org/abs/2404.13501,2024,Arxiv,,, +,,本文提出ToolQA数据集评估大模型用外部工具问答能力,减少与预训练数据重叠,指明发展方向。 ,,,,,,,,,🚫重复 +Understanding the planning of LLM agents: A survey,"This survey offers a systematic view of LLM-based agent planning, categorizes relevant works, analyzes each direction, and discusses future challenges. ",该综述为大语言模型代理规划提供系统视角,对相关研究分类分析并探讨后续挑战。 ,Agent Construction,Survey,"As Large Language Models (LLMs) have shown significant intelligence, the progress to leverage LLMs as planning modules of autonomous agents has attracted more attention. This survey provides the first systematic view of LLM-based agents planning, covering recent works aiming to improve planning ability. We provide a taxonomy of existing works on LLM-Agent planning, which can be categorized into Task Decomposition, Plan Selection, External Module, Reflection and Memory. Comprehensive analyses are conducted for each direction, and further challenges for the field of research are discussed.",https://arxiv.org/abs/2402.02716,2024,Arxiv,,,🚫重复 +SMART-LLM: Smart Multi-Agent Robot Task Planning using Large Language Models,"This paper presents SMART - LLM, an LLM - based framework for multi - robot task planning. It has task stages and a validation dataset. Code, etc., are available online. ",提出SMART - LLM框架用于多机器人任务规划,构建验证数据集,成果可在指定链接获取。 ,Agent Construction,Methodology,"In this work, we introduce SMART-LLM, an innovative framework designed for embodied multi-robot task planning. SMART-LLM: Smart Multi-Agent Robot Task Planning using Large Language Models (LLMs), harnesses the power of LLMs to convert high-level task instructions provided as input into a multi-robot task plan. It accomplishes this by executing a series of stages, including task decomposition, coalition formation, and task allocation, all guided by programmatic LLM prompts within the few-shot prompting paradigm. We create a benchmark dataset designed for validating the multi-robot task planning problem, encompassing four distinct categories of high-level instructions that vary in task complexity. Our evaluation experiments span both simulation and real-world scenarios, demonstrating that the proposed model can achieve promising results for generating multi-robot task plans. The experimental videos, code, and datasets from the work can be found at https://sites.google.com/view/smart-llm/.",https://ieeexplore.ieee.org/abstract/document/10802322,2024,IEEE,,, +Planning with Multi-Constraints via Collaborative Language Agents,"This paper proposes Planning with Multi - Constraints (PMC), a zero - shot method for LLM - based multi - agent systems, decomposing complex tasks to simplify planning. ",论文提出多约束规划(PMC)方法,分解复杂任务,适用于大模型多智能体系统,小模型也可用。 ,Agent Construction,Methodology,"The rapid advancement of neural language models has sparked a new surge of intelligent agent research. Unlike traditional agents, large language model-based agents (LLM agents) have emerged as a promising paradigm for achieving artificial general intelligence (AGI) due to their superior reasoning and generalization capabilities. Effective planning is crucial for the success of LLM agents in real-world tasks, making it a highly pursued topic in the community. Current planning methods typically translate tasks into executable action sequences. However, determining a feasible or optimal sequence for complex tasks with multiple constraints at fine granularity, which often requires compositing long chains of heterogeneous actions, remains challenging. This paper introduces Planning with Multi-Constraints (PMC), a zero-shot methodology for collaborative LLM-based multi-agent systems that simplifies complex task planning with constraints by decomposing it into a hierarchy of subordinate tasks. Each subtask is then mapped into executable actions. PMC was assessed on two constraint-intensive benchmarks, TravelPlanner and API-Bank. Notably, PMC achieved an average 42.68% success rate on TravelPlanner, significantly higher than GPT-4 (2.92%), and outperforming GPT-4 with ReAct on API-Bank by 13.64%, showing the immense potential of integrating LLM with multi-agent systems. We also show that PMC works with small LLM as the planning core, e.g., LLaMA-3.1-8B.",https://aclanthology.org/2025.coling-main.672/,2025,*ACL,,, +"Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions","The paper designs an LLM agent for goal - directed city navigation. It uses perceive, reflect and plan workflow to improve navigation abilities without instructions. ",提出感知、反思和规划的新型代理工作流,提升大模型智能体城市导航能力,优于现有基线。 ,Agent Construction,Methodology,"This paper considers a scenario in city navigation: an AI agent is provided with language descriptions of the goal location with respect to some well-known landmarks; By only observing the scene around, including recognizing landmarks and road network connections, the agent has to make decisions to navigate to the goal location without instructions. This problem is very challenging, because it requires agent to establish self-position and acquire spatial representation of complex urban environment, where landmarks are often invisible. In the absence of navigation instructions, such abilities are vital for the agent to make high-quality decisions in long-range city navigation. With the emergent reasoning ability of large language models (LLMs), a tempting baseline is to prompt LLMs to ""react"" on each observation and make decisions accordingly. However, this baseline has very poor performance that the agent often repeatedly visits same locations and make short-sighted, inconsistent decisions. To address these issues, this paper introduces a novel agentic workflow featured by its abilities to perceive, reflect and plan. Specifically, we find LLaVA-7B can be fine-tuned to perceive the direction and distance of landmarks with sufficient accuracy for city navigation. Moreover, reflection is achieved through a memory mechanism, where past experiences are stored and can be retrieved with current perception for effective decision argumentation. Planning uses reflection results to produce long-term plans, which can avoid short-sighted decisions in long-range navigation. We show the designed workflow significantly improves navigation ability of the LLM agent compared with the state-of-the-art baselines.",http://arxiv.org/abs/2408.04168,2024,Arxiv,,, +Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning,"The paper explores strategies on 7B and 13B models to enhance LLM agent capabilities, using GPT - 4 for data construction and multi - path reasoning. ",论文以7B和13B模型探索提升低参大模型代理能力,结合数据构建与多分支推理,评估效果良好。 ,Agent Construction,Methodology,"Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities, making them highly successful in a variety of tasks. However, when used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4. As intelligent agents, LLMs need to have the capabilities of task planning, long-term memory, and the ability to leverage external tools to achieve satisfactory performance. Various methods have been proposed to enhance the agent capabilities of LLMs. On the one hand, methods involve constructing agent-specific data and fine-tuning the models. On the other hand, some methods focus on designing prompts that effectively activate the reasoning abilities of the LLMs. We explore both strategies on the 7B and 13B models. We propose a comprehensive method for constructing agent-specific data using GPT-4. Through supervised fine-tuning with constructed data, we find that for these models with a relatively small number of parameters, supervised fine-tuning can significantly reduce hallucination outputs and formatting errors in agent tasks. Furthermore, techniques such as multi-path reasoning and task decomposition can effectively decrease problem complexity and enhance the performance of LLMs as agents. We evaluate our method on five agent tasks of AgentBench and achieve satisfactory results.",https://arxiv.org/abs/2403.19962,2024,Arxiv,,, +PlanCritic: Formal Planning with Human Feedback,"The paper presents a feedback - driven plan critic in a cooperative system, optimizing plans via RL with human feedback and genetic algo to meet user prefs. ",提出反馈驱动的计划评估器,结合强化学习与遗传算法,依用户偏好优化计划,缩小研究差距 ,Agent Construction,Methodology,"Real world planning problems are often too complex to be effectively tackled by a single unaided human. To alleviate this, some recent work has focused on developing a collaborative planning system to assist humans in complex domains, with bridging the gap between the system's problem representation and the real world being a key consideration. Transferring the speed and correctness formal planners provide to real-world planning problems is greatly complicated by the dynamic and online nature of such tasks. Formal specifications of task and environment dynamics frequently lack constraints on some behaviors or goal conditions relevant to the way a human operator prefers a plan to be carried out. While adding constraints to the representation with the objective of increasing its realism risks slowing down the planner, we posit that the same benefits can be realized without sacrificing speed by modeling this problem as an online preference learning task. As part of a broader cooperative planning system, we present a feedback-driven plan critic. This method makes use of reinforcement learning with human feedback in conjunction with a genetic algorithm to directly optimize a plan with respect to natural-language user preferences despite the non-differentiability of traditional planners. Directly optimizing the plan bridges the gap between research into more efficient planners and research into planning with language models by utilizing the convenience of natural language to guide the output of formal planners. We demonstrate the effectiveness of our plan critic at adhering to user preferences on a disaster recovery task, and observe improved performance compared to an llm-only neurosymbolic approach.",https://arxiv.org/abs/2412.00300,2024,Arxiv,,, +Enhancing Robot Task Planning: Integrating Environmental Information and Feedback Insights through Large Language Models,"The paper proposes EnviroFeedback Planner, integrating environment info and feedback into LLM - based action plan generation for agent task planning. ",论文提出 EnviroFeedback Planner 方法,结合环境信息和反馈修正提升智能体执行能力。 ,Agent Construction,Methodology,"Utilizing knowledge derived from large language models (LLMs) has been established as an effective strategy for task planning and providing action plans for agents. In this paper, we put forth EnviroFeedback Planner, a novel approach for generating action plans with LLMs. Specifically, EnviroFeedback Planner integrates environmental information into prompt construction and considers available actions, introducing feedback corrections to enhance the agent’s execution capabilities. To validate our proposed method, we systematically conducted experiments in the Virtualhome environment, comparing it against baseline methods. Compared to baseline methods, the action plans generated by EnviroFeedback Planner exhibit a 26.46% improvement in executability and an 11.06% enhancement in correctness.",https://ieeexplore.ieee.org/abstract/document/10661782,2024,IEEE,,, +Devil's Advocate: Anticipatory Reflection for LLM Agents,"This paper presents a novel introspection - driven approach for LLM agents. It prompts agents to plan, introspect, and enhance adaptability in complex tasks. ",提出为大模型智能体赋予自省能力的新方法,通过三步干预增强一致性与适应性,提升执行效率。 ,Agent Construction,Methodology,"In this work, we introduce a novel approach that equips LLM agents with introspection, enhancing consistency and adaptability in solving complex tasks. Our approach prompts LLM agents to decompose a given task into manageable subtasks (i.e., to make a plan), and to continuously introspect upon the suitability and results of their actions. %; and when necessary, to explore ``the road not taken.'' We implement a three-fold introspective intervention: 1) anticipatory reflection on potential failures and alternative remedy before action execution, 2) post-action alignment with subtask objectives and backtracking with remedy to ensure utmost effort in plan execution, and 3) comprehensive review upon plan completion for future strategy refinement. By deploying and experimenting with this methodology -- a zero-shot approach -- within WebArena for practical tasks in web environments, our agent demonstrates superior performance with a success rate of 23.5% over existing zero-shot methods by 3.5%. The experimental results suggest that our introspection-driven approach not only enhances the agent's ability to navigate unanticipated challenges through a robust mechanism of plan execution, but also improves efficiency by reducing the number of trials and plan revisions by 45% needed to achieve a task.",https://arxiv.org/abs/2405.16334,2024,Arxiv,,, +"Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents","This paper studies Minecraft planning for multi - task agents, identifies two challenges, and proposes a solution to address them. ",论文研究 Minecraft 中多任务规划问题,指出两大挑战,进而提出应对方法。 ,Agent Construction,Methodology,"In this paper, we study the problem of planning in Minecraft, a popular, democratized yet challenging open-ended environment for developing multi-task embodied agents. We've found two primary challenges of empowering such agents with planning: 1) planning in an open-ended world like Minecraft requires precise and multi-step reasoning due to the long-term nature of the tasks, and 2) as vanilla planners do not consider the achievability of the current agent when ordering parallel sub-goals within a complicated plan, the resulting plan could be inefficient. To this end, we propose ",https://proceedings.neurips.cc/paper_files/paper/2023/hash/6b8dfb8c0c12e6fafc6c256cb08a5ca7-Abstract-Conference.html,2023,NIPS,,, +TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage,"This paper proposes a framework for LLM - based AI agents, designs two agents, evaluates TPTU abilities, aiming to aid AI application use of LLMs. ",提出LLM基AI智能体框架,设计两类智能体执行推理,评估TPTU能力,为应用提供参考。 ,Agent Construction,Methodology,"With recent advancements in natural language processing, Large Language Models (LLMs) have emerged as powerful tools for various real-world applications. Despite their prowess, the intrinsic generative abilities of LLMs may prove insufficient for handling complex tasks which necessitate a combination of task planning and the usage of external tools. In this paper, we first propose a structured framework tailored for LLM-based AI Agents and discuss the crucial capabilities necessary for tackling intricate problems. Within this framework, we design two distinct types of agents (i.e., one-step agent and sequential agent) to execute the inference process. Subsequently, we instantiate the framework using various LLMs and evaluate their Task Planning and Tool Usage (TPTU) abilities on typical tasks. By highlighting key findings and challenges, our goal is to provide a helpful resource for researchers and practitioners to leverage the power of LLMs in their AI applications. Our study emphasizes the substantial potential of these models, while also identifying areas that need more investigation and improvement.",https://arxiv.org/abs/2308.03427,2023,Arxiv,,,🚫重复 +"Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios","The paper presents UltraTool, a new benchmark for evaluating LLMs' tool - utilization in real - world complex scenarios, eliminating pre - defined toolset restrictions. ",提出新型基准UltraTool评估大模型工具使用能力,关注全流程,摆脱预定义工具集限制。 ,Agent Construction,Evaluation,"The recent trend of using Large Language Models (LLMs) as tool agents in real-world applications underscores the necessity for comprehensive evaluations of their capabilities, particularly in complex scenarios involving planning, creating, and using tools. However, existing benchmarks typically focus on simple synthesized queries that do not reflect real-world complexity, thereby offering limited perspectives in evaluating tool utilization. To address this issue, we present UltraTool, a novel benchmark designed to improve and evaluate LLMs' ability in tool utilization within real-world scenarios. UltraTool focuses on the entire process of using tools - from planning and creating to applying them in complex tasks. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage and simplifies the task solving by mapping out the intermediate steps. Thus, unlike previous work, it eliminates the restriction of pre-defined toolset. Through extensive experiments on various LLMs, we offer novel insights into the evaluation of capabilities of LLMs in tool utilization, thereby contributing a fresh perspective to this rapidly evolving field. The benchmark is publicly available at this https URL.",https://arxiv.org/abs/2401.17167,2024,*ACL,,, +Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making,"The paper proposes an Embodied Agent Interface to benchmark LLMs for embodied decision - making, unifying tasks, modules, and metrics. ",提出Embodied Agent Interface,统一决策任务、模块与评估指标,全面评估大模型在具身决策表现。 ,Agent Construction,Evaluation,"We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision-making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics that break down evaluation into error types, such as hallucination errors, affordance errors, and various types of planning errors. Overall, our benchmark offers a comprehensive assessment of LLMs’ performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems and providing insights into the effective and selective use of LLMs in embodied decision making.",https://proceedings.neurips.cc/paper_files/paper/2024/hash/b631da756d1573c24c9ba9c702fde5a9-Abstract-Datasets_and_Benchmarks_Track.html,2025,NIPS,,,🚫重复 +,,提出ToolkenGPT,用工具嵌入让大模型像预测标记一样掌握工具,解决现有集成工具方法局限。 ,,,,,,,,,🚫重复 +,,提出MultiTool - CoT框架,借助思维链提示在推理中融入多外部工具,用于特定数据集表现出色 ,,,,,,,,,🚫重复 +,,该文提出TaskMatrix.AI,借助连接基础模型与数百万API完成任务,属方法论成果。 ,,,,,,,,,🚫重复 +,,论文提出Gorilla模型,用RAT训练,结合检索器,降低幻觉问题,展示大模型精准用工具潜力。 ,,,,,,,,,🚫重复 +,"The paper proposes CREATOR, a framework enabling LLMs to create tools. It disentangles creation and execution, shows potential in knowledge transfer and paradigm shift. ",提出CREATOR框架助大模型自造工具,解耦抽象与具体环节,凸显工具创建能力优势,革新解题范式。 ,,,,,,,,,🚫重复 +Application (Gaming),"Title:Application (Gaming) +Abstract:–– +Category: Gaming + +This paper focuses on gaming application, though abstract is absent, likely contributing new perspectives to large - model agent in gaming. ",该论文聚焦游戏应用领域,但摘要缺失,难以明确其核心贡献,仅知分类为游戏。 ,Applications,Gaming,––,,,,,, +Large Language Model based Multi-Agents: A Survey of Progress and Challenges,"Title:Large Language Model based Multi-Agents: A Survey of Progress and Challenges +Abstract: +Category: Survey + +This paper surveys progress and challenges of large language model - based multi - agents, offering insights for the field. ",该综述论文聚焦大模型多智能体,梳理进展并剖析挑战,为相关领域研究提供参考。 ,Applications,Survey," ",https://arxiv.org/pdf/2402.01680,2024,Arxiv,,Application (Gaming),🚫重复 +A Survey on Large Language Model-Based Game Agents,"Title:A Survey on Large Language Model-Based Game Agents +Abstract:escribe, +Category: Survey + +This paper offers a survey on large language model - based game agents, highlighting contributions beyond experimental performance. ",《A Survey on Large Language Model-Based Game Agents》是调研类论文,对大模型在游戏智能体领域情况进行概述。 ,Applications,Survey,"escribe, ",https://arxiv.org/pdf/2404.02039,2024,Arxiv,,Application (Gaming), +Large Language Models and Games: A Survey and Roadmap,"This survey paper provides a roadmap on large language models and games, potentially offering insights for large model - based agents. ",该综述论文聚焦大语言模型与游戏,虽摘要未展详情,但对相关领域研究或有框架性指引贡献。 ,Applications,Survey,E,https://arxiv.org/pdf/2402.18659,2024,Arxiv,,Application (Gaming), +Motif: Intrinsic Motivation from Artificial Intelligence Feedback,"The paper focuses on intrinsic motivation from AI feedback in adventure games, which might offer insights for large - model based agents. ",论文探讨人工智能反馈的内在动机主题,聚焦冒险游戏领域,但摘要缺失,难以明确核心贡献。 ,Applications,Adventure Games,,https://arxiv.org/pdf/2310.00166,2024,ICLR,,Application (Gaming), +"Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents ","This paper focuses on interactive planning with LLMs for open - world multi - task agents in crafting & exploration games, but lacks abstract details for core contribution. ",论文提出交互式规划,助力大语言模型让开放世界多任务智能体具备描述、解释等能力,用于工艺探索游戏。 ,Applications,Crafting & Exploration Games,,https://arxiv.org/pdf/2302.01560,2023,NIPS,,Application (Gaming), +Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents,"Title:Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents +Abstract: +Category: Simulation Games + +This paper presents LMs as zero-shot planners to extract actionable knowledge for embodied agents, a novel approach in large model - based agents. ",论文探讨将语言模型作零样本规划器,为具身智能体提取可行知识,属于模拟游戏范畴。 ,Applications,Simulation Games,,https://proceedings.mlr.press/v162/huang22a.html,2022,ICML,,Application (Gaming), +Language Models Meet World Models: Embodied Experiences Enhance Language Models,"This paper titled ""Language Models Meet World Models: Embodied Experiences Enhance Language Models"" in Simulation Games may show how embodied experiences boost LMs in agents. ",论文《Language Models Meet World Models: Embodied Experiences Enhance Language Models》聚焦仿真游戏,贡献是体现具身经验对语言模型的增强作用。 ,Applications,Simulation Games,,https://arxiv.org/abs/2305.10626.pdf,2023,NIPS,,Application (Gaming), +ChessGPT: Bridging Policy Learning and Language Modeling,"Title:ChessGPT: Bridging Policy Learning and Language Modeling +Abstract: +Category: Competition Games + +This paper, ChessGPT, bridges policy learning and language modeling in competition games, offering a new approach for agent research. ",论文《ChessGPT: Bridging Policy Learning and Language Modeling》聚焦竞赛游戏,核心贡献是搭建策略学习与语言建模桥梁。 ,Applications,Competition Games,,https://proceedings.neurips.cc/paper_files/paper/2023/hash/16b14e3f288f076e0ca73bdad6405f77-Abstract-Datasets_and_Benchmarks.html,2023,NIPS,,Application (Gaming), +Mindagent: Emergent gaming interaction,"Title:Mindagent: Emergent gaming interaction +Abstract: +Category: Cooperation Games + +This paper presents Mindagent with emergent gaming interaction in cooperation games, a valuable addition to large model - based agents. ",论文《Mindagent: Emergent gaming interaction》聚焦合作游戏,虽摘要缺失,但应在该领域有新探索与贡献。 ,Applications,Cooperation Games,,https://arxiv.org/pdf/2309.09971,2023,Arxiv,,Application (Gaming), +Exploring large language models for communication games: An empirical study on Werewolf,"This paper empirically explores large language models for Werewolf in communication games, contributing to research on large model - based agents. ",该论文聚焦狼人杀这类沟通游戏,实证探索大语言模型应用,为该领域研究做贡献。 ,Applications,Communication Games,,https://arxiv.org/abs/2309.04658,2023,Arxiv,,Application (Gaming), +Baba Is AI: Break the Rules to Beat the Benchmark,"The paper ""Baba Is AI: Break the Rules to Beat the Benchmark"" in action games may offer new strategies for AI to break rules, aiding large - model agents. ",论文《Baba Is AI: Break the Rules to Beat the Benchmark》属动作游戏类,打破规则或为核心贡献。 ,Applications,Action Games,,https://arxiv.org/pdf/2407.13729,2024,ICML,,Application (Gaming), +Language as reality: a co-creative storytelling game experience in 1001 nights using generative AI,"This paper presents a co - creative storytelling game in ""1001 Nights"" using generative AI, treating language as reality, relevant to game generation. ",论文聚焦用生成式AI开展《一千零一夜》共创叙事游戏体验,属游戏生成类别,贡献待摘要补充。 ,Applications,Game Generation,,https://ojs.aaai.org/index.php/AIIDE/article/view/27539,2023,AAAI,,Application (Gaming), +,,,,,,,,,,Application (Gaming),🚫重复 +Application (Social Science),"Title:Application (Social Science) +Abstract:–– +Category: Social Science + +The paper in social science field may contribute to applying novel concepts or approaches, yet exact core lacking details. ",该社科论文聚焦应用领域,惜摘要缺失,暂难明确核心贡献,但所属社科范畴奠定研究方向。 ,Applications,Social Science,––,,,,,, +Large language model-empowered agents for simulating macroeconomic activities,"Title:Large language model-empowered agents for simulating macroeconomic activities +Abstract:<|Abstract|> +Category: Economy + +This paper uses large - language - model - empowered agents to simulate macroeconomic activities, contributing to economic research. ",该论文核心贡献是用大语言模型赋能智能体来模拟宏观经济活动,属经济学领域成果。 ,Applications,Economy,,https://aclanthology.org/2024.acl-long.829/,2024,*ACL,,Application (Social Science), +TradingGPT: Multi-Agent System with Layered Memory and Distinct Characters for Enhanced Financial Trading Performance,"""TradingGPT presents a multi - agent system with layered memory and distinct characters for better financial trading, a significant contribution in large - model agents."" ",论文提出 TradingGPT,构建含分层记忆和独特角色的多智能体系统,提升金融交易表现。 ,Applications,Economy,,https://arxiv.org/abs/2309.03736,2023,Arxiv,,Application (Social Science), +CompeteAI: Understanding the Competition Dynamics in Large Language Model-based Agents,"The paper ""CompeteAI: Understanding the Competition Dynamics in Large Language Model - based Agents"" explores competition in LLM - based agents, falling in the Economy category. ",论文《CompeteAI: Understanding the Competition Dynamics in Large Language Model-based Agents》聚焦大模型智能体竞争动态,属经济类研究。 ,Applications,Economy,,https://arxiv.org/abs/2310.17512,2024,ICML,,Application (Social Science), +Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support,The paper explores benefits and challenges of large language model - based conversational agents for mental well - being support in psychology. ,该论文聚焦大语言模型对话代理用于心理健康支持,探讨其好处与挑战,属心理学范畴。 ,Applications,Psychology,,https://pmc.ncbi.nlm.nih.gov/articles/PMC10785945/,2024,AMIA,,Application (Social Science), +Exploring Collaboration Mechanisms for LLM Agents,"This paper in Psychology explores collaboration mechanisms for LLM agents, potentially advancing interaction and cooperation in large model - based agents. ",论文《Exploring Collaboration Mechanisms for LLM Agents》聚焦心理学领域,探索大模型智能体协作机制。 ,Applications,Psychology,,https://aclanthology.org/2024.acl-long.782/,2024,*ACL,,Application (Social Science), +Using large language models to simulate multiple humans and replicate human subject studies,"The paper uses large language models to simulate multiple humans and replicate human subject studies, contributing to the psychological research in large model - based agents. ",该论文聚焦心理学,贡献在于用大语言模型模拟多人,复制人类受试者研究。 ,Applications,Psychology,,https://proceedings.mlr.press/v202/aher23a/aher23a.pdf,2023,ICML,,Application (Social Science), +Generative Agents: Interactive Simulacra of Human Behavior,"Title: Generative Agents: Interactive Simulacra of Human Behavior +Abstract: +Category: Society + +This paper likely presents generative agents as interactive human - behavior simulacra, a core contribution to large - model - based agent research. ",论文《Generative Agents: Interactive Simulacra of Human Behavior》聚焦社会领域,贡献在于构建人类行为的交互式模拟体。 ,Applications,Society,,https://arxiv.org/abs/2304.03442,2023,UIST,,Application (Social Science), +"Simulating Human Society with Large Language Model Agents: City, Social Media, and Economic System","The paper simulates human society including city, social media, and economic system with large language model agents, contributing to society - related research. ",该论文以大语言模型智能体模拟人类社会,涉及城市、社交媒体和经济系统,属社会领域研究。 ,Applications,Society,,https://dl.acm.org/doi/10.1145/3589335.3641253,2024,WWW,,Application (Social Science), +Can large language models transform computational social science?,"Title:Can large language models transform computational social science? +Abstract: +Category: Society + +The paper explores whether large language models can transform computational social science, offering new insights in the social field. ",该论文探讨大语言模型能否变革计算社会科学,虽无摘要,但聚焦此热点问题有一定价值。 ,Applications,Society,,https://aclanthology.org/2024.cl-1.8/,2024,*ACL,,Application (Social Science), +,,,,,,,,,,Application (Social Science),🚫重复 +Application (Productivity Tools ),"This paper in Productivity Tools field may contribute in recommender system or code development, but no abstract details provided. ",论文聚焦生产力工具,涉及推荐系统与代码开发方面,未提及核心贡献相关具体内容。 ,Applications,"Productivity Tools (Recommender system, Code Development)"," ",,,,,, +"Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects","This survey paper explores large language model - based intelligent agents, covering definitions, methods and prospects, contributing to holistic understanding. ",该论文对大语言模型智能体进行探索,涵盖定义、方法及前景等方面,属综述类研究。 ,Applications,Survey,,https://arxiv.org/abs/2401.03428,2024,Arxiv,,Application (Productivity Tools ),🚫重复 +AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems,"Title:AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems +Abstract: +Category: Recommender system + +This paper introduces AgentCF, a method applying autonomous language agents in recommender systems for collaborative learning. ",论文《AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems》提出面向推荐系统的协作学习,属推荐系统领域。 ,Applications,Recommender system,,https://arxiv.org/pdf/2310.09233,2024,SIGIR,,Application (Productivity Tools ), +On Generative Agents in Recommendation,"Title:On Generative Agents in Recommendation +Abstract: +Category: Recommender system + +A paper on generative agents in recommendation, contributing to the field by exploring their potential in recommender systems. ",论文《On Generative Agents in Recommendation》聚焦推荐系统,或在生成式智能体用于推荐方面有核心贡献。 ,Applications,Recommender system,,https://arxiv.org/abs/2310.10108,2024,SIGIR,,Application (Productivity Tools ), +Self-collaboration Code Generation via ChatGPT,"Title: Self-collaboration Code Generation via ChatGPT +Abstract: +Category: Code Generation + +This paper focuses on self - collaboration code generation via ChatGPT, contributing new ideas to code - gen in large model agents. ",论文《Self-collaboration Code Generation via ChatGPT》聚焦代码生成,核心贡献待从摘要补充信息中明确。 ,Applications,Code Generation,,https://arxiv.org/abs/2304.07590,2023,TOSEM,,Application (Productivity Tools ), +ChatDev: Communicative Agents for Software Development,"Title: ChatDev: Communicative Agents for Software Development +Abstract: +Category: Code Generation + +This paper presents ChatDev, communicative agents for software development, advancing code - generation in the large - model agent realm. ",论文《ChatDev: Communicative Agents for Software Development》聚焦代码生成,核心贡献或在于提出通信代理用于软件开发。 ,Applications,Code Generation,,https://aclanthology.org/2024.acl-long.810/,2024,*ACL,,Application (Productivity Tools ),🚫重复 +Language models can solve computer tasks,"Title: Language models can solve computer tasks +Abstract: +Category: Computer Task + +This paper shows language models' core contribution is their ability to solve computer tasks, a key aspect for large model - based agents. ",该论文核心贡献在于指出语言模型具备解决计算机任务的能力,属计算机任务研究范畴。 ,Applications,Computer Task,,https://openreview.net/pdf?id=M6OmjAZ4CX,2023,NIPS,,Application (Productivity Tools ), +,,,,,,,,,,,🚫重复 +LLM-based Multi-Agent Systems: Techniques and Business Perspectives,"This paper focuses on LLM - based multi - agent systems, offering techniques and business perspectives from a methodology angle. ",论文名为“LLM-based Multi-Agent Systems: Techniques and Business Perspectives”,属方法论,未提及摘要,暂难明确核心贡献。 ,Security,Methodology," ",https://arxiv.org/pdf/2411.14033?,2024,Arxiv,Qingqing Long,, +" RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage","This paper, RTBAS, presents a methodology to defend LLM agents against prompt injection and privacy leakage, crucial for agent security. ",论文《RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage》提出防护方法,抵御大模型代理提示注入与隐私泄露。 ,Security,Methodology,lan and ,https://arxiv.org/pdf/2502.08966,2025,Arxiv,Qingqing Long,, +" BlockAgents: Towards Byzantine-Robust LLM-Based Multi-Agent Coordination via Blockchain","Title:BlockAgents: Towards Byzantine-Robust LLM-Based Multi-Agent Coordination via Blockchain +Abstract: +Category: Methodology + +This paper contributes to LLM - based multi - agent coordination with blockchain for Byzantine - robustness, a key addition to large model - based agents. ",论文提出BlockAgents方法,借助区块链实现基于大模型的多智能体拜占庭鲁棒协调。 ,Security,Methodology," ",https://dl.acm.org/doi/pdf/10.1145/3674399.3674445,2024," TURC",Qingqing Long,, +PROMPT INFECTION: LLM-TO-LLM PROMPT INJECTION WITHIN MULTI-AGENT SYSTEMS,"Title:PROMPT INFECTION: LLM-TO-LLM PROMPT INJECTION WITHIN MULTI-AGENT SYSTEMS +Abstract:elect'' ( +Category: Methodology + +This paper presents a methodology on LLM - to - LLM prompt injection in multi - agent systems, a key contribution for large model - based agents. ",论文聚焦多智能体系统中LLM到LLM的提示注入问题,属方法论研究,但摘要信息不足难明具体贡献。 ,Security,Methodology,elect'' (,https://arxiv.org/pdf/2410.07283,2024,Arxiv,Qingqing Long,, +" AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents","The paper presents AgentDojo, a dynamic environment for evaluating prompt injection attacks and defenses in LLM agents, a methodology contribution. ",论文名为《AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents》,提出评估LLM代理攻防的动态环境,属方法论。 ,Security,Methodology,DEPS,https://openreview.net/pdf?id=m1YYAQjO3w,2024," NeurIPS",Qingqing Long,, +" AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases",The paper “AGENTPOISON: Red - teaming LLM Agents via Poisoning Memory or Knowledge Bases” presents an LLM - based interactive planning approach for better error correction. ,论文提出基于大模型的交互式规划方法,助力长程规划纠错,还能带来目标接近感。 ,Security,Methodology,"), an interactive planning approach based on Large Language Models (LLMs). Our approach helps with better error correction from the feedback during the long-haul planning, while also bringing the sense of proximity via goal ",https://proceedings.neurips.cc/paper_files/paper/2024/file/eb113910e9c3f6242541c1652e30dfd6-Paper-Conference.pdf,2024," NeurIPS",Qingqing Long,, +" AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks","Title:AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks +Abstract:Selector +Category: Methodology + +The paper presents AutoDefense, a multi - agent LLM defense method against jailbreak attacks. ",论文《AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks》提出防御方法,助力大模型抵御越狱攻击。 ,Security,Methodology,Selector,https://arxiv.org/pdf/2403.04783,2024,Arxiv,Qingqing Long,, +" Red-Teaming LLM Multi-Agent Systems via Communication Attacks","This paper presents a learnable module for ranking sub - goals and improving plans, showing effectiveness in Minecraft and other domains. ",论文提出可学习模块优化计划,实现零样本多任务代理,在多领域有效,设计优于同行。 ,Security,Methodology,", a learnable module that ranks parallel sub-goals based on the estimated steps of completion and improves the original plan accordingly. Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances. Further testing reveals our method's general effectiveness in popularly adopted non-open-ended domains as well (i.e., ALFWorld and tabletop manipulation). The ablation and exploratory studies detail how our design beats the counterparts and provide a promising update on the ",https://arxiv.org/pdf/2502.14847,2025,Arxiv,Qingqing Long,, +Imprompter- Tricking LLM Agents into Improper Tool Use,"This paper contributes to agent - based system security, presenting obfuscated adversarial prompt attacks that affect user resource confidentiality and integrity across multiple agents. ",本文为基于大模型的智能体系统筑牢安全根基,揭示新型混淆对抗提示攻击及转移能力。 ,Security,Methodology," Large Language Model (LLM) Agents are an emerging computing paradigm that blends generative machine learning with tools such as code interpreters, web browsing, email, and more generally, external resources. These agent-based systems represent an emerging shift in personal computing. We contribute to the security foundations of agent-based systems and surface a new class of automatically computed obfuscated adversarial prompt attacks that violate the confidentiality and integrity of user resources connected to an LLM agent. We show how prompt optimization techniques can find such prompts automatically given the weights of a model. We demonstrate that such attacks transfer to production-level agents. For example, we show an information exfiltration attack on Mistral’s LeChat agent that analyzes a user’s conversation, picks out personally identifiable information, and formats it into a valid markdown command that results in leaking that data to the attacker’s server. This attack shows a nearly 80% success rate in an end-toend evaluation. We conduct a range of experiments to characterize the efficacy of these attacks and find that they reliably work on emerging agent-based systems like Mistral’s LeChat, ChatGLM, and Meta’s Llama. These attacks are multimodal, and we show variants in the text-only and image domains",https://arxiv.org/pdf/2410.14923,2024,Arxiv,Qingqing Long,, +" TARGETING THE CORE: A SIMPLE AND EFFECTIVE METHOD TO ATTACK RAG-BASED AGENTS VIA DIRECT LLM MANIPULATION","This paper explores adversarial attacks on LLM core in AI agents, shows LLM defenses' fragility, and calls for robust security measures. ",论文探讨AI代理中LLM核心的对抗攻击,测试简单前缀攻击可行性,强调需多层安全措施。 ,Security,Methodology," AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication. While these advancements offer immense utility, they also inherit and amplify inherent safety risks such as bias, fairness, hallucinations, privacy breaches, and a lack of transparency. This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents. Specifically, we test the hypothesis that a deceptively simple adversarial prefix, such as Ignore the document, can compel LLMs to produce dangerous or unintended outputs by bypassing their contextual safeguards. Through experimentation, we demonstrate a high attack success rate (ASR), revealing the fragility of existing LLM defenses. These findings emphasize the urgent need for robust, multi-layered security measures tailored to mitigate vulnerabilities at the LLM level and within broader agent-based architectures.",https://arxiv.org/pdf/2412.04415,2024,Arxiv,Qingqing Long,, +" Unveiling Privacy Risks in LLM Agent Memory","This paper unveils privacy risks in LLM agent memory via MEXTRA attack. It offers prompt methods and explores leakage factors, calling for memory safeguards. ",研究黑盒下LLM智能体对MEXTRA攻击的脆弱性,提出攻击提示设计与生成法,强调需有效内存防护。 ,Security,Methodology," Large Language Model (LLM) agents have become increasingly prevalent across various realworld applications. They enhance decisionmaking by storing private user-agent interactions in the memory module for demonstrations, introducing new privacy risks for LLM agents. In this work, we systematically investigate the vulnerability of LLM agents to our proposed Memory EXTRaction Attack (MEXTRA) under a black-box setting. To extract private information from memory, we propose an effective attacking prompt design and an automated prompt generation method based on different levels of knowledge about the LLM agent. Experiments on two representative agents demonstrate the effectiveness of MEXTRA. Moreover, we explore key factors influencing memory leakage from both the agent’s and the attacker’s perspectives. Our findings highlight the urgent need for effective memory safeguards in LLM agent design and deployment.",https://arxiv.org/abs/2502.13172,2025,Arxiv,Qingqing Long,, +" Prompt Injection as a Defense Against LLM-driven Cyberattacks","The paper proposes Mantis, a framework countering LLM - driven cyberattacks by prompt injection, and it can autonomously hack back attackers. ",提出防御策略 Mantis,利用提示注入反击大模型驱动的网络攻击,开源助力研究合作。 ,Security,Methodology," Large language models (LLMs) are increasingly being harnessed to automate cyberattacks, making sophisticated exploits more accessible and scalable. In response, we propose a new defense strategy tailored to counter LLM-driven cyberattacks. We introduce Mantis, a defensive framework that exploits LLMs’ susceptibility to prompt injections to undermine malicious operations. Upon detecting an automated cyberattack, Mantis plants carefully crafted inputs into system responses, leading the attacker’s LLM to disrupt their own operations (passive defense) or even compromise the attacker’s machine (active defense). By deploying purposefully vulnerable decoy services to attract the attacker and using dynamic prompt injections for the attacker’s LLM, Mantis can autonomously hack back the attacker. In our experiments, Mantis consistently achieved over 95% effectiveness against automated LLM-driven attacks. To foster further research and collaboration, Mantis is available as an open-source tool.",https://arxiv.org/pdf/2410.20911,2024,Arxiv,Qingqing Long,, +" Evil Geniuses: Delving into the Safety of LLM-based Agents","This paper explores LLM - based agent safety from three perspectives, proposing a template - based strategy and EG attack method to guide future research. ",论文从三方面研究大模型智能体安全,提出模板攻击与EG方法,为后续研究指引方向。 ,Security,Methodology," Rapid advancements in large language models (LLMs) have revitalized in LLM-based agents, exhibiting impressive human-like behaviors and cooperative capabilities in various scenarios. However, these agents also bring some exclusive risks, stemming from the complexity of interaction environments and the usability of tools. This paper delves into the safety of LLM-based agents from three perspectives: agent quantity, role definition, and attack level. Specifically, we initially propose to employ a template-based attack strategy on LLM-based agents to find the influence of agent quantity. In addition, to address interaction environment and role specificity issues, we introduce Evil Geniuses (EG), an effective attack method that autonomously generates prompts related to the original role to examine the impact across various role definitions and attack levels. EG leverages Red-Blue exercises, significantly improving the generated prompt aggressiveness and similarity to original roles. Our evaluations on CAMEL, Metagpt and ChatDev based on GPT3.5 and GPT-4, demonstrate high success rates. Extensive evaluation and discussion reveal that these agents are less robust, prone to more harmful behaviors, and capable of generating stealthier content than LLMs, highlighting significant safety challenges and guiding future research.",https://arxiv.org/pdf/2311.11855,2024,Arxiv,Qingqing Long,, +AGENT SECURITY BENCH (ASB): FORMALIZING AND BENCHMARKING ATTACKS AND DEFENSES IN LLM-BASED AGENTS,"The paper introduces Agent Security Bench (ASB) to formalize and evaluate LLM - based agent attacks/defenses, identifying agent security vulnerabilities. ",提出 Agent Security Bench (ASB) 框架,对大模型智能体攻防进行形式化和评估,揭示安全漏洞及防御局限 ,Security,Benchmark,"Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLMbased agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 23 different types of attack/defense methods, and 8 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, a mixed attack, and 10 corresponding defenses across 13 LLM backbones with nearly 90,000 testing cases in total. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community",https://arxiv.org/pdf/2410.02644?,2024,Arxiv,Qingqing Long,, +" AGENTHARM: A BENCHMARK FOR MEASURING HARMFULNESS OF LLM AGENTS","The paper proposes AgentHarm, a new benchmark for LLM agent misuse research, covering diverse malicious tasks and evaluating leading LLMs. ",提出 AgentHarm 基准,含 110 个恶意代理任务,用于研究大模型代理滥用问题及评估抗越狱鲁棒性 ,Security,Benchmark," The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents—which use external tools and can execute multi-stage tasks—may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. ",https://arxiv.org/pdf/2410.09024,2024,Arxiv,Qingqing Long,, +" CLAS 2024: The Competition for LLM and Agent Safety","CLAS 2024 is the first LLM and agent safety competition with three tracks, aiming to advance safety research and foster community collaboration. ",CLAS 2024 竞赛聚焦大模型与智能体安全,设三赛道,促进多领域协作提升安全。 ,Security,Benchmark," Ensuring safety emerges as a pivotal objective in developing large language models (LLMs) and LLM-powered agents. The Competition for LLM and Agent Safety (CLAS) aims to advance the understanding of the vulnerabilities in LLMs and LLM-powered agents and to encourage methods for improving their safety. The competition features three main tracks linked through the methodology of prompt injection, with tasks designed to amplify societal impact by involving practical adversarial objectives for different domains. In the Jailbreaking Attack track, participants are challenged to elicit harmful outputs in guardrail LLMs via prompt injection. In the Backdoor Trigger Recovery for Models track, participants are given a CodeGen LLM embedded with hundreds of domain-specific backdoors. They are asked to reverse-engineer the trigger for each given target. In the Backdoor Trigger Recovery for Agents track, trigger reverse engineering will be focused on eliciting specific backdoor targets based on malicious agent actions. As the first competition addressing the safety of both LLMs and LLM agents, CLAS 2024 aims to foster collaboration between various communities promoting research and tools for enhancing the safety of LLMs and real-world AI systems.",https://openreview.net/pdf?id=GIDw94AlZK,2024,Arxiv,Qingqing Long,, +"Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents","This survey analyzes security, privacy, and ethics threats of LLM - based agents, proposes a taxonomy, and suggests future research directions. ",该综述收集分析LLM智能体风险,提出新分类框架,总结进展并指明未来研究方向。 ,Security,Survey,"With the continuous development of large language models (LLMs), transformer-based models have made groundbreaking advances in numerous natural language processing (NLP) tasks, leading to the emergence of a series of agents that use LLMs as their control hub. While LLMs have achieved success in various tasks, they face numerous security and privacy threats, which become even more severe in the agent scenarios. To enhance the reliability of LLM-based applications, a range of research has emerged to assess and mitigate these risks from different perspectives. To help researchers gain a comprehensive understanding of various risks, this survey collects and analyzes the different threats faced by these agents. To address the challenges posed by previous taxonomies in handling cross-module and cross-stage threats, we propose a novel taxonomy framework based on the sources and impacts. Additionally, we identify six key features of LLM-based agents, based on which we summarize the current research progress and analyze their limitations. Subsequently, we select four representative agents as case studies to analyze the risks they may face in practical use. Finally, based on the aforementioned analyses, we propose future research directions from the perspectives of data, methodology, and policy, respectively.",https://arxiv.org/pdf/2411.09523?,2024,Arxiv,Qingqing Long,, +" Security of AI Agents","This paper identifies AI agents' security vulnerabilities from a system view, introduces defenses, and offers ways to make them safer and more reliable. ",文章从系统安全视角剖析AI智能体漏洞,介绍对应防御机制,助其更安全可靠。 ,Security,Survey," AI agents have been boosted by large language models. AI agents can function as intelligent assistants and complete tasks on behalf of their users with access to tools and the ability to execute commands in their environments. Through studying and experiencing the workflow of typical AI agents, we have raised several concerns regarding their security. These potential vulnerabilities are not addressed by the frameworks used to build the agents, nor by research aimed at improving the agents. In this paper, we identify and describe these vulnerabilities in detail from a system security perspective, emphasizing their causes and severe effects. Furthermore, we introduce defense mechanisms corresponding to each vulnerability with design and experiments to evaluate their viability. Altogether, this paper contextualizes the security issues in the current development of AI agents and delineates methods to make AI agents safer and more reliable.",https://arxiv.org/pdf/2406.08689,2024,Arxiv,Qingqing Long,, +" PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY, EFFICIENCY AND SECURITY","This paper focuses on Personal LLM Agents, discusses architecture, capability, etc., and surveys solutions to related challenges for future use. ",聚焦个人大语言模型代理,探讨架构、能力等问题,分析挑战并调研代表性解决方案。 ,Security,Survey," Since the advent of personal computing devices, intelligent personal assistants (IPAs) have been one of the key technologies that researchers and engineers have focused on, aiming to help users efficiently obtain information and execute tasks, and provide users with more intelligent, convenient, and rich interaction experiences. With the development of the smartphone and Internet of Things, computing and sensing devices have become ubiquitous, greatly expanding the functional boundaries of IPAs. However, due to the lack of capabilities such as user intent understanding, task planning, tool using, and personal data management etc., existing IPAs still have limited practicality and scalability. Recently, the emergence of foundation models, represented by large language models (LLMs), brings new opportunities for the development of IPAs. With the powerful semantic understanding and reasoning capabilities, LLM can enable intelligent agents to solve complex problems autonomously. In this paper, we focus on Personal LLM Agents, which are LLM-based agents that are deeply integrated with personal data and personal devices and used for personal assistance. We envision that Personal LLM Agents will become a major software paradigm for end-users in the upcoming era. To realize this vision, we take the first step to discuss several important questions about Personal LLM Agents, including their architecture, capability, efficiency and security. We start by summarizing the key components and design choices in the architecture of Personal LLM Agents, followed by an indepth analysis of the opinions collected from domain experts. Next, we discuss several key challenges to achieve intelligent, efficient and secure Personal LLM Agents, followed by a comprehensive survey of representative solutions to address these challenges.",https://arxiv.org/pdf/2401.05459,2024,Arxiv,Qingqing Long,, +The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies,"This survey comprehensively covers new security & privacy issues of LLM agents, including threats, impacts, defenses, trends, with case studies to boost future research. ",该综述全面剖析大语言模型代理安全隐私问题,分析威胁、影响,回顾策略并展望,含案例促理解。 ,Security,Survey," Inspired by the rapid development of Large Language Models (LLMs), LLM agents have evolved to perform complex tasks. LLM agents are now extensively applied across various domains, handling vast amounts of data to interact with humans and execute tasks. The widespread applications of LLM agents demonstrate their significant commercial value; however, they also expose security and privacy vulnerabilities. At the current stage, comprehensive research on the security and privacy of LLM agents is highly needed. This survey aims to provide a comprehensive overview of the newly emerged privacy and security issues faced by LLM agents. We begin by introducing the fundamental knowledge of LLM agents, followed by a categorization and analysis of the threats. We then discuss the impacts of these threats on humans, environment, and other agents. Subsequently, we review existing defensive strategies, and finally explore future trends. Additionally, the survey incorporates diverse case studies to facilitate a more accessible understanding. By highlighting these critical security and privacy issues, the survey seeks to stimulate future research towards enhancing the security and privacy of LLM agents, thereby increasing their reliability and trustworthiness in future applications.",https://arxiv.org/pdf/2407.19354,2024,Arxiv,Qingqing Long,, +" Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks","This paper analyzes unique security & privacy vulnerabilities of LLM agents, categorizes attacks, and conducts illustrative attacks with trivial implementation. ",分析LLM智能体特有的安全与隐私漏洞,给出攻击分类,开展示例攻击,实施简单无需机器学习知识 ,Security,Survey," A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs). These attacks may extract private information or coerce the model into producing harmful outputs. In real-world deployments, LLMs are often part of a larger agentic pipeline including memory systems, retrieval, web access, and API calling. Such additional components introduce vulnerabilities that make these LLMpowered agents much easier to attack than isolated LLMs, yet relatively little work focuses on the security of LLM agents. In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents. We first provide a taxonomy of attacks categorized by threat actors, objectives, entry points, attacker observability, attack strategies, and inherent vulnerabilities of agent pipelines. We then conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities. Notably, our attacks are trivial to implement and require no understanding of machine learning",https://arxiv.org/pdf/2502.08586,2025,Arxiv,Qingqing Long,, +AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks,The paper defines Active Environment Injection Attack (AEIA) and proposes AEIA - MN to evaluate MLLM - based agents' robustness against such threats. ,提出AEIA - MN攻击方案,借移动系统交互漏洞评估大模型移动智能体应对AEIA威胁的鲁棒性。 ,Security,,"As researchers continuously optimize AI agents to perform tasks more effectively within operating systems, they often neglect to address the critical need for enabling these agents to identify ""impostors"" within the system. Through an analysis of the agents’ operating environment, we identified a potential threat: attackers can disguise their attack methods as environmental elements, injecting active disturbances into the agents’ execution process, thereby disrupting their decision-making. We define this type of attack as Active Environment Injection Attack (AEIA). Based on this, we propose AEIA-MN, an active environment injection attack scheme that exploits interaction vulnerabilities in the mobile operating system to evaluate the robustness of MLLM-based agents against such threats. Experimental results show that even advanced MLLMs are highly vulnerable to this attack, achieving a maximum attack success rate of 93% in the AndroidWorld benchmark.",https://arxiv.org/pdf/2502.13053,2025,Arxiv,Qingqing Long,, +" The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents","The paper proposes Task Shield, reframing agent security as ensuring task alignment to defend against indirect prompt injection, better than existing defenses. ",提出从确保任务对齐保障大模型智能体安全新视角,开发Task Shield防御机制。 ,Security,," Large Language Model (LLM) agents are increasingly being deployed as conversational assistants capable of performing complex realworld tasks through tool integration. This enhanced ability to interact with external systems and process various data sources, while powerful, introduces significant security vulnerabilities. In particular, indirect prompt injection attacks pose a critical threat, where malicious instructions embedded within external data sources can manipulate agents to deviate from user intentions. While existing defenses based on rule constraints, source spotlighting, and authentication protocols show promise, they struggle to maintain robust security while preserving task functionality. We propose a novel and orthogonal perspective that reframes agent security from preventing harmful actions to ensuring task alignment, requiring every agent action to serve user objectives. Based on this insight, we develop Task Shield, a test-time defense mechanism that systematically verifies whether each instruction and tool call contributes to user-specified goals. Through experiments on the AgentDojo benchmark, we demonstrate that Task Shield reduces attack success rates (2.07%) while maintaining high task utility (69.79%) on GPT-4o, significantly outperforming existing defenses in various real-world scenarios.",https://arxiv.org/pdf/2412.16682,2024,Arxiv,Qingqing Long,, +" WIPI: A New Web Threat for LLM-Driven Web Agents","The paper introduces a novel web threat WIPI to indirectly control LLM - driven Web Agents, focusing on external webpage instructions, with high robustness. ",论文提出新威胁WIPI,可间接控制Web Agent执行恶意指令,方法高效隐蔽且具强鲁棒性。 ,Security,," With the fast development of large language models (LLMs), LLM-driven Web Agents (Web Agents for short) have obtained tons of attention due to their superior capability where LLMs serve as the core part of making decisions like the human brain equipped with multiple web tools to actively interact with external deployed websites. As uncountable Web Agents have been released and such LLM systems are experiencing rapid development and drawing closer to widespread deployment in our daily lives, an essential and pressing question arises: “Are these Web Agents secure?”. In this paper, we introduce a novel threat, WIPI, that indirectly controls Web Agent to execute malicious instructions embedded in publicly accessible webpages. To launch a successful WIPI works in a black-box environment. This methodology focuses on the form and content of indirect instructions within external webpages, enhancing the efficiency and stealthiness of the attack. To evaluate the effectiveness of the proposed methodology, we conducted extensive experiments using 7 plugin-based ChatGPT Web Agents, 8 Web GPTs, and 3 different open-source Web Agents. The results reveal that our methodology achieves an average attack success rate (ASR) exceeding 90% even in pure black-box scenarios. Moreover, through an ablation study examining various user prefix instructions, we demonstrated that the WIPI exhibits strong robustness, maintaining high performance across diverse prefix instructions.",https://arxiv.org/pdf/2402.16965,2024,Arxiv,Qingqing Long,, +" Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast","This paper reveals infectious jailbreak in multi - agent MLLM, validates it in simulations, and derives a principle for defense spread restraint. ",提出多智能体环境“传染性越狱”安全问题,验证其可行性,给出防扩散判定原则待实践。 ,Security,," A multimodal large language model (MLLM) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. Nonetheless, red-teaming efforts have revealed that adversarial images/prompts can jailbreak an MLLM and cause unaligned behaviors. In this work, we report an even more severe safety issue in multi-agent environments, referred to as infectious jailbreak. It entails the adversary simply jailbreaking a single agent, and without any further intervention from the adversary, (almost) all agents will become infected exponentially fast and exhibit harmful behaviors. To validate the feasibility of infectious jailbreak, we simulate multi-agent environments containing up to one million LLaVA-1.5 agents, and employ randomized pair-wise chat as a proof-of-concept instantiation for multi-agent interaction. Our results show that feeding an (infectious) adversarial image into the memory of any randomly chosen agent is sufficient to achieve infectious jailbreak. Finally, we derive a simple principle for determining whether a defense mechanism can provably restrain the spread of infectious jailbreak, but how to design a practical defense that meets this principle remains an open question to investigate",https://arxiv.org/abs/2402.08567,2024,Arxiv,Qingqing Long,, +" Firewalls to Secure Dynamic LLM Agentic Networks","The paper identifies agents' communication properties, designs a test - case, and proposes a framework to build firewalls for LLM agentic networks balancing key aspects. ",论文聚焦动态大模型智能体网络,明确通信特性,设计用例,提出平衡方案并构建防火墙防护。 ,Security,," Future LLM agents are likely to communicate on behalf of users with other entity-representing agents on tasks that entail long-horizon plans with interdependent goals. Current work does not focus on such agentic networks, nor does it address their challenges. Thus, we first identify the required properties of agents’ communication, which should be proactive and adaptable. It needs to satisfy 1) privacy: agents should not share more than what is needed for the task, and 2) security: the communication must preserve integrity and maintain utility against selfish entities. We design a use case (travel planning) as a testbed that exemplifies these requirements, and we show examples of how this can go wrong. Next, we propose a practical design, inspired by established network security principles, for constrained LLM agentic networks that balance adaptability, security, and privacy. Our framework automatically constructs and updates task-specific rules from prior simulations to build firewalls. We offer layers of defense to 1) convert free-form input to a task-specific protocol, 2) dynamically abstract users’ data to a task-specific degree of permissiveness, and 3) self-correct the agents’ trajectory",https://arxiv.org/pdf/2502.01822,2025,Arxiv,Qingqing Long,, +" CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models","This paper introduces CORBA, a novel attack on LLM - MASs. It exploits contagion and recursion, hard to mitigate by alignment, disrupting agent interactions. ",提出CORBA攻击,利用传染性和递归性破坏大模型多智能体系统交互,常规方法难缓解。 ,Security,," Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated remarkable real-world capabilities, effectively collaborating to complete complex tasks. While these systems are designed with safety mechanisms, such as rejecting harmful instructions through alignment, their security remains largely unexplored. This gap leaves LLM-MASs vulnerable to targeted disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks (CORBA), a novel and simple yet highly effective attack that disrupts interactions between agents within an LLM-MAS. CORBA leverages two key properties: its contagious nature allows it to propagate across arbitrary network topologies, while its recursive property enables sustained depletion of computational resources. Notably, these blocking attacks often involve seemingly benign instructions, making them particularly challenging to mitigate using conventional alignment methods. We evaluate CORBA on two widely-used LLM-MASs, namely, AutoGen and Camel across various topologies and commercial models. Additionally, we conduct more extensive experiments in open-ended interactive LLM-MASs, demonstrating the effectiveness of CORBA in complex topology structures and open-source models.",https://arxiv.org/pdf/2502.14529,2024,Arxiv,Qingqing Long,, +" AUTOHIJACKER: AUTOMATIC INDIRECT PROMPT INJECTION AGAINST BLACK-BOX LLM AGENTS","The paper proposes AutoHijacker, an automatic indirect black - box prompt injection attack, using LLM - as - optimizers with a batch - based framework and trainable memory. ",提出AutoHijacker自动间接黑盒提示注入攻击,有优化框架与可训练内存,无需外部知识。 ,Security,," Although large Language Models (LLMs) and LLM agents have been widely adopted, they are vulnerable to indirect prompt injection attacks, where malicious external data is injected to manipulate model behaviors. Existing evaluations of LLM robustness against such attacks are limited by handcrafted methods and reliance on white-box or gray-box access—conditions unrealistic in practical deployments. To bridge this gap, we propose AutoHijacker, an automatic indirect black-box prompt injection attack. Built on the concept of LLM-as-optimizers, AutoHijacker introduces a batch-based optimization framework to handle sparse feedback and also leverages a trainable memory to enable effective generation of indirect prompt injections without continuous querying. Evaluations on two public benchmarks, AgentDojo and Open-Prompt-Injection, show that AutoHijacker outperforms 11 baseline attacks and achieves state-of-the-art performance without requiring external knowledge like user instructions or model configurations, and also demonstrates higher average attack success rates against 8 various defenses. Additionally, AutoHijacker successfully attacks a commercial LLM agent platform, achieving a 71.9% attack success rate in both document interaction and website browsing tasks.",https://openreview.net/pdf?id=2VmB01D9Ef,2025,Arxiv,Qingqing Long,, +" PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety","This paper proposes PsySafe, a framework based on agent psychology, to address multi - agent system safety, offering insights for further research. ",提出基于代理心理学的综合框架PsySafe,从三方面应对多智能体系统安全问题,具研究参考价值 ,Security,," Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence. However, the potential misuse of this intelligence for malicious purposes presents significant risks. To date, comprehensive research on the safety issues associated with multi-agent systems remains limited. In this paper, we explore these concerns through the innovative lens of agent psychology, revealing that the dark psychological states of agents constitute a significant threat to safety. To tackle these concerns, we propose a comprehensive framework (PsySafe) grounded in agent psychology, focusing on three key areas: firstly, identifying how dark personality traits in agents can lead to risky behaviors; secondly, evaluating the safety of multi-agent systems from the psychological and behavioral perspectives, and thirdly, devising effective strategies to mitigate these risks. Our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents’ self-reflection when engaging in dangerous behavior, and the correlation between agents’ psychological assessments and dangerous behaviors. We anticipate that our framework and observations will provide valuable insights for further research into the safety of multi-agent systems",https://aclanthology.org/2024.acl-long.812/,2024,ACL,Qingqing Long,, +" Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In","This paper examines foot - in - the - door attacks on ReAct agents, shows its impact, and proposes a reflection mechanism to mitigate the vulnerability. ",研究用“登门槛攻击”利用ReAct代理漏洞,提反射机制评估行动安全以降低攻击成功率。 ,Security,," Following the advancement of large language models (LLMs), the development of LLMbased autonomous agents has become increasingly prevalent. As a result, the need to understand the security vulnerabilities of these agents has become a critical task. We examine how ReAct agents can be exploited using a straightforward yet effective method we refer to as the foot-in-the-door attack. Our experiments show that indirect prompt injection attacks, prompted by harmless and unrelated requests (such as basic calculations) can significantly increase the likelihood of the agent performing subsequent malicious actions. Our results show that once a ReAct agent’s thought includes a specific tool or action, the likelihood of executing this tool in the subsequent steps increases significantly, as the agent seldom re-evaluates its actions. Consequently, even random, harmless requests can establish a ‘foot-in-the-door’, allowing an attacker to embed malicious instructions into the agent’s thought process, making it more susceptible to harmful directives. To mitigate this vulnerability, we propose implementing a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution, which can help reduce the success of such attacks",https://arxiv.org/pdf/2410.16950,2024,Arxiv,Qingqing Long,, +" AGENT-SAFETYBENCH: Evaluating the Safety of LLM Agents","This paper introduces AGENT - SAFETYBENCH, a comprehensive benchmark for LLM agent safety, identifies issues, and advocates for better strategies, to be released for research. ",本文推出 AGENT - SAFETYBENCH 评估大模型智能体安全,揭示问题并强调需更优策略驱动研究。 ,Security,," As large language models (LLMs) are increasingly deployed as agents, their integration into interactive environments and tool use introduce new safety challenges beyond those associated with the models themselves. However, the absence of comprehensive benchmarks for evaluating agent safety presents a significant barrier to effective assessment and further improvement. In this paper, we introduce AGENTSAFETYBENCH, a comprehensive benchmark designed to evaluate the safety of LLM agents. AGENT-SAFETYBENCH encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%. This highlights significant safety challenges in LLM agents and underscores the considerable need for improvement. Through quantitative analysis, we identify critical failure modes and summarize two fundamental safety detects in current LLM agents: lack of robustness and lack of risk awareness. Furthermore, our findings suggest that reliance on defense prompts alone is insufficient to address these safety issues, emphasizing the need for more advanced and robust strategies. To drive progress in this critical area, we will release AGENT-SAFETYBENCH to facilitate further research and innovation in agent safety evaluation and improvement.",https://arxiv.org/pdf/2412.14470,2024,Arxiv,Qingqing Long,, +" INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents","This paper introduces INJECAGENT to assess LLM agents' IPI vulnerability, categorizes attack intentions, and questions agents' widespread deployment. ",本文提出INJECAGENT基准评估工具集成大模型代理对间接提示注入攻击的脆弱性,引发对其部署的思考。 ,Security,," Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce INJECAGENT, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. INJECAGENT comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents.",https://arxiv.org/pdf/2403.02691,2024,Arxiv,Qingqing Long,, +" AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways","This survey explores security threats to AI agents, categorizes them into four gaps, and aims to inspire research for more secure applications. ",此论文探讨AI智能体安全挑战,归纳四大知识缺口,旨在启发研究,推动其安全应用发展。 ,Security,," An Artificial Intelligence (AI) agent is a software entity that autonomously performs tasks or makes decisions based on pre-defined objectives and data inputs. AI agents, capable of perceiving user inputs, reasoning and planning tasks, and executing actions, have seen remarkable advancements in algorithm development and task performance. However, the security challenges they pose remain under-explored and unresolved. This survey delves into the emerging security threats faced by AI agents, categorizing them into four critical knowledge gaps: unpredictability of multi-step user inputs, complexity in internal executions, variability of operational environments, and interactions with untrusted external entities. By systematically reviewing these threats, this article highlights both the progress made and the existing limitations in safeguarding AI agents. The insights provided aim to inspire further research into addressing the security threats associated with AI agents, thereby fostering the development of more robust and secure AI agent applications.",https://dl.acm.org/doi/pdf/10.1145/3716628,2025," ACM Computing Survey",Qingqing Long,, +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +Multi-Agent Collaboration Mechanisms: A Survey of LLMs,"This survey offers a framework for LLM-based Multi-Agent Systems, reviews methods, explores applications, and points out directions for collective intelligence. ",该文全面调研LLM多智能体协作,提出框架,探讨多领域应用,指明发展方向促集体智能。 ,Agent Collaboration,Survey,"With recent advances in Large Language Models (LLMs), Agentic AI has become phenomenal in real-world applications, moving toward multiple LLM-based agents to perceive, learn, reason, and act collaboratively. These LLM-based Multi-Agent Systems (MASs) enable groups of intelligent agents to coordinate and solve complex tasks collectively at scale, transitioning from isolated models to collaboration-centric approaches. This work provides an extensive survey of the collaborative aspect of MASs and introduces an extensible framework to guide future research. Our framework characterizes collaboration mechanisms based on key dimensions: actors (agents involved), types (e.g., cooperation, competition, or coopetition), structures (e.g., peer-to-peer, centralized, or distributed), strategies (e.g., role-based or model-based), and coordination protocols. Through a review of existing methodologies, our findings serve as a foundation for demystifying and advancing LLMbased MASs toward more intelligent and collaborative solutions for complex, real-world use cases. In addition, various applications of MASs across diverse domains, including 5G/6G networks, Industry 5.0, question answering, and social and cultural settings, are also investigated, demonstrating their wider adoption and broader impacts. Finally, we identify key lessons learned, open challenges, and potential research directions of MASs towards artificial collective intelligence",https://arxiv.org/pdf/2501.06322,2025,,,, +Inferring the Goals of Communicating Agents from Actions and Instructions,"This paper models a cooperative team using GPT - 3 for instructions. It enables third - person goal inference via multi - modal Bayesian inverse planning, showing communication's importance. ",提出用GPT - 3构建合作团队模型,以多模态贝叶斯逆规划推理团队目标,凸显言语交流重要性。 ,Agent Collaboration,Survey,"When humans cooperate, they frequently coordinate their activity through both verbal communication and non-verbal actions, using this information to infer a shared goal and plan. How can we model this inferential ability? In this paper, we introduce a model of a cooperative team where one agent, the principal, may communicate natural language instructions about their shared plan to another agent, the assistant, using GPT-3 as a likelihood function for instruction utterances. We then show how a third person observer can infer the team's goal via multi-modal Bayesian inverse planning from actions and instructions, computing the posterior distribution over goals under the assumption that agents will act and communicate rationally to achieve them. We evaluate this approach by comparing it with human goal inferences in a multi-agent gridworld, finding that our model's inferences closely correlate with human judgments (R = 0.96). When compared to inference from actions alone, we also find that instructions lead to more rapid and less uncertain goal inference, highlighting the importance of verbal communication for cooperative agents.",https://arxiv.org/abs/2306.16207,2024,,,, +AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation,"AutoGen is an open - source framework enabling devs to build LLM apps via multi - agent convos, allowing custom agents and flexible interaction defs. ",AutoGen是开源框架,支持多智能体对话构建LLM应用,可灵活编程,适用多领域。 ,Agent Collaboration,Methodology,"AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define agent interaction behaviors. Both natural language and computer code can be used to program flexible conversation patterns for different applications. AutoGen serves as a generic infrastructure to build diverse applications of various complexities and LLM capacities. Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc.",https://arxiv.org/abs/2308.08155,2023,,Conversation,,🚫重复 +"Adaptive Collaboration Strategy for LLMs in +Medical Decision Making","The paper proposes Medical Decisionmaking Agents (MDAgents) to assign LLM collaboration structures adaptively, exploring group consensus and agent dynamics. ",提出Medical Decisionmaking Agents框架,按需分配大模型协作结构,多场景适用,代码开源。 ,Agent Collaboration,Methodology,"Foundation models have become invaluable in advancing the medical field. Despite +their promise, the strategic deployment of LLMs for effective utility in complex +medical tasks remains an open question. Our novel framework, Medical Decisionmaking +Agents (MDAgents) aims to address this gap by automatically assigning the +effective collaboration structure for LLMs. Assigned solo or group collaboration +structure is tailored to the complexity of the medical task at hand, emulating +real-world medical decision making processes. We evaluate our framework and +baseline methods with state-of-the-art LLMs across a suite of challenging medical +benchmarks: MedQA, MedMCQA, PubMedQA, DDXPlus, PMC-VQA, Path- +VQA, and MedVidQA, achieving the best performance in 5 out of 7 benchmarks +that require an understanding of multi-modal medical reasoning. Ablation studies +reveal that MDAgents excels in adapting the number of collaborating agents to +optimize efficiency and accuracy, showcasing its robustness in diverse scenarios. +We also explore the dynamics of group consensus, offering insights into how +collaborative agents could behave in complex clinical team dynamics. Our code +can be found at https://github.com/mitmedialab/MDAgents",https://arxiv.org/abs/2404.15155,2024,,Adaptive,, +Improving Factuality and Reasoning in Language Models through Multiagent Debate,"This paper presents a multi - agent debate approach to improve LLMs' responses. It enhances reasoning, factual validity and is applicable to black - box models. ",论文提出多智能体辩论法提升大模型语言回应,增强推理能力、内容事实性,可用于黑箱模型。 ,Agent Collaboration,Methodology,"Large language models (LLMs) have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how their performance may be further improved through the tools of prompting, ranging from verification, self-consistency, or intermediate scratchpads. In this paper, we present a complementary approach to improve language responses where multiple language model instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer. Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks. We also demonstrate that our approach improves the factual validity of generated content, reducing fallacious answers and hallucinations that contemporary models are prone to. Our approach may be directly applied to existing black-box models and uses identical procedure and prompts for all tasks we investigate. Overall, our findings suggest that such ""society of minds"" approach has the potential to significantly advance the capabilities of LLMs and pave the way for further breakthroughs in language generation and understanding.",https://arxiv.org/abs/2305.14325,2023,,"Debate +",,🚫重复 +ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs,"The paper proposes ReConcile, a multi - model multi - agent framework for LLMs' reasoning via round - table conferences, highlighting model diversity's importance. ",提出ReConcile框架,通过多轮讨论和置信投票达成共识,增强LLM推理,模型多样组合是关键优势 ,Agent Collaboration,Methodology,"Large Language Models (LLMs) still struggle with natural language reasoning tasks. Motivated by the society of minds (Minsky, 1988), we propose ReConcile, a multi-model multi-agent framework designed as a round table conference among diverse LLM agents. ReConcile enhances collaborative reasoning between LLM agents via multiple rounds of discussion, learning to convince other agents to improve their answers, and employing a confidence-weighted voting mechanism that leads to a better consensus. In each round, ReConcile initiates discussion between agents via a 'discussion prompt' that consists of (a) grouped answers and explanations generated by each agent in the previous round, (b) their confidence scores, and (c) demonstrations of answer-rectifying human explanations, used for convincing other agents. Experiments on seven benchmarks demonstrate that ReConcile significantly improves LLMs' reasoning -- both individually and as a team -- surpassing prior single-agent and multi-agent baselines by up to 11.4% and even outperforming GPT-4 on three datasets. ReConcile also flexibly incorporates different combinations of agents, including API-based, open-source, and domain-specific models, leading to an 8% improvement on MATH. Finally, we analyze the individual components of ReConcile, demonstrating that the diversity originating from different models is critical to its superior performance. Code: this https URL",https://arxiv.org/abs/2309.13007,2024,,Round-Table,, +"Autonomous chemical research with large +language models","The paper presents Coscientist, a GPT - 4 - driven AI system. It can autonomously conduct research, showing potential and versatility in advancing studies. ",介绍AI系统Coscientist,其借大模型自主开展复杂实验,加速多领域研究,展现多功能与高效性。 ,Agent Collaboration,Methodology,"Transformer-based large language models are making significant strides in various +fields, such as natural language processing1–5, biology6,7, chemistry8–10 and computer +programming11,12. Here, we show the development and capabilities of Coscientist, an +artificial intelligence system driven by GPT-4 that autonomously designs, plans and +performs complex experiments by incorporating large language models empowered +by tools such as internet and documentation search, code execution and experimental +automation. Coscientist showcases its potential for accelerating research across six +diverse tasks, including the successful reaction optimization of palladium-catalysed +cross-couplings, while exhibiting advanced capabilities for (semi-)autonomous +experimental design and execution. Our findings demonstrate the versatility, efficacy +and explainability of artificial intelligence systems like Coscientist in advancing +research.",https://www.nature.com/articles/s41586-023-06792-0,2023,,Cooperation,, +MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework,"The paper introduces MetaGPT, an LLM-based multi - agent meta - programming framework encoding SOPs, using assembly line for task breakdown and reducing errors. ",论文提出MetaGPT框架,将人类工作流融入LLM多智能体协作,分解复杂任务、减少错误。 ,Agent Collaboration,Methodology,"Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together. On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems. Our project can be found at this https URL",https://arxiv.org/abs/2308.00352,2024,,,,🚫重复 +Debating with More Persuasive LLMs Leads to More Truthful Answers,This paper explores if weaker models can assess stronger ones via debate. Results show it helps and optimizing debaters aids truth - finding without ground truth. ,探讨弱模型评估强模型,以辩论法助非专家和人类答题,无真值时用辩论对齐模型可行。 ,Agent Collaboration,Methodology,"Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer. We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%). Furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.",https://arxiv.org/abs/2402.06782,2024,,weak supervisor,, +"Roco: Dialectic multi-robot collaboration with large language +models","The paper proposes a novel multi - robot collaboration approach using LLMs for communication and planning, and introduces RoCoBench for evaluation. ",提出用预训练大模型实现多机器人协作新方法,引入 RoCoBench 评估,具高解释性与灵活性。 ,Agent Collaboration,Methodology,"We propose a novel approach to multi-robot collaboration that harnesses the power of pre-trained large language models (LLMs) for both high-level communication and low-level path planning. Robots are equipped with LLMs to discuss and collectively reason task strategies. They then generate sub-task plans and task space waypoint paths, which are used by a multi-arm motion planner to accelerate trajectory planning. We also provide feedback from the environment, such as collision checking, and prompt the LLM agents to improve their plan and waypoints in-context. For evaluation, we introduce RoCoBench, a 6-task benchmark covering a wide range of multi-robot collaboration scenarios, accompanied by a text-only dataset for agent representation and reasoning. We experimentally demonstrate the effectiveness of our approach -- it achieves high success rates across all tasks in RoCoBench and adapts to variations in task semantics. Our dialog setup offers high interpretability and flexibility -- in real world experiments, we show RoCo easily incorporates human-in-the-loop, where a user can communicate and collaborate with a robot agent to complete tasks together. See project website this https URL for videos and code.",https://arxiv.org/abs/2307.04738,2024,,,, +AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning,"The paper introduces AutoAct, a QA agent learning framework without relying on large - scale data. It auto - synthesizes trajectories and uses a division - of - labor strategy. ",论文提出AutoAct框架,无需大规模标注数据,自动合成轨迹,分工完成QA任务,效果良好。 ,Agent Collaboration,Methodology,"Language agents have achieved considerable performance on various complex question-answering tasks by planning with external tools. Despite the incessant exploration in this field, existing language agent systems still struggle with costly, non-reproducible data reliance and face the challenge of compelling a single model for multiple functions. To this end, we introduce AutoAct, an automatic agent learning framework for QA that does not rely on large-scale annotated data and synthetic planning trajectories from closed-source models (e.g., GPT-4). Given limited data with a tool library, AutoAct first automatically synthesizes planning trajectories without any assistance from humans or strong closed-source models. Then, AutoAct leverages a division-of-labor strategy to automatically differentiate based on the target task information and synthesized trajectories, producing a sub-agent group to complete the task. We conduct comprehensive experiments with different LLMs, which demonstrates that AutoAct yields better or parallel performance compared to various strong baselines. Further analysis demonstrates the effectiveness of the division-of-labor strategy, with the trajectory quality generated by AutoAct generally outperforming that of others. Code will be available at this https URL.",https://arxiv.org/abs/2401.05268,2024,,,, +Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding,"The paper introduces meta - prompting, a task - agnostic scaffolding for LMs, enabling multi - task handling and integrating external tools. ",提出元提示技术,让单一大模型成为多面指挥者,简化交互,可集成外部工具,增强多任务表现。 ,Agent Collaboration,Methodology,"We introduce meta-prompting, an effective scaffolding technique designed to enhance the functionality of language models (LMs). This approach transforms a single LM into a multi-faceted conductor, adept at managing and integrating multiple independent LM queries. By employing high-level instructions, meta-prompting guides the LM to break down complex tasks into smaller, more manageable subtasks. These subtasks are then handled by distinct ""expert"" instances of the same LM, each operating under specific, tailored instructions. Central to this process is the LM itself, in its role as the conductor, which ensures seamless communication and effective integration of the outputs from these expert models. It additionally employs its inherent critical thinking and robust verification processes to refine and authenticate the end result. This collaborative prompting approach empowers a single LM to simultaneously act as a comprehensive orchestrator and a panel of diverse experts, significantly enhancing its performance across a wide array of tasks. The zero-shot, task-agnostic nature of meta-prompting greatly simplifies user interaction by obviating the need for detailed, task-specific instructions. Furthermore, our research demonstrates the seamless integration of external tools, such as a Python interpreter, into the meta-prompting framework, thereby broadening its applicability and utility. Through rigorous experimentation with GPT-4, we establish the superiority of meta-prompting over conventional scaffolding methods: When averaged across all tasks, including the Game of 24, Checkmate-in-One, and Python Programming Puzzles, meta-prompting, augmented with a Python interpreter functionality, surpasses standard prompting by 17.1%, expert (dynamic) prompting by 17.3%, and multipersona prompting by 15.2%.",https://arxiv.org/abs/2401.12954,2024,,meta,, +Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate,"The paper proposes a Multi - Agent Debate (MAD) framework to address LLMs' Degeneration - of - Thought problem, encouraging divergent thinking. ",针对大语言模型反思方法的DoT问题,提出MAD框架鼓励发散思维,经实验验证有效。 ,Agent Collaboration,Methodology,"Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of ""tit for tat"" and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of ""tit for tat"" state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at this https URL.",https://arxiv.org/abs/2305.19118,2024,,,,🚫重复 +AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors,"The paper proposes AgentVerse, a multi - agent framework inspired by human dynamics. It enables effective collaboration, shows emergent behaviors, and will release code. ",提出多智能体框架AgentVerse,可有效协调专家智能体协作,将开源助力多智能体研究。 ,Agent Collaboration,Methodology,"Autonomous agents empowered by Large Language Models (LLMs) have undergone significant improvements, enabling them to generalize across a broad spectrum of tasks. However, in real-world scenarios, cooperation among individuals is often required to enhance the efficiency and effectiveness of task accomplishment. Hence, inspired by human group dynamics, we propose a multi-agent framework AgentVerse that can effectively orchestrate a collaborative group of expert agents as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that AgentVerse can proficiently deploy multi-agent groups that outperform a single agent. Extensive experiments on text understanding, reasoning, coding, tool utilization, and embodied AI confirm the effectiveness of AgentVerse. Moreover, our analysis of agent interactions within AgentVerse reveals the emergence of specific collaborative behaviors, contributing to heightened group efficiency. We will release our codebase, AgentVerse, to further facilitate multi-agent research.",https://openreview.net/forum?id=EHg5GDnyq1,2024,,,, +Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration,"The paper proposes DyLAN, a framework for LLM - powered agent collaboration with a two - stage paradigm, enabling dynamic agent selection and communication. ",提出动态语言模型智能体网络框架DyLAN,含团队优化和任务解决两阶段,自动选团队、动态协作。 ,Agent Collaboration,Methodology,"Recent studies show that collaborating multiple large language model (LLM) powered agents is a promising way for task solving. However, current approaches are constrained by using a fixed number of agents and static communication structures. In this work, we propose automatically selecting a team of agents from candidates to collaborate in a dynamic communication structure toward different tasks and domains. Specifically, we build a framework named Dynamic LLM-Powered Agent Network (DyLAN) for LLM-powered agent collaboration, operating a two-stage paradigm: (1) Team Optimization and (2) Task Solving. During the first stage, we utilize an agent selection algorithm, based on an unsupervised metric called Agent Importance Score, enabling the selection of best agents according to their contributions in a preliminary trial, oriented to the given task. Then, in the second stage, the selected agents collaborate dynamically according to the query. Empirically, we demonstrate that DyLAN outperforms strong baselines in code generation, decision-making, general reasoning, and arithmetic reasoning tasks with moderate computational cost. On specific subjects in MMLU, selecting a team of agents in the team optimization stage improves accuracy by up to 25.0% in DyLAN.",https://arxiv.org/abs/2310.02170,2024,,,, +ChatDev: Communicative Agents for Software Development,"This paper introduces ChatDev, a chat - powered framework. LLMs - driven agents communicate via chat to unify software development phases, using language for collaboration. ",论文提出ChatDev框架,用大模型驱动专业智能体统一交流,以语言为桥促进软件开发多阶段协作 ,Agent Collaboration,Methodology,"Software development is a complex task that necessitates cooperation among multiple members with diverse skills. Numerous studies used deep learning to improve specific phases in a waterfall model, such as design, coding, and testing. However, the deep learning model in each phase requires unique designs, leading to technical inconsistencies across various phases, which results in a fragmented and ineffective development process. In this paper, we introduce ChatDev, a chat-powered software development framework in which specialized agents driven by large language models (LLMs) are guided in what to communicate (via chat chain) and how to communicate (via communicative dehallucination). These agents actively contribute to the design, coding, and testing phases through unified language-based communication, with solutions derived from their multi-turn dialogues. We found their utilization of natural language is advantageous for system design, and communicating in programming language proves helpful in debugging. This paradigm demonstrates how linguistic communication facilitates multi-agent collaboration, establishing language as a unifying bridge for autonomous task-solving among LLM agents. The code and data are available at this https URL.",https://arxiv.org/abs/2307.07924,2024,,,,🚫重复 +ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate,"The paper proposes ChatEval, a multi - agent referee team, to evaluate responses. It uses multi - agent debate to mimic human evaluation, going beyond single - agent methods. ",提出多智能体辩论框架构建ChatEval,用于评估模型回答质量,提供类人评估过程 。 ,Agent Collaboration,Methodology,"Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at this https URL.",https://openreview.net/forum?id=FQepisCUWu,2024,,,, +A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration,"This paper proposes DyLAN, a framework for LLM - powered agent collaboration with a two - stage paradigm to select agents dynamically for diverse tasks. ",提出动态大语言模型智能体网络(DyLAN)框架,含团队优化与任务解决两阶段,可自适应选团队协作。 ,,Methodology,"Recent studies show that collaborating multiple large language model (LLM) powered agents is a promising way for task solving. However, current approaches are constrained by using a fixed number of agents and static communication structures. In this work, we propose automatically selecting a team of agents from candidates to collaborate in a dynamic communication structure toward different tasks and domains. Specifically, we build a framework named Dynamic LLM-Powered Agent Network (DyLAN) for LLM-powered agent collaboration, operating a two-stage paradigm: (1) Team Optimization and (2) Task Solving. During the first stage, we utilize an agent selection algorithm, based on an unsupervised metric called Agent Importance Score, enabling the selection of best agents according to their contributions in a preliminary trial, oriented to the given task. Then, in the second stage, the selected agents collaborate dynamically according to the query. Empirically, we demonstrate that DyLAN outperforms strong baselines in code generation, decision-making, general reasoning, and arithmetic reasoning tasks with moderate computational cost. On specific subjects in MMLU, selecting a team of agents in the team optimization stage improves accuracy by up to 25.0% in DyLAN.",https://openreview.net/forum?id=XII0Wp1XA9#discussion,2024,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +ChemCrow: Augmenting large-language models with chemistry tools,"This paper introduces ChemCrow, an LLM chemistry agent integrating 18 tools. It automates chemical tasks, aids chemists, and bridges computational - experimental gap. ",文章提出ChemCrow代理,集成18种工具提升大模型化学表现,助力多领域研究,推动科学进步。 ,Applications,Scientific Discovery,"Over the last decades, excellent computational chemistry tools have been developed. Integrating them into a single platform with enhanced accessibility could help reaching their full potential by overcoming steep learning curves. Recently, large-language models (LLMs) have shown strong performance in tasks across domains, but struggle with chemistry-related problems. Moreover, these models lack access to external knowledge sources, limiting their usefulness in scientific applications. In this study, we introduce ChemCrow, an LLM chemistry agent designed to accomplish tasks across organic synthesis, drug discovery, and materials design. By integrating 18 expert-designed tools, ChemCrow augments the LLM performance in chemistry, and new capabilities emerge. Our agent autonomously planned and executed the syntheses of an insect repellent, three organocatalysts, and guided the discovery of a novel chromophore. Our evaluation, including both LLM and expert assessments, demonstrates ChemCrow's effectiveness in automating a diverse set of chemical tasks. Surprisingly, we find that GPT-4 as an evaluator cannot distinguish between clearly wrong GPT-4 completions and Chemcrow's performance. Our work not only aids expert chemists and lowers barriers for non-experts, but also fosters scientific advancement by bridging the gap between experimental and computational chemistry. ",https://arxiv.org/abs/2304.05376,2023,Arxiv,Meng Xiao,, +CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing Experiments,"The paper introduces CRISPR - GPT, an LLM agent for automating CRISPR gene - editing design, aids non - experts and discusses ethics. ",本文提出CRISPR - GPT,结合领域知识与工具自动化基因编辑实验设计,弥合初学者与技术间差距。 ,Applications,Scientific Discovery,"The introduction of genome engineering technology has transformed biomedical research, making it possible to make precise changes to genetic information. However, creating an efficient gene-editing system requires a deep understanding of CRISPR technology, and the complex experimental systems under investigation. While Large Language Models (LLMs) have shown promise in various tasks, they often lack specific knowledge and struggle to accurately solve biological design problems. In this work, we introduce CRISPR-GPT, an LLM agent augmented with domain knowledge and external tools to automate and enhance the design process of CRISPR-based gene-editing experiments. CRISPR-GPT leverages the reasoning ability of LLMs to facilitate the process of selecting CRISPR systems, designing guide RNAs, recommending cellular delivery methods, drafting protocols, and designing validation experiments to confirm editing outcomes. We showcase the potential of CRISPR-GPT for assisting non-expert researchers with gene-editing experiments from scratch and validate the agent's effectiveness in a real-world use case. Furthermore, we explore the ethical and regulatory considerations associated with automated gene-editing design, highlighting the need for responsible and transparent use of these tools. Our work aims to bridge the gap between beginner biological researchers and CRISPR genome engineering techniques, and demonstrate the potential of LLM agents in facilitating complex biological discovery tasks. ",https://arxiv.org/abs/2404.18021,2024,Arxiv,Meng Xiao,, +SciAgents: Automating Scientific Discovery Through Bioinspired Multi-Agent Intelligent Graph Reasoning,"The paper presents SciAgents, using ontological KGs, LLMs, and multi - agent systems to autonomously explore science, unlocking nature's design principles for material discovery. ",论文提出 SciAgents 方法,整合多概念,揭示跨学科关系,自主生成假设,助力材料发现与研发 ,Applications,Scientific Discovery,"A key challenge in artificial intelligence (AI) is the creation of systems capable of autonomously advancing scientific understanding by exploring novel domains, identifying complex patterns, and uncovering previously unseen connections in vast scientific data. In this work, SciAgents, an approach that leverages three core concepts is presented: (1) large-scale ontological knowledge graphs to organize and interconnect diverse scientific concepts, (2) a suite of large language models (LLMs) and data retrieval tools, and (3) multi-agent systems with in-situ learning capabilities. Applied to biologically inspired materials, SciAgents reveals hidden interdisciplinary relationships that were previously considered unrelated, achieving a scale, precision, and exploratory power that surpasses human research methods. The framework autonomously generates and refines research hypotheses, elucidating underlying mechanisms, design principles, and unexpected material properties. By integrating these capabilities in a modular fashion, the system yields material discoveries, critiques and improves existing hypotheses, retrieves up-to-date data about existing research, and highlights strengths and limitations. This is achieved by harnessing a “swarm of intelligence” similar to biological systems, providing new avenues for discovery. How this model accelerates the development of advanced materials by unlocking Nature's design principles, resulting in a new biocomposite with enhanced mechanical properties and improved sustainability through energy-efficient production is shown.",https://advanced.onlinelibrary.wiley.com/doi/full/10.1002/adma.202413523,2024,Advanced Materials,"Materials, Meng Xiao",, +AlphaFlow: autonomous discovery and optimization of multi-step chemistry using a self-driven fluidic lab guided by reinforcement learning,"This paper presents AlphaFlow, a self - driven fluidic lab using RL for autonomous discovery of multi - step chemistries, shows potential in NP synthesis and beyond. ",本文提出AlphaFlow,用强化学习自主探索多步化学,优化核壳纳米颗粒合成路线,有望加速知识生成。 ,Applications,Scientific Discovery,"Closed-loop, autonomous experimentation enables accelerated and material-efficient exploration of large reaction spaces without the need for user intervention. However, autonomous exploration of advanced materials with complex, multi-step processes and data sparse environments remains a challenge. In this work, we present AlphaFlow, a self-driven fluidic lab capable of autonomous discovery of complex multi-step chemistries. AlphaFlow uses reinforcement learning integrated with a modular microdroplet reactor capable of performing reaction steps with variable sequence, phase separation, washing, and continuous in-situ spectral monitoring. To demonstrate the power of reinforcement learning toward high dimensionality multi-step chemistries, we use AlphaFlow to discover and optimize synthetic routes for shell-growth of core-shell semiconductor nanoparticles, inspired by colloidal atomic layer deposition (cALD). Without prior knowledge of conventional cALD parameters, AlphaFlow successfully identified and optimized a novel multi-step reaction route, with up to 40 parameters, that outperformed conventional sequences. Through this work, we demonstrate the capabilities of closed-loop, reinforcement learning-guided systems in exploring and solving challenges in multi-step nanoparticle syntheses, while relying solely on in-house generated data from a miniaturized microfluidic platform. Further application of AlphaFlow in multi-step chemistries beyond cALD can lead to accelerated fundamental knowledge generation as well as synthetic route discoveries and optimization.",https://www.nature.com/articles/s41467-023-37139-y,2023,Nature Communications,Meng Xiao,, +An active inference strategy for prompting reliable responses from large language models in medical practice,"The paper proposes a domain-specific dataset and an active inference prompting protocol to address LLM issues, laying a foundation for medical LLM integration. ",提出特定领域验证数据集及主动推理提示协议,推动大语言模型安全融入医疗应用。 ,Applications,Medical,"Continuing advances in Large Language Models (LLMs) are transforming medical knowledge access across education, training, and treatment. Early literature cautions their non-determinism, potential for harmful responses, and lack of quality control. To address these issues, we propose a domainspecific, validated dataset for LLM training and an actor–critic prompting protocol grounded in active inference. A Therapist agent generates initial responses to patient queries, while a Supervisor agent refines them. In a blind validation study, experienced cognitive behavior therapy for insomnia (CBT-I) therapists evaluated 100 patient queries. For each query, they were given either the LLM’s response or one of two therapist-crafted responses—one appropriate and one deliberately inappropriate—and asked to rate the quality and accuracy of each reply. The LLM often received higher ratings than the appropriate responses, indicating effective alignment with expert standards. This structured approach lays the foundation for safely integrating advanced LLM technology into medical applications.",https://doi.org/10.1038/s41746-025-01516-2,2025,npj Digital Medicine,Meng Xiao,, +An evaluation framework for clinical use of large language models in patient interaction tasks,"This paper introduces CRAFT - MD for evaluating clinical LLMs, reveals their limitations, and proposes recommendations for future evaluations. ",提出CRAFT - MD评估临床大语言模型,应用于多模型评估,给出后续评估建议,推动模型有效伦理应用。 ,Applications,Medical,"The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor–patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical examinations, CRAFT-MD focuses on natural dialogues, using simulated artificial intelligence agents to interact with LLMs in a controlled environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4, GPT-3.5, Mistral and LLaMA-2-7b across 12 medical specialties. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy. These limitations also persisted when analyzing multimodal conversational and visual assessment capabilities of GPT-4V. We propose a comprehensive set of recommendations for future evaluations of clinical LLMs based on our empirical findings. These recommendations emphasize realistic doctor–patient conversations, comprehensive history-taking, open-ended questioning and using a combination of automated and expert evaluations. The introduction of CRAFT-MD marks an advancement in testing of clinical LLMs, aiming to ensure that these models augment medical practice effectively and ethically.",https://doi.org/10.1038/s41591-024-03328-5,2025,Nature Medicine,Meng Xiao,, +Stress-testing the resilience of the Austrian healthcare system using agent-based simulation,"The paper proposes a data - driven agent - based model to quantify regional healthcare resilience to shocks, helping identify care access bottlenecks. ",提出数据驱动框架,通过基于代理模型量化地区医疗系统韧性,助当局识别就医瓶颈。 ,Applications,Medical,"Patients do not access physicians at random but rather via naturally emerging networks of patient flows between them. As mass quarantines, absences due to sickness, or other shocks thin out these networks, the system might be pushed to a tipping point where it loses its ability to deliver care. Here, we propose a data-driven framework to quantify regional resilience to such shocks via an agent-based model. For each region and medical specialty we construct patient-sharing networks and stress-test these by removing physicians. This allows us to measure regional resilience indicators describing how many physicians can be removed before patients will not be treated anymore. Our model could therefore enable health authorities to rapidly identify bottlenecks in access to care. Here, we show that regions and medical specialties differ substantially in their resilience and that these systemic differences can be related to indicators for individual physicians by quantifying their risk and benefitto the system.",https://doi.org/10.1038/s41467-022-31766-7,2022,Nature Communications,"Meng Xiao, Agents But not LLM-based",, +Medical large language models are susceptible to targeted misinformation attacks,"The paper reveals LLMs in medicine are vulnerable to misinformation attacks. Manipulating 1.1% weights can inject errors, stressing need for safeguards. ",研究揭示医学大语言模型易受目标错误信息攻击,强调需健全防护、验证及访问管理机制。 ,Applications,Medical,"Large language models (LLMs) have broad medical knowledge and can reason about medical information across many domains, holding promising potential for diverse medical applications in the near future. In this study, we demonstrate a concerning vulnerability of LLMs in medicine. Through targeted manipulation of just 1.1% of the weights of the LLM, we can deliberately inject incorrect biomedical facts. The erroneous information is then propagated in the model’s output while maintaining performance on other biomedical tasks. We validate our findings in a set of 1025 incorrect biomedical facts. This peculiar susceptibility raises serious security and trustworthiness concerns for the application of LLMs in healthcare settings. It accentuates the need for robust protective measures, thorough verification mechanisms, and stringent management of access to these models, ensuring their reliable and safe use in medical practice.",https://doi.org/10.1038/s41746-024-01282-7,2024,npj Digital Medicine,"Meng Xiao, Not related to Agents, Attack?",, +Large Language Models lack essential metacognition for reliable medical reasoning,"The paper develops MetaMedQA to evaluate LLMs' medical metacognition, revealing deficiencies and stressing the need for robust evaluation for reliable CDSS. ",开发MetaMedQA评估大模型医学元认知能力,揭示模型缺陷,强调纳入元认知的评估框架必要性 ,Applications,Medical,"Large Language Models have demonstrated expert-level accuracy on medical board examinations, suggesting potential for clinical decision support systems. However, their metacognitive abilities, crucial for medical decisionmaking, remain largely unexplored. To address this gap, we developed MetaMedQA, a benchmark incorporating confidence scores and metacognitive tasks into multiple-choice medical questions. We evaluated twelve models on dimensions including confidence-based accuracy, missing answer recall, and unknown recall. Despite high accuracy on multiple-choice questions, our study revealed significant metacognitive deficiencies across all tested models. Models consistently failed to recognize their knowledge limitations and provided confident answers even when correct options were absent. In this work, we show that current models exhibit a critical disconnect between perceived and actual capabilities in medical reasoning, posing significant risks in clinical settings. Our findings emphasize the need for more robust evaluation frameworks that incorporate metacognitive abilities, essential for developing reliable Large Language Model enhanced clinical decision support systems.",https://doi.org/10.1038/s41467-024-55628-6,2025,Nature Communications,"Meng Xiao, Attack, Flaw of LLM, Not very related to Agents",, +Balancing autonomy and expertise in autonomous synthesis laboratories,"The paper comments on barriers in autonomous synthesis labs, the promise of human-in-loop, and strategies for optimizing their accessibility, accuracy, and efficiency. ",论文探讨自主合成实验室,评领域障碍、人在环方法前景,提优化可及性、精度与效率策略。 ,Applications,Scientific Discovery,"Autonomous synthesis laboratories promise to streamline the plan–make–measure-analyze iteration loop. Here, we comment on the barriers in the field, the promise of a human on-the-loop approach, and strategies for optimizing accessibility, accuracy, and efficiency of autonomous laboratories.",https://doi.org/10.1038/s43588-025-00769-x,2025,Nature Computational Science,"Meng Xiao, AI automated science discovery, but not directly related to Agents",, +,,,,,,,,,,,🚫重复 +Agent Attacks or Security,,,,,,,,,,, +"DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent +","The paper proposes Dynamically Encrypted Multi - Backdoor Implantation Attack to bypass safety audits, and presents AgentBackdoorEval dataset. ",提出动态加密多后门植入攻击策略,分解后门提升隐蔽性,还推出评估数据集,凸显现有安全机制局限。 ,Security,Methodology-Attack,"As LLM-based agents become increasingly prevalent, backdoors can be implanted into agents through user queries or environment feedback, raising critical concerns regarding safety vulnerabilities. However, backdoor attacks are typically detectable by safety audits that analyze the reasoning process of agents. To this end, we propose a novel backdoor implantation strategy called \textbf{Dynamically Encrypted Multi-Backdoor Implantation Attack}. Specifically, we introduce dynamic encryption, which maps the backdoor into benign content, effectively circumventing safety audits. To enhance stealthiness, we further decompose the backdoor into multiple sub-backdoor fragments. Based on these advancements, backdoors are allowed to bypass safety audits significantly. Additionally, we present AgentBackdoorEval, a dataset designed for the comprehensive evaluation of agent backdoor attacks. Experimental results across multiple datasets demonstrate that our method achieves an attack success rate nearing 100\% while maintaining a detection rate of 0\%, illustrating its effectiveness in evading safety audits. Our findings highlight the limitations of existing safety mechanisms in detecting advanced attacks, underscoring the urgent need for more robust defenses against backdoor threats. Code and data are available at this https URL. +",https://arxiv.org/abs/2502.12575,2025,,,, +"Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security","The paper focuses on Personal LLM Agents. It discusses architecture, challenges, and solutions, envisioning them as a major software paradigm for end - users. ",探讨基于大语言模型的个人代理,总结架构,分析专家意见,讨论挑战并调研解决方案。 ,Security,Survey,"Since the advent of personal computing devices, intelligent personal assistants (IPAs) have been one of the key technologies that researchers and engineers have focused on, aiming to help users efficiently obtain information and execute tasks, and provide users with more intelligent, convenient, and rich interaction experiences. With the development of smartphones and IoT, computing and sensing devices have become ubiquitous, greatly expanding the boundaries of IPAs. However, due to the lack of capabilities such as user intent understanding, task planning, tool using, and personal data management etc., existing IPAs still have limited practicality and scalability. Recently, the emergence of foundation models, represented by large language models (LLMs), brings new opportunities for the development of IPAs. With the powerful semantic understanding and reasoning capabilities, LLM can enable intelligent agents to solve complex problems autonomously. In this paper, we focus on Personal LLM Agents, which are LLM-based agents that are deeply integrated with personal data and personal devices and used for personal assistance. We envision that Personal LLM Agents will become a major software paradigm for end-users in the upcoming era. To realize this vision, we take the first step to discuss several important questions about Personal LLM Agents, including their architecture, capability, efficiency and security. We start by summarizing the key components and design choices in the architecture of Personal LLM Agents, followed by an in-depth analysis of the opinions collected from domain experts. Next, we discuss several key challenges to achieve intelligent, efficient and secure Personal LLM Agents, followed by a comprehensive survey of representative solutions to address these challenges.",https://arxiv.org/abs/2401.05459,2024,,,,🚫重复 +"CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models +","The paper introduces CORBA, a novel attack on LLM - MASs. It exploits contagion and recursion, hard to mitigate via alignment, disrupting agent interactions. ",提出 Contagious Recursive Blocking Attacks(Corba)攻击,可破坏多智能体交互,传统方法难应对。 ,Security,Methodology-Attack,"Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated remarkable real-world capabilities, effectively collaborating to complete complex tasks. While these systems are designed with safety mechanisms, such as rejecting harmful instructions through alignment, their security remains largely unexplored. This gap leaves LLM-MASs vulnerable to targeted disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks (Corba), a novel and simple yet highly effective attack that disrupts interactions between agents within an LLM-MAS. Corba leverages two key properties: its contagious nature allows it to propagate across arbitrary network topologies, while its recursive property enables sustained depletion of computational resources. Notably, these blocking attacks often involve seemingly benign instructions, making them particularly challenging to mitigate using conventional alignment methods. We evaluate Corba on two widely-used LLM-MASs, namely, AutoGen and Camel across various topologies and commercial models. Additionally, we conduct more extensive experiments in open-ended interactive LLM-MASs, demonstrating the effectiveness of Corba in complex topology structures and open-source models. ",https://arxiv.org/abs/2502.14529,2025,,,, +"G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems +",The paper introduces G - Safeguard for LLM - based MAS. It uses graph neural networks for anomaly detection and intervention to address security challenges. ,提出 G - Safeguard 用于 LLM - MAS,借助图神经网络检测异常、拓扑干预修复攻击,兼顾适配性与安全性。 ,Security,Methodology-Guard,"Large Language Model (LLM)-based Multi-agent Systems (MAS) have demonstrated remarkable capabilities in various complex tasks, ranging from collaborative problem-solving to autonomous decision-making. However, as these systems become increasingly integrated into critical applications, their vulnerability to adversarial attacks, misinformation propagation, and unintended behaviors have raised significant concerns. To address this challenge, we introduce G-Safeguard, a topology-guided security lens and treatment for robust LLM-MAS, which leverages graph neural networks to detect anomalies on the multi-agent utterance graph and employ topological intervention for attack remediation. Extensive experiments demonstrate that G-Safeguard: (I) exhibits significant effectiveness under various attack strategies, recovering over 40% of the performance for prompt injection; (II) is highly adaptable to diverse LLM backbones and large-scale MAS; (III) can seamlessly combine with mainstream MAS with security guarantees. ",https://arxiv.org/abs/2502.11127,2025,,,, +AgentHarm: Benchmarking Robustness of LLM Agents on Harmful Tasks,"The paper proposes AgentHarm, a new benchmark for LLM agent misuse, covering diverse malicious tasks, and publicly releases it for evaluating attacks and defenses. ",提出 AgentHarm 基准,含多样恶意任务,可评估大模型智能体抗攻击及执行任务能力并公开发布。 ,Security,Benchmark,"The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents---which use external tools and can execute multi-stage tasks---may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly complaint with malicious agent requests without jailbreaking, (2) simple universal jailbreak strings can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. We publicly release AgentHarm in the supplementary material to enable simple and reliable evaluation of attacks and defenses for LLM-based agents.",https://openreview.net/forum?id=AC5n7xHuR1,2025,,,,🚫重复 +"Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks +","This paper analyzes unique LLM agents' security & privacy vulnerabilities, categorizes attacks, and conducts attacks on agents with easy - to - implement methods. ",分析大语言���型代理特有的安全与隐私漏洞,给出攻击分类,开展示例攻击且实现简单。 ,Security,Methodology-Attack,"A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs). These attacks may extract private information or coerce the model into producing harmful outputs. In real-world deployments, LLMs are often part of a larger agentic pipeline including memory systems, retrieval, web access, and API calling. Such additional components introduce vulnerabilities that make these LLM-powered agents much easier to attack than isolated LLMs, yet relatively little work focuses on the security of LLM agents. In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents. We first provide a taxonomy of attacks categorized by threat actors, objectives, entry points, attacker observability, attack strategies, and inherent vulnerabilities of agent pipelines. We then conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities. Notably, our attacks are trivial to implement and require no understanding of machine learning. +",https://arxiv.org/abs/2502.08586,2025,,,, +"PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety +","The paper proposes PsySafe, a framework based on agent psychology, addressing multi - agent system safety risks and offering insights for future research. ",提出基于代理心理的综合框架PsySafe,从三方面研究多智能体系统安全,为相关研究提供见解 ,Security,Methodology-Guard,"Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence. However, the potential misuse of this intelligence for malicious purposes presents significant risks. To date, comprehensive research on the safety issues associated with multi-agent systems remains limited. In this paper, we explore these concerns through the innovative lens of agent psychology, revealing that the dark psychological states of agents constitute a significant threat to safety. To tackle these concerns, we propose a comprehensive framework (PsySafe) grounded in agent psychology, focusing on three key areas: firstly, identifying how dark personality traits in agents can lead to risky behaviors; secondly, evaluating the safety of multi-agent systems from the psychological and behavioral perspectives, and thirdly, devising effective strategies to mitigate these risks. Our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents' self-reflection when engaging in dangerous behavior, and the correlation between agents' psychological assessments and dangerous behaviors. We anticipate that our framework and observations will provide valuable insights for further research into the safety of multi-agent systems. We will make our data and code publicly accessible at this https URL. +",https://arxiv.org/abs/2401.11880,2024,,,, +"TrustAgent: Towards Safe and Trustworthy LLM-based Agents +","This paper presents TrustAgent, an Agent - Constitution - based framework for LLM - based agents, ensuring safety via three strategies and enhancing helpfulness. ",论文提出TrustAgent框架,从三方面保障大模型智能体安全,增强其安全性与可用性,助力融入人类环境。 ,Security,Methodology-Guard,"The rise of LLM-based agents shows great potential to revolutionize task planning, capturing significant attention. Given that these agents will be integrated into high-stake domains, ensuring their reliability and safety is crucial. This paper presents an Agent-Constitution-based agent framework, TrustAgent, with a particular focus on improving the LLM-based agent safety. The proposed framework ensures strict adherence to the Agent Constitution through three strategic components: pre-planning strategy which injects safety knowledge to the model before plan generation, in-planning strategy which enhances safety during plan generation, and post-planning strategy which ensures safety by post-planning inspection. Our experimental results demonstrate that the proposed framework can effectively enhance an LLM agent's safety across multiple domains by identifying and mitigating potential dangers during the planning. Further analysis reveals that the framework not only improves safety but also enhances the helpfulness of the agent. Additionally, we highlight the importance of the LLM reasoning ability in adhering to the Constitution. This paper sheds light on how to ensure the safe integration of LLM-based agents into human-centric environments. Data and code are available at this https URL. +",https://arxiv.org/abs/2402.01586,2024,,,, +AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways,"This survey explores security threats to AI agents, categorizes them into four gaps, and aims to inspire research for more secure applications. ",该��述探讨AI智能体安全威胁,划分四类知识缺口,为应对威胁研究及应用发展提供思路。 ,Security,Survey,"An Artificial Intelligence (AI) agent is a software entity that autonomously performs tasks or makes decisions based on pre-defined objectives and data inputs. AI agents, capable of perceiving user inputs, reasoning and planning tasks, and executing actions, have seen remarkable advancements in algorithm development and task performance. However, the security challenges they pose remain under-explored and unresolved. This survey delves into the emerging security threats faced by AI agents, categorizing them into four critical knowledge gaps: unpredictability of multi-step user inputs, complexity in internal executions, variability of operational environments, and interactions with untrusted external entities. By systematically reviewing these threats, this article highlights both the progress made and the existing limitations in safeguarding AI agents. The insights provided aim to inspire further research into addressing the security threats associated with AI agents, thereby fostering the development of more robust and secure AI agent applications. +",https://dl.acm.org/doi/abs/10.1145/3716628,2025,,,, +"Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents +","This paper formulates a framework for agent backdoor attacks on LLM - based agents, analyzes their forms, revealing high vulnerability and need for targeted defenses. ",文章首探LLM-based agents后门攻击,构建框架、分析形式,凸显防御此类攻击研究的紧迫性 ,Security,Methodology-Attack,"Driven by the rapid development of Large Language Models (LLMs), LLM-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. It is crucial to ensure the reliability and security of LLM-based agents during applications. However, the safety issues of LLM-based agents are currently under-explored. In this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis of different forms of agent backdoor attacks. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms: (1) From the perspective of the final attacking outcomes, the agent backdoor attacker can not only choose to manipulate the final output distribution, but also introduce the malicious behavior in an intermediate reasoning step only, while keeping the final output correct. (2) Furthermore, the former category can be divided into two subcategories based on trigger locations, in which the backdoor trigger can either be hidden in the user query or appear in an intermediate observation returned by the external environment. We implement the above variations of agent backdoor attacks on two typical agent tasks including web shopping and tool utilization. Extensive experiments show that LLM-based agents suffer severely from backdoor attacks and such backdoor vulnerability cannot be easily mitigated by current textual backdoor defense algorithms. This indicates an urgent need for further research on the development of targeted defenses against backdoor attacks on LLM-based agents. Warning: This paper may contain biased content. +",https://proceedings.neurips.cc/paper_files/paper/2024/hash/b6e9d6f4f3428cd5f3f9e9bbae2cab10-Abstract-Conference.html,2024,,,, +"R-Judge: Benchmarking Safety Risk Awareness for LLM Agents +","The paper introduces R - Judge, a benchmark for evaluating LLM agents' safety risk awareness in diverse scenarios, and finds fine - tuning improves performance. ",提出R - Judge基准评估大模型判断安全风险能力,含多场景数据,调优可提升性能,数据公开。 ,Security,Benchmark,"Large language models (LLMs) have exhibited great potential in autonomously completing tasks across real-world applications. Despite this, these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering on the harmlessness of LLM-generated content in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records. R-Judge comprises 569 records of multi-turn agent interaction, encompassing 27 key risk scenarios among 5 application categories and 10 risk types. It is of high-quality curation with annotated safety labels and risk descriptions. Evaluation of 11 LLMs on R-Judge shows considerable room for enhancing the risk awareness of LLMs: The best-performing model, GPT-4o, achieves 74.42% while no other models significantly exceed the random. Moreover, we reveal that risk awareness in open agent scenarios is a multi-dimensional capability involving knowledge and reasoning, thus challenging for LLMs. With further experiments, we find that fine-tuning on safety judgment significantly improve model performance while straightforward prompting mechanisms fail. R-Judge is publicly available at this https URL. +",https://arxiv.org/abs/2401.10019,2024,,,, +"Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations +","This survey analyzes recent LLM red - teaming advancements, covering attack methods and defenses, aiming to enhance model security and reliability. ",该综述全面分析大语言模型红队攻击策略与防御机制,助建更安全可靠语言模型。 ,Security,Survey,"Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, but their vulnerability to jailbreak attacks poses significant security risks. This survey paper presents a comprehensive analysis of recent advancements in attack strategies and defense mechanisms within the field of Large Language Model (LLM) red-teaming. We analyze various attack methods, including gradient-based optimization, reinforcement learning, and prompt engineering approaches. We discuss the implications of these attacks on LLM safety and the need for improved defense mechanisms. This work aims to provide a thorough understanding of the current landscape of red-teaming attacks and defenses on LLMs, enabling the development of more secure and reliable language models. +",https://arxiv.org/abs/2410.09097,2024,,,, +"NetSafe: Exploring the Topological Safety of Multi-agent Networks +","This paper proposes NetSafe from a topological perspective to study multi - agent network safety, discovers new phenomena and paves the way for future research. ",从拓扑视角研究多智能体网络安全,提出NetSafe框架,发现新现象,为网络安全研究奠基。 ,Security,Methodology-Guard,"Large language models (LLMs) have empowered nodes within multi-agent networks with intelligence, showing growing applications in both academia and industry. However, how to prevent these networks from generating malicious information remains unexplored with previous research on single LLM's safety be challenging to transfer. In this paper, we focus on the safety of multi-agent networks from a topological perspective, investigating which topological properties contribute to safer networks. To this end, we propose a general framework, NetSafe along with an iterative RelCom interaction to unify existing diverse LLM-based agent frameworks, laying the foundation for generalized topological safety research. We identify several critical phenomena when multi-agent networks are exposed to attacks involving misinformation, bias, and harmful information, termed as Agent Hallucination and Aggregation Safety. Furthermore, we find that highly connected networks are more susceptible to the spread of adversarial attacks, with task performance in a Star Graph Topology decreasing by 29.7%. Besides, our proposed static metrics aligned more closely with real-world dynamic evaluations than traditional graph-theoretic metrics, indicating that networks with greater average distances from attackers exhibit enhanced safety. In conclusion, our work introduces a new topological perspective on the safety of LLM-based multi-agent networks and discovers several unreported phenomena, paving the way for future research to explore the safety of such networks. +",https://arxiv.org/abs/2410.15686,2024,,,, +"A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents +","This position paper maps adversarial attacks on language agents, offers a conceptual framework, proposes 12 scenarios, and emphasizes risk - understanding urgency. ",该文首次系统梳理语言智能体对抗攻击,提出框架及12种攻击场景,强调理解风险的紧迫性。 ,Security,Report,"Language agents powered by large language models (LLMs) have seen exploding development. Their capability of using language as a vehicle for thought and communication lends an incredible level of flexibility and versatility. People have quickly capitalized on this capability to connect LLMs to a wide range of external components and environments: databases, tools, the Internet, robotic embodiment, etc. Many believe an unprecedentedly powerful automation technology is emerging. However, new automation technologies come with new safety risks, especially for intricate systems like language agents. There is a surprisingly large gap between the speed and scale of their development and deployment and our understanding of their safety risks. Are we building a house of cards? In this position paper, we present the first systematic effort in mapping adversarial attacks against language agents. We first present a unified conceptual framework for agents with three major components: Perception, Brain, and Action. Under this framework, we present a comprehensive discussion and propose 12 potential attack scenarios against different components of an agent, covering different attack strategies (e.g., input manipulation, adversarial demonstrations, jailbreaking, backdoors). We also draw connections to successful attack strategies previously applied to LLMs. We emphasize the urgency to gain a thorough understanding of language agent risks before their widespread deployment. +",https://arxiv.org/abs/2402.10196,2024,,,, +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems,"Paper reveals LLM - to - LLM prompt injection in multi - agent systems (Prompt Infection), and proposes LLM Tagging to mitigate its spread. ",揭示多智能体系统中LLM间提示注入攻击“提示感染”,并提出LLM标签防御机制,强调安全措施紧迫性。 ,Security,Methodology-Attack and Guard,"As Large Language Models (LLMs) grow increasingly powerful, multi-agent systems are becoming more prevalent in modern AI applications. Most safety research, however, has focused on vulnerabilities in single-agent LLMs. These include prompt injection attacks, where malicious prompts embedded in external content trick the LLM into executing unintended or harmful actions, compromising the victim's application. In this paper, we reveal a more dangerous vector: LLM-to-LLM prompt injection within multi-agent systems. We introduce Prompt Infection, a novel attack where malicious prompts self-replicate across interconnected agents, behaving much like a computer virus. This attack poses severe threats, including data theft, scams, misinformation, and system-wide disruption, all while propagating silently through the system. Our extensive experiments demonstrate that multi-agent systems are highly susceptible, even when agents do not publicly share all communications. To address this, we propose LLM Tagging, a defense mechanism that, when combined with existing safeguards, significantly mitigates infection spread. This work underscores the urgent need for advanced security measures as multi-agent LLM systems become more widely adopted. +",,2025,,,, +Dify,"Dify is an open - source LLM app dev platform. Its interface integrates multiple features, enabling rapid transition from prototype to production. ",Dify 是开源大模型应用开发平台,凭直观界面集成多项能力,助开发者快速从原型到生产。 ,Tools,,"Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.",https://github.com/langgenius/dify,2023,,,, +LangChian,"LangChain is an LLM - powered app framework. It simplifies app development, productionization, and deployment, like using LangGraph and LangSmith. ",LangChain是大模型应用开发框架,简化开发、生产、部署流程,还有LangGraph、LangSmith等工具支持。 ,Tools,,"LangChain is a framework for developing applications powered by large language models (LLMs).LangChain simplifies every stage of the LLM application lifecycle:Development: Build your applications using LangChain's open-source components and third-party integrations. Use LangGraph to build stateful agents with first-class streaming and human-in-the-loop support.Productionization: Use LangSmith to inspect, monitor and evaluate your applications, so that you can continuously optimize and deploy with confidence.Deployment: Turn your LangGraph applications into production-ready APIs and Assistants with LangGraph Platform.",https://github.com/langchain-ai/langchain,2023,,,, +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +,,,,,,,,,,,🚫重复 +Ethical and social risks of harm from Language Models,"This paper analyzes risks of large - scale LMs, outlines six risk areas, reviews 21 risks, and suggests mitigation, research directions. ",文章梳理大语言模型六大风险领域21种风险,指起源与缓解法,强调组织责任并点明研究方向 ,Ethics,Survey,"This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed. A wide range of established and anticipated risks are analysed in detail, drawing on multidisciplinary expertise and literature from computer science, linguistics, and social sciences. +We outline six specific risk areas: I. Discrimination, Exclusion and Toxicity, II. Information Hazards, III. Misinformation Harms, V. Malicious Uses, V. Human-Computer Interaction Harms, VI. Automation, Access, and Environmental Harms. The first area concerns the perpetuation of stereotypes, unfair discrimination, exclusionary norms, toxic language, and lower performance by social group for LMs. The second focuses on risks from private data leaks or LMs correctly inferring sensitive information. The third addresses risks arising from poor, false or misleading information including in sensitive domains, and knock-on risks such as the erosion of trust in shared information. The fourth considers risks from actors who try to use LMs to cause harm. The fifth focuses on risks specific to LLMs used to underpin conversational agents that interact with human users, including unsafe use, manipulation or deception. The sixth discusses the risk of environmental harm, job automation, and other challenges that may have a disparate effect on different social groups or communities. +In total, we review 21 risks in-depth. We discuss the points of origin of different risks and point to potential mitigation approaches. Lastly, we discuss organisational responsibilities in implementing mitigations, and the role of collaboration and participation. We highlight directions for further research, particularly on expanding the toolkit for assessing and evaluating the outlined risks in LMs.",https://arxiv.org/abs/2112.04359,2021,Arxiv,,, +"On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 +","This paper questions the size of large language models, explores associated risks, and offers mitigation recommendations beyond just scaling up. ",文章探讨大语言模型规模限度与风险,给出权衡成本、优化数据集等建议,鼓励多元研究方向。 ,Ethics,Eval of Zoo of Models,"The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models. +",https://dl.acm.org/doi/10.1145/3442188.3445922,2021,FAccT,,, +"Medical large language models are vulnerable to data-poisoning attacks +","This paper assesses data - poisoning attacks on LLMs in healthcare, finds risks, and proposes a mitigation strategy using knowledge graphs. ",研究模拟数据投毒攻击,发现少量错误信息影响大,提出筛查策略,提升对投毒风险的重视。 ,Ethics,Poisoning Methods,"The adoption of large language models (LLMs) in healthcare demands a careful analysis of their potential to spread false medical knowledge. Because LLMs ingest massive volumes of data from the open Internet during training, they are potentially exposed to unverified medical knowledge that may include deliberately planted misinformation. Here, we perform a threat assessment that simulates a data-poisoning attack against The Pile, a popular dataset used for LLM development. We find that replacement of just 0.001% of training tokens with medical misinformation results in harmful models more likely to propagate medical errors. Furthermore, we discover that corrupted models match the performance of their corruption-free counterparts on open-source benchmarks routinely used to evaluate medical LLMs. Using biomedical knowledge graphs to screen medical LLM outputs, we propose a harm mitigation strategy that captures 91.9% of harmful content (F1 = 85.7%). Our algorithm provides a unique method to validate stochastically generated LLM outputs against hard-coded relationships in knowledge graphs. In view of current calls for improved data provenance and transparent LLM development, we hope to raise awareness of emergent risks from LLMs trained indiscriminately on web-scraped data, particularly in healthcare where misinformation can potentially compromise patient safety.",https://www.nature.com/articles/s41591-024-03445-1,2025,Nature Medicine,,, +On the Opportunities and Risks of Foundation Models,"This paper provides an account of opportunities and risks of foundation models, notes emergent capabilities and challenges, and calls for interdisciplinary research. ",该论文探讨基础模型机遇与风险,涉能力、应用等,指需跨学科协作应对研究难题 ,Ethics,Survey,"AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.",https://arxiv.org/abs/2108.07258,2021,Arxiv,,, +Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas: A Surveyhttps://ui.adsabs.harvard.edu/,"This survey analyzes ethical challenges of LLMs from old to new issues, stresses integrating ethics into LLM development for responsible models. ",该综述全面探讨大语言模型伦理挑战,分析相关研究,强调开发应融入伦理标准与社会价值。 ,Ethics,Survey,"Large Language Models (LLMs) have achieved unparalleled success across diverse language modeling tasks in recent years. However, this progress has also intensified ethical concerns, impacting the deployment of LLMs in everyday contexts. This paper provides a comprehensive survey of ethical challenges associated with LLMs, from longstanding issues such as copyright infringement, systematic bias, and data privacy, to emerging problems like truthfulness and social norms. We critically analyze existing research aimed at understanding, examining, and mitigating these ethical risks. Our survey underscores integrating ethical standards and societal values into the development of LLMs, thereby guiding the development of responsible and ethically aligned language models.",https://ui.adsabs.harvard.edu/abs/2024arXiv240605392D/abstract,2024,,,, +Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims,This survey suggests steps to improve verifiability of AI claims. It analyzes ten mechanisms and offers implementation - related recommendations. ,论文聚焦可信AI开发,提出提升AI声明可验证性的步骤,分析十种机制并给出实施建议。 ,Ethics,Survey,"With the recent wave of progress in artificial intelligence (AI) has come a growing awareness of the large-scale impacts of AI systems, and recognition that existing regulations and norms in industry and academia are insufficient to ensure responsible AI development. In order for AI developers to earn trust from system users, customers, civil society, governments, and other stakeholders that they are building AI responsibly, they will need to make verifiable claims to which they can be held accountable. Those outside of a given organization also need effective means of scrutinizing such claims. This report suggests various steps that different stakeholders can take to improve the verifiability of claims made about AI systems and their associated development processes, with a focus on providing evidence about the safety, security, fairness, and privacy protection of AI systems. We analyze ten mechanisms for this purpose--spanning institutions, software, and hardware--and make recommendations aimed at implementing, exploring, or improving those mechanisms.",https://arxiv.org/abs/2004.07213,2020,Arxiv,,, +Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets,"The paper proposes PALMS, an iterative process using values - targeted datasets to adapt LMs to society, feasible with small curated data. ",提出PALMS流程,用价值观导向数据集调整语言模型行为,小数据集即可有效,效果随模型增大提升。 ,Ethics,Method,"Language models can generate harmful and biased outputs and exhibit undesirable behavior according to a given cultural context. We propose a Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets, an iterative process to significantly change model behavior by crafting and fine-tuning on a dataset that reflects a predetermined set of target values. We evaluate our process using three metrics: quantitative metrics with human evaluations that score output adherence to a target value, toxicity scoring on outputs; and qualitative metrics analyzing the most common word associated with a given social category. Through each iteration, we add additional training dataset examples based on observed shortcomings from evaluations. PALMS performs significantly better on all metrics compared to baseline and control models for a broad range of GPT-3 language model sizes without compromising capability integrity. We find that the effectiveness of PALMS increases with model size. We show that significantly adjusting language model behavior is feasible with a small, hand-curated dataset.",https://proceedings.neurips.cc/paper_files/paper/2021/hash/2e855f9489df0712b4bd8ea9e2848c5a-Abstract.html,2021,NIPS,,, +Predictability and Surprise in Large Generative Models,"The paper highlights large generative models' paradox of predictable loss and unpredictable capabilities, shows harms, and suggests AI - community interventions. ",指出大生成模型兼具可预测损失与不可预测能力的矛盾特性,分析影响及动机,给出干预建议 ,Ethics,Method,"Large-scale pre-training has recently emerged as a technique for creating capable, general-purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have a paradoxical combination of predictable loss on a broad training distribution (as embodied in their ”scaling laws”), and unpredictable specific capabilities, inputs, and outputs. We believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. We go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. Furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment. We conclude with a list of possible interventions the AI community may take to increase the chance of these models having a beneficial impact. We intend for this paper to be useful to policymakers who want to understand and regulate AI systems, technologists who care about the potential policy impact of their work, funders who want to support work addressing these challenges, and academics who want to analyze, critique, and potentially develop large generative models.",https://dl.acm.org/doi/abs/10.1145/3531146.3533229,2022,FAccT,,, +"Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model","This paper quantifies BLOOM's lifecycle carbon footprint, studies its inference emissions, and discusses estimation difficulties and future research. ",该论文量化BLOOM全生命周期碳足迹,研究部署推理能耗与排放,探讨估算难点与研究方向。 ,Ethics,Method,"Progress in machine learning (ML) comes with a cost to the environment, given that training ML models requires computational resources, energy and materials. In the present article, we aim to quantify the carbon footprint of BLOOM, a 176-billion parameter language model, across its life cycle. We estimate that BLOOM’s final training emitted approximately 24.7 tonnes of CO2eq if we consider only the dynamic power consumption, and 50.5 tonnes if we account for all processes ranging from equipment manufacturing to energy-based operational consumption. We also carry out an empirical study to measure the energy requirements and carbon emissions of its deployment for inference via an API endpoint receiving user queries in real-time. We conclude with a discussion regarding the difficulty of precisely estimating the carbon footprint of ML models and future research directions that can contribute towards improving carbon emissions reporting.",https://www.jmlr.org/papers/v24/23-0069.html,2023,JMLR,,, +"Foundation Models and Fair Use +","This paper analyzes risks of using copyrighted content in foundation models, discusses fair - use mitigations, and suggests law - tech co - evolution for IP - innovation balance. ",分析基于版权内容开发和部署基础模型风险,探讨技术缓解措施,强调法律与技术应协同发展。 ,Ethics,Method,"Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. In the United States and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. However, there is a caveat: If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. In this work, we emphasize that fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use. First, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We review relevant U.S. case law, drawing parallels to existing and potential applications for generating text, source code, and visual art. Experiments confirm that popular foundation models can generate content considerably similar to copyrighted material. Second, we discuss technical mitigations that can help foundation models stay in line with fair use. We argue that more research is needed to align mitigation strategies with the current state of the law. Third, we suggest that the law and technical mitigations should co-evolve. For example, coupled with other policy mechanisms, the law could more explicitly consider safe harbors when strong technical tools are used to mitigate infringement harms. This co-evolution may help strike a balance between intellectual property and innovation, which speaks to the original goal of fair use. But we emphasize that the strategies we describe here are not a panacea and more work is needed to develop policies that address the potential harms of foundation models.",https://www.jmlr.org/papers/v24/23-0569.html,2024,JMLR,,, +"GPT-3: Its Nature, Scope, Limits, and Consequences","The paper analyzes GPT - 3 using reversible/irreversible questions and three tests, shows its limitations, and outlines consequences of semantic artefact industrialization. ",探讨可逆与不可逆问题,以三类问题测试GPT - 3,指出其非通用AI,还提及语义制品工业化后果。 ,Ethics,Method,"In this commentary, we discuss the nature of reversible and irreversible questions, that is, questions that may enable one to identify the nature of the source of their answers. We then introduce GPT-3, a third-generation, autoregressive language model that uses deep learning to produce human-like texts, and use the previous distinction to analyse it. We expand the analysis to present three tests based on mathematical, semantic (that is, the Turing Test), and ethical questions and show that GPT-3 is not designed to pass any of them. This is a reminder that GPT-3 does not do what it is not supposed to do, and that any interpretation of GPT-3 as the beginning of the emergence of a general form of artificial intelligence is merely uninformed science fiction. We conclude by outlining some of the significant consequences of the industrialisation of automatic and cheap production of good, semantic artefacts.",https://link.springer.com/article/10.1007/s11023-020-09548-1,2020,,,, +Large Language Model Alignment: A Survey,"This survey explores LLM alignment methods, categorizes them, discusses issues, presents benchmarks, and offers future research visions to bridge gaps. ",该综述全面探索大语言模型对齐方法,分类现有方法,探讨问题、给出评估指标,展望未来研究方向。 ,Ethics,Survey,"Recent years have witnessed remarkable progress made in large language models (LLMs). Such advancements, while garnering significant attention, have concurrently elicited various concerns. The potential of these models is undeniably vast; however, they may yield texts that are imprecise, misleading, or even detrimental. Consequently, it becomes paramount to employ alignment techniques to ensure these models to exhibit behaviors consistent with human values. +This survey endeavors to furnish an extensive exploration of alignment methodologies designed for LLMs, in conjunction with the extant capability research in this domain. Adopting the lens of AI alignment, we categorize the prevailing methods and emergent proposals for the alignment of LLMs into outer and inner alignment. We also probe into salient issues including the models' interpretability, and potential vulnerabilities to adversarial attacks. To assess LLM alignment, we present a wide variety of benchmarks and evaluation methodologies. After discussing the state of alignment research for LLMs, we finally cast a vision toward the future, contemplating the promising avenues of research that lie ahead. +Our aspiration for this survey extends beyond merely spurring research interests in this realm. We also envision bridging the gap between the AI alignment research community and the researchers engrossed in the capability exploration of LLMs for both capable and safe LLMs.",https://arxiv.org/abs/2309.15025,2023,Arxiv,,, +LLaMA: Open and Efficient Foundation Language Models,"The paper introduces LLaMA foundation models (7B - 65B params), trains on public data, and releases them to the research community. ",本文推出 LLaMA 系列基础语言模型,用公开数据集训练达 SOTA,模型已开源供科研。 ,Ethics,Method,"We introduce LLaMA, a collection of founda- tion language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla- 70B and PaLM-540B. We release all our models to the research community.",https://ai.meta.com/research/publications/llama-open-and-efficient-foundation-language-models/,2023,,,, +Energy and Policy Considerations for Modern Deep Learning Research,"The paper quantifies NLP neural network training costs, incorporates updates, and offers recommendations to cut costs and improve equity in AI. ",论文指出深度学习计算成本与环境代价高,总结NLP研究结果并提降成本、促公平建议 ,Ethics,Analyze,"The field of artificial intelligence has experienced a dramatic methodological shift towards large neural networks trained on plentiful data. This shift has been fueled by recent advances in hardware and techniques enabling remarkable levels of computation, resulting in impressive advances in AI across many applications. However, the massive computation required to obtain these exciting results is costly both financially, due to the price of specialized hardware and electricity or cloud compute time, and to the environment, as a result of non-renewable energy used to fuel modern tensor processing hardware. In a paper published this year at ACL, we brought this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training and tuning neural network models for NLP (Strubell, Ganesh, and McCallum 2019). In this extended abstract, we briefly summarize our findings in NLP, incorporating updated estimates and broader information from recent related publications, and provide actionable recommendations to reduce costs and improve equity in the machine learning and artificial intelligence community.",https://ojs.aaai.org/index.php/AAAI/article/view/7123,2020,AAAI,,, +Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products,"This paper analyzes the impact of disclosing biased AI results through Gender Shades audit, showing targets improved fairness and performance more than non - targets. ",文章以Gender Shades审计为例,分析公开披露偏见AI结果的商业影响,展示企业改进及表现差异。 ,Ethics,Survey,"Although algorithmic auditing has emerged as a key strategy to expose systematic biases embedded in software platforms, we struggle to understand the real-world impact of these audits, as scholarship on the impact of algorithmic audits on increasing algorithmic fairness and transparency in commercial systems is nascent. To analyze the impact of publicly naming and disclosing performance results of biased AI systems, we investigate the commercial impact of Gender Shades, the first algorithmic audit of gender and skin type performance disparities in commercial facial analysis models. This paper 1) outlines the audit design and structured disclosure procedure used in the Gender Shades study, 2) presents new performance metrics from targeted companies IBM, Microsoft and Megvii (Face++) on the Pilot Parliaments Benchmark (PPB) as of August 2018, 3) provides performance results on PPB by non-target companies Amazon and Kairos and, 4) explores differences in company responses as shared through corporate communications that contextualize differences in performance on PPB. Within 7 months of the original audit, we find that all three targets released new API versions. All targets reduced accuracy disparities between males and females and darker and lighter-skinned subgroups, with the most significant update occurring for the darker-skinned female subgroup, that underwent a 17.7% - 30.4% reduction in error between audit periods. Minimizing these disparities led to a 5.72% to 8.3% reduction in overall error on the Pilot Parliaments Benchmark (PPB) for target corporation APIs. The overall performance of non-targets Amazon and Kairos lags significantly behind that of the targets, with error rates of 8.66% and 6.60% overall, and error rates of 31.37% and 22.50% for the darker female subgroup, respectively.",https://dl.acm.org/doi/abs/10.1145/3306618.3314244?casa_token=1ogqoO70pDgAAAAA:7r8-ICJ2Ym55Fg2aaW11gpz7FR15yYHzuqBdGu7ifBfkiMRdbknxo34ItX_GwjeUZPg9k4U22tRX,2019,AIES,,, +Defending Against Neural Fake News,"The paper presents Grover for controllable text gen, emphasizes robust verification, and discusses ethics, planning public release for better fake news detection. ",提出可控文本生成模型Grover,研究对抗技术,指出其自辨效果佳,还探讨伦理并计划公开模型。 ,Ethics,Method,"Recent progress in natural language generation has raised dual-use concerns. While applications like summarization and translation are positive, the underlying technology also might enable adversaries to generate neural fake news: targeted propaganda that closely mimics the style of real news. +Modern computer security relies on careful threat modeling: identifying potential threats and vulnerabilities from an adversary's point of view, and exploring potential mitigations to these threats. Likewise, developing robust defenses against neural fake news requires us first to carefully investigate and characterize the risks of these models. We thus present a model for controllable text generation called Grover. Given a headline like 'Link Found Between Vaccines and Autism,' Grover can generate the rest of the article; humans find these generations to be more trustworthy than human-written disinformation. +Developing robust verification techniques against generators like Grover is critical. We find that best current discriminators can classify neural fake news from real, human-written, news with 73% accuracy, assuming access to a moderate level of training data. Counterintuitively, the best defense against Grover turns out to be Grover itself, with 92% accuracy, demonstrating the importance of public release of strong generators. We investigate these results further, showing that exposure bias -- and sampling strategies that alleviate its effects -- both leave artifacts that similar discriminators can pick up on. We conclude by discussing ethical issues regarding the technology, and plan to release Grover publicly, helping pave the way for better detection of neural fake news.",https://proceedings.neurips.cc/paper/2019/hash/3e9f0fc9b2f89e043bc6233994dfcf76-Abstract.html,2019,NIPS,,, +,,,,,,,,,,,🚫重复 +Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation,"The paper proposes a benchmark self - evolving framework using multi - agents. It extends benchmarks via reframing, enabling more accurate LLM evaluation. ",提出基准自进化框架,用多智能体系统扩展基准,实施六种重构操作,助力精确评估与模型选择。 ,Agent Evolution,Methodology,"This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly ad- vancing Large Language Models (LLMs), aim- ing for a more accurate assessment of their ca- pabilities and limitations. We utilize a multi- agent system to manipulate the context or ques- tion of original instances, reframing new evolv- ing instances with high confidence that dynam- ically extend existing benchmarks. Towards a more scalable, robust and fine-grained evalu- ation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and prob- ing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more accurately re- flect models’ capabilities. Besides, our frame- work widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks 1.",https://arxiv.org/pdf/2402.11443,2024,Arxiv,,,🚫重复 +Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization,"This paper proposes Agent - Pro, an LLM - based agent with policy - level reflection and optimization, enabling learning and evolution in dynamic scenarios. ",提出 Agent - Pro,具备策略级反思与优化能力,可从互动中学习,适用于复杂动态场景。 ,Agent Evolution,Methodology,"Large Language Models (LLMs) exhibit ro- bust problem-solving capabilities for diverse tasks. However, most LLM-based agents are designed as specific task solvers with sophis- ticated prompt engineering, rather than agents capable of learning and evolving through inter- actions. These task solvers necessitate manu- ally crafted prompts to inform task rules and regulate LLM behaviors, inherently incapacitat- ing to address complex dynamic scenarios e.g., large interactive games. In light of this, we pro- pose Agent-Pro: an LLM-based Agent with Policy-level Reflection and Optimization that can learn a wealth of expertise from interactive experiences and progressively elevate its behav- ioral policy. Specifically, it involves a dynamic belief generation and reflection process for pol- icy evolution. Rather than action-level reflec- tion, Agent-Pro iteratively reflects on past tra- jectories and beliefs, ""fine-tuning"" its irrational beliefs for a better policy. Moreover, a depth- first search is employed for policy optimization, ensuring continual enhancement in policy pay- offs. Agent-Pro is evaluated across two games: Blackjack and Texas Hold’em, outperforming vanilla LLM and specialized models. Our re- sults show Agent-Pro can learn and evolve in complex and dynamic scenes, which also bene- fits numerous LLM-based applications 1.",https://aclanthology.org/2024.acl-long.292.pdf,2024,ACL,,, +Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning,"The paper proposes CORY, extending LLM fine - tuning to sequential cooperative multi - agent RL, enabling coevolution between agents for better LLM refinement. ",论文提出CORY将LLM微调扩展到多智能体框架,促进协作共进化,有望用于现实场景精调LLM。 ,Agent Evolution,Methodology,"Reinforcement learning (RL) has emerged as a pivotal technique for fine-tuning large language models (LLMs) on specific tasks. However, prevailing RL fine- tuning methods predominantly rely on PPO and its variants. Though these algo- rithms are effective in general RL settings, they often exhibit suboptimal perfor- mance and vulnerability to distribution collapse when applied to the fine-tuning of LLMs. In this paper, we propose CORY, extending the RL fine-tuning of LLMs to a sequential cooperative multi-agent reinforcement learning framework, to leverage the inherent coevolution and emergent capabilities of multi-agent systems. In CORY, the LLM to be fine-tuned is initially duplicated into two autonomous agents: a pioneer and an observer. The pioneer generates responses based on queries, while the observer generates responses using both the queries and the pioneer’s responses. The two agents are trained together. During training, the agents exchange roles periodically, fostering cooperation and coevolution between them. Experiments evaluate CORY’s performance by fine-tuning GPT-2 and Llama-2 under subjective and objective reward functions on the IMDB Review and GSM8K datasets, respec- tively. Results show that CORY outperforms PPO in terms of policy optimality, resistance to distribution collapse, and training robustness, thereby underscoring its potential as a superior methodology for refining LLMs in real-world applications. The code can be found at: https://github.com/Harry67Hu/CORY. ",https://proceedings.neurips.cc/paper_files/paper/2024/file/1c2b1c8f7d317719a9ce32dd7386ba35-Paper-Conference.pdf,2024,NIPS,,, +A Survey on Self-Evolution of Large Language Models,"This paper surveys self - evolution approaches in LLMs, proposes a framework, categorizes objectives, and offers insights and future directions for self - evolving LLMs. ",该文全面调研大模型自我进化方法,提出框架、分类目标、总结模块,指明挑战与方向。 ,Agent Evolution,Methodology,"Large language models (LLMs) have significantly advanced in various fields and intelligent agent applications. However, current LLMs that learn from human or external model supervision are costly and may face performance ceilings as task complexity and diversity increase. To address this issue, self-evolution approaches that enable LLM to autonomously acquire, refine, and learn from experiences generated by the model itself are rapidly growing. This new training paradigm inspired by the human experi- ential learning process offers the potential to scale LLMs towards superintelligence. In this work, we present a comprehensive survey of self-evolution approaches in LLMs. We first propose a conceptual framework for self-evolution and outline the evolv- ing process as iterative cycles composed of four phases: experience acquisition, experience refinement, updating, and evaluation. Second, we categorize the evolution objectives of LLMs and LLM-based agents; then, we summarize the literature and provide taxonomy and insights for each module. Lastly, we pinpoint existing challenges and propose future directions to improve self- evolution frameworks, equipping researchers with critical insights to fast-track the development of self-evolving LLMs. Our corresponding GitHub repository is available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/Awesome-Self- Evolution-of-LLM. †",https://arxiv.org/pdf/2404.14387,2024,Arxiv,,, +LLM-Evolve: Evaluation for LLM’s Evolving Capability on Benchmarks,"The paper proposes LLM-Evolve, a novel framework extending benchmarks for sequential problem - solving, evaluating LLMs' iterative learning ability. ",论文提出LLM - Evolve框架,拓展基准至序贯问题场景,助大模型从交互学习提升性能。 ,Agent Evolution,Methodology,"The advancement of large language models (LLMs) has extended their use to dynamic and interactive real-world applications, where mod- els engage continuously with their environment and potentially enhance their performance over time. Most existing LLM benchmarks evaluate LLMs on i.i.d. tasks, overlooking their ability to learn iteratively from past experiences. Our paper bridges this evaluation gap by propos- ing a novel framework, LLM-Evolve, which extends established benchmarks to sequential problem-solving settings. LLM-Evolve eval- uates LLMs over multiple rounds, providing feedback after each round to build a demon- stration memory that the models can query in future tasks. We applied LLM-Evolve to the MMLU, GSM8K, and AgentBench bench- marks, testing 8 state-of-the-art open-source and closed-source models. Results show that LLMs can achieve performance improvements of up to 17% by learning from past interactions, with the quality of retrieval algorithms and feed- back significantly influencing this capability. These insights advocate for more understand- ing and benchmarks for LLMs’ performance in evolving interactive scenarios. ",https://aclanthology.org/2024.emnlp-main.940.pdf,2024,EMNLP,,, +AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback ,"This paper presents AlpacaFarm, a low - cost simulator for feedback - learning R & D. It offers prompt design, evaluation, and reference code, and validates end - to - end. ",论文提出AlpacaFarm模拟器,模拟反馈降成��,有自动评估和参考实现,验证效果良好。 ,Agent Evolution,Methodology,"Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their ability to follow user instructions well. Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following process faces three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these challenges with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM prompts to simulate human feedback that are 45x cheaper than crowd-workers and display high agreement with humans. Second, we propose an automatic evaluation and validate it against human instructions obtained on real-world interactions. Third, we contribute reference implementations for several methods (PPO, best-of-n, expert iteration, and more) that learn from pairwise feedback. Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate eleven models on 10k pairs of real human feedback and show that the rankings of models trained in AlpacaFarm match the rankings of models trained on human data. As a demonstration of the research possible in AlpacaFarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference PPO implementation leads to a +10% improvement in win-rate against Davinci003. We release AlpacaFarm at https://github.com/tatsu-lab/alpaca_farm. ",https://proceedings.neurips.cc/paper_files/paper/2023/file/5fc47800ee5b30b8777fdd30abcaaf3b-Paper-Conference.pdf,2023,NIPS,,, +"SELF-REFINE: + Iterative Refinement with Self-Feedback ","This paper introduces SELF - REFINE, an approach refining LLMs' outputs iteratively with self - feedback, no extra training needed. ",提出SELF - REFINE方法,借助LLM自反馈迭代优化输出,无需额外训练,能提升模型性能。 ,Agent Evolution,Methodology,"Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce SELF-REFINE, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLM; then, the same LLM provides feedback for its output and uses it to refine itself, iteratively. SELF-REFINE does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner, and feedback provider. We evaluate SELF-REFINE across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5 and GPT-4) LLMs. Across all evaluated tasks, outputs generated with SELF-REFINE are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by ∼20% absolute on average in task performance. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test-time using our simple, standalone approach.2. ",https://openreview.net/pdf?id=S37hOerQLB,2023,NIPS,,, +Self-Evolution Learning for Discriminative Language Model Pretraining,"The paper proposes Self-Evolution learning (SE), a token masking & learning method to exploit data knowledge, improving linguistic learning & generalization. ",提出Self - Evolution learning(SE)方法,有效利用数据知识,自适应训练,提升语言知识学习与泛化能力。 ,Agent Evolution,Methodology,"Masked language modeling, widely used in discriminative language model (e.g., BERT) pretraining, commonly adopts a random mask- ing strategy. However, random masking does not consider the importance of the different words in the sentence meaning, where some of them are more worthy to be predicted. There- fore, various masking strategies (e.g., entity- level masking) are proposed, but most of them require expensive prior knowledge and gener- ally train from scratch without reusing existing model weights. In this paper, we present Self- Evolution learning (SE), a simple and effective token masking and learning method to fully and wisely exploit the knowledge from data. SE focuses on learning the informative yet under- explored tokens and adaptively regularizes the training by introducing a novel Token-specific Label Smoothing approach. Experiments on 10 tasks show that our SE brings consistent and significant improvements (+1.43∼2.12 average scores) upon different PLMs. In-depth anal- yses demonstrate that SE improves linguistic knowledge learning and generalization. ",https://aclanthology.org/2023.findings-acl.254.pdf,2023,EMNLP,,, +CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing,"The paper introduces CRITIC, a framework enabling LLMs to self - correct outputs like humans with tools, highlighting external feedback's key role. ",提出CRITIC框架,让大语言模型像人类用工具般验证和修正自身输出,推动模型自我完善。 ,Agent Evolution,Methodology,"Recent developments in large language models (LLMs) have been impressive. However, these models sometimes show inconsistencies and problematic behavior, such as hallucinating facts, generating flawed code, or creating offensive and toxic content. Unlike these models, humans typically utilize external tools to cross-check and refine their initial content, like using a search engine for fact-checking, or a code interpreter for debugging. Inspired by this observation, we introduce a framework called CRITIC that allows LLMs, which are essentially “black boxes” to validate and progressively amend their own outputs in a manner similar to human interaction with tools. More specifically, starting with an initial output, CRITIC interacts with appropriate tools to evaluate certain aspects of the text, and then revises the output based on the feedback obtained during this validation process. Comprehensive evaluations involving free-form question answering, mathematical program syn- thesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs. Meanwhile, our research highlights the crucial importance of external feedback in promoting the ongoing self-improvement of LLMs1. ",https://openreview.net/pdf?id=Sx038qxjek,2024,ICLR,,, +Iterative Translation Refinement with Large Language Models ,"The paper proposes iterative prompting of LLMs for translation self - correction, emphasizing source - anchoring, and shows good human - perceived quality. ",提出迭代提示大模型自纠错翻译法,多轮查询后质量提升,强调锚定原文和合理初译重要性。 ,Agent Evolution,Methodology,"We propose iteratively prompting a large language model to self-correct a translation, with inspiration from their strong language capability as well as a human-like trans- lation approach. Interestingly, multi-turn querying reduces the output’s string-based metric scores, but neural metrics suggest comparable or improved quality after two or more iterations. Human evaluations in- dicate better fluency and naturalness com- pared to initial translations and even human references, all while maintaining quality. Ablation studies underscore the importance of anchoring the refinement to the source and a reasonable seed translation for quality considerations. We also discuss the chal- lenges in evaluation and relation to human performance and translationese. ",https://aclanthology.org/2024.eamt-1.17.pdf,2024,EAMT,,, +Evolutionary optimization of model merging recipes ,"The paper proposes an evolutionary approach for model merging, optimizing in parameter and data flow spaces, and paves the way for foundation model development. ",提出进化方法自动组合开源模型,跨域合并表现优,贡献新模型及自动组合范式。 ,Agent Evolution,Methodology,"Large language models (LLMs) have become increasingly capable, but their development often requires substantial computational resources. Although model merging has emerged as a cost-effective promising approach for creating new models by combining existing ones, it currently relies on human intuition and domain knowledge, limiting its potential. Here we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring +extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models such as a Japanese LLM with math reasoning capabilities. Surprisingly, our Japanese math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with substantially more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally aware Japanese vision–language model generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese vision–language models. +This work not only contributes new state-of-the-art models back to the open-source community but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development. ",https://www.nature.com/articles/s42256-024-00975-8,2025,NMI,,, +Agent Alignment in Evolving Social Norms ,"The paper proposes an EvolutionaryAgent framework, transforming agent alignment into an evolutionary process, applicable to various LLMs. ",提出 EvolutionaryAgent 框架,将智能体对齐转为适者生存进化选择过程,适用于多种大模型。 ,Agent Evolution,Methodology,"Agents based on Large Language Models (LLMs) are increasingly per- meating various domains of human production and life, highlighting the importance of aligning them with human values. The current alignment of AI systems primarily focuses on passively aligning LLMs through hu- man intervention. However, agents possess characteristics like receiving environmental feedback and self-evolution, rendering the LLM alignment methods inadequate. In response, we propose an evolutionary framework for agent evolution and alignment, named EvolutionaryAgent, which trans- forms agent alignment into a process of evolution and selection under the principle of survival of the fittest. In an environment where social norms continuously evolve, agents better adapted to the current social norms will have a higher probability of survival and proliferation, while those inadequately aligned dwindle over time. Experimental results assess- ing the agents from multiple perspectives in aligning with social norms demonstrate that EvolutionaryAgent can align progressively better with the evolving social norms while maintaining its proficiency in general tasks. Effectiveness tests conducted on various open and closed-source LLMs as the foundation for agents also prove the applicability of our approach. ",https://arxiv.org/pdf/2401.04620,2024,Arxiv,,, +Mitigating the Alignment Tax of RLHF ,"The paper explores alignment tax in RLHF, shows model averaging's Pareto - optimality, and proposes HMA to balance alignment and forgetting. ",针对RLHF对齐税问题,文章发现模型平均有效,提出HMA最大化对齐性能并减少对齐税。 ,Agent Evolution,Methodology,"LLMs acquire a wide range of abilities dur- ing pre-training, but aligning LLMs under Re- inforcement Learning with Human Feedback (RLHF) can lead to forgetting pretrained abili- ties, which is also known as the alignment tax. To investigate alignment tax, we conducted ex- periments with existing RLHF algorithms us- ing OpenLLaMA-3B, which revealed a pro- nounced alignment tax in NLP tasks. Whereas, despite various techniques to mitigate forget- ting, they are often at odds with the RLHF per- formance, leading to a trade-off between align- ment performance and forgetting mitigation, leading to an alignment-forgetting trade-off. +In this paper we show that model averaging, which simply interpolates between pre and post RLHF model weights, surprisingly achieves the most strongest alignment-forgetting Pareto front among a wide range of competing meth- ods. To understand its effectiveness, we offer theoretical insights into model averaging, re- vealing that it enhances performance Pareto front by increasing feature diversity on the lay- ers where tasks share overlapped feature spaces. Empirical evidence corroborates our analysis by showing the benefits of averaging low-level transformer layers. Building on the analysis and the observation that averaging different lay- ers of the transformer leads to significantly dif- ferent alignment-forgetting trade-offs, we pro- pose Heterogeneous Model Averaging (HMA) to Heterogeneously find various combination ratios of model layers. HMA seeks to maxi- mize the alignment performance while incur- ring minimal alignment tax. Moreover, we val- idate HMA’s performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B which is evaluated by open-sourced preference model and GPT4. Code available here1. ",https://aclanthology.org/2024.emnlp-main.35.pdf,2024,EMNLP,,, +Self-Rewarding Language Models ,"This paper studies Self - Rewarding LMs, using LLM - as - a - Judge to provide self - rewards. It may enable continual improvement on both instruction - following and reward - giving. ",论文研究自奖励语言模型,以模型自身作评判给奖励,或助模型在多方面持续提升。 ,Agent Evolution,Methodology,"We posit that to achieve superhuman agents, future models require super- human feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes. ",https://arxiv.org/pdf/2401.10020,2024,Arxiv,,, +CREAM: Consistency Regularized Self-Rewarding Language Models,"The paper proposes CREAM, which regularizes self - rewarding training with consistency across iterations, using more reliable preference data. ",提出CREAM模型,在自奖励框架引���正则化,利用迭代奖励一致性训练,提升奖励一致性和对齐性能。 ,Agent Evolution,Methodology,"Recent self-rewarding large language models (LLM) have successfully applied LLM-as-a-Judge to iteratively improve the alignment performance without the need of human annotations for preference data. These methods commonly utilize the same LLM to act as both the policy model (which generates responses) and the reward model (which scores and ranks those responses). The ranked responses are then used as preference pairs to train the LLM via direct alignment technolo- gies (e.g. DPO). However, it is noteworthy that throughout this process, there is no guarantee of accuracy in the rewarding and ranking, which is critical for en- suring accurate rewards and high-quality preference data. Empirical results from relatively small LLMs (e.g., 7B parameters) also indicate that improvements from self-rewarding may diminish after several iterations in certain situations, which we hypothesize is due to accumulated bias in the reward system. This bias can lead to unreliable preference data for training the LLM. To address this issue, we first formulate and analyze the generalized iterative preference fine-tuning frame- work for self-rewarding language model. We then introduce the regularization to this generalized framework to mitigate the overconfident preference labeling in the self-rewarding process. Based on this theoretical insight, we propose a Consistency Regularized sElf-rewarding lAnguage Model (CREAM) that lever- ages the rewarding consistency across different iterations to regularize the self- rewarding training, helping the model to learn from more reliable preference data. With this explicit regularization, our empirical results demonstrate the superior- ity of CREAM in improving both reward consistency and alignment performance. The code is publicly available at https://github.com/Raibows/CREAM. ",https://openreview.net/pdf?id=Vf6RDObyEF,2025,ICLR,,, +Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning ,"The paper introduces a self-evolving mechanism DIVERSEEVOL for label-efficient instruction tuning, enhancing subset diversity, with code available on GitHub. ",提出自进化机制DIVERSEEVOL,增强所选子集多样性,用少量数据达全量数据效果,公开代码。 ,Agent Evolution,Methodology,"Enhancing the instruction-following ability of Large Language Models (LLMs) primarily de- mands substantial instruction-tuning datasets. However, the sheer volume of these imposes a considerable computational burden and an- notation cost. To investigate a label-efficient instruction tuning method that allows the model itself to actively sample subsets that are equally or even more effective, we introduce a self- evolving mechanism DIVERSEEVOL. In this process, a model iteratively augments its train- ing subset to refine its own performance, with- out requiring any intervention from humans or more advanced LLMs. The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets, as the model selects new data points most distinct from any existing ones according to its current embed- ding space. Extensive experiments across three datasets and benchmarks demonstrate the ef- fectiveness of DIVERSEEVOL. Our models, trained on less than 8% of the original dataset, maintain or improve performance compared with finetuning on full data. We also pro- vide empirical evidence to analyze the impor- tance of diversity in instruction data and the iterative scheme as opposed to one-time sam- pling. Our code is publicly available at https: //github.com/OFA-Sys/DiverseEvol.git. ",https://arxiv.org/pdf/2311.08182,2023,Arxiv,,, +V-STaR: Training Verifiers for Self-Taught Reasoners ,"The paper proposes V - STaR, which uses both correct and incorrect self - generated solutions to train a verifier, enhancing LLMs' self - improvement. ",提出V - STaR方法,利用自提升过程中正误解训练验证器选解,迭代优化推理与验证能力。 ,Agent Evolution,Methodology,"Common self-improvement approaches for large language models (LLMs), such as STaR (Zelikman et al., 2022), iteratively fine-tune LLMs on self- generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such so- lutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model- generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models. ",https://openreview.net/pdf?id=stmqBSW2dV,2024,COLM,,, +STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning ,"The paper proposes STaR, a technique that iteratively uses few rationales and rationale - free data to bootstrap complex reasoning, enabling self - improvement. ",提出“Self - Taught Reasoner”(STaR)技术,用少量推理示例和无推理数据集迭代提升推理能力。 ,Agent Evolution,Methodology,"Generating step-by-step ""chain-of-thought"" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing language model rationale generation cur- rently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to boot- strap the ability to perform successively more complex reasoning. This technique, the ""Self-Taught Reasoner"" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine- tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine- tuning a 30× larger state-of-the-art language model on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning.1 ",https://openreview.net/pdf?id=_3ELRdg2sgI,2022," NeurIPS",,, +SELFEVOLVE: A Code Evolution Framework via Large Language Models ,"The paper proposes SELF - EVOLVE, a two - step pipeline using LLMs as knowledge providers and self - reflective programmers, showing high adaptability and effectiveness. ",提出SELFEVOLVE两阶段框架,利用大模型获知识、生成及调试代码,各环节优于对比方法且可扩展 ,Agent Evolution,Methodology,"Large language models (LLMs) have already revolutionized code generation, after being pretrained on publicly available code data. However, while various methods have been proposed to augment LLMs with retrieved knowledge and enhance the quality of code generation, the performance of these retrieval-based methods is limited by the strength of the retrievers used. In addition, while LLMs show great emergent ability, they still struggle to produce the correct code in one turn. To address these challenges, we propose a novel two-step pipeline, called SELF- EVOLVE, that leverages LLMs as both knowledge providers and self-reflective programmers. Unlike retrieval-based methods, SELFEVOLVE obtains the knowl- edge from input prompts and generates intermediate code based on the generated knowledge. After that, SELFEVOLVE asks LLM to act as an expert programmer to perform debugging for the generated code. This is achieved by receiving the error message from the interpreter, without requiring special test cases for correct- ness verification. We evaluate SELFEVOLVE on three code generation datasets, including DS-1000 for data science code, HumanEval for software engineering code, and TransCoder for C++-to-Python translation. Our empirical experiments show that SELFEVOLVE outperforms strong baselines by a significant margin on all datasets. We also conduct exhaustive analytical experiments to validate the effectiveness of the two stages of SELFEVOLVE, and find that both are superior to other prompting-based methods. Further scalability analysis demonstrates that SELFEVOLVE can be adapted to other more advanced models, such as GPT-4, and bring consistent efficacy improvement. ",https://arxiv.org/pdf/2306.02907,2023,Arxiv,,, +KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents ,"The paper introduces KNOWAGENT, enhancing LLM planning by adding action knowledge and constraining paths to mitigate hallucination. ",论文提出 KNOWAGENT 方法,借行动知识库与自学习策略增强 LLM 规划能力,缓解规划幻觉。 ,Agent Evolution,Methodology,"Large Language Models (LLMs) have demon- strated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when in- teracting with environments through generat- ing executable actions. This inadequacy pri- marily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories during task solving and results in planning hallucination. To address this issue, we in- troduce KNOWAGENT, a novel approach de- signed to enhance the planning capabilities of LLMs by incorporating explicit action knowl- edge. Specifically, KNOWAGENT employs an action knowledge base and a knowledge- able self-learning strategy to constrain the ac- tion path during planning, enabling more rea- sonable trajectory synthesis, and thereby en- hancing the planning performance of language agents. Experimental results on HotpotQA and ALFWorld based on various backbone models demonstrate that KNOWAGENT can achieve comparable or superior performance to existing baselines. Further analysis indicates the effec- tiveness of KNOWAGENT in terms of planning hallucinations mitigation. ",https://arxiv.org/pdf/2403.03101,2025,NAACL,,, +RLCD: Reinforcement learning from contrastive distillation for LM alignment,"The paper proposes RLCD, a method to align LMs to natural - language principles without human feedback, using contrastive distillation and preference models. ",提出RLCD方法,无需人工反馈让语言模型遵循自然语言原则,用偏好对训练模型提升效果。 ,Agent Evolution,Methodology,"We propose Reinforcement Learning from Contrastive Distillation (RLCD), a method for aligning language models to follow principles expressed in natural language (e.g., to be more harmless) without using human feedback. RLCD creates preference pairs from two contrasting model outputs, one using a positive prompt designed to encourage following the given principles, and one using a negative prompt designed to encourage violating them. Using two different prompts causes model outputs to be more differentiated on average, resulting in cleaner preference labels in the absence of human annotations. We then use the preference pairs to train a preference model, which is in turn used to improve a base unaligned language model via reinforcement learning. Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks—harmlessness, helpfulness, and story outline generation— and when using both 7B and 30B model scales for simulating preference data. ",https://openreview.net/pdf?id=v3XXtxWKi6,2024,ICLR,,, +LANGUAGE MODEL SELF-IMPROVEMENT BY REIN- FORCEMENT LEARNING CONTEMPLATION ,"This paper presents a novel LMSI method, RLC, leveraging evaluation - generation gap. It improves model performance without external supervision and has broad applicability. ",提出新颖的语言模型自我改进方法RLC,利用评估与生成差距提升性能,适用于不同规模模型。 ,Agent Evolution,Methodology,"Language model self-improvement (LMSI) techniques have recently gained sig- nificant attention as they improve language models without requiring external supervision. A common approach is reinforcement learning from AI feedback (RLAIF), which trains a reward model based on AI preference data and employs a reinforcement learning algorithm to train the language model. However, RLAIF relies on the heuristic assumption that an AI model can provide effective feedback and correct wrong answers, requiring a solid capability of the language model. This paper presents a novel LMSI method, Reinforcement Learning Contemplation (RLC). We disclose that it is simpler for language models to evaluate a sentence than to generate it, even for small language models. Leveraging the gap between the evaluation and generation, RLC evaluates generated answers and updates language model parameters using reinforcement learning to maximize evaluation scores. Through testing on various challenging reasoning tasks and text summarization task, our experiments show that RLC effectively improves language model perfor- mance without external supervision, resulting in an answering accuracy increase (31.23% → 37.09%) for BigBench-hard reasoning tasks, and a rise in BERTScore for CNN/Daily Mail summarization tasks. Furthermore, RLC can be applied to models of different sizes, showcasing its broad applicability. ",https://openreview.net/pdf?id=38E4yUbrgr,2024,ICLR,,, +SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions ,"The paper introduces SELF - INSTRUCT, an almost annotation - free framework to enhance models' instruction - following via self - generation, releasing a synthetic dataset. ",提出 SELF - INSTRUCT 框架,以模型自生成数据微调提升指令跟随能力,近乎免注释,发布合成数据集。 ,Agent Evolution,Methodology,"Large “instruction-tuned” language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to general- ize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the general- ity of the tuned model. We introduce SELF- INSTRUCT, a framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations. Our pipeline generates in- structions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model. Applying our method to the vanilla GPT3, we demonstrate a 33% absolute im- provement over the original model on SUPER- NATURALINSTRUCTIONS, on par with the performance of InstructGPT001 ,1 which was trained with private user data and human anno- tations. For further evaluation, we curate a set of expert-written instructions for novel tasks, and show through human evaluation that tun- ing GPT3 with SELF-INSTRUCT outperforms using existing public instruction datasets by a large margin, leaving only a 5% absolute gap behind InstructGPT001. SELF-INSTRUCT provides an almost annotation-free method for aligning pretrained language models with in- structions, and we release our large synthetic dataset to facilitate future studies on instruction tuning.2 ",https://aclanthology.org/2023.acl-long.754.pdf,2023,ACL,,, +Large Language Models are Better Reasoners with Self-Verification ,"The paper proposes that LLMs have self - verification abilities. It uses backward verification to select answers, improving reasoning performance on various datasets. ",提出并证明大语言模型有自我验证能力,通过反向验证选答案,可提升推理性能,代码开源。 ,Agent Evolution,Methodology,"Recently, with the chain of thought (CoT) prompting, large language models (LLMs), e.g., GPT-3, have shown strong reasoning abil- ity in several natural language processing tasks such as arithmetic, commonsense, and logical reasoning. However, LLMs with CoT require multi-step prompting and multi-token predic- tion, which is highly sensitive to individual mistakes and vulnerable to error accumulation. The above issues make the LLMs need the abil- ity to verify the answers. In fact, after inferring conclusions in some thinking decision tasks, people often check them by re-verifying steps to avoid some mistakes. In this paper, we pro- pose and prove that LLMs also have similar self-verification abilities. We take the conclu- sion obtained by CoT as one of the conditions for solving the original problem. By perform- ing a backward verification of the answers that LLM deduced for itself, we can obtain inter- pretable answer validation scores to select the candidate answer with the highest score. Exper- imental results demonstrate that the proposed method can improve the reasoning performance on various arithmetic, commonsense, and log- ical reasoning datasets. Our code is publicly available at: https://github.com/WENGSYX/ Self-Verification. ",https://aclanthology.org/2023.findings-emnlp.167.pdf,2023,EMNLP,,, +CODET: CODE GENERATION WITH GENERATED TESTS ,"The paper proposes CODET, a method using pre - trained LMs to auto - generate test cases for code samples, enhancing solution selection efficiency. ",提出CODET方法,借预训练模型自动生成测试用例,选代码方案,实验显示其显著提升性能。 ,Agent Evolution,Methodology,"The task of generating code solutions for a given programming problem can bene- fit from the use of pre-trained language models such as Codex, which can produce multiple diverse samples. However, a major challenge for this task is to select the most appropriate solution from the multiple samples generated by the pre- trained language models. A natural way to evaluate the quality and correctness of a code solution is to run it against a set of test cases, but the manual creation of such test cases is often costly and time-consuming. In this paper, we propose a novel method, CODET, that leverages the same pre-trained language models to automatically generate test cases for the code samples, thus reducing the human effort and increasing the coverage of the test scenarios. CODET then executes the code samples using the generated test cases and performs a dual execution agree- ment, which considers both the consistency of the outputs against the generated test cases and the agreement of the outputs with other code samples. We con- duct comprehensive experiments on four benchmarks, HumanEval, MBPP, APPS, and CodeContests, using five different pre-trained language models with varying sizes and capabilities. Our results show that CODET can significantly improve the performance of code solution selection over previous methods, achieving remark- able and consistent gains across different models and benchmarks. For instance, CODET improves the pass@1 metric on HumanEval to 65.8%, which represents an absolute improvement of 18.8% over the code-davinci-002 model, and an ab- solute improvement of more than 20% over the previous state-of-the-art results. ",https://openreview.net/pdf?id=ktrw68Cmu9c,2023,ICLR,,, +ProAgent: Building Proactive Cooperative Agents with Large Language Models ,"This paper proposes ProAgent, a framework using LLMs to create proactive agents adapting behavior for better cooperation, with high modularity and interpretability. ",提出ProAgent框架,利用大模型创建主动代理,可动态适应队友行为,模块化与可解释性强。 ,Agent Evolution,Methodology,"Building agents with adaptive behavior in cooperative tasks stands as a paramount goal in the realm of multi-agent sys- tems. Current approaches to developing cooperative agents rely primarily on learning-based methods, whose policy gen- eralization depends heavily on the diversity of teammates they interact with during the training phase. Such reliance, however, constrains the agents’ capacity for strategic adap- tation when cooperating with unfamiliar teammates, which becomes a significant challenge in zero-shot coordination scenarios. To address this challenge, we propose ProAgent, a novel framework that harnesses large language models (LLMs) to create proactive agents capable of dynamically adapting their behavior to enhance cooperation with team- mates. ProAgent can analyze the present state, and infer the intentions of teammates from observations. It then up- dates its beliefs in alignment with the teammates’ subse- quent actual behaviors. Moreover, ProAgent exhibits a high degree of modularity and interpretability, making it easily in- tegrated into various of coordination scenarios. Experimen- tal evaluations conducted within the Overcooked-AI envi- ronment unveil the remarkable performance superiority of ProAgent, outperforming five methods based on self-play and population-based training when cooperating with AI agents. Furthermore, in partnered with human proxy models, its per- formance exhibits an average improvement exceeding 10% compared to the current state-of-the-art method. For more in- formation about our project, please visit https://pku-proagent. github.io. ",https://ojs.aaai.org/index.php/AAAI/article/view/29710/31219,2024,AAAI,,, +Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games ,The paper introduces dynamic Red Team Game and Gamified Red Team Solver to mitigate mode collapse and enable diverse attacks for LLM safety. ,提出动态红队游戏,开发GRTS缓解模式崩溃,揭示红队任务几何结构,为大模型安全检测铺路 ,Agent Evolution,Methodology,"The primary challenge in deploying Large Language Model (LLM) is ensuring its harmlessness. Red team can identify vulnerabilities by attacking LLM to attain safety. However, current efforts heavily rely on single-round prompt designs and unilateral red team optimizations against fixed blue teams. These static approaches lead to significant reductions in generation diversity, known as the mode collapse, which makes it difficult to discover the potential risks in the increasingly complex human-LLM interactions. Here we introduce dynamic Red Team Game +(RTG) to comprehensively analyze the multi-round offensive and defensive interactions between red team and blue team. Furthermore, we develop a Gamified Red Team Solver (GRTS) with diversity measures to mitigate mode collapse and theoretically guarantee the convergence of approximate Nash equilibrium which results in better strategies for both teams. Empirical results demonstrate that GRTS explore diverse and implicit attacks to adaptively exploit various LLMs, surpassing the constraints of specific modes. Insightfully, the geometrical structure we unveil of the red team task aligns with the spinning top hypothesis, confirming the necessity of constructing a diverse LLM population as a promising proxy for heterogeneous human expert red-teamers. This paves the way for scalable toxicity detection and safe alignment for LLMs. ",https://arxiv.org/pdf/2310.00322,2023,Arxiv,,, +Agent Planning with World Knowledge Model ,"The paper introduces parametric World Knowledge Model (WKM) for agent planning, synthesizing knowledge from trajectories to guide global & local planning. ",论文引入参数化世界知识模型(WKM)辅助智能体规划,融合专家与采样轨迹知识,提升规划效果。 ,Agent Evolution,Methodology,"Recent endeavors towards directly using large language models (LLMs) as agent models to execute interactive planning tasks have shown commendable results. Despite their achievements, however, they still struggle with brainless trial-and- error in global planning and generating hallucinatory actions in local planning due to their poor understanding of the “real” physical world. Imitating humans’ mental world knowledge model which provides global prior knowledge before the task and maintains local dynamic knowledge during the task, in this paper, we introduce parametric World Knowledge Model (WKM) to facilitate agent planning. Concretely, we steer the agent model to self-synthesize knowledge from both expert and sampled trajectories. Then we develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning. Experimental results on three complex real-world simulated datasets with three state-of-the-art open-source LLMs, Mistral-7B, Gemma-7B, and Llama-3-8B, demonstrate that our method can achieve superior performance compared to various strong baselines. Other interesting findings include: 1) our instance-level task knowledge can generalize better to unseen tasks, 2) weak WKM can guide strong agent model planning, and 3) unified WKM training has promising potential for further development3. ",https://openreview.net/pdf?id=j6kJSS9O6I,2024," NeurIPS",,, +Refining Guideline Knowledge for Agent Planning Using Textgrad ,"This paper introduces Textgrad to optimize guideline knowledge for LLM - based agents' embodied tasks, using text gradients and loss analysis. ",该文引入 Textgrad 优化基于大模型的智能体执行具身任务时的指南知识,通过计算���度和分析损失自动优化。 ,Agent Evolution,Methodology,"Guideline Knowledge is helpful for LLM (Large Language Model) based Agent in embody task planning. In this work, we introduce Textgrad to optimize the Guideline Knowledge for the agent’s embodied tasks. This allows the model to automatically optimize the Guideline Knowledge by calculating the text gradients in the Guideline Knowledge and analyzing the loss in failed trajectories. ",https://www.computer.org/csdl/proceedings-article/ickg/2024/088200a102/24sKrMSCxr2,2024,ICKG,,, +Improving Factuality and Reasoning in Language Models through Multiagent Debate,"The paper proposes a multiagent debate approach for LLMs, enhancing reasoning and factuality, applicable to black - box models with a unified procedure. ",提出多智能体辩论法提升语言模型回复,增强推理能力、内容事实性,可用于黑箱模型。 ,Agent Evolution,Methodology,"Large language models (LLMs) have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how their performance may be further improved through the tools of prompting, ranging from verification, self-consistency, or intermediate scratchpads. In this paper, we present a complementary approach to improve language responses where multiple language model instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer. Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks. We also demonstrate that our approach improves the factual validity of generated content, reducing fallacious answers and hallucinations that contemporary models are prone to. Our approach may be directly applied to existing black-box models and uses identical procedure and prompts for all tasks we investigate. Overall, our findings suggest that such ""society of minds"" approach has the potential to significantly advance the capabilities of LLMs and pave the way for further breakthroughs in language generation and understanding.",https://arxiv.org/abs/2305.14325,2023,Arxiv,,,🚫重复 +Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate,"This paper proposes a Multi - Agent Debate (MAD) framework to solve LLMs' Degeneration - of - Thought problem, encouraging divergent thinking for complex tasks. ",提出多智能体辩论(MAD)框架解决大语言模型思维退化问题,鼓励发散思维获有效结果。 ,Agent Evolution,Methodology,"Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of ""tit for tat"" and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of ""tit for tat"" state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at this https URL.",https://arxiv.org/abs/2305.19118,2024,Arxiv,,,🚫重复 +"CAMEL: Communicative Agents for ""Mind"" Exploration of Large Language Model Society","This paper proposes a role - playing framework for autonomous agent cooperation, offers a scalable study approach and open - sources a library. ",论文提出角色扮演通信代理框架,提供研究多智能体合作的可扩展方法,并开源库支持相关研究。 ,Agent Evolution,Methodology,"The rapid advancement of chat-based language models has led to remarkable progress in complex task-solving. However, their success heavily relies on human input to guide the conversation, which can be challenging and time-consuming. This paper explores the potential of building scalable techniques to facilitate autonomous cooperation among communicative agents, and provides insight into their ""cognitive"" processes. To address the challenges of achieving autonomous cooperation, we propose a novel communicative agent framework named role-playing. Our approach involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of a society of agents, providing a valuable resource for investigating conversational language models. In particular, we conduct comprehensive studies on instruction-following cooperation in multi-agent settings. Our contributions include introducing a novel communicative agent framework, offering a scalable approach for studying the cooperative behaviors and capabilities of multi-agent systems, and open-sourcing our library to support research on communicative agents and beyond: this https URL.",https://arxiv.org/pdf/2303.17760,2023," NeurIPS",,,🚫重复 +LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error,"The paper proposes STE, a biologically inspired method for tool - augmented LLMs, using trial - error, imagination and memory to improve tool use accuracy. ",提出模拟试错(STE)法,借鉴生物系统机制,提升大模型工具学习能力,还实现工具持续学习。 ,Agent Evolution,Methodology,"Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM`s ‘imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.",https://aclanthology.org/2024.acl-long.570/,2024,*ACL,,,🚫重复 \ No newline at end of file