Spaces:

HASHIRUAgentX
/

hashiruAI

Running

Kunal Pai commited on May 3

Commit

1d27ba2

1 Parent(s): c2ad08f

Add HASHIRU paper: Initial implementation of a hierarchical multi-agent system for hybrid intelligent resource utilization

- Introduced the main paper (conference_101719.tex) detailing the HASHIRU framework, its architecture, and core mechanisms.
- Added a PDF version of the paper (conference_101719.pdf).
- Created a references file (references.bib) with citations relevant to the research and framework.

Files changed (4) hide show

paper/HASHIRU.pdf +0 -0
paper/IEEEtran.cls +0 -0
paper/conference_101719.tex +248 -0
paper/references.bib +368 -0

paper/HASHIRU.pdf ADDED Viewed

Binary file (74.4 kB). View file

paper/IEEEtran.cls ADDED Viewed

The diff for this file is too large to render. See raw diff

paper/conference_101719.tex ADDED Viewed

	@@ -0,0 +1,248 @@

+\documentclass[conference]{IEEEtran}
+\IEEEoverridecommandlockouts
+% The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it out.
+\usepackage{cite}
+\usepackage{amsmath,amssymb,amsfonts}
+\usepackage{algorithmic}
+\usepackage{graphicx}
+\usepackage{textcomp}
+\usepackage{xcolor}
+\usepackage{hyperref}
+\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
+    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
+\begin{document}
+\title{HASHIRU: Hierarchical Agent System for Hybrid Intelligent Resource Utilization}
+\author{\IEEEauthorblockN{Kunal Pai}
+\IEEEauthorblockA{\textit{UC Davis} \\
+kunpai@ucdavis.edu}
+\and
+\IEEEauthorblockN{Parth Shah}
+\IEEEauthorblockA{\textit{Independent Researcher} \\
+helloparthshah@gmail.com}
+\and
+\IEEEauthorblockN{Harshil Patel}
+\IEEEauthorblockA{\textit{UC Davis} \\
+hpppatel@ucdavis.edu}
+\and
+\IEEEauthorblockN{Saisha Shetty}
+\IEEEauthorblockA{\textit{UC Davis} \\
+spshetty@ucdavis.edu}
+}
+\maketitle
+\section{Introduction}\label{sec:introduction}
+The landscape of Artificial Intelligence (AI) is being reshaped by the rapid advancements in Large Language Models (LLMs), which exhibit profound capabilities in language understanding, generation, reasoning, and planning \cite{brown2020language, devlin2019bert, raffel2020exploring}. This progress has catalyzed the development of sophisticated AI agents capable of autonomous task execution. Increasingly, the focus is shifting from single-agent systems to Multi-Agent Systems (MAS), where collaborative teams of specialized agents address complex problems beyond the scope of individual agents \cite{dorri2018multi, wooldridge2009introduction}. Such collaborative approaches hold significant potential in diverse domains like scientific discovery \cite{boiko2023emergent}, software engineering \cite{qian2023communicative}, data analysis, and strategic decision-making \cite{wang2023decision}. The growing complexity of tasks, highlighted by benchmarks demanding advanced mathematical reasoning (e.g., GSM8K \cite{cobbe2021gsm8k}, SVAMP \cite{patel2021svamp}), coding (e.g., HumanEval \cite{chen2021codex}, CoDocBench \cite{pai2024codocbench}), and graduate-level technical knowledge and reasoning \cite{phan2025humanitysexam}, further underscores the need for agentic systems capable of effectively coordinating diverse cognitive resources \cite{wen2024benchmarkingcomplexinstructionfollowingmultiple}.
+Despite this promise, contemporary agentic frameworks often encounter significant limitations. Many suffer from \textbf{rigidity}, relying on predefined agent roles and static team structures that hinder adaptation to dynamic task requirements \cite{zhang2023building}. Furthermore, \textbf{resource obliviousness} is prevalent; systems frequently lack mechanisms to monitor and optimize computational resources like API costs, memory usage, and CPU load, leading to inefficiency, particularly when scaling or deploying in resource-constrained environments \cite{park2023generative}. This is often exacerbated by a reliance on powerful, proprietary cloud-based LLMs, incurring substantial operational expenses. \textbf{Model homogeneity}, the default use of a single powerful LLM for all sub-tasks, neglects the potential efficiency gains from employing a diverse ecosystem of models, including smaller, specialized, or locally-run alternatives \cite{zhou2023agents}. Lastly, while \textbf{tool use} is fundamental \cite{yao2022react, parisi2022talm}, the ability for agents to autonomously \textbf{create and integrate new tools} during operation remains limited, restricting dynamic functional extension and long-term self-improvement without human intervention \cite{wang2023voyager}.
+To address these challenges, we introduce \textbf{HASHIRU (Hierarchical Agent System for Hybrid Intelligent Resource Utilization)}, a novel MAS framework designed for enhanced flexibility, resource efficiency, and adaptability. HASHIRU employs a hierarchical structure led by a central ``CEO'' agent that dynamically manages a team of specialized ``employee'' agents, instantiated on demand for specific sub-tasks. A core tenet of HASHIRU is its \textbf{hybrid intelligence} approach, strategically prioritizing smaller (e.g., 3B--7B parameter), locally-run LLMs, often accessed via frameworks like Ollama \cite{ollama}, to promote cost-effectiveness and computational efficiency. While prioritizing local resources, the system retains the flexibility to integrate external APIs and potentially more powerful models when justified by task complexity and resource availability, under the CEO's management.
+The primary contributions of this work are:
+\begin{enumerate}
+    \item A novel MAS architecture combining a \textbf{hierarchical control structure} with \textbf{dynamic, resource-aware agent lifecycle management} (hiring/firing). This management is explicitly governed by computational budget constraints (cost, memory usage, concurrency) and incorporates an economic model with hiring/firing costs to discourage excessive churn.
+    \item A \textbf{hybrid intelligence model} that prioritizes cost-effective, local LLMs while adaptively incorporating external APIs and potentially larger models, optimizing the efficiency-capability trade-off.
+    \item An integrated mechanism enabling the \textbf{autonomous creation of API tools}, allowing the system to dynamically extend its functional repertoire in response to task demands.
+    \item The application of an \textbf{economic model} (hiring/firing fees) to agent management, promoting efficient resource allocation and team stability.
+\end{enumerate}
+This paper details the design and rationale behind HASHIRU. Section \ref{sec:background} discusses related work in agent architectures, dynamic management, resource allocation, model heterogeneity, and tool use. Section 3 elaborates on the HASHIRU architecture and its core mechanisms. Section 4 presents experimental results (or outlines planned experiments), followed by discussion and conclusion in Sections 5 and 6.
+\section{Background and Related Work} \label{sec:background}
+The concept of intelligent agents has evolved significantly from early work in symbolic AI and distributed problem-solving \cite{russell2010artificial, shoham1994agent} to the current era dominated by LLMs. Modern agentic frameworks leverage LLMs as their cognitive core, enabling sophisticated reasoning, planning, and interaction capabilities \cite{wang2023survey, xi2023rise}. HASHIRU builds upon this foundation while addressing specific limitations observed in the current state-of-the-art.
+\subsection{Agent Architectures: Hierarchy and Dynamics}
+MAS architectures vary widely, including flat, federated, and hierarchical structures \cite{dorri2018multi, horling2004survey}. Hierarchical models offer clear control flow and efficient task decomposition but risk bottlenecks and rigidity \cite{gaston2005agenta,gaston2005agentb}. HASHIRU utilizes a \textbf{CEO-Employee hierarchy} for centralized coordination but distinguishes itself through \textbf{dynamic team composition}. Unlike systems with static hierarchies or predefined roles (e.g., CrewAI \cite{crewai}, ChatDev \cite{qian2023communicative}), HASHIRU's CEO actively manages the employee pool based on runtime needs and resource constraints.
+\subsection{Dynamic Agent Lifecycle Management}
+The ability of an MAS to adapt its composition dynamically is crucial for complex environments \cite{valckenaers2005trends}. Various triggers for agent creation or deletion have been explored, often tied to task structure or environmental changes. HASHIRU introduces a specific mechanism where the CEO agent makes \textbf{hiring and firing decisions} based on a cost-benefit analysis considering agent performance, operational costs (API fees, inferred compute), memory footprint (tracked explicitly as a percentage of available resources), and agent concurrency limits. Furthermore, HASHIRU incorporates an \textbf{economic model} with explicit ``starting bonus'' (hiring) and ``invocation'' (usage) costs. This introduces economic friction, aiming to prevent excessive agent initialization or usage for marginal gains and promote team stability, a nuance often missing in simpler dynamic composition strategies.
+\subsection{Resource Management and Agent Economies}
+Resource awareness is critical for scalable and deployable MAS. Research areas in economics explore mechanisms like market-based auctions or contract nets for resource allocation \cite{clearwater1996market}. HASHIRU implements a more \textbf{centralized, budget-constrained resource management model}. The CEO operates within defined limits for financial cost, memory usage (as a percentage of total available memory), and concurrent agent count. This direct management, particularly the focus on memory percentage, suggests an orientation towards practical deployment, potentially on local hardware or edge devices with finite resources, contrasting with cloud-centric systems assuming elastic resources \cite{park2023generative}. Frameworks like AutoGen \cite{wu2023autogen} and LangGraph \cite{langgraph} typically rely on implicit cost tracking via API keys without such explicit multi-dimensional resource budgeting and control.
+\subsection{Hybrid Intelligence and Heterogeneous Models}
+Leveraging diverse LLMs with varying capabilities, costs, and latencies is an emerging trend \cite{zhou2023agents}. Techniques like model routing aim to select the optimal model for specific sub-tasks. HASHIRU embraces \textbf{model heterogeneity} with a specific strategic focus: \textbf{prioritizing smaller (3B--7B), locally-run models via Ollama integration} \cite{ollama}. This emphasizes cost-efficiency, low latency for simpler tasks, and potential privacy advantages over systems defaulting to large, proprietary cloud APIs (e.g., GPT-4 \cite{openai2023gpt4}, Claude 3 \cite{anthropic2024claude}). While capable of integrating external APIs (potentially invoking larger models), HASHIRU's default operational stance represents a distinct balance point in the capability vs. efficiency trade-off.
+\subsection{Tool Use and Autonomous Tool Creation}
+The ability to use external tools (APIs, functions, databases) is a cornerstone of modern agents, enabled by frameworks like ReAct \cite{yao2022react} and built-in function calling \cite{openai_func_calling}. Most systems rely on a predefined tool suite. HASHIRU advances this by incorporating a mechanism for \textbf{autonomous API tool creation}. When a required functionality is missing, the CEO can commission the generation (potentially via a specialized agent or code generation process) and deployment of a new API tool within the HASHIRU ecosystem. This capability for self-extension differentiates HASHIRU from systems limited to static toolsets and moves towards greater operational autonomy and adaptability \cite{wang2023voyager, park2023generative}.
+In summary, HASHIRU integrates concepts from hierarchical control, dynamic MAS, resource management, and tool use, but its novelty lies in the synergistic combination of: (1) dynamic, resource-aware hierarchical management with (2) an economic model for stability, (3) a local-first hybrid intelligence strategy, and (4) integrated autonomous tool creation capabilities. This combination targets key limitations in current agentic systems concerning efficiency, adaptability, cost, and autonomy.
+\section{HASHIRU System Architecture}
+\label{sec:architecture}
+The architecture of HASHIRU is designed to directly address the challenges of rigidity, resource obliviousness, and limited adaptability outlined in Section~\ref{sec:introduction}. It implements a hierarchical, dynamically managed multi-agent system optimized for hybrid resource utilization. This section details the core components and mechanisms underpinning HASHIRU's operation.
+\subsection{Overview}
+HASHIRU operates on a hierarchical model conceptually similar to a business organization, featuring a central coordinating agent (``CEO'') and specialized task-executing agents (``Employees''). Key architectural tenets include:
+\begin{itemize}
+    \item \textbf{Centralized Coordination within a Dynamic Hierarchy:} A CEO agent manages overall strategy, task allocation, and team composition.
+    \item \textbf{Dynamic Lifecycle Management:} Employee agents are instantiated (hired) and terminated (fired) based on runtime requirements and resource constraints, governed by an economic model.
+    \item \textbf{Hybrid Intelligence:} Strategic preference for local, computationally cheaper LLMs, while retaining access to external APIs and potentially more powerful models.
+    \item \textbf{Explicit Resource Management:} Continuous monitoring and control of costs, memory usage, and agent concurrency against defined budgets.
+    \item \textbf{Adaptive Tooling:} Utilization of predefined tools alongside the capability for autonomous creation of new API tools.
+\end{itemize}
+Figure \ref{fig:arch} illustrates the overall structure and interaction flow.
+\begin{figure}[ht]
+    \centering
+    \includegraphics[width=0.45\textwidth]{HASHIRU.pdf}
+    \caption{High-level architecture of the HASHIRU system, illustrating the CEO-Employee hierarchy.}
+    \label{fig:arch}
+\end{figure}
+\subsection{Hierarchical Structure: CEO and Employee Agents}
+The system employs a two-tiered hierarchy:
+\begin{itemize}
+    \item \textbf{CEO Agent:} This singleton agent serves as the central coordinator and entry point. Its primary responsibilities include:
+        \begin{itemize}
+            \item Receiving and interpreting the primary user query or task.
+            \item Decomposing the main task into smaller, manageable sub-tasks.
+            \item Identifying the capabilities required for each sub-task.
+            \item Managing the pool of Employee agents (see Section \ref{subsec:dynamic_mgmt}).
+            \item Assigning sub-tasks to suitable, active Employee agents.
+            \item Monitoring the progress and performance of Employee agents.
+            \item Synthesizing the results from Employee agents into a coherent final output or response.
+            \item Managing the system's overall resource budget (see Section \ref{subsec:resource_mgmt}).
+            \item Initiating the creation of new tools when required (see Section \ref{subsec:tooling}).
+        \end{itemize}
+    \item \textbf{Employee Agents:} These are specialized agents instantiated by the CEO to perform specific sub-tasks. Each Employee agent typically wraps an LLM (local via Ollama \cite{ollama} or external via API) or provides access to a specific tool/API. Key characteristics include:
+        \begin{itemize}
+            \item Specialization: Possessing capabilities tailored to certain types of sub-tasks (e.g., code generation, data analysis, information retrieval).
+            \item Dynamic Existence: Created and destroyed by the CEO based on need and performance.
+            \item Task Execution: Receives a sub-task description and context from the CEO, executes it, and returns the result.
+            \item Resource Consumption: Associated with specific resource costs (e.g., API call costs, memory footprint) tracked by the system.
+        \end{itemize}
+\end{itemize}
+This hierarchical structure facilitates organized task decomposition and result aggregation under centralized control, while the dynamic nature of the Employee pool provides flexibility.
+\subsection{Dynamic Agent Lifecycle Management}
+\label{subsec:dynamic_mgmt}
+A core innovation in HASHIRU is the CEO's ability to dynamically manage the Employee agent team through ``hiring'' (instantiation) and ``firing'' (termination). This process is driven by a cost-benefit analysis aimed at optimizing task performance within resource constraints.
+When a new sub-task requires capabilities not readily available or efficiently provided by the current pool of active Employee agents, the CEO may decide to hire a new agent. Conversely, if an agent is underperforming, consistently idle, excessively costly, or if resource limits are approached, the CEO may decide to fire it. This decision considers multiple factors:
+\begin{itemize}
+    \item \textbf{Task Requirements:} The specific capabilities needed for pending sub-tasks.
+    \item \textbf{Agent Performance Metrics:} Historical success rate, quality of output, or efficiency of existing agents relevant to the task type.
+    \item \textbf{Operational Costs:} API costs, estimated computational load, or other costs associated with using the agent's underlying model or tools.
+    \item \textbf{Memory Footprint:} The amount of system memory the agent consumes, tracked as a percentage of the total available memory allocated to HASHIRU.
+    \item \textbf{Agent Concurrency:} The current number of active Employee agents relative to a predefined limit.
+\end{itemize}
+Crucially, HASHIRU incorporates an \textbf{economic model} for agent lifecycle events:
+\begin{itemize}
+    \item \textbf{Hiring Cost (``Starting Bonus''):} A one-time cost incurred when a new agent is instantiated, representing setup overhead.
+    \item \textbf{Invocation Cost (``Salary''):} A multi-time cost incurred when an agent is used, representing load on the system, or on payment methods, for using an agent.
+\end{itemize}
+These explicit transaction costs discourage excessive agent churn, promoting stability. The CEO must evaluate if the anticipated long-term benefit of replacing an agent outweighs the immediate hiring and firing costs plus any difference in ongoing operational costs. This mechanism directly combats system rigidity and allows adaptation while actively managing computational budgets and preventing wasteful high-frequency agent turnover.
+\subsection{Hybrid Intelligence and Model Management}
+HASHIRU is designed for \textbf{hybrid intelligence}, leveraging a diverse set of cognitive resources. It strategically prioritizes the use of smaller (e.g., 3B--7B parameter), cost-effective LLMs that can be run locally via frameworks like Ollama \cite{ollama}. This approach enhances efficiency, reduces reliance on expensive external APIs, and potentially improves privacy and latency for certain tasks.
+However, the system is not restricted to local models. The CEO agent can integrate and utilize:
+\begin{itemize}
+    \item \textbf{External LLM APIs:} Access to powerful proprietary models (e.g., GPT-4 \cite{openai2023gpt4}, Claude 3 \cite{anthropic2024claude}) when deemed necessary for complex reasoning or specialized knowledge, subject to cost-benefit analysis.
+    \item \textbf{External Tool APIs:} Integration with third-party software or data sources.
+    \item \textbf{Self-Created APIs:} Tools generated by HASHIRU itself (see Section \ref{subsec:tooling}).
+\end{itemize}
+The CEO manages this heterogeneous pool, selecting the most appropriate resource (local model, external API, tool) for a given sub-task based on perceived difficulty, required capabilities, and the current resource budget status. This allows HASHIRU to balance cost-effectiveness and computational efficiency with the need for high capability when required.
+\subsection{Resource Monitoring and Control}
+\label{subsec:resource_mgmt}
+Explicit resource management is central to HASHIRU's design, moving beyond simple API key cost tracking. The system, coordinated by the CEO, actively monitors:
+\begin{itemize}
+    \item \textbf{Financial Costs:} Accumulating costs from external API calls.
+    \item \textbf{Memory Usage:} Tracking the memory footprint of active Employee agents, specifically as a percentage of a predefined total available memory resource.
+    \item \textbf{Agent Concurrency:} Maintaining a count of concurrently active Employee agents.
+\end{itemize}
+These metrics are monitored against predefined \textbf{budget limits} or hard caps. If initiating an action (like hiring a new agent) would exceed a budget limit (e.g., push memory usage over 90\% of allocated, or exceed the maximum concurrent agent count), the action is prevented. This mechanism ensures the system operates within defined operational constraints, crucial for deployment on devices with limited resources or under strict financial budgets.
+\subsection{Tool Utilization and Autonomous Creation}
+\label{subsec:tooling}
+Like many modern agent systems, HASHIRU's agents can utilize predefined tools (functions, external APIs, databases) to interact with the environment and perform actions beyond pure text generation \cite{yao2022react, openai_func_calling}.
+A distinctive feature of HASHIRU is its capability for \textbf{integrated, autonomous tool creation}. If the CEO agent determines, through task analysis or failure analysis of existing agents, that a specific functional capability is required but not available through existing Employee agents or tools, it can initiate a process to create a new tool. This typically involves:
+\begin{enumerate}
+    \item Defining the specification for the required tool (inputs, outputs, functionality).
+    \item Commissioning the generation of the necessary logic (e.g., code implementing the functionality, potentially involving API calls to external services using provided credentials, possibly generated by a specialized code-generating Employee agent).
+    \item Deploying this logic as a new, callable API endpoint accessible within the HASHIRU ecosystem.
+    \item Potentially instantiating a new Employee agent dedicated to utilizing this newly created tool.
+\end{enumerate}
+This mechanism allows HASHIRU to dynamically extend its own functional repertoire over time, tailoring its capabilities to the tasks it encounters without requiring direct manual intervention for every new function, thereby enabling greater autonomy and long-term adaptation.
+\section{Experimental Setup}
+\label{sec:experiments}
+To evaluate the performance, efficiency, and adaptability of HASHIRU, we designed a set of experiments targeting its core architectural features. Our evaluation focuses on assessing the benefits of dynamic resource-aware management, the hybrid intelligence model, and the autonomous tool creation capability compared to relevant baselines. Specifically, we investigate:
+\begin{itemize}
+    \item The impact of dynamic agent management with economic constraints on resource utilization (cost, memory) and task performance compared to static configurations.
+    \item The effectiveness of the hybrid (local-first) model strategy versus homogeneous (cloud-only or local-only) approaches across tasks of varying complexity.
+    \item The system's ability to autonomously create and utilize necessary tools when faced with novel functional requirements within a task.
+\end{itemize}
+\subsection{Evaluation Tasks}
+\label{subsec:tasks}
+We selected tasks demanding complex reasoning, multi-perspective analysis, and interaction, suitable for exercising HASHIRU's hierarchical coordination and dynamic capabilities. The tasks fall into two main categories:
+\subsubsection{Academic Paper Review}
+This task evaluates HASHIRU's capacity to critically assess academic work by simulating the peer-review process. Given one or more scientific papers (e.g., in PDF format), the system must generate a review summary and ultimately recommend acceptance or rejection. This probes HASHIRU's ability to decompose evaluation criteria, delegate tasks to specialized agents (e.g., novelty assessment, methodological rigor, clarity), and manage resources effectively across long and complex documents.
+\subsubsection{Reasoning and Problem-Solving Tasks}
+To evaluate broader reasoning, knowledge retrieval, and problem-solving capabilities under different constraints, we employ a set of challenging benchmarks and puzzle-like tasks:
+\begin{itemize}
+    \item \textbf{Humanity's Last Exam \cite{phan2025humanitysexam}:} A benchmark designed to test graduate-level technical knowledge and complex reasoning across multiple domains. Success requires deep understanding and sophisticated problem-solving, likely necessitating access to powerful external LLMs managed effectively within HASHIRU's hybrid framework.
+    \item \textbf{NYT Connections:} This popular puzzle requires identifying hidden semantic relationships or themes to categorize 16 words into four distinct groups. Solving this involves associative reasoning, broad world knowledge, and potentially hypothesis testing across different potential groupings, testing knowledge access and combinatorial reasoning coordination.
+    \item \textbf{Wordle:} The daily word puzzle requires deductive reasoning to identify a five-letter word within six guesses, using feedback on correct letters and positions. This tests logical deduction, constraint satisfaction, and vocabulary knowledge. It serves as a good test case for comparing the efficiency (speed, cost, number of guesses) of local versus external models for iterative reasoning. We assume interaction via a simulated game environment.
+    \item \textbf{Globle:} This geographic deduction game requires identifying a target country based on proximity feedback from guesses. It tests geographic knowledge retrieval, spatial reasoning, and iterative strategy refinement based on feedback (distance, direction). We assume interaction via a simulated game environment.
+\end{itemize}
+These diverse reasoning tasks challenge the system's ability to leverage appropriate cognitive resources (local vs. external models), potentially create simple tools, and coordinate problem-solving strategies effectively.
+\subsection{Baselines for Comparison}
+\label{subsec:baselines}
+To quantify the benefits of HASHIRU's features, we will compare its performance against several baseline configurations:
+\begin{itemize}
+    \item \textbf{Static-HASHIRU:} A version with a fixed, predefined set of Employee agents (e.g., one generalist agent per potential role identified in paper analysis), disabling dynamic hiring/firing.
+    \item \textbf{Cloud-Only HASHIRU:} HASHIRU operating exclusively with a powerful external LLM API and online function-calling for all agents, disabling the use of local models.
+    \item \textbf{Local-Only HASHIRU:} HASHIRU operating exclusively with smaller, local LLMs (e.g., selected models via Ollama) for all agents.
+    \item \textbf{HASHIRU (No-Economy):} HASHIRU with dynamic hiring/firing enabled but without the explicit costs, to isolate the impact of the economic model on agent churn and stability.
+    % \item \textbf{[Optional] Other Frameworks:} If feasible, configure AutoGen \cite{wu2023autogen} or CrewAI \cite{crewai} with a similar hierarchical structure to solve a subset of the tasks for comparison.
+\end{itemize}
+\subsection{Evaluation Metrics}
+\label{subsec:metrics}
+We will evaluate performance using a combination of quantitative and qualitative metrics:
+\begin{itemize}
+    \item \textbf{Task Success Rate / Quality:}
+        \begin{itemize}
+            \item Percentage of tasks successfully completed (binary for games, potentially graded for analysis based on rubrics).
+            \item Quality of output for analysis tasks (human evaluation based on relevance, coherence, accuracy, completeness).
+            \item Accuracy for information extraction tasks.
+            \item Number of guesses/turns required for game tasks.
+        \end{itemize}
+    \item \textbf{Resource Consumption:}
+        \begin{itemize}
+            \item Total cost incurred from external API calls.
+            \item Peak and average memory usage (\% of allocated budget).
+            \item Wall-clock time per task.
+            \item Number and type (local/external) of LLM calls.
+        \end{itemize}
+    \item \textbf{System Dynamics and Adaptability:}
+        \begin{itemize}
+            \item Number of Employee agents hired and fired during tasks.
+            \item Frequency of agent churn (hires+fires / task duration or steps).
+            \item Number and utility of autonomously created tools (if applicable).
+        \end{itemize}
+\end{itemize}
+\bibliography{references}
+\bibliographystyle{plain}
+\end{document}

paper/references.bib ADDED Viewed

	@@ -0,0 +1,368 @@

+@article{shen2023hugginggpt,
+  title = {HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face},
+  author = {Shen, Yongliang and Song, Kaitao and Tan, Xu and Li, Dongsheng and Lu, Weiming and Zhuang, Yueting},
+  journal = {arXiv preprint arXiv:2303.17580},
+  year = {2023}
+}
+@article{wu2023autogen,
+  title = {{AutoGen}: Enabling Next-Gen {LLM} Applications via Multi-Agent Conversation},
+  author = {Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed H. and White, Ryen W. and Burger, Doug and Wang, Chi},
+  journal = {arXiv preprint arXiv:2308.08155},
+  year = {2023}
+}
+@inproceedings{yao2022react,
+  title = {{ReAct}: Synergizing Reasoning and Acting in Language Models},
+  author = {Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan},
+  booktitle = {International Conference on Learning Representations (ICLR)},
+  year = {2023},
+  note = {arXiv:2210.03629}
+}
+@article{schick2023toolformer,
+  title = {Toolformer: Language Models Can Teach Themselves to Use Tools},
+  author = {Schick, Timo and Dwivedi-Yu, Jane and Bitton, Yonatan and Yuan, Xi and Camburu, Oana-Maria and Houlsby, Neil},
+  journal = {arXiv preprint arXiv:2302.04761},
+  year = {2023}
+}
+@article{ong2024routellm,
+  title = {{RouteLLM}: Learning to Route {LLMs} with Preference Data},
+  author = {Ong, Isaac and Almahairi, Amjad and Wu, Vincent and Chiang, Wei-Lin and Wu, Tianhao and Gonzalez, Joseph E. and Kadous, M. Waleed and Stoica, Ion},
+  journal = {arXiv preprint arXiv:2406.18665},
+  year = {2024}
+}
+@article{fourney2024magentic,
+  title = {Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks},
+  author = {Fourney, Adam and Bansal, Gagan and Mozannar, Hussein and Tan, Cheng and et al.},
+  journal = {arXiv preprint arXiv:2411.04468},
+  year = {2024}
+}
+@inproceedings{cobbe2021gsm8k,
+  title = {Training Verifiers to Solve Math Word Problems},
+  author = {Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
+  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
+  year = {2021},
+  note = {Dataset introduced: GSM8K (Grade School Math 8K)}
+}
+@inproceedings{patel2021svamp,
+  title = {Are {NLP} Models really able to Solve Simple Math Word Problems?},
+  author = {Patel, Arkil and Bhattamishra, Satwik and Goyal, Navin},
+  booktitle = {Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
+  year = {2021},
+  note = {Introduces the SVAMP challenge dataset}
+}
+@misc{phan2025humanitysexam,
+  title         = {Humanity's Last Exam},
+  author        = {Phan, Long and Gatti, Alice and Han, Ziwen and others},
+  year          = {2025},
+  eprint        = {2501.14249},
+  archivePrefix = {arXiv},
+  primaryClass  = {cs.LG},
+  url           = {https://arxiv.org/abs/2501.14249}
+}
+@article{chen2021codex,
+  title = {Evaluating Large Language Models Trained on Code},
+  author = {Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Ponde de Oliveira Pinto, Henrique and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and Ray, Alex and Puri, Raul and Krueger, Gretchen and Petrov, Michael and Khlaaf, Heidy and Sastry, Girish and Mishkin, Pamela and Chan, Brooke and Gray, Scott and Ryder, Nick and Pavlov, Mikhail and Power, Alethea and Kaiser, Lukasz and Bavarian, Mohammad and Winter, Clemens and Tillet, Philippe and Such, Felipe and Cummings, Dave and Plappert, Matthias and Chantzis, Fotios and Barnes, Elizabeth and Herbert-Voss, Ariel and Guss, William and Nichol, Alex and Paino, Alex and Tezak, Nikolas and Tang, Jie and Babuschkin, Igor and Balaji, Suchir and Jain, Shantanu and Saunders, William and Hesse, Christopher and Carr, Andrew N. and Leike, Jan and Achiam, Josh and Misra, Vedant and Morikawa, Evan and Radford, Alec and Knight, Matthew and Brundage, Miles and Murati, Mira and Mayer, Katie and Welinder, Peter and McGrew, Bob and Amodei, Dario and McCandlish, Sam and Sutskever, Ilya and Zaremba, Wojciech},
+  journal = {arXiv preprint arXiv:2107.03374},
+  year = {2021},
+  note = {OpenAI Codex paper; introduced HumanEval benchmark}
+}
+@article{pai2024codocbench,
+  title = {{CoDocBench}: A Dataset for Code-Documentation Alignment in Software Maintenance},
+  author = {Pai, Kunal and Devanbu, Premkumar and Ahmed, Toufique},
+  journal = {arXiv preprint arXiv:2502.00519},
+  year = {2024}
+}
+@inproceedings{kamienski2021pysstubs,
+  title = {{PySStuBs}: Characterizing Single-Statement Bugs in Popular Open-Source Python Projects},
+  author = {Kamienski, Arthur V. and Palechor, Luisa and Bezemer, Cor-Paul and Hindle, Abram},
+  booktitle = {IEEE/ACM International Conference on Mining Software Repositories (MSR)},
+  year = {2021}
+}
+@article{brown2020language,
+  title={Language models are few-shot learners},
+  author={Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others},
+  journal={Advances in neural information processing systems},
+  volume={33},
+  pages={1877--1901},
+  year={2020}
+}
+@inproceedings{devlin2019bert,
+  title={Bert: Pre-training of deep bidirectional transformers for language understanding},
+  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
+  booktitle={Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)},
+  pages={4171--4186},
+  year={2019}
+}
+@article{raffel2020exploring,
+  title={Exploring the limits of transfer learning with a unified text-to-text transformer},
+  author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J},
+  journal={Journal of machine learning research},
+  volume={21},
+  number={140},
+  pages={1--67},
+  year={2020}
+}
+@article{dorri2018multi,
+  title={Multi-agent systems: A survey},
+  author={Dorri, Ali and Kanhere, Salil S and Jurdak, Raja},
+  journal={Ieee Access},
+  volume={6},
+  pages={28573--28593},
+  year={2018},
+  publisher={IEEE}
+}
+@book{wooldridge2009introduction,
+  title={An introduction to multiagent systems},
+  author={Wooldridge, Michael},
+  year={2009},
+  publisher={John wiley \& sons}
+}
+@article{boiko2023emergent,
+  title={Emergent autonomous scientific research capabilities of large language models},
+  author={Boiko, Daniil A and MacKnight, Robert and Gomes, Gabe},
+  journal={arXiv preprint arXiv:2304.05332},
+  year={2023}
+}
+@inproceedings{gaston2005agenta,
+  title={Agent-organized networks for dynamic team formation},
+  author={Gaston, Matthew E and DesJardins, Marie},
+  booktitle={Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems},
+  pages={230--237},
+  year={2005}
+}
+@misc{zhou2023agents,
+    title={Agents: An Open-source Framework for Large Language Model based Autonomous Agents},
+    author={Wangchunshu Zhou and Jianshu Chen and Jialong Wu and Yiheng Xu and Kexin Wang and Jintian Zhang and Yuan Gao and Zhiyong Wu and Kevin Tian and Yubo Feng and Linyi Yang and Bokai Quan and Cong Yu and Yuhang Wang and Shishen Lan and Yan Wang and Hong-Cheng Guo and Chaoyu Chen and Tianxiang Sun and Jin Xiong and Yi Lu and Peng Li and Lichao Sun and Lifan Yuan and Hang Li and Xiangang Li},
+    year={2023},
+    eprint={2309.07870},
+    archivePrefix={arXiv},
+    primaryClass={cs.AI},
+    url={https://arxiv.org/abs/2309.07870}
+}
+@misc{openai_func_calling,
+    title = {Function calling},
+    author = {{OpenAI}},
+    year = {2023},
+    howpublished = {OpenAI API Documentation},
+    url = {https://platform.openai.com/docs/guides/function-calling},
+    note = {Accessed: 2025-05-01}
+}
+@misc{wang2023voyager,
+    title={{Voyager}: An Open-Ended Embodied Agent with Large Language Models},
+    author={Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar},
+    year={2023},
+    eprint={2305.16291},
+    archivePrefix={arXiv},
+    primaryClass={cs.AI},
+    url={https://arxiv.org/abs/2305.16291}
+}
+@book{russell2010artificial,
+  title={Artificial intelligence: a modern approach},
+  author={Russell, Stuart J. and Norvig, Peter},
+  year={2010},
+  edition={3rd},
+  publisher={Prentice Hall Press},
+  address={Upper Saddle River, NJ, USA}
+}
+@article{shoham1994agent,
+  author    = {Yoav Shoham},
+  title     = {Agent-oriented programming},
+  journal   = {Artificial Intelligence},
+  volume    = {60},
+  number    = {1},
+  pages     = {51--92},
+  year      = {1993},
+  publisher = {Elsevier}
+}
+@misc{wang2023survey,
+      title={A Survey on Large Language Model based Autonomous Agents},
+      author={Lei Wang and Chen Ma and Xueyang Feng and Zeyu Zhang and Hao Yang and Jingsen Zhang and Zhiyuan Chen and Jiakai Tang and Xu Chen and Yankai Lin and Wayne Xin Zhao and Zhewei Wei and Ji-Rong Wen},
+      year={2023},
+      eprint={2308.11432},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI}
+}
+@misc{xi2023rise,
+      title={The Rise and Potential of Large Language Model Based Agents: A Survey},
+      author={Zhiheng Xi and Wenxiang Chen and Xin Guo and Wei He and Yiwen Ding and Boyang Hong and Ming Zhang and Junzhe Wang and Senjie Jin and Enyu Zhou and Rui Zheng and Xiaoran Fan and Xiao Wang and Limao Xiong and Linyi Yang and Ting Ruan and Yongquan Yang and Peng Li and Yitao Chang and Yanlin Wang},
+      year={2023},
+      eprint={2309.07864},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI}
+}
+@inproceedings{park2023generative,
+    author = {Park, Joon Sung and O'Brien, Joseph C. and Cai, Carrie J. and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S.},
+    title = {Generative Agents: Interactive Simulacra of Human Behavior},
+    year = {2023},
+    isbn = {9798400701320},
+    publisher = {Association for Computing Machinery},
+    address = {New York, NY, USA},
+    url = {https://doi.org/10.1145/3586183.3606763},
+    doi = {10.1145/3586183.3606763},
+    booktitle = {The 36th Annual ACM Symposium on User Interface Software and Technology (UIST '23)},
+    pages = {1–22},
+    numpages = {22},
+    location = {San Francisco, CA, USA},
+    series = {UIST '23}
+}
+@misc{ollama,
+    title = {Ollama},
+    author = {{Ollama Team}},
+    howpublished = {\url{https://ollama.com/}},
+    year = {2023},
+    note = {Accessed: 2025-05-01}
+}
+@misc{anthropic2024claude,
+    title = {The {Claude 3} Model Family: {Opus, Sonnet, Haiku}},
+    author = {{Anthropic}},
+    year = {2024},
+    month = {March},
+    howpublished = {Model Card},
+    url = {https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf},
+    note = {Accessed: 2025-05-01}
+}
+@misc{openai2023gpt4,
+    title={GPT-4 Technical Report},
+    author={OpenAI},
+    year={2023},
+    eprint={2303.08774},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL},
+    url={https://arxiv.org/abs/2303.08774}
+}
+@misc{langgraph,
+  title       = {LangGraph: A Framework for Agentic Workflows},
+  author      = {LangChain},
+  year        = {2024},
+  howpublished= {\url{https://www.langchain.com/langgraph}},
+  note        = {Accessed: May 1, 2025}
+}
+@book{clearwater1996market,
+  title     = {Market-Based Control: A Paradigm for Distributed Resource Allocation},
+  editor    = {Scott H. Clearwater},
+  publisher = {World Scientific},
+  year      = {1996}
+}
+@article{valckenaers2005trends,
+  title={Guest Editors' Introduction: Intelligent Control in the Manufacturing Supply Chain},
+  author={McFarlane, Duncan and Mar{\'\i}k, Vladim{\'\i}r and Valckenaers, Paul},
+  journal={IEEE Intelligent Systems},
+  volume={20},
+  number={1},
+  pages={24--26},
+  year={2005},
+  publisher={IEEE}
+}
+@article{horling2004survey,
+  title={A survey of multi-agent organizational paradigms},
+  author={Horling, Bryan and Lesser, Victor},
+  journal={The Knowledge engineering review},
+  volume={19},
+  number={4},
+  pages={281--316},
+  year={2004},
+  publisher={Cambridge University Press}
+}
+@inproceedings{gaston2005agentb,
+  title={Agent-organized networks for multi-agent production and exchange},
+  author={Gaston, Matthew E and DesJardins, Marie},
+  booktitle={Proceedings of the 20th national conference on Artificial intelligence-Volume 1},
+  pages={77--82},
+  year={2005}
+}
+@misc{zhang2023building,
+      title={Building Cooperative Embodied Agents Modularly with Large Language Models},
+      author={Hongxin Zhang and Weihua Du and Jiaming Shan and Qinhong Zhou and Yilun Du and Joshua B. Tenenbaum and Tianmin Shu and Chuang Gan},
+      year={2023},
+      eprint={2307.02485},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI}
+}
+@misc{parisi2022talm,
+      title={TALM: Tool Augmented Language Models},
+      author={Aaron Parisi and Yao Zhao and Noah Fiedel},
+      year={2022},
+      eprint={2205.12255},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+@misc{crewai,
+  title        = {CrewAI},
+  author       = {{CrewAI Inc.}},
+  year         = {2025},
+  howpublished = {\url{https://www.crewai.com/}},
+  note         = {Accessed: 2025-05-01}
+}
+@article{qian2023communicative,
+  title={Chatdev: Communicative agents for software development},
+  author={Qian, Chen and Liu, Wei and Liu, Hongzhang and Chen, Nuo and Dang, Yufan and Li, Jiahao and Yang, Cheng and Chen, Weize and Su, Yusheng and Cong, Xin and others},
+  journal={arXiv preprint arXiv:2307.07924},
+  year={2023}
+}
+@article{wang2023decision,
+  title={Decision-making driven by driver intelligence and environment reasoning for high-level autonomous vehicles: a survey},
+  author={Wang, Yuning and Jiang, Junkai and Li, Shangyi and Li, Ruochen and Xu, Shaobing and Wang, Jianqiang and Li, Keqiang},
+  journal={IEEE Transactions on Intelligent Transportation Systems},
+  volume={24},
+  number={10},
+  pages={10362--10381},
+  year={2023},
+  publisher={IEEE}
+}
+@misc{wen2024benchmarkingcomplexinstructionfollowingmultiple,
+      title={Benchmarking Complex Instruction-Following with Multiple Constraints Composition},
+      author={Bosi Wen and Pei Ke and Xiaotao Gu and Lindong Wu and Hao Huang and Jinfeng Zhou and Wenchuang Li and Binxin Hu and Wendy Gao and Jiaxin Xu and Yiming Liu and Jie Tang and Hongning Wang and Minlie Huang},
+      year={2024},
+      eprint={2407.03978},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2407.03978},
+}