Spaces:
Running
Running
Kunal Pai
commited on
Commit
·
adfd14d
1
Parent(s):
7040aa5
Make paper more concise
Browse files- paper/conference_101719.tex +97 -100
- paper/references.bib +10 -0
paper/conference_101719.tex
CHANGED
@@ -35,58 +35,58 @@ spshetty@ucdavis.edu}
|
|
35 |
|
36 |
\section{Introduction}\label{sec:introduction}
|
37 |
|
38 |
-
|
39 |
|
40 |
-
Despite this
|
41 |
|
42 |
-
To address these challenges, we introduce \textbf{HASHIRU (Hierarchical Agent System for Hybrid Intelligent Resource Utilization)}, a novel MAS framework
|
43 |
|
44 |
-
The primary contributions
|
45 |
\begin{enumerate}
|
46 |
-
\item A novel MAS architecture combining
|
47 |
-
\item A \textbf{hybrid intelligence model}
|
48 |
-
\item An integrated mechanism
|
49 |
-
\item
|
50 |
\end{enumerate}
|
51 |
|
52 |
-
This paper details
|
53 |
|
54 |
\section{Background and Related Work} \label{sec:background}
|
55 |
|
56 |
-
|
57 |
|
58 |
\subsection{Agent Architectures: Hierarchy and Dynamics}
|
59 |
-
MAS architectures vary
|
60 |
|
61 |
\subsection{Dynamic Agent Lifecycle Management}
|
62 |
-
|
63 |
|
64 |
\subsection{Resource Management and Agent Economies}
|
65 |
-
Resource awareness is critical for scalable
|
66 |
|
67 |
\subsection{Hybrid Intelligence and Heterogeneous Models}
|
68 |
-
Leveraging diverse LLMs with varying capabilities, costs, and latencies is an emerging trend \cite{zhou2023agents}. Techniques like model routing
|
69 |
|
70 |
\subsection{Tool Use and Autonomous Tool Creation}
|
71 |
-
|
72 |
|
73 |
-
In summary, HASHIRU integrates
|
74 |
|
75 |
\section{HASHIRU System Architecture}
|
76 |
\label{sec:architecture}
|
77 |
|
78 |
-
|
79 |
|
80 |
\subsection{Overview}
|
81 |
-
HASHIRU operates
|
82 |
\begin{itemize}
|
83 |
-
\item \textbf{
|
84 |
-
\item \textbf{Dynamic Lifecycle Management:}
|
85 |
-
\item \textbf{Hybrid Intelligence:} Strategic preference for local,
|
86 |
-
\item \textbf{Explicit Resource Management:} Continuous monitoring and control of costs, memory usage, and
|
87 |
-
\item \textbf{Adaptive Tooling:}
|
88 |
\end{itemize}
|
89 |
-
Figure \ref{fig:arch} illustrates the
|
90 |
|
91 |
\begin{figure}[ht]
|
92 |
\centering
|
@@ -96,153 +96,150 @@ Figure \ref{fig:arch} illustrates the overall structure and interaction flow.
|
|
96 |
\end{figure}
|
97 |
|
98 |
\subsection{Hierarchical Structure: CEO and Employee Agents}
|
99 |
-
The system
|
100 |
|
101 |
\begin{itemize}
|
102 |
-
\item \textbf{CEO Agent:}
|
103 |
\begin{itemize}
|
104 |
-
\item
|
105 |
-
\item Decomposing
|
106 |
-
\item Identifying
|
107 |
-
\item Managing
|
108 |
-
\item Assigning sub-tasks to
|
109 |
-
\item Monitoring
|
110 |
-
\item Synthesizing
|
111 |
-
\item Managing
|
112 |
-
\item Initiating
|
113 |
\end{itemize}
|
114 |
-
|
|
|
115 |
\begin{itemize}
|
116 |
-
\item Specialization:
|
117 |
-
\item Dynamic Existence: Created
|
118 |
-
\item Task Execution:
|
119 |
-
\item Resource Consumption: Associated
|
120 |
\end{itemize}
|
121 |
\end{itemize}
|
122 |
-
This
|
123 |
|
124 |
\subsection{Dynamic Agent Lifecycle Management}
|
125 |
\label{subsec:dynamic_mgmt}
|
126 |
-
A core innovation
|
127 |
|
128 |
-
When a
|
129 |
\begin{itemize}
|
130 |
-
\item \textbf{Task Requirements:}
|
131 |
-
\item \textbf{Agent Performance
|
132 |
-
\item \textbf{Operational Costs:} API
|
133 |
-
\item \textbf{Memory Footprint:}
|
134 |
-
\item \textbf{Agent Concurrency:}
|
135 |
\end{itemize}
|
136 |
|
137 |
-
|
138 |
\begin{itemize}
|
139 |
-
\item \textbf{Hiring Cost (``Starting Bonus''):}
|
140 |
-
\item \textbf{Invocation Cost (``Salary''):}
|
141 |
\end{itemize}
|
142 |
-
These
|
143 |
|
144 |
\subsection{Hybrid Intelligence and Model Management}
|
145 |
-
HASHIRU is designed for \textbf{hybrid intelligence}, leveraging
|
146 |
|
147 |
-
|
148 |
\begin{itemize}
|
149 |
-
\item \textbf{External LLM APIs:} Access to powerful proprietary models (
|
150 |
-
\item \textbf{External Tool APIs:}
|
151 |
-
\item \textbf{Self-Created APIs:} Tools generated by HASHIRU
|
152 |
\end{itemize}
|
153 |
-
The CEO manages this heterogeneous pool, selecting the most appropriate resource
|
154 |
|
155 |
\subsection{Resource Monitoring and Control}
|
156 |
\label{subsec:resource_mgmt}
|
157 |
-
Explicit resource management is central
|
158 |
\begin{itemize}
|
159 |
-
\item \textbf{Financial Costs:} Accumulating
|
160 |
-
\item \textbf{Memory Usage:}
|
161 |
-
\item \textbf{Agent Concurrency:}
|
162 |
\end{itemize}
|
163 |
-
|
164 |
|
165 |
\subsection{Tool Utilization and Autonomous Creation}
|
166 |
\label{subsec:tooling}
|
167 |
-
|
168 |
|
169 |
-
A distinctive feature
|
170 |
\begin{enumerate}
|
171 |
-
\item Defining
|
172 |
-
\item Commissioning
|
173 |
-
\item Deploying
|
174 |
-
\item Potentially instantiating
|
175 |
\end{enumerate}
|
176 |
-
This
|
177 |
-
|
178 |
|
179 |
\section{Experimental Setup}
|
180 |
\label{sec:experiments}
|
181 |
|
182 |
-
|
183 |
\begin{itemize}
|
184 |
-
\item
|
185 |
-
\item
|
186 |
-
\item
|
187 |
\end{itemize}
|
188 |
|
189 |
\subsection{Evaluation Tasks}
|
190 |
\label{subsec:tasks}
|
191 |
-
|
192 |
|
193 |
\subsubsection{Academic Paper Review}
|
194 |
-
|
195 |
|
196 |
\subsubsection{Reasoning and Problem-Solving Tasks}
|
197 |
-
|
198 |
\begin{itemize}
|
199 |
-
\item \textbf{Humanity's Last Exam \cite{phan2025humanitysexam}:}
|
200 |
-
\item \textbf{NYT Connections \cite{lopez2024nyt}:}
|
201 |
-
\item \textbf{Wordle:}
|
202 |
-
\item \textbf{Globle:}
|
203 |
\end{itemize}
|
204 |
-
These
|
205 |
|
206 |
\subsection{Baselines for Comparison}
|
207 |
\label{subsec:baselines}
|
208 |
-
To quantify
|
209 |
\begin{itemize}
|
210 |
-
\item \textbf{Static-HASHIRU:}
|
211 |
-
\item \textbf{Cloud-Only HASHIRU:}
|
212 |
-
\item \textbf{Local-Only HASHIRU:}
|
213 |
-
\item \textbf{HASHIRU (No-Economy):}
|
214 |
-
% \item \textbf{[Optional] Other Frameworks:} If feasible, configure AutoGen \cite{wu2023autogen} or CrewAI \cite{crewai} with a similar hierarchical structure to solve a subset of the tasks for comparison.
|
215 |
\end{itemize}
|
216 |
|
217 |
\subsection{Evaluation Metrics}
|
218 |
\label{subsec:metrics}
|
219 |
-
We
|
220 |
\begin{itemize}
|
221 |
\item \textbf{Task Success Rate / Quality:}
|
222 |
\begin{itemize}
|
223 |
-
\item Percentage of tasks
|
224 |
-
\item
|
225 |
-
\item Accuracy for information extraction
|
226 |
-
\item
|
227 |
\end{itemize}
|
228 |
\item \textbf{Resource Consumption:}
|
229 |
\begin{itemize}
|
230 |
-
\item Total
|
231 |
\item Peak and average memory usage (\% of allocated budget).
|
232 |
\item Wall-clock time per task.
|
233 |
\item Number and type (local/external) of LLM calls.
|
234 |
\end{itemize}
|
235 |
\item \textbf{System Dynamics and Adaptability:}
|
236 |
\begin{itemize}
|
237 |
-
\item
|
238 |
-
\item
|
239 |
\item Number and utility of autonomously created tools (if applicable).
|
240 |
\end{itemize}
|
241 |
\end{itemize}
|
242 |
|
243 |
-
|
244 |
-
|
245 |
\bibliography{references}
|
246 |
\bibliographystyle{plain}
|
247 |
|
248 |
-
\end{document}
|
|
|
35 |
|
36 |
\section{Introduction}\label{sec:introduction}
|
37 |
|
38 |
+
Rapid advancements in Large Language Models (LLMs) are reshaping Artificial Intelligence (AI) with profound capabilities in language understanding, generation, reasoning, and planning \cite{brown2020language, devlin2019bert, raffel2020exploring}. This progress drives the development of autonomous AI agents, shifting focus from single to Multi-Agent Systems (MAS) where collaborative teams tackle complex problems beyond individual scope \cite{dorri2018multi, wooldridge2009introduction}. Collaborative MAS show significant potential in diverse domains like scientific discovery \cite{boiko2023emergent}, software engineering \cite{qian2023communicative}, data analysis, and strategic decision-making \cite{wang2023decision}. The increasing complexity of tasks, demonstrated by benchmarks requiring advanced mathematical reasoning (e.g., GSM8K \cite{cobbe2021gsm8k}, SVAMP \cite{patel2021svamp}), coding (e.g., HumanEval \cite{chen2021codex}, CoDocBench \cite{pai2024codocbench}), and graduate-level technical knowledge \cite{phan2025humanitysexam}, highlights the need for agentic systems to effectively coordinate diverse cognitive resources \cite{wen2024benchmarkingcomplexinstructionfollowingmultiple}.
|
39 |
|
40 |
+
Despite this potential, contemporary agentic frameworks face significant limitations. Many are \textbf{rigid}, relying on predefined roles and static structures hindering adaptation to dynamic tasks \cite{zhang2023building}. \textbf{Resource obliviousness} is common; systems often lack mechanisms to monitor and optimize computational resources like API costs, memory, and CPU load, leading to inefficiency, especially when scaling or deploying in resource-constrained environments \cite{park2023generative}. This is often worsened by reliance on powerful, costly proprietary cloud LLMs. \textbf{Model homogeneity}, defaulting to a single powerful LLM for all sub-tasks, misses efficiency gains from a diverse ecosystem including smaller, specialized, or local models \cite{zhou2023agents}. While \textbf{tool use} is fundamental \cite{yao2022react, parisi2022talm}, agents' ability to autonomously \textbf{create and integrate new tools} remains limited, restricting dynamic extension and self-improvement without human intervention \cite{wang2023voyager}.
|
41 |
|
42 |
+
To address these challenges, we introduce \textbf{HASHIRU (Hierarchical Agent System for Hybrid Intelligent Resource Utilization)}, a novel MAS framework enhancing flexibility, resource efficiency, and adaptability. HASHIRU employs a hierarchical structure led by a central ``CEO'' agent dynamically managing specialized ``employee'' agents instantiated on demand. A core tenet is its \textbf{hybrid intelligence} approach, strategically prioritizing smaller (e.g., 3B--7B), locally-run LLMs (often via Ollama \cite{ollama}) for cost-effectiveness and efficiency. While prioritizing local resources, the system flexibly integrates external APIs and potentially more powerful models when justified by task complexity and resource availability, under the CEO's management.
|
43 |
|
44 |
+
The primary contributions are:
|
45 |
\begin{enumerate}
|
46 |
+
\item A novel MAS architecture combining \textbf{hierarchical control} with \textbf{dynamic, resource-aware agent lifecycle management} (hiring/firing). This management is governed by computational budget constraints (cost, memory, concurrency) and includes an economic model with hiring/firing costs to discourage excessive churn.
|
47 |
+
\item A \textbf{hybrid intelligence model} prioritizing cost-effective, local LLMs while adaptively incorporating external APIs and larger models, optimizing the efficiency-capability trade-off.
|
48 |
+
\item An integrated mechanism for \textbf{autonomous API tool creation}, allowing dynamic functional repertoire extension.
|
49 |
+
\item An \textbf{economic model} (hiring/firing fees) for agent management, promoting efficient resource allocation and team stability.
|
50 |
\end{enumerate}
|
51 |
|
52 |
+
This paper details HASHIRU's design and rationale. Section \ref{sec:background} discusses related work in agent architectures, dynamic management, resource allocation, model heterogeneity, and tool use. Section 3 elaborates on the architecture and core mechanisms. Section 4 presents experimental results (or outlines planned experiments), followed by discussion and conclusion in Sections 5 and 6.
|
53 |
|
54 |
\section{Background and Related Work} \label{sec:background}
|
55 |
|
56 |
+
Intelligent agent concepts have evolved from early symbolic AI \cite{russell2010artificial, shoham1994agent} to LLM-dominated frameworks leveraging models for reasoning, planning, and interaction \cite{wang2023survey, xi2023rise}. HASHIRU builds on this, addressing current limitations.
|
57 |
|
58 |
\subsection{Agent Architectures: Hierarchy and Dynamics}
|
59 |
+
MAS architectures vary, including flat, federated, and hierarchical \cite{dorri2018multi, horling2004survey}. Hierarchical models offer clear control and task decomposition but risk bottlenecks and rigidity \cite{gaston2005agenta,gaston2005agentb}. HASHIRU uses a \textbf{CEO-Employee hierarchy} for centralized coordination but distinguishes itself through \textbf{dynamic team composition}. Unlike systems with static hierarchies or predefined roles (e.g., CrewAI \cite{crewai}, ChatDev \cite{qian2023communicative}), HASHIRU's CEO dynamically manages the employee pool based on runtime needs and resource constraints.
|
60 |
|
61 |
\subsection{Dynamic Agent Lifecycle Management}
|
62 |
+
Dynamic MAS composition is crucial for complex environments \cite{valckenaers2005trends}. Agent creation/deletion triggers often relate to task structure or environmental changes. HASHIRU introduces a specific mechanism where the CEO makes \textbf{hiring and firing decisions} based on a cost-benefit analysis considering agent performance, operational costs (API fees, inferred compute), memory footprint (tracked explicitly as a percentage of available resources), and concurrency limits. HASHIRU also incorporates an \textbf{economic model} with explicit ``starting bonus'' (hiring) and ``invocation'' (usage) costs. This economic friction aims to prevent excessive initialization or usage for marginal gains and promote team stability, a nuance often missing in simpler dynamic strategies.
|
63 |
|
64 |
\subsection{Resource Management and Agent Economies}
|
65 |
+
Resource awareness is critical for scalable MAS. Economic research explores mechanisms like market-based auctions or contract nets for allocation \cite{clearwater1996market}. HASHIRU implements a more \textbf{centralized, budget-constrained resource management model}. The CEO operates within defined limits for financial cost, memory usage (as a percentage of total allocated), and concurrent agent count. This direct management, particularly focusing on memory percentage, suggests practicality for deployment on local or edge devices with finite resources, contrasting with cloud systems assuming elastic resources \cite{park2023generative}. Frameworks like AutoGen \cite{wu2023autogen} and LangGraph \cite{langgraph} typically rely on implicit cost tracking without explicit multi-dimensional budgeting and control.
|
66 |
|
67 |
\subsection{Hybrid Intelligence and Heterogeneous Models}
|
68 |
+
Leveraging diverse LLMs with varying capabilities, costs, and latencies is an emerging trend \cite{zhou2023agents}. Techniques like model routing select optimal models for sub-tasks. HASHIRU embraces \textbf{model heterogeneity} with a strategic focus: \textbf{prioritizing smaller (3B--7B), locally-run models via Ollama integration} \cite{ollama}. This emphasizes cost-efficiency, low latency, and potential privacy over systems defaulting to large proprietary cloud APIs (e.g., GPT-4 \cite{openai2023gpt4}, Claude 3 \cite{anthropic2024claude}). While integrating external APIs (potentially larger models), HASHIRU's default stance represents a distinct capability vs. efficiency balance.
|
69 |
|
70 |
\subsection{Tool Use and Autonomous Tool Creation}
|
71 |
+
Tool use (APIs, functions) is fundamental for modern agents \cite{yao2022react, openai_func_calling}. Most systems use predefined tools. HASHIRU advances this with \textbf{integrated, autonomous API tool creation}. When needed functionality is missing, the CEO can commission the generation (potentially via a specialized agent) and deployment of a new API tool within the HASHIRU ecosystem. This self-extension capability differentiates HASHIRU from systems limited to static toolsets, moving towards greater autonomy and adaptability \cite{wang2023voyager, park2023generative}.
|
72 |
|
73 |
+
In summary, HASHIRU integrates hierarchical control, dynamic MAS, resource management, and tool use. Its novelty lies in the synergistic combination of: (1) dynamic, resource-aware hierarchical management with (2) an economic model for stability, (3) a local-first hybrid intelligence strategy, and (4) integrated autonomous tool creation. This targets key limitations in current systems regarding efficiency, adaptability, cost, and autonomy.
|
74 |
|
75 |
\section{HASHIRU System Architecture}
|
76 |
\label{sec:architecture}
|
77 |
|
78 |
+
HASHIRU's architecture addresses rigidity, resource obliviousness, and limited adaptability through a hierarchical, dynamically managed MAS optimized for hybrid resource utilization.
|
79 |
|
80 |
\subsection{Overview}
|
81 |
+
HASHIRU operates with a central ``CEO'' agent coordinating specialized ``Employees''. Key tenets:
|
82 |
\begin{itemize}
|
83 |
+
\item \textbf{Dynamic Hierarchical Coordination:} CEO manages strategy, task allocation, and dynamic team composition.
|
84 |
+
\item \textbf{Dynamic Lifecycle Management:} Employees are hired/fired based on runtime needs and resource constraints, governed by an economic model.
|
85 |
+
\item \textbf{Hybrid Intelligence:} Strategic preference for local, cheaper LLMs, while accessing external APIs/models.
|
86 |
+
\item \textbf{Explicit Resource Management:} Continuous monitoring and control of costs, memory usage, and concurrency against budgets.
|
87 |
+
\item \textbf{Adaptive Tooling:} Using predefined tools alongside autonomous creation of new API tools.
|
88 |
\end{itemize}
|
89 |
+
Figure \ref{fig:arch} illustrates the structure.
|
90 |
|
91 |
\begin{figure}[ht]
|
92 |
\centering
|
|
|
96 |
\end{figure}
|
97 |
|
98 |
\subsection{Hierarchical Structure: CEO and Employee Agents}
|
99 |
+
The system uses a two-tiered hierarchy:
|
100 |
|
101 |
\begin{itemize}
|
102 |
+
\item \textbf{CEO Agent:} Singleton, central coordinator and entry point. Responsibilities:
|
103 |
\begin{itemize}
|
104 |
+
\item Interpreting user query/task.
|
105 |
+
\item Decomposing main task into sub-tasks.
|
106 |
+
\item Identifying required capabilities.
|
107 |
+
\item Managing Employee pool (Section \ref{subsec:dynamic_mgmt}).
|
108 |
+
\item Assigning sub-tasks to active Employees.
|
109 |
+
\item Monitoring Employee progress/performance.
|
110 |
+
\item Synthesizing Employee results into final output.
|
111 |
+
\item Managing overall resource budget (Section \ref{subsec:resource_mgmt}).
|
112 |
+
\item Initiating new tool creation (Section \ref{subsec:tooling}).
|
113 |
\end{itemize}
|
114 |
+
We use Gemini 2.5 Flash~\cite{gemini25flash} as the CEO agent due to its strong reasoning capabilities, support for tool usage, and cost efficiency, making it a practical and capable choice for our deployment.
|
115 |
+
\item \textbf{Employee Agents:} Specialized agents instantiated by the CEO for specific sub-tasks. Each typically wraps an LLM (local via Ollama \cite{ollama} or external API) or provides tool access. Characteristics:
|
116 |
\begin{itemize}
|
117 |
+
\item Specialization: Capabilities tailored to task types (code, data analysis, info retrieval).
|
118 |
+
\item Dynamic Existence: Created/destroyed by CEO based on need/performance.
|
119 |
+
\item Task Execution: Receive task, execute, return result.
|
120 |
+
\item Resource Consumption: Associated costs (API, memory) tracked by system.
|
121 |
\end{itemize}
|
122 |
\end{itemize}
|
123 |
+
This hierarchy facilitates task decomposition and result aggregation; the dynamic pool provides flexibility.
|
124 |
|
125 |
\subsection{Dynamic Agent Lifecycle Management}
|
126 |
\label{subsec:dynamic_mgmt}
|
127 |
+
A core innovation is the CEO's dynamic management (hiring/firing) of Employee agents. Driven by cost-benefit analysis, this optimizes task performance within resource constraints.
|
128 |
|
129 |
+
When a sub-task needs unavailable or inefficiently provided capabilities, the CEO may hire a new agent. Conversely, if an agent underperforms, is idle, costly, or resource limits are neared, the CEO may fire it. Decision factors:
|
130 |
\begin{itemize}
|
131 |
+
\item \textbf{Task Requirements:} Needed capabilities for pending sub-tasks.
|
132 |
+
\item \textbf{Agent Performance:} Historical success, output quality, efficiency.
|
133 |
+
\item \textbf{Operational Costs:} API, estimated compute, or other costs.
|
134 |
+
\item \textbf{Memory Footprint:} Agent memory usage (\% of total allocated).
|
135 |
+
\item \textbf{Agent Concurrency:} Active agents vs. predefined limit.
|
136 |
\end{itemize}
|
137 |
|
138 |
+
HASHIRU includes an \textbf{economic model}:
|
139 |
\begin{itemize}
|
140 |
+
\item \textbf{Hiring Cost (``Starting Bonus''):} One-time cost upon instantiation (setup overhead).
|
141 |
+
\item \textbf{Invocation Cost (``Salary''):} Multi-time cost upon use (system/payment load).
|
142 |
\end{itemize}
|
143 |
+
These transaction costs discourage excessive churn, promoting stability. The CEO evaluates if replacing an agent benefits outweigh hiring/firing costs plus operational differences. This combats rigidity and allows adaptation while managing budgets and preventing wasteful turnover.
|
144 |
|
145 |
\subsection{Hybrid Intelligence and Model Management}
|
146 |
+
HASHIRU is designed for \textbf{hybrid intelligence}, leveraging diverse cognitive resources. It strategically prioritizes smaller (3B--7B), cost-effective local LLMs via Ollama \cite{ollama}. This enhances efficiency, reduces external API reliance, and potentially improves privacy/latency.
|
147 |
|
148 |
+
The system also integrates:
|
149 |
\begin{itemize}
|
150 |
+
\item \textbf{External LLM APIs:} Access to powerful proprietary models (GPT-4 \cite{openai2023gpt4}, Claude 3 \cite{anthropic2024claude}) when necessary, subject to cost-benefit.
|
151 |
+
\item \textbf{External Tool APIs:} Third-party software/data source integration.
|
152 |
+
\item \textbf{Self-Created APIs:} Tools generated by HASHIRU (Section \ref{subsec:tooling}).
|
153 |
\end{itemize}
|
154 |
+
The CEO manages this heterogeneous pool, selecting the most appropriate resource based on difficulty, capabilities, and budget. This balances cost-effectiveness and efficiency with high capability needs.
|
155 |
|
156 |
\subsection{Resource Monitoring and Control}
|
157 |
\label{subsec:resource_mgmt}
|
158 |
+
Explicit resource management is central, moving beyond simple API cost tracking. The system, coordinated by the CEO, monitors:
|
159 |
\begin{itemize}
|
160 |
+
\item \textbf{Financial Costs:} Accumulating external API costs.
|
161 |
+
\item \textbf{Memory Usage:} Footprint of active Employee agents (\% of allocated budget).
|
162 |
+
\item \textbf{Agent Concurrency:} Count of concurrently active agents.
|
163 |
\end{itemize}
|
164 |
+
Metrics are monitored against predefined \textbf{budget limits}. Actions (like hiring) exceeding limits (e.g., >90\% memory, exceeding max concurrency) are prevented. This ensures operation within constraints, crucial for limited resources or strict budgets.
|
165 |
|
166 |
\subsection{Tool Utilization and Autonomous Creation}
|
167 |
\label{subsec:tooling}
|
168 |
+
HASHIRU agents use predefined tools (functions, APIs, databases) to interact and perform actions beyond text generation \cite{yao2022react, openai_func_calling}.
|
169 |
|
170 |
+
A distinctive feature is \textbf{integrated, autonomous tool creation}. If the CEO determines a required capability is missing, it can initiate new tool creation. This involves:
|
171 |
\begin{enumerate}
|
172 |
+
\item Defining tool specification (inputs, outputs, functionality).
|
173 |
+
\item Commissioning logic generation (code, potentially using external APIs with provided credentials, possibly via a code-generating agent).
|
174 |
+
\item Deploying logic as a new, callable API endpoint within HASHIRU.
|
175 |
+
\item Potentially instantiating an Employee agent for the new tool.
|
176 |
\end{enumerate}
|
177 |
+
This allows HASHIRU to dynamically extend its functional repertoire, tailoring capabilities to tasks without manual intervention, enabling greater autonomy and adaptation.
|
|
|
178 |
|
179 |
\section{Experimental Setup}
|
180 |
\label{sec:experiments}
|
181 |
|
182 |
+
We designed experiments to evaluate HASHIRU's performance, efficiency, and adaptability, targeting dynamic resource management, hybrid intelligence, and autonomous tool creation. Evaluation assesses benefits over baselines, focusing on:
|
183 |
\begin{itemize}
|
184 |
+
\item Impact of dynamic management with economic constraints on resource utilization (cost, memory) and task performance vs. static configurations.
|
185 |
+
\item Effectiveness of the hybrid (local-first) strategy vs. homogeneous (cloud-only or local-only) approaches across task complexity.
|
186 |
+
\item System's ability to autonomously create/utilize tools for novel functional requirements.
|
187 |
\end{itemize}
|
188 |
|
189 |
\subsection{Evaluation Tasks}
|
190 |
\label{subsec:tasks}
|
191 |
+
Tasks demand complex reasoning, multi-perspective analysis, and interaction, suitable for HASHIRU's coordination and dynamic capabilities. Tasks fall into two categories:
|
192 |
|
193 |
\subsubsection{Academic Paper Review}
|
194 |
+
Evaluates HASHIRU's critical assessment by simulating peer review. Given papers (e.g., PDF), the system generates a review summary and recommends acceptance/rejection. Probes ability to decompose criteria, delegate to specialized agents (novelty, rigor, clarity), and manage resources across complex documents.
|
195 |
|
196 |
\subsubsection{Reasoning and Problem-Solving Tasks}
|
197 |
+
Evaluates broader reasoning, knowledge retrieval, and problem-solving under constraints using challenging benchmarks and puzzles:
|
198 |
\begin{itemize}
|
199 |
+
\item \textbf{Humanity's Last Exam \cite{phan2025humanitysexam}:} Tests graduate-level technical knowledge and complex reasoning across domains. Requires deep understanding and sophisticated problem-solving, likely needing powerful external LLMs managed within HASHIRU's hybrid framework.
|
200 |
+
\item \textbf{NYT Connections \cite{lopez2024nyt}:} Puzzle requiring identifying hidden semantic relationships/themes to categorize 16 words into four groups. Involves associative reasoning, broad knowledge, and hypothesis testing, testing knowledge access and combinatorial reasoning coordination.
|
201 |
+
\item \textbf{Wordle:} Daily word puzzle requiring deductive reasoning to identify a five-letter word within six guesses, using feedback. Tests logical deduction, constraint satisfaction, vocabulary. Good test for comparing efficiency (speed, cost, guesses) of local vs. external models for iterative reasoning. Assumes simulated game environment.
|
202 |
+
\item \textbf{Globle:} Geographic deduction game identifying a target country based on proximity feedback. Tests geographic knowledge, spatial reasoning, iterative strategy based on feedback. Assumes simulated game environment.
|
203 |
\end{itemize}
|
204 |
+
These tasks challenge the system's ability to leverage appropriate resources (local vs. external), potentially create simple tools, and coordinate problem-solving.
|
205 |
|
206 |
\subsection{Baselines for Comparison}
|
207 |
\label{subsec:baselines}
|
208 |
+
To quantify HASHIRU's benefits, we compare its performance against baselines:
|
209 |
\begin{itemize}
|
210 |
+
\item \textbf{Static-HASHIRU:} Fixed, predefined Employee agents (e.g., one per role), disabling dynamic hiring/firing.
|
211 |
+
\item \textbf{Cloud-Only HASHIRU:} Uses exclusively powerful external LLM API and online function-calling for all agents, disabling local models.
|
212 |
+
\item \textbf{Local-Only HASHIRU:} Uses exclusively smaller, local LLMs (via Ollama) for all agents.
|
213 |
+
\item \textbf{HASHIRU (No-Economy):} Dynamic hiring/firing enabled but without explicit costs, isolating economic model impact on churn/stability.
|
|
|
214 |
\end{itemize}
|
215 |
|
216 |
\subsection{Evaluation Metrics}
|
217 |
\label{subsec:metrics}
|
218 |
+
We evaluate using quantitative and qualitative metrics:
|
219 |
\begin{itemize}
|
220 |
\item \textbf{Task Success Rate / Quality:}
|
221 |
\begin{itemize}
|
222 |
+
\item Percentage of tasks completed (binary for games, graded for analysis).
|
223 |
+
\item Output quality for analysis (human evaluation: relevance, coherence, accuracy, completeness).
|
224 |
+
\item Accuracy for information extraction.
|
225 |
+
\item Guesses/turns for game tasks.
|
226 |
\end{itemize}
|
227 |
\item \textbf{Resource Consumption:}
|
228 |
\begin{itemize}
|
229 |
+
\item Total external API costs.
|
230 |
\item Peak and average memory usage (\% of allocated budget).
|
231 |
\item Wall-clock time per task.
|
232 |
\item Number and type (local/external) of LLM calls.
|
233 |
\end{itemize}
|
234 |
\item \textbf{System Dynamics and Adaptability:}
|
235 |
\begin{itemize}
|
236 |
+
\item Employee agents hired/fired per task.
|
237 |
+
\item Agent churn frequency (hires+fires / duration or steps).
|
238 |
\item Number and utility of autonomously created tools (if applicable).
|
239 |
\end{itemize}
|
240 |
\end{itemize}
|
241 |
|
|
|
|
|
242 |
\bibliography{references}
|
243 |
\bibliographystyle{plain}
|
244 |
|
245 |
+
\end{document}
|
paper/references.bib
CHANGED
@@ -19,6 +19,16 @@
|
|
19 |
year={2024}
|
20 |
}
|
21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
@inproceedings{yao2022react,
|
24 |
title = {{ReAct}: Synergizing Reasoning and Acting in Language Models},
|
|
|
19 |
year={2024}
|
20 |
}
|
21 |
|
22 |
+
@misc{gemini25flash,
|
23 |
+
title = {Gemini 2.5 Flash: Model Card, API, and Announcement},
|
24 |
+
author = {{Google DeepMind} and {Google AI}},
|
25 |
+
year = {2025},
|
26 |
+
howpublished = {\url{https://developers.googleblog.com/en/start-building-with-gemini-25-flash/}},
|
27 |
+
note = {See also:
|
28 |
+
\url{https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-2.5-flash-preview-04-17?inv=1&invt=AbxICQ},
|
29 |
+
\url{https://ai.google.dev/gemini-api/docs/models}. Accessed: 2025-05-11}
|
30 |
+
}
|
31 |
+
|
32 |
|
33 |
@inproceedings{yao2022react,
|
34 |
title = {{ReAct}: Synergizing Reasoning and Acting in Language Models},
|