Spaces:

HASHIRUAgentX
/

hashiruAI

Running

App Files Files Community

Kunal Pai commited on May 4

Commit

06dc658

1 Parent(s): 4f96523

Add citation for NYT Connections benchmark and update references

Browse files

Files changed (2) hide show

paper/conference_101719.tex +1 -1
paper/references.bib +8 -0

paper/conference_101719.tex CHANGED Viewed

@@ -197,7 +197,7 @@ This task evaluates HASHIRU's capacity to critically assess academic work by sim
 To evaluate broader reasoning, knowledge retrieval, and problem-solving capabilities under different constraints, we employ a set of challenging benchmarks and puzzle-like tasks:
 \begin{itemize}
     \item \textbf{Humanity's Last Exam \cite{phan2025humanitysexam}:} A benchmark designed to test graduate-level technical knowledge and complex reasoning across multiple domains. Success requires deep understanding and sophisticated problem-solving, likely necessitating access to powerful external LLMs managed effectively within HASHIRU's hybrid framework.
-    \item \textbf{NYT Connections:} This popular puzzle requires identifying hidden semantic relationships or themes to categorize 16 words into four distinct groups. Solving this involves associative reasoning, broad world knowledge, and potentially hypothesis testing across different potential groupings, testing knowledge access and combinatorial reasoning coordination.
     \item \textbf{Wordle:} The daily word puzzle requires deductive reasoning to identify a five-letter word within six guesses, using feedback on correct letters and positions. This tests logical deduction, constraint satisfaction, and vocabulary knowledge. It serves as a good test case for comparing the efficiency (speed, cost, number of guesses) of local versus external models for iterative reasoning. We assume interaction via a simulated game environment.
     \item \textbf{Globle:} This geographic deduction game requires identifying a target country based on proximity feedback from guesses. It tests geographic knowledge retrieval, spatial reasoning, and iterative strategy refinement based on feedback (distance, direction). We assume interaction via a simulated game environment.
 \end{itemize}

 To evaluate broader reasoning, knowledge retrieval, and problem-solving capabilities under different constraints, we employ a set of challenging benchmarks and puzzle-like tasks:
 \begin{itemize}
     \item \textbf{Humanity's Last Exam \cite{phan2025humanitysexam}:} A benchmark designed to test graduate-level technical knowledge and complex reasoning across multiple domains. Success requires deep understanding and sophisticated problem-solving, likely necessitating access to powerful external LLMs managed effectively within HASHIRU's hybrid framework.
+    \item \textbf{NYT Connections \cite{lopez2024nyt}:} This popular puzzle requires identifying hidden semantic relationships or themes to categorize 16 words into four distinct groups. Solving this involves associative reasoning, broad world knowledge, and potentially hypothesis testing across different potential groupings, testing knowledge access and combinatorial reasoning coordination.
     \item \textbf{Wordle:} The daily word puzzle requires deductive reasoning to identify a five-letter word within six guesses, using feedback on correct letters and positions. This tests logical deduction, constraint satisfaction, and vocabulary knowledge. It serves as a good test case for comparing the efficiency (speed, cost, number of guesses) of local versus external models for iterative reasoning. We assume interaction via a simulated game environment.
     \item \textbf{Globle:} This geographic deduction game requires identifying a target country based on proximity feedback from guesses. It tests geographic knowledge retrieval, spatial reasoning, and iterative strategy refinement based on feedback (distance, direction). We assume interaction via a simulated game environment.
 \end{itemize}

paper/references.bib CHANGED Viewed

@@ -12,6 +12,14 @@
   year = {2023}
 }
 @inproceedings{yao2022react,
   title = {{ReAct}: Synergizing Reasoning and Acting in Language Models},
   author = {Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan},

   year = {2023}
 }
+@article{lopez2024nyt,
+  title={NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers},
+  author={Lopez, Angel Yahir Loredo and McDonald, Tyler and Emami, Ali},
+  journal={arXiv preprint arXiv:2412.01621},
+  year={2024}
+}
 @inproceedings{yao2022react,
   title = {{ReAct}: Synergizing Reasoning and Acting in Language Models},
   author = {Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan},