Spaces:
Running
Running
Kunal Pai
commited on
Commit
·
06dc658
1
Parent(s):
4f96523
Add citation for NYT Connections benchmark and update references
Browse files- paper/conference_101719.tex +1 -1
- paper/references.bib +8 -0
paper/conference_101719.tex
CHANGED
@@ -197,7 +197,7 @@ This task evaluates HASHIRU's capacity to critically assess academic work by sim
|
|
197 |
To evaluate broader reasoning, knowledge retrieval, and problem-solving capabilities under different constraints, we employ a set of challenging benchmarks and puzzle-like tasks:
|
198 |
\begin{itemize}
|
199 |
\item \textbf{Humanity's Last Exam \cite{phan2025humanitysexam}:} A benchmark designed to test graduate-level technical knowledge and complex reasoning across multiple domains. Success requires deep understanding and sophisticated problem-solving, likely necessitating access to powerful external LLMs managed effectively within HASHIRU's hybrid framework.
|
200 |
-
\item \textbf{NYT Connections:} This popular puzzle requires identifying hidden semantic relationships or themes to categorize 16 words into four distinct groups. Solving this involves associative reasoning, broad world knowledge, and potentially hypothesis testing across different potential groupings, testing knowledge access and combinatorial reasoning coordination.
|
201 |
\item \textbf{Wordle:} The daily word puzzle requires deductive reasoning to identify a five-letter word within six guesses, using feedback on correct letters and positions. This tests logical deduction, constraint satisfaction, and vocabulary knowledge. It serves as a good test case for comparing the efficiency (speed, cost, number of guesses) of local versus external models for iterative reasoning. We assume interaction via a simulated game environment.
|
202 |
\item \textbf{Globle:} This geographic deduction game requires identifying a target country based on proximity feedback from guesses. It tests geographic knowledge retrieval, spatial reasoning, and iterative strategy refinement based on feedback (distance, direction). We assume interaction via a simulated game environment.
|
203 |
\end{itemize}
|
|
|
197 |
To evaluate broader reasoning, knowledge retrieval, and problem-solving capabilities under different constraints, we employ a set of challenging benchmarks and puzzle-like tasks:
|
198 |
\begin{itemize}
|
199 |
\item \textbf{Humanity's Last Exam \cite{phan2025humanitysexam}:} A benchmark designed to test graduate-level technical knowledge and complex reasoning across multiple domains. Success requires deep understanding and sophisticated problem-solving, likely necessitating access to powerful external LLMs managed effectively within HASHIRU's hybrid framework.
|
200 |
+
\item \textbf{NYT Connections \cite{lopez2024nyt}:} This popular puzzle requires identifying hidden semantic relationships or themes to categorize 16 words into four distinct groups. Solving this involves associative reasoning, broad world knowledge, and potentially hypothesis testing across different potential groupings, testing knowledge access and combinatorial reasoning coordination.
|
201 |
\item \textbf{Wordle:} The daily word puzzle requires deductive reasoning to identify a five-letter word within six guesses, using feedback on correct letters and positions. This tests logical deduction, constraint satisfaction, and vocabulary knowledge. It serves as a good test case for comparing the efficiency (speed, cost, number of guesses) of local versus external models for iterative reasoning. We assume interaction via a simulated game environment.
|
202 |
\item \textbf{Globle:} This geographic deduction game requires identifying a target country based on proximity feedback from guesses. It tests geographic knowledge retrieval, spatial reasoning, and iterative strategy refinement based on feedback (distance, direction). We assume interaction via a simulated game environment.
|
203 |
\end{itemize}
|
paper/references.bib
CHANGED
@@ -12,6 +12,14 @@
|
|
12 |
year = {2023}
|
13 |
}
|
14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
@inproceedings{yao2022react,
|
16 |
title = {{ReAct}: Synergizing Reasoning and Acting in Language Models},
|
17 |
author = {Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan},
|
|
|
12 |
year = {2023}
|
13 |
}
|
14 |
|
15 |
+
@article{lopez2024nyt,
|
16 |
+
title={NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers},
|
17 |
+
author={Lopez, Angel Yahir Loredo and McDonald, Tyler and Emami, Ali},
|
18 |
+
journal={arXiv preprint arXiv:2412.01621},
|
19 |
+
year={2024}
|
20 |
+
}
|
21 |
+
|
22 |
+
|
23 |
@inproceedings{yao2022react,
|
24 |
title = {{ReAct}: Synergizing Reasoning and Acting in Language Models},
|
25 |
author = {Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan},
|