MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search
Abstract
A method is proposed to generate detailed scientific hypotheses using LLMs by defining and optimizing a latent reward landscape, outperforming baselines in benchmark evaluations.
Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines.
Community
We introduce the task of fine-grained scientific hypothesis discovery—automatically generating detailed, experimentally actionable hypotheses from coarse directions using large language models.
By framing the problem as combinatorial optimization over a latent LLM reward landscape, we explore how ensemble scoring and hierarchical search improve hypothesis quality.
Our benchmark-driven evaluation in chemistry shows consistent gains over strong baselines. This work expands the frontier of LLMs for scientific discovery.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback (2025)
- HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation (2025)
- LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models (2025)
- Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment (2025)
- YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering (2025)
- IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery (2025)
- Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper