EXP-Bench: Can AI Conduct AI Research Experiments?
Abstract
EXP-Bench evaluates AI agents' end-to-end research experiment capabilities through curated tasks from top AI papers, highlighting current limitations.
Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.
Community
Can AI Agents Conduct AI Research Experiments?
Check out Curie and our benchmark here! https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research (2025)
- AI Idea Bench 2025: AI Research Idea Generation Benchmark (2025)
- The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search (2025)
- AI-Researcher: Autonomous Scientific Innovation (2025)
- ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines (2025)
- MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation (2025)
- Generative to Agentic AI: Survey, Conceptualization, and Challenges (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper