Recursive SWE-bench
Open Source
Evolution Beyond Linear Benchmarking
Recursive-SWE-bench extends the established SWE-bench
framework to measure adaptive intelligence in software engineering tasks through recursive evaluation paradigms. While traditional benchmarks measure static, single-pass performance, Recursive-SWE-bench evaluates dynamic problem-solving capabilities across iterative refinement cycles.
Key innovation: Benchmark tasks self-modify as models interact with them, creating a feedback loop that more accurately reflects real-world software engineering challenges.
Why Recursive Benchmarking?
Traditional benchmarks evaluate models using a linear, static framework:
Input β Model β Output β Evaluation β Score
Real-world engineering is inherently recursive:
Problem β Solution β Testing β Feedback β Refinement β New Problem State β ...
Recursive-SWE-bench captures this dynamic process, measuring:
- Adaptive reasoning: How models incorporate feedback into subsequent solution attempts
- Self-correction: The ability to identify and fix errors across iterations
- Learning efficiency: How quickly models converge on optimal solutions
- Meta-problem understanding: Recognition of patterns across related problem states
- Probabilistic optimization: Managing uncertainty in problem specifications and solution spaces
Core Innovations
Dynamic Task Evolution: Tasks transform based on model interactions, generating unique problem sequences for each evaluation run
Recursive Evaluation Metrics: Performance measured across solution trajectories rather than single attempts
Self-Modifying Test Harnesses: Evaluation environments that adapt to model capabilities, maintaining consistent challenge levels
Meta-learning Assessment: Explicit measurement of knowledge transfer between related problems
Feedback Integration Protocols: Standardized frameworks for delivering actionable feedback to models
Quick Start
# Install the package
pip install recursive-swe-bench
# Run a basic evaluation
rswe-bench evaluate --model your-model-name --task-set standard --iterations 5
# Generate a performance report
rswe-bench report --results-dir ./results --visualization recursive-trajectory
Benchmark Structure
Recursive-SWE-bench organizes tasks into recursive trajectories:
- Task Generators: Dynamically create problem instances based on model interaction history
- Feedback Modules: Provide standardized assessment of solutions with actionable insights
- State Trackers: Maintain the evolving state of problems across solution attempts
- Meta-Pattern Evaluators: Assess model ability to identify patterns across problem sequences
Task Categories
Category | Description | Recursive Elements |
---|---|---|
Bug Fixing | Identify and resolve issues in existing code | Error patterns transform based on fix attempts |
Feature Implementation | Add functionality to existing codebases | Requirements evolve as implementation progresses |
Refactoring | Improve code structure without changing behavior | Complexity dynamically adjusts to refactoring success |
System Design | Create architecture for complex systems | Design constraints adapt to proposed solutions |
Test Generation | Create effective test suites | Test coverage requirements shift with implementation |
Documentation | Create clear technical documentation | Clarity targets adapt to explanation attempts |
Performance Metrics
Recursive-SWE-bench evaluates models using both traditional and recursive metrics:
Traditional Metrics
- Pass@k (for varying k)
- Execution accuracy
- Code similarity to human solutions
Recursive Metrics
- Convergence Rate: How quickly models reach stable solutions
- Adaptation Efficiency: Performance improvements per feedback iteration
- Transfer Learning Factor: Performance gains across related problems
- Learning Curve Area: Integration of performance across all iterations
- Probabilistic Solution Quality: Distribution of solution quality across runs
- Dynamic Complexity Handling: Performance across varying problem complexity
Sample Results
Here's how various models perform on Recursive-SWE-bench:
Note: These preliminary results demonstrate how recursive evaluation reveals capabilities not captured by traditional single-pass benchmarks.
Citation
If you use Recursive-SWE-bench in your research, please cite:
@article{recursive2025swebench,
title={Recursive-SWE-bench: Evaluating Adaptive Programming Intelligence Through Self-Modifying Benchmarks},
author={Recursive Labs Team},
journal={arXiv preprint arXiv:2505.12345},
year={2025}
}
Contributing
We welcome contributions to Recursive-SWE-bench! See CONTRIBUTING.md for guidelines.
Key Areas for Contribution
- Additional recursive task generators
- Enhanced feedback mechanisms
- New evaluation metrics
- Integration with more models and frameworks
- Documentation and tutorials
License
Recursive-SWE-bench is released under the MIT License.
Acknowledgments
Recursive-SWE-bench builds upon the foundation established by the original SWE-bench, created by the Princeton NLP group. We extend our gratitude to their pioneering work while taking benchmark evaluation in new directions.