File size: 6,097 Bytes
e8a0a6a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
# Recursive SWE-bench
## Open Source
 [](https://polyformproject.org/licenses/noncommercial/1.0.0/) [](https://creativecommons.org/licenses/by-nc-nd/4.0/) 
## Evolution Beyond Linear Benchmarking
Recursive-SWE-bench extends the established [**`SWE-bench`**](https://github.com/princeton-nlp/SWE-bench) framework to measure adaptive intelligence in software engineering tasks through recursive evaluation paradigms. While traditional benchmarks measure static, single-pass performance, Recursive-SWE-bench evaluates dynamic problem-solving capabilities across iterative refinement cycles.
**Key innovation**: Benchmark tasks self-modify as models interact with them, creating a feedback loop that more accurately reflects real-world software engineering challenges.
## Why Recursive Benchmarking?
Traditional benchmarks evaluate models using a linear, static framework:
```
Input β Model β Output β Evaluation β Score
```
Real-world engineering is inherently recursive:
```
Problem β Solution β Testing β Feedback β Refinement β New Problem State β ...
```
Recursive-SWE-bench captures this dynamic process, measuring:
- **Adaptive reasoning**: How models incorporate feedback into subsequent solution attempts
- **Self-correction**: The ability to identify and fix errors across iterations
- **Learning efficiency**: How quickly models converge on optimal solutions
- **Meta-problem understanding**: Recognition of patterns across related problem states
- **Probabilistic optimization**: Managing uncertainty in problem specifications and solution spaces
## Core Innovations
1. **Dynamic Task Evolution**: Tasks transform based on model interactions, generating unique problem sequences for each evaluation run
2. **Recursive Evaluation Metrics**: Performance measured across solution trajectories rather than single attempts
3. **Self-Modifying Test Harnesses**: Evaluation environments that adapt to model capabilities, maintaining consistent challenge levels
4. **Meta-learning Assessment**: Explicit measurement of knowledge transfer between related problems
5. **Feedback Integration Protocols**: Standardized frameworks for delivering actionable feedback to models
## Quick Start
```bash
# Install the package
pip install recursive-swe-bench
# Run a basic evaluation
rswe-bench evaluate --model your-model-name --task-set standard --iterations 5
# Generate a performance report
rswe-bench report --results-dir ./results --visualization recursive-trajectory
```
## Benchmark Structure
Recursive-SWE-bench organizes tasks into recursive trajectories:
- **Task Generators**: Dynamically create problem instances based on model interaction history
- **Feedback Modules**: Provide standardized assessment of solutions with actionable insights
- **State Trackers**: Maintain the evolving state of problems across solution attempts
- **Meta-Pattern Evaluators**: Assess model ability to identify patterns across problem sequences
## Task Categories
| Category | Description | Recursive Elements |
|----------|-------------|-------------------|
| Bug Fixing | Identify and resolve issues in existing code | Error patterns transform based on fix attempts |
| Feature Implementation | Add functionality to existing codebases | Requirements evolve as implementation progresses |
| Refactoring | Improve code structure without changing behavior | Complexity dynamically adjusts to refactoring success |
| System Design | Create architecture for complex systems | Design constraints adapt to proposed solutions |
| Test Generation | Create effective test suites | Test coverage requirements shift with implementation |
| Documentation | Create clear technical documentation | Clarity targets adapt to explanation attempts |
## Performance Metrics
Recursive-SWE-bench evaluates models using both traditional and recursive metrics:
### Traditional Metrics
- Pass@k (for varying k)
- Execution accuracy
- Code similarity to human solutions
### Recursive Metrics
- **Convergence Rate**: How quickly models reach stable solutions
- **Adaptation Efficiency**: Performance improvements per feedback iteration
- **Transfer Learning Factor**: Performance gains across related problems
- **Learning Curve Area**: Integration of performance across all iterations
- **Probabilistic Solution Quality**: Distribution of solution quality across runs
- **Dynamic Complexity Handling**: Performance across varying problem complexity
## Sample Results
Here's how various models perform on Recursive-SWE-bench:
<p align="center">
<img src="docs/assets/performance-comparison.png" alt="Performance Comparison" width="650"/>
</p>
*Note: These preliminary results demonstrate how recursive evaluation reveals capabilities not captured by traditional single-pass benchmarks.*
## Citation
If you use Recursive-SWE-bench in your research, please cite:
```bibtex
@article{recursive2025swebench,
title={Recursive-SWE-bench: Evaluating Adaptive Programming Intelligence Through Self-Modifying Benchmarks},
author={Recursive Labs Team},
journal={arXiv preprint arXiv:2505.12345},
year={2025}
}
```
## Contributing
We welcome contributions to Recursive-SWE-bench! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
### Key Areas for Contribution
- Additional recursive task generators
- Enhanced feedback mechanisms
- New evaluation metrics
- Integration with more models and frameworks
- Documentation and tutorials
## License
Recursive-SWE-bench is released under the [MIT License](LICENSE).
## Acknowledgments
Recursive-SWE-bench builds upon the foundation established by the original SWE-bench, created by the Princeton NLP group. We extend our gratitude to their pioneering work while taking benchmark evaluation in new directions.
|