File size: 6,097 Bytes
e8a0a6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# Recursive SWE-bench
## Open Source

![Status](https://img.shields.io/badge/Status-Recursive%20Benchmark-crimson) [![License: MIT](https://img.shields.io/badge/License-MIT-lime.svg)](https://polyformproject.org/licenses/noncommercial/1.0.0/) [![LICENSE: CC BY-NC-ND 4.0](https://img.shields.io/badge/Content-CC--BY--NC--ND-turquoise.svg)](https://creativecommons.org/licenses/by-nc-nd/4.0/) ![Version](https://img.shields.io/badge/Version-0.1.0--alpha-purple)


## Evolution Beyond Linear Benchmarking

Recursive-SWE-bench extends the established [**`SWE-bench`**](https://github.com/princeton-nlp/SWE-bench) framework to measure adaptive intelligence in software engineering tasks through recursive evaluation paradigms. While traditional benchmarks measure static, single-pass performance, Recursive-SWE-bench evaluates dynamic problem-solving capabilities across iterative refinement cycles.

**Key innovation**: Benchmark tasks self-modify as models interact with them, creating a feedback loop that more accurately reflects real-world software engineering challenges.


## Why Recursive Benchmarking?

Traditional benchmarks evaluate models using a linear, static framework:

```
Input β†’ Model β†’ Output β†’ Evaluation β†’ Score
```

Real-world engineering is inherently recursive:

```
Problem β†’ Solution β†’ Testing β†’ Feedback β†’ Refinement β†’ New Problem State β†’ ...
```

Recursive-SWE-bench captures this dynamic process, measuring:

- **Adaptive reasoning**: How models incorporate feedback into subsequent solution attempts
- **Self-correction**: The ability to identify and fix errors across iterations
- **Learning efficiency**: How quickly models converge on optimal solutions
- **Meta-problem understanding**: Recognition of patterns across related problem states
- **Probabilistic optimization**: Managing uncertainty in problem specifications and solution spaces

## Core Innovations

1. **Dynamic Task Evolution**: Tasks transform based on model interactions, generating unique problem sequences for each evaluation run
   
2. **Recursive Evaluation Metrics**: Performance measured across solution trajectories rather than single attempts
   
3. **Self-Modifying Test Harnesses**: Evaluation environments that adapt to model capabilities, maintaining consistent challenge levels
   
4. **Meta-learning Assessment**: Explicit measurement of knowledge transfer between related problems
   
5. **Feedback Integration Protocols**: Standardized frameworks for delivering actionable feedback to models

## Quick Start

```bash
# Install the package
pip install recursive-swe-bench

# Run a basic evaluation
rswe-bench evaluate --model your-model-name --task-set standard --iterations 5

# Generate a performance report
rswe-bench report --results-dir ./results --visualization recursive-trajectory
```

## Benchmark Structure

Recursive-SWE-bench organizes tasks into recursive trajectories:

- **Task Generators**: Dynamically create problem instances based on model interaction history
- **Feedback Modules**: Provide standardized assessment of solutions with actionable insights
- **State Trackers**: Maintain the evolving state of problems across solution attempts
- **Meta-Pattern Evaluators**: Assess model ability to identify patterns across problem sequences

## Task Categories

| Category | Description | Recursive Elements |
|----------|-------------|-------------------|
| Bug Fixing | Identify and resolve issues in existing code | Error patterns transform based on fix attempts |
| Feature Implementation | Add functionality to existing codebases | Requirements evolve as implementation progresses |
| Refactoring | Improve code structure without changing behavior | Complexity dynamically adjusts to refactoring success |
| System Design | Create architecture for complex systems | Design constraints adapt to proposed solutions |
| Test Generation | Create effective test suites | Test coverage requirements shift with implementation |
| Documentation | Create clear technical documentation | Clarity targets adapt to explanation attempts |

## Performance Metrics

Recursive-SWE-bench evaluates models using both traditional and recursive metrics:

### Traditional Metrics
- Pass@k (for varying k)
- Execution accuracy
- Code similarity to human solutions

### Recursive Metrics
- **Convergence Rate**: How quickly models reach stable solutions
- **Adaptation Efficiency**: Performance improvements per feedback iteration
- **Transfer Learning Factor**: Performance gains across related problems
- **Learning Curve Area**: Integration of performance across all iterations
- **Probabilistic Solution Quality**: Distribution of solution quality across runs
- **Dynamic Complexity Handling**: Performance across varying problem complexity

## Sample Results

Here's how various models perform on Recursive-SWE-bench:

<p align="center">
  <img src="docs/assets/performance-comparison.png" alt="Performance Comparison" width="650"/>
</p>

*Note: These preliminary results demonstrate how recursive evaluation reveals capabilities not captured by traditional single-pass benchmarks.*

## Citation

If you use Recursive-SWE-bench in your research, please cite:

```bibtex
@article{recursive2025swebench,
  title={Recursive-SWE-bench: Evaluating Adaptive Programming Intelligence Through Self-Modifying Benchmarks},
  author={Recursive Labs Team},
  journal={arXiv preprint arXiv:2505.12345},
  year={2025}
}
```

## Contributing

We welcome contributions to Recursive-SWE-bench! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

### Key Areas for Contribution

- Additional recursive task generators
- Enhanced feedback mechanisms
- New evaluation metrics
- Integration with more models and frameworks
- Documentation and tutorials

## License

Recursive-SWE-bench is released under the [MIT License](LICENSE).

## Acknowledgments

Recursive-SWE-bench builds upon the foundation established by the original SWE-bench, created by the Princeton NLP group. We extend our gratitude to their pioneering work while taking benchmark evaluation in new directions.