# Recursive SWE-bench ## Open Source  [](https://polyformproject.org/licenses/noncommercial/1.0.0/) [](https://creativecommons.org/licenses/by-nc-nd/4.0/)  ## Evolution Beyond Linear Benchmarking Recursive-SWE-bench extends the established [**`SWE-bench`**](https://github.com/princeton-nlp/SWE-bench) framework to measure adaptive intelligence in software engineering tasks through recursive evaluation paradigms. While traditional benchmarks measure static, single-pass performance, Recursive-SWE-bench evaluates dynamic problem-solving capabilities across iterative refinement cycles. **Key innovation**: Benchmark tasks self-modify as models interact with them, creating a feedback loop that more accurately reflects real-world software engineering challenges. ## Why Recursive Benchmarking? Traditional benchmarks evaluate models using a linear, static framework: ``` Input → Model → Output → Evaluation → Score ``` Real-world engineering is inherently recursive: ``` Problem → Solution → Testing → Feedback → Refinement → New Problem State → ... ``` Recursive-SWE-bench captures this dynamic process, measuring: - **Adaptive reasoning**: How models incorporate feedback into subsequent solution attempts - **Self-correction**: The ability to identify and fix errors across iterations - **Learning efficiency**: How quickly models converge on optimal solutions - **Meta-problem understanding**: Recognition of patterns across related problem states - **Probabilistic optimization**: Managing uncertainty in problem specifications and solution spaces ## Core Innovations 1. **Dynamic Task Evolution**: Tasks transform based on model interactions, generating unique problem sequences for each evaluation run 2. **Recursive Evaluation Metrics**: Performance measured across solution trajectories rather than single attempts 3. **Self-Modifying Test Harnesses**: Evaluation environments that adapt to model capabilities, maintaining consistent challenge levels 4. **Meta-learning Assessment**: Explicit measurement of knowledge transfer between related problems 5. **Feedback Integration Protocols**: Standardized frameworks for delivering actionable feedback to models ## Quick Start ```bash # Install the package pip install recursive-swe-bench # Run a basic evaluation rswe-bench evaluate --model your-model-name --task-set standard --iterations 5 # Generate a performance report rswe-bench report --results-dir ./results --visualization recursive-trajectory ``` ## Benchmark Structure Recursive-SWE-bench organizes tasks into recursive trajectories: - **Task Generators**: Dynamically create problem instances based on model interaction history - **Feedback Modules**: Provide standardized assessment of solutions with actionable insights - **State Trackers**: Maintain the evolving state of problems across solution attempts - **Meta-Pattern Evaluators**: Assess model ability to identify patterns across problem sequences ## Task Categories | Category | Description | Recursive Elements | |----------|-------------|-------------------| | Bug Fixing | Identify and resolve issues in existing code | Error patterns transform based on fix attempts | | Feature Implementation | Add functionality to existing codebases | Requirements evolve as implementation progresses | | Refactoring | Improve code structure without changing behavior | Complexity dynamically adjusts to refactoring success | | System Design | Create architecture for complex systems | Design constraints adapt to proposed solutions | | Test Generation | Create effective test suites | Test coverage requirements shift with implementation | | Documentation | Create clear technical documentation | Clarity targets adapt to explanation attempts | ## Performance Metrics Recursive-SWE-bench evaluates models using both traditional and recursive metrics: ### Traditional Metrics - Pass@k (for varying k) - Execution accuracy - Code similarity to human solutions ### Recursive Metrics - **Convergence Rate**: How quickly models reach stable solutions - **Adaptation Efficiency**: Performance improvements per feedback iteration - **Transfer Learning Factor**: Performance gains across related problems - **Learning Curve Area**: Integration of performance across all iterations - **Probabilistic Solution Quality**: Distribution of solution quality across runs - **Dynamic Complexity Handling**: Performance across varying problem complexity ## Sample Results Here's how various models perform on Recursive-SWE-bench: