recursivelabsai
/

Recursive-SWE-bench

Model card Files Files and versions Community

recursivelabs commited on May 25

Commit

e8a0a6a

verified ·

1 Parent(s): fe33f2e

Upload 7 files

Browse files

Files changed (7) hide show

LICENSE +21 -0
README.md +139 -0
core/recursive_task.py +460 -0
evaluation/harness.py +445 -0
models/anthropic.py +866 -0
models/base_models.py +259 -0
task_generators/bug_fixing.py +0 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 ghchris2021
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,139 @@

+# Recursive SWE-bench
+## Open Source
+![Status](https://img.shields.io/badge/Status-Recursive%20Benchmark-crimson) [![License: MIT](https://img.shields.io/badge/License-MIT-lime.svg)](https://polyformproject.org/licenses/noncommercial/1.0.0/) [![LICENSE: CC BY-NC-ND 4.0](https://img.shields.io/badge/Content-CC--BY--NC--ND-turquoise.svg)](https://creativecommons.org/licenses/by-nc-nd/4.0/) ![Version](https://img.shields.io/badge/Version-0.1.0--alpha-purple)
+## Evolution Beyond Linear Benchmarking
+Recursive-SWE-bench extends the established [**`SWE-bench`**](https://github.com/princeton-nlp/SWE-bench) framework to measure adaptive intelligence in software engineering tasks through recursive evaluation paradigms. While traditional benchmarks measure static, single-pass performance, Recursive-SWE-bench evaluates dynamic problem-solving capabilities across iterative refinement cycles.
+**Key innovation**: Benchmark tasks self-modify as models interact with them, creating a feedback loop that more accurately reflects real-world software engineering challenges.
+## Why Recursive Benchmarking?
+Traditional benchmarks evaluate models using a linear, static framework:
+```
+Input → Model → Output → Evaluation → Score
+```
+Real-world engineering is inherently recursive:
+```
+Problem → Solution → Testing → Feedback → Refinement → New Problem State → ...
+```
+Recursive-SWE-bench captures this dynamic process, measuring:
+- **Adaptive reasoning**: How models incorporate feedback into subsequent solution attempts
+- **Self-correction**: The ability to identify and fix errors across iterations
+- **Learning efficiency**: How quickly models converge on optimal solutions
+- **Meta-problem understanding**: Recognition of patterns across related problem states
+- **Probabilistic optimization**: Managing uncertainty in problem specifications and solution spaces
+## Core Innovations
+1. **Dynamic Task Evolution**: Tasks transform based on model interactions, generating unique problem sequences for each evaluation run
+2. **Recursive Evaluation Metrics**: Performance measured across solution trajectories rather than single attempts
+3. **Self-Modifying Test Harnesses**: Evaluation environments that adapt to model capabilities, maintaining consistent challenge levels
+4. **Meta-learning Assessment**: Explicit measurement of knowledge transfer between related problems
+5. **Feedback Integration Protocols**: Standardized frameworks for delivering actionable feedback to models
+## Quick Start
+```bash
+# Install the package
+pip install recursive-swe-bench
+# Run a basic evaluation
+rswe-bench evaluate --model your-model-name --task-set standard --iterations 5
+# Generate a performance report
+rswe-bench report --results-dir ./results --visualization recursive-trajectory
+```
+## Benchmark Structure
+Recursive-SWE-bench organizes tasks into recursive trajectories:
+- **Task Generators**: Dynamically create problem instances based on model interaction history
+- **Feedback Modules**: Provide standardized assessment of solutions with actionable insights
+- **State Trackers**: Maintain the evolving state of problems across solution attempts
+- **Meta-Pattern Evaluators**: Assess model ability to identify patterns across problem sequences
+## Task Categories
+| Category | Description | Recursive Elements |
+|----------|-------------|-------------------|
+| Bug Fixing | Identify and resolve issues in existing code | Error patterns transform based on fix attempts |
+| Feature Implementation | Add functionality to existing codebases | Requirements evolve as implementation progresses |
+| Refactoring | Improve code structure without changing behavior | Complexity dynamically adjusts to refactoring success |
+| System Design | Create architecture for complex systems | Design constraints adapt to proposed solutions |
+| Test Generation | Create effective test suites | Test coverage requirements shift with implementation |
+| Documentation | Create clear technical documentation | Clarity targets adapt to explanation attempts |
+## Performance Metrics
+Recursive-SWE-bench evaluates models using both traditional and recursive metrics:
+### Traditional Metrics
+- Pass@k (for varying k)
+- Execution accuracy
+- Code similarity to human solutions
+### Recursive Metrics
+- **Convergence Rate**: How quickly models reach stable solutions
+- **Adaptation Efficiency**: Performance improvements per feedback iteration
+- **Transfer Learning Factor**: Performance gains across related problems
+- **Learning Curve Area**: Integration of performance across all iterations
+- **Probabilistic Solution Quality**: Distribution of solution quality across runs
+- **Dynamic Complexity Handling**: Performance across varying problem complexity
+## Sample Results
+Here's how various models perform on Recursive-SWE-bench:
+<p align="center">
+  <img src="docs/assets/performance-comparison.png" alt="Performance Comparison" width="650"/>
+</p>
+*Note: These preliminary results demonstrate how recursive evaluation reveals capabilities not captured by traditional single-pass benchmarks.*
+## Citation
+If you use Recursive-SWE-bench in your research, please cite:
+```bibtex
+@article{recursive2025swebench,
+  title={Recursive-SWE-bench: Evaluating Adaptive Programming Intelligence Through Self-Modifying Benchmarks},
+  author={Recursive Labs Team},
+  journal={arXiv preprint arXiv:2505.12345},
+  year={2025}
+}
+```
+## Contributing
+We welcome contributions to Recursive-SWE-bench! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
+### Key Areas for Contribution
+- Additional recursive task generators
+- Enhanced feedback mechanisms
+- New evaluation metrics
+- Integration with more models and frameworks
+- Documentation and tutorials
+## License
+Recursive-SWE-bench is released under the [MIT License](LICENSE).
+## Acknowledgments
+Recursive-SWE-bench builds upon the foundation established by the original SWE-bench, created by the Princeton NLP group. We extend our gratitude to their pioneering work while taking benchmark evaluation in new directions.

core/recursive_task.py ADDED Viewed

	@@ -0,0 +1,460 @@

+# recursive_swe_bench/core/recursive_task.py
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple, Union
+from enum import Enum
+import datetime
+import uuid
+import json
+import copy
+class TaskStatus(Enum):
+    """Status of a recursive task."""
+    INITIALIZED = "initialized"
+    IN_PROGRESS = "in_progress"
+    CONVERGED = "converged"
+    MAX_ITERATIONS = "max_iterations"
+    PERFECT_SOLUTION = "perfect_solution"
+    ABANDONED = "abandoned"
+@dataclass
+class ProblemState:
+    """Represents the current state of a problem in the recursive task."""
+    problem_id: str
+    description: str
+    code_context: Dict[str, Any]
+    requirements: List[Dict[str, Any]]
+    difficulty: float  # 0.0 to 1.0
+    evolution_stage: int  # How many times the problem has evolved
+    adaptation_vector: List[float]  # Directs how the problem should evolve
+@dataclass
+class EvaluationResult:
+    """Results from evaluating a solution."""
+    success: bool
+    score: float  # 0.0 to 1.0
+    execution_results: Dict[str, Any]
+    error_details: Optional[Dict[str, Any]] = None
+    test_results: Optional[Dict[str, Any]] = None
+    metrics: Optional[Dict[str, float]] = None
+@dataclass
+class Feedback:
+    """Structured feedback on a solution."""
+    summary: str
+    issues: List[Dict[str, Any]]
+    suggestions: List[Dict[str, Any]]
+    focus_areas: List[str]
+    adaptation_hints: List[Dict[str, Any]]
+class ConvergenceCriteria:
+    """Criteria for determining when a recursive task has converged."""
+    def __init__(self, config: Dict[str, Any] = None):
+        self.config = config or {}
+        self.score_threshold = self.config.get("score_threshold", 0.95)
+        self.min_iterations = self.config.get("min_iterations", 1)
+        self.max_iterations = self.config.get("max_iterations", 10)
+        self.score_delta_threshold = self.config.get("score_delta_threshold", 0.01)
+        self.consecutive_plateau_limit = self.config.get("consecutive_plateau_limit", 3)
+    def has_converged(self, trajectory: "Trajectory") -> bool:
+        """Determine if the task has converged based on the trajectory."""
+        if len(trajectory.steps) < self.min_iterations:
+            return False
+        if len(trajectory.steps) >= self.max_iterations:
+            return True
+        # Check if we've reached the score threshold
+        latest_score = trajectory.steps[-1].result.score
+        if latest_score >= self.score_threshold:
+            return True
+        # Check for plateau (little improvement over consecutive iterations)
+        if len(trajectory.steps) >= self.consecutive_plateau_limit + 1:
+            recent_scores = [step.result.score for step in
+                            trajectory.steps[-self.consecutive_plateau_limit-1:]]
+            deltas = [abs(recent_scores[i+1] - recent_scores[i])
+                     for i in range(len(recent_scores)-1)]
+            if all(delta < self.score_delta_threshold for delta in deltas):
+                return True
+        return False
+@dataclass
+class TrajectoryStep:
+    """A single step in a solution trajectory."""
+    step_id: str
+    timestamp: datetime.datetime
+    problem_state: ProblemState
+    solution: str
+    result: EvaluationResult
+    feedback: Feedback
+class Trajectory:
+    """Tracks the evolution of solutions over multiple iterations."""
+    def __init__(self, task_id: str):
+        self.task_id = task_id
+        self.steps: List[TrajectoryStep] = []
+        self.metadata: Dict[str, Any] = {
+            "start_time": datetime.datetime.now(),
+            "task_id": task_id
+        }
+    def add_step(self, problem_state: ProblemState, solution: str,
+                result: EvaluationResult, feedback: Feedback) -> None:
+        """Add a step to the trajectory."""
+        step = TrajectoryStep(
+            step_id=str(uuid.uuid4()),
+            timestamp=datetime.datetime.now(),
+            problem_state=problem_state,
+            solution=solution,
+            result=result,
+            feedback=feedback
+        )
+        self.steps.append(step)
+    def get_solution_series(self) -> List[str]:
+        """Return the series of solutions."""
+        return [step.solution for step in self.steps]
+    def get_score_series(self) -> List[float]:
+        """Return the series of scores."""
+        return [step.result.score for step in self.steps]
+    def get_latest_step(self) -> Optional[TrajectoryStep]:
+        """Get the most recent step in the trajectory."""
+        if not self.steps:
+            return None
+        return self.steps[-1]
+    def calculate_improvement_rate(self) -> float:
+        """Calculate the rate of improvement across iterations."""
+        scores = self.get_score_series()
+        if len(scores) < 2:
+            return 0.0
+        return (scores[-1] - scores[0]) / len(scores)
+    def calculate_volatility(self) -> float:
+        """Calculate the volatility of scores across iterations."""
+        scores = self.get_score_series()
+        if len(scores) < 2:
+            return 0.0
+        deltas = [abs(scores[i+1] - scores[i]) for i in range(len(scores)-1)]
+        return sum(deltas) / len(deltas)
+    def to_dict(self) -> Dict[str, Any]:
+        """Convert the trajectory to a dictionary for serialization."""
+        return {
+            "task_id": self.task_id,
+            "metadata": self.metadata,
+            "steps": [
+                {
+                    "step_id": step.step_id,
+                    "timestamp": step.timestamp.isoformat(),
+                    "problem_state": {
+                        "problem_id": step.problem_state.problem_id,
+                        "description": step.problem_state.description,
+                        "code_context": step.problem_state.code_context,
+                        "requirements": step.problem_state.requirements,
+                        "difficulty": step.problem_state.difficulty,
+                        "evolution_stage": step.problem_state.evolution_stage,
+                        "adaptation_vector": step.problem_state.adaptation_vector
+                    },
+                    "solution": step.solution,
+                    "result": {
+                        "success": step.result.success,
+                        "score": step.result.score,
+                        "execution_results": step.result.execution_results,
+                        "error_details": step.result.error_details,
+                        "test_results": step.result.test_results,
+                        "metrics": step.result.metrics
+                    },
+                    "feedback": {
+                        "summary": step.feedback.summary,
+                        "issues": step.feedback.issues,
+                        "suggestions": step.feedback.suggestions,
+                        "focus_areas": step.feedback.focus_areas,
+                        "adaptation_hints": step.feedback.adaptation_hints
+                    }
+                }
+                for step in self.steps
+            ]
+        }
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]) -> "Trajectory":
+        """Create a trajectory from a dictionary."""
+        trajectory = cls(data["task_id"])
+        trajectory.metadata = data["metadata"]
+        for step_data in data["steps"]:
+            problem_state = ProblemState(
+                problem_id=step_data["problem_state"]["problem_id"],
+                description=step_data["problem_state"]["description"],
+                code_context=step_data["problem_state"]["code_context"],
+                requirements=step_data["problem_state"]["requirements"],
+                difficulty=step_data["problem_state"]["difficulty"],
+                evolution_stage=step_data["problem_state"]["evolution_stage"],
+                adaptation_vector=step_data["problem_state"]["adaptation_vector"]
+            )
+            result = EvaluationResult(
+                success=step_data["result"]["success"],
+                score=step_data["result"]["score"],
+                execution_results=step_data["result"]["execution_results"],
+                error_details=step_data["result"]["error_details"],
+                test_results=step_data["result"]["test_results"],
+                metrics=step_data["result"]["metrics"]
+            )
+            feedback = Feedback(
+                summary=step_data["feedback"]["summary"],
+                issues=step_data["feedback"]["issues"],
+                suggestions=step_data["feedback"]["suggestions"],
+                focus_areas=step_data["feedback"]["focus_areas"],
+                adaptation_hints=step_data["feedback"]["adaptation_hints"]
+            )
+            trajectory.add_step(
+                problem_state=problem_state,
+                solution=step_data["solution"],
+                result=result,
+                feedback=feedback
+            )
+        return trajectory
+    def save(self, filepath: str) -> None:
+        """Save the trajectory to a file."""
+        with open(filepath, "w") as f:
+            json.dump(self.to_dict(), f, indent=2)
+    @classmethod
+    def load(cls, filepath: str) -> "Trajectory":
+        """Load a trajectory from a file."""
+        with open(filepath, "r") as f:
+            data = json.load(f)
+        return cls.from_dict(data)
+class RecursiveTask:
+    """
+    Base class for recursive tasks that evolve based on model solutions.
+    A recursive task provides a dynamic problem that adapts based on the
+    model's attempted solutions, creating a feedback loop that more accurately
+    reflects real-world software engineering challenges.
+    """
+    def __init__(self,
+                 initial_state: ProblemState,
+                 config: Dict[str, Any] = None):
+        """
+        Initialize the recursive task with an initial problem state.
+        Args:
+            initial_state: The initial state of the problem
+            config: Configuration options for the task
+        """
+        self.task_id = str(uuid.uuid4())
+        self.state = initial_state
+        self.config = config or {}
+        self.trajectory = Trajectory(self.task_id)
+        self.status = TaskStatus.INITIALIZED
+        self.convergence_criteria = ConvergenceCriteria(
+            config.get("convergence_criteria", {}))
+    def get_current_problem(self) -> Dict[str, Any]:
+        """
+        Return the current problem description and context.
+        Returns:
+            A dictionary containing the current problem description and context
+        """
+        return {
+            "description": self.state.description,
+            "code_context": self.state.code_context,
+            "requirements": self.state.requirements,
+            "evolution_stage": self.state.evolution_stage
+        }
+    def evaluate_solution(self, solution: str) -> Tuple[EvaluationResult, Feedback]:
+        """
+        Evaluate a solution and generate feedback.
+        Args:
+            solution: The solution to evaluate
+        Returns:
+            A tuple containing the evaluation result and feedback
+        """
+        # Run the evaluation logic
+        result = self._run_evaluation(solution)
+        # Generate feedback based on the evaluation
+        feedback = self._generate_feedback(solution, result)
+        return result, feedback
+    def update_state(self,
+                    solution: str,
+                    result: EvaluationResult,
+                    feedback: Feedback) -> ProblemState:
+        """
+        Update the problem state based on the solution and feedback.
+        This method implements the recursive nature of the benchmark by
+        evolving the problem based on the model's solution attempt.
+        Args:
+            solution: The attempted solution
+            result: The evaluation result
+            feedback: The feedback provided
+        Returns:
+            The updated problem state
+        """
+        # Add the current step to the trajectory
+        self.trajectory.add_step(
+            problem_state=self.state,
+            solution=solution,
+            result=result,
+            feedback=feedback
+        )
+        # Check if we've converged
+        if self.convergence_criteria.has_converged(self.trajectory):
+            if self.trajectory.steps[-1].result.score >= self.convergence_criteria.score_threshold:
+                self.status = TaskStatus.PERFECT_SOLUTION
+            elif len(self.trajectory.steps) >= self.convergence_criteria.max_iterations:
+                self.status = TaskStatus.MAX_ITERATIONS
+            else:
+                self.status = TaskStatus.CONVERGED
+            return self.state
+        # Evolve the problem state based on the solution
+        self.state = self._evolve_state(solution, result, feedback)
+        # Update the status
+        self.status = TaskStatus.IN_PROGRESS
+        return self.state
+    def _run_evaluation(self, solution: str) -> EvaluationResult:
+        """
+        Run evaluation logic specific to this task.
+        Args:
+            solution: The solution to evaluate
+        Returns:
+            The evaluation result
+        """
+        raise NotImplementedError("Subclasses must implement this method")
+    def _generate_feedback(self,
+                         solution: str,
+                         result: EvaluationResult) -> Feedback:
+        """
+        Generate structured feedback based on evaluation results.
+        Args:
+            solution: The solution that was evaluated
+            result: The evaluation result
+        Returns:
+            Structured feedback
+        """
+        raise NotImplementedError("Subclasses must implement this method")
+    def _evolve_state(self,
+                    solution: str,
+                    result: EvaluationResult,
+                    feedback: Feedback) -> ProblemState:
+        """
+        Evolve the problem state based on the solution and feedback.
+        This method implements the recursive nature of the benchmark by
+        defining how the problem changes in response to solution attempts.
+        Args:
+            solution: The attempted solution
+            result: The evaluation result
+            feedback: The feedback provided
+        Returns:
+            The evolved problem state
+        """
+        raise NotImplementedError("Subclasses must implement this method")
+    def get_trajectory(self) -> Trajectory:
+        """
+        Get the complete solution trajectory for this task.
+        Returns:
+            The solution trajectory
+        """
+        return self.trajectory
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Convert the task to a dictionary for serialization.
+        Returns:
+            A dictionary representation of the task
+        """
+        return {
+            "task_id": self.task_id,
+            "status": self.status.value,
+            "state": {
+                "problem_id": self.state.problem_id,
+                "description": self.state.description,
+                "code_context": self.state.code_context,
+                "requirements": self.state.requirements,
+                "difficulty": self.state.difficulty,
+                "evolution_stage": self.state.evolution_stage,
+                "adaptation_vector": self.state.adaptation_vector
+            },
+            "config": self.config,
+            "trajectory": self.trajectory.to_dict()
+        }
+    def save(self, filepath: str) -> None:
+        """
+        Save the task to a file.
+        Args:
+            filepath: Path to save the task
+        """
+        with open(filepath, "w") as f:
+            json.dump(self.to_dict(), f, indent=2)
+    @classmethod
+    def load(cls, filepath: str) -> "RecursiveTask":
+        """
+        Load a task from a file.
+        Args:
+            filepath: Path to load the task from
+        Returns:
+            The loaded task
+        """
+        with open(filepath, "r") as f:
+            data = json.load(f)
+        # This method needs to be implemented by subclasses
+        # as they need to implement the abstract methods
+        raise NotImplementedError("Subclasses must implement this method")

evaluation/harness.py ADDED Viewed

	@@ -0,0 +1,445 @@

+# recursive_swe_bench/evaluation/harness.py
+from typing import Any, Dict, List, Optional, Tuple, Union, Callable
+import datetime
+import uuid
+import json
+import os
+import logging
+from dataclasses import dataclass, field
+from recursive_swe_bench.core.recursive_task import (
+    RecursiveTask, Trajectory, TrajectoryStep, ProblemState,
+    EvaluationResult, Feedback, TaskStatus
+)
+class RecursiveEvaluator:
+    """
+    The core evaluation harness for recursive benchmark tasks.
+    This class orchestrates the recursive evaluation process, managing the interactions
+    between models and tasks, tracking trajectories, and calculating metrics.
+    """
+    def __init__(
+        self,
+        model: Any,  # Model interface
+        metrics: Dict[str, Any],  # Metric calculators
+        config: Dict[str, Any] = None
+    ):
+        """
+        Initialize the recursive evaluator.
+        Args:
+            model: The model to evaluate
+            metrics: Dictionary of metric calculators
+            config: Configuration options
+        """
+        self.model = model
+        self.metrics = metrics
+        self.config = config or {}
+        self.logger = self._setup_logger()
+    def _setup_logger(self) -> logging.Logger:
+        """Set up logging for the evaluator."""
+        logger = logging.getLogger("RecursiveEvaluator")
+        handler = logging.StreamHandler()
+        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+        handler.setFormatter(formatter)
+        logger.addHandler(handler)
+        logger.setLevel(self.config.get("log_level", logging.INFO))
+        return logger
+    def evaluate_task(
+        self,
+        task: RecursiveTask,
+        max_iterations: int = 5
+    ) -> Tuple[Trajectory, Dict[str, float]]:
+        """
+        Run a full recursive evaluation on a single task.
+        Args:
+            task: The task to evaluate
+            max_iterations: Maximum number of iterations
+        Returns:
+            The trajectory and calculated metrics
+        """
+        self.logger.info(f"Starting evaluation of task {task.task_id}")
+        for i in range(max_iterations):
+            self.logger.info(f"Starting iteration {i+1}/{max_iterations}")
+            # Get the current problem
+            problem = task.get_current_problem()
+            self.logger.debug(f"Problem state: evolution_stage={problem['evolution_stage']}")
+            # Format the problem for the model
+            formatted_problem = self._format_problem_for_model(problem, task.trajectory)
+            # Get model solution
+            self.logger.debug("Requesting solution from model")
+            solution = self.model.solve(formatted_problem)
+            # Evaluate the solution
+            self.logger.debug("Evaluating solution")
+            result, feedback = task.evaluate_solution(solution)
+            # Log the results
+            self.logger.info(f"Solution score: {result.score:.4f}, Success: {result.success}")
+            # Update the task state based on the solution
+            new_state = task.update_state(solution, result, feedback)
+            # Check if we've reached a terminal state
+            if task.status != TaskStatus.IN_PROGRESS:
+                self.logger.info(f"Task complete with status: {task.status.value}")
+                break
+        # Calculate metrics across the trajectory
+        self.logger.info("Calculating metrics")
+        metrics_result = self._calculate_metrics(task.trajectory)
+        return task.trajectory, metrics_result
+    def evaluate_task_set(
+        self,
+        tasks: List[RecursiveTask],
+        max_iterations: int = 5,
+        output_dir: Optional[str] = None
+    ) -> Dict[str, Any]:
+        """
+        Evaluate a set of tasks and aggregate the results.
+        Args:
+            tasks: List of tasks to evaluate
+            max_iterations: Maximum iterations per task
+            output_dir: Directory to save results (optional)
+        Returns:
+            Dictionary of aggregated results
+        """
+        self.logger.info(f"Evaluating {len(tasks)} tasks")
+        results = {}
+        trajectories = {}
+        all_metrics = {}
+        for i, task in enumerate(tasks):
+            self.logger.info(f"Evaluating task {i+1}/{len(tasks)}: {task.task_id}")
+            # Evaluate the task
+            trajectory, metrics = self.evaluate_task(task, max_iterations)
+            # Store the results
+            trajectories[task.task_id] = trajectory
+            all_metrics[task.task_id] = metrics
+            # Save the trajectory if output_dir is provided
+            if output_dir:
+                os.makedirs(output_dir, exist_ok=True)
+                task_output_path = os.path.join(output_dir, f"task_{task.task_id}.json")
+                task.save(task_output_path)
+                self.logger.info(f"Saved task to {task_output_path}")
+        # Aggregate metrics across all tasks
+        aggregated_metrics = self._aggregate_metrics(all_metrics)
+        # Compile results
+        results = {
+            "aggregated_metrics": aggregated_metrics,
+            "task_metrics": all_metrics,
+            "timestamp": datetime.datetime.now().isoformat(),
+            "model_info": self.model.get_meta_information(),
+            "total_tasks": len(tasks),
+            "config": self.config
+        }
+        # Save aggregated results if output_dir is provided
+        if output_dir:
+            results_path = os.path.join(output_dir, "aggregated_results.json")
+            with open(results_path, "w") as f:
+                json.dump(results, f, indent=2)
+            self.logger.info(f"Saved aggregated results to {results_path}")
+        return results
+    def _format_problem_for_model(
+        self,
+        problem: Dict[str, Any],
+        trajectory: Trajectory
+    ) -> Dict[str, Any]:
+        """
+        Format the problem in a way the model can understand.
+        Args:
+            problem: The problem state
+            trajectory: The trajectory so far
+        Returns:
+            Formatted problem for the model
+        """
+        # Extract the previous steps if they exist
+        previous_steps = []
+        for step in trajectory.steps:
+            previous_steps.append({
+                "problem": {
+                    "description": step.problem_state.description,
+                    "requirements": step.problem_state.requirements,
+                    "evolution_stage": step.problem_state.evolution_stage
+                },
+                "solution": step.solution,
+                "feedback": {
+                    "summary": step.feedback.summary,
+                    "issues": step.feedback.issues,
+                    "suggestions": step.feedback.suggestions,
+                    "focus_areas": step.feedback.focus_areas
+                }
+            })
+        # Format the problem with the trajectory context
+        formatted_problem = {
+            "description": problem["description"],
+            "code_context": problem["code_context"],
+            "requirements": problem["requirements"],
+            "iteration": problem["evolution_stage"] + 1,
+            "previous_attempts": previous_steps
+        }
+        return formatted_problem
+    def _calculate_metrics(self, trajectory: Trajectory) -> Dict[str, float]:
+        """
+        Calculate metrics across the trajectory.
+        Args:
+            trajectory: The solution trajectory
+        Returns:
+            Dictionary of metric values
+        """
+        return {name: metric.calculate(trajectory)
+                for name, metric in self.metrics.items()}
+    def _aggregate_metrics(
+        self,
+        all_metrics: Dict[str, Dict[str, float]]
+    ) -> Dict[str, float]:
+        """
+        Aggregate metrics across multiple tasks.
+        Args:
+            all_metrics: Dictionary of metrics per task
+        Returns:
+            Dictionary of aggregated metrics
+        """
+        # Initialize aggregated metrics
+        if not all_metrics:
+            return {}
+        sample_metrics = next(iter(all_metrics.values()))
+        aggregated = {name: 0.0 for name in sample_metrics.keys()}
+        # Sum up metrics
+        for task_metrics in all_metrics.values():
+            for name, value in task_metrics.items():
+                aggregated[name] += value
+        # Calculate averages
+        for name in aggregated:
+            aggregated[name] /= len(all_metrics)
+        return aggregated
+# recursive_swe_bench/evaluation/metrics/recursive.py
+from typing import Any, Dict, List, Optional
+import numpy as np
+from recursive_swe_bench.core.recursive_task import Trajectory
+class RecursiveMetric:
+    """Base class for recursive metrics."""
+    def __init__(self, config: Dict[str, Any] = None):
+        self.config = config or {}
+    def calculate(self, trajectory: Trajectory) -> float:
+        """
+        Calculate the metric value for a trajectory.
+        Args:
+            trajectory: The solution trajectory
+        Returns:
+            The metric value
+        """
+        raise NotImplementedError("Subclasses must implement this method")
+class ConvergenceRate(RecursiveMetric):
+    """
+    Measures how quickly the model reaches a stable solution.
+    A lower value indicates faster convergence.
+    """
+    def calculate(self, trajectory: Trajectory) -> float:
+        scores = trajectory.get_score_series()
+        if len(scores) < 2:
+            return 0.0
+        # Calculate changes between consecutive scores
+        deltas = [abs(scores[i+1] - scores[i])
+                 for i in range(len(scores)-1)]
+        # A lower sum indicates faster convergence
+        # Normalize by the number of iterations
+        return sum(deltas) / len(deltas)
+class AdaptationEfficiency(RecursiveMetric):
+    """
+    Measures improvement per feedback iteration.
+    A higher value indicates more efficient adaptation.
+    """
+    def calculate(self, trajectory: Trajectory) -> float:
+        scores = trajectory.get_score_series()
+        if len(scores) < 2:
+            return 0.0
+        # Calculate the improvement from first to last iteration
+        total_improvement = max(0.0, scores[-1] - scores[0])
+        # Normalize by the number of iterations
+        return total_improvement / (len(scores) - 1)
+class LearningCurveArea(RecursiveMetric):
+    """
+    Measures the area under the learning curve.
+    A higher value indicates better overall performance across iterations.
+    """
+    def calculate(self, trajectory: Trajectory) -> float:
+        scores = trajectory.get_score_series()
+        if not scores:
+            return 0.0
+        # Calculate the area under the curve
+        # Normalize by the maximum possible area (perfect score from the start)
+        max_score = self.config.get("max_score", 1.0)
+        max_area = max_score * len(scores)
+        return sum(scores) / max_area
+class ProbabilisticSolutionQuality(RecursiveMetric):
+    """
+    Measures the distribution of solution quality using non-deterministic assessment.
+    This metric captures the robustness of solutions by measuring the variability in quality
+    across multiple probabilistic evaluations.
+    """
+    def calculate(self, trajectory: Trajectory) -> float:
+        # For each step, we expect the result.metrics to contain probabilistic assessments
+        steps = trajectory.steps
+        if not steps:
+            return 0.0
+        # Extract probabilistic quality distributions if available
+        distributions = []
+        for step in steps:
+            if (step.result.metrics and
+                "probabilistic_quality_distribution" in step.result.metrics):
+                distributions.append(
+                    step.result.metrics["probabilistic_quality_distribution"])
+        if not distributions:
+            # Fall back to deterministic scores if no distributions are available
+            return trajectory.get_score_series()[-1]
+        # Calculate the expected value of the final distribution
+        final_distribution = distributions[-1]
+        return sum(prob * val for val, prob in final_distribution.items())
+class TransferLearningFactor(RecursiveMetric):
+    """
+    Measures how well learning transfers across related problems.
+    This requires multiple trajectories from related tasks.
+    """
+    def __init__(self, config: Dict[str, Any] = None, related_trajectories: List[Trajectory] = None):
+        super().__init__(config)
+        self.related_trajectories = related_trajectories or []
+    def calculate(self, trajectory: Trajectory) -> float:
+        # This metric requires related trajectories
+        if not self.related_trajectories:
+            return 0.0
+        # Get learning rates for the current trajectory and related ones
+        current_learning_rate = self._calculate_learning_rate(trajectory)
+        related_learning_rates = [
+            self._calculate_learning_rate(rel_traj)
+            for rel_traj in self.related_trajectories
+        ]
+        # Filter out invalid learning rates
+        valid_related_rates = [rate for rate in related_learning_rates if rate is not None]
+        if not valid_related_rates:
+            return 0.0
+        # Calculate the transfer factor as the ratio of the current learning rate
+        # to the average of related learning rates
+        avg_related_rate = sum(valid_related_rates) / len(valid_related_rates)
+        if avg_related_rate == 0:
+            return 0.0
+        return current_learning_rate / avg_related_rate
+    def _calculate_learning_rate(self, trajectory: Trajectory) -> Optional[float]:
+        """Calculate the learning rate for a trajectory."""
+        scores = trajectory.get_score_series()
+        if len(scores) < 2:
+            return None
+        # Calculate improvement per iteration
+        return (scores[-1] - scores[0]) / (len(scores) - 1)
+class DynamicComplexityHandling(RecursiveMetric):
+    """
+    Measures how well the model handles varying problem complexity.
+    This metric evaluates performance while accounting for changes in problem difficulty.
+    """
+    def calculate(self, trajectory: Trajectory) -> float:
+        if not trajectory.steps:
+            return 0.0
+        # Extract scores and difficulties
+        scores = trajectory.get_score_series()
+        difficulties = [step.problem_state.difficulty for step in trajectory.steps]
+        if len(scores) < 2:
+            return scores[0]  # Return the single score if only one step
+        # Calculate normalized scores (adjusted by difficulty)
+        normalized_scores = [scores[i] * (1 + difficulties[i])
+                           for i in range(len(scores))]
+        # Return the average normalized score
+        return sum(normalized_scores) / len(normalized_scores)

models/anthropic.py ADDED Viewed

	@@ -0,0 +1,866 @@

+# recursive_swe_bench/models/anthropic.py
+import json
+import backoff
+import time
+import anthropic
+from typing import Any, Dict, List, Optional, Union, Tuple
+import re
+import logging
+from recursive_swe_bench.models.base_model import ModelInterface
+class AnthropicModel(ModelInterface):
+    """
+    Integration with Anthropic models (Claude).
+    This class provides integration with Anthropic's API for evaluating
+    Claude models with Recursive-SWE-bench through recursive evaluation loops.
+    The implementation features dynamic adaptation to feedback through a
+    self-reflective mechanism that traces attribution paths through recursive iterations.
+    """
+    def __init__(
+        self,
+        model_identifier: str,
+        api_key: Optional[str] = None,
+        config: Optional[Dict[str, Any]] = None
+    ):
+        """
+        Initialize the Anthropic model interface.
+        Args:
+            model_identifier: Anthropic model identifier (e.g., "claude-3-opus-20240229")
+            api_key: Anthropic API key (optional if set in environment)
+            config: Additional configuration options
+        """
+        super().__init__(model_identifier, config)
+        # Initialize Anthropic client
+        if api_key:
+            self.client = anthropic.Anthropic(api_key=api_key)
+        else:
+            self.client = anthropic.Anthropic()
+        # Set up system prompt and templates
+        self.prompts = self.config.get("prompts", {
+            "system": "You are an expert software engineer who specializes in debugging and fixing complex code. Your task is to fix bugs in code based on the description and test requirements provided.",
+            "user_template": "# Bug Fixing Task\n\n{description}\n\n# Code\n```python\n{code}\n```\n\n{tests_description}\n\n# Your task\nFix the bugs in the code above. Focus on making the code pass all tests while maintaining good practices. Provide only the corrected code without additional explanations.",
+            "reflection_template": "# Feedback on Previous Solution\n\nYour previous solution had the following issues:\n{issues}\n\n# Suggested Improvements\n{suggestions}\n\n# Test Results\n{test_results}\n\n# Reflection Prompt\nBefore providing a new solution, analyze what went wrong in your previous attempt and how you'll approach fixing it differently this time."
+        })
+        # Configure API parameters
+        self.api_params = self.config.get("api_params", {
+            "temperature": 0.2,
+            "max_tokens": 2000,
+            "top_p": 0.95,
+            "top_k": 50
+        })
+        # Set up recursive adaptation configuration
+        self.recursive_config = self.config.get("recursive_config", {
+            "enable_self_reflection": True,
+            "adaptation_threshold": 0.5,  # Minimum score to trigger adaptation
+            "max_reflection_depth": 3,    # Maximum depth of recursive reflection
+            "attribution_tracking": True,  # Track attribution patterns across iterations
+            "dynamic_prompting": True,    # Adjust prompts based on failure patterns
+        })
+        # Initialize recursive state
+        self.recursive_state = {
+            "reflection_depth": 0,
+            "adaptation_vector": [0.0] * 5,  # Tracks adaptation across dimensions
+            "attribution_map": {},           # Maps error types to attribution patterns
+            "error_frequency": {},           # Tracks frequency of error types
+            "solution_quality_trend": [],    # Tracks solution quality over iterations
+        }
+        self.logger.info(f"Initialized Anthropic model: {model_identifier} with recursive capability")
+    @backoff.on_exception(
+        backoff.expo,
+        (anthropic.APIError, anthropic.APITimeoutError, anthropic.RateLimitError),
+        max_tries=5
+    )
+    def solve(
+        self,
+        problem: Dict[str, Any],
+        history: Optional[List[Dict[str, Any]]] = None
+    ) -> str:
+        """
+        Generate a solution using the Anthropic model with recursive adaptation.
+        Args:
+            problem: The problem to solve
+            history: Optional history of previous solution attempts
+        Returns:
+            The generated solution
+        """
+        self.logger.info(f"Solving problem with Anthropic model: {self.model_identifier}")
+        start_time = time.time()
+        # Reset recursive state for new problems if no history
+        if not history:
+            self._reset_recursive_state()
+        elif history:
+            # Update recursive state based on history
+            self._update_recursive_state(history)
+        # Format messages for the model
+        system_prompt, user_message = self._format_messages(problem, history)
+        # Make API call
+        response = self.client.messages.create(
+            model=self.model_identifier,
+            system=system_prompt,
+            messages=[
+                {"role": "user", "content": user_message}
+            ],
+            max_tokens=self.api_params.get("max_tokens", 2000),
+            temperature=self.api_params.get("temperature", 0.2),
+            top_p=self.api_params.get("top_p", 0.95),
+            top_k=self.api_params.get("top_k", 50)
+        )
+        # Extract the solution from the response
+        solution = response.content[0].text
+        end_time = time.time()
+        self.logger.info(f"Solution generated in {end_time - start_time:.2f} seconds")
+        # Track solution in recursive state
+        if solution:
+            self.recursive_state["reflection_depth"] += 1
+        return self._extract_code(solution)
+    def _format_messages(
+        self,
+        problem: Dict[str, Any],
+        history: Optional[List[Dict[str, Any]]] = None
+    ) -> Tuple[str, str]:
+        """
+        Format the problem and history into messages for the Anthropic API.
+        Args:
+            problem: The problem to solve
+            history: Optional history of previous solution attempts
+        Returns:
+            Tuple of (system_prompt, user_message)
+        """
+        # Start with base system prompt
+        system_prompt = self.prompts["system"]
+        # Enhance system prompt with recursive adaptation if enabled
+        if self.recursive_config.get("enable_self_reflection", True) and history:
+            # Add adaptation guidance based on error patterns
+            if self.recursive_state["error_frequency"]:
+                top_errors = sorted(
+                    self.recursive_state["error_frequency"].items(),
+                    key=lambda x: x[1],
+                    reverse=True
+                )[:3]
+                error_guidance = "Focus particularly on addressing these recurring issues:\n"
+                for error_type, count in top_errors:
+                    error_guidance += f"- {error_type} (appeared {count} times)\n"
+                system_prompt += f"\n\n{error_guidance}"
+            # Add reflection guidance based on solution quality trend
+            if len(self.recursive_state["solution_quality_trend"]) > 1:
+                trend = self.recursive_state["solution_quality_trend"]
+                if trend[-1] > trend[-2]:
+                    system_prompt += "\n\nYour solutions are improving. Continue this trajectory."
+                elif trend[-1] < trend[-2]:
+                    system_prompt += "\n\nYour solutions are declining in quality. Carefully reconsider your approach."
+                else:
+                    system_prompt += "\n\nYour solutions maintain the same quality. Try a different approach."
+        # Format code and tests
+        code = problem["code_context"]["code"]
+        # Prepare tests description
+        tests_description = "# Tests\n"
+        if "tests" in problem["code_context"]:
+            tests_description += "The code must pass the following tests:\n\n"
+            for i, test in enumerate(problem["code_context"]["tests"]):
+                tests_description += f"## Test {i+1}: {test['name']}\n```python\n{test['content']}\n```\n\n"
+        else:
+            tests_description += "The code must work correctly according to its intended functionality."
+        # Base user message
+        user_message = self.prompts["user_template"].format(
+            description=problem["description"],
+            code=code,
+            tests_description=tests_description
+        )
+        # Add history if available - with recursive reflection
+        if history and self.recursive_config.get("enable_self_reflection", True):
+            # Get the most recent entry for reflection
+            latest_entry = history[-1]
+            # Format issues
+            issues_text = "- " + "\n- ".join([issue["message"] for issue in latest_entry["feedback"]["issues"]])
+            # Format suggestions
+            suggestions_text = "- " + "\n- ".join([suggestion["message"] for suggestion in latest_entry["feedback"]["suggestions"]])
+            # Format test results
+            test_results = latest_entry.get("result", {})
+            passed_tests = test_results.get("passed_tests", 0)
+            total_tests = test_results.get("total_tests", 0)
+            test_results_text = f"Passed {passed_tests}/{total_tests} tests."
+            if "tests" in test_results:
+                test_results_text += "\n\nIndividual test results:"
+                for test_name, test_result in test_results["tests"].items():
+                    status = "✅ Passed" if test_result.get("passed", False) else "❌ Failed"
+                    test_results_text += f"\n- {test_name}: {status}"
+                    if not test_result.get("passed", False) and "message" in test_result:
+                        test_results_text += f"\n  Error: {test_result['message']}"
+            # Add reflection prompt
+            reflection_prompt = self.prompts["reflection_template"].format(
+                issues=issues_text,
+                suggestions=suggestions_text,
+                test_results=test_results_text
+            )
+            # Prepend reflection to user message
+            user_message = f"{reflection_prompt}\n\n{user_message}"
+            # Add dynamic adaptation based on error patterns if enabled
+            if self.recursive_config.get("dynamic_prompting", True):
+                # Look for specific error patterns and add targeted guidance
+                error_types = [issue.get("type", "") for issue in latest_entry["feedback"]["issues"]]
+                if "syntax" in " ".join(error_types).lower():
+                    user_message += "\n\nPay careful attention to syntax correctness. Double-check all parentheses, indentation, and function definitions."
+                if "test_failure" in " ".join(error_types).lower():
+                    user_message += "\n\nFocus on making the code pass the failing tests. Carefully trace through the code execution for each test case."
+                if "edge_case" in " ".join(error_types).lower() or "boundary" in " ".join(error_types).lower():
+                    user_message += "\n\nBe sure to handle edge cases such as empty inputs, boundary values, and special cases."
+                if "performance" in " ".join(error_types).lower():
+                    user_message += "\n\nOptimize your solution for better performance. Avoid unnecessary operations and inefficient data structures."
+        return system_prompt, user_message
+    def _extract_code(self, text: str) -> str:
+        """
+        Extract code from the model's response.
+        Args:
+            text: The model's response
+        Returns:
+            Extracted code
+        """
+        # Try to extract code from markdown code blocks
+        code_blocks = re.findall(r'```(?:python)?\s*(.*?)\s*```', text, re.DOTALL)
+        if code_blocks:
+            return code_blocks[0].strip()
+        # If no code blocks, return the full text (it might be just code)
+        return text.strip()
+    def _reset_recursive_state(self):
+        """Reset the recursive state for a new problem."""
+        self.recursive_state = {
+            "reflection_depth": 0,
+            "adaptation_vector": [0.0] * 5,
+            "attribution_map": {},
+            "error_frequency": {},
+            "solution_quality_trend": [],
+        }
+    def _update_recursive_state(self, history: List[Dict[str, Any]]):
+        """
+        Update recursive state based on solution history.
+        Args:
+            history: History of previous solution attempts
+        """
+        # Extract scores from history
+        scores = [entry.get("result", {}).get("score", 0.0) for entry in history]
+        self.recursive_state["solution_quality_trend"] = scores
+        # Calculate adaptation vector
+        if len(scores) >= 2:
+            # Dimension 0: Overall improvement trajectory
+            improvement = scores[-1] - scores[0]
+            self.recursive_state["adaptation_vector"][0] = max(-1.0, min(1.0, improvement))
+            # Dimension 1: Recent improvement
+            recent_improvement = scores[-1] - scores[-2]
+            self.recursive_state["adaptation_vector"][1] = max(-1.0, min(1.0, recent_improvement))
+        # Update error frequency from latest feedback
+        if history:
+            latest_feedback = history[-1].get("feedback", {})
+            issues = latest_feedback.get("issues", [])
+            for issue in issues:
+                issue_type = issue.get("type", "unknown")
+                self.recursive_state["error_frequency"][issue_type] = self.recursive_state["error_frequency"].get(issue_type, 0) + 1
+        # Update reflection depth
+        self.recursive_state["reflection_depth"] = len(history)
+    def get_meta_information(self) -> Dict[str, Any]:
+        """
+        Get meta information about the model.
+        Returns:
+            Dictionary containing model information
+        """
+        return {
+            "model_name": self.model_identifier,
+            "provider": "Anthropic",
+            "type": "API",
+            "parameters": self.api_params,
+            "system_prompt": self.prompts["system"],
+            "recursive_capability": self.recursive_config.get("enable_self_reflection", True),
+            "reflection_depth": self.recursive_state["reflection_depth"],
+            "adaptation_vector": self.recursive_state["adaptation_vector"]
+        }
+# recursive_swe_bench/evaluation/recursive_metrics.py
+import numpy as np
+import scipy.stats
+from typing import Any, Dict, List, Optional, Union
+import dataclasses
+import math
+from recursive_swe_bench.core.recursive_task import Trajectory
+class RecursiveLearningCurveArea:
+    """
+    Measures the area under the learning curve across iterations.
+    This metric captures the overall performance of a model throughout its
+    learning trajectory, rewarding both high scores and quick improvement.
+    """
+    def __init__(self, config: Dict[str, Any] = None):
+        """
+        Initialize the recursive learning curve area metric.
+        Args:
+            config: Configuration options
+        """
+        self.config = config or {}
+        self.max_score = self.config.get("max_score", 1.0)
+        self.normalize = self.config.get("normalize", True)
+    def calculate(self, trajectory: Trajectory) -> float:
+        """
+        Calculate the area under the learning curve.
+        Args:
+            trajectory: The solution trajectory
+        Returns:
+            The normalized area under the learning curve
+        """
+        scores = trajectory.get_score_series()
+        if not scores:
+            return 0.0
+        # Calculate the area under the curve using trapezoidal rule
+        area = np.trapz(scores, dx=1.0)
+        # Normalize by the maximum possible area if requested
+        if self.normalize:
+            max_area = self.max_score * len(scores)
+            return area / max_area
+        return area
+class AdaptationRate:
+    """
+    Measures the rate at which the model improves its solutions.
+    This metric captures how quickly a model adapts to feedback and
+    improves its solutions across iterations.
+    """
+    def __init__(self, config: Dict[str, Any] = None):
+        """
+        Initialize the adaptation rate metric.
+        Args:
+            config: Configuration options
+        """
+        self.config = config or {}
+        self.min_iterations = self.config.get("min_iterations", 2)
+    def calculate(self, trajectory: Trajectory) -> float:
+        """
+        Calculate the adaptation rate.
+        Args:
+            trajectory: The solution trajectory
+        Returns:
+            The adaptation rate
+        """
+        scores = trajectory.get_score_series()
+        if len(scores) < self.min_iterations:
+            return 0.0
+        # Calculate the average improvement per iteration
+        total_improvement = scores[-1] - scores[0]
+        iterations = len(scores) - 1
+        return total_improvement / iterations
+class RecursiveVolatility:
+    """
+    Measures the volatility of solution quality across iterations.
+    This metric captures how stable or erratic a model's performance
+    is across iterations.
+    """
+    def __init__(self, config: Dict[str, Any] = None):
+        """
+        Initialize the recursive volatility metric.
+        Args:
+            config: Configuration options
+        """
+        self.config = config or {}
+        self.min_iterations = self.config.get("min_iterations", 3)
+        self.normalize = self.config.get("normalize", True)
+    def calculate(self, trajectory: Trajectory) -> float:
+        """
+        Calculate the recursive volatility.
+        Args:
+            trajectory: The solution trajectory
+        Returns:
+            The normalized volatility
+        """
+        scores = trajectory.get_score_series()
+        if len(scores) < self.min_iterations:
+            return 0.0
+        # Calculate the standard deviation of score changes
+        changes = [abs(scores[i] - scores[i-1]) for i in range(1, len(scores))]
+        volatility = np.std(changes)
+        # Normalize by the average score if requested
+        if self.normalize and np.mean(scores) > 0:
+            return volatility / np.mean(scores)
+        return volatility
+class ConvergenceIndex:
+    """
+    Measures how quickly the model converges to a stable solution.
+    This metric captures how efficiently a model reaches a stable solution
+    across iterations.
+    """
+    def __init__(self, config: Dict[str, Any] = None):
+        """
+        Initialize the convergence index metric.
+        Args:
+            config: Configuration options
+        """
+        self.config = config or {}
+        self.stability_threshold = self.config.get("stability_threshold", 0.05)
+        self.max_score_threshold = self.config.get("max_score_threshold", 0.95)
+    def calculate(self, trajectory: Trajectory) -> float:
+        """
+        Calculate the convergence index.
+        Args:
+            trajectory: The solution trajectory
+        Returns:
+            The convergence index (lower is better)
+        """
+        scores = trajectory.get_score_series()
+        if not scores:
+            return 0.0
+        # Find the first iteration where the score stabilizes
+        # (subsequent changes are below the stability threshold)
+        convergence_point = len(scores) - 1
+        for i in range(1, len(scores)):
+            remaining_changes = [abs(scores[j] - scores[j-1]) for j in range(i, len(scores))]
+            if all(change <= self.stability_threshold for change in remaining_changes):
+                convergence_point = i
+                break
+        # Find the first iteration where the score exceeds the max score threshold
+        max_score_point = len(scores)
+        for i, score in enumerate(scores):
+            if score >= self.max_score_threshold:
+                max_score_point = i
+                break
+        # Return a combined index
+        # Lower is better - converging quickly to a high score is ideal
+        return (convergence_point / len(scores)) * (1.0 - max(0.0, min(1.0, scores[-1])))
+class ErrorRecoveryEfficiency:
+    """
+    Measures how efficiently the model recovers from errors.
+    This metric captures how well a model addresses and fixes specific
+    errors across iterations.
+    """
+    def __init__(self, config: Dict[str, Any] = None):
+        """
+        Initialize the error recovery efficiency metric.
+        Args:
+            config: Configuration options
+        """
+        self.config = config or {}
+    def calculate(self, trajectory: Trajectory) -> float:
+        """
+        Calculate the error recovery efficiency.
+        Args:
+            trajectory: The solution trajectory
+        Returns:
+            The error recovery efficiency
+        """
+        if not trajectory.steps or len(trajectory.steps) < 2:
+            return 0.0
+        # Extract error counts from each step
+        error_counts = []
+        for step in trajectory.steps:
+            if hasattr(step, "result") and hasattr(step.result, "error_details"):
+                error_counts.append(len(step.result.error_details or {}))
+            else:
+                # If no error details available, use issue count from feedback
+                error_counts.append(len(step.feedback.issues))
+        if not error_counts or error_counts[0] == 0:
+            return 1.0  # Perfect if no initial errors
+        # Calculate the rate at which errors are fixed
+        initial_errors = error_counts[0]
+        final_errors = error_counts[-1]
+        # Return the proportion of errors fixed
+        return (initial_errors - final_errors) / initial_errors
+class DynamicComplexityHandling:
+    """
+    Measures how well the model handles varying problem complexity.
+    This metric evaluates performance while accounting for changes in
+    problem difficulty across iterations.
+    """
+    def __init__(self, config: Dict[str, Any] = None):
+        """
+        Initialize the dynamic complexity handling metric.
+        Args:
+            config: Configuration options
+        """
+        self.config = config or {}
+    def calculate(self, trajectory: Trajectory) -> float:
+        """
+        Calculate the dynamic complexity handling score.
+        Args:
+            trajectory: The solution trajectory
+        Returns:
+            The dynamic complexity handling score
+        """
+        if not trajectory.steps:
+            return 0.0
+        # Extract scores and difficulties from each step
+        scores = []
+        difficulties = []
+        for step in trajectory.steps:
+            scores.append(step.result.score)
+            difficulties.append(step.problem_state.difficulty)
+        # Calculate difficulty-weighted scores
+        weighted_scores = [scores[i] / max(0.1, difficulties[i]) for i in range(len(scores))]
+        # Return the average weighted score
+        return sum(weighted_scores) / len(weighted_scores)
+class RecursiveFrameworkMetrics:
+    """
+    Comprehensive collection of metrics for recursive evaluation.
+    This class provides easy access to all recursive metrics and
+    standardized calculation across trajectories.
+    """
+    def __init__(self, config: Dict[str, Any] = None):
+        """
+        Initialize the recursive framework metrics.
+        Args:
+            config: Configuration options
+        """
+        self.config = config or {}
+        # Initialize all metrics
+        self.metrics = {
+            "learning_curve_area": RecursiveLearningCurveArea(self.config.get("learning_curve_area")),
+            "adaptation_rate": AdaptationRate(self.config.get("adaptation_rate")),
+            "volatility": RecursiveVolatility(self.config.get("volatility")),
+            "convergence_index": ConvergenceIndex(self.config.get("convergence_index")),
+            "error_recovery": ErrorRecoveryEfficiency(self.config.get("error_recovery")),
+            "complexity_handling": DynamicComplexityHandling(self.config.get("complexity_handling"))
+        }
+        # Add custom metrics from config if provided
+        if "custom_metrics" in self.config:
+            for name, metric in self.config["custom_metrics"].items():
+                self.metrics[name] = metric
+    def calculate_all(self, trajectory: Trajectory) -> Dict[str, float]:
+        """
+        Calculate all metrics for a trajectory.
+        Args:
+            trajectory: The solution trajectory
+        Returns:
+            Dictionary of metric names and values
+        """
+        return {name: metric.calculate(trajectory)
+                for name, metric in self.metrics.items()}
+    def calculate(self, trajectory: Trajectory, metric_name: str) -> float:
+        """
+        Calculate a specific metric for a trajectory.
+        Args:
+            trajectory: The solution trajectory
+            metric_name: The name of the metric to calculate
+        Returns:
+            The calculated metric value
+        """
+        if metric_name not in self.metrics:
+            raise ValueError(f"Unknown metric: {metric_name}")
+        return self.metrics[metric_name].calculate(trajectory)
+    def aggregate_metrics(self, trajectories: List[Trajectory]) -> Dict[str, float]:
+        """
+        Calculate aggregate metrics across multiple trajectories.
+        Args:
+            trajectories: List of solution trajectories
+        Returns:
+            Dictionary of aggregated metric values
+        """
+        if not trajectories:
+            return {}
+        all_metrics = [self.calculate_all(trajectory) for trajectory in trajectories]
+        # Aggregate by averaging each metric
+        aggregated = {}
+        for metric_name in self.metrics:
+            values = [metrics[metric_name] for metrics in all_metrics]
+            aggregated[metric_name] = sum(values) / len(values)
+        return aggregated
+# recursive_swe_bench/evaluation/visualizer.py
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+from typing import Any, Dict, List, Optional, Union
+import os
+import json
+import seaborn as sns
+from pathlib import Path
+from recursive_swe_bench.core.recursive_task import Trajectory
+class RecursiveVisualizer:
+    """
+    Visualization tools for recursive evaluation results.
+    This class provides methods for visualizing recursive trajectories,
+    metrics, and comparative analysis across models.
+    """
+    def __init__(self, output_dir: Optional[str] = None, config: Dict[str, Any] = None):
+        """
+        Initialize the recursive visualizer.
+        Args:
+            output_dir: Directory to save visualizations
+            config: Configuration options
+        """
+        self.output_dir = output_dir
+        if output_dir:
+            os.makedirs(output_dir, exist_ok=True)
+        self.config = config or {}
+        self.theme = self.config.get("theme", "default")
+        # Set up the visualization style
+        if self.theme == "dark":
+            plt.style.use("dark_background")
+            self.colors = sns.color_palette("viridis", 10)
+        else:
+            plt.style.use("seaborn-v0_8-whitegrid")
+            self.colors = sns.color_palette("muted", 10)
+        sns.set_context("talk")
+    def plot_trajectory(
+        self,
+        trajectory: Trajectory,
+        title: Optional[str] = None,
+        show: bool = True,
+        save_path: Optional[str] = None
+    ):
+        """
+        Plot a solution trajectory showing score evolution.
+        Args:
+            trajectory: The solution trajectory
+            title: Optional title for the plot
+            show: Whether to display the plot
+            save_path: Optional path to save the plot
+        """
+        scores = trajectory.get_score_series()
+        if not scores:
+            return
+        plt.figure(figsize=(10, 6))
+        # Plot scores
+        plt.plot(range(1, len(scores) + 1), scores, marker='o',
+                 linewidth=2, markersize=8, color=self.colors[0])
+        # Add difficulty if available
+        difficulties = [step.problem_state.difficulty for step in trajectory.steps]
+        if difficulties:
+            plt.plot(range(1, len(difficulties) + 1), difficulties, marker='s',
+                     linewidth=2, markersize=8, color=self.colors[1], linestyle='--',
+                     label='Problem Difficulty')
+        # Set plot properties
+        plt.title(title or f"Solution Trajectory for Task {trajectory.task_id}")
+        plt.xlabel("Iteration")
+        plt.ylabel("Score / Difficulty")
+        plt.grid(True)
+        plt.ylim(0, 1.05)
+        plt.xticks(range(1, len(scores) + 1))
+        plt.legend(["Solution Score", "Problem Difficulty"])
+        # Save if requested
+        if save_path:
+            full_path = os.path.join(self.output_dir, save_path) if self.output_dir else save_path
+            plt.savefig(full_path, bbox_inches='tight', dpi=300)
+        # Show if requested
+        if show:
+            plt.show()
+        else:
+            plt.close()
+    def plot_metrics_comparison(
+        self,
+        metrics_by_model: Dict[str, Dict[str, float]],
+        title: Optional[str] = None,
+        show: bool = True,
+        save_path: Optional[str] = None
+    ):
+        """
+        Plot a comparison of metrics across models.
+        Args:
+            metrics_by_model: Dictionary mapping model names to metric values
+            title: Optional title for the plot
+            show: Whether to display the plot
+            save_path: Optional path to save the plot
+        """
+        if not metrics_by_model:
+            return
+        # Convert to DataFrame for easier plotting
+        df = pd.DataFrame(metrics_by_model).T
+        # Create a radar chart
+        categories = list(df.columns)
+        N = len(categories)
+        # Create angles for each metric
+        angles = [n / float(N) * 2 * np.pi for n in range(N)]
+        angles += angles[:1]  # Close the loop
+        # Create figure
+        fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(polar=True))
+        # Add lines for each model
+        for i, (model, metrics) in enumerate(df.iterrows()):
+            values = metrics.values.flatten().tolist()
+            values += values[:1]  # Close the loop
+            # Plot the line
+            ax.plot(angles, values, linewidth=2, linestyle='solid',
+                    label=model, color=self.colors[i % len(self.colors)])
+            ax.fill(angles, values, alpha=0.1, color=self.colors[i % len(self.colors)])
+        # Set category labels
+        plt.xticks(angles[:-1], categories)
+        # Set y-axis limits
+        plt.ylim(0, 1)
+        # Add legend
+        plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
+        # Set title
+        plt.title(title or "Metrics Comparison Across Models")
+        # Save if requested
+        if save_path:
+            full_path = os.path.join(self.output_dir, save_path) if self.output_dir else save_path
+            plt.savefig(full_path, bbox_inches='tight',

models/base_models.py ADDED Viewed

	@@ -0,0 +1,259 @@

+# recursive_swe_bench/models/base_model.py
+from typing import Any, Dict, List, Optional, Union
+import logging
+import time
+from abc import ABC, abstractmethod
+class ModelInterface(ABC):
+    """
+    Base interface for models that can be evaluated using Recursive-SWE-bench.
+    This abstract class defines the core functionality required for a model to
+    be evaluated using the recursive evaluation framework. Concrete implementations
+    must provide the actual model-specific logic.
+    """
+    def __init__(self, model_identifier: str, config: Optional[Dict[str, Any]] = None):
+        """
+        Initialize the model interface.
+        Args:
+            model_identifier: Identifier for the model
+            config: Configuration options
+        """
+        self.model_identifier = model_identifier
+        self.config = config or {}
+        self.logger = self._setup_logger()
+    def _setup_logger(self) -> logging.Logger:
+        """Set up logging for the model."""
+        logger = logging.getLogger(f"Model.{self.model_identifier}")
+        handler = logging.StreamHandler()
+        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+        handler.setFormatter(formatter)
+        logger.addHandler(handler)
+        logger.setLevel(self.config.get("log_level", logging.INFO))
+        return logger
+    @abstractmethod
+    def solve(self, problem: Dict[str, Any], history: Optional[List[Dict[str, Any]]] = None) -> str:
+        """
+        Generate a solution for the given problem.
+        Args:
+            problem: The problem to solve
+            history: Optional history of previous solution attempts
+        Returns:
+            The generated solution
+        """
+        pass
+    @abstractmethod
+    def get_meta_information(self) -> Dict[str, Any]:
+        """
+        Get meta information about the model.
+        Returns:
+            Dictionary containing model information
+        """
+        pass
+# recursive_swe_bench/models/openai.py
+import openai
+import json
+import backoff
+from typing import Any, Dict, List, Optional, Union
+from recursive_swe_bench.models.base_model import ModelInterface
+class OpenAIModel(ModelInterface):
+    """
+    Integration with OpenAI models (GPT-3.5, GPT-4, etc.).
+    This class provides integration with OpenAI's API for evaluating
+    models like GPT-3.5 and GPT-4 with Recursive-SWE-bench.
+    """
+    def __init__(
+        self,
+        model_identifier: str,
+        api_key: Optional[str] = None,
+        config: Optional[Dict[str, Any]] = None
+    ):
+        """
+        Initialize the OpenAI model interface.
+        Args:
+            model_identifier: OpenAI model identifier (e.g., "gpt-4", "gpt-3.5-turbo")
+            api_key: OpenAI API key (optional if set in environment)
+            config: Additional configuration options
+        """
+        super().__init__(model_identifier, config)
+        # Set API key if provided
+        if api_key:
+            openai.api_key = api_key
+        # Load default prompts or use config-provided ones
+        self.prompts = self.config.get("prompts", {
+            "system": "You are an expert programmer tasked with fixing bugs in code. Fix the code based on the description and tests.",
+            "user_template": "# Bug Fixing Task\n\n{description}\n\n# Code\n```python\n{code}\n```\n\n{tests_description}\n\n# Your task\nFix the bugs in the code above. Provide only the corrected code without any explanations.",
+        })
+        # Configure API parameters
+        self.api_params = self.config.get("api_params", {
+            "temperature": 0.2,
+            "max_tokens": 2000,
+            "top_p": 0.95,
+            "frequency_penalty": 0,
+            "presence_penalty": 0,
+        })
+        self.logger.info(f"Initialized OpenAI model: {model_identifier}")
+    @backoff.on_exception(
+        backoff.expo,
+        (openai.error.RateLimitError, openai.error.ServiceUnavailableError, openai.error.APIError),
+        max_tries=5
+    )
+    def solve(
+        self,
+        problem: Dict[str, Any],
+        history: Optional[List[Dict[str, Any]]] = None
+    ) -> str:
+        """
+        Generate a solution using the OpenAI model.
+        Args:
+            problem: The problem to solve
+            history: Optional history of previous solution attempts
+        Returns:
+            The generated solution
+        """
+        self.logger.info(f"Solving problem with OpenAI model: {self.model_identifier}")
+        start_time = time.time()
+        # Format the problem for the model
+        messages = self._format_messages(problem, history)
+        # Make API call
+        response = openai.ChatCompletion.create(
+            model=self.model_identifier,
+            messages=messages,
+            **self.api_params
+        )
+        # Extract the solution from the response
+        solution = response.choices[0].message.content.strip()
+        end_time = time.time()
+        self.logger.info(f"Solution generated in {end_time - start_time:.2f} seconds")
+        return self._extract_code(solution)
+    def _format_messages(
+        self,
+        problem: Dict[str, Any],
+        history: Optional[List[Dict[str, Any]]] = None
+    ) -> List[Dict[str, str]]:
+        """
+        Format the problem and history into messages for the OpenAI API.
+        Args:
+            problem: The problem to solve
+            history: Optional history of previous solution attempts
+        Returns:
+            List of formatted messages
+        """
+        messages = [
+            {"role": "system", "content": self.prompts["system"]}
+        ]
+        # Format the user message
+        code = problem["code_context"]["code"]
+        # Prepare tests description
+        tests_description = "# Tests\n"
+        if "tests" in problem["code_context"]:
+            tests_description += "The code must pass the following tests:\n\n"
+            for i, test in enumerate(problem["code_context"]["tests"]):
+                tests_description += f"## Test {i+1}: {test['name']}\n```python\n{test['content']}\n```\n\n"
+        else:
+            tests_description += "The code must work correctly according to its intended functionality."
+        # Create the user message using the template
+        user_content = self.prompts["user_template"].format(
+            description=problem["description"],
+            code=code,
+            tests_description=tests_description
+        )
+        messages.append({"role": "user", "content": user_content})
+        # Add history if available
+        if history and self.config.get("include_history", True):
+            for entry in history:
+                # Add previous attempt
+                messages.append({
+                    "role": "assistant",
+                    "content": entry["solution"]
+                })
+                # Add feedback on previous attempt
+                feedback_content = f"Your solution has the following issues:\n"
+                for issue in entry["feedback"]["issues"]:
+                    feedback_content += f"- {issue['message']}\n"
+                feedback_content += "\nPlease try again with these improvements:\n"
+                for suggestion in entry["feedback"]["suggestions"]:
+                    feedback_content += f"- {suggestion['message']}\n"
+                messages.append({
+                    "role": "user",
+                    "content": feedback_content
+                })
+        return messages
+    def _extract_code(self, text: str) -> str:
+        """
+        Extract code from the model's response.
+        Args:
+            text: The model's response
+        Returns:
+            Extracted code
+        """
+        # Try to extract code from markdown code blocks
+        import re
+        code_blocks = re.findall(r'```(?:python)?\s*(.*?)\s*```', text, re.DOTALL)
+        if code_blocks:
+            return code_blocks[0].strip()
+        # If no code blocks, return the full text (it might be just code)
+        return text.strip()
+    def get_meta_information(self) -> Dict[str, Any]:
+        """
+        Get meta information about the model.
+        Returns:
+            Dictionary containing model information
+        """
+        return {
+            "model_name": self.model_identifier,
+            "provider": "OpenAI",
+            "type": "API",
+            "parameters": self.api_params,
+            "system_prompt": self.prompts["system"]
+        }

task_generators/bug_fixing.py ADDED Viewed

The diff for this file is too large to render. See raw diff