mafzaal's picture
Add multiple blog posts on Ragas evaluation framework and metric-driven development
4ba7a19
metadata
title: 'Part 6: Evaluating AI Agents: Beyond Simple Answers with Ragas'
date: 2025-04-28T12:00:00.000Z
layout: blog
description: >-
  Learn how to evaluate complex AI agents using Ragas' specialized metrics for
  goal accuracy, tool call accuracy, and topic adherence to build more reliable
  and effective agent-based applications.
categories:
  - AI
  - Agents
  - Evaluation
  - Ragas
  - LLM
coverImage: /images/ai_agent_evaluation.png
readingTime: 8
published: true

In our previous posts, we've explored how Ragas evaluates RAG systems and enables custom metrics for specialized applications. As LLMs evolve beyond simple question-answering to become powerful AI agents, evaluation needs have grown more sophisticated too. In this post, we'll explore Ragas' specialized metrics for evaluating AI agents that engage in multi-turn interactions, use tools, and work toward specific goals.

The Challenge of Evaluating AI Agents

Unlike traditional RAG systems, AI agents present unique evaluation challenges:

  • Multi-turn interactions: Agents maintain context across multiple exchanges
  • Tool usage: Agents call external tools and APIs to accomplish tasks
  • Goal-oriented behavior: Success means achieving the user's ultimate objective
  • Boundaries and constraints: Agents must operate within defined topic boundaries

Standard metrics like faithfulness or answer relevancy don't fully capture these dimensions. Let's explore three specialized metrics Ragas provides for agent evaluation.

Evaluating AI Agents: Beyond Simple Answers with Ragas

1. Goal Accuracy (agent_goal_accuracy)

What it measures: Did the agent successfully achieve the user's ultimate objective over the course of the interaction?

How it works: This metric analyzes the entire agent workflow (user inputs, AI responses, tool calls).

  • It uses an LLM (InferGoalOutcomePrompt) to identify the user_goal and the end_state (what actually happened).
  • It then compares the end_state to either:
    • A provided reference outcome (AgentGoalAccuracyWithReference).
    • The inferred user_goal (AgentGoalAccuracyWithoutReference).
  • An LLM (CompareOutcomePrompt) determines if the achieved outcome matches the desired one, resulting in a binary score (1 for success, 0 for failure).

Why it's important: For task-oriented agents (like booking systems or assistants), success isn't about individual responses but about completing the overall task correctly. This metric directly measures that end-to-end success.

2. Tool Call Accuracy (tool_call_accuracy)

What it measures: Did the agent use the correct tools, in the right order, and with the right arguments?

How it works: This metric compares the sequence and details of tool calls made by the agent against a reference_tool_calls list.

  • It checks if the sequence of tool names called by the agent aligns with the reference sequence (is_sequence_aligned).
  • For each matching tool call, it compares the arguments provided by the agent to the reference arguments, often using a sub-metric like ExactMatch (_get_arg_score).
  • The final score reflects both the sequence alignment and the argument correctness.

Why it's important: Many agents rely on external tools (APIs, databases, etc.). Incorrect tool usage (wrong tool, bad parameters) leads to task failure. This metric pinpoints issues in the agent's interaction with its tools.

3. Topic Adherence (topic_adherence)

What it measures: Did the agent stick to the allowed topics and appropriately handle requests about restricted topics?

How it works: This metric evaluates conversations against a list of reference_topics.

  • It extracts the topics discussed in the user's input (TopicExtractionPrompt).
  • It checks if the agent refused to answer questions related to specific topics (TopicRefusedPrompt).
  • It classifies whether the discussed topics fall within the allowed reference_topics (TopicClassificationPrompt).
  • Based on these classifications and refusals, it calculates a score (Precision, Recall, or F1) indicating how well the agent adhered to the topic constraints.

Why it's important: Ensures agents stay focused, avoid generating content on forbidden subjects (safety, policy), and handle out-of-scope requests gracefully.

Implementing Agent Evaluation in Practice

Let's look at a practical example of evaluating an AI agent using these metrics:

from ragas.metrics import AgentGoalAccuracyWithoutReference, ToolCallAccuracy, TopicAdherenceScore
from ragas.evaluation import EvaluationDataset
from ragas.dataset_schema import MultiTurnSample
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper

# Initialize the LLM
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

# Example conversation with a travel booking agent
test_data = {
    "user_input": [
        {"role": "user", "content": "I need to book a flight from New York to London next Friday"},
        {"role": "assistant", "content": "I'd be happy to help you book a flight. Let me search for options...", 
         "tool_calls": [{"name": "search_flights", "arguments": {"origin": "NYC", "destination": "LON", "date": "next Friday"}}]},
        {"role": "tool", "name": "search_flights", "content": "Found 5 flights: Flight 1 (Delta, $750), Flight 2 (British Airways, $820)..."},
        {"role": "assistant", "content": "I found several flights from New York to London next Friday. The cheapest option is Delta for $750. Would you like to book this one?"},
        {"role": "user", "content": "Yes, please book the Delta flight"},
        {"role": "assistant", "content": "I'll book that for you now.", 
         "tool_calls": [{"name": "book_flight", "arguments": {"flight_id": "delta_123", "price": "$750"}}]},
        {"role": "tool", "name": "book_flight", "content": "Booking confirmed. Confirmation #: ABC123"},
        {"role": "assistant", "content": "Great news! Your flight is confirmed. Your confirmation number is ABC123. The flight is scheduled for next Friday. Is there anything else you need help with?"}
    ],
    "reference_topics": ["travel", "flight booking", "schedules", "prices"],
    "reference_tool_calls": [
        {"name": "search_flights", "args": {"origin": "NYC", "destination": "LON", "date": "next Friday"}},
        {"name": "book_flight", "args": {"flight_id": "delta_123", "price": "$750"}}
    ]
}

# Create a sample
sample = MultiTurnSample(**test_data)

# Initialize metrics
goal_accuracy = AgentGoalAccuracyWithoutReference(llm=evaluator_llm)
tool_accuracy = ToolCallAccuracy()
topic_adherence = TopicAdherenceScore(llm=evaluator_llm)

# Calculate scores
goal_score = await goal_accuracy.multi_turn_ascore(sample)
tool_score = tool_accuracy.multi_turn_score(sample)
topic_score = await topic_adherence.multi_turn_ascore(sample)

print(f"Goal Accuracy: {goal_score}")
print(f"Tool Call Accuracy: {tool_score}")
print(f"Topic Adherence: {topic_score}")

💡 Try it yourself:
Explore the hands-on notebook for agent evaluation:
06_Evaluating_AI_Agents

Advanced Agent Evaluation Techniques

Combining Metrics for Comprehensive Evaluation

For a complete assessment of agent capabilities, combine multiple metrics:

from ragas import evaluate

results = evaluate(
    dataset,  # Your dataset of agent conversations
    metrics=[
        AgentGoalAccuracyWithoutReference(llm=evaluator_llm),
        ToolCallAccuracy(),
        TopicAdherence(llm=evaluator_llm)
    ]
)

Best Practices for Agent Evaluation

  1. Test scenario coverage: Include a diverse range of interaction scenarios
  2. Edge case handling: Test how agents handle unexpected inputs or failures
  3. Longitudinal evaluation: Track performance over time to identify regressions
  4. Human-in-the-loop validation: Periodically verify metric alignment with human judgments
  5. Continuous feedback loops: Use evaluation insights to guide agent improvements

Conclusion

Evaluating AI agents requires specialized metrics that go beyond traditional RAG evaluation. Ragas' agent_goal_accuracy, tool_call_accuracy, and topic_adherence provide crucial insights into whether an agent can successfully complete tasks, use tools correctly, and stay within designated boundaries.

By incorporating these metrics into your evaluation pipeline, you can build more reliable and effective AI agents that truly deliver on the promise of helpful, goal-oriented AI assistants.

In our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.


Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications
Part 2: Basic Evaluation Workflow
Part 3: Evaluating RAG Systems with Ragas
Part 4: Test Data Generation
Part 5: Advanced Metrics and Customization
Part 6: Evaluating AI Agents — You are here
Next up in the series:
Part 7: Integrations and Observability
Part 8: Building Feedback Loops

How are you evaluating your AI agents? What challenges have you encountered in measuring agent performance? If you're facing specific evaluation hurdles, don't hesitate to reach out—we'd love to help!