data/basic-evaluation-workflow-with-ragas/index.md

metadata

title: 'Part 2: Basic Evaluation Workflow with Ragas'
date: 2025-04-27T01:00:00.000Z
layout: blog
description: >-
  Learn how to set up a basic evaluation workflow for LLM applications using
  Ragas. This guide walks you through data preparation, metric selection, and
  result analysis.
categories:
  - AI
  - RAG
  - Evaluation
  - Ragas
coverImage: >-
  https://images.unsplash.com/photo-1600132806370-bf17e65e942f?q=80&w=1988&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D
readingTime: 8
published: true

In our previous post, we introduced Ragas as a powerful framework for evaluating LLM applications. Now, let's dive into the practical aspects of setting up your first evaluation pipeline.

Understanding the Evaluation Workflow

A typical Ragas evaluation workflow consists of four key steps:

Prepare your data: Collect queries, contexts, responses, and reference answers
Select appropriate metrics: Choose metrics that align with what you want to evaluate
Run the evaluation: Process your data through the selected metrics
Analyze the results: Interpret scores and identify areas for improvement

Let's walk through each step with practical examples.

Step 1: Setting Up Your Environment

First, ensure you have Ragas installed:

uv add ragas

Next, import the necessary components:

import pandas as pd
from ragas import EvaluationDataset
from ragas import evaluate, RunConfig
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity

Step 2: Preparing Your Evaluation Data

For a RAG system evaluation, you'll need:

Questions: User queries to your system
Contexts: Documents or chunks retrieved by your system
Responses: Answers generated by your system
Ground truth (optional): Reference answers or documents for comparison

Here's how to organize this data:

# Sample data
data = {
    "user_input": [
        "What are the main symptoms of COVID-19?",
        "How does machine learning differ from deep learning?"
    ],
    "retrieved_contexts": [
        [
            "Common symptoms of COVID-19 include fever, cough, and fatigue. Some patients also report loss of taste or smell, body aches, and difficulty breathing.",
            "COVID-19 is caused by the SARS-CoV-2 virus and spreads primarily through respiratory droplets."
        ],
        [
            "Machine learning is a subset of AI focused on algorithms that learn from data without being explicitly programmed.",
            "Deep learning is a specialized form of machine learning using neural networks with many layers (deep neural networks)."
        ]
    ],
    "response": [
        "The main symptoms of COVID-19 include fever, cough, fatigue, and sometimes loss of taste or smell, body aches, and breathing difficulties.",
        "Machine learning is a subset of AI that focuses on algorithms learning from data, while deep learning is a specialized form of machine learning that uses deep neural networks with multiple layers."
    ],
    "reference": [
        "COVID-19 symptoms commonly include fever, dry cough, fatigue, loss of taste or smell, body aches, sore throat, and in severe cases, difficulty breathing.",
        "Machine learning is a branch of AI where systems learn from data, identify patterns, and make decisions with minimal human intervention. Deep learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to analyze various factors of data."
    ]
}

eval_data = pd.DataFrame(data)

# Convert to a format Ragas can use
evaluation_dataset = EvaluationDataset.from_pandas(eval_data)
evaluation_dataset

Step 3: Selecting and Configuring Metrics

Ragas offers various metrics to evaluate different aspects of your system:

Core RAG Metrics:

Faithfulness: Measures if the response is factually consistent with the provided context.
Factual Correctness: Assesses if the response is accurate and free from factual errors.
Response Relevancy: Evaluates if the response directly addresses the user query.
Context Entity Recall: Measures how well the retrieved context captures relevant entities from the ground truth.
Noise Sensitivity: Assesses the robustness of the response to irrelevant or noisy context.
LLM Context Recall: Evaluates how effectively the LLM utilizes the provided context to generate the response.

For metrics that require an LLM (like faithfulness), you need to configure the LLM provider:

# Configure LLM for evaluation
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper

# Initialize the LLM, you are going to OPENAI API key
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o")) 

# Define metrics to use
metrics = [
    Faithfulness(), 
    FactualCorrectness(),
    ResponseRelevancy(), 
    ContextEntityRecall(), 
    NoiseSensitivity(),
    LLMContextRecall()
]

Step 4: Running the Evaluation

Now, run the evaluation with your selected metrics:

# Run evaluation
results = evaluate(
    evaluation_dataset,
    metrics=metrics,
    llm=evaluator_llm  # Required for LLM-based metrics
)

# View results
print(results)

Output:

Values will vary based on your data and LLM performance.

{
    "faithfulness": 1.0000,
    "factual_correctness": 0.6750,
    "answer_relevancy": 0.9897,
    "context_entity_recall": 0.8889,
    "noise_sensitivity_relevant": 0.1667,
    "context_recall": 0.5000
}

Step 5: Interpreting Results

Ragas metrics typically return scores between 0 and 1, where higher is better:

Understanding Score Ranges:

0.8-1.0: Excellent performance
0.6-0.8: Good performance
0.4-0.6: Moderate performance, needs improvement
0.4 or lower: Poor performance, requires significant attention

Advanced Use: Custom Evaluation for Specific Examples

For more detailed analysis of specific examples:

from ragas import SingleTurnSample
from ragas.metrics import AspectCritic

# Define a specific test case
test_data = {
    "user_input": "What are quantum computers?",
    "response": "Quantum computers use quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits that can only be 0 or 1.",
    "retrieved_contexts": ["Quantum computing is a type of computation that harnesses quantum mechanical phenomena."]
}

# Create a custom evaluation metric
custom_metric = AspectCritic(
    name="quantum_accuracy", 
    llm=llm,
    definition="Verify if the explanation of quantum computing is accurate and complete."
)

# Score the sample
sample = SingleTurnSample(**test_data)
score = await custom_metric.single_turn_ascore(sample)
print(f"Quantum accuracy score: {score}")

💡 Try it yourself:
Explore the hands-on notebook for this workflow:
02_Basic_Evaluation_Workflow_with_Ragas

Common Evaluation Patterns and Metrics

Below is a matrix mapping evaluation patterns to the metrics used, along with definitions for each metric:

Metric	Comprehensive RAG Evaluation	Content Quality Evaluation	Retrieval Quality Evaluation
Faithfulness	✓	✓
Answer Relevancy	✓	✓
Context Recall	✓		✓
Context Precision	✓		✓
Harmfulness		✓
Coherence		✓
Context Relevancy			✓

Metric Definitions

Faithfulness: Measures if the response is factually consistent with the provided context.
Answer Relevancy: Assesses if the response addresses the question.
Context Recall: Measures how well the retrieved context covers the information in the ground truth.
Context Precision: Evaluates the proportion of relevant information in the retrieved context.
Harmfulness: Evaluates if the response contains harmful or inappropriate content.
Coherence: Measures the logical flow and clarity of the response.
Context Relevancy: Evaluates if the retrieved context is relevant to the question.

This matrix provides a clear overview of which metrics to use for specific evaluation patterns and their respective definitions.

Best Practices for Ragas Evaluation

Start simple: Begin with core metrics before adding more specialized ones
Use diverse test cases: Include a variety of questions, from simple to complex
Consider edge cases: Test with queries that might challenge your system
Compare versions: Track metrics across different versions of your application
Combine with human evaluation: Use Ragas alongside human feedback for a comprehensive assessment

Conclusion

Setting up a basic evaluation workflow with Ragas is straightforward yet powerful. By systematically evaluating your LLM applications, you gain objective insights into their performance and clear directions for improvement.

In our next post, we'll delve deeper into specialized evaluation techniques for RAG systems, exploring advanced metrics and evaluation strategies for retrieval-augmented generation applications.

Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications
Part 2: Basic Evaluation Workflow — You are here
Next up in the series:
Part 3: Evaluating RAG Systems
Part 4: Test Data Generation
Part 5: Advanced Evaluation Techniques
Part 6: Evaluating AI Agents
Part 7: Integrations and Observability
Part 8: Building Feedback Loops

Have you set up your first Ragas evaluation? What aspects of your LLM application are you most interested in measuring? If you’re facing specific evaluation hurdles, don’t hesitate to reach out—we’d love to help!