{ "cells": [ { "cell_type": "markdown", "id": "3b368f39", "metadata": {}, "source": [ "# Dealing with the Data\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "95ebfe0e", "metadata": {}, "outputs": [], "source": [ "import os\n", "import getpass\n", "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API Key:\")" ] }, { "cell_type": "markdown", "id": "c869f7a5", "metadata": {}, "source": [ "TheDataGuy's blog posts are in markdown in git repo, that why we copied the docs here for initial version\n", "\n", "- 14 posts currently\n", "\n", "TODO:\n", "\n", "[ ] - Develop a pipeline to ingest data as new posts are published\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a523f17c", "metadata": {}, "outputs": [], "source": [ "from langchain_community.document_loaders import DirectoryLoader\n", "\n", "path = \"data/\"\n", "text_loader = DirectoryLoader(path, glob=\"*.md\", show_progress=True,recursive=True)\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "6c0cc8c8", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 14/14 [00:00<00:00, 41.74it/s]\n" ] } ], "source": [ "raw_docs = text_loader.load()" ] }, { "cell_type": "code", "execution_count": 15, "id": "774a9a99", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "14" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(raw_docs)" ] }, { "cell_type": "code", "execution_count": null, "id": "4c7a6cdb", "metadata": {}, "outputs": [], "source": [ "# add url in metadata by replace \"data/\" with \"https://thedataguy.pro/\" and remove \"index.md\" from metadata.source\n", "for doc in raw_docs:\n", " doc.metadata[\"url\"] = doc.metadata[\"source\"].replace(\"data/\", \"https://thedataguy.pro/blog/\").replace(\"index.md\", \"\")\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 129, "id": "1a3f7432", "metadata": {}, "outputs": [], "source": [ "# build list of doc source and text length\n", "doc_info = []\n", "for doc in raw_docs:\n", " doc_info.append(\n", " {\n", " \"url\": doc.metadata[\"url\"],\n", " \"source\": doc.metadata[\"source\"],\n", " \"text_length\": len(doc.page_content),\n", " }\n", " )\n", "\n" ] }, { "cell_type": "code", "execution_count": 130, "id": "9f99da77", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "13468" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max(len(doc.page_content) for doc in raw_docs)" ] }, { "cell_type": "markdown", "id": "a6aa672d", "metadata": {}, "source": [ "The longest blog post contains 13,468 characters. Ideally, we would like to retrieve full post content based on the query. \n", "\n", "> That's why chunking based on the blog post might be the best way." ] }, { "cell_type": "code", "execution_count": 131, "id": "6106468f", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "int64", "type": "integer" }, { "name": "url", "rawType": "object", "type": "string" }, { "name": "source", "rawType": "object", "type": "string" }, { "name": "text_length", "rawType": "int64", "type": "integer" } ], "conversionMethod": "pd.DataFrame", "ref": "bbd3b888-a367-492d-9809-454ac5ea11ba", "rows": [ [ "0", "https://thedataguy.pro/introduction-to-ragas/", "data/introduction-to-ragas/index.md", "6071" ], [ "1", "https://thedataguy.pro/generating-test-data-with-ragas/", "data/generating-test-data-with-ragas/index.md", "13468" ], [ "2", "https://thedataguy.pro/advanced-metrics-and-customization-with-ragas/", "data/advanced-metrics-and-customization-with-ragas/index.md", "10455" ], [ "3", "https://thedataguy.pro/building-research-agent/", "data/building-research-agent/index.md", "6877" ], [ "4", "https://thedataguy.pro/rss-feed-announcement/", "data/rss-feed-announcement/index.md", "1900" ], [ "5", "https://thedataguy.pro/metric-driven-development/", "data/metric-driven-development/index.md", "11879" ], [ "6", "https://thedataguy.pro/basic-evaluation-workflow-with-ragas/", "data/basic-evaluation-workflow-with-ragas/index.md", "9164" ], [ "7", "https://thedataguy.pro/langchain-experience-csharp-perspective/", "data/langchain-experience-csharp-perspective/index.md", "3070" ], [ "8", "https://thedataguy.pro/evaluating-ai-agents-with-ragas/", "data/evaluating-ai-agents-with-ragas/index.md", "8907" ], [ "9", "https://thedataguy.pro/integrations-and-observability-with-ragas/", "data/integrations-and-observability-with-ragas/index.md", "8221" ], [ "10", "https://thedataguy.pro/building-feedback-loops-with-ragas/", "data/building-feedback-loops-with-ragas/index.md", "6891" ], [ "11", "https://thedataguy.pro/coming-back-to-ai-roots/", "data/coming-back-to-ai-roots/index.md", "5711" ], [ "12", "https://thedataguy.pro/data-is-king/", "data/data-is-king/index.md", "5987" ], [ "13", "https://thedataguy.pro/evaluating-rag-systems-with-ragas/", "data/evaluating-rag-systems-with-ragas/index.md", "7674" ] ], "shape": { "columns": 3, "rows": 14 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlsourcetext_length
0https://thedataguy.pro/introduction-to-ragas/data/introduction-to-ragas/index.md6071
1https://thedataguy.pro/generating-test-data-wi...data/generating-test-data-with-ragas/index.md13468
2https://thedataguy.pro/advanced-metrics-and-cu...data/advanced-metrics-and-customization-with-r...10455
3https://thedataguy.pro/building-research-agent/data/building-research-agent/index.md6877
4https://thedataguy.pro/rss-feed-announcement/data/rss-feed-announcement/index.md1900
5https://thedataguy.pro/metric-driven-development/data/metric-driven-development/index.md11879
6https://thedataguy.pro/basic-evaluation-workfl...data/basic-evaluation-workflow-with-ragas/inde...9164
7https://thedataguy.pro/langchain-experience-cs...data/langchain-experience-csharp-perspective/i...3070
8https://thedataguy.pro/evaluating-ai-agents-wi...data/evaluating-ai-agents-with-ragas/index.md8907
9https://thedataguy.pro/integrations-and-observ...data/integrations-and-observability-with-ragas...8221
10https://thedataguy.pro/building-feedback-loops...data/building-feedback-loops-with-ragas/index.md6891
11https://thedataguy.pro/coming-back-to-ai-roots/data/coming-back-to-ai-roots/index.md5711
12https://thedataguy.pro/data-is-king/data/data-is-king/index.md5987
13https://thedataguy.pro/evaluating-rag-systems-...data/evaluating-rag-systems-with-ragas/index.md7674
\n", "
" ], "text/plain": [ " url \\\n", "0 https://thedataguy.pro/introduction-to-ragas/ \n", "1 https://thedataguy.pro/generating-test-data-wi... \n", "2 https://thedataguy.pro/advanced-metrics-and-cu... \n", "3 https://thedataguy.pro/building-research-agent/ \n", "4 https://thedataguy.pro/rss-feed-announcement/ \n", "5 https://thedataguy.pro/metric-driven-development/ \n", "6 https://thedataguy.pro/basic-evaluation-workfl... \n", "7 https://thedataguy.pro/langchain-experience-cs... \n", "8 https://thedataguy.pro/evaluating-ai-agents-wi... \n", "9 https://thedataguy.pro/integrations-and-observ... \n", "10 https://thedataguy.pro/building-feedback-loops... \n", "11 https://thedataguy.pro/coming-back-to-ai-roots/ \n", "12 https://thedataguy.pro/data-is-king/ \n", "13 https://thedataguy.pro/evaluating-rag-systems-... \n", "\n", " source text_length \n", "0 data/introduction-to-ragas/index.md 6071 \n", "1 data/generating-test-data-with-ragas/index.md 13468 \n", "2 data/advanced-metrics-and-customization-with-r... 10455 \n", "3 data/building-research-agent/index.md 6877 \n", "4 data/rss-feed-announcement/index.md 1900 \n", "5 data/metric-driven-development/index.md 11879 \n", "6 data/basic-evaluation-workflow-with-ragas/inde... 9164 \n", "7 data/langchain-experience-csharp-perspective/i... 3070 \n", "8 data/evaluating-ai-agents-with-ragas/index.md 8907 \n", "9 data/integrations-and-observability-with-ragas... 8221 \n", "10 data/building-feedback-loops-with-ragas/index.md 6891 \n", "11 data/coming-back-to-ai-roots/index.md 5711 \n", "12 data/data-is-king/index.md 5987 \n", "13 data/evaluating-rag-systems-with-ragas/index.md 7674 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# display as markdown table\n", "import pandas as pd\n", "doc_info_df = pd.DataFrame(doc_info)\n", "\n", "from IPython.display import display\n", "display(doc_info_df)\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "fbb6a7ab", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "title: \"Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications\" date: 2025-04-26T18:00:00-06:00 layout: blog description: \"Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems.\" categories: [\"AI\", \"RAG\", \"Evaluation\",\"Ragas\"] coverImage: \"https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3\" readingTime: 7 published: true\n", "\n", "As Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you're building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how well your application performs. This is where Ragas steps in.\n", "\n", "What is Ragas?\n", "{'source': 'data/introduction-to-ragas/index.md'}\n" ] } ], "source": [ "# calculate the number of tokens in the documents\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "text_splitter = RecursiveCharacterTextSplitter(\n", " chunk_size=1000,\n", " chunk_overlap=200,\n", " length_function=len,\n", ")\n", "docs = text_splitter.split_documents(raw_docs)\n", "len(docs)\n", "# check the first document\n", "print(docs[0].page_content[:1000])\n", "# check the metadata\n", "print(docs[0].metadata)\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "8049a960", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "139" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(docs)" ] }, { "cell_type": "code", "execution_count": null, "id": "b8d51429", "metadata": {}, "outputs": [], "source": [ "from langchain_huggingface import HuggingFaceEmbeddings\n", "huggingface_embeddings = HuggingFaceEmbeddings(model_name=\"Snowflake/snowflake-arctic-embed-l\")" ] }, { "cell_type": "code", "execution_count": 133, "id": "6052bfc5", "metadata": {}, "outputs": [], "source": [ "from langchain_community.vectorstores import Qdrant" ] }, { "cell_type": "code", "execution_count": null, "id": "b6ad603c", "metadata": {}, "outputs": [], "source": [ "import os\n", "from pathlib import Path\n", "\n", "# Create directory if it doesn't exist\n", "storage_path = Path(\"./db\")\n", "os.makedirs(storage_path, exist_ok=True)\n", "\n", "qdrant_vectorstore = Qdrant.from_documents(\n", " documents=docs,\n", " embedding=huggingface_embeddings,\n", " path=str(storage_path / \"vectorstore\"),\n", " collection_name=\"thedataguy_documents\",\n", ")" ] }, { "cell_type": "code", "execution_count": 135, "id": "774fddda", "metadata": {}, "outputs": [], "source": [ "# Create directory if it doesn't exist\n", "storage_path = Path(\"./db\")\n", "os.makedirs(storage_path, exist_ok=True)\n", "\n", "qdrant_vectorstore = Qdrant.from_documents(\n", " documents=raw_docs,\n", " embedding=huggingface_embeddings,\n", " path=str(storage_path / \"vectorstore_v3\"),\n", " collection_name=\"thedataguy_documents\",\n", ")" ] }, { "cell_type": "code", "execution_count": 145, "id": "c802408f", "metadata": {}, "outputs": [], "source": [ "# display content in markdown\n", "from IPython.display import Markdown, display\n", "def display_markdown(content):\n", " display(Markdown(content))" ] }, { "cell_type": "code", "execution_count": 136, "id": "3855ff2f", "metadata": {}, "outputs": [], "source": [ "retriever = qdrant_vectorstore.as_retriever()" ] }, { "cell_type": "code", "execution_count": 137, "id": "51863b67", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(metadata={'source': 'data/advanced-metrics-and-customization-with-ragas/index.md', 'url': 'https://thedataguy.pro/advanced-metrics-and-customization-with-ragas/', '_id': 'dea41eaa40a54b3ca558fda681d69ec8', '_collection_name': 'thedataguy_documents'}, page_content='title: \"Part 5: Advanced Metrics and Customization with Ragas\" date: 2025-04-28T05:00:00-06:00 layout: blog description: \"Explore advanced metrics and customization techniques in Ragas for evaluating LLM applications, including creating custom metrics, domain-specific evaluation, composite scoring, and best practices for building a comprehensive evaluation ecosystem.\" categories: [\"AI\", \"RAG\", \"Evaluation\", \"Ragas\",\"Data\"] coverImage: \"https://plus.unsplash.com/premium_photo-1661368994107-43200954c524?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D\" readingTime: 9 published: true\\n\\nIn our previous post, we explored how to generate comprehensive test datasets for evaluating LLM applications. Now, let\\'s dive into one of Ragas\\' most powerful capabilities: advanced metrics and custom evaluation approaches that address specialized evaluation needs.\\n\\nBeyond the Basics: Why Advanced Metrics Matter\\n\\nWhile Ragas\\' core metrics cover fundamental evaluation aspects, real-world applications often have unique requirements:\\n\\nDomain-specific quality criteria: Legal, medical, or financial applications have specialized accuracy requirements\\n\\nCustom interaction patterns: Applications with unique conversation flows need tailored evaluation approaches\\n\\nSpecialized capabilities: Features like reasoning, code generation, or structured output demand purpose-built metrics\\n\\nBusiness-specific KPIs: Aligning evaluation with business objectives requires customized metrics\\n\\nLet\\'s explore how to extend Ragas\\' capabilities to meet these specialized needs.\\n\\nUnderstanding Ragas\\' Metric Architecture\\n\\nBefore creating custom metrics, it\\'s helpful to understand Ragas\\' metric architecture:\\n\\n1. Understand the Metric Base Classes\\n\\nAll metrics in Ragas inherit from the abstract Metric class (see metrics/base.py). For most use cases, you’ll extend one of these:\\n\\nSingleTurnMetric: For metrics that evaluate a single question/response pair.\\n\\nMultiTurnMetric: For metrics that evaluate multi-turn conversations.\\n\\nMetricWithLLM: For metrics that require an LLM for evaluation.\\n\\nMetricWithEmbeddings: For metrics that use embeddings.\\n\\nYou can mix these as needed (e.g., MetricWithLLM, SingleTurnMetric).\\n\\nEach metric implements specific scoring methods depending on its type:\\n\\n_single_turn_ascore: For single-turn metrics\\n\\n_multi_turn_ascore: For multi-turn metrics\\n\\nCreating Your First Custom Metric\\n\\nLet\\'s create a custom metric that evaluates technical accuracy in programming explanations:\\n\\n```python from dataclasses import dataclass, field from typing import Dict, Optional, Set import typing as t\\n\\nfrom ragas.metrics.base import MetricWithLLM, SingleTurnMetric from ragas.prompt import PydanticPrompt from ragas.metrics import MetricType, MetricOutputType from pydantic import BaseModel\\n\\nDefine input/output models for the prompt\\n\\nclass TechnicalAccuracyInput(BaseModel): question: str context: str response: str programming_language: str = \"python\"\\n\\nclass TechnicalAccuracyOutput(BaseModel): score: float feedback: str\\n\\nDefine the prompt\\n\\nclass TechnicalAccuracyPrompt(PydanticPrompt[TechnicalAccuracyInput, TechnicalAccuracyOutput]): instruction: str = ( \"Evaluate the technical accuracy of the response to a programming question. \" \"Consider syntax correctness, algorithmic accuracy, and best practices.\" ) input_model = TechnicalAccuracyInput output_model = TechnicalAccuracyOutput examples = [ # Add examples here ]\\n\\nCreate the metric\\n\\n@dataclass class TechnicalAccuracy(MetricWithLLM, SingleTurnMetric): name: str = \"technical_accuracy\" _required_columns: Dict[MetricType, Set[str]] = field( default_factory=lambda: { MetricType.SINGLE_TURN: { \"user_input\", \"response\",\\n\\n }\\n }\\n)\\noutput_type: Optional[MetricOutputType] = MetricOutputType.CONTINUOUS\\nevaluation_prompt: PydanticPrompt = field(default_factory=TechnicalAccuracyPrompt)\\n\\nasync def _single_turn_ascore(self, sample, callbacks) -> float:\\n assert self.llm is not None, \"LLM must be set\"\\n\\n question = sample.user_input\\n response = sample.response\\n # Extract programming language from question if possible\\n programming_language = \"python\" # Default\\n languages = [\"python\", \"javascript\", \"java\", \"c++\", \"rust\", \"go\"]\\n for lang in languages:\\n if lang in question.lower():\\n programming_language = lang\\n break\\n\\n # Get the context\\n context = \"\\\\n\".join(sample.retrieved_contexts) if sample.retrieved_contexts else \"\"\\n\\n # Prepare input for prompt\\n prompt_input = TechnicalAccuracyInput(\\n question=question,\\n context=context,\\n response=response,\\n programming_language=programming_language\\n )\\n\\n # Generate evaluation\\n evaluation = await self.evaluation_prompt.generate(\\n data=prompt_input, llm=self.llm, callbacks=callbacks\\n )\\n\\n return evaluation.score\\n\\n```\\n\\nUsing the Custom Metric\\n\\nTo use the custom metric, simply include it in your evaluation pipeline:\\n\\n```python from langchain_openai import ChatOpenAI from ragas import SingleTurnSample from ragas.llms import LangchainLLMWrapper\\n\\nInitialize the LLM, you are going to OPENAI API key\\n\\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\\n\\ntest_data = { \"user_input\": \"Write a function to calculate the factorial of a number in Python.\", \"retrieved_contexts\": [\"Python is a programming language.\", \"A factorial of a number n is the product of all positive integers less than or equal to n.\"], \"response\": \"def factorial(n):\\\\n if n == 0:\\\\n return 1\\\\n else:\\\\n return n * factorial(n-1)\", }\\n\\nCreate a sample\\n\\nsample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor technical_accuracy = TechnicalAccuracy(llm=evaluator_llm) score = await technical_accuracy.single_turn_ascore(sample) print(f\"Technical Accuracy Score: {score}\")\\n\\nNote: The above code is a simplified example. In a real-world scenario, you would need to handle exceptions,\\n\\n`` You can also use theevaluate` function to evaluate a dataset:\\n\\n```python from ragas import evaluate from ragas import evaluate\\n\\nresults = evaluate( dataset, # Your dataset of samples metrics=[TechnicalAccuracy(), ...], llm=myevaluator_llm_llm ) ```\\n\\n💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 05_Advanced_Metrics_and_Customization\\n\\nCustomizing Metrics for Your Application\\n\\nYou can further refine your evaluation by customizing existing metrics—such as adjusting thresholds or criteria—to better fit your application\\'s requirements. For multi-turn conversations, you might configure metrics like topic adherence to emphasize specific aspects, such as precision or recall, based on your evaluation objectives.\\n\\nIn specialized domains like healthcare or legal, it\\'s crucial to design custom metrics that capture domain-specific accuracy and compliance needs. For complex applications, consider combining several metrics into composite scores to represent multiple quality dimensions.\\n\\nWhen assessing capabilities like code generation or structured outputs, develop metrics that evaluate execution correctness or schema compliance. For advanced scenarios, you can build metric pipelines that orchestrate several metrics and aggregate their results using strategies like weighted averages or minimum scores.\\n\\nBy thoughtfully customizing and combining metrics, you can achieve a comprehensive and meaningful evaluation framework tailored to your unique use case.\\n\\nBest Practices for Custom Metric Development\\n\\nSingle Responsibility: Each metric should evaluate one specific aspect\\n\\nClear Definition: Define precisely what your metric measures\\n\\nBounded Output: Scores should be normalized, typically in [0,1]\\n\\nReproducibility: Minimize randomness in evaluation\\n\\nDocumentation: Document criteria, prompt design, and interpretation guidelines\\n\\nTest with Examples: Verify metric behavior on clear-cut examples\\n\\nHuman Correlation: Validate that metrics correlate with human judgment\\n\\nStandardizing Custom Metrics\\n\\nTo ensure consistency across custom metrics, consider the following best practices:\\n\\nDefine a clear, human-readable description for each metric.\\n\\nProvide interpretation guidelines to help users understand score meanings.\\n\\nInclude metadata such as metric name, required columns, and output type.\\n\\nUse a standardized interface or base class for all custom metrics.\\n\\nImplementation Patterns for Advanced Metrics\\n\\nWhen developing advanced metrics like topic adherence:\\n\\nDesign multi-step evaluation workflows for complex tasks.\\n\\nUse specialized prompts for different sub-tasks within the metric.\\n\\nAllow configurable scoring modes (e.g., precision, recall, F1).\\n\\nSupport conversational context for multi-turn evaluations.\\n\\nDebugging Custom Metrics\\n\\nEffective debugging strategies include:\\n\\nImplementing a debug mode to capture prompt inputs, outputs, and intermediate results.\\n\\nLogging detailed evaluation steps for easier troubleshooting.\\n\\nReviewing final scores alongside intermediate calculations to identify issues.\\n\\nConclusion: Building an Evaluation Ecosystem\\n\\nCustom metrics allow you to build a comprehensive evaluation ecosystem tailored to your application\\'s specific needs:\\n\\nBaseline metrics: Start with Ragas\\' core metrics for fundamental quality aspects\\n\\nDomain adaptation: Add specialized metrics for your application domain\\n\\nFeature-specific metrics: Develop metrics for unique features of your system\\n\\nBusiness alignment: Create metrics that reflect specific business KPIs and requirements\\n\\nBy extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\\n\\nIn our next post, we\\'ll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.\\n\\nPart 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications Part 2: Basic Evaluation Workflow Part 3: Evaluating RAG Systems with Ragas Part 4: Test Data Generation Part 5: Advanced Evaluation Techniques — You are here Next up in the series: Part 6: Evaluating AI Agents Part 7: Integrations and Observability Part 8: Building Feedback Loops\\n\\nHow have you implemented feedback loops in your LLM applications? What improvement strategies have been most effective for your use cases? If you’re facing specific evaluation hurdles, don’t hesitate to reach out—we’d love to help!'),\n", " Document(metadata={'source': 'data/langchain-experience-csharp-perspective/index.md', 'url': 'https://thedataguy.pro/langchain-experience-csharp-perspective/', '_id': 'c7d27ac193ba4897b79df7c6916dd6c8', '_collection_name': 'thedataguy_documents'}, page_content='layout: blog title: A C# Programmer\\'s Perspective on LangChain Expression Language date: 2025-04-16T00:00:00-06:00 description: My experiences transitioning from C# to LangChain Expression Language, exploring the pipe operator abstraction challenges and the surprising simplicity of parallel execution. categories: [\"Technology\", \"AI\", \"Programming\"] coverImage: \"https://images.unsplash.com/photo-1555066931-4365d14bab8c?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3\" readingTime: 3 published: true\\n\\nAs a C# developer diving into LangChain Expression Language (LCEL), I\\'ve encountered both challenges and pleasant surprises. Here\\'s what stood out most during my transition.\\n\\nThe Pipe Operator Abstraction Challenge\\n\\nIn C#, processing pipelines are explicit:\\n\\ncsharp var result = inputData .Where(item => item.IsValid) .Select(item => TransformItem(item)) .ToList() .ForEach(item => ProcessItem(item));\\n\\nLCEL\\'s pipe operator creates a different flow:\\n\\npython chain = ( ChatPromptTemplate.from_messages([ (\"system\", \"You are a helpful assistant specialized in {topic}.\"), (\"human\", \"{query}\") ]) | ChatOpenAI(temperature=0.7) | (lambda llm_result: llm_result.content) | (lambda content: content.split(\"\\\\n\")) | (lambda lines: [line for line in lines if line.strip()]) | (lambda filtered_lines: \"\\\\n\".join(filtered_lines)) )\\n\\nWith complex chains, questions arise: - What exactly passes through each step? - How can I inspect intermediate results? - How do I debug unexpected outcomes?\\n\\nThis becomes more apparent in real-world examples:\\n\\npython retrieval_chain = ( {\"query\": RunnablePassthrough(), \"context\": retriever | format_docs} | prompt | llm | StrOutputParser() )\\n\\nSurprisingly Simple Parallel Execution\\n\\nDespite abstraction challenges, LCEL handles parallel execution elegantly.\\n\\nIn C#: ```csharp var task1 = Task.Run(() => ProcessData(data1)); var task2 = Task.Run(() => ProcessData(data2)); var task3 = Task.Run(() => ProcessData(data3));\\n\\nawait Task.WhenAll(task1, task2, task3); var results = new[] { task1.Result, task2.Result, task3.Result }; ```\\n\\nIn LCEL: ```python parallel_chain = RunnableMap({ \"summary\": prompt_summary | llm | StrOutputParser(), \"translation\": prompt_translate | llm | StrOutputParser(), \"analysis\": prompt_analyze | llm | StrOutputParser() })\\n\\nresult = parallel_chain.invoke({\"input\": user_query}) ```\\n\\nThis approach eliminates manual task management, handling everything behind the scenes.\\n\\nBest Practices I\\'ve Adopted\\n\\nTo balance LCEL\\'s expressiveness with clarity:\\n\\nBreak complex chains into named subcomponents\\n\\nComment non-obvious transformations\\n\\nCreate visualization helpers for debugging\\n\\nEmbrace functional thinking\\n\\nConclusion\\n\\nFor C# developers exploring LCEL, approach it with an open mind. The initial learning curve is worth it, especially for AI workflows where LCEL\\'s parallel execution shines.\\n\\nWant to see these concepts in practice? Check out my Pythonic RAG repository for working examples.\\n\\nIf you found this useful or have questions about transitioning from C# to LCEL, feel free to reach out — we’d love to help!')]" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"Who is TheDataGuy?\"\n", "retriever.invoke(query,k=2)" ] }, { "cell_type": "code", "execution_count": 138, "id": "fe41c589", "metadata": {}, "outputs": [], "source": [ "from langchain.prompts import ChatPromptTemplate\n", "\n", "rag_prompt_template = \"\"\"\\\n", "You are a helpful assistant that answers questions based on the context provided. \n", "Generate a concise answer to the question in markdown format and include a list of relevant links to the context.\n", "Use links from context to help user to navigate to to find more information.\n", "You have access to the following information:\n", "\n", "Context:\n", "{context}\n", "\n", "Question:\n", "{question}\n", "\n", "If context is unrelated to question, say \"I don't know\".\n", "\"\"\"\n", "\n", "rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)" ] }, { "cell_type": "code", "execution_count": 140, "id": "4098897b", "metadata": {}, "outputs": [], "source": [ "from langchain_openai.chat_models import ChatOpenAI\n", "\n", "llm = ChatOpenAI(model=\"gpt-4o-mini\", temperature=0.0)" ] }, { "cell_type": "code", "execution_count": 142, "id": "4b903b28", "metadata": {}, "outputs": [], "source": [ "from operator import itemgetter\n", "from langchain.schema.output_parser import StrOutputParser\n", "from langchain.schema.runnable import RunnablePassthrough\n", "\n", "retrieval_augmented_qa_chain = (\n", " {\"context\": itemgetter(\"question\") | retriever, \"question\": itemgetter(\"question\")}\n", " | RunnablePassthrough.assign(context=itemgetter(\"context\"))\n", " | {\"response\": rag_prompt | llm, \"context\": itemgetter(\"context\")}\n", ")" ] }, { "cell_type": "code", "execution_count": 143, "id": "d0ed00da", "metadata": {}, "outputs": [], "source": [ "query = \"Who is TheDataGuy?\"\n", "response = retrieval_augmented_qa_chain.invoke({\"question\" : query})" ] }, { "cell_type": "code", "execution_count": 144, "id": "8d838216", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"I don't know.\"" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response[\"response\"].content" ] }, { "cell_type": "code", "execution_count": 146, "id": "dd89de8d", "metadata": {}, "outputs": [], "source": [ "query = \"What is RAGAS?\"\n", "response = retrieval_augmented_qa_chain.invoke({\"question\" : query})" ] }, { "cell_type": "code", "execution_count": 147, "id": "4f6bd775", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'response': AIMessage(content='Ragas is an open-source evaluation framework specifically designed for Large Language Model (LLM) applications, particularly focusing on Retrieval-Augmented Generation (RAG) systems. It provides specialized metrics to assess the performance of LLM applications, addressing challenges such as information retrieval accuracy, response consistency, and user query relevance. Ragas helps ensure that LLM applications are reliable and effective by offering tools for quality assurance, performance tracking, and continuous improvement.\\n\\nFor more information, you can explore the following links:\\n- [Part 1: Introduction to Ragas](https://thedataguy.pro/introduction-to-ragas/)\\n- [Part 2: Basic Evaluation Workflow with Ragas](https://thedataguy.pro/basic-evaluation-workflow-with-ragas/)\\n- [Part 3: Evaluating RAG Systems with Ragas](https://thedataguy.pro/evaluating-rag-systems-with-ragas/)\\n- [Part 5: Advanced Metrics and Customization with Ragas](https://thedataguy.pro/advanced-metrics-and-customization-with-ragas/)', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 219, 'prompt_tokens': 7761, 'total_tokens': 7980, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_96c46af214', 'id': 'chatcmpl-BVfuNKBpgXYrnV9WdbVKALJ7y5ln1', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--cc6d3507-84b5-483a-ab8c-e4b6084350a0-0', usage_metadata={'input_tokens': 7761, 'output_tokens': 219, 'total_tokens': 7980, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}),\n", " 'context': [Document(metadata={'source': 'data/introduction-to-ragas/index.md', 'url': 'https://thedataguy.pro/introduction-to-ragas/', '_id': '8c6d25c18cbf4a868a01e3a50cfaf020', '_collection_name': 'thedataguy_documents'}, page_content='title: \"Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications\" date: 2025-04-26T18:00:00-06:00 layout: blog description: \"Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems.\" categories: [\"AI\", \"RAG\", \"Evaluation\",\"Ragas\"] coverImage: \"https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3\" readingTime: 7 published: true\\n\\nAs Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you\\'re building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how well your application performs. This is where Ragas steps in.\\n\\nWhat is Ragas?\\n\\nRagas is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems.\\n\\nAt its core, Ragas helps answer crucial questions: - Is my application retrieving the right information? - Are the responses factually accurate and consistent with the retrieved context? - Does the system appropriately address the user\\'s query? - How well does my application handle multi-turn conversations?\\n\\nWhy Evaluate LLM Applications?\\n\\nLLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable.\\n\\nEvaluation serves several key purposes: - Quality assurance: Identify and fix issues before they reach users - Performance tracking: Monitor how changes impact system performance - Benchmarking: Compare different approaches objectively - Continuous improvement: Build feedback loops to enhance your application\\n\\nKey Features of Ragas\\n\\n🎯 Specialized Metrics\\n\\nRagas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications:\\n\\nFaithfulness: Measures if the response is factually consistent with the retrieved context\\n\\nContext Relevancy: Evaluates if the retrieved information is relevant to the query\\n\\nAnswer Relevancy: Assesses if the response addresses the user\\'s question\\n\\nTopic Adherence: Gauges how well multi-turn conversations stay on topic\\n\\n🧪 Test Data Generation\\n\\nCreating high-quality test data is often a bottleneck in evaluation. Ragas helps you generate comprehensive test datasets automatically, saving time and ensuring thorough coverage.\\n\\n🔗 Seamless Integrations\\n\\nRagas works with popular LLM frameworks and tools: - LangChain - LlamaIndex - Haystack - OpenAI\\n\\nObservability platforms - Phoenix - LangSmith - Langfuse\\n\\n📊 Comprehensive Analysis\\n\\nBeyond simple scores, Ragas provides detailed insights into your application\\'s strengths and weaknesses, enabling targeted improvements.\\n\\nGetting Started with Ragas\\n\\nInstalling Ragas is straightforward:\\n\\nbash uv init && uv add ragas\\n\\nHere\\'s a simple example of evaluating a response using Ragas:\\n\\n```python from ragas.metrics import Faithfulness from ragas.evaluation import EvaluationDataset from ragas.dataset_schema import SingleTurnSample from langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper from langchain_openai import ChatOpenAI\\n\\nInitialize the LLM, you are going to new OPENAI API key\\n\\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\\n\\nYour evaluation data\\n\\ntest_data = { \"user_input\": \"What is the capital of France?\", \"retrieved_contexts\": [\"Paris is the capital and most populous city of France.\"], \"response\": \"The capital of France is Paris.\" }\\n\\nCreate a sample\\n\\nsample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor\\n\\nCreate metric\\n\\nfaithfulness = Faithfulness(llm=evaluator_llm)\\n\\nCalculate the score\\n\\nresult = await faithfulness.single_turn_ascore(sample) print(f\"Faithfulness score: {result}\") ```\\n\\n💡 Try it yourself: Explore the hands-on notebook for this workflow: 01_Introduction_to_Ragas\\n\\nWhat\\'s Coming in This Blog Series\\n\\nThis introduction is just the beginning. In the upcoming posts, we\\'ll dive deeper into all aspects of evaluating LLM applications with Ragas:\\n\\nPart 2: Basic Evaluation Workflow We\\'ll explore each metric in detail, explaining when and how to use them effectively.\\n\\nPart 3: Evaluating RAG Systems Learn specialized techniques for evaluating retrieval-augmented generation systems, including context precision, recall, and relevance.\\n\\nPart 4: Test Data Generation Discover how to create high-quality test datasets that thoroughly exercise your application\\'s capabilities.\\n\\nPart 5: Advanced Evaluation Techniques Go beyond basic metrics with custom evaluations, multi-aspect analysis, and domain-specific assessments.\\n\\nPart 6: Evaluating AI Agents Learn how to evaluate complex AI agents that engage in multi-turn interactions, use tools, and work toward specific goals.\\n\\nPart 7: Integrations and Observability Connect Ragas with your existing tools and platforms for streamlined evaluation workflows.\\n\\nPart 8: Building Feedback Loops Learn how to implement feedback loops that drive continuous improvement in your LLM applications. Transform evaluation insights into concrete improvements for your LLM applications.\\n\\nConclusion\\n\\nIn a world increasingly powered by LLMs, robust evaluation is the difference between reliable applications and unpredictable ones. Ragas provides the tools you need to confidently assess and improve your LLM applications.\\n\\nReady to Elevate Your LLM Applications?\\n\\nStart exploring Ragas today by visiting the official documentation. Share your thoughts, challenges, or success stories. If you\\'re facing specific evaluation hurdles, don\\'t hesitate to reach out—we\\'d love to help!'),\n", " Document(metadata={'source': 'data/evaluating-rag-systems-with-ragas/index.md', 'url': 'https://thedataguy.pro/evaluating-rag-systems-with-ragas/', '_id': 'e23d0f12b30d4a9b9ef557bac2a0a745', '_collection_name': 'thedataguy_documents'}, page_content='title: \"Part 3: Evaluating RAG Systems with Ragas\" date: 2025-04-26T20:00:00-06:00 layout: blog description: \"Learn specialized techniques for comprehensive evaluation of Retrieval-Augmented Generation systems using Ragas, including metrics for retrieval quality, generation quality, and end-to-end performance.\" categories: [\"AI\", \"RAG\", \"Evaluation\", \"Ragas\"] coverImage: \"https://images.unsplash.com/photo-1743796055664-3473eedab36e?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D\" readingTime: 14 published: true\\n\\nIn our previous post, we covered the fundamentals of setting up evaluation workflows with Ragas. Now, let\\'s focus specifically on evaluating Retrieval-Augmented Generation (RAG) systems, which present unique evaluation challenges due to their multi-component nature.\\n\\nUnderstanding RAG Systems: More Than the Sum of Their Parts\\n\\nRAG systems combine two critical capabilities: 1. Retrieval: Finding relevant information from a knowledge base 2. Generation: Creating coherent, accurate responses based on retrieved information\\n\\nThis dual nature means evaluation must address both components while also assessing their interaction. A system might retrieve perfect information but generate poor responses, or generate excellent prose from irrelevant retrieved content.\\n\\nThe RAG Evaluation Triad\\n\\nEffective RAG evaluation requires examining three key dimensions:\\n\\nRetrieval Quality: How well does the system find relevant information?\\n\\nGeneration Quality: How well does the system produce responses from retrieved information?\\n\\nEnd-to-End Performance: How well does the complete system satisfy user needs?\\n\\nLet\\'s explore how Ragas helps evaluate each dimension of RAG systems.\\n\\nCore RAG Metrics in Ragas\\n\\nRagas provides specialized metrics to assess RAG systems across retrieval, generation, and end-to-end performance.\\n\\nRetrieval Quality Metrics\\n\\n1. Context Relevancy\\n\\nMeasures how relevant the retrieved documents are to the user\\'s question.\\n\\nHow it works:\\n\\nTakes the user\\'s question (user_input) and the retrieved documents (retrieved_contexts).\\n\\nUses an LLM to score relevance with two different prompts, averaging the results for robustness.\\n\\nScores are normalized between 0.0 (irrelevant) and 1.0 (fully relevant).\\n\\nWhy it matters: Low scores indicate your retriever is pulling in unrelated or noisy documents. Monitoring this helps you improve the retrieval step.\\n\\n2. Context Precision\\n\\nAssesses how much of the retrieved context is actually useful for generating the answer.\\n\\nHow it works:\\n\\nFor each retrieved chunk, an LLM judges if it was necessary for the answer, using the ground truth (reference) or the generated response.\\n\\nCalculates Average Precision, rewarding systems that rank useful chunks higher.\\n\\nVariants:\\n\\nContextUtilization: Uses the generated response instead of ground truth.\\n\\nNon-LLM version: Compares retrieved chunks to ideal reference contexts using string similarity.\\n\\nWhy it matters: High precision means your retriever is efficient; low precision means too much irrelevant information is included.\\n\\n3. Context Recall\\n\\nEvaluates whether all necessary information from the ground truth answer is present in the retrieved context.\\n\\nHow it works:\\n\\nBreaks down the reference answer into sentences.\\n\\nFor each sentence, an LLM checks if it can be supported by the retrieved context.\\n\\nThe score is the proportion of reference sentences attributed to the retrieved context.\\n\\nVariants:\\n\\nNon-LLM version: Compares reference and retrieved contexts using similarity and thresholds.\\n\\nWhy it matters: High recall means your retriever finds all needed information; low recall means critical information is missing.\\n\\nSummary: - Low context relevancy: Retriever needs better query understanding or semantic matching. - Low context precision: Retriever includes unnecessary information. - Low context recall: Retriever misses critical information.\\n\\nGeneration Quality Metrics\\n\\n1. Faithfulness\\n\\nChecks if the generated answer is factually consistent with the retrieved context, addressing hallucination.\\n\\nHow it works:\\n\\nBreaks the answer into simple statements.\\n\\nFor each, an LLM checks if it can be inferred from the retrieved context.\\n\\nThe score is the proportion of faithful statements.\\n\\nAlternative:\\n\\nFaithfulnesswithHHEM: Uses a specialized NLI model for verification.\\n\\nWhy it matters: High faithfulness means answers are grounded in context; low faithfulness signals hallucination.\\n\\n2. Answer Relevancy\\n\\nMeasures if the generated answer directly addresses the user\\'s question.\\n\\nHow it works:\\n\\nAsks an LLM to generate possible questions for the answer.\\n\\nCompares these to the original question using embedding similarity.\\n\\nPenalizes noncommittal answers.\\n\\nWhy it matters: High relevancy means answers are on-topic; low relevancy means answers are off-topic or incomplete.\\n\\nSummary: - Low faithfulness: Generator adds facts not supported by context. - Low answer relevancy: Generator doesn\\'t focus on the specific question.\\n\\nEnd-to-End Metrics\\n\\n1. Correctness\\n\\nAssesses factual alignment between the generated answer and a ground truth reference.\\n\\nHow it works:\\n\\nBreaks both the answer and reference into claims.\\n\\nUses NLI to verify claims in both directions.\\n\\nCalculates precision, recall, or F1-score.\\n\\nWhy it matters: High correctness means answers match the ground truth; low correctness signals factual errors.\\n\\nKey distinction: - Faithfulness: Compares answer to retrieved context. - FactualCorrectness: Compares answer to ground truth.\\n\\nCommon RAG Evaluation Patterns\\n\\n1. High Retrieval, Low Generation Scores\\n\\nDiagnosis: Good retrieval, poor use of information.\\n\\nFixes: Improve prompts, use better generation models, or verify responses post-generation.\\n\\n2. Low Retrieval, High Generation Scores\\n\\nDiagnosis: Good generation, inadequate information.\\n\\nFixes: Enhance indexing, retrieval algorithms, or expand the knowledge base.\\n\\n3. Low Context Precision, High Faithfulness\\n\\nDiagnosis: Retrieves too much, but generates reliably.\\n\\nFixes: Filter passages, optimize chunk size, or use re-ranking.\\n\\nBest Practices for RAG Evaluation\\n\\nEvaluate components independently: Assess retrieval and generation separately.\\n\\nUse diverse queries: Include factoid, explanatory, and complex questions.\\n\\nCompare against baselines: Test against simpler systems.\\n\\nPerform ablation studies: Try variations like different chunk sizes or retrieval models.\\n\\nCombine with human evaluation: Use Ragas with human judgment for a complete view.\\n\\nConclusion: The Iterative RAG Evaluation Cycle\\n\\nEffective RAG development is iterative:\\n\\nEvaluate: Measure performance.\\n\\nAnalyze: Identify weaknesses.\\n\\nImprove: Apply targeted enhancements.\\n\\nRe-evaluate: Measure the impact of changes.\\n\\nThe Iterative RAG Evaluation Cycle\\n\\nBy using Ragas to implement this cycle, you can systematically improve your RAG system\\'s performance across all dimensions.\\n\\nIn our next post, we\\'ll explore how to generate high-quality test datasets for comprehensive RAG evaluation, addressing the common challenge of limited test data.\\n\\nPart 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications Part 2: Basic Evaluation Workflow Part 3: Evaluating RAG Systems with Ragas — You are here Next up in the series: Part 4: Test Data Generation Part 5: Advanced Evaluation Techniques Part 6: Evaluating AI Agents Part 7: Integrations and Observability Part 8: Building Feedback Loops\\n\\nHow have you implemented feedback loops in your LLM applications? What improvement strategies have been most effective for your use cases? If you’re facing specific evaluation hurdles, don’t hesitate to reach out—we’d love to help!'),\n", " Document(metadata={'source': 'data/advanced-metrics-and-customization-with-ragas/index.md', 'url': 'https://thedataguy.pro/advanced-metrics-and-customization-with-ragas/', '_id': 'dea41eaa40a54b3ca558fda681d69ec8', '_collection_name': 'thedataguy_documents'}, page_content='title: \"Part 5: Advanced Metrics and Customization with Ragas\" date: 2025-04-28T05:00:00-06:00 layout: blog description: \"Explore advanced metrics and customization techniques in Ragas for evaluating LLM applications, including creating custom metrics, domain-specific evaluation, composite scoring, and best practices for building a comprehensive evaluation ecosystem.\" categories: [\"AI\", \"RAG\", \"Evaluation\", \"Ragas\",\"Data\"] coverImage: \"https://plus.unsplash.com/premium_photo-1661368994107-43200954c524?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D\" readingTime: 9 published: true\\n\\nIn our previous post, we explored how to generate comprehensive test datasets for evaluating LLM applications. Now, let\\'s dive into one of Ragas\\' most powerful capabilities: advanced metrics and custom evaluation approaches that address specialized evaluation needs.\\n\\nBeyond the Basics: Why Advanced Metrics Matter\\n\\nWhile Ragas\\' core metrics cover fundamental evaluation aspects, real-world applications often have unique requirements:\\n\\nDomain-specific quality criteria: Legal, medical, or financial applications have specialized accuracy requirements\\n\\nCustom interaction patterns: Applications with unique conversation flows need tailored evaluation approaches\\n\\nSpecialized capabilities: Features like reasoning, code generation, or structured output demand purpose-built metrics\\n\\nBusiness-specific KPIs: Aligning evaluation with business objectives requires customized metrics\\n\\nLet\\'s explore how to extend Ragas\\' capabilities to meet these specialized needs.\\n\\nUnderstanding Ragas\\' Metric Architecture\\n\\nBefore creating custom metrics, it\\'s helpful to understand Ragas\\' metric architecture:\\n\\n1. Understand the Metric Base Classes\\n\\nAll metrics in Ragas inherit from the abstract Metric class (see metrics/base.py). For most use cases, you’ll extend one of these:\\n\\nSingleTurnMetric: For metrics that evaluate a single question/response pair.\\n\\nMultiTurnMetric: For metrics that evaluate multi-turn conversations.\\n\\nMetricWithLLM: For metrics that require an LLM for evaluation.\\n\\nMetricWithEmbeddings: For metrics that use embeddings.\\n\\nYou can mix these as needed (e.g., MetricWithLLM, SingleTurnMetric).\\n\\nEach metric implements specific scoring methods depending on its type:\\n\\n_single_turn_ascore: For single-turn metrics\\n\\n_multi_turn_ascore: For multi-turn metrics\\n\\nCreating Your First Custom Metric\\n\\nLet\\'s create a custom metric that evaluates technical accuracy in programming explanations:\\n\\n```python from dataclasses import dataclass, field from typing import Dict, Optional, Set import typing as t\\n\\nfrom ragas.metrics.base import MetricWithLLM, SingleTurnMetric from ragas.prompt import PydanticPrompt from ragas.metrics import MetricType, MetricOutputType from pydantic import BaseModel\\n\\nDefine input/output models for the prompt\\n\\nclass TechnicalAccuracyInput(BaseModel): question: str context: str response: str programming_language: str = \"python\"\\n\\nclass TechnicalAccuracyOutput(BaseModel): score: float feedback: str\\n\\nDefine the prompt\\n\\nclass TechnicalAccuracyPrompt(PydanticPrompt[TechnicalAccuracyInput, TechnicalAccuracyOutput]): instruction: str = ( \"Evaluate the technical accuracy of the response to a programming question. \" \"Consider syntax correctness, algorithmic accuracy, and best practices.\" ) input_model = TechnicalAccuracyInput output_model = TechnicalAccuracyOutput examples = [ # Add examples here ]\\n\\nCreate the metric\\n\\n@dataclass class TechnicalAccuracy(MetricWithLLM, SingleTurnMetric): name: str = \"technical_accuracy\" _required_columns: Dict[MetricType, Set[str]] = field( default_factory=lambda: { MetricType.SINGLE_TURN: { \"user_input\", \"response\",\\n\\n }\\n }\\n)\\noutput_type: Optional[MetricOutputType] = MetricOutputType.CONTINUOUS\\nevaluation_prompt: PydanticPrompt = field(default_factory=TechnicalAccuracyPrompt)\\n\\nasync def _single_turn_ascore(self, sample, callbacks) -> float:\\n assert self.llm is not None, \"LLM must be set\"\\n\\n question = sample.user_input\\n response = sample.response\\n # Extract programming language from question if possible\\n programming_language = \"python\" # Default\\n languages = [\"python\", \"javascript\", \"java\", \"c++\", \"rust\", \"go\"]\\n for lang in languages:\\n if lang in question.lower():\\n programming_language = lang\\n break\\n\\n # Get the context\\n context = \"\\\\n\".join(sample.retrieved_contexts) if sample.retrieved_contexts else \"\"\\n\\n # Prepare input for prompt\\n prompt_input = TechnicalAccuracyInput(\\n question=question,\\n context=context,\\n response=response,\\n programming_language=programming_language\\n )\\n\\n # Generate evaluation\\n evaluation = await self.evaluation_prompt.generate(\\n data=prompt_input, llm=self.llm, callbacks=callbacks\\n )\\n\\n return evaluation.score\\n\\n```\\n\\nUsing the Custom Metric\\n\\nTo use the custom metric, simply include it in your evaluation pipeline:\\n\\n```python from langchain_openai import ChatOpenAI from ragas import SingleTurnSample from ragas.llms import LangchainLLMWrapper\\n\\nInitialize the LLM, you are going to OPENAI API key\\n\\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\\n\\ntest_data = { \"user_input\": \"Write a function to calculate the factorial of a number in Python.\", \"retrieved_contexts\": [\"Python is a programming language.\", \"A factorial of a number n is the product of all positive integers less than or equal to n.\"], \"response\": \"def factorial(n):\\\\n if n == 0:\\\\n return 1\\\\n else:\\\\n return n * factorial(n-1)\", }\\n\\nCreate a sample\\n\\nsample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor technical_accuracy = TechnicalAccuracy(llm=evaluator_llm) score = await technical_accuracy.single_turn_ascore(sample) print(f\"Technical Accuracy Score: {score}\")\\n\\nNote: The above code is a simplified example. In a real-world scenario, you would need to handle exceptions,\\n\\n`` You can also use theevaluate` function to evaluate a dataset:\\n\\n```python from ragas import evaluate from ragas import evaluate\\n\\nresults = evaluate( dataset, # Your dataset of samples metrics=[TechnicalAccuracy(), ...], llm=myevaluator_llm_llm ) ```\\n\\n💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 05_Advanced_Metrics_and_Customization\\n\\nCustomizing Metrics for Your Application\\n\\nYou can further refine your evaluation by customizing existing metrics—such as adjusting thresholds or criteria—to better fit your application\\'s requirements. For multi-turn conversations, you might configure metrics like topic adherence to emphasize specific aspects, such as precision or recall, based on your evaluation objectives.\\n\\nIn specialized domains like healthcare or legal, it\\'s crucial to design custom metrics that capture domain-specific accuracy and compliance needs. For complex applications, consider combining several metrics into composite scores to represent multiple quality dimensions.\\n\\nWhen assessing capabilities like code generation or structured outputs, develop metrics that evaluate execution correctness or schema compliance. For advanced scenarios, you can build metric pipelines that orchestrate several metrics and aggregate their results using strategies like weighted averages or minimum scores.\\n\\nBy thoughtfully customizing and combining metrics, you can achieve a comprehensive and meaningful evaluation framework tailored to your unique use case.\\n\\nBest Practices for Custom Metric Development\\n\\nSingle Responsibility: Each metric should evaluate one specific aspect\\n\\nClear Definition: Define precisely what your metric measures\\n\\nBounded Output: Scores should be normalized, typically in [0,1]\\n\\nReproducibility: Minimize randomness in evaluation\\n\\nDocumentation: Document criteria, prompt design, and interpretation guidelines\\n\\nTest with Examples: Verify metric behavior on clear-cut examples\\n\\nHuman Correlation: Validate that metrics correlate with human judgment\\n\\nStandardizing Custom Metrics\\n\\nTo ensure consistency across custom metrics, consider the following best practices:\\n\\nDefine a clear, human-readable description for each metric.\\n\\nProvide interpretation guidelines to help users understand score meanings.\\n\\nInclude metadata such as metric name, required columns, and output type.\\n\\nUse a standardized interface or base class for all custom metrics.\\n\\nImplementation Patterns for Advanced Metrics\\n\\nWhen developing advanced metrics like topic adherence:\\n\\nDesign multi-step evaluation workflows for complex tasks.\\n\\nUse specialized prompts for different sub-tasks within the metric.\\n\\nAllow configurable scoring modes (e.g., precision, recall, F1).\\n\\nSupport conversational context for multi-turn evaluations.\\n\\nDebugging Custom Metrics\\n\\nEffective debugging strategies include:\\n\\nImplementing a debug mode to capture prompt inputs, outputs, and intermediate results.\\n\\nLogging detailed evaluation steps for easier troubleshooting.\\n\\nReviewing final scores alongside intermediate calculations to identify issues.\\n\\nConclusion: Building an Evaluation Ecosystem\\n\\nCustom metrics allow you to build a comprehensive evaluation ecosystem tailored to your application\\'s specific needs:\\n\\nBaseline metrics: Start with Ragas\\' core metrics for fundamental quality aspects\\n\\nDomain adaptation: Add specialized metrics for your application domain\\n\\nFeature-specific metrics: Develop metrics for unique features of your system\\n\\nBusiness alignment: Create metrics that reflect specific business KPIs and requirements\\n\\nBy extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\\n\\nIn our next post, we\\'ll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.\\n\\nPart 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications Part 2: Basic Evaluation Workflow Part 3: Evaluating RAG Systems with Ragas Part 4: Test Data Generation Part 5: Advanced Evaluation Techniques — You are here Next up in the series: Part 6: Evaluating AI Agents Part 7: Integrations and Observability Part 8: Building Feedback Loops\\n\\nHow have you implemented feedback loops in your LLM applications? What improvement strategies have been most effective for your use cases? If you’re facing specific evaluation hurdles, don’t hesitate to reach out—we’d love to help!'),\n", " Document(metadata={'source': 'data/basic-evaluation-workflow-with-ragas/index.md', 'url': 'https://thedataguy.pro/basic-evaluation-workflow-with-ragas/', '_id': '57555960a06b4741adf1aa5655267097', '_collection_name': 'thedataguy_documents'}, page_content='title: \"Part 2: Basic Evaluation Workflow with Ragas\" date: 2025-04-26T19:00:00-06:00 layout: blog description: \"Learn how to set up a basic evaluation workflow for LLM applications using Ragas. This guide walks you through data preparation, metric selection, and result analysis.\" categories: [\"AI\", \"RAG\", \"Evaluation\", \"Ragas\"] coverImage: \"https://images.unsplash.com/photo-1600132806370-bf17e65e942f?q=80&w=1988&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D\" readingTime: 8 published: true\\n\\nIn our previous post, we introduced Ragas as a powerful framework for evaluating LLM applications. Now, let\\'s dive into the practical aspects of setting up your first evaluation pipeline.\\n\\nUnderstanding the Evaluation Workflow\\n\\nA typical Ragas evaluation workflow consists of four key steps:\\n\\nPrepare your data: Collect queries, contexts, responses, and reference answers\\n\\nSelect appropriate metrics: Choose metrics that align with what you want to evaluate\\n\\nRun the evaluation: Process your data through the selected metrics\\n\\nAnalyze the results: Interpret scores and identify areas for improvement\\n\\nLet\\'s walk through each step with practical examples.\\n\\nStep 1: Setting Up Your Environment\\n\\nFirst, ensure you have Ragas installed:\\n\\nbash uv add ragas\\n\\nNext, import the necessary components:\\n\\npython import pandas as pd from ragas import EvaluationDataset from ragas import evaluate, RunConfig from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity\\n\\nStep 2: Preparing Your Evaluation Data\\n\\nFor a RAG system evaluation, you\\'ll need:\\n\\nQuestions: User queries to your system\\n\\nContexts: Documents or chunks retrieved by your system\\n\\nResponses: Answers generated by your system\\n\\nGround truth (optional): Reference answers or documents for comparison\\n\\nHere\\'s how to organize this data:\\n\\n```python\\n\\nSample data\\n\\ndata = { \"user_input\": [ \"What are the main symptoms of COVID-19?\", \"How does machine learning differ from deep learning?\" ], \"retrieved_contexts\": [ [ \"Common symptoms of COVID-19 include fever, cough, and fatigue. Some patients also report loss of taste or smell, body aches, and difficulty breathing.\", \"COVID-19 is caused by the SARS-CoV-2 virus and spreads primarily through respiratory droplets.\" ], [ \"Machine learning is a subset of AI focused on algorithms that learn from data without being explicitly programmed.\", \"Deep learning is a specialized form of machine learning using neural networks with many layers (deep neural networks).\" ] ], \"response\": [ \"The main symptoms of COVID-19 include fever, cough, fatigue, and sometimes loss of taste or smell, body aches, and breathing difficulties.\", \"Machine learning is a subset of AI that focuses on algorithms learning from data, while deep learning is a specialized form of machine learning that uses deep neural networks with multiple layers.\" ], \"reference\": [ \"COVID-19 symptoms commonly include fever, dry cough, fatigue, loss of taste or smell, body aches, sore throat, and in severe cases, difficulty breathing.\", \"Machine learning is a branch of AI where systems learn from data, identify patterns, and make decisions with minimal human intervention. Deep learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to analyze various factors of data.\" ] }\\n\\neval_data = pd.DataFrame(data)\\n\\nConvert to a format Ragas can use\\n\\nevaluation_dataset = EvaluationDataset.from_pandas(eval_data) evaluation_dataset\\n\\n```\\n\\nStep 3: Selecting and Configuring Metrics\\n\\nRagas offers various metrics to evaluate different aspects of your system:\\n\\nCore RAG Metrics:\\n\\nFaithfulness: Measures if the response is factually consistent with the provided context.\\n\\nFactual Correctness: Assesses if the response is accurate and free from factual errors.\\n\\nResponse Relevancy: Evaluates if the response directly addresses the user query.\\n\\nContext Entity Recall: Measures how well the retrieved context captures relevant entities from the ground truth.\\n\\nNoise Sensitivity: Assesses the robustness of the response to irrelevant or noisy context.\\n\\nLLM Context Recall: Evaluates how effectively the LLM utilizes the provided context to generate the response.\\n\\nFor metrics that require an LLM (like faithfulness), you need to configure the LLM provider:\\n\\n```python\\n\\nConfigure LLM for evaluation\\n\\nfrom langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper\\n\\nInitialize the LLM, you are going to OPENAI API key\\n\\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\\n\\nDefine metrics to use\\n\\nmetrics = [ Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity(), LLMContextRecall() ] ```\\n\\nStep 4: Running the Evaluation\\n\\nNow, run the evaluation with your selected metrics:\\n\\n```python\\n\\nRun evaluation\\n\\nresults = evaluate( evaluation_dataset, metrics=metrics, llm=evaluator_llm # Required for LLM-based metrics )\\n\\nView results\\n\\nprint(results) ```\\n\\nOutput:\\n\\nValues will vary based on your data and LLM performance.\\n\\npython { \"faithfulness\": 1.0000, \"factual_correctness\": 0.6750, \"answer_relevancy\": 0.9897, \"context_entity_recall\": 0.8889, \"noise_sensitivity_relevant\": 0.1667, \"context_recall\": 0.5000 }\\n\\nStep 5: Interpreting Results\\n\\nRagas metrics typically return scores between 0 and 1, where higher is better:\\n\\nUnderstanding Score Ranges:\\n\\n0.8-1.0: Excellent performance\\n\\n0.6-0.8: Good performance\\n\\n0.4-0.6: Moderate performance, needs improvement\\n\\n0.4 or lower: Poor performance, requires significant attention\\n\\nAdvanced Use: Custom Evaluation for Specific Examples\\n\\nFor more detailed analysis of specific examples:\\n\\n```python from ragas import SingleTurnSample from ragas.metrics import AspectCritic\\n\\nDefine a specific test case\\n\\ntest_data = { \"user_input\": \"What are quantum computers?\", \"response\": \"Quantum computers use quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits that can only be 0 or 1.\", \"retrieved_contexts\": [\"Quantum computing is a type of computation that harnesses quantum mechanical phenomena.\"] }\\n\\nCreate a custom evaluation metric\\n\\ncustom_metric = AspectCritic( name=\"quantum_accuracy\", llm=llm, definition=\"Verify if the explanation of quantum computing is accurate and complete.\" )\\n\\nScore the sample\\n\\nsample = SingleTurnSample(**test_data) score = await custom_metric.single_turn_ascore(sample) print(f\"Quantum accuracy score: {score}\") ```\\n\\n💡 Try it yourself: Explore the hands-on notebook for this workflow: 02_Basic_Evaluation_Workflow_with_Ragas\\n\\nCommon Evaluation Patterns and Metrics\\n\\nBelow is a matrix mapping evaluation patterns to the metrics used, along with definitions for each metric:\\n\\nMetric Comprehensive RAG Evaluation Content Quality Evaluation Retrieval Quality Evaluation Faithfulness ✓ ✓ Answer Relevancy ✓ ✓ Context Recall ✓ ✓ Context Precision ✓ ✓ Harmfulness ✓ Coherence ✓ Context Relevancy ✓\\n\\nMetric Definitions\\n\\nFaithfulness: Measures if the response is factually consistent with the provided context.\\n\\nAnswer Relevancy: Assesses if the response addresses the question.\\n\\nContext Recall: Measures how well the retrieved context covers the information in the ground truth.\\n\\nContext Precision: Evaluates the proportion of relevant information in the retrieved context.\\n\\nHarmfulness: Evaluates if the response contains harmful or inappropriate content.\\n\\nCoherence: Measures the logical flow and clarity of the response.\\n\\nContext Relevancy: Evaluates if the retrieved context is relevant to the question.\\n\\nThis matrix provides a clear overview of which metrics to use for specific evaluation patterns and their respective definitions.\\n\\nBest Practices for Ragas Evaluation\\n\\nStart simple: Begin with core metrics before adding more specialized ones\\n\\nUse diverse test cases: Include a variety of questions, from simple to complex\\n\\nConsider edge cases: Test with queries that might challenge your system\\n\\nCompare versions: Track metrics across different versions of your application\\n\\nCombine with human evaluation: Use Ragas alongside human feedback for a comprehensive assessment\\n\\nConclusion\\n\\nSetting up a basic evaluation workflow with Ragas is straightforward yet powerful. By systematically evaluating your LLM applications, you gain objective insights into their performance and clear directions for improvement.\\n\\nIn our next post, we\\'ll delve deeper into specialized evaluation techniques for RAG systems, exploring advanced metrics and evaluation strategies for retrieval-augmented generation applications.\\n\\nPart 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications Part 2: Basic Evaluation Workflow — You are here Next up in the series: Part 3: Evaluating RAG Systems Part 4: Test Data Generation Part 5: Advanced Evaluation Techniques Part 6: Evaluating AI Agents Part 7: Integrations and Observability Part 8: Building Feedback Loops\\n\\nHave you set up your first Ragas evaluation? What aspects of your LLM application are you most interested in measuring? If you’re facing specific evaluation hurdles, don’t hesitate to reach out—we’d love to help!')]}" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response" ] }, { "cell_type": "code", "execution_count": 148, "id": "428d5070", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "Ragas is an open-source evaluation framework specifically designed for Large Language Model (LLM) applications, particularly focusing on Retrieval-Augmented Generation (RAG) systems. It provides specialized metrics to assess the performance of LLM applications, addressing challenges such as information retrieval accuracy, response consistency, and user query relevance. Ragas helps ensure that LLM applications are reliable and effective by offering tools for quality assurance, performance tracking, and continuous improvement.\n", "\n", "For more information, you can explore the following links:\n", "- [Part 1: Introduction to Ragas](https://thedataguy.pro/introduction-to-ragas/)\n", "- [Part 2: Basic Evaluation Workflow with Ragas](https://thedataguy.pro/basic-evaluation-workflow-with-ragas/)\n", "- [Part 3: Evaluating RAG Systems with Ragas](https://thedataguy.pro/evaluating-rag-systems-with-ragas/)\n", "- [Part 5: Advanced Metrics and Customization with Ragas](https://thedataguy.pro/advanced-metrics-and-customization-with-ragas/)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_markdown(response[\"response\"].content)" ] }, { "cell_type": "code", "execution_count": 149, "id": "bae2eb92", "metadata": {}, "outputs": [], "source": [ "query = \"How to build research agents?\"\n", "response = retrieval_augmented_qa_chain.invoke({\"question\" : query})" ] }, { "cell_type": "code", "execution_count": 150, "id": "dabb1280", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'response': AIMessage(content=\"To build a research agent, you can follow these key steps:\\n\\n1. **Define the Purpose**: Identify the specific research needs, such as searching across multiple sources, analyzing documents, and providing structured reports.\\n\\n2. **Integrate Information Sources**:\\n - **Web Search**: Use APIs like Tavily and DuckDuckGo for real-time information.\\n - **Academic Research**: Connect to databases like arXiv for scholarly articles.\\n - **RSS Feeds**: Implement an RSS feed reader to aggregate content from relevant publications.\\n\\n3. **Document Analysis**: Incorporate a document analysis engine that uses techniques like Retrieval Augmented Generation (RAG) to process and analyze uploaded documents.\\n\\n4. **Workflow Architecture**: Utilize a structured framework like LangGraph for orchestrating the research process, maintaining context, and ensuring transparent reasoning.\\n\\n5. **Technology Stack**: Leverage tools such as LangChain, Chainlit, Qdrant, and OpenAI's GPT-4o for building the agent.\\n\\n6. **Testing and Iteration**: Continuously test the agent's performance and refine its capabilities based on user feedback and research outcomes.\\n\\nFor more detailed guidance, you can explore the following resources:\\n- [Building a Research Agent with RSS Feed Support](https://thedataguy.pro/building-research-agent/)\\n- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/advanced-metrics-and-customization-with-ragas/)\\n- [Evaluating RAG Systems with Ragas](https://thedataguy.pro/evaluating-rag-systems-with-ragas/)\\n- [Evaluating AI Agents with Ragas](https://thedataguy.pro/evaluating-ai-agents-with-ragas/)\", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 356, 'prompt_tokens': 7690, 'total_tokens': 8046, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_96c46af214', 'id': 'chatcmpl-BVfxh1fYAlx76l7OMgPsovziPbeF3', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--2f808f3a-534c-4773-8bde-54ad9397f40d-0', usage_metadata={'input_tokens': 7690, 'output_tokens': 356, 'total_tokens': 8046, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}),\n", " 'context': [Document(metadata={'source': 'data/building-research-agent/index.md', 'url': 'https://thedataguy.pro/building-research-agent/', '_id': '818ba3edf3484becaeb57c363f6cf8db', '_collection_name': 'thedataguy_documents'}, page_content='layout: blog title: Building a Research Agent with RSS Feed Support date: 2025-04-20T00:00:00-06:00 description: How I created a comprehensive research assistant that combines web search, academic papers, RSS feeds, and document analysis to revolutionize information discovery. categories: [\"AI\", \"LLM\", \"Research\", \"Technology\", \"Agents\"] coverImage: \"https://images.unsplash.com/photo-1507842217343-583bb7270b66?q=80&w=2290&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D\" readingTime: 5 published: true\\n\\nIn the age of information overload, finding the right data efficiently has become increasingly challenging. Whether you\\'re conducting academic research, staying updated on industry trends, or investigating specific topics, the process often involves juggling multiple tools and platforms. This fragmentation inspired me to create a comprehensive solution: a research agent with RSS feed support that brings together multiple information sources in one unified interface.\\n\\nWhy Build a Research Agent?\\n\\nAs someone who regularly conducts research across different domains, I\\'ve experienced the frustration of switching between search engines, academic databases, news aggregators, and document analysis tools. Each context switch breaks concentration and slows down the discovery process. I wanted a tool that could:\\n\\nSearch across multiple information sources simultaneously\\n\\nAnalyze uploaded documents in the context of web information\\n\\nProvide transparent reasoning about its research process\\n\\nDeliver structured, well-cited reports\\n\\nThe result is the Research Agent - an LLM-powered assistant that brings together web search, academic papers, RSS feeds, and document analysis into a single, coherent workflow.\\n\\nMulti-Source Research Architecture\\n\\nThe agent\\'s strength comes from its ability to tap into various information streams:\\n\\nWeb Search Integration\\n\\nFor real-time information and general knowledge, the agent leverages both Tavily and DuckDuckGo APIs to perform semantic searches across the web. This provides access to current events, recent developments, and general information that might not be available in academic sources.\\n\\nAcademic Research Pipeline\\n\\nResearch often requires scholarly sources. The agent connects to arXiv\\'s extensive database of scientific papers, allowing it to retrieve relevant academic articles complete with titles, authors, and abstracts. This is particularly valuable for technical topics that require peer-reviewed information.\\n\\nRSS Feed Aggregation\\n\\nFor targeted news monitoring and industry updates, the RSS feed reader component allows the agent to retrieve content from specific publications and blogs. This is ideal for tracking industry trends or following particular news sources relevant to your research topic.\\n\\nDocument Analysis Engine\\n\\nPerhaps the most powerful feature is the document analysis capability, which uses Retrieval Augmented Generation (RAG) to process uploaded PDFs or text files. By breaking documents into semantic chunks and creating vector embeddings, the agent can answer questions specifically about your documents while incorporating relevant information from other sources.\\n\\nBehind the Scenes: LangGraph Workflow\\n\\nWhat makes this agent particularly powerful is its LangGraph-based architecture, which provides a structured framework for reasoning and tool orchestration:\\n\\nResearch Agent Graph\\n\\nThis workflow provides several key advantages:\\n\\nContextual Awareness: The agent maintains context throughout the research process\\n\\nDynamic Tool Selection: It intelligently chooses which information sources to query based on your question\\n\\nTransparent Reasoning: You can see each step of the research process\\n\\nConsistent Output Structure: Results are formatted into comprehensive reports with proper citations\\n\\nThe Technology Stack\\n\\nBuilding the Research Agent required integrating several cutting-edge technologies:\\n\\nLangChain: Provides the foundation for LLM application development\\n\\nLangGraph: Enables sophisticated workflow orchestration\\n\\nChainlit: Powers the interactive chat interface\\n\\nQdrant: Serves as the vector database for document embeddings\\n\\nOpenAI: Supplies the GPT-4o language model and embeddings\\n\\nTavily/DuckDuckGo: Delivers web search capabilities\\n\\narXiv API: Connects to academic paper repositories\\n\\nFeedparser: Handles RSS feed processing\\n\\nThe Research Process in Action\\n\\nWhen you ask the Research Agent a question, it follows a systematic process:\\n\\nQuery Analysis: It first analyzes your question to determine which information sources would be most relevant\\n\\nMulti-Tool Research: Depending on the query, it executes searches across selected tools\\n\\nContext Retrieval: If you\\'ve uploaded documents, it retrieves relevant passages from them\\n\\nResearch Transparency: It shows each step of its research process for full transparency\\n\\nInformation Synthesis: It analyzes and combines information from all sources\\n\\nStructured Reporting: It delivers a comprehensive response with proper citations\\n\\nReal-World Applications\\n\\nThe Research Agent has proven valuable across various use cases:\\n\\nAcademic Research: Gathering information across multiple scholarly sources\\n\\nCompetitive Analysis: Staying updated on industry competitors\\n\\nTechnical Deep Dives: Understanding complex technical topics\\n\\nNews Monitoring: Tracking specific events across multiple sources\\n\\nDocument Q&A: Asking questions about specific documents in broader context\\n\\nLessons Learned and Future Directions\\n\\nBuilding this agent taught me several valuable lessons about LLM application development:\\n\\nTool Integration Complexity: Combining multiple data sources requires careful consideration of data formats and query patterns\\n\\nContext Management: Maintaining context across different research steps is critical for coherent outputs\\n\\nTransparency Matters: Users trust AI more when they can see how it reached its conclusions\\n\\nLangGraph Power: The graph-based approach to LLM workflows provides significant advantages over simpler chains\\n\\nLooking ahead, I\\'m exploring several enhancements:\\n\\nExpanded academic database integration beyond arXiv\\n\\nMore sophisticated document analysis with multi-document reasoning\\n\\nImproved citation formats and bibliographic support\\n\\nEnhanced visualization of research findings\\n\\nTry It Yourself\\n\\nThe Research Agent is available as an open-source project, and you can try it directly on Hugging Face Spaces:\\n\\nLive Demo: Hugging Face Space\\n\\nSource Code: GitHub Repository\\n\\nIf you\\'re interested in deploying your own instance, the GitHub repository includes detailed setup instructions for both local development and Docker deployment.\\n\\nHave you used the Research Agent or built similar tools? I\\'d love to hear about your experiences and any suggestions for improvements. Feel free to reach out through the contact form or connect with me on social media!'),\n", " Document(metadata={'source': 'data/advanced-metrics-and-customization-with-ragas/index.md', 'url': 'https://thedataguy.pro/advanced-metrics-and-customization-with-ragas/', '_id': 'dea41eaa40a54b3ca558fda681d69ec8', '_collection_name': 'thedataguy_documents'}, page_content='title: \"Part 5: Advanced Metrics and Customization with Ragas\" date: 2025-04-28T05:00:00-06:00 layout: blog description: \"Explore advanced metrics and customization techniques in Ragas for evaluating LLM applications, including creating custom metrics, domain-specific evaluation, composite scoring, and best practices for building a comprehensive evaluation ecosystem.\" categories: [\"AI\", \"RAG\", \"Evaluation\", \"Ragas\",\"Data\"] coverImage: \"https://plus.unsplash.com/premium_photo-1661368994107-43200954c524?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D\" readingTime: 9 published: true\\n\\nIn our previous post, we explored how to generate comprehensive test datasets for evaluating LLM applications. Now, let\\'s dive into one of Ragas\\' most powerful capabilities: advanced metrics and custom evaluation approaches that address specialized evaluation needs.\\n\\nBeyond the Basics: Why Advanced Metrics Matter\\n\\nWhile Ragas\\' core metrics cover fundamental evaluation aspects, real-world applications often have unique requirements:\\n\\nDomain-specific quality criteria: Legal, medical, or financial applications have specialized accuracy requirements\\n\\nCustom interaction patterns: Applications with unique conversation flows need tailored evaluation approaches\\n\\nSpecialized capabilities: Features like reasoning, code generation, or structured output demand purpose-built metrics\\n\\nBusiness-specific KPIs: Aligning evaluation with business objectives requires customized metrics\\n\\nLet\\'s explore how to extend Ragas\\' capabilities to meet these specialized needs.\\n\\nUnderstanding Ragas\\' Metric Architecture\\n\\nBefore creating custom metrics, it\\'s helpful to understand Ragas\\' metric architecture:\\n\\n1. Understand the Metric Base Classes\\n\\nAll metrics in Ragas inherit from the abstract Metric class (see metrics/base.py). For most use cases, you’ll extend one of these:\\n\\nSingleTurnMetric: For metrics that evaluate a single question/response pair.\\n\\nMultiTurnMetric: For metrics that evaluate multi-turn conversations.\\n\\nMetricWithLLM: For metrics that require an LLM for evaluation.\\n\\nMetricWithEmbeddings: For metrics that use embeddings.\\n\\nYou can mix these as needed (e.g., MetricWithLLM, SingleTurnMetric).\\n\\nEach metric implements specific scoring methods depending on its type:\\n\\n_single_turn_ascore: For single-turn metrics\\n\\n_multi_turn_ascore: For multi-turn metrics\\n\\nCreating Your First Custom Metric\\n\\nLet\\'s create a custom metric that evaluates technical accuracy in programming explanations:\\n\\n```python from dataclasses import dataclass, field from typing import Dict, Optional, Set import typing as t\\n\\nfrom ragas.metrics.base import MetricWithLLM, SingleTurnMetric from ragas.prompt import PydanticPrompt from ragas.metrics import MetricType, MetricOutputType from pydantic import BaseModel\\n\\nDefine input/output models for the prompt\\n\\nclass TechnicalAccuracyInput(BaseModel): question: str context: str response: str programming_language: str = \"python\"\\n\\nclass TechnicalAccuracyOutput(BaseModel): score: float feedback: str\\n\\nDefine the prompt\\n\\nclass TechnicalAccuracyPrompt(PydanticPrompt[TechnicalAccuracyInput, TechnicalAccuracyOutput]): instruction: str = ( \"Evaluate the technical accuracy of the response to a programming question. \" \"Consider syntax correctness, algorithmic accuracy, and best practices.\" ) input_model = TechnicalAccuracyInput output_model = TechnicalAccuracyOutput examples = [ # Add examples here ]\\n\\nCreate the metric\\n\\n@dataclass class TechnicalAccuracy(MetricWithLLM, SingleTurnMetric): name: str = \"technical_accuracy\" _required_columns: Dict[MetricType, Set[str]] = field( default_factory=lambda: { MetricType.SINGLE_TURN: { \"user_input\", \"response\",\\n\\n }\\n }\\n)\\noutput_type: Optional[MetricOutputType] = MetricOutputType.CONTINUOUS\\nevaluation_prompt: PydanticPrompt = field(default_factory=TechnicalAccuracyPrompt)\\n\\nasync def _single_turn_ascore(self, sample, callbacks) -> float:\\n assert self.llm is not None, \"LLM must be set\"\\n\\n question = sample.user_input\\n response = sample.response\\n # Extract programming language from question if possible\\n programming_language = \"python\" # Default\\n languages = [\"python\", \"javascript\", \"java\", \"c++\", \"rust\", \"go\"]\\n for lang in languages:\\n if lang in question.lower():\\n programming_language = lang\\n break\\n\\n # Get the context\\n context = \"\\\\n\".join(sample.retrieved_contexts) if sample.retrieved_contexts else \"\"\\n\\n # Prepare input for prompt\\n prompt_input = TechnicalAccuracyInput(\\n question=question,\\n context=context,\\n response=response,\\n programming_language=programming_language\\n )\\n\\n # Generate evaluation\\n evaluation = await self.evaluation_prompt.generate(\\n data=prompt_input, llm=self.llm, callbacks=callbacks\\n )\\n\\n return evaluation.score\\n\\n```\\n\\nUsing the Custom Metric\\n\\nTo use the custom metric, simply include it in your evaluation pipeline:\\n\\n```python from langchain_openai import ChatOpenAI from ragas import SingleTurnSample from ragas.llms import LangchainLLMWrapper\\n\\nInitialize the LLM, you are going to OPENAI API key\\n\\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\\n\\ntest_data = { \"user_input\": \"Write a function to calculate the factorial of a number in Python.\", \"retrieved_contexts\": [\"Python is a programming language.\", \"A factorial of a number n is the product of all positive integers less than or equal to n.\"], \"response\": \"def factorial(n):\\\\n if n == 0:\\\\n return 1\\\\n else:\\\\n return n * factorial(n-1)\", }\\n\\nCreate a sample\\n\\nsample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor technical_accuracy = TechnicalAccuracy(llm=evaluator_llm) score = await technical_accuracy.single_turn_ascore(sample) print(f\"Technical Accuracy Score: {score}\")\\n\\nNote: The above code is a simplified example. In a real-world scenario, you would need to handle exceptions,\\n\\n`` You can also use theevaluate` function to evaluate a dataset:\\n\\n```python from ragas import evaluate from ragas import evaluate\\n\\nresults = evaluate( dataset, # Your dataset of samples metrics=[TechnicalAccuracy(), ...], llm=myevaluator_llm_llm ) ```\\n\\n💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 05_Advanced_Metrics_and_Customization\\n\\nCustomizing Metrics for Your Application\\n\\nYou can further refine your evaluation by customizing existing metrics—such as adjusting thresholds or criteria—to better fit your application\\'s requirements. For multi-turn conversations, you might configure metrics like topic adherence to emphasize specific aspects, such as precision or recall, based on your evaluation objectives.\\n\\nIn specialized domains like healthcare or legal, it\\'s crucial to design custom metrics that capture domain-specific accuracy and compliance needs. For complex applications, consider combining several metrics into composite scores to represent multiple quality dimensions.\\n\\nWhen assessing capabilities like code generation or structured outputs, develop metrics that evaluate execution correctness or schema compliance. For advanced scenarios, you can build metric pipelines that orchestrate several metrics and aggregate their results using strategies like weighted averages or minimum scores.\\n\\nBy thoughtfully customizing and combining metrics, you can achieve a comprehensive and meaningful evaluation framework tailored to your unique use case.\\n\\nBest Practices for Custom Metric Development\\n\\nSingle Responsibility: Each metric should evaluate one specific aspect\\n\\nClear Definition: Define precisely what your metric measures\\n\\nBounded Output: Scores should be normalized, typically in [0,1]\\n\\nReproducibility: Minimize randomness in evaluation\\n\\nDocumentation: Document criteria, prompt design, and interpretation guidelines\\n\\nTest with Examples: Verify metric behavior on clear-cut examples\\n\\nHuman Correlation: Validate that metrics correlate with human judgment\\n\\nStandardizing Custom Metrics\\n\\nTo ensure consistency across custom metrics, consider the following best practices:\\n\\nDefine a clear, human-readable description for each metric.\\n\\nProvide interpretation guidelines to help users understand score meanings.\\n\\nInclude metadata such as metric name, required columns, and output type.\\n\\nUse a standardized interface or base class for all custom metrics.\\n\\nImplementation Patterns for Advanced Metrics\\n\\nWhen developing advanced metrics like topic adherence:\\n\\nDesign multi-step evaluation workflows for complex tasks.\\n\\nUse specialized prompts for different sub-tasks within the metric.\\n\\nAllow configurable scoring modes (e.g., precision, recall, F1).\\n\\nSupport conversational context for multi-turn evaluations.\\n\\nDebugging Custom Metrics\\n\\nEffective debugging strategies include:\\n\\nImplementing a debug mode to capture prompt inputs, outputs, and intermediate results.\\n\\nLogging detailed evaluation steps for easier troubleshooting.\\n\\nReviewing final scores alongside intermediate calculations to identify issues.\\n\\nConclusion: Building an Evaluation Ecosystem\\n\\nCustom metrics allow you to build a comprehensive evaluation ecosystem tailored to your application\\'s specific needs:\\n\\nBaseline metrics: Start with Ragas\\' core metrics for fundamental quality aspects\\n\\nDomain adaptation: Add specialized metrics for your application domain\\n\\nFeature-specific metrics: Develop metrics for unique features of your system\\n\\nBusiness alignment: Create metrics that reflect specific business KPIs and requirements\\n\\nBy extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\\n\\nIn our next post, we\\'ll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.\\n\\nPart 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications Part 2: Basic Evaluation Workflow Part 3: Evaluating RAG Systems with Ragas Part 4: Test Data Generation Part 5: Advanced Evaluation Techniques — You are here Next up in the series: Part 6: Evaluating AI Agents Part 7: Integrations and Observability Part 8: Building Feedback Loops\\n\\nHow have you implemented feedback loops in your LLM applications? What improvement strategies have been most effective for your use cases? If you’re facing specific evaluation hurdles, don’t hesitate to reach out—we’d love to help!'),\n", " Document(metadata={'source': 'data/evaluating-rag-systems-with-ragas/index.md', 'url': 'https://thedataguy.pro/evaluating-rag-systems-with-ragas/', '_id': 'e23d0f12b30d4a9b9ef557bac2a0a745', '_collection_name': 'thedataguy_documents'}, page_content='title: \"Part 3: Evaluating RAG Systems with Ragas\" date: 2025-04-26T20:00:00-06:00 layout: blog description: \"Learn specialized techniques for comprehensive evaluation of Retrieval-Augmented Generation systems using Ragas, including metrics for retrieval quality, generation quality, and end-to-end performance.\" categories: [\"AI\", \"RAG\", \"Evaluation\", \"Ragas\"] coverImage: \"https://images.unsplash.com/photo-1743796055664-3473eedab36e?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D\" readingTime: 14 published: true\\n\\nIn our previous post, we covered the fundamentals of setting up evaluation workflows with Ragas. Now, let\\'s focus specifically on evaluating Retrieval-Augmented Generation (RAG) systems, which present unique evaluation challenges due to their multi-component nature.\\n\\nUnderstanding RAG Systems: More Than the Sum of Their Parts\\n\\nRAG systems combine two critical capabilities: 1. Retrieval: Finding relevant information from a knowledge base 2. Generation: Creating coherent, accurate responses based on retrieved information\\n\\nThis dual nature means evaluation must address both components while also assessing their interaction. A system might retrieve perfect information but generate poor responses, or generate excellent prose from irrelevant retrieved content.\\n\\nThe RAG Evaluation Triad\\n\\nEffective RAG evaluation requires examining three key dimensions:\\n\\nRetrieval Quality: How well does the system find relevant information?\\n\\nGeneration Quality: How well does the system produce responses from retrieved information?\\n\\nEnd-to-End Performance: How well does the complete system satisfy user needs?\\n\\nLet\\'s explore how Ragas helps evaluate each dimension of RAG systems.\\n\\nCore RAG Metrics in Ragas\\n\\nRagas provides specialized metrics to assess RAG systems across retrieval, generation, and end-to-end performance.\\n\\nRetrieval Quality Metrics\\n\\n1. Context Relevancy\\n\\nMeasures how relevant the retrieved documents are to the user\\'s question.\\n\\nHow it works:\\n\\nTakes the user\\'s question (user_input) and the retrieved documents (retrieved_contexts).\\n\\nUses an LLM to score relevance with two different prompts, averaging the results for robustness.\\n\\nScores are normalized between 0.0 (irrelevant) and 1.0 (fully relevant).\\n\\nWhy it matters: Low scores indicate your retriever is pulling in unrelated or noisy documents. Monitoring this helps you improve the retrieval step.\\n\\n2. Context Precision\\n\\nAssesses how much of the retrieved context is actually useful for generating the answer.\\n\\nHow it works:\\n\\nFor each retrieved chunk, an LLM judges if it was necessary for the answer, using the ground truth (reference) or the generated response.\\n\\nCalculates Average Precision, rewarding systems that rank useful chunks higher.\\n\\nVariants:\\n\\nContextUtilization: Uses the generated response instead of ground truth.\\n\\nNon-LLM version: Compares retrieved chunks to ideal reference contexts using string similarity.\\n\\nWhy it matters: High precision means your retriever is efficient; low precision means too much irrelevant information is included.\\n\\n3. Context Recall\\n\\nEvaluates whether all necessary information from the ground truth answer is present in the retrieved context.\\n\\nHow it works:\\n\\nBreaks down the reference answer into sentences.\\n\\nFor each sentence, an LLM checks if it can be supported by the retrieved context.\\n\\nThe score is the proportion of reference sentences attributed to the retrieved context.\\n\\nVariants:\\n\\nNon-LLM version: Compares reference and retrieved contexts using similarity and thresholds.\\n\\nWhy it matters: High recall means your retriever finds all needed information; low recall means critical information is missing.\\n\\nSummary: - Low context relevancy: Retriever needs better query understanding or semantic matching. - Low context precision: Retriever includes unnecessary information. - Low context recall: Retriever misses critical information.\\n\\nGeneration Quality Metrics\\n\\n1. Faithfulness\\n\\nChecks if the generated answer is factually consistent with the retrieved context, addressing hallucination.\\n\\nHow it works:\\n\\nBreaks the answer into simple statements.\\n\\nFor each, an LLM checks if it can be inferred from the retrieved context.\\n\\nThe score is the proportion of faithful statements.\\n\\nAlternative:\\n\\nFaithfulnesswithHHEM: Uses a specialized NLI model for verification.\\n\\nWhy it matters: High faithfulness means answers are grounded in context; low faithfulness signals hallucination.\\n\\n2. Answer Relevancy\\n\\nMeasures if the generated answer directly addresses the user\\'s question.\\n\\nHow it works:\\n\\nAsks an LLM to generate possible questions for the answer.\\n\\nCompares these to the original question using embedding similarity.\\n\\nPenalizes noncommittal answers.\\n\\nWhy it matters: High relevancy means answers are on-topic; low relevancy means answers are off-topic or incomplete.\\n\\nSummary: - Low faithfulness: Generator adds facts not supported by context. - Low answer relevancy: Generator doesn\\'t focus on the specific question.\\n\\nEnd-to-End Metrics\\n\\n1. Correctness\\n\\nAssesses factual alignment between the generated answer and a ground truth reference.\\n\\nHow it works:\\n\\nBreaks both the answer and reference into claims.\\n\\nUses NLI to verify claims in both directions.\\n\\nCalculates precision, recall, or F1-score.\\n\\nWhy it matters: High correctness means answers match the ground truth; low correctness signals factual errors.\\n\\nKey distinction: - Faithfulness: Compares answer to retrieved context. - FactualCorrectness: Compares answer to ground truth.\\n\\nCommon RAG Evaluation Patterns\\n\\n1. High Retrieval, Low Generation Scores\\n\\nDiagnosis: Good retrieval, poor use of information.\\n\\nFixes: Improve prompts, use better generation models, or verify responses post-generation.\\n\\n2. Low Retrieval, High Generation Scores\\n\\nDiagnosis: Good generation, inadequate information.\\n\\nFixes: Enhance indexing, retrieval algorithms, or expand the knowledge base.\\n\\n3. Low Context Precision, High Faithfulness\\n\\nDiagnosis: Retrieves too much, but generates reliably.\\n\\nFixes: Filter passages, optimize chunk size, or use re-ranking.\\n\\nBest Practices for RAG Evaluation\\n\\nEvaluate components independently: Assess retrieval and generation separately.\\n\\nUse diverse queries: Include factoid, explanatory, and complex questions.\\n\\nCompare against baselines: Test against simpler systems.\\n\\nPerform ablation studies: Try variations like different chunk sizes or retrieval models.\\n\\nCombine with human evaluation: Use Ragas with human judgment for a complete view.\\n\\nConclusion: The Iterative RAG Evaluation Cycle\\n\\nEffective RAG development is iterative:\\n\\nEvaluate: Measure performance.\\n\\nAnalyze: Identify weaknesses.\\n\\nImprove: Apply targeted enhancements.\\n\\nRe-evaluate: Measure the impact of changes.\\n\\nThe Iterative RAG Evaluation Cycle\\n\\nBy using Ragas to implement this cycle, you can systematically improve your RAG system\\'s performance across all dimensions.\\n\\nIn our next post, we\\'ll explore how to generate high-quality test datasets for comprehensive RAG evaluation, addressing the common challenge of limited test data.\\n\\nPart 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications Part 2: Basic Evaluation Workflow Part 3: Evaluating RAG Systems with Ragas — You are here Next up in the series: Part 4: Test Data Generation Part 5: Advanced Evaluation Techniques Part 6: Evaluating AI Agents Part 7: Integrations and Observability Part 8: Building Feedback Loops\\n\\nHow have you implemented feedback loops in your LLM applications? What improvement strategies have been most effective for your use cases? If you’re facing specific evaluation hurdles, don’t hesitate to reach out—we’d love to help!'),\n", " Document(metadata={'source': 'data/evaluating-ai-agents-with-ragas/index.md', 'url': 'https://thedataguy.pro/evaluating-ai-agents-with-ragas/', '_id': '5b107b3376b0488ebf1de07a90eef585', '_collection_name': 'thedataguy_documents'}, page_content='title: \"Part 6: Evaluating AI Agents: Beyond Simple Answers with Ragas\" date: 2025-04-28T06:00:00-06:00 layout: blog description: \"Learn how to evaluate complex AI agents using Ragas\\' specialized metrics for goal accuracy, tool call accuracy, and topic adherence to build more reliable and effective agent-based applications.\" categories: [\"AI\", \"Agents\", \"Evaluation\", \"Ragas\", \"LLM\"] coverImage: \"/images/ai_agent_evaluation.png\" readingTime: 8 published: true\\n\\nIn our previous posts, we\\'ve explored how Ragas evaluates RAG systems and enables custom metrics for specialized applications. As LLMs evolve beyond simple question-answering to become powerful AI agents, evaluation needs have grown more sophisticated too. In this post, we\\'ll explore Ragas\\' specialized metrics for evaluating AI agents that engage in multi-turn interactions, use tools, and work toward specific goals.\\n\\nThe Challenge of Evaluating AI Agents\\n\\nUnlike traditional RAG systems, AI agents present unique evaluation challenges:\\n\\nMulti-turn interactions: Agents maintain context across multiple exchanges\\n\\nTool usage: Agents call external tools and APIs to accomplish tasks\\n\\nGoal-oriented behavior: Success means achieving the user\\'s ultimate objective\\n\\nBoundaries and constraints: Agents must operate within defined topic boundaries\\n\\nStandard metrics like faithfulness or answer relevancy don\\'t fully capture these dimensions. Let\\'s explore three specialized metrics Ragas provides for agent evaluation.\\n\\nEvaluating AI Agents: Beyond Simple Answers with Ragas\\n\\n1. Goal Accuracy (agent_goal_accuracy)\\n\\nWhat it measures: Did the agent successfully achieve the user\\'s ultimate objective over the course of the interaction?\\n\\nHow it works: This metric analyzes the entire agent workflow (user inputs, AI responses, tool calls). * It uses an LLM (InferGoalOutcomePrompt) to identify the user_goal and the end_state (what actually happened). * It then compares the end_state to either: * A provided reference outcome (AgentGoalAccuracyWithReference). * The inferred user_goal (AgentGoalAccuracyWithoutReference). * An LLM (CompareOutcomePrompt) determines if the achieved outcome matches the desired one, resulting in a binary score (1 for success, 0 for failure).\\n\\nWhy it\\'s important: For task-oriented agents (like booking systems or assistants), success isn\\'t about individual responses but about completing the overall task correctly. This metric directly measures that end-to-end success.\\n\\n2. Tool Call Accuracy (tool_call_accuracy)\\n\\nWhat it measures: Did the agent use the correct tools, in the right order, and with the right arguments?\\n\\nHow it works: This metric compares the sequence and details of tool calls made by the agent against a reference_tool_calls list. * It checks if the sequence of tool names called by the agent aligns with the reference sequence (is_sequence_aligned). * For each matching tool call, it compares the arguments provided by the agent to the reference arguments, often using a sub-metric like ExactMatch (_get_arg_score). * The final score reflects both the sequence alignment and the argument correctness.\\n\\nWhy it\\'s important: Many agents rely on external tools (APIs, databases, etc.). Incorrect tool usage (wrong tool, bad parameters) leads to task failure. This metric pinpoints issues in the agent\\'s interaction with its tools.\\n\\n3. Topic Adherence (topic_adherence)\\n\\nWhat it measures: Did the agent stick to the allowed topics and appropriately handle requests about restricted topics?\\n\\nHow it works: This metric evaluates conversations against a list of reference_topics. * It extracts the topics discussed in the user\\'s input (TopicExtractionPrompt). * It checks if the agent refused to answer questions related to specific topics (TopicRefusedPrompt). * It classifies whether the discussed topics fall within the allowed reference_topics (TopicClassificationPrompt). * Based on these classifications and refusals, it calculates a score (Precision, Recall, or F1) indicating how well the agent adhered to the topic constraints.\\n\\nWhy it\\'s important: Ensures agents stay focused, avoid generating content on forbidden subjects (safety, policy), and handle out-of-scope requests gracefully.\\n\\nImplementing Agent Evaluation in Practice\\n\\nLet\\'s look at a practical example of evaluating an AI agent using these metrics:\\n\\n```python from ragas.metrics import AgentGoalAccuracyWithoutReference, ToolCallAccuracy, TopicAdherenceScore from ragas.evaluation import EvaluationDataset from ragas.dataset_schema import MultiTurnSample from langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper\\n\\nInitialize the LLM\\n\\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\\n\\nExample conversation with a travel booking agent\\n\\ntest_data = { \"user_input\": [ {\"role\": \"user\", \"content\": \"I need to book a flight from New York to London next Friday\"}, {\"role\": \"assistant\", \"content\": \"I\\'d be happy to help you book a flight. Let me search for options...\", \"tool_calls\": [{\"name\": \"search_flights\", \"arguments\": {\"origin\": \"NYC\", \"destination\": \"LON\", \"date\": \"next Friday\"}}]}, {\"role\": \"tool\", \"name\": \"search_flights\", \"content\": \"Found 5 flights: Flight 1 (Delta, $750), Flight 2 (British Airways, $820)...\"}, {\"role\": \"assistant\", \"content\": \"I found several flights from New York to London next Friday. The cheapest option is Delta for $750. Would you like to book this one?\"}, {\"role\": \"user\", \"content\": \"Yes, please book the Delta flight\"}, {\"role\": \"assistant\", \"content\": \"I\\'ll book that for you now.\", \"tool_calls\": [{\"name\": \"book_flight\", \"arguments\": {\"flight_id\": \"delta_123\", \"price\": \"$750\"}}]}, {\"role\": \"tool\", \"name\": \"book_flight\", \"content\": \"Booking confirmed. Confirmation #: ABC123\"}, {\"role\": \"assistant\", \"content\": \"Great news! Your flight is confirmed. Your confirmation number is ABC123. The flight is scheduled for next Friday. Is there anything else you need help with?\"} ], \"reference_topics\": [\"travel\", \"flight booking\", \"schedules\", \"prices\"], \"reference_tool_calls\": [ {\"name\": \"search_flights\", \"args\": {\"origin\": \"NYC\", \"destination\": \"LON\", \"date\": \"next Friday\"}}, {\"name\": \"book_flight\", \"args\": {\"flight_id\": \"delta_123\", \"price\": \"$750\"}} ] }\\n\\nCreate a sample\\n\\nsample = MultiTurnSample(**test_data)\\n\\nInitialize metrics\\n\\ngoal_accuracy = AgentGoalAccuracyWithoutReference(llm=evaluator_llm) tool_accuracy = ToolCallAccuracy() topic_adherence = TopicAdherenceScore(llm=evaluator_llm)\\n\\nCalculate scores\\n\\ngoal_score = await goal_accuracy.multi_turn_ascore(sample) tool_score = tool_accuracy.multi_turn_score(sample) topic_score = await topic_adherence.multi_turn_ascore(sample)\\n\\nprint(f\"Goal Accuracy: {goal_score}\") print(f\"Tool Call Accuracy: {tool_score}\") print(f\"Topic Adherence: {topic_score}\") ```\\n\\n💡 Try it yourself: Explore the hands-on notebook for agent evaluation: 06_Evaluating_AI_Agents\\n\\nAdvanced Agent Evaluation Techniques\\n\\nCombining Metrics for Comprehensive Evaluation\\n\\nFor a complete assessment of agent capabilities, combine multiple metrics:\\n\\n```python from ragas import evaluate\\n\\nresults = evaluate( dataset, # Your dataset of agent conversations metrics=[ AgentGoalAccuracyWithoutReference(llm=evaluator_llm), ToolCallAccuracy(), TopicAdherence(llm=evaluator_llm) ] ) ```\\n\\nBest Practices for Agent Evaluation\\n\\nTest scenario coverage: Include a diverse range of interaction scenarios\\n\\nEdge case handling: Test how agents handle unexpected inputs or failures\\n\\nLongitudinal evaluation: Track performance over time to identify regressions\\n\\nHuman-in-the-loop validation: Periodically verify metric alignment with human judgments\\n\\nContinuous feedback loops: Use evaluation insights to guide agent improvements\\n\\nConclusion\\n\\nEvaluating AI agents requires specialized metrics that go beyond traditional RAG evaluation. Ragas\\' agent_goal_accuracy, tool_call_accuracy, and topic_adherence provide crucial insights into whether an agent can successfully complete tasks, use tools correctly, and stay within designated boundaries.\\n\\nBy incorporating these metrics into your evaluation pipeline, you can build more reliable and effective AI agents that truly deliver on the promise of helpful, goal-oriented AI assistants.\\n\\nIn our next post, we\\'ll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.\\n\\nPart 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications Part 2: Basic Evaluation Workflow Part 3: Evaluating RAG Systems with Ragas Part 4: Test Data Generation Part 5: Advanced Metrics and Customization Part 6: Evaluating AI Agents — You are here Next up in the series: Part 7: Integrations and Observability Part 8: Building Feedback Loops\\n\\nHow are you evaluating your AI agents? What challenges have you encountered in measuring agent performance? If you\\'re facing specific evaluation hurdles, don\\'t hesitate to reach out—we\\'d love to help!')]}" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response" ] }, { "cell_type": "code", "execution_count": null, "id": "4d5121ca", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "To build a research agent, you can follow these key steps:\n", "\n", "1. **Define the Purpose**: Identify the specific research needs, such as searching across multiple sources, analyzing documents, and providing structured reports.\n", "\n", "2. **Integrate Information Sources**:\n", " - **Web Search**: Use APIs like Tavily and DuckDuckGo for real-time information.\n", " - **Academic Research**: Connect to databases like arXiv for scholarly articles.\n", " - **RSS Feeds**: Implement an RSS feed reader to aggregate content from relevant publications.\n", "\n", "3. **Document Analysis**: Incorporate a document analysis engine that uses techniques like Retrieval Augmented Generation (RAG) to process and analyze uploaded documents.\n", "\n", "4. **Workflow Architecture**: Utilize a structured framework like LangGraph for orchestrating the research process, maintaining context, and ensuring transparent reasoning.\n", "\n", "5. **Technology Stack**: Leverage tools such as LangChain, Chainlit, Qdrant, and OpenAI's GPT-4o for building the agent.\n", "\n", "6. **Testing and Iteration**: Continuously test the agent's performance and refine its capabilities based on user feedback and research outcomes.\n", "\n", "For more detailed guidance, you can explore the following resources:\n", "- [Building a Research Agent with RSS Feed Support](https://thedataguy.pro/building-research-agent/)\n", "- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/advanced-metrics-and-customization-with-ragas/)\n", "- [Evaluating RAG Systems with Ragas](https://thedataguy.pro/evaluating-rag-systems-with-ragas/)\n", "- [Evaluating AI Agents with Ragas](https://thedataguy.pro/evaluating-ai-agents-with-ragas/)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_markdown(response[\"response\"].content)" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.2" } }, "nbformat": 4, "nbformat_minor": 5 }