Spaces:

mafzaal
/

lets_talk

Running

App Files Files Community

mafzaal commited on May 13

Commit

9ffcda2

1 Parent(s): c962643

Refactor code structure for improved readability and maintainability

Browse files

Files changed (6) hide show

DESIGN.md +131 -15
evals/rag_eval_ft.csv +83 -0
evals/rag_eval_result_ft.csv +83 -0
py-src/app.py +2 -15
py-src/notebooks/07_Fine_Tune_Eval.ipynb +0 -0
py-src/pipeline.py +13 -3

DESIGN.md CHANGED Viewed

@@ -35,38 +35,154 @@ For our first implementation, we will focus on content from [TheDataGuy](https:/
 These questions align with the main themes in the blog posts, which cover LLM evaluation, RAG systems, custom development of AI tools, and data strategy.
-# AI-Driven Chat for [TheDataGuy](https://thedataguy.pro): Solution Design
-## Proposed Solution
-We propose an AI-driven chat assistant for [TheDataGuy](https://thedataguy.pro)'s blog that enables visitors to have interactive conversations about the technical content. This chat component will understand both the blog's content and the user's intent, providing contextually relevant responses about RAG systems, evaluation metrics, data strategy, and other specialized topics covered in the blog. Users will be able to ask clarifying questions about complex concepts, request code examples, or dive deeper into specific aspects of a post without having to search through multiple articles. The experience will feel like having a conversation with a knowledgeable data professional who has internalized all of [TheDataGuy](https://thedataguy.pro)'s expertise, creating a personalized learning experience that adapts to each visitor's technical background and specific interests.
 ## Technology Stack
-1. **LLM**: OpenAI GPT-4.1/o4-mini - Delivers superior reasoning capabilities for technical content interpretation with an optimal balance of context handling and cost-efficiency, ensuring accurate responses to complex AI engineering questions.
-2. **Embedding Model**: Snowflake Arctic Embed L v2.0 (Snowflake/snowflake-arctic-embed-l-v2.0) - Optimized for technical content retrieval with strong semantic understanding of AI concepts while maintaining efficient inference for blog content vectorization.
 3. **Orchestration**: LangChain - Provides flexible components for building LLM applications with robust RAG pipelines and context management tailored for technical blog content.
-4. **Vector Database**: Qdrant - Self-hostable, offers filtering capabilities ideal for blog content categorization, and provides high performance with reasonable operational complexity.
 5. **Monitoring**: LangSmith - Integrates seamlessly with LangChain while providing comprehensive tracing, debugging, and performance monitoring specific to LLM applications.
-6. **Evaluation**: Ragas - Aligns perfectly with [TheDataGuy](https://thedataguy.pro)'s expertise and blog content, enabling evaluation of the chat system on metrics like faithfulness and relevance.
-7. **User Interface**: Custom Svelte component - Matches the existing blog's aesthetics and provides a lightweight, responsive chat interface with minimal impact on page load times.
-8. **Serving & Inference**: Azure Container Apps - Offers scalability and seamless deployment for the chat service while aligning with [TheDataGuy](https://thedataguy.pro)'s technical environment as indicated by the Azure development focus.
 ## Agentic Reasoning
-The solution will implement agentic reasoning in two key areas. First, a research agent will dynamically determine when additional context is needed beyond what's in the vector database, intelligently pulling supplementary information from recent blog posts or external technical documentation when appropriate. Second, a reasoning agent will manage complex technical questions that require multi-step thinking, breaking down queries about implementation details, evaluation methodologies, or architectural decisions into logical components before synthesizing comprehensive answers. This approach will be particularly valuable when users ask about applying concepts across multiple blog posts (e.g., "How would I evaluate a research agent using Ragas?") where simple retrieval wouldn't provide a satisfactory response.
 # Task 3: Dealing with the Data
-We will be using data source
-1. Markdown blog post files from [TheDataGuy](https://thedataguy.pro)' from github at [mafzaal](https://github.com/mafzaal/mafzaal.github.io)
-2. Python code files from [TheDataGuy](https://thedataguy.pro)'s from github at [intro-to-ragas](https://github.com/mafzaal/intro-to-ragas) and [AIE6-ResearchAgent](https://github.com/mafzaal/AIE6-ResearchAgent)

 These questions align with the main themes in the blog posts, which cover LLM evaluation, RAG systems, custom development of AI tools, and data strategy.
+## Task 2: Proposed Solution
+We propose an AI-driven chat assistant for [TheDataGuy](https://thedataguy.pro)'s blog that enables readers to interactively explore technical content. This component will:
+- Provide contextually relevant responses about RAG systems, evaluation metrics, and data strategy
+- Allow users to ask clarifying questions about complex concepts without searching across multiple articles
+- Deliver code examples and detailed explanations tailored to the user's technical background
+- Function as a knowledge companion that has internalized the blog's expertise
+- Create a personalized learning experience that adapts to each visitor's specific interests
+The solution transforms passive reading into an interactive dialogue, significantly enhancing information discovery and retention.
 ## Technology Stack
+1. **LLM Architecture**:
+    - **Primary Model**: OpenAI `gpt-4.1` - Handles complex tasks including synthetic data generation, sophisticated evaluation workflows, and fine-tuning questions development
+    - **Inference Model**: OpenAI `gpt-4o-mini` - Powers the user-facing chat application, offering an optimal balance between performance and cost-efficiency for real-time responses
+2. **Embedding Model**:
+    - **Base Model**: `Snowflake/snowflake-arctic-embed-l` - Provides foundation embedding capabilities optimized for technical content with robust semantic understanding
+    - **Fine-tuned Model**: `mafzaal/thedataguy_arctic_ft` - Custom-tuned embedding model using query-context pairs extracted from blog content, enhancing retrieval accuracy for domain-specific AI terminology and concepts
 3. **Orchestration**: LangChain - Provides flexible components for building LLM applications with robust RAG pipelines and context management tailored for technical blog content.
+4. **Vector Database**: Qdrant - Stores embeddings generated through a `pipeline.py` script which also functions as a GitHub workflow to automatically incorporate new blog posts. Provides robust filtering capabilities for content categorization and delivers high performance with manageable operational complexity.
 5. **Monitoring**: LangSmith - Integrates seamlessly with LangChain while providing comprehensive tracing, debugging, and performance monitoring specific to LLM applications.
+6. **Evaluation**: Ragas - Aligns perfectly with [TheDataGuy](https://thedataguy.pro)'s expertise and blog content, enabling evaluation of the chat system on metrics like faithfulness and relevance. See following notebooks
+    - [05_SDG_Eval](/py-src/notebooks/05_SDG_Eval.ipynb)
+    - [07_Fine_Tuning_Dataset](/py-src/notebooks/07_Fine_Tuning_Dataset.ipynb)
+    - [07_Fine_Tune_Embeddings](/py-src/notebooks/07_Fine_Tune_Embeddings.ipynb)
+    - [07_Fine_Tune_Eval](/py-src/notebooks/07_Fine_Tune_Eval.ipynb)
+7. **User Interface**:
+    - **Current Implementation**: Chainlit - Provides rapid prototyping capabilities with built-in chat UI components
+    - **Production Version**: Custom Svelte component - Delivers a lightweight, responsive interface that seamlessly integrates with the blog's existing design language while minimizing impact on page load performance
+## Serving & Inference
+- **Development Environment**: Prototype deployed on Hugging Face Spaces for rapid testing and validation, visit [Let's Talk](https://huggingface.co/spaces/mafzaal/lets_talk)
+### Future
+- **Production Infrastructure**: Azure Container Apps - Provides event-driven autoscaling and enterprise-grade security while integrating with [TheDataGuy](https://thedataguy.pro)'s existing Azure-based technical ecosystem
+- **API Layer**: FastAPI - Delivers high-performance endpoints with automatic OpenAPI documentation, facilitating seamless integration with the blog's frontend
+- **Deployment Strategy**: CI/CD pipeline using GitHub Actions - Ensures consistent testing and deployment with automated content indexing whenever new blog posts are published
 ## Agentic Reasoning
+Agentic reasoing will be added in future version.
 # Task 3: Dealing with the Data
+## Data Collection
+The blog data from [TheDataGuy](https://thedataguy.pro) was collected and processed for our AI-driven chat component. Below is a summary of the collected blog posts:
+| Title | Date | Text Length | URL |
+|-------|------|-------------|-----|
+| "Coming Back to AI Roots - My Professional Journey" | 2025-04-14 | 5,827 | [Link](https://thedataguy.pro/blog/coming-back-to-ai-roots/) |
+| "Data is King: Why Your Data Strategy IS Your Business Strategy" | 2025-04-15 | 6,197 | [Link](https://thedataguy.pro/blog/data-is-king/) |
+| "A C# Programmer's Perspective on LangChain Expression Language" | 2025-04-16 | 3,361 | [Link](https://thedataguy.pro/blog/langchain-experience-csharp-perspective/) |
+| "Building a Research Agent with RSS Feed Support" | 2025-04-20 | 7,320 | [Link](https://thedataguy.pro/blog/building-research-agent/) |
+| "Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications" | 2025-04-26 | 6,999 | [Link](https://thedataguy.pro/blog/introduction-to-ragas/) |
+| "Part 2: Basic Evaluation Workflow with Ragas" | 2025-04-26 | 11,223 | [Link](https://thedataguy.pro/blog/basic-evaluation-workflow-with-ragas/) |
+| "Part 3: Evaluating RAG Systems with Ragas" | 2025-04-26 | 8,811 | [Link](https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/) |
+| "Part 4: Generating Test Data with Ragas" | 2025-04-27 | 14,682 | [Link](https://thedataguy.pro/blog/generating-test-data-with-ragas/) |
+| "Part 5: Advanced Metrics and Customization with Ragas" | 2025-04-28 | 11,531 | [Link](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/) |
+| "Part 6: Evaluating AI Agents: Beyond Simple Answers with Ragas" | 2025-04-28 | 9,822 | [Link](https://thedataguy.pro/blog/evaluating-ai-agents-with-ragas/) |
+| "Part 7: Integrations and Observability with Ragas" | 2025-04-30 | 9,100 | [Link](https://thedataguy.pro/blog/integrations-and-observability-with-ragas/) |
+| "Subscribe to Our Blog via RSS" | 2025-05-03 | 2,139 | [Link](https://thedataguy.pro/blog/rss-feed-announcement/) |
+| "Part 8: Building Feedback Loops with Ragas" | 2025-05-04 | 8,160 | [Link](https://thedataguy.pro/blog/building-feedback-loops-with-ragas/) |
+| "Metric-Driven Development: Make Smarter Decisions, Faster" | 2025-05-05 | 12,450 | [Link](https://thedataguy.pro/blog/metric-driven-development/) |
+## Chunking Strategy
+For our blog chat component, we evaluated multiple chunking approaches to optimize retrieval performance:
+1. **Initial Experimental Approach**:
+    - Used RecursiveCharacterTextSplitter with chunk size of 1000 characters and 200 character overlap
+    - This approach provided granular context chunks for both baseline testing and embedding fine-tuning
+2. **Final Implementation Decision**:
+    - Opted to use whole blog posts as individual chunks rather than smaller text segments
+    - This approach ensures complete retention of article context and coherence
+    - Each blog post is treated as a distinct retrievable unit in the vector database
+The whole-document chunking strategy was selected to retrieve relevant full blog posts for response generation. This approach preserves article narrative integrity while providing sufficient context for accurate responses.
+## Data Statistics Summary
+| Statistic | Value |
+|-----------|-------|
+| Total Blog Posts | 14 |
+| Total Characters | 106,275 |
+| Minimum Post Length | 1,900 characters |
+| Maximum Post Length | 13,468 characters |
+| Average Post Length | 7,591 characters |
+With average post length under 8,000 characters, whole-document retrieval remains efficient while eliminating contextual fragmentation that occurs with smaller chunks. This approach optimizes for content coherence over granularity, supporting comprehensive responses to technical queries about RAG systems, evaluation frameworks, and data strategy.
+## Tools and APIs
+We will also using [RSS Feed](https://thedataguy.pro/rss.xml) with tool call to get latest posts which are not yet vectorized
+# Task 4: Building a Quick End-to-End Prototype
+Live demo at [Let's Talk](https://huggingface.co/spaces/mafzaal/lets_talk)
+# Task 5: Creating a Golden Test Data Set
+Synthetic data is available at [Testset](/evals/testset_2.csv), [Eval set](/evals/rag_eval_2.csv) and [Results](/evals/rag_eval_result_2.csv) and here is summary
+## Evaluation Results Summary
+| Metric | Score |
+|--------|-------|
+| Faithfulness | 0.8545 |
+| Answer Relevancy | 0.3892 |
+| Factual Correctness (F1) | 0.2490 |
+| Noise Sensitivity (Relevant) | 0.2540 |
+| Context Recall | 0.1905 |
+| Context Entity Recall | 0.1503 |
+These results indicate strong faithfulness but opportunities for improvement in contextual relevance and factual accuracy. The relatively low context recall and entity recall scores suggest that the retrieval component may need refinement to better surface relevant information from the blog content.
+# Task 6: Fine-Tuning Open-Source Embeddings
+Fine tuning dataset is available at [Link](/evals/ft_questions.csv) and also uploaded at [thedataguy_embed_ft](https://huggingface.co/datasets/mafzaal/thedataguy_embed_ft) and notebook is available at [07_Fine_Tune_Embeddings](/py-src/notebooks/07_Fine_Tune_Embeddings.ipynb)
+# Task 7: Assessing Performance
+Following is evalation based on finetuned embedding model.
+## Fine-Tuned Embedding Model Evaluation Results
+| Metric | Score |
+|--------|-------|
+| Faithfulness | 0.4432 |
+| Answer Relevancy | 0.6849 |
+| Factual Correctness (F1) | 0.2000 |
+| Noise Sensitivity (Relevant) | 0.2033 |
+| Context Recall | 0.2500 |
+| Context Entity Recall | 0.2175 |
+The fine-tuned embedding model shows improved answer relevancy and context recall compared to the base model. While faithfulness decreased, the system demonstrates better ability to retrieve relevant information. These results suggest that the fine-tuning process has shifted the model's strengths toward delivering more contextually appropriate responses, though further optimization is needed to improve faithfulness and factual accuracy.

evals/rag_eval_ft.csv ADDED Viewed

	@@ -0,0 +1,83 @@

+user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
+"How are Large Language Models integrated into modern applications, and why is their performance evaluation considered critical according to the context?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", ""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""* **Faithfulness**: Measures how well the generated answer aligns with the retrieved context\n* **Answer Relevancy**: Evaluates whether the response correctly addresses the user's query\n* **Context Relevancy**: Assesses if the system retrieves information that's actually needed\n* **Context Precision**: Quantifies how efficiently the system uses retrieved information\n\nThese targeted metrics provide clearer direction than general-purpose evaluations, allowing teams to make precise improvements where they matter most.\nImagine two teams building a new feature for a streaming platform:""]","['---\ntitle: ""Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications""\ndate: 2025-04-26T18:00:00-06:00\nlayout: blog\ndescription: ""Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems.""\ncategories: [""AI"", ""RAG"", ""Evaluation"",""Ragas""]\ncoverImage: ""https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3""\nreadingTime: 7\npublished: true\n---\n\nAs Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you\'re building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how well your application performs. This is where Ragas steps in.\n\n## What is Ragas?']","Large Language Models (LLMs) are integrated into modern applications by extending frameworks like Ragas with custom metrics, which allows for the creation of evaluation frameworks that measure what matters most for specific applications. This leads to more meaningful improvements and better user experiences.
+Performance evaluation of LLMs is considered critical because it ensures that the models generate relevant and accurate responses, aligning with user queries and the context provided. Targeted metrics such as faithfulness, answer relevancy, context relevancy, and context precision help teams make precise improvements, enhancing the overall effectiveness of the applications.
+For more information, you can explore the following links:
+- [Metric-Driven Development: Make Smarter Decisions, Faster](https://thedataguy.pro/blog/metric-driven-development/)
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)
+- [Data is King: Why Your Data Strategy IS Your Business Strategy](https://thedataguy.pro/blog/data-is-king/)","Large Language Models (LLMs) are becoming fundamental components of modern applications, such as question-answering systems, document retrieval tools, and conversational agents. Effectively evaluating their performance is considered increasingly critical to ensure reliable metrics for assessing how well these applications perform.",single_hop_specifc_query_synthesizer
+"Howw does Ragas help evalute LLM applikations in the helthcare industree, and why is this importent for ensuring akuracy and reliabilitee?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", ""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""These columns enable comprehensive evaluation by linking each question to its supporting evidence and expected answer, while also providing insight into the diversity and difficulty of the generated queries.\n\n\n## Deep Dive into Test Data Generation\n\nSo you have a collection of documents and want to create a robust evaluation dataset for your RAG system using Ragas. The `TestsetGenerator`'s `generate_with_langchain_docs` method is your starting point. But what exactly happens when you call it? Let's peek under the hood.\n\n**The Goal:** To take raw Langchain `Document` objects and transform them into a structured Ragas `Testset` containing diverse question-answer pairs grounded in those documents.\n\n**The Workflow:**""]","[""## What is Ragas?\n\n[Ragas](https://docs.ragas.io/en/stable/) is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems.\n\nAt its core, Ragas helps answer crucial questions:\n- Is my application retrieving the right information?\n- Are the responses factually accurate and consistent with the retrieved context?\n- Does the system appropriately address the user's query?\n- How well does my application handle multi-turn conversations?\n\n## Why Evaluate LLM Applications?\n\nLLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable.""]",I don't know.,"Ragas is an open-source evaluation framework designed for LLM applications, with strengths in Retrieval-Augmented Generation systems. It provides specialized metrics to address challenges unique to LLM-powered systems, such as determining if the application retrieves the right information and if responses are factually accurate and consistent with the retrieved context. Evaluating LLM applications is especially important in healthcare, where accuracy and reliability are critical, because LLMs can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. Proper evaluation with frameworks like Ragas is non-negotiable in healthcare to ensure the system's accuracy and reliability.",single_hop_specifc_query_synthesizer
+What specialized metrics does Ragas provide for evaluating LLM applications?,"[""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", ""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""## Conclusion\n\nEvaluating AI agents requires specialized metrics that go beyond traditional RAG evaluation. Ragas' `agent_goal_accuracy`, `tool_call_accuracy`, and `topic_adherence` provide crucial insights into whether an agent can successfully complete tasks, use tools correctly, and stay within designated boundaries.\n\nBy incorporating these metrics into your evaluation pipeline, you can build more reliable and effective AI agents that truly deliver on the promise of helpful, goal-oriented AI assistants.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.""]","[""Evaluation serves several key purposes:\n- **Quality assurance**: Identify and fix issues before they reach users\n- **Performance tracking**: Monitor how changes impact system performance\n- **Benchmarking**: Compare different approaches objectively\n- **Continuous improvement**: Build feedback loops to enhance your application\n\n## Key Features of Ragas\n\n### 🎯 Specialized Metrics\nRagas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications:\n\n- **Faithfulness**: Measures if the response is factually consistent with the retrieved context\n- **Context Relevancy**: Evaluates if the retrieved information is relevant to the query\n- **Answer Relevancy**: Assesses if the response addresses the user's question\n- **Topic Adherence**: Gauges how well multi-turn conversations stay on topic""]","Ragas provides specialized metrics for evaluating LLM applications, including:
+- **Agent Goal Accuracy**: Measures how well the AI agent achieves its intended goals.
+- **Tool Call Accuracy**: Assesses the correctness of the tools used by the agent.
+- **Topic Adherence**: Evaluates whether the agent stays within the designated topics during interactions.
+These metrics help in building more reliable and effective AI agents.
+For more information, you can explore the following links:
+- [Evaluating AI Agents: Beyond Simple Answers with Ragas](https://thedataguy.pro/blog/evaluating-ai-agents-with-ragas/)
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)","Ragas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications, including faithfulness, context relevancy, answer relevancy, and topic adherence.",single_hop_specifc_query_synthesizer
+"me wanna know how LangSmith work with Ragas, like is it for test data or what, and how it help me as LLM app builder, can you tell me all about LangSmith from this info?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", ""### Ready to Elevate Your LLM Applications?\n\nStart exploring Ragas today by visiting the [official documentation](https://docs.ragas.io/en/stable/). Share your thoughts, challenges, or success stories. If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!"", ""*How are you evaluating your AI agents? What challenges have you encountered in measuring agent performance? If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!*"", ""*How are you evaluating your AI agents? What challenges have you encountered in measuring agent performance? If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!*""]","[""### 🧪 Test Data Generation\nCreating high-quality test data is often a bottleneck in evaluation. Ragas helps you generate comprehensive test datasets automatically, saving time and ensuring thorough coverage.\n\n### 🔗 Seamless Integrations\nRagas works with popular LLM frameworks and tools:\n- [LangChain](https://www.langchain.com/)\n- [LlamaIndex](https://www.llamaindex.ai/)\n- [Haystack](https://haystack.deepset.ai/)\n- [OpenAI](https://openai.com/)\n\nObservability platforms \n- [Phoenix](https://phoenix.arize.com/)\n- [LangSmith](https://python.langchain.com/docs/introduction/)\n- [Langfuse](https://www.langfuse.com/)\n\n### 📊 Comprehensive Analysis\nBeyond simple scores, Ragas provides detailed insights into your application's strengths and weaknesses, enabling targeted improvements.\n\n## Getting Started with Ragas\n\nInstalling Ragas is straightforward:\n\n```bash\nuv init && uv add ragas\n```\n\nHere's a simple example of evaluating a response using Ragas:""]",I don't know.,"LangSmith is listed as one of the observability platforms that Ragas works with. The context does not provide details about how LangSmith specifically functions, but it shows that Ragas integrates with LangSmith to support observability in LLM-powered systems.",single_hop_specifc_query_synthesizer
+How do I use the OPENAI API key when initializing an LLM for evaluation with Ragas?,"[""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", ""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", '## Implementing Agent Evaluation in Practice\n\nLet\'s look at a practical example of evaluating an AI agent using these metrics:\n\n```python\nfrom ragas.metrics import AgentGoalAccuracyWithoutReference, ToolCallAccuracy, TopicAdherenceScore\nfrom ragas.evaluation import EvaluationDataset\nfrom ragas.dataset_schema import MultiTurnSample\nfrom langchain_openai import ChatOpenAI\nfrom ragas.llms import LangchainLLMWrapper\n\n# Initialize the LLM\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o""))', '*   **`conditional`:** Creates questions with ""if/then"" clauses based on information in the graph.\n*   **Generation Process:** `self.generate()` calculates how many questions of each type to create based on the `testset_size` and the distribution probabilities. It then uses an `Executor` to run the appropriate synthesizers, generating the final `TestsetSample` objects that make up your evaluation dataset.']","['## Getting Started with Ragas\n\nInstalling Ragas is straightforward:\n\n```bash\nuv init && uv add ragas\n```\n\nHere\'s a simple example of evaluating a response using Ragas:\n\n```python\nfrom ragas.metrics import Faithfulness\nfrom ragas.evaluation import EvaluationDataset\nfrom ragas.dataset_schema import SingleTurnSample\nfrom langchain_openai import ChatOpenAI\nfrom ragas.llms import LangchainLLMWrapper\nfrom langchain_openai import ChatOpenAI\n\n# Initialize the LLM, you are going to new OPENAI API key\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")) \n\n# Your evaluation data\ntest_data = {\n    ""user_input"": ""What is the capital of France?"",\n    ""retrieved_contexts"": [""Paris is the capital and most populous city of France.""],\n    ""response"": ""The capital of France is Paris.""\n}\n\n# Create a sample\nsample = SingleTurnSample(**test_data)  # Unpack the dictionary into the constructor']","To use the OPENAI API key when initializing an LLM for evaluation with Ragas, you typically set the API key in your environment or directly in your code before initializing the LLM. Here's a basic example of how to do this:
+```python
+import os
+from langchain_openai import ChatOpenAI
+# Set your OpenAI API key
+os.environ[""OPENAI_API_KEY""] = ""your_api_key_here""
+# Initialize the LLM
+evaluator_llm = ChatOpenAI(model=""gpt-4o"")
+```
+Make sure to replace `""your_api_key_here""` with your actual OpenAI API key.
+For more detailed information on evaluating AI agents with Ragas, you can refer to the following link:
+- [Evaluating AI Agents: Beyond Simple Answers with Ragas](https://thedataguy.pro/blog/evaluating-ai-agents-with-ragas/)","To use the OPENAI API key when initializing an LLM for evaluation with Ragas, you need to provide your OPENAI API key during the initialization of the ChatOpenAI model, as shown in the example: evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")).",single_hop_specifc_query_synthesizer
+"How does synthetic data generation contribute to the importance of data quantity in improving RAG evaluation, and what paradigm shift does this represent according to Peter Norvig's perspective?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""* **Faithfulness**: Measures how well the generated answer aligns with the retrieved context\n* **Answer Relevancy**: Evaluates whether the response correctly addresses the user's query\n* **Context Relevancy**: Assesses if the system retrieves information that's actually needed\n* **Context Precision**: Quantifies how efficiently the system uses retrieved information\n\nThese targeted metrics provide clearer direction than general-purpose evaluations, allowing teams to make precise improvements where they matter most.\nImagine two teams building a new feature for a streaming platform:"", 'Essentially, the default transformations build a knowledge graph populated with embedded, filtered document chunks and corresponding simple, extractive question-answer pairs.\n\n**Spotlight: Query Synthesizers (via `self.generate()` and `default_query_distribution`)**\n\nThe `self.generate()` method, called by `generate_with_langchain_docs`, is responsible for taking the foundational graph and creating the final, potentially complex, test questions using **Query Synthesizers** (also referred to as ""evolutions"" or ""scenarios"").']","['<1-hop>\n\n## Why and How to Generate Synthetic Data for RAG Evaluation\n\nIn the world of Retrieval-Augmented Generation (RAG) and LLM-powered applications, **synthetic data generation** is a game-changer for rapid iteration and robust evaluation. This blog post explains why synthetic data is essential, and how you can generate it for your own RAG pipelines—using modern tools like [RAGAS](https://github.com/explodinggradients/ragas) and [LangSmith](https://smith.langchain.com/).\n\n---\n\n### Why Generate Synthetic Data?\n\n1. **Early Signal, Fast Iteration**  \n   Real-world data is often scarce or expensive to label. Synthetic data lets you quickly create test sets that mimic real user queries and contexts, so you can evaluate your system’s performance before deploying to production.\n\n2. **Controlled Complexity**  \n   You can design synthetic datasets to cover edge cases, multi-hop reasoning, or specific knowledge domains—ensuring your RAG system is robust, not just good at the “easy” cases.', '<2-hop>\n\n## The Origin of ""Data is King""\n\nPeter Norvig famously stated, ""We don\'t have better algorithms. We just have more data."" This statement emerged during a time when Google\'s approach to machine translation was yielding surprisingly effective results not through algorithmic innovations, but through the sheer volume of multilingual data they had amassed. \n\nThis perspective represented a paradigm shift. Prior to this, the field had largely focused on crafting ever more sophisticated algorithms, with the assumption that smarter code would yield better results. Norvig\'s insight suggested something different: even relatively simple algorithms could outperform more sophisticated ones when trained on sufficiently large datasets.\n\n## The Business Imperative of Data Ownership']",I don't know.,"Synthetic data generation enables rapid creation of test sets that mimic real user queries and contexts, allowing for early evaluation and iteration of RAG systems even when real-world data is scarce or expensive to label. This approach increases the quantity of data available for evaluation, supporting robust system development. According to Peter Norvig's perspective, as described in the context, the importance of data quantity represents a paradigm shift: rather than relying solely on more sophisticated algorithms, having more data—even with simpler algorithms—can lead to better results. Thus, synthetic data generation aligns with this shift by providing the large datasets necessary to improve system performance.",multi_hop_abstract_query_synthesizer
+"How does Ragas support the evaluation of both LLM applications and AI agents, and what specialized metrics are introduced for evaluating AI agents?","[""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", ""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""## Conclusion\n\nEvaluating AI agents requires specialized metrics that go beyond traditional RAG evaluation. Ragas' `agent_goal_accuracy`, `tool_call_accuracy`, and `topic_adherence` provide crucial insights into whether an agent can successfully complete tasks, use tools correctly, and stay within designated boundaries.\n\nBy incorporating these metrics into your evaluation pipeline, you can build more reliable and effective AI agents that truly deliver on the promise of helpful, goal-oriented AI assistants.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.""]","['<1-hop>\n\n# Create a sample\nsample = SingleTurnSample(**test_data)  # Unpack the dictionary into the constructor\n\n# Create metric\nfaithfulness = Faithfulness(llm=evaluator_llm)\n# Calculate the score\nresult = await faithfulness.single_turn_ascore(sample)\nprint(f""Faithfulness score: {result}"")\n```\n\n> 💡 **Try it yourself:**  \n> Explore the hands-on notebook for this workflow:  \n> [01_Introduction_to_Ragas](https://github.com/mafzaal/intro-to-ragas/blob/master/01_Introduction_to_Ragas.ipynb)\n\n## What\'s Coming in This Blog Series\n\nThis introduction is just the beginning. In the upcoming posts, we\'ll dive deeper into all aspects of evaluating LLM applications with Ragas:\n\n**[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**  \nWe\'ll explore each metric in detail, explaining when and how to use them effectively.', ""<2-hop>\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.\n\n---\n \n**[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**  \n**[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**  \n**[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**   \n**[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas)**  \n**Part 5: Advanced Evaluation Techniques — _You are here_**  \n*Next up in the series:*  \n**[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**  \n**[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**  \n**[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**"", '<3-hop>\n\n---\ntitle: ""Part 6: Evaluating AI Agents: Beyond Simple Answers with Ragas""\ndate: 2025-04-28T06:00:00-06:00\nlayout: blog\ndescription: ""Learn how to evaluate complex AI agents using Ragas\' specialized metrics for goal accuracy, tool call accuracy, and topic adherence to build more reliable and effective agent-based applications.""\ncategories: [""AI"", ""Agents"", ""Evaluation"", ""Ragas"", ""LLM""]\ncoverImage: ""/images/ai_agent_evaluation.png""   \nreadingTime: 8\npublished: true\n---\n\nIn our previous posts, we\'ve explored how Ragas evaluates RAG systems and enables custom metrics for specialized applications. As LLMs evolve beyond simple question-answering to become powerful AI agents, evaluation needs have grown more sophisticated too. In this post, we\'ll explore Ragas\' specialized metrics for evaluating AI agents that engage in multi-turn interactions, use tools, and work toward specific goals.\n\n## The Challenge of Evaluating AI Agents']","Ragas supports the evaluation of LLM applications by allowing the extension of custom metrics, which helps create evaluation frameworks that measure what matters most for these applications. This leads to more meaningful improvements and better user experiences.
+For evaluating AI agents, Ragas introduces specialized metrics such as `agent_goal_accuracy`, `tool_call_accuracy`, and `topic_adherence`. These metrics provide insights into an agent's ability to complete tasks, use tools correctly, and stay within designated boundaries, enhancing the reliability and effectiveness of AI agents.
+For more information, you can explore the following links:
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)
+- [Evaluating AI Agents: Beyond Simple Answers with Ragas](https://thedataguy.pro/blog/evaluating-ai-agents-with-ragas/)","Ragas supports the evaluation of LLM applications by providing metrics such as faithfulness, which can be applied to single-turn samples as shown in the introductory workflow. For AI agents, Ragas introduces specialized metrics designed to assess more complex behaviors, including goal accuracy, tool call accuracy, and topic adherence. These metrics enable the evaluation of AI agents that perform multi-turn interactions, utilize tools, and work toward specific goals, thus addressing the advanced requirements of evaluating agent-based applications.",multi_hop_abstract_query_synthesizer
+"How does Metric-Driven Development (MDD) utilize combined performance metrics to guide project outcomes, and what are some examples of such metrics in practice?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", ""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""* **Faithfulness**: Measures how well the generated answer aligns with the retrieved context\n* **Answer Relevancy**: Evaluates whether the response correctly addresses the user's query\n* **Context Relevancy**: Assesses if the system retrieves information that's actually needed\n* **Context Precision**: Quantifies how efficiently the system uses retrieved information\n\nThese targeted metrics provide clearer direction than general-purpose evaluations, allowing teams to make precise improvements where they matter most.\nImagine two teams building a new feature for a streaming platform:""]","['<1-hop>\n\n## What Exactly is Metric-Driven Development?\n\nMetric-Driven Development (MDD) is a simple but effective framework where teams:\n\n1.  **Define Clear, Measurable Goals:** Set specific numerical targets (e.g., ""Increase user sign-ups by 20% this quarter"").\n2.  **Base Decisions on Data:** Rely on evidence and measurements, not just opinions or assumptions.\n3.  **Iterate and Learn Quickly:** Continuously measure the impact of changes to see what works and what doesn\'t.\n\nThink of MDD as a **GPS for your project**. Without clear metrics, you\'re driving in the fog, hoping you\'re heading in the right direction. With MDD, you get real-time feedback, ensuring you\'re moving towards your destination efficiently.\n\n## Why Teams Struggle Without Clear Metrics\n\nWithout a metric-driven approach, teams often fall into common traps:', '<2-hop>\n\n*   **Metric:** A combined score, e.g., `Points Scored - (Time Taken * Penalty Factor)`.\n    *   **Impact:** Trains AI opponents that are challenging but fair, balancing speed and skill.\n*   **Autonomous Vehicles: Safety & Comfort Score**\n    *   **Metric:** Combination of factors like smooth acceleration/braking, lane adherence, and deductions for interventions or near-misses.\n    *   **Impact:** Guides development towards vehicles that are not only safe but also provide a comfortable ride.']","Metric-Driven Development (MDD) utilizes combined performance metrics to set clear, measurable goals that guide actions towards successful project outcomes. By focusing on specific metrics, teams can make informed decisions and improvements based on precise evaluations rather than general assessments.
+### Examples of Metrics in MDD:
+- **Faithfulness**: Measures alignment of generated answers with retrieved context.
+- **Answer Relevancy**: Evaluates if the response addresses the user's query correctly.
+- **Context Relevancy**: Assesses if the retrieved information is necessary.
+- **Context Precision**: Quantifies the efficiency of using retrieved information.
+These metrics help teams identify areas for improvement and ensure that development efforts are aligned with user needs and project goals.
+For more information, you can explore the following links:
+- [Metric-Driven Development: Make Smarter Decisions, Faster](https://thedataguy.pro/blog/metric-driven-development/)
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)
+- [Data is King: Why Your Data Strategy IS Your Business Strategy](https://thedataguy.pro/blog/data-is-king/)","Metric-Driven Development (MDD) utilizes combined performance metrics by defining clear, measurable goals and basing decisions on data rather than assumptions. This approach ensures that teams receive real-time feedback and can iterate quickly to improve outcomes. Examples of combined performance metrics in practice include a score such as 'Points Scored - (Time Taken * Penalty Factor)' to train AI opponents that balance speed and skill, and a 'Safety & Comfort Score' for autonomous vehicles, which combines factors like smooth acceleration, lane adherence, and deductions for interventions or near-misses. These combined metrics help guide development towards achieving specific, balanced objectives.",multi_hop_abstract_query_synthesizer
+"How does Ragas provide specialized evaluation metrics for LLMs, and what steps are involved in creating a custom metric to assess technical accuracy in programming explanations?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", ""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', '*Have you set up your first Ragas evaluation? What aspects of your LLM application are you most interested in measuring? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!*']","[""<1-hop>\n\n## What is Ragas?\n\n[Ragas](https://docs.ragas.io/en/stable/) is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems.\n\nAt its core, Ragas helps answer crucial questions:\n- Is my application retrieving the right information?\n- Are the responses factually accurate and consistent with the retrieved context?\n- Does the system appropriately address the user's query?\n- How well does my application handle multi-turn conversations?\n\n## Why Evaluate LLM Applications?\n\nLLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable."", '<2-hop>\n\n## Creating Your First Custom Metric\n\nLet\'s create a custom metric that evaluates technical accuracy in programming explanations:\n\n```python\nfrom dataclasses import dataclass, field\nfrom typing import Dict, Optional, Set\nimport typing as t\n\nfrom ragas.metrics.base import MetricWithLLM, SingleTurnMetric\nfrom ragas.prompt import PydanticPrompt\nfrom ragas.metrics import MetricType, MetricOutputType\nfrom pydantic import BaseModel\n\n# Define input/output models for the prompt\nclass TechnicalAccuracyInput(BaseModel):\n    question: str\n    context: str\n    response: str\n    programming_language: str = ""python""\n\nclass TechnicalAccuracyOutput(BaseModel):\n    score: float\n    feedback: str']","Ragas provides specialized evaluation metrics for LLMs (Large Language Models) by allowing users to extend its framework with custom metrics tailored to specific needs. This customization enables the creation of evaluation frameworks that can measure what matters most for LLM applications, leading to more meaningful improvements and better user experiences.
+To create a custom metric to assess technical accuracy in programming explanations, the following steps are typically involved:
+1. **Define the Metric**: Clearly outline what constitutes technical accuracy in the context of programming explanations.
+2. **Data Collection**: Gather relevant data that will be used to evaluate the programming explanations.
+3. **Metric Implementation**: Use Ragas to implement the custom metric, ensuring it aligns with the defined criteria for technical accuracy.
+4. **Testing and Validation**: Test the metric on sample data to validate its effectiveness and make adjustments as necessary.
+5. **Integration**: Integrate the custom metric into the evaluation workflow for ongoing assessments.
+For more detailed information, you can explore the following links:
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)
+- [Basic Evaluation Workflow with Ragas](https://thedataguy.pro/blog/basic-evaluation-workflow-with-ragas/)","Ragas is an open-source evaluation framework specifically designed for LLM applications, offering specialized metrics that address challenges unique to LLM-powered systems, such as ensuring factual accuracy, consistency with retrieved context, and appropriate query handling. To create a custom metric for evaluating technical accuracy in programming explanations, Ragas allows developers to define input and output models (for example, using Pydantic BaseModel classes for technical accuracy input and output), and implement the metric logic using its extensible metric classes. This enables tailored evaluation beyond traditional NLP metrics, supporting the needs of high-stakes LLM applications.",multi_hop_abstract_query_synthesizer
+How do observability best practices contribute to building production-ready AI systems?,"[""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", ""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', '## Creating Custom Dashboards\n\nBuilding custom dashboards gives you a comprehensive view of your evaluation results. Dashboards can display current performance, trends, and detailed breakdowns of recent evaluations, making it easier to monitor your system and identify areas for improvement.\n\nWith these practices, you can make evaluation an ongoing, automated, and visible part of your development workflow, leading to more reliable and robust RAG systems.\n\n## Best Practices for Observability']","['<1-hop>\n\n## Best Practices for Observability\n\n1. **Define clear thresholds**: Establish performance baselines and alert thresholds for each metric\n2. **Segment evaluations**: Break down results by query type, data source, or other relevant factors\n3. **Historical tracking**: Maintain historical evaluation data to identify trends and regressions\n4. **Correlation analysis**: Link evaluation metrics to user feedback and business outcomes\n5. **Regular benchmarking**: Periodically evaluate against fixed test sets to ensure consistency\n6. **Alert on regressions**: Implement automated alerts when metrics drop below thresholds\n7. **Contextualize metrics**: Include example failures alongside aggregate metrics for better understanding\n\n## Building a Feedback Loop\n\nThe ultimate goal of evaluation is to drive improvements. Establish a feedback loop:', ""<2-hop>\n\n- **Production-ready**: Understanding software engineering best practices helps create AI systems that can operate reliably at scale.\n- **User-focused**: Experience with UX principles ensures AI solutions are designed with actual human users in mind.\n- **Integrated**: Knowledge of enterprise systems makes it easier to connect AI capabilities with existing business processes.\n- **Simplified**: My experience in streamlining complex business processes helps me identify where AI can have the greatest impact through intelligent automation.\n- **Business-oriented**: I understand that AI isn't just about the technology—it's about solving real business problems and creating measurable value.\n- **Practical**: I focus on practical applications that deliver immediate benefits rather than getting caught up in theoretical possibilities.\n\n## What's Next\n\nAs I return to my AI roots, I'm excited to share this journey with you through this blog. In the coming months, I plan to write about:""]","Observability best practices contribute to building production-ready AI systems by enabling continuous monitoring and evaluation of system performance. They help in identifying issues, understanding system behavior, and ensuring that the AI models are functioning as intended. By implementing custom dashboards and automated evaluation workflows, developers can gain insights into performance trends and areas for improvement, leading to more reliable and robust AI systems.
+For more information, you can explore the following resources:
+- [Integrations and Observability with Ragas](https://thedataguy.pro/blog/integrations-and-observability-with-ragas/)
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)","Observability best practices, such as defining clear thresholds, segmenting evaluations, maintaining historical tracking, and alerting on regressions, ensure that AI systems are continuously monitored and improved. These practices are essential for creating production-ready AI systems that can operate reliably at scale, as they enable consistent performance evaluation and rapid response to issues.",multi_hop_abstract_query_synthesizer

evals/rag_eval_result_ft.csv ADDED Viewed

	@@ -0,0 +1,83 @@

+user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness(mode=f1),answer_relevancy,context_entity_recall,noise_sensitivity(mode=relevant)
+"How are Large Language Models integrated into modern applications, and why is their performance evaluation considered critical according to the context?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", ""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""* **Faithfulness**: Measures how well the generated answer aligns with the retrieved context\n* **Answer Relevancy**: Evaluates whether the response correctly addresses the user's query\n* **Context Relevancy**: Assesses if the system retrieves information that's actually needed\n* **Context Precision**: Quantifies how efficiently the system uses retrieved information\n\nThese targeted metrics provide clearer direction than general-purpose evaluations, allowing teams to make precise improvements where they matter most.\nImagine two teams building a new feature for a streaming platform:""]","['---\ntitle: ""Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications""\ndate: 2025-04-26T18:00:00-06:00\nlayout: blog\ndescription: ""Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems.""\ncategories: [""AI"", ""RAG"", ""Evaluation"",""Ragas""]\ncoverImage: ""https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3""\nreadingTime: 7\npublished: true\n---\n\nAs Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you\'re building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how well your application performs. This is where Ragas steps in.\n\n## What is Ragas?']","Large Language Models (LLMs) are integrated into modern applications by extending frameworks like Ragas with custom metrics, which allows for the creation of evaluation frameworks that measure what matters most for specific applications. This leads to more meaningful improvements and better user experiences.
+Performance evaluation of LLMs is considered critical because it ensures that the models generate relevant and accurate responses, aligning with user queries and the context provided. Targeted metrics such as faithfulness, answer relevancy, context relevancy, and context precision help teams make precise improvements, enhancing the overall effectiveness of the applications.
+For more information, you can explore the following links:
+- [Metric-Driven Development: Make Smarter Decisions, Faster](https://thedataguy.pro/blog/metric-driven-development/)
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)
+- [Data is King: Why Your Data Strategy IS Your Business Strategy](https://thedataguy.pro/blog/data-is-king/)","Large Language Models (LLMs) are becoming fundamental components of modern applications, such as question-answering systems, document retrieval tools, and conversational agents. Effectively evaluating their performance is considered increasingly critical to ensure reliable metrics for assessing how well these applications perform.",0.5,0.36363636363636365,,0.9691065259686998,0.0,
+"Howw does Ragas help evalute LLM applikations in the helthcare industree, and why is this importent for ensuring akuracy and reliabilitee?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", ""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""These columns enable comprehensive evaluation by linking each question to its supporting evidence and expected answer, while also providing insight into the diversity and difficulty of the generated queries.\n\n\n## Deep Dive into Test Data Generation\n\nSo you have a collection of documents and want to create a robust evaluation dataset for your RAG system using Ragas. The `TestsetGenerator`'s `generate_with_langchain_docs` method is your starting point. But what exactly happens when you call it? Let's peek under the hood.\n\n**The Goal:** To take raw Langchain `Document` objects and transform them into a structured Ragas `Testset` containing diverse question-answer pairs grounded in those documents.\n\n**The Workflow:**""]","[""## What is Ragas?\n\n[Ragas](https://docs.ragas.io/en/stable/) is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems.\n\nAt its core, Ragas helps answer crucial questions:\n- Is my application retrieving the right information?\n- Are the responses factually accurate and consistent with the retrieved context?\n- Does the system appropriately address the user's query?\n- How well does my application handle multi-turn conversations?\n\n## Why Evaluate LLM Applications?\n\nLLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable.""]",I don't know.,"Ragas is an open-source evaluation framework designed for LLM applications, with strengths in Retrieval-Augmented Generation systems. It provides specialized metrics to address challenges unique to LLM-powered systems, such as determining if the application retrieves the right information and if responses are factually accurate and consistent with the retrieved context. Evaluating LLM applications is especially important in healthcare, where accuracy and reliability are critical, because LLMs can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. Proper evaluation with frameworks like Ragas is non-negotiable in healthcare to ensure the system's accuracy and reliability.",0.5,0.0,0.0,0.0,0.33333333277777777,0.0
+What specialized metrics does Ragas provide for evaluating LLM applications?,"[""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", ""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""## Conclusion\n\nEvaluating AI agents requires specialized metrics that go beyond traditional RAG evaluation. Ragas' `agent_goal_accuracy`, `tool_call_accuracy`, and `topic_adherence` provide crucial insights into whether an agent can successfully complete tasks, use tools correctly, and stay within designated boundaries.\n\nBy incorporating these metrics into your evaluation pipeline, you can build more reliable and effective AI agents that truly deliver on the promise of helpful, goal-oriented AI assistants.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.""]","[""Evaluation serves several key purposes:\n- **Quality assurance**: Identify and fix issues before they reach users\n- **Performance tracking**: Monitor how changes impact system performance\n- **Benchmarking**: Compare different approaches objectively\n- **Continuous improvement**: Build feedback loops to enhance your application\n\n## Key Features of Ragas\n\n### 🎯 Specialized Metrics\nRagas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications:\n\n- **Faithfulness**: Measures if the response is factually consistent with the retrieved context\n- **Context Relevancy**: Evaluates if the retrieved information is relevant to the query\n- **Answer Relevancy**: Assesses if the response addresses the user's question\n- **Topic Adherence**: Gauges how well multi-turn conversations stay on topic""]","Ragas provides specialized metrics for evaluating LLM applications, including:
+- **Agent Goal Accuracy**: Measures how well the AI agent achieves its intended goals.
+- **Tool Call Accuracy**: Assesses the correctness of the tools used by the agent.
+- **Topic Adherence**: Evaluates whether the agent stays within the designated topics during interactions.
+These metrics help in building more reliable and effective AI agents.
+For more information, you can explore the following links:
+- [Evaluating AI Agents: Beyond Simple Answers with Ragas](https://thedataguy.pro/blog/evaluating-ai-agents-with-ragas/)
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)","Ragas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications, including faithfulness, context relevancy, answer relevancy, and topic adherence.",0.0,0.75,0.36,1.0,0.24999999968749997,0.625
+"me wanna know how LangSmith work with Ragas, like is it for test data or what, and how it help me as LLM app builder, can you tell me all about LangSmith from this info?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", ""### Ready to Elevate Your LLM Applications?\n\nStart exploring Ragas today by visiting the [official documentation](https://docs.ragas.io/en/stable/). Share your thoughts, challenges, or success stories. If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!"", ""*How are you evaluating your AI agents? What challenges have you encountered in measuring agent performance? If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!*"", ""*How are you evaluating your AI agents? What challenges have you encountered in measuring agent performance? If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!*""]","[""### 🧪 Test Data Generation\nCreating high-quality test data is often a bottleneck in evaluation. Ragas helps you generate comprehensive test datasets automatically, saving time and ensuring thorough coverage.\n\n### 🔗 Seamless Integrations\nRagas works with popular LLM frameworks and tools:\n- [LangChain](https://www.langchain.com/)\n- [LlamaIndex](https://www.llamaindex.ai/)\n- [Haystack](https://haystack.deepset.ai/)\n- [OpenAI](https://openai.com/)\n\nObservability platforms \n- [Phoenix](https://phoenix.arize.com/)\n- [LangSmith](https://python.langchain.com/docs/introduction/)\n- [Langfuse](https://www.langfuse.com/)\n\n### 📊 Comprehensive Analysis\nBeyond simple scores, Ragas provides detailed insights into your application's strengths and weaknesses, enabling targeted improvements.\n\n## Getting Started with Ragas\n\nInstalling Ragas is straightforward:\n\n```bash\nuv init && uv add ragas\n```\n\nHere's a simple example of evaluating a response using Ragas:""]",I don't know.,"LangSmith is listed as one of the observability platforms that Ragas works with. The context does not provide details about how LangSmith specifically functions, but it shows that Ragas integrates with LangSmith to support observability in LLM-powered systems.",0.0,1.0,0.0,0.0,0.249999999375,0.25
+How do I use the OPENAI API key when initializing an LLM for evaluation with Ragas?,"[""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", ""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", '## Implementing Agent Evaluation in Practice\n\nLet\'s look at a practical example of evaluating an AI agent using these metrics:\n\n```python\nfrom ragas.metrics import AgentGoalAccuracyWithoutReference, ToolCallAccuracy, TopicAdherenceScore\nfrom ragas.evaluation import EvaluationDataset\nfrom ragas.dataset_schema import MultiTurnSample\nfrom langchain_openai import ChatOpenAI\nfrom ragas.llms import LangchainLLMWrapper\n\n# Initialize the LLM\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o""))', '*   **`conditional`:** Creates questions with ""if/then"" clauses based on information in the graph.\n*   **Generation Process:** `self.generate()` calculates how many questions of each type to create based on the `testset_size` and the distribution probabilities. It then uses an `Executor` to run the appropriate synthesizers, generating the final `TestsetSample` objects that make up your evaluation dataset.']","['## Getting Started with Ragas\n\nInstalling Ragas is straightforward:\n\n```bash\nuv init && uv add ragas\n```\n\nHere\'s a simple example of evaluating a response using Ragas:\n\n```python\nfrom ragas.metrics import Faithfulness\nfrom ragas.evaluation import EvaluationDataset\nfrom ragas.dataset_schema import SingleTurnSample\nfrom langchain_openai import ChatOpenAI\nfrom ragas.llms import LangchainLLMWrapper\nfrom langchain_openai import ChatOpenAI\n\n# Initialize the LLM, you are going to new OPENAI API key\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")) \n\n# Your evaluation data\ntest_data = {\n    ""user_input"": ""What is the capital of France?"",\n    ""retrieved_contexts"": [""Paris is the capital and most populous city of France.""],\n    ""response"": ""The capital of France is Paris.""\n}\n\n# Create a sample\nsample = SingleTurnSample(**test_data)  # Unpack the dictionary into the constructor']","To use the OPENAI API key when initializing an LLM for evaluation with Ragas, you typically set the API key in your environment or directly in your code before initializing the LLM. Here's a basic example of how to do this:
+```python
+import os
+from langchain_openai import ChatOpenAI
+# Set your OpenAI API key
+os.environ[""OPENAI_API_KEY""] = ""your_api_key_here""
+# Initialize the LLM
+evaluator_llm = ChatOpenAI(model=""gpt-4o"")
+```
+Make sure to replace `""your_api_key_here""` with your actual OpenAI API key.
+For more detailed information on evaluating AI agents with Ragas, you can refer to the following link:
+- [Evaluating AI Agents: Beyond Simple Answers with Ragas](https://thedataguy.pro/blog/evaluating-ai-agents-with-ragas/)","To use the OPENAI API key when initializing an LLM for evaluation with Ragas, you need to provide your OPENAI API key during the initialization of the ChatOpenAI model, as shown in the example: evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")).",0.0,,0.25,0.9947663251451272,0.6666666655555555,0.0
+"How does synthetic data generation contribute to the importance of data quantity in improving RAG evaluation, and what paradigm shift does this represent according to Peter Norvig's perspective?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""* **Faithfulness**: Measures how well the generated answer aligns with the retrieved context\n* **Answer Relevancy**: Evaluates whether the response correctly addresses the user's query\n* **Context Relevancy**: Assesses if the system retrieves information that's actually needed\n* **Context Precision**: Quantifies how efficiently the system uses retrieved information\n\nThese targeted metrics provide clearer direction than general-purpose evaluations, allowing teams to make precise improvements where they matter most.\nImagine two teams building a new feature for a streaming platform:"", 'Essentially, the default transformations build a knowledge graph populated with embedded, filtered document chunks and corresponding simple, extractive question-answer pairs.\n\n**Spotlight: Query Synthesizers (via `self.generate()` and `default_query_distribution`)**\n\nThe `self.generate()` method, called by `generate_with_langchain_docs`, is responsible for taking the foundational graph and creating the final, potentially complex, test questions using **Query Synthesizers** (also referred to as ""evolutions"" or ""scenarios"").']","['<1-hop>\n\n## Why and How to Generate Synthetic Data for RAG Evaluation\n\nIn the world of Retrieval-Augmented Generation (RAG) and LLM-powered applications, **synthetic data generation** is a game-changer for rapid iteration and robust evaluation. This blog post explains why synthetic data is essential, and how you can generate it for your own RAG pipelines—using modern tools like [RAGAS](https://github.com/explodinggradients/ragas) and [LangSmith](https://smith.langchain.com/).\n\n---\n\n### Why Generate Synthetic Data?\n\n1. **Early Signal, Fast Iteration**  \n   Real-world data is often scarce or expensive to label. Synthetic data lets you quickly create test sets that mimic real user queries and contexts, so you can evaluate your system’s performance before deploying to production.\n\n2. **Controlled Complexity**  \n   You can design synthetic datasets to cover edge cases, multi-hop reasoning, or specific knowledge domains—ensuring your RAG system is robust, not just good at the “easy” cases.', '<2-hop>\n\n## The Origin of ""Data is King""\n\nPeter Norvig famously stated, ""We don\'t have better algorithms. We just have more data."" This statement emerged during a time when Google\'s approach to machine translation was yielding surprisingly effective results not through algorithmic innovations, but through the sheer volume of multilingual data they had amassed. \n\nThis perspective represented a paradigm shift. Prior to this, the field had largely focused on crafting ever more sophisticated algorithms, with the assumption that smarter code would yield better results. Norvig\'s insight suggested something different: even relatively simple algorithms could outperform more sophisticated ones when trained on sufficiently large datasets.\n\n## The Business Imperative of Data Ownership']",I don't know.,"Synthetic data generation enables rapid creation of test sets that mimic real user queries and contexts, allowing for early evaluation and iteration of RAG systems even when real-world data is scarce or expensive to label. This approach increases the quantity of data available for evaluation, supporting robust system development. According to Peter Norvig's perspective, as described in the context, the importance of data quantity represents a paradigm shift: rather than relying solely on more sophisticated algorithms, having more data—even with simpler algorithms—can lead to better results. Thus, synthetic data generation aligns with this shift by providing the large datasets necessary to improve system performance.",,0.0,0.0,0.0,0.0,0.0
+"How does Ragas support the evaluation of both LLM applications and AI agents, and what specialized metrics are introduced for evaluating AI agents?","[""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", ""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""## Conclusion\n\nEvaluating AI agents requires specialized metrics that go beyond traditional RAG evaluation. Ragas' `agent_goal_accuracy`, `tool_call_accuracy`, and `topic_adherence` provide crucial insights into whether an agent can successfully complete tasks, use tools correctly, and stay within designated boundaries.\n\nBy incorporating these metrics into your evaluation pipeline, you can build more reliable and effective AI agents that truly deliver on the promise of helpful, goal-oriented AI assistants.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.""]","['<1-hop>\n\n# Create a sample\nsample = SingleTurnSample(**test_data)  # Unpack the dictionary into the constructor\n\n# Create metric\nfaithfulness = Faithfulness(llm=evaluator_llm)\n# Calculate the score\nresult = await faithfulness.single_turn_ascore(sample)\nprint(f""Faithfulness score: {result}"")\n```\n\n> 💡 **Try it yourself:**  \n> Explore the hands-on notebook for this workflow:  \n> [01_Introduction_to_Ragas](https://github.com/mafzaal/intro-to-ragas/blob/master/01_Introduction_to_Ragas.ipynb)\n\n## What\'s Coming in This Blog Series\n\nThis introduction is just the beginning. In the upcoming posts, we\'ll dive deeper into all aspects of evaluating LLM applications with Ragas:\n\n**[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**  \nWe\'ll explore each metric in detail, explaining when and how to use them effectively.', ""<2-hop>\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.\n\n---\n \n**[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**  \n**[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**  \n**[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**   \n**[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas)**  \n**Part 5: Advanced Evaluation Techniques — _You are here_**  \n*Next up in the series:*  \n**[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**  \n**[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**  \n**[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**"", '<3-hop>\n\n---\ntitle: ""Part 6: Evaluating AI Agents: Beyond Simple Answers with Ragas""\ndate: 2025-04-28T06:00:00-06:00\nlayout: blog\ndescription: ""Learn how to evaluate complex AI agents using Ragas\' specialized metrics for goal accuracy, tool call accuracy, and topic adherence to build more reliable and effective agent-based applications.""\ncategories: [""AI"", ""Agents"", ""Evaluation"", ""Ragas"", ""LLM""]\ncoverImage: ""/images/ai_agent_evaluation.png""   \nreadingTime: 8\npublished: true\n---\n\nIn our previous posts, we\'ve explored how Ragas evaluates RAG systems and enables custom metrics for specialized applications. As LLMs evolve beyond simple question-answering to become powerful AI agents, evaluation needs have grown more sophisticated too. In this post, we\'ll explore Ragas\' specialized metrics for evaluating AI agents that engage in multi-turn interactions, use tools, and work toward specific goals.\n\n## The Challenge of Evaluating AI Agents']","Ragas supports the evaluation of LLM applications by allowing the extension of custom metrics, which helps create evaluation frameworks that measure what matters most for these applications. This leads to more meaningful improvements and better user experiences.
+For evaluating AI agents, Ragas introduces specialized metrics such as `agent_goal_accuracy`, `tool_call_accuracy`, and `topic_adherence`. These metrics provide insights into an agent's ability to complete tasks, use tools correctly, and stay within designated boundaries, enhancing the reliability and effectiveness of AI agents.
+For more information, you can explore the following links:
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)
+- [Evaluating AI Agents: Beyond Simple Answers with Ragas](https://thedataguy.pro/blog/evaluating-ai-agents-with-ragas/)","Ragas supports the evaluation of LLM applications by providing metrics such as faithfulness, which can be applied to single-turn samples as shown in the introductory workflow. For AI agents, Ragas introduces specialized metrics designed to assess more complex behaviors, including goal accuracy, tool call accuracy, and topic adherence. These metrics enable the evaluation of AI agents that perform multi-turn interactions, utilize tools, and work toward specific goals, thus addressing the advanced requirements of evaluating agent-based applications.",,,,0.9734328051066976,0.29999999969999996,0.36363636363636365
+"How does Metric-Driven Development (MDD) utilize combined performance metrics to guide project outcomes, and what are some examples of such metrics in practice?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", ""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', ""* **Faithfulness**: Measures how well the generated answer aligns with the retrieved context\n* **Answer Relevancy**: Evaluates whether the response correctly addresses the user's query\n* **Context Relevancy**: Assesses if the system retrieves information that's actually needed\n* **Context Precision**: Quantifies how efficiently the system uses retrieved information\n\nThese targeted metrics provide clearer direction than general-purpose evaluations, allowing teams to make precise improvements where they matter most.\nImagine two teams building a new feature for a streaming platform:""]","['<1-hop>\n\n## What Exactly is Metric-Driven Development?\n\nMetric-Driven Development (MDD) is a simple but effective framework where teams:\n\n1.  **Define Clear, Measurable Goals:** Set specific numerical targets (e.g., ""Increase user sign-ups by 20% this quarter"").\n2.  **Base Decisions on Data:** Rely on evidence and measurements, not just opinions or assumptions.\n3.  **Iterate and Learn Quickly:** Continuously measure the impact of changes to see what works and what doesn\'t.\n\nThink of MDD as a **GPS for your project**. Without clear metrics, you\'re driving in the fog, hoping you\'re heading in the right direction. With MDD, you get real-time feedback, ensuring you\'re moving towards your destination efficiently.\n\n## Why Teams Struggle Without Clear Metrics\n\nWithout a metric-driven approach, teams often fall into common traps:', '<2-hop>\n\n*   **Metric:** A combined score, e.g., `Points Scored - (Time Taken * Penalty Factor)`.\n    *   **Impact:** Trains AI opponents that are challenging but fair, balancing speed and skill.\n*   **Autonomous Vehicles: Safety & Comfort Score**\n    *   **Metric:** Combination of factors like smooth acceleration/braking, lane adherence, and deductions for interventions or near-misses.\n    *   **Impact:** Guides development towards vehicles that are not only safe but also provide a comfortable ride.']","Metric-Driven Development (MDD) utilizes combined performance metrics to set clear, measurable goals that guide actions towards successful project outcomes. By focusing on specific metrics, teams can make informed decisions and improvements based on precise evaluations rather than general assessments.
+### Examples of Metrics in MDD:
+- **Faithfulness**: Measures alignment of generated answers with retrieved context.
+- **Answer Relevancy**: Evaluates if the response addresses the user's query correctly.
+- **Context Relevancy**: Assesses if the retrieved information is necessary.
+- **Context Precision**: Quantifies the efficiency of using retrieved information.
+These metrics help teams identify areas for improvement and ensure that development efforts are aligned with user needs and project goals.
+For more information, you can explore the following links:
+- [Metric-Driven Development: Make Smarter Decisions, Faster](https://thedataguy.pro/blog/metric-driven-development/)
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)
+- [Data is King: Why Your Data Strategy IS Your Business Strategy](https://thedataguy.pro/blog/data-is-king/)","Metric-Driven Development (MDD) utilizes combined performance metrics by defining clear, measurable goals and basing decisions on data rather than assumptions. This approach ensures that teams receive real-time feedback and can iterate quickly to improve outcomes. Examples of combined performance metrics in practice include a score such as 'Points Scored - (Time Taken * Penalty Factor)' to train AI opponents that balance speed and skill, and a 'Safety & Comfort Score' for autonomous vehicles, which combines factors like smooth acceleration, lane adherence, and deductions for interventions or near-misses. These combined metrics help guide development towards achieving specific, balanced objectives.",,,0.33,0.9642023245827662,0.24999999968749997,0.5
+"How does Ragas provide specialized evaluation metrics for LLMs, and what steps are involved in creating a custom metric to assess technical accuracy in programming explanations?","[""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", ""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', '*Have you set up your first Ragas evaluation? What aspects of your LLM application are you most interested in measuring? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!*']","[""<1-hop>\n\n## What is Ragas?\n\n[Ragas](https://docs.ragas.io/en/stable/) is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems.\n\nAt its core, Ragas helps answer crucial questions:\n- Is my application retrieving the right information?\n- Are the responses factually accurate and consistent with the retrieved context?\n- Does the system appropriately address the user's query?\n- How well does my application handle multi-turn conversations?\n\n## Why Evaluate LLM Applications?\n\nLLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable."", '<2-hop>\n\n## Creating Your First Custom Metric\n\nLet\'s create a custom metric that evaluates technical accuracy in programming explanations:\n\n```python\nfrom dataclasses import dataclass, field\nfrom typing import Dict, Optional, Set\nimport typing as t\n\nfrom ragas.metrics.base import MetricWithLLM, SingleTurnMetric\nfrom ragas.prompt import PydanticPrompt\nfrom ragas.metrics import MetricType, MetricOutputType\nfrom pydantic import BaseModel\n\n# Define input/output models for the prompt\nclass TechnicalAccuracyInput(BaseModel):\n    question: str\n    context: str\n    response: str\n    programming_language: str = ""python""\n\nclass TechnicalAccuracyOutput(BaseModel):\n    score: float\n    feedback: str']","Ragas provides specialized evaluation metrics for LLMs (Large Language Models) by allowing users to extend its framework with custom metrics tailored to specific needs. This customization enables the creation of evaluation frameworks that can measure what matters most for LLM applications, leading to more meaningful improvements and better user experiences.
+To create a custom metric to assess technical accuracy in programming explanations, the following steps are typically involved:
+1. **Define the Metric**: Clearly outline what constitutes technical accuracy in the context of programming explanations.
+2. **Data Collection**: Gather relevant data that will be used to evaluate the programming explanations.
+3. **Metric Implementation**: Use Ragas to implement the custom metric, ensuring it aligns with the defined criteria for technical accuracy.
+4. **Testing and Validation**: Test the metric on sample data to validate its effectiveness and make adjustments as necessary.
+5. **Integration**: Integrate the custom metric into the evaluation workflow for ongoing assessments.
+For more detailed information, you can explore the following links:
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)
+- [Basic Evaluation Workflow with Ragas](https://thedataguy.pro/blog/basic-evaluation-workflow-with-ragas/)","Ragas is an open-source evaluation framework specifically designed for LLM applications, offering specialized metrics that address challenges unique to LLM-powered systems, such as ensuring factual accuracy, consistency with retrieved context, and appropriate query handling. To create a custom metric for evaluating technical accuracy in programming explanations, Ragas allows developers to define input and output models (for example, using Pydantic BaseModel classes for technical accuracy input and output), and implement the metric logic using its extensible metric classes. This enables tailored evaluation beyond traditional NLP metrics, supporting the needs of high-stakes LLM applications.",,,,0.9471243648744894,0.124999999921875,0.09090909090909091
+How do observability best practices contribute to building production-ready AI systems?,"[""By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.\n\nIn our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows."", ""Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.\n\n## Metric-Driven Development Across Industries: Real-World Examples\n\nMDD isn't just for software. Here's how different fields use it:"", '3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.\n\n4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.', '## Creating Custom Dashboards\n\nBuilding custom dashboards gives you a comprehensive view of your evaluation results. Dashboards can display current performance, trends, and detailed breakdowns of recent evaluations, making it easier to monitor your system and identify areas for improvement.\n\nWith these practices, you can make evaluation an ongoing, automated, and visible part of your development workflow, leading to more reliable and robust RAG systems.\n\n## Best Practices for Observability']","['<1-hop>\n\n## Best Practices for Observability\n\n1. **Define clear thresholds**: Establish performance baselines and alert thresholds for each metric\n2. **Segment evaluations**: Break down results by query type, data source, or other relevant factors\n3. **Historical tracking**: Maintain historical evaluation data to identify trends and regressions\n4. **Correlation analysis**: Link evaluation metrics to user feedback and business outcomes\n5. **Regular benchmarking**: Periodically evaluate against fixed test sets to ensure consistency\n6. **Alert on regressions**: Implement automated alerts when metrics drop below thresholds\n7. **Contextualize metrics**: Include example failures alongside aggregate metrics for better understanding\n\n## Building a Feedback Loop\n\nThe ultimate goal of evaluation is to drive improvements. Establish a feedback loop:', ""<2-hop>\n\n- **Production-ready**: Understanding software engineering best practices helps create AI systems that can operate reliably at scale.\n- **User-focused**: Experience with UX principles ensures AI solutions are designed with actual human users in mind.\n- **Integrated**: Knowledge of enterprise systems makes it easier to connect AI capabilities with existing business processes.\n- **Simplified**: My experience in streamlining complex business processes helps me identify where AI can have the greatest impact through intelligent automation.\n- **Business-oriented**: I understand that AI isn't just about the technology—it's about solving real business problems and creating measurable value.\n- **Practical**: I focus on practical applications that deliver immediate benefits rather than getting caught up in theoretical possibilities.\n\n## What's Next\n\nAs I return to my AI roots, I'm excited to share this journey with you through this blog. In the coming months, I plan to write about:""]","Observability best practices contribute to building production-ready AI systems by enabling continuous monitoring and evaluation of system performance. They help in identifying issues, understanding system behavior, and ensuring that the AI models are functioning as intended. By implementing custom dashboards and automated evaluation workflows, developers can gain insights into performance trends and areas for improvement, leading to more reliable and robust AI systems.
+For more information, you can explore the following resources:
+- [Integrations and Observability with Ragas](https://thedataguy.pro/blog/integrations-and-observability-with-ragas/)
+- [Advanced Metrics and Customization with Ragas](https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)","Observability best practices, such as defining clear thresholds, segmenting evaluations, maintaining historical tracking, and alerting on regressions, ensure that AI systems are continuously monitored and improved. These practices are essential for creating production-ready AI systems that can operate reliably at scale, as they enable consistent performance evaluation and rapid response to issues.",0.5,0.5454545454545454,0.46,0.9999999999999997,0.0,0.0

py-src/app.py CHANGED Viewed

@@ -9,26 +9,13 @@ from dotenv import load_dotenv
 load_dotenv()
 import pipeline
 #build vector store
-print("=== Blog Data Update ===")
-pipeline.create_vector_database(force_recreate=True)
 print("========================")
 import chainlit as cl
-from langchain.prompts import ChatPromptTemplate
-from langchain.schema.runnable import RunnablePassthrough
-from langchain_openai.chat_models import ChatOpenAI
-from langchain_huggingface import HuggingFaceEmbeddings
-from langchain_qdrant import QdrantVectorStore
-from qdrant_client import QdrantClient
-from qdrant_client.http.models import Distance, VectorParams
-from lets_talk.config import LLM_MODEL, LLM_TEMPERATURE
-import lets_talk.utils.blog as blog
 from lets_talk.agent import build_agent,parse_output
 tdg_agent = build_agent()

 load_dotenv()
 import pipeline
 #build vector store
+print("=== create vector db ===")
+pipeline.create_vector_database(force_recreate=True,save_stats=False,use_chunking=True)
 print("========================")
 import chainlit as cl
 from lets_talk.agent import build_agent,parse_output
 tdg_agent = build_agent()

py-src/notebooks/07_Fine_Tune_Eval.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

py-src/pipeline.py CHANGED Viewed

@@ -94,7 +94,7 @@ def save_stats(stats, output_dir="./stats", ci_mode=False):
     return filename, basic_stats
 def create_vector_database(data_dir=DATA_DIR, storage_path=VECTOR_STORAGE_PATH,
-                      force_recreate=False, output_dir="./stats", ci_mode=False):
     """
     Create or update the vector database with blog documents.
@@ -114,12 +114,22 @@ def create_vector_database(data_dir=DATA_DIR, storage_path=VECTOR_STORAGE_PATH,
         documents = blog.load_blog_posts(data_dir)
         documents = blog.update_document_metadata(documents)
         # Get stats
         stats = blog.get_document_stats(documents)
         blog.display_document_stats(stats)
         # Save stats for tracking
-        stats_file, stats_content = save_stats(stats, output_dir=output_dir, ci_mode=ci_mode)
         create_vector_store = (not Path.exists(Path(storage_path))) or force_recreate

     return filename, basic_stats
 def create_vector_database(data_dir=DATA_DIR, storage_path=VECTOR_STORAGE_PATH,
+                      force_recreate=False, output_dir="./stats", ci_mode=False, use_chunking=True, save_stats=True):
     """
     Create or update the vector database with blog documents.
         documents = blog.load_blog_posts(data_dir)
         documents = blog.update_document_metadata(documents)
         # Get stats
         stats = blog.get_document_stats(documents)
         blog.display_document_stats(stats)
         # Save stats for tracking
+        stats_file = None
+        stats_content = None
+        if save_stats:
+            stats_file, stats_content = save_stats(stats, output_dir=output_dir, ci_mode=ci_mode)
+        if use_chunking:
+            logger.info("Chunking documents...")
+            documents = blog.split_documents(documents)
         create_vector_store = (not Path.exists(Path(storage_path))) or force_recreate