Spaces:

mafzaal
/

lets_talk

Running

App Files Files Community

mafzaal commited on May 11

Commit

8825f6e

1 Parent(s): 25160b5

Refactor code structure for improved readability and maintainability

Browse files

Files changed (9) hide show

evals/rag_eval.csv +0 -0
evals/rag_eval_result.csv +0 -0
evals/testset.csv +13 -0
py-src/lets_talk/config.py +2 -0
py-src/lets_talk/utils/eval.py +53 -0
py-src/notebooks/05_SGD_Eval.ipynb +0 -0
py-src/notebooks/update_blog_data.ipynb +13 -4
pyproject.toml +3 -0
uv.lock +164 -3

evals/rag_eval.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

evals/rag_eval_result.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

evals/testset.csv ADDED Viewed

	@@ -0,0 +1,13 @@

+user_input,reference_contexts,reference,synthesizer_name
+Does Ragas support integration with Langfuse?,"['title: ""Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications"" date: 2025-04-26T18:00:00-06:00 layout: blog description: ""Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems."" categories: [""AI"", ""RAG"", ""Evaluation"",""Ragas""] coverImage: ""https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3"" readingTime: 7 published: true As Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you\'re building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how well your application performs. This is where Ragas steps in. What is Ragas? Ragas is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems. At its core, Ragas helps answer crucial questions: - Is my application retrieving the right information? - Are the responses factually accurate and consistent with the retrieved context? - Does the system appropriately address the user\'s query? - How well does my application handle multi-turn conversations? Why Evaluate LLM Applications? LLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable. Evaluation serves several key purposes: - Quality assurance: Identify and fix issues before they reach users - Performance tracking: Monitor how changes impact system performance - Benchmarking: Compare different approaches objectively - Continuous improvement: Build feedback loops to enhance your application Key Features of Ragas 🎯 Specialized Metrics Ragas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications: Faithfulness: Measures if the response is factually consistent with the retrieved context Context Relevancy: Evaluates if the retrieved information is relevant to the query Answer Relevancy: Assesses if the response addresses the user\'s question Topic Adherence: Gauges how well multi-turn conversations stay on topic 🧪 Test Data Generation Creating high-quality test data is often a bottleneck in evaluation. Ragas helps you generate comprehensive test datasets automatically, saving time and ensuring thorough coverage. 🔗 Seamless Integrations Ragas works with popular LLM frameworks and tools: - LangChain - LlamaIndex - Haystack - OpenAI Observability platforms - Phoenix - LangSmith - Langfuse 📊 Comprehensive Analysis Beyond simple scores, Ragas provides detailed insights into your application\'s strengths and weaknesses, enabling targeted improvements. Getting Started with Ragas Installing Ragas is straightforward: bash uv init && uv add ragas Here\'s a simple example of evaluating a response using Ragas: ```python from ragas.metrics import Faithfulness from ragas.evaluation import EvaluationDataset from ragas.dataset_schema import SingleTurnSample from langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper from langchain_openai import ChatOpenAI Initialize the LLM, you are going to new OPENAI API key evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")) Your evaluation data test_data = { ""user_input"": ""What is the capital of France?"", ""retrieved_contexts"": [""Paris is the capital and most populous city of France.""], ""response"": ""The capital of France is Paris."" } Create a sample sample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor Create metric faithfulness = Faithfulness(llm=evaluator_llm) Calculate the score result = await faithfulness.single_turn_ascore(sample) print(f""Faithfulness score: {result}"") ``` 💡 Try it yourself: Explore the hands-on notebook for this workflow: 01_Introduction_to_Ragas']","Yes, Ragas works with observability platforms such as Langfuse.",single_hop_specifc_query_synthesizer
+what Part 8: Building Feedback Loops do for LLM app devs?,"[""What's Coming in This Blog Series This introduction is just the beginning. In the upcoming posts, we'll dive deeper into all aspects of evaluating LLM applications with Ragas: Part 2: Basic Evaluation Workflow We'll explore each metric in detail, explaining when and how to use them effectively. Part 3: Evaluating RAG Systems Learn specialized techniques for evaluating retrieval-augmented generation systems, including context precision, recall, and relevance. Part 4: Test Data Generation Discover how to create high-quality test datasets that thoroughly exercise your application's capabilities. Part 5: Advanced Evaluation Techniques Go beyond basic metrics with custom evaluations, multi-aspect analysis, and domain-specific assessments. Part 6: Evaluating AI Agents Learn how to evaluate complex AI agents that engage in multi-turn interactions, use tools, and work toward specific goals. Part 7: Integrations and Observability Connect Ragas with your existing tools and platforms for streamlined evaluation workflows. Part 8: Building Feedback Loops Learn how to implement feedback loops that drive continuous improvement in your LLM applications. Transform evaluation insights into concrete improvements for your LLM applications. Conclusion In a world increasingly powered by LLMs, robust evaluation is the difference between reliable applications and unpredictable ones. Ragas provides the tools you need to confidently assess and improve your LLM applications. Ready to Elevate Your LLM Applications? Start exploring Ragas today by visiting the official documentation. Share your thoughts, challenges, or success stories. If you're facing specific evaluation hurdles, don't hesitate to reach out—we'd love to help!""]",Part 8: Building Feedback Loops show how to implement feedback loops that drive continuous improvement in LLM applications and how to turn evaluation insights into concrete improvements for LLM applications.,single_hop_specifc_query_synthesizer
+How does Ragas assist with Evaluation of RAG systems?,"['title: ""Part 4: Generating Test Data with Ragas"" date: 2025-04-27T16:00:00-06:00 layout: blog description: ""Discover how to generate robust test datasets for evaluating Retrieval-Augmented Generation systems using Ragas, including document-based, domain-specific, and adversarial test generation techniques."" categories: [""AI"", ""RAG"", ""Evaluation"", ""Ragas"",""Data""] coverImage: ""/images/generating_test_data.png"" readingTime: 14 published: true In our previous post, we explored how to comprehensively evaluate RAG systems using specialized metrics. However, even the best evaluation framework requires high-quality test data to yield meaningful insights. In this post, we\'ll dive into how Ragas helps you generate robust test datasets for evaluating your LLM applications. Why and']","Ragas helps generate robust test datasets for evaluating Retrieval-Augmented Generation systems, including document-based, domain-specific, and adversarial test generation techniques.",single_hop_specifc_query_synthesizer
+Wut is OpenAIEmbeddings used for?,"['How to Generate Synthetic Data for RAG Evaluation In the world of Retrieval-Augmented Generation (RAG) and LLM-powered applications, synthetic data generation is a game-changer for rapid iteration and robust evaluation. This blog post explains why synthetic data is essential, and how you can generate it for your own RAG pipelines—using modern tools like RAGAS and LangSmith. Why Generate Synthetic Data? Early Signal, Fast Iteration Real-world data is often scarce or expensive to label. Synthetic data lets you quickly create test sets that mimic real user queries and contexts, so you can evaluate your system’s performance before deploying to production. Controlled Complexity You can design synthetic datasets to cover edge cases, multi-hop reasoning, or specific knowledge domains—ensuring your RAG system is robust, not just good at the “easy” cases. Benchmarking and Comparison Synthetic test sets provide a repeatable, comparable way to measure improvements as you tweak your pipeline (e.g., changing chunk size, embeddings, or prompts). How to Generate Synthetic Data 1. Prepare Your Source Data Start with a set of documents relevant to your domain. For example, you might download and load HTML blog posts into a document format using tools like LangChain’s DirectoryLoader. 2. Build a Knowledge Graph Use RAGAS to convert your documents into a knowledge graph. This graph captures entities, relationships, and summaries, forming the backbone for generating meaningful queries. RAGAS applies default transformations are dependent on the corpus length, here are some examples: Producing Summaries -> produces summaries of the documents Extracting Headlines -> finding the overall headline for the document Theme Extractor -> extracts broad themes about the documents It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes. This is a crucial step, as the quality of your knowledge graph directly impacts the relevance and accuracy of the generated queries. 3. Configure Query Synthesizers RAGAS provides several query synthesizers: - SingleHopSpecificQuerySynthesizer: Generates direct, fact-based questions. - MultiHopAbstractQuerySynthesizer: Creates broader, multi-step reasoning questions. - MultiHopSpecificQuerySynthesizer: Focuses on questions that require connecting specific entities across documents. By mixing these, you get a diverse and challenging test set. 4. Generate the Test Set With your knowledge graph and query synthesizers, use RAGAS’s TestsetGenerator to create a synthetic dataset. This dataset will include questions, reference answers, and supporting contexts. 5. Evaluate and Iterate Load your synthetic dataset into an evaluation platform like LangSmith. Run your RAG pipeline against the test set, and use automated evaluators (for accuracy, helpfulness, style, etc.) to identify strengths and weaknesses. Tweak your pipeline and re-evaluate to drive improvements. Minimal Example Here’s a high-level pseudocode outline (see the notebook for full details): ````python 1. Load documents from langchain_community.document_loaders import DirectoryLoader path = ""data/"" loader = DirectoryLoader(path, glob=""*.md"") docs = loader.load() 2. Generate data from ragas.testset import TestsetGenerator from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper from langchain_openai import ChatOpenAI from langchain_openai import OpenAIEmbeddings Initialize the generator with the LLM and embedding model generator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4.1"")) generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings()) Create the test set generator generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings) dataset = generator.generate_with_langchain_docs(docs, testset_size=10) ```` dataset will now contain a set of questions, answers, and contexts that you can use to evaluate your RAG system. 💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 04_Synthetic_Data_Generation']","OpenAIEmbeddings is used as an embedding model in the synthetic data generation process for RAG evaluation, as shown when initializing the generator with LangchainEmbeddingsWrapper(OpenAIEmbeddings()).",single_hop_specifc_query_synthesizer
+"Wht are the key steps in the Ragas evalution workflow for RAG systems, and wich specialized evalution metrics can be selected to asess system performance?","['<1-hop>\n\ntitle: ""Part 2: Basic Evaluation Workflow with Ragas"" date: 2025-04-26T19:00:00-06:00 layout: blog description: ""Learn how to set up a basic evaluation workflow for LLM applications using Ragas. This guide walks you through data preparation, metric selection, and result analysis."" categories: [""AI"", ""RAG"", ""Evaluation"", ""Ragas""] coverImage: ""https://images.unsplash.com/photo-1600132806370-bf17e65e942f?q=80&w=1988&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"" readingTime: 8 published: true In our previous post, we introduced Ragas as a powerful framework for evaluating LLM applications. Now, let\'s dive into the practical aspects of setting up your first evaluation pipeline. Understanding the Evaluation Workflow A typical Ragas evaluation workflow consists of four key steps: Prepare your data: Collect queries, contexts, responses, and reference answers Select appropriate metrics: Choose metrics that align with what you want to evaluate Run the evaluation: Process your data through the selected metrics Analyze the results: Interpret scores and identify areas for improvement Let\'s walk through each step with practical examples. Step 1: Setting Up Your Environment First, ensure you have Ragas installed: bash uv add ragas Next, import the necessary components: python import pandas as pd from ragas import EvaluationDataset from ragas import evaluate, RunConfig from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity Step 2: Preparing Your Evaluation Data For a RAG system evaluation, you\'ll need: Questions: User queries to your system Contexts: Documents or chunks retrieved by your system Responses: Answers generated by your system Ground truth (optional): Reference answers or documents for comparison Here\'s how to organize this data: ```python Sample data data = { ""user_input"": [ ""What are the main symptoms of COVID-19?"", ""How does machine learning differ from deep learning?"" ], ""retrieved_contexts"": [ [ ""Common symptoms of COVID-19 include fever, cough, and fatigue. Some patients also report loss of taste or smell, body aches, and difficulty breathing."", ""COVID-19 is caused by the SARS-CoV-2 virus and spreads primarily through respiratory droplets."" ], [ ""Machine learning is a subset of AI focused on algorithms that learn from data without being explicitly programmed."", ""Deep learning is a specialized form of machine learning using neural networks with many layers (deep neural networks)."" ] ], ""response"": [ ""The main symptoms of COVID-19 include fever, cough, fatigue, and sometimes loss of taste or smell, body aches, and breathing difficulties."", ""Machine learning is a subset of AI that focuses on algorithms learning from data, while deep learning is a specialized form of machine learning that uses deep neural networks with multiple layers."" ], ""reference"": [ ""COVID-19 symptoms commonly include fever, dry cough, fatigue, loss of taste or smell, body aches, sore throat, and in severe cases, difficulty breathing."", ""Machine learning is a branch of AI where systems learn from data, identify patterns, and make decisions with minimal human intervention. Deep learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to analyze various factors of data."" ] } eval_data = pd.DataFrame(data) Convert to a format Ragas can use evaluation_dataset = EvaluationDataset.from_pandas(eval_data) evaluation_dataset ``` Step 3: Selecting and Configuring Metrics Ragas offers various metrics to evaluate different aspects of your system: Core RAG Metrics: Faithfulness: Measures if the response is factually consistent with the provided context. Factual Correctness: Assesses if the response is accurate and free from factual errors. Response Relevancy: Evaluates if the response directly addresses the user query. Context Entity Recall: Measures how well the retrieved context captures relevant entities from the ground truth. Noise Sensitivity: Assesses the robustness of the response to irrelevant or noisy context. LLM Context Recall: Evaluates how effectively the LLM utilizes the provided context to generate the response. For metrics that require an LLM (like faithfulness), you need to configure the LLM provider: ```python Configure LLM for evaluation from langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper Initialize the LLM, you are going to OPENAI API key evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")) Define metrics to use metrics = [ Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity(), LLMContextRecall() ] ```', '<2-hop>\n\ntitle: ""Part 4: Generating Test Data with Ragas"" date: 2025-04-27T16:00:00-06:00 layout: blog description: ""Discover how to generate robust test datasets for evaluating Retrieval-Augmented Generation systems using Ragas, including document-based, domain-specific, and adversarial test generation techniques."" categories: [""AI"", ""RAG"", ""Evaluation"", ""Ragas"",""Data""] coverImage: ""/images/generating_test_data.png"" readingTime: 14 published: true In our previous post, we explored how to comprehensively evaluate RAG systems using specialized metrics. However, even the best evaluation framework requires high-quality test data to yield meaningful insights. In this post, we\'ll dive into how Ragas helps you generate robust test datasets for evaluating your LLM applications. Why and']","The key steps in the Ragas evaluation workflow for RAG systems include preparing your data (collecting queries, contexts, responses, and reference answers), selecting appropriate metrics that align with your evaluation goals, running the evaluation by processing your data through the selected metrics, and analyzing the results to interpret scores and identify areas for improvement. Specialized evaluation metrics offered by Ragas include Faithfulness (measuring factual consistency with context), Factual Correctness (assessing accuracy and freedom from factual errors), Response Relevancy (evaluating if the response addresses the user query), Context Entity Recall (measuring how well the retrieved context captures relevant entities), Noise Sensitivity (assessing robustness to irrelevant context), and LLM Context Recall (evaluating how effectively the LLM uses the provided context to generate the response).",multi_hop_abstract_query_synthesizer
+"How does Ragas facilitate both test data generation and synthetic data generation for evaluating Retrieval-Augmented Generation (RAG) systems, and what are the key steps and tools involved in creating robust synthetic test datasets as described in the blog series?","['<1-hop>\n\ntitle: ""Part 4: Generating Test Data with Ragas"" date: 2025-04-27T16:00:00-06:00 layout: blog description: ""Discover how to generate robust test datasets for evaluating Retrieval-Augmented Generation systems using Ragas, including document-based, domain-specific, and adversarial test generation techniques."" categories: [""AI"", ""RAG"", ""Evaluation"", ""Ragas"",""Data""] coverImage: ""/images/generating_test_data.png"" readingTime: 14 published: true In our previous post, we explored how to comprehensively evaluate RAG systems using specialized metrics. However, even the best evaluation framework requires high-quality test data to yield meaningful insights. In this post, we\'ll dive into how Ragas helps you generate robust test datasets for evaluating your LLM applications. Why and', '<2-hop>\n\nHow to Generate Synthetic Data for RAG Evaluation In the world of Retrieval-Augmented Generation (RAG) and LLM-powered applications, synthetic data generation is a game-changer for rapid iteration and robust evaluation. This blog post explains why synthetic data is essential, and how you can generate it for your own RAG pipelines—using modern tools like RAGAS and LangSmith. Why Generate Synthetic Data? Early Signal, Fast Iteration Real-world data is often scarce or expensive to label. Synthetic data lets you quickly create test sets that mimic real user queries and contexts, so you can evaluate your system’s performance before deploying to production. Controlled Complexity You can design synthetic datasets to cover edge cases, multi-hop reasoning, or specific knowledge domains—ensuring your RAG system is robust, not just good at the “easy” cases. Benchmarking and Comparison Synthetic test sets provide a repeatable, comparable way to measure improvements as you tweak your pipeline (e.g., changing chunk size, embeddings, or prompts). How to Generate Synthetic Data 1. Prepare Your Source Data Start with a set of documents relevant to your domain. For example, you might download and load HTML blog posts into a document format using tools like LangChain’s DirectoryLoader. 2. Build a Knowledge Graph Use RAGAS to convert your documents into a knowledge graph. This graph captures entities, relationships, and summaries, forming the backbone for generating meaningful queries. RAGAS applies default transformations are dependent on the corpus length, here are some examples: Producing Summaries -> produces summaries of the documents Extracting Headlines -> finding the overall headline for the document Theme Extractor -> extracts broad themes about the documents It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes. This is a crucial step, as the quality of your knowledge graph directly impacts the relevance and accuracy of the generated queries. 3. Configure Query Synthesizers RAGAS provides several query synthesizers: - SingleHopSpecificQuerySynthesizer: Generates direct, fact-based questions. - MultiHopAbstractQuerySynthesizer: Creates broader, multi-step reasoning questions. - MultiHopSpecificQuerySynthesizer: Focuses on questions that require connecting specific entities across documents. By mixing these, you get a diverse and challenging test set. 4. Generate the Test Set With your knowledge graph and query synthesizers, use RAGAS’s TestsetGenerator to create a synthetic dataset. This dataset will include questions, reference answers, and supporting contexts. 5. Evaluate and Iterate Load your synthetic dataset into an evaluation platform like LangSmith. Run your RAG pipeline against the test set, and use automated evaluators (for accuracy, helpfulness, style, etc.) to identify strengths and weaknesses. Tweak your pipeline and re-evaluate to drive improvements. Minimal Example Here’s a high-level pseudocode outline (see the notebook for full details): ````python 1. Load documents from langchain_community.document_loaders import DirectoryLoader path = ""data/"" loader = DirectoryLoader(path, glob=""*.md"") docs = loader.load() 2. Generate data from ragas.testset import TestsetGenerator from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper from langchain_openai import ChatOpenAI from langchain_openai import OpenAIEmbeddings Initialize the generator with the LLM and embedding model generator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4.1"")) generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings()) Create the test set generator generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings) dataset = generator.generate_with_langchain_docs(docs, testset_size=10) ```` dataset will now contain a set of questions, answers, and contexts that you can use to evaluate your RAG system. 💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 04_Synthetic_Data_Generation', ""<3-hop>\n\nWhat's Coming in This Blog Series This introduction is just the beginning. In the upcoming posts, we'll dive deeper into all aspects of evaluating LLM applications with Ragas: Part 2: Basic Evaluation Workflow We'll explore each metric in detail, explaining when and how to use them effectively. Part 3: Evaluating RAG Systems Learn specialized techniques for evaluating retrieval-augmented generation systems, including context precision, recall, and relevance. Part 4: Test Data Generation Discover how to create high-quality test datasets that thoroughly exercise your application's capabilities. Part 5: Advanced Evaluation Techniques Go beyond basic metrics with custom evaluations, multi-aspect analysis, and domain-specific assessments. Part 6: Evaluating AI Agents Learn how to evaluate complex AI agents that engage in multi-turn interactions, use tools, and work toward specific goals. Part 7: Integrations and Observability Connect Ragas with your existing tools and platforms for streamlined evaluation workflows. Part 8: Building Feedback Loops Learn how to implement feedback loops that drive continuous improvement in your LLM applications. Transform evaluation insights into concrete improvements for your LLM applications. Conclusion In a world increasingly powered by LLMs, robust evaluation is the difference between reliable applications and unpredictable ones. Ragas provides the tools you need to confidently assess and improve your LLM applications. Ready to Elevate Your LLM Applications? Start exploring Ragas today by visiting the official documentation. Share your thoughts, challenges, or success stories. If you're facing specific evaluation hurdles, don't hesitate to reach out—we'd love to help!""]","Ragas facilitates test data generation and synthetic data generation for evaluating Retrieval-Augmented Generation (RAG) systems by providing a structured workflow and specialized tools. According to the blog series, high-quality test datasets are essential for meaningful evaluation of LLM applications. Ragas enables the creation of robust test datasets by supporting document-based, domain-specific, and adversarial test generation techniques (<1-hop>). For synthetic data generation, Ragas allows developers to quickly create test sets that mimic real user queries and contexts, which is especially useful when real-world data is scarce or expensive to label. The process involves several key steps: preparing source documents, building a knowledge graph using Ragas (which captures entities, relationships, and summaries), and configuring query synthesizers such as SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, and MultiHopSpecificQuerySynthesizer to generate diverse and challenging questions. The TestsetGenerator in Ragas then creates a synthetic dataset containing questions, reference answers, and supporting contexts. This synthetic dataset can be loaded into evaluation platforms like LangSmith for automated assessment and iterative improvement of the RAG pipeline (<2-hop>). The blog series further outlines that these practices are part of a comprehensive approach to evaluating LLM applications, with future posts covering advanced evaluation techniques and feedback loops for continuous improvement (<3-hop>).",multi_hop_abstract_query_synthesizer
+"Wht speshulized evalushun metrix does Ragas provied for LLMs, and how do you selekt and configure these metrix in a basic evalushun workflow?","['<1-hop>\n\ntitle: ""Part 2: Basic Evaluation Workflow with Ragas"" date: 2025-04-26T19:00:00-06:00 layout: blog description: ""Learn how to set up a basic evaluation workflow for LLM applications using Ragas. This guide walks you through data preparation, metric selection, and result analysis."" categories: [""AI"", ""RAG"", ""Evaluation"", ""Ragas""] coverImage: ""https://images.unsplash.com/photo-1600132806370-bf17e65e942f?q=80&w=1988&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"" readingTime: 8 published: true In our previous post, we introduced Ragas as a powerful framework for evaluating LLM applications. Now, let\'s dive into the practical aspects of setting up your first evaluation pipeline. Understanding the Evaluation Workflow A typical Ragas evaluation workflow consists of four key steps: Prepare your data: Collect queries, contexts, responses, and reference answers Select appropriate metrics: Choose metrics that align with what you want to evaluate Run the evaluation: Process your data through the selected metrics Analyze the results: Interpret scores and identify areas for improvement Let\'s walk through each step with practical examples. Step 1: Setting Up Your Environment First, ensure you have Ragas installed: bash uv add ragas Next, import the necessary components: python import pandas as pd from ragas import EvaluationDataset from ragas import evaluate, RunConfig from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity Step 2: Preparing Your Evaluation Data For a RAG system evaluation, you\'ll need: Questions: User queries to your system Contexts: Documents or chunks retrieved by your system Responses: Answers generated by your system Ground truth (optional): Reference answers or documents for comparison Here\'s how to organize this data: ```python Sample data data = { ""user_input"": [ ""What are the main symptoms of COVID-19?"", ""How does machine learning differ from deep learning?"" ], ""retrieved_contexts"": [ [ ""Common symptoms of COVID-19 include fever, cough, and fatigue. Some patients also report loss of taste or smell, body aches, and difficulty breathing."", ""COVID-19 is caused by the SARS-CoV-2 virus and spreads primarily through respiratory droplets."" ], [ ""Machine learning is a subset of AI focused on algorithms that learn from data without being explicitly programmed."", ""Deep learning is a specialized form of machine learning using neural networks with many layers (deep neural networks)."" ] ], ""response"": [ ""The main symptoms of COVID-19 include fever, cough, fatigue, and sometimes loss of taste or smell, body aches, and breathing difficulties."", ""Machine learning is a subset of AI that focuses on algorithms learning from data, while deep learning is a specialized form of machine learning that uses deep neural networks with multiple layers."" ], ""reference"": [ ""COVID-19 symptoms commonly include fever, dry cough, fatigue, loss of taste or smell, body aches, sore throat, and in severe cases, difficulty breathing."", ""Machine learning is a branch of AI where systems learn from data, identify patterns, and make decisions with minimal human intervention. Deep learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to analyze various factors of data."" ] } eval_data = pd.DataFrame(data) Convert to a format Ragas can use evaluation_dataset = EvaluationDataset.from_pandas(eval_data) evaluation_dataset ``` Step 3: Selecting and Configuring Metrics Ragas offers various metrics to evaluate different aspects of your system: Core RAG Metrics: Faithfulness: Measures if the response is factually consistent with the provided context. Factual Correctness: Assesses if the response is accurate and free from factual errors. Response Relevancy: Evaluates if the response directly addresses the user query. Context Entity Recall: Measures how well the retrieved context captures relevant entities from the ground truth. Noise Sensitivity: Assesses the robustness of the response to irrelevant or noisy context. LLM Context Recall: Evaluates how effectively the LLM utilizes the provided context to generate the response. For metrics that require an LLM (like faithfulness), you need to configure the LLM provider: ```python Configure LLM for evaluation from langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper Initialize the LLM, you are going to OPENAI API key evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")) Define metrics to use metrics = [ Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity(), LLMContextRecall() ] ```', '<2-hop>\n\ntitle: ""Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications"" date: 2025-04-26T18:00:00-06:00 layout: blog description: ""Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems."" categories: [""AI"", ""RAG"", ""Evaluation"",""Ragas""] coverImage: ""https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3"" readingTime: 7 published: true As Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you\'re building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how well your application performs. This is where Ragas steps in. What is Ragas? Ragas is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems. At its core, Ragas helps answer crucial questions: - Is my application retrieving the right information? - Are the responses factually accurate and consistent with the retrieved context? - Does the system appropriately address the user\'s query? - How well does my application handle multi-turn conversations? Why Evaluate LLM Applications? LLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable. Evaluation serves several key purposes: - Quality assurance: Identify and fix issues before they reach users - Performance tracking: Monitor how changes impact system performance - Benchmarking: Compare different approaches objectively - Continuous improvement: Build feedback loops to enhance your application Key Features of Ragas 🎯 Specialized Metrics Ragas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications: Faithfulness: Measures if the response is factually consistent with the retrieved context Context Relevancy: Evaluates if the retrieved information is relevant to the query Answer Relevancy: Assesses if the response addresses the user\'s question Topic Adherence: Gauges how well multi-turn conversations stay on topic 🧪 Test Data Generation Creating high-quality test data is often a bottleneck in evaluation. Ragas helps you generate comprehensive test datasets automatically, saving time and ensuring thorough coverage. 🔗 Seamless Integrations Ragas works with popular LLM frameworks and tools: - LangChain - LlamaIndex - Haystack - OpenAI Observability platforms - Phoenix - LangSmith - Langfuse 📊 Comprehensive Analysis Beyond simple scores, Ragas provides detailed insights into your application\'s strengths and weaknesses, enabling targeted improvements. Getting Started with Ragas Installing Ragas is straightforward: bash uv init && uv add ragas Here\'s a simple example of evaluating a response using Ragas: ```python from ragas.metrics import Faithfulness from ragas.evaluation import EvaluationDataset from ragas.dataset_schema import SingleTurnSample from langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper from langchain_openai import ChatOpenAI Initialize the LLM, you are going to new OPENAI API key evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")) Your evaluation data test_data = { ""user_input"": ""What is the capital of France?"", ""retrieved_contexts"": [""Paris is the capital and most populous city of France.""], ""response"": ""The capital of France is Paris."" } Create a sample sample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor Create metric faithfulness = Faithfulness(llm=evaluator_llm) Calculate the score result = await faithfulness.single_turn_ascore(sample) print(f""Faithfulness score: {result}"") ``` 💡 Try it yourself: Explore the hands-on notebook for this workflow: 01_Introduction_to_Ragas']","Ragas provieds speshulized evalushun metrix for LLMs, such as Faithfulness, Factual Correctness, Response Relevancy, Context Entity Recall, Noise Sensitivity, and LLM Context Recall. These metrix are taylored to address the unique challeenges of LLM-powred systems, like ensuring responses are factually consistant with the context and relevunt to the user query. In a basic evalushun workflow, you selekt metrix that align with your evalushun goals, then configure them—sum metrix, like Faithfulness, require setting up an LLM provider (for example, using LangchainLLMWrapper with a model like gpt-4o). You then run your evalushun by processing your data through the selekted metrix to analyze results and identify improvemint areas.",multi_hop_abstract_query_synthesizer
+"Which specialized metrics does Ragas provide for evaluating Retrieval-Augmented Generation (RAG) systems, and how do these metrics address the unique evaluation challenges posed by the multi-component nature of RAG systems?","['<1-hop>\n\ntitle: ""Part 3: Evaluating RAG Systems with Ragas"" date: 2025-04-26T20:00:00-06:00 layout: blog description: ""Learn specialized techniques for comprehensive evaluation of Retrieval-Augmented Generation systems using Ragas, including metrics for retrieval quality, generation quality, and end-to-end performance."" categories: [""AI"", ""RAG"", ""Evaluation"", ""Ragas""] coverImage: ""https://images.unsplash.com/photo-1743796055664-3473eedab36e?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"" readingTime: 14 published: true In our previous post, we covered the fundamentals of setting up evaluation workflows with Ragas. Now, let\'s focus specifically on evaluating Retrieval-Augmented Generation (RAG) systems, which present unique evaluation challenges due to their multi-component nature. Understanding RAG Systems: More Than the Sum of Their Parts RAG systems combine two critical capabilities: 1. Retrieval: Finding relevant information from a knowledge base 2. Generation: Creating coherent, accurate responses based on retrieved information This dual nature means evaluation must address both components while also assessing their interaction. A system might retrieve perfect information but generate poor responses, or generate excellent prose from irrelevant retrieved content. The RAG Evaluation Triad Effective RAG evaluation requires examining three key dimensions: Retrieval Quality: How well does the system find relevant information? Generation Quality: How well does the system produce responses from retrieved information? End-to-End Performance: How well does the complete system satisfy user needs? Let\'s explore how Ragas helps evaluate each dimension of RAG systems.', ""<2-hop>\n\nCore RAG Metrics in Ragas Ragas provides specialized metrics to assess RAG systems across retrieval, generation, and end-to-end performance. Retrieval Quality Metrics 1. Context Relevancy Measures how relevant the retrieved documents are to the user's question. How it works: Takes the user's question (user_input) and the retrieved documents (retrieved_contexts). Uses an LLM to score relevance with two different prompts, averaging the results for robustness. Scores are normalized between 0.0 (irrelevant) and 1.0 (fully relevant). Why it matters: Low scores indicate your retriever is pulling in unrelated or noisy documents. Monitoring this helps you improve the retrieval step. 2. Context Precision Assesses how much of the retrieved context is actually useful for generating the answer. How it works: For each retrieved chunk, an LLM judges if it was necessary for the answer, using the ground truth (reference) or the generated response. Calculates Average Precision, rewarding systems that rank useful chunks higher. Variants: ContextUtilization: Uses the generated response instead of ground truth. Non-LLM version: Compares retrieved chunks to ideal reference contexts using string similarity. Why it matters: High precision means your retriever is efficient; low precision means too much irrelevant information is included. 3. Context Recall Evaluates whether all necessary information from the ground truth answer is present in the retrieved context. How it works: Breaks down the reference answer into sentences. For each sentence, an LLM checks if it can be supported by the retrieved context. The score is the proportion of reference sentences attributed to the retrieved context. Variants: Non-LLM version: Compares reference and retrieved contexts using similarity and thresholds. Why it matters: High recall means your retriever finds all needed information; low recall means critical information is missing. Summary: - Low context relevancy: Retriever needs better query understanding or semantic matching. - Low context precision: Retriever includes unnecessary information. - Low context recall: Retriever misses critical information. Generation Quality Metrics 1. Faithfulness Checks if the generated answer is factually consistent with the retrieved context, addressing hallucination. How it works: Breaks the answer into simple statements. For each, an LLM checks if it can be inferred from the retrieved context. The score is the proportion of faithful statements. Alternative: FaithfulnesswithHHEM: Uses a specialized NLI model for verification. Why it matters: High faithfulness means answers are grounded in context; low faithfulness signals hallucination. 2. Answer Relevancy Measures if the generated answer directly addresses the user's question. How it works: Asks an LLM to generate possible questions for the answer. Compares these to the original question using embedding similarity. Penalizes noncommittal answers. Why it matters: High relevancy means answers are on-topic; low relevancy means answers are off-topic or incomplete. Summary: - Low faithfulness: Generator adds facts not supported by context. - Low answer relevancy: Generator doesn't focus on the specific question. End-to-End Metrics 1. Correctness Assesses factual alignment between the generated answer and a ground truth reference. How it works: Breaks both the answer and reference into claims. Uses NLI to verify claims in both directions. Calculates precision, recall, or F1-score. Why it matters: High correctness means answers match the ground truth; low correctness signals factual errors. Key distinction: - Faithfulness: Compares answer to retrieved context. - FactualCorrectness: Compares answer to ground truth."", '<3-hop>\n\ntitle: ""Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications"" date: 2025-04-26T18:00:00-06:00 layout: blog description: ""Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems."" categories: [""AI"", ""RAG"", ""Evaluation"",""Ragas""] coverImage: ""https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3"" readingTime: 7 published: true As Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you\'re building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how well your application performs. This is where Ragas steps in. What is Ragas? Ragas is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems. At its core, Ragas helps answer crucial questions: - Is my application retrieving the right information? - Are the responses factually accurate and consistent with the retrieved context? - Does the system appropriately address the user\'s query? - How well does my application handle multi-turn conversations? Why Evaluate LLM Applications? LLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable. Evaluation serves several key purposes: - Quality assurance: Identify and fix issues before they reach users - Performance tracking: Monitor how changes impact system performance - Benchmarking: Compare different approaches objectively - Continuous improvement: Build feedback loops to enhance your application Key Features of Ragas 🎯 Specialized Metrics Ragas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications: Faithfulness: Measures if the response is factually consistent with the retrieved context Context Relevancy: Evaluates if the retrieved information is relevant to the query Answer Relevancy: Assesses if the response addresses the user\'s question Topic Adherence: Gauges how well multi-turn conversations stay on topic 🧪 Test Data Generation Creating high-quality test data is often a bottleneck in evaluation. Ragas helps you generate comprehensive test datasets automatically, saving time and ensuring thorough coverage. 🔗 Seamless Integrations Ragas works with popular LLM frameworks and tools: - LangChain - LlamaIndex - Haystack - OpenAI Observability platforms - Phoenix - LangSmith - Langfuse 📊 Comprehensive Analysis Beyond simple scores, Ragas provides detailed insights into your application\'s strengths and weaknesses, enabling targeted improvements. Getting Started with Ragas Installing Ragas is straightforward: bash uv init && uv add ragas Here\'s a simple example of evaluating a response using Ragas: ```python from ragas.metrics import Faithfulness from ragas.evaluation import EvaluationDataset from ragas.dataset_schema import SingleTurnSample from langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper from langchain_openai import ChatOpenAI Initialize the LLM, you are going to new OPENAI API key evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")) Your evaluation data test_data = { ""user_input"": ""What is the capital of France?"", ""retrieved_contexts"": [""Paris is the capital and most populous city of France.""], ""response"": ""The capital of France is Paris."" } Create a sample sample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor Create metric faithfulness = Faithfulness(llm=evaluator_llm) Calculate the score result = await faithfulness.single_turn_ascore(sample) print(f""Faithfulness score: {result}"") ``` 💡 Try it yourself: Explore the hands-on notebook for this workflow: 01_Introduction_to_Ragas', ""<4-hop>\n\nWhat's Coming in This Blog Series This introduction is just the beginning. In the upcoming posts, we'll dive deeper into all aspects of evaluating LLM applications with Ragas: Part 2: Basic Evaluation Workflow We'll explore each metric in detail, explaining when and how to use them effectively. Part 3: Evaluating RAG Systems Learn specialized techniques for evaluating retrieval-augmented generation systems, including context precision, recall, and relevance. Part 4: Test Data Generation Discover how to create high-quality test datasets that thoroughly exercise your application's capabilities. Part 5: Advanced Evaluation Techniques Go beyond basic metrics with custom evaluations, multi-aspect analysis, and domain-specific assessments. Part 6: Evaluating AI Agents Learn how to evaluate complex AI agents that engage in multi-turn interactions, use tools, and work toward specific goals. Part 7: Integrations and Observability Connect Ragas with your existing tools and platforms for streamlined evaluation workflows. Part 8: Building Feedback Loops Learn how to implement feedback loops that drive continuous improvement in your LLM applications. Transform evaluation insights into concrete improvements for your LLM applications. Conclusion In a world increasingly powered by LLMs, robust evaluation is the difference between reliable applications and unpredictable ones. Ragas provides the tools you need to confidently assess and improve your LLM applications. Ready to Elevate Your LLM Applications? Start exploring Ragas today by visiting the official documentation. Share your thoughts, challenges, or success stories. If you're facing specific evaluation hurdles, don't hesitate to reach out—we'd love to help!""]","Ragas provides specialized metrics for evaluating Retrieval-Augmented Generation (RAG) systems that address the unique challenges arising from their multi-component structure, which includes both retrieval and generation. The core metrics are divided into three key dimensions: Retrieval Quality, Generation Quality, and End-to-End Performance. For Retrieval Quality, Ragas offers Context Relevancy (measuring how relevant retrieved documents are to the user's question), Context Precision (assessing how much of the retrieved context is actually useful for generating the answer), and Context Recall (evaluating whether all necessary information from the ground truth answer is present in the retrieved context). For Generation Quality, Ragas includes Faithfulness (checking if the generated answer is factually consistent with the retrieved context) and Answer Relevancy (measuring if the generated answer directly addresses the user's question). For End-to-End Performance, the Correctness metric assesses factual alignment between the generated answer and a ground truth reference. These metrics collectively ensure that both the retrieval and generation components are evaluated individually and in combination, addressing the unique evaluation challenges of RAG systems.",multi_hop_abstract_query_synthesizer
+"How does RAGAS facilitate metric-driven development in RAG system evaluation, and what specific metrics does it introduce to improve the assessment process?","['<1-hop>\n\nHow to Generate Synthetic Data for RAG Evaluation In the world of Retrieval-Augmented Generation (RAG) and LLM-powered applications, synthetic data generation is a game-changer for rapid iteration and robust evaluation. This blog post explains why synthetic data is essential, and how you can generate it for your own RAG pipelines—using modern tools like RAGAS and LangSmith. Why Generate Synthetic Data? Early Signal, Fast Iteration Real-world data is often scarce or expensive to label. Synthetic data lets you quickly create test sets that mimic real user queries and contexts, so you can evaluate your system’s performance before deploying to production. Controlled Complexity You can design synthetic datasets to cover edge cases, multi-hop reasoning, or specific knowledge domains—ensuring your RAG system is robust, not just good at the “easy” cases. Benchmarking and Comparison Synthetic test sets provide a repeatable, comparable way to measure improvements as you tweak your pipeline (e.g., changing chunk size, embeddings, or prompts). How to Generate Synthetic Data 1. Prepare Your Source Data Start with a set of documents relevant to your domain. For example, you might download and load HTML blog posts into a document format using tools like LangChain’s DirectoryLoader. 2. Build a Knowledge Graph Use RAGAS to convert your documents into a knowledge graph. This graph captures entities, relationships, and summaries, forming the backbone for generating meaningful queries. RAGAS applies default transformations are dependent on the corpus length, here are some examples: Producing Summaries -> produces summaries of the documents Extracting Headlines -> finding the overall headline for the document Theme Extractor -> extracts broad themes about the documents It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes. This is a crucial step, as the quality of your knowledge graph directly impacts the relevance and accuracy of the generated queries. 3. Configure Query Synthesizers RAGAS provides several query synthesizers: - SingleHopSpecificQuerySynthesizer: Generates direct, fact-based questions. - MultiHopAbstractQuerySynthesizer: Creates broader, multi-step reasoning questions. - MultiHopSpecificQuerySynthesizer: Focuses on questions that require connecting specific entities across documents. By mixing these, you get a diverse and challenging test set. 4. Generate the Test Set With your knowledge graph and query synthesizers, use RAGAS’s TestsetGenerator to create a synthetic dataset. This dataset will include questions, reference answers, and supporting contexts. 5. Evaluate and Iterate Load your synthetic dataset into an evaluation platform like LangSmith. Run your RAG pipeline against the test set, and use automated evaluators (for accuracy, helpfulness, style, etc.) to identify strengths and weaknesses. Tweak your pipeline and re-evaluate to drive improvements. Minimal Example Here’s a high-level pseudocode outline (see the notebook for full details): ````python 1. Load documents from langchain_community.document_loaders import DirectoryLoader path = ""data/"" loader = DirectoryLoader(path, glob=""*.md"") docs = loader.load() 2. Generate data from ragas.testset import TestsetGenerator from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper from langchain_openai import ChatOpenAI from langchain_openai import OpenAIEmbeddings Initialize the generator with the LLM and embedding model generator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4.1"")) generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings()) Create the test set generator generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings) dataset = generator.generate_with_langchain_docs(docs, testset_size=10) ```` dataset will now contain a set of questions, answers, and contexts that you can use to evaluate your RAG system. 💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 04_Synthetic_Data_Generation', '<2-hop>\n\ntitle: ""Metric-Driven Development: Make Smarter Decisions, Faster"" date: 2025-05-05T00:00:00-06:00 layout: blog description: ""Your Team\'s Secret Weapon for Cutting Through Noise and Driving Real Progress. Learn how to use clear metrics to eliminate guesswork and make faster, smarter progress in your projects."" categories: [""Development"", ""Productivity"", ""AI"", ""Management""] coverImage: ""/images/metric-driven-development.png"" readingTime: 9 published: true In today\'s data-driven world, success depends increasingly on our ability to measure the right things at the right time. Whether you\'re developing AI systems, building web applications, or managing projects, having clear metrics guides your team toward meaningful progress while eliminating subjective debates. The Power of Metrics in AI Evaluation Recent advances in generative AI and large language models (LLMs) highlight the critical importance of proper evaluation frameworks. Projects like RAGAS (Retrieval Augmented Generation Assessment System) demonstrate how specialized metrics can transform vague goals into actionable insights. For example, when evaluating retrieval-augmented generation systems, generic metrics like BLEU or ROUGE scores often fail to capture what truly matters - the accuracy, relevance, and contextual understanding of the generated responses. RAGAS instead introduces metrics specifically designed for RAG systems: Faithfulness: Measures how well the generated answer aligns with the retrieved context Answer Relevancy: Evaluates whether the response correctly addresses the user\'s query Context Relevancy: Assesses if the system retrieves information that\'s actually needed Context Precision: Quantifies how efficiently the system uses retrieved information These targeted metrics provide clearer direction than general-purpose evaluations, allowing teams to make precise improvements where they matter most. Imagine two teams building a new feature for a streaming platform: Team A is stuck in debates. Should they focus on improving video load speed or making the recommendation engine more accurate? One engineer insists, ""Faster videos keep users from leaving!"" Another counters, ""But better recommendations are what make them subscribe!"" They argue based on gut feelings. Team B operates differently. They have a clear, agreed-upon goal: Improve the average ""Watch Time per User"" metric, while ensuring video buffering times stay below 2 seconds. They rapidly test ideas, measuring the impact of each change against this specific target. Which team do you think will make faster, smarter progress? Team B has the edge because they\'re using Metric-Driven Development (MDD). This is a powerful strategy where teams unite around measurable goals to eliminate guesswork and make real strides. Let\'s break down how it works, what makes a metric truly useful, and see how industries from healthcare to e-commerce use it to succeed. What Exactly is Metric-Driven Development? Metric-Driven Development (MDD) is a simple but effective framework where teams: Define Clear, Measurable Goals: Set specific numerical targets (e.g., ""Increase user sign-ups by 20% this quarter""). Base Decisions on Data: Rely on evidence and measurements, not just opinions or assumptions. Iterate and Learn Quickly: Continuously measure the impact of changes to see what works and what doesn\'t. Think of MDD as a GPS for your project. Without clear metrics, you\'re driving in the fog, hoping you\'re heading in the right direction. With MDD, you get real-time feedback, ensuring you\'re moving towards your destination efficiently. Why Teams Struggle Without Clear Metrics Without a metric-driven approach, teams often fall into common traps: Chasing Too Many Goals: Trying to improve everything at once (""We need higher accuracy and faster speed and lower costs!"") leads to scattered effort and slow progress. Endless Subjective Debates: Arguments arise that are hard to resolve with data (""Is Model A\'s slightly better performance worth the extra complexity?""). Difficulty Measuring Progress: It\'s hard to know if you\'re actually improving (""Are we doing better than last quarter? How can we be sure?""). In machine learning (ML), this often happens when teams track various technical scores (like precision, recall, or F1 score – measures of model accuracy) without a single, unifying metric tied to the actual business outcome they want to achieve.']","RAGAS facilitates metric-driven development in RAG system evaluation by providing tools to generate synthetic datasets and by introducing specialized metrics tailored for RAG systems. According to the context, RAGAS enables the creation of synthetic test sets that mimic real user queries and contexts, allowing teams to benchmark and compare system performance in a controlled and repeatable way. This supports rapid iteration and targeted improvements. In addition, RAGAS introduces specific metrics designed for RAG evaluation, such as Faithfulness (measuring alignment of generated answers with retrieved context), Answer Relevancy (evaluating if the response addresses the user’s query), Context Relevancy (assessing if the retrieved information is needed), and Context Precision (quantifying efficient use of retrieved information). These targeted metrics provide clearer direction than generic metrics, enabling teams to make precise, data-driven improvements and embodying the principles of metric-driven development.",multi_hop_specific_query_synthesizer
+"How does the use of Ragas facilitate the evaluation of Retrieval-Augmented Generation (RAG) systems by generating robust EvaluationDatasets, and what are some best practices for ensuring comprehensive evaluation of AI agents according to the provided context?","['<1-hop>\n\ntitle: ""Part 4: Generating Test Data with Ragas"" date: 2025-04-27T16:00:00-06:00 layout: blog description: ""Discover how to generate robust test datasets for evaluating Retrieval-Augmented Generation systems using Ragas, including document-based, domain-specific, and adversarial test generation techniques."" categories: [""AI"", ""RAG"", ""Evaluation"", ""Ragas"",""Data""] coverImage: ""/images/generating_test_data.png"" readingTime: 14 published: true In our previous post, we explored how to comprehensively evaluate RAG systems using specialized metrics. However, even the best evaluation framework requires high-quality test data to yield meaningful insights. In this post, we\'ll dive into how Ragas helps you generate robust test datasets for evaluating your LLM applications. Why and', '<2-hop>\n\nImplementing Agent Evaluation in Practice Let\'s look at a practical example of evaluating an AI agent using these metrics: ```python from ragas.metrics import AgentGoalAccuracyWithoutReference, ToolCallAccuracy, TopicAdherenceScore from ragas.evaluation import EvaluationDataset from ragas.dataset_schema import MultiTurnSample from langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper Initialize the LLM evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")) Example conversation with a travel booking agent test_data = { ""user_input"": [ {""role"": ""user"", ""content"": ""I need to book a flight from New York to London next Friday""}, {""role"": ""assistant"", ""content"": ""I\'d be happy to help you book a flight. Let me search for options..."", ""tool_calls"": [{""name"": ""search_flights"", ""arguments"": {""origin"": ""NYC"", ""destination"": ""LON"", ""date"": ""next Friday""}}]}, {""role"": ""tool"", ""name"": ""search_flights"", ""content"": ""Found 5 flights: Flight 1 (Delta, $750), Flight 2 (British Airways, $820)...""}, {""role"": ""assistant"", ""content"": ""I found several flights from New York to London next Friday. The cheapest option is Delta for $750. Would you like to book this one?""}, {""role"": ""user"", ""content"": ""Yes, please book the Delta flight""}, {""role"": ""assistant"", ""content"": ""I\'ll book that for you now."", ""tool_calls"": [{""name"": ""book_flight"", ""arguments"": {""flight_id"": ""delta_123"", ""price"": ""$750""}}]}, {""role"": ""tool"", ""name"": ""book_flight"", ""content"": ""Booking confirmed. Confirmation #: ABC123""}, {""role"": ""assistant"", ""content"": ""Great news! Your flight is confirmed. Your confirmation number is ABC123. The flight is scheduled for next Friday. Is there anything else you need help with?""} ], ""reference_topics"": [""travel"", ""flight booking"", ""schedules"", ""prices""], ""reference_tool_calls"": [ {""name"": ""search_flights"", ""args"": {""origin"": ""NYC"", ""destination"": ""LON"", ""date"": ""next Friday""}}, {""name"": ""book_flight"", ""args"": {""flight_id"": ""delta_123"", ""price"": ""$750""}} ] } Create a sample sample = MultiTurnSample(**test_data) Initialize metrics goal_accuracy = AgentGoalAccuracyWithoutReference(llm=evaluator_llm) tool_accuracy = ToolCallAccuracy() topic_adherence = TopicAdherenceScore(llm=evaluator_llm) Calculate scores goal_score = await goal_accuracy.multi_turn_ascore(sample) tool_score = tool_accuracy.multi_turn_score(sample) topic_score = await topic_adherence.multi_turn_ascore(sample) print(f""Goal Accuracy: {goal_score}"") print(f""Tool Call Accuracy: {tool_score}"") print(f""Topic Adherence: {topic_score}"") ``` 💡 Try it yourself: Explore the hands-on notebook for agent evaluation: 06_Evaluating_AI_Agents Advanced Agent Evaluation Techniques Combining Metrics for Comprehensive Evaluation For a complete assessment of agent capabilities, combine multiple metrics: ```python from ragas import evaluate results = evaluate( dataset, # Your dataset of agent conversations metrics=[ AgentGoalAccuracyWithoutReference(llm=evaluator_llm), ToolCallAccuracy(), TopicAdherence(llm=evaluator_llm) ] ) ``` Best Practices for Agent Evaluation Test scenario coverage: Include a diverse range of interaction scenarios Edge case handling: Test how agents handle unexpected inputs or failures Longitudinal evaluation: Track performance over time to identify regressions Human-in-the-loop validation: Periodically verify metric alignment with human judgments Continuous feedback loops: Use evaluation insights to guide agent improvements Conclusion Evaluating AI agents requires specialized metrics that go beyond traditional RAG evaluation. Ragas\' agent_goal_accuracy, tool_call_accuracy, and topic_adherence provide crucial insights into whether an agent can successfully complete tasks, use tools correctly, and stay within designated boundaries. By incorporating these metrics into your evaluation pipeline, you can build more reliable and effective AI agents that truly deliver on the promise of helpful, goal-oriented AI assistants. In our next post, we\'ll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows. Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications Part 2: Basic Evaluation Workflow Part 3: Evaluating RAG Systems with Ragas Part 4: Test Data Generation Part 5: Advanced Metrics and Customization Part 6: Evaluating AI Agents — You are here Next up in the series: Part 7: Integrations and Observability Part 8: Building Feedback Loops How are you evaluating your AI agents? What challenges have you encountered in measuring agent performance? If you\'re facing specific evaluation hurdles, don\'t hesitate to reach out—we\'d love to help!']","Ragas facilitates the evaluation of Retrieval-Augmented Generation (RAG) systems by enabling the generation of robust test datasets, which are essential for meaningful evaluation. According to the context, Ragas supports the creation of document-based, domain-specific, and adversarial test datasets, ensuring that LLM applications are tested under diverse and challenging scenarios. The EvaluationDataset class in Ragas allows for the structuring of multi-turn agent conversations, which can then be assessed using specialized metrics such as AgentGoalAccuracyWithoutReference, ToolCallAccuracy, and TopicAdherenceScore. These metrics provide insights into an agent's ability to complete tasks, use tools correctly, and adhere to designated topics. Best practices for comprehensive evaluation include covering a wide range of interaction scenarios, testing edge case handling, conducting longitudinal evaluations to track performance over time, incorporating human-in-the-loop validation to align metrics with human judgment, and establishing continuous feedback loops to guide agent improvements. By combining robust EvaluationDatasets with these best practices and metrics, Ragas enables a thorough and reliable evaluation process for AI agents in RAG systems.",multi_hop_specific_query_synthesizer
+"How can ChatOpenAI be integrated into a RAG evaluation pipeline for both synthetic data generation and advanced metric evaluation using Ragas, and what are the key steps involved in this process?","['<1-hop>\n\nHow to Generate Synthetic Data for RAG Evaluation In the world of Retrieval-Augmented Generation (RAG) and LLM-powered applications, synthetic data generation is a game-changer for rapid iteration and robust evaluation. This blog post explains why synthetic data is essential, and how you can generate it for your own RAG pipelines—using modern tools like RAGAS and LangSmith. Why Generate Synthetic Data? Early Signal, Fast Iteration Real-world data is often scarce or expensive to label. Synthetic data lets you quickly create test sets that mimic real user queries and contexts, so you can evaluate your system’s performance before deploying to production. Controlled Complexity You can design synthetic datasets to cover edge cases, multi-hop reasoning, or specific knowledge domains—ensuring your RAG system is robust, not just good at the “easy” cases. Benchmarking and Comparison Synthetic test sets provide a repeatable, comparable way to measure improvements as you tweak your pipeline (e.g., changing chunk size, embeddings, or prompts). How to Generate Synthetic Data 1. Prepare Your Source Data Start with a set of documents relevant to your domain. For example, you might download and load HTML blog posts into a document format using tools like LangChain’s DirectoryLoader. 2. Build a Knowledge Graph Use RAGAS to convert your documents into a knowledge graph. This graph captures entities, relationships, and summaries, forming the backbone for generating meaningful queries. RAGAS applies default transformations are dependent on the corpus length, here are some examples: Producing Summaries -> produces summaries of the documents Extracting Headlines -> finding the overall headline for the document Theme Extractor -> extracts broad themes about the documents It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes. This is a crucial step, as the quality of your knowledge graph directly impacts the relevance and accuracy of the generated queries. 3. Configure Query Synthesizers RAGAS provides several query synthesizers: - SingleHopSpecificQuerySynthesizer: Generates direct, fact-based questions. - MultiHopAbstractQuerySynthesizer: Creates broader, multi-step reasoning questions. - MultiHopSpecificQuerySynthesizer: Focuses on questions that require connecting specific entities across documents. By mixing these, you get a diverse and challenging test set. 4. Generate the Test Set With your knowledge graph and query synthesizers, use RAGAS’s TestsetGenerator to create a synthetic dataset. This dataset will include questions, reference answers, and supporting contexts. 5. Evaluate and Iterate Load your synthetic dataset into an evaluation platform like LangSmith. Run your RAG pipeline against the test set, and use automated evaluators (for accuracy, helpfulness, style, etc.) to identify strengths and weaknesses. Tweak your pipeline and re-evaluate to drive improvements. Minimal Example Here’s a high-level pseudocode outline (see the notebook for full details): ````python 1. Load documents from langchain_community.document_loaders import DirectoryLoader path = ""data/"" loader = DirectoryLoader(path, glob=""*.md"") docs = loader.load() 2. Generate data from ragas.testset import TestsetGenerator from ragas.llms import LangchainLLMWrapper from ragas.embeddings import LangchainEmbeddingsWrapper from langchain_openai import ChatOpenAI from langchain_openai import OpenAIEmbeddings Initialize the generator with the LLM and embedding model generator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4.1"")) generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings()) Create the test set generator generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings) dataset = generator.generate_with_langchain_docs(docs, testset_size=10) ```` dataset will now contain a set of questions, answers, and contexts that you can use to evaluate your RAG system. 💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 04_Synthetic_Data_Generation', '<2-hop>\n\ntitle: ""Part 5: Advanced Metrics and Customization with Ragas"" date: 2025-04-28T05:00:00-06:00 layout: blog description: ""Explore advanced metrics and customization techniques in Ragas for evaluating LLM applications, including creating custom metrics, domain-specific evaluation, composite scoring, and best practices for building a comprehensive evaluation ecosystem."" categories: [""AI"", ""RAG"", ""Evaluation"", ""Ragas"",""Data""] coverImage: ""https://plus.unsplash.com/premium_photo-1661368994107-43200954c524?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"" readingTime: 9 published: true In our previous post, we explored how to generate comprehensive test datasets for evaluating LLM applications. Now, let\'s dive into one of Ragas\' most powerful capabilities: advanced metrics and custom evaluation approaches that address specialized evaluation needs. Beyond the Basics: Why Advanced Metrics Matter While Ragas\' core metrics cover fundamental evaluation aspects, real-world applications often have unique requirements: Domain-specific quality criteria: Legal, medical, or financial applications have specialized accuracy requirements Custom interaction patterns: Applications with unique conversation flows need tailored evaluation approaches Specialized capabilities: Features like reasoning, code generation, or structured output demand purpose-built metrics Business-specific KPIs: Aligning evaluation with business objectives requires customized metrics Let\'s explore how to extend Ragas\' capabilities to meet these specialized needs. Understanding Ragas\' Metric Architecture Before creating custom metrics, it\'s helpful to understand Ragas\' metric architecture: 1. Understand the Metric Base Classes All metrics in Ragas inherit from the abstract Metric class (see metrics/base.py). For most use cases, you’ll extend one of these: SingleTurnMetric: For metrics that evaluate a single question/response pair. MultiTurnMetric: For metrics that evaluate multi-turn conversations. MetricWithLLM: For metrics that require an LLM for evaluation. MetricWithEmbeddings: For metrics that use embeddings. You can mix these as needed (e.g., MetricWithLLM, SingleTurnMetric). Each metric implements specific scoring methods depending on its type: _single_turn_ascore: For single-turn metrics _multi_turn_ascore: For multi-turn metrics Creating Your First Custom Metric Let\'s create a custom metric that evaluates technical accuracy in programming explanations: ```python from dataclasses import dataclass, field from typing import Dict, Optional, Set import typing as t from ragas.metrics.base import MetricWithLLM, SingleTurnMetric from ragas.prompt import PydanticPrompt from ragas.metrics import MetricType, MetricOutputType from pydantic import BaseModel Define input/output models for the prompt class TechnicalAccuracyInput(BaseModel): question: str context: str response: str programming_language: str = ""python"" class TechnicalAccuracyOutput(BaseModel): score: float feedback: str Define the prompt class TechnicalAccuracyPrompt(PydanticPrompt[TechnicalAccuracyInput, TechnicalAccuracyOutput]): instruction: str = ( ""Evaluate the technical accuracy of the response to a programming question. "" ""Consider syntax correctness, algorithmic accuracy, and best practices."" ) input_model = TechnicalAccuracyInput output_model = TechnicalAccuracyOutput examples = [ # Add examples here ] Create the metric @dataclass class TechnicalAccuracy(MetricWithLLM, SingleTurnMetric): name: str = ""technical_accuracy"" _required_columns: Dict[MetricType, Set[str]] = field( default_factory=lambda: { MetricType.SINGLE_TURN: { ""user_input"", ""response"", } } ) output_type: Optional[MetricOutputType] = MetricOutputType.CONTINUOUS evaluation_prompt: PydanticPrompt = field(default_factory=TechnicalAccuracyPrompt) async def _single_turn_ascore(self, sample, callbacks) -> float: assert self.llm is not None, ""LLM must be set"" question = sample.user_input response = sample.response # Extract programming language from question if possible programming_language = ""python"" # Default languages = [""python"", ""javascript"", ""java"", ""c++"", ""rust"", ""go""] for lang in languages: if lang in question.lower(): programming_language = lang break # Get the context context = ""\\n"".join(sample.retrieved_contexts) if sample.retrieved_contexts else """" # Prepare input for prompt prompt_input = TechnicalAccuracyInput( question=question, context=context, response=response, programming_language=programming_language ) # Generate evaluation evaluation = await self.evaluation_prompt.generate( data=prompt_input, llm=self.llm, callbacks=callbacks ) return evaluation.score ``` Using the Custom Metric To use the custom metric, simply include it in your evaluation pipeline: ```python from langchain_openai import ChatOpenAI from ragas import SingleTurnSample from ragas.llms import LangchainLLMWrapper Initialize the LLM, you are going to OPENAI API key evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")) test_data = { ""user_input"": ""Write a function to calculate the factorial of a number in Python."", ""retrieved_contexts"": [""Python is a programming language."", ""A factorial of a number n is the product of all positive integers less than or equal to n.""], ""response"": ""def factorial(n):\\n if n == 0:\\n return 1\\n else:\\n return n * factorial(n-1)"", } Create a sample sample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor technical_accuracy = TechnicalAccuracy(llm=evaluator_llm) score = await technical_accuracy.single_turn_ascore(sample) print(f""Technical Accuracy Score: {score}"") Note: The above code is a simplified example. In a real-world scenario, you would need to handle exceptions, `` You can also use theevaluate` function to evaluate a dataset: ```python from ragas import evaluate from ragas import evaluate results = evaluate( dataset, # Your dataset of samples metrics=[TechnicalAccuracy(), ...], llm=myevaluator_llm_llm ) ``` 💡 Try it yourself: Explore the hands-on notebook for synthetic data generation: 05_Advanced_Metrics_and_Customization']","ChatOpenAI can be integrated into a RAG evaluation pipeline for both synthetic data generation and advanced metric evaluation using Ragas by following a series of steps outlined in the provided context. First, for synthetic data generation, you initialize the generator with the LLM and embedding model by wrapping ChatOpenAI (for example, with LangchainLLMWrapper using ChatOpenAI(model=""gpt-4.1"")) and OpenAIEmbeddings. This setup is used with Ragas’s TestsetGenerator to create a synthetic dataset containing questions, answers, and contexts for RAG evaluation. Second, for advanced metric evaluation, ChatOpenAI can be used as the underlying LLM in custom metrics within Ragas. For instance, when creating a custom metric such as TechnicalAccuracy, you initialize the evaluator LLM with LangchainLLMWrapper(ChatOpenAI(model=""gpt-4o"")), and use it to score responses based on criteria like syntax correctness and algorithmic accuracy. The key steps are: (1) loading and preparing source documents, (2) initializing the LLM and embedding models with ChatOpenAI, (3) generating synthetic datasets with Ragas, and (4) evaluating the datasets using both built-in and custom metrics powered by ChatOpenAI as the LLM. This approach enables robust, repeatable evaluation and supports both standard and domain-specific assessment needs.",multi_hop_specific_query_synthesizer
+"How does Ragas facilitate the comprehensive evaluation of RAG systems by addressing both their retrieval and generation components, and how does this approach differ when evaluating more complex AI agents that use tools and pursue specific goals?","['<1-hop>\n\ntitle: ""Part 6: Evaluating AI Agents: Beyond Simple Answers with Ragas"" date: 2025-04-28T06:00:00-06:00 layout: blog description: ""Learn how to evaluate complex AI agents using Ragas\' specialized metrics for goal accuracy, tool call accuracy, and topic adherence to build more reliable and effective agent-based applications."" categories: [""AI"", ""Agents"", ""Evaluation"", ""Ragas"", ""LLM""] coverImage: ""/images/ai_agent_evaluation.png"" readingTime: 8 published: true In our previous posts, we\'ve explored how Ragas evaluates RAG systems and enables custom metrics for specialized applications. As LLMs evolve beyond simple question-answering to become powerful AI agents, evaluation needs have grown more sophisticated too. In this post, we\'ll explore Ragas\' specialized metrics for evaluating AI agents that engage in multi-turn interactions, use tools, and work toward specific goals. ', '<2-hop>\n\ntitle: ""Part 3: Evaluating RAG Systems with Ragas"" date: 2025-04-26T20:00:00-06:00 layout: blog description: ""Learn specialized techniques for comprehensive evaluation of Retrieval-Augmented Generation systems using Ragas, including metrics for retrieval quality, generation quality, and end-to-end performance."" categories: [""AI"", ""RAG"", ""Evaluation"", ""Ragas""] coverImage: ""https://images.unsplash.com/photo-1743796055664-3473eedab36e?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"" readingTime: 14 published: true In our previous post, we covered the fundamentals of setting up evaluation workflows with Ragas. Now, let\'s focus specifically on evaluating Retrieval-Augmented Generation (RAG) systems, which present unique evaluation challenges due to their multi-component nature. Understanding RAG Systems: More Than the Sum of Their Parts RAG systems combine two critical capabilities: 1. Retrieval: Finding relevant information from a knowledge base 2. Generation: Creating coherent, accurate responses based on retrieved information This dual nature means evaluation must address both components while also assessing their interaction. A system might retrieve perfect information but generate poor responses, or generate excellent prose from irrelevant retrieved content. The RAG Evaluation Triad Effective RAG evaluation requires examining three key dimensions: Retrieval Quality: How well does the system find relevant information? Generation Quality: How well does the system produce responses from retrieved information? End-to-End Performance: How well does the complete system satisfy user needs? Let\'s explore how Ragas helps evaluate each dimension of RAG systems.']","Ragas facilitates the comprehensive evaluation of RAG systems by providing specialized metrics that assess both the retrieval and generation components. For RAG systems, Ragas evaluates retrieval quality (how well relevant information is found), generation quality (how accurately and coherently responses are produced from retrieved information), and end-to-end performance (how well the system satisfies user needs). This approach ensures that both the individual components and their interaction are thoroughly assessed. When evaluating more complex AI agents, Ragas extends its evaluation with additional specialized metrics for goal accuracy, tool call accuracy, and topic adherence, reflecting the increased sophistication required for agents that engage in multi-turn interactions, use tools, and work toward specific goals.",multi_hop_specific_query_synthesizer

py-src/lets_talk/config.py CHANGED Viewed

@@ -12,6 +12,8 @@ QDRANT_COLLECTION = os.environ.get("QDRANT_COLLECTION", "thedataguy_documents")
 BLOG_BASE_URL = os.environ.get("BLOG_BASE_URL", "https://thedataguy.pro/blog/")
 LLM_MODEL = os.environ.get("LLM_MODEL", "gpt-4o-mini")
 LLM_TEMPERATURE = float(os.environ.get("TEMPERATURE", "0"))
 MAX_SEARCH_RESULTS = int(os.environ.get("MAX_SEARCH_RESULTS", "5"))

 BLOG_BASE_URL = os.environ.get("BLOG_BASE_URL", "https://thedataguy.pro/blog/")
 LLM_MODEL = os.environ.get("LLM_MODEL", "gpt-4o-mini")
 LLM_TEMPERATURE = float(os.environ.get("TEMPERATURE", "0"))
+SDG_LLM_MODLEL = os.environ.get("SDG_LLM_MODEL", "gpt-4.1")
+EVAL_LLM_MODEL = os.environ.get("EVAL_LLM_MODEL", "gpt-4.1")
 MAX_SEARCH_RESULTS = int(os.environ.get("MAX_SEARCH_RESULTS", "5"))

py-src/lets_talk/utils/eval.py ADDED Viewed

	@@ -0,0 +1,53 @@

+from langchain_huggingface import HuggingFaceEmbeddings
+from ragas import EvaluationDataset, RunConfig, evaluate
+from ragas.testset import TestsetGenerator
+from ragas.llms import LangchainLLMWrapper
+from ragas.embeddings import LangchainEmbeddingsWrapper
+from langchain_openai import ChatOpenAI
+from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
+from lets_talk.config import EMBEDDING_MODEL,SDG_LLM_MODLEL,EVAL_LLM_MODEL
+#TODO: need to make more generic
+def generate_testset(docs, llm_model = SDG_LLM_MODLEL, embedding_model = EMBEDDING_MODEL, testset_size=100):
+    """
+    Generate a test set from the provided documents using the TestsetGenerator.
+    Args:
+        docs (list): A list of documents to generate the test set from.
+    Returns:
+        dataset: The generated test set.
+    """
+    # Initialize the generator with the LLM and embedding model
+    generator_llm = LangchainLLMWrapper(ChatOpenAI(model=llm_model))
+    generator_embeddings = LangchainEmbeddingsWrapper(HuggingFaceEmbeddings(model_name=embedding_model))
+    generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
+    dataset = generator.generate_with_langchain_docs(docs, testset_size=testset_size)
+    return dataset
+def run_rag_chain(dataset, chain):
+  from tqdm import tqdm
+  for test_row in tqdm(dataset):
+    response = chain.invoke({"question" : test_row.eval_sample.user_input})
+    test_row.eval_sample.response = response["response"].content
+    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
+  return dataset
+def run_ragas_evaluation(dataset,llm=EVAL_LLM_MODEL):
+  evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
+  evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=llm))
+  custom_run_config = RunConfig(timeout=360)
+  result = evaluate(
+      dataset=evaluation_dataset,
+      metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
+      llm=evaluator_llm,
+      run_config=custom_run_config
+  )
+  return result

py-src/notebooks/05_SGD_Eval.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

py-src/notebooks/update_blog_data.ipynb CHANGED Viewed

@@ -12,7 +12,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
    "id": "6ec048b4",
    "metadata": {},
    "outputs": [],
@@ -21,7 +21,16 @@
     "import os\n",
     "from pathlib import Path\n",
     "from dotenv import load_dotenv\n",
-    "import importlib.util\n"
    ]
   },
   {
@@ -39,7 +48,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
    "id": "3d56f688",
    "metadata": {},
    "outputs": [
@@ -66,7 +75,7 @@
     }
    ],
    "source": [
-    "import blog_utils\n",
     "\n",
     "docs = blog_utils.load_blog_posts()\n",
     "docs = blog_utils.update_document_metadata(docs)\n",

   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "6ec048b4",
    "metadata": {},
    "outputs": [],
     "import os\n",
     "from pathlib import Path\n",
     "from dotenv import load_dotenv\n",
+    "import importlib.util\n",
+    "\n",
+    "import sys\n",
+    "import os\n",
+    "\n",
+    "# Add the project root to the Python path\n",
+    "project_root = os.path.abspath(os.path.join(os.getcwd(), \"../\"))\n",
+    "print(f\"Adding project root to sys.path: {project_root}\")\n",
+    "if project_root not in sys.path:\n",
+    "\tsys.path.append(project_root)\n"
    ]
   },
   {
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "id": "3d56f688",
    "metadata": {},
    "outputs": [
     }
    ],
    "source": [
+    "import lets_talk.utils.blog as  blog_utils\n",
     "\n",
     "docs = blog_utils.load_blog_posts()\n",
     "docs = blog_utils.update_document_metadata(docs)\n",

pyproject.toml CHANGED Viewed

@@ -10,6 +10,7 @@ dependencies = [
     "feedparser>=6.0.11",
     "ipykernel>=6.29.5",
     "ipython>=9.2.0",
     "langchain>=0.3.25",
     "langchain-community>=0.3.23",
     "langchain-core>=0.3.59",
@@ -22,6 +23,8 @@ dependencies = [
     "pandas>=2.2.3",
     "python-dotenv>=1.1.0",
     "qdrant-client>=1.14.2",
     "unstructured[md]>=0.17.2",
     "websockets>=15.0.1",
 ]

     "feedparser>=6.0.11",
     "ipykernel>=6.29.5",
     "ipython>=9.2.0",
+    "ipywidgets>=8.1.7",
     "langchain>=0.3.25",
     "langchain-community>=0.3.23",
     "langchain-core>=0.3.59",
     "pandas>=2.2.3",
     "python-dotenv>=1.1.0",
     "qdrant-client>=1.14.2",
+    "ragas>=0.2.15",
+    "tqdm>=4.67.1",
     "unstructured[md]>=0.17.2",
     "websockets>=15.0.1",
 ]

uv.lock CHANGED Viewed

@@ -105,6 +105,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/a1/ee/48ca1a7c89ffec8b6a0c5d02b89c305671d5ffd8d3c94acf8b8c408575bb/anyio-4.9.0-py3-none-any.whl", hash = "sha256:9f76d541cad6e36af7beb62e978876f3b41e3e04f2c1fbf0884604c0a9c4d93c", size = 100916 },
 ]
 [[package]]
 name = "appnope"
 version = "0.1.4"
@@ -374,6 +383,30 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/c3/be/d0d44e092656fe7a06b55e6103cbce807cdbdee17884a5367c68c9860853/dataclasses_json-0.6.7-py3-none-any.whl", hash = "sha256:0dbf33f26c8d5305befd61b39d2b3414e8a407bedc2834dea9b8d642666fb40a", size = 28686 },
 ]
 [[package]]
 name = "debugpy"
 version = "1.8.14"
@@ -408,6 +441,24 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/6e/c6/ac0b6c1e2d138f1002bcf799d330bd6d85084fece321e662a14223794041/Deprecated-1.2.18-py2.py3-none-any.whl", hash = "sha256:bd5011788200372a32418f888e326a09ff80d0214bd961147cfed01b5c018eec", size = 9998 },
 ]
 [[package]]
 name = "distro"
 version = "1.9.0"
@@ -533,11 +584,16 @@ wheels = [
 [[package]]
 name = "fsspec"
-version = "2025.3.2"
 source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/45/d8/8425e6ba5fcec61a1d16e41b1b71d2bf9344f1fe48012c2b48b9620feae5/fsspec-2025.3.2.tar.gz", hash = "sha256:e52c77ef398680bbd6a98c0e628fbc469491282981209907bbc8aea76a04fdc6", size = 299281 }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/44/4b/e0cfc1a6f17e990f3e64b7d941ddc4acdc7b19d6edd51abf495f32b1a9e4/fsspec-2025.3.2-py3-none-any.whl", hash = "sha256:2daf8dc3d1dfa65b6aa37748d112773a7a08416f6c70d96b264c96476ecaf711", size = 194435 },
 ]
 [[package]]
@@ -811,6 +867,22 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/d9/33/1f075bf72b0b747cb3288d011319aaf64083cf2efef8354174e3ed4540e2/ipython_pygments_lexers-1.1.1-py3-none-any.whl", hash = "sha256:a9462224a505ade19a605f71f8fa63c2048833ce50abc86768a0d81d876dc81c", size = 8074 },
 ]
 [[package]]
 name = "jedi"
 version = "0.19.2"
@@ -918,6 +990,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/c9/fb/108ecd1fe961941959ad0ee4e12ee7b8b1477247f30b1fdfd83ceaf017f0/jupyter_core-5.7.2-py3-none-any.whl", hash = "sha256:4f7315d2f6b4bcf2e3e7cb6e46772eba760ae459cd1f59d29eb57b0a01bd7409", size = 28965 },
 ]
 [[package]]
 name = "langchain"
 version = "0.3.25"
@@ -1135,6 +1216,7 @@ dependencies = [
     { name = "feedparser" },
     { name = "ipykernel" },
     { name = "ipython" },
     { name = "langchain" },
     { name = "langchain-community" },
     { name = "langchain-core" },
@@ -1147,6 +1229,8 @@ dependencies = [
     { name = "pandas" },
     { name = "python-dotenv" },
     { name = "qdrant-client" },
     { name = "unstructured", extra = ["md"] },
     { name = "websockets" },
 ]
@@ -1158,6 +1242,7 @@ requires-dist = [
     { name = "feedparser", specifier = ">=6.0.11" },
     { name = "ipykernel", specifier = ">=6.29.5" },
     { name = "ipython", specifier = ">=9.2.0" },
     { name = "langchain", specifier = ">=0.3.25" },
     { name = "langchain-community", specifier = ">=0.3.23" },
     { name = "langchain-core", specifier = ">=0.3.59" },
@@ -1170,6 +1255,8 @@ requires-dist = [
     { name = "pandas", specifier = ">=2.2.3" },
     { name = "python-dotenv", specifier = ">=1.1.0" },
     { name = "qdrant-client", specifier = ">=1.14.2" },
     { name = "unstructured", extras = ["md"], specifier = ">=0.17.2" },
     { name = "websockets", specifier = ">=15.0.1" },
 ]
@@ -1360,6 +1447,22 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/96/10/7d526c8974f017f1e7ca584c71ee62a638e9334d8d33f27d7cdfc9ae79e4/multidict-6.4.3-py3-none-any.whl", hash = "sha256:59fe01ee8e2a1e8ceb3f6dbb216b09c8d9f4ef1c22c4fc825d045a147fa2ebc9", size = 10400 },
 ]
 [[package]]
 name = "mypy-extensions"
 version = "1.1.0"
@@ -2484,6 +2587,32 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/8e/37/efad0257dc6e593a18957422533ff0f87ede7c9c6ea010a2177d738fb82f/pure_eval-0.2.3-py3-none-any.whl", hash = "sha256:1db8e35b67b3d218d818ae653e27f06c3aa420901fa7b081ca98cbedc874e0d0", size = 11842 },
 ]
 [[package]]
 name = "pycparser"
 version = "2.22"
@@ -2748,6 +2877,29 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/e4/52/f49b0aa96253010f57cf80315edecec4f469e7a39c1ed92bf727fa290e57/qdrant_client-1.14.2-py3-none-any.whl", hash = "sha256:7c283b1f0e71db9c21b85d898fb395791caca2a6d56ee751da96d797b001410c", size = 327691 },
 ]
 [[package]]
 name = "rapidfuzz"
 version = "3.13.0"
@@ -3491,6 +3643,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/fa/a8/5b41e0da817d64113292ab1f8247140aac61cbf6cfd085d6a0fa77f4984f/websockets-15.0.1-py3-none-any.whl", hash = "sha256:f7a866fbc1e97b5c617ee4116daaa09b722101d4a3c170c787450ba409f9736f", size = 169743 },
 ]
 [[package]]
 name = "wrapt"
 version = "1.17.2"

     { url = "https://files.pythonhosted.org/packages/a1/ee/48ca1a7c89ffec8b6a0c5d02b89c305671d5ffd8d3c94acf8b8c408575bb/anyio-4.9.0-py3-none-any.whl", hash = "sha256:9f76d541cad6e36af7beb62e978876f3b41e3e04f2c1fbf0884604c0a9c4d93c", size = 100916 },
 ]
+[[package]]
+name = "appdirs"
+version = "1.4.4"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/d7/d8/05696357e0311f5b5c316d7b95f46c669dd9c15aaeecbb48c7d0aeb88c40/appdirs-1.4.4.tar.gz", hash = "sha256:7d5d0167b2b1ba821647616af46a749d1c653740dd0d2415100fe26e27afdf41", size = 13470 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/3b/00/2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324/appdirs-1.4.4-py2.py3-none-any.whl", hash = "sha256:a841dacd6b99318a741b166adb07e19ee71a274450e68237b4650ca1055ab128", size = 9566 },
+]
 [[package]]
 name = "appnope"
 version = "0.1.4"
     { url = "https://files.pythonhosted.org/packages/c3/be/d0d44e092656fe7a06b55e6103cbce807cdbdee17884a5367c68c9860853/dataclasses_json-0.6.7-py3-none-any.whl", hash = "sha256:0dbf33f26c8d5305befd61b39d2b3414e8a407bedc2834dea9b8d642666fb40a", size = 28686 },
 ]
+[[package]]
+name = "datasets"
+version = "3.6.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "dill" },
+    { name = "filelock" },
+    { name = "fsspec", extra = ["http"] },
+    { name = "huggingface-hub" },
+    { name = "multiprocess" },
+    { name = "numpy" },
+    { name = "packaging" },
+    { name = "pandas" },
+    { name = "pyarrow" },
+    { name = "pyyaml" },
+    { name = "requests" },
+    { name = "tqdm" },
+    { name = "xxhash" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/1a/89/d3d6fef58a488f8569c82fd293ab7cbd4250244d67f425dcae64c63800ea/datasets-3.6.0.tar.gz", hash = "sha256:1b2bf43b19776e2787e181cfd329cb0ca1a358ea014780c3581e0f276375e041", size = 569336 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/20/34/a08b0ee99715eaba118cbe19a71f7b5e2425c2718ef96007c325944a1152/datasets-3.6.0-py3-none-any.whl", hash = "sha256:25000c4a2c0873a710df127d08a202a06eab7bf42441a6bc278b499c2f72cd1b", size = 491546 },
+]
 [[package]]
 name = "debugpy"
 version = "1.8.14"
     { url = "https://files.pythonhosted.org/packages/6e/c6/ac0b6c1e2d138f1002bcf799d330bd6d85084fece321e662a14223794041/Deprecated-1.2.18-py2.py3-none-any.whl", hash = "sha256:bd5011788200372a32418f888e326a09ff80d0214bd961147cfed01b5c018eec", size = 9998 },
 ]
+[[package]]
+name = "dill"
+version = "0.3.8"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/17/4d/ac7ffa80c69ea1df30a8aa11b3578692a5118e7cd1aa157e3ef73b092d15/dill-0.3.8.tar.gz", hash = "sha256:3ebe3c479ad625c4553aca177444d89b486b1d84982eeacded644afc0cf797ca", size = 184847 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/c9/7a/cef76fd8438a42f96db64ddaa85280485a9c395e7df3db8158cfec1eee34/dill-0.3.8-py3-none-any.whl", hash = "sha256:c36ca9ffb54365bdd2f8eb3eff7d2a21237f8452b57ace88b1ac615b7e815bd7", size = 116252 },
+]
+[[package]]
+name = "diskcache"
+version = "5.6.3"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/3f/21/1c1ffc1a039ddcc459db43cc108658f32c57d271d7289a2794e401d0fdb6/diskcache-5.6.3.tar.gz", hash = "sha256:2c3a3fa2743d8535d832ec61c2054a1641f41775aa7c556758a109941e33e4fc", size = 67916 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/3f/27/4570e78fc0bf5ea0ca45eb1de3818a23787af9b390c0b0a0033a1b8236f9/diskcache-5.6.3-py3-none-any.whl", hash = "sha256:5e31b2d5fbad117cc363ebaf6b689474db18a1f6438bc82358b024abd4c2ca19", size = 45550 },
+]
 [[package]]
 name = "distro"
 version = "1.9.0"
 [[package]]
 name = "fsspec"
+version = "2025.3.0"
 source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/34/f4/5721faf47b8c499e776bc34c6a8fc17efdf7fdef0b00f398128bc5dcb4ac/fsspec-2025.3.0.tar.gz", hash = "sha256:a935fd1ea872591f2b5148907d103488fc523295e6c64b835cfad8c3eca44972", size = 298491 }
 wheels = [
+    { url = "https://files.pythonhosted.org/packages/56/53/eb690efa8513166adef3e0669afd31e95ffde69fb3c52ec2ac7223ed6018/fsspec-2025.3.0-py3-none-any.whl", hash = "sha256:efb87af3efa9103f94ca91a7f8cb7a4df91af9f74fc106c9c7ea0efd7277c1b3", size = 193615 },
+]
+[package.optional-dependencies]
+http = [
+    { name = "aiohttp" },
 ]
 [[package]]
     { url = "https://files.pythonhosted.org/packages/d9/33/1f075bf72b0b747cb3288d011319aaf64083cf2efef8354174e3ed4540e2/ipython_pygments_lexers-1.1.1-py3-none-any.whl", hash = "sha256:a9462224a505ade19a605f71f8fa63c2048833ce50abc86768a0d81d876dc81c", size = 8074 },
 ]
+[[package]]
+name = "ipywidgets"
+version = "8.1.7"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "comm" },
+    { name = "ipython" },
+    { name = "jupyterlab-widgets" },
+    { name = "traitlets" },
+    { name = "widgetsnbextension" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/3e/48/d3dbac45c2814cb73812f98dd6b38bbcc957a4e7bb31d6ea9c03bf94ed87/ipywidgets-8.1.7.tar.gz", hash = "sha256:15f1ac050b9ccbefd45dccfbb2ef6bed0029d8278682d569d71b8dd96bee0376", size = 116721 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/58/6a/9166369a2f092bd286d24e6307de555d63616e8ddb373ebad2b5635ca4cd/ipywidgets-8.1.7-py3-none-any.whl", hash = "sha256:764f2602d25471c213919b8a1997df04bef869251db4ca8efba1b76b1bd9f7bb", size = 139806 },
+]
 [[package]]
 name = "jedi"
 version = "0.19.2"
     { url = "https://files.pythonhosted.org/packages/c9/fb/108ecd1fe961941959ad0ee4e12ee7b8b1477247f30b1fdfd83ceaf017f0/jupyter_core-5.7.2-py3-none-any.whl", hash = "sha256:4f7315d2f6b4bcf2e3e7cb6e46772eba760ae459cd1f59d29eb57b0a01bd7409", size = 28965 },
 ]
+[[package]]
+name = "jupyterlab-widgets"
+version = "3.0.15"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/b9/7d/160595ca88ee87ac6ba95d82177d29ec60aaa63821d3077babb22ce031a5/jupyterlab_widgets-3.0.15.tar.gz", hash = "sha256:2920888a0c2922351a9202817957a68c07d99673504d6cd37345299e971bb08b", size = 213149 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/43/6a/ca128561b22b60bd5a0c4ea26649e68c8556b82bc70a0c396eebc977fe86/jupyterlab_widgets-3.0.15-py3-none-any.whl", hash = "sha256:d59023d7d7ef71400d51e6fee9a88867f6e65e10a4201605d2d7f3e8f012a31c", size = 216571 },
+]
 [[package]]
 name = "langchain"
 version = "0.3.25"
     { name = "feedparser" },
     { name = "ipykernel" },
     { name = "ipython" },
+    { name = "ipywidgets" },
     { name = "langchain" },
     { name = "langchain-community" },
     { name = "langchain-core" },
     { name = "pandas" },
     { name = "python-dotenv" },
     { name = "qdrant-client" },
+    { name = "ragas" },
+    { name = "tqdm" },
     { name = "unstructured", extra = ["md"] },
     { name = "websockets" },
 ]
     { name = "feedparser", specifier = ">=6.0.11" },
     { name = "ipykernel", specifier = ">=6.29.5" },
     { name = "ipython", specifier = ">=9.2.0" },
+    { name = "ipywidgets", specifier = ">=8.1.7" },
     { name = "langchain", specifier = ">=0.3.25" },
     { name = "langchain-community", specifier = ">=0.3.23" },
     { name = "langchain-core", specifier = ">=0.3.59" },
     { name = "pandas", specifier = ">=2.2.3" },
     { name = "python-dotenv", specifier = ">=1.1.0" },
     { name = "qdrant-client", specifier = ">=1.14.2" },
+    { name = "ragas", specifier = ">=0.2.15" },
+    { name = "tqdm", specifier = ">=4.67.1" },
     { name = "unstructured", extras = ["md"], specifier = ">=0.17.2" },
     { name = "websockets", specifier = ">=15.0.1" },
 ]
     { url = "https://files.pythonhosted.org/packages/96/10/7d526c8974f017f1e7ca584c71ee62a638e9334d8d33f27d7cdfc9ae79e4/multidict-6.4.3-py3-none-any.whl", hash = "sha256:59fe01ee8e2a1e8ceb3f6dbb216b09c8d9f4ef1c22c4fc825d045a147fa2ebc9", size = 10400 },
 ]
+[[package]]
+name = "multiprocess"
+version = "0.70.16"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "dill" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/b5/ae/04f39c5d0d0def03247c2893d6f2b83c136bf3320a2154d7b8858f2ba72d/multiprocess-0.70.16.tar.gz", hash = "sha256:161af703d4652a0e1410be6abccecde4a7ddffd19341be0a7011b94aeb171ac1", size = 1772603 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/bc/f7/7ec7fddc92e50714ea3745631f79bd9c96424cb2702632521028e57d3a36/multiprocess-0.70.16-py310-none-any.whl", hash = "sha256:c4a9944c67bd49f823687463660a2d6daae94c289adff97e0f9d696ba6371d02", size = 134824 },
+    { url = "https://files.pythonhosted.org/packages/50/15/b56e50e8debaf439f44befec5b2af11db85f6e0f344c3113ae0be0593a91/multiprocess-0.70.16-py311-none-any.whl", hash = "sha256:af4cabb0dac72abfb1e794fa7855c325fd2b55a10a44628a3c1ad3311c04127a", size = 143519 },
+    { url = "https://files.pythonhosted.org/packages/0a/7d/a988f258104dcd2ccf1ed40fdc97e26c4ac351eeaf81d76e266c52d84e2f/multiprocess-0.70.16-py312-none-any.whl", hash = "sha256:fc0544c531920dde3b00c29863377f87e1632601092ea2daca74e4beb40faa2e", size = 146741 },
+    { url = "https://files.pythonhosted.org/packages/ea/89/38df130f2c799090c978b366cfdf5b96d08de5b29a4a293df7f7429fa50b/multiprocess-0.70.16-py38-none-any.whl", hash = "sha256:a71d82033454891091a226dfc319d0cfa8019a4e888ef9ca910372a446de4435", size = 132628 },
+    { url = "https://files.pythonhosted.org/packages/da/d9/f7f9379981e39b8c2511c9e0326d212accacb82f12fbfdc1aa2ce2a7b2b6/multiprocess-0.70.16-py39-none-any.whl", hash = "sha256:a0bafd3ae1b732eac64be2e72038231c1ba97724b60b09400d68f229fcc2fbf3", size = 133351 },
+]
 [[package]]
 name = "mypy-extensions"
 version = "1.1.0"
     { url = "https://files.pythonhosted.org/packages/8e/37/efad0257dc6e593a18957422533ff0f87ede7c9c6ea010a2177d738fb82f/pure_eval-0.2.3-py3-none-any.whl", hash = "sha256:1db8e35b67b3d218d818ae653e27f06c3aa420901fa7b081ca98cbedc874e0d0", size = 11842 },
 ]
+[[package]]
+name = "pyarrow"
+version = "20.0.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/a2/ee/a7810cb9f3d6e9238e61d312076a9859bf3668fd21c69744de9532383912/pyarrow-20.0.0.tar.gz", hash = "sha256:febc4a913592573c8d5805091a6c2b5064c8bd6e002131f01061797d91c783c1", size = 1125187 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/9b/aa/daa413b81446d20d4dad2944110dcf4cf4f4179ef7f685dd5a6d7570dc8e/pyarrow-20.0.0-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:a15532e77b94c61efadde86d10957950392999503b3616b2ffcef7621a002893", size = 30798501 },
+    { url = "https://files.pythonhosted.org/packages/ff/75/2303d1caa410925de902d32ac215dc80a7ce7dd8dfe95358c165f2adf107/pyarrow-20.0.0-cp313-cp313-macosx_12_0_x86_64.whl", hash = "sha256:dd43f58037443af715f34f1322c782ec463a3c8a94a85fdb2d987ceb5658e061", size = 32277895 },
+    { url = "https://files.pythonhosted.org/packages/92/41/fe18c7c0b38b20811b73d1bdd54b1fccba0dab0e51d2048878042d84afa8/pyarrow-20.0.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:aa0d288143a8585806e3cc7c39566407aab646fb9ece164609dac1cfff45f6ae", size = 41327322 },
+    { url = "https://files.pythonhosted.org/packages/da/ab/7dbf3d11db67c72dbf36ae63dcbc9f30b866c153b3a22ef728523943eee6/pyarrow-20.0.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b6953f0114f8d6f3d905d98e987d0924dabce59c3cda380bdfaa25a6201563b4", size = 42411441 },
+    { url = "https://files.pythonhosted.org/packages/90/c3/0c7da7b6dac863af75b64e2f827e4742161128c350bfe7955b426484e226/pyarrow-20.0.0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:991f85b48a8a5e839b2128590ce07611fae48a904cae6cab1f089c5955b57eb5", size = 40677027 },
+    { url = "https://files.pythonhosted.org/packages/be/27/43a47fa0ff9053ab5203bb3faeec435d43c0d8bfa40179bfd076cdbd4e1c/pyarrow-20.0.0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:97c8dc984ed09cb07d618d57d8d4b67a5100a30c3818c2fb0b04599f0da2de7b", size = 42281473 },
+    { url = "https://files.pythonhosted.org/packages/bc/0b/d56c63b078876da81bbb9ba695a596eabee9b085555ed12bf6eb3b7cab0e/pyarrow-20.0.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:9b71daf534f4745818f96c214dbc1e6124d7daf059167330b610fc69b6f3d3e3", size = 42893897 },
+    { url = "https://files.pythonhosted.org/packages/92/ac/7d4bd020ba9145f354012838692d48300c1b8fe5634bfda886abcada67ed/pyarrow-20.0.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:e8b88758f9303fa5a83d6c90e176714b2fd3852e776fc2d7e42a22dd6c2fb368", size = 44543847 },
+    { url = "https://files.pythonhosted.org/packages/9d/07/290f4abf9ca702c5df7b47739c1b2c83588641ddfa2cc75e34a301d42e55/pyarrow-20.0.0-cp313-cp313-win_amd64.whl", hash = "sha256:30b3051b7975801c1e1d387e17c588d8ab05ced9b1e14eec57915f79869b5031", size = 25653219 },
+    { url = "https://files.pythonhosted.org/packages/95/df/720bb17704b10bd69dde086e1400b8eefb8f58df3f8ac9cff6c425bf57f1/pyarrow-20.0.0-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:ca151afa4f9b7bc45bcc791eb9a89e90a9eb2772767d0b1e5389609c7d03db63", size = 30853957 },
+    { url = "https://files.pythonhosted.org/packages/d9/72/0d5f875efc31baef742ba55a00a25213a19ea64d7176e0fe001c5d8b6e9a/pyarrow-20.0.0-cp313-cp313t-macosx_12_0_x86_64.whl", hash = "sha256:4680f01ecd86e0dd63e39eb5cd59ef9ff24a9d166db328679e36c108dc993d4c", size = 32247972 },
+    { url = "https://files.pythonhosted.org/packages/d5/bc/e48b4fa544d2eea72f7844180eb77f83f2030b84c8dad860f199f94307ed/pyarrow-20.0.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7f4c8534e2ff059765647aa69b75d6543f9fef59e2cd4c6d18015192565d2b70", size = 41256434 },
+    { url = "https://files.pythonhosted.org/packages/c3/01/974043a29874aa2cf4f87fb07fd108828fc7362300265a2a64a94965e35b/pyarrow-20.0.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3e1f8a47f4b4ae4c69c4d702cfbdfe4d41e18e5c7ef6f1bb1c50918c1e81c57b", size = 42353648 },
+    { url = "https://files.pythonhosted.org/packages/68/95/cc0d3634cde9ca69b0e51cbe830d8915ea32dda2157560dda27ff3b3337b/pyarrow-20.0.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:a1f60dc14658efaa927f8214734f6a01a806d7690be4b3232ba526836d216122", size = 40619853 },
+    { url = "https://files.pythonhosted.org/packages/29/c2/3ad40e07e96a3e74e7ed7cc8285aadfa84eb848a798c98ec0ad009eb6bcc/pyarrow-20.0.0-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:204a846dca751428991346976b914d6d2a82ae5b8316a6ed99789ebf976551e6", size = 42241743 },
+    { url = "https://files.pythonhosted.org/packages/eb/cb/65fa110b483339add6a9bc7b6373614166b14e20375d4daa73483755f830/pyarrow-20.0.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:f3b117b922af5e4c6b9a9115825726cac7d8b1421c37c2b5e24fbacc8930612c", size = 42839441 },
+    { url = "https://files.pythonhosted.org/packages/98/7b/f30b1954589243207d7a0fbc9997401044bf9a033eec78f6cb50da3f304a/pyarrow-20.0.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:e724a3fd23ae5b9c010e7be857f4405ed5e679db5c93e66204db1a69f733936a", size = 44503279 },
+    { url = "https://files.pythonhosted.org/packages/37/40/ad395740cd641869a13bcf60851296c89624662575621968dcfafabaa7f6/pyarrow-20.0.0-cp313-cp313t-win_amd64.whl", hash = "sha256:82f1ee5133bd8f49d31be1299dc07f585136679666b502540db854968576faf9", size = 25944982 },
+]
 [[package]]
 name = "pycparser"
 version = "2.22"
     { url = "https://files.pythonhosted.org/packages/e4/52/f49b0aa96253010f57cf80315edecec4f469e7a39c1ed92bf727fa290e57/qdrant_client-1.14.2-py3-none-any.whl", hash = "sha256:7c283b1f0e71db9c21b85d898fb395791caca2a6d56ee751da96d797b001410c", size = 327691 },
 ]
+[[package]]
+name = "ragas"
+version = "0.2.15"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "appdirs" },
+    { name = "datasets" },
+    { name = "diskcache" },
+    { name = "langchain" },
+    { name = "langchain-community" },
+    { name = "langchain-core" },
+    { name = "langchain-openai" },
+    { name = "nest-asyncio" },
+    { name = "numpy" },
+    { name = "openai" },
+    { name = "pydantic" },
+    { name = "tiktoken" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/6c/0f/04fddfa94744b1c3d8901aed8832a6b4193cc8e4886881f1bb88ff055350/ragas-0.2.15.tar.gz", hash = "sha256:2d0cd77b315a9c9c02ceb0a19ca8a48e82e1d02416587a2944ea51e6e327cd7b", size = 40867766 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/f2/9b/a5641da8aab06e069885a9ffa1b4897878f14c5b9807a9e3c5f1f532a6a9/ragas-0.2.15-py3-none-any.whl", hash = "sha256:298cd3d1fe3bd21ca4d31023a55079740d7bdd27a8c915bb371cec3c50cde608", size = 190947 },
+]
 [[package]]
 name = "rapidfuzz"
 version = "3.13.0"
     { url = "https://files.pythonhosted.org/packages/fa/a8/5b41e0da817d64113292ab1f8247140aac61cbf6cfd085d6a0fa77f4984f/websockets-15.0.1-py3-none-any.whl", hash = "sha256:f7a866fbc1e97b5c617ee4116daaa09b722101d4a3c170c787450ba409f9736f", size = 169743 },
 ]
+[[package]]
+name = "widgetsnbextension"
+version = "4.0.14"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/41/53/2e0253c5efd69c9656b1843892052a31c36d37ad42812b5da45c62191f7e/widgetsnbextension-4.0.14.tar.gz", hash = "sha256:a3629b04e3edb893212df862038c7232f62973373869db5084aed739b437b5af", size = 1097428 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/ca/51/5447876806d1088a0f8f71e16542bf350918128d0a69437df26047c8e46f/widgetsnbextension-4.0.14-py3-none-any.whl", hash = "sha256:4875a9eaf72fbf5079dc372a51a9f268fc38d46f767cbf85c43a36da5cb9b575", size = 2196503 },
+]
 [[package]]
 name = "wrapt"
 version = "1.17.2"