Add multiple blog posts on Ragas evaluation framework and metric-driven development
Browse files- Introduced "Part 1: Introduction to Ragas" covering the evaluation framework for LLM applications.
- Added "Part 4: Generating Test Data with Ragas" detailing synthetic data generation techniques.
- Included "Part 6: Integrations and Observability with Ragas" discussing integration with various frameworks and observability platforms.
- Published "Metric-Driven Development: Make Smarter Decisions, Faster" emphasizing the importance of metrics in project management.
- Announced the availability of an RSS feed for the blog to enhance content accessibility.
- Added a blog post titled "A C# Programmer's Perspective on LangChain Expression Language" sharing insights on transitioning from C# to LCEL.
- .gitignore +1 -1
- Dockerfile +4 -2
- data/advanced-metrics-and-customization-with-ragas/index.md +245 -0
- data/basic-evaluation-workflow-with-ragas/index.md +248 -0
- data/building-feedback-loops-with-ragas/index.md +161 -0
- data/building-research-agent/index.md +119 -0
- data/coming-back-to-ai-roots/index.md +64 -0
- data/data-is-king/index.md +78 -0
- data/evaluating-ai-agents-with-ragas/index.md +170 -0
- data/evaluating-rag-systems-with-ragas/index.md +197 -0
- data/generating-test-data-with-ragas/index.md +206 -0
- data/integrations-and-observability-with-ragas/index.md +137 -0
- data/introduction-to-ragas/index.md +136 -0
- data/langchain-experience-csharp-perspective/index.md +103 -0
- data/metric-driven-development/index.md +155 -0
- data/rss-feed-announcement/index.md +55 -0
- py-src/app.py +3 -1
.gitignore
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
|
2 |
db/
|
3 |
|
4 |
|
|
|
1 |
+
|
2 |
db/
|
3 |
|
4 |
|
Dockerfile
CHANGED
@@ -25,8 +25,10 @@ COPY --chown=user ./uv.lock $HOME/app
|
|
25 |
# Install the dependencies
|
26 |
# RUN uv sync --frozen
|
27 |
RUN uv sync
|
28 |
-
|
29 |
-
|
|
|
|
|
30 |
# Expose the port
|
31 |
EXPOSE 7860
|
32 |
|
|
|
25 |
# Install the dependencies
|
26 |
# RUN uv sync --frozen
|
27 |
RUN uv sync
|
28 |
+
|
29 |
+
#TODO: Fix this to download
|
30 |
+
#copy posts to container
|
31 |
+
COPY --chown=user ./data/ $HOME/app/data
|
32 |
# Expose the port
|
33 |
EXPOSE 7860
|
34 |
|
data/advanced-metrics-and-customization-with-ragas/index.md
ADDED
@@ -0,0 +1,245 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "Part 5: Advanced Metrics and Customization with Ragas"
|
3 |
+
date: 2025-04-28T05:00:00-06:00
|
4 |
+
layout: blog
|
5 |
+
description: "Explore advanced metrics and customization techniques in Ragas for evaluating LLM applications, including creating custom metrics, domain-specific evaluation, composite scoring, and best practices for building a comprehensive evaluation ecosystem."
|
6 |
+
categories: ["AI", "RAG", "Evaluation", "Ragas","Data"]
|
7 |
+
coverImage: "https://plus.unsplash.com/premium_photo-1661368994107-43200954c524?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
|
8 |
+
readingTime: 9
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
In our previous post, we explored how to generate comprehensive test datasets for evaluating LLM applications. Now, let's dive into one of Ragas' most powerful capabilities: advanced metrics and custom evaluation approaches that address specialized evaluation needs.
|
13 |
+
|
14 |
+
## Beyond the Basics: Why Advanced Metrics Matter
|
15 |
+
|
16 |
+
While Ragas' core metrics cover fundamental evaluation aspects, real-world applications often have unique requirements:
|
17 |
+
|
18 |
+
- **Domain-specific quality criteria**: Legal, medical, or financial applications have specialized accuracy requirements
|
19 |
+
- **Custom interaction patterns**: Applications with unique conversation flows need tailored evaluation approaches
|
20 |
+
- **Specialized capabilities**: Features like reasoning, code generation, or structured output demand purpose-built metrics
|
21 |
+
- **Business-specific KPIs**: Aligning evaluation with business objectives requires customized metrics
|
22 |
+
|
23 |
+
Let's explore how to extend Ragas' capabilities to meet these specialized needs.
|
24 |
+
|
25 |
+
## Understanding Ragas' Metric Architecture
|
26 |
+
|
27 |
+
Before creating custom metrics, it's helpful to understand Ragas' metric architecture:
|
28 |
+
|
29 |
+
### 1. Understand the Metric Base Classes
|
30 |
+
|
31 |
+
All metrics in Ragas inherit from the abstract `Metric` class (see `metrics/base.py`). For most use cases, you’ll extend one of these:
|
32 |
+
|
33 |
+
- **SingleTurnMetric**: For metrics that evaluate a single question/response pair.
|
34 |
+
- **MultiTurnMetric**: For metrics that evaluate multi-turn conversations.
|
35 |
+
- **MetricWithLLM**: For metrics that require an LLM for evaluation.
|
36 |
+
- **MetricWithEmbeddings**: For metrics that use embeddings.
|
37 |
+
|
38 |
+
You can mix these as needed (e.g., `MetricWithLLM, SingleTurnMetric`).
|
39 |
+
|
40 |
+
Each metric implements specific scoring methods depending on its type:
|
41 |
+
|
42 |
+
- `_single_turn_ascore`: For single-turn metrics
|
43 |
+
- `_multi_turn_ascore`: For multi-turn metrics
|
44 |
+
|
45 |
+
|
46 |
+
## Creating Your First Custom Metric
|
47 |
+
|
48 |
+
Let's create a custom metric that evaluates technical accuracy in programming explanations:
|
49 |
+
|
50 |
+
```python
|
51 |
+
from dataclasses import dataclass, field
|
52 |
+
from typing import Dict, Optional, Set
|
53 |
+
import typing as t
|
54 |
+
|
55 |
+
from ragas.metrics.base import MetricWithLLM, SingleTurnMetric
|
56 |
+
from ragas.prompt import PydanticPrompt
|
57 |
+
from ragas.metrics import MetricType, MetricOutputType
|
58 |
+
from pydantic import BaseModel
|
59 |
+
|
60 |
+
# Define input/output models for the prompt
|
61 |
+
class TechnicalAccuracyInput(BaseModel):
|
62 |
+
question: str
|
63 |
+
context: str
|
64 |
+
response: str
|
65 |
+
programming_language: str = "python"
|
66 |
+
|
67 |
+
class TechnicalAccuracyOutput(BaseModel):
|
68 |
+
score: float
|
69 |
+
feedback: str
|
70 |
+
|
71 |
+
|
72 |
+
# Define the prompt
|
73 |
+
class TechnicalAccuracyPrompt(PydanticPrompt[TechnicalAccuracyInput, TechnicalAccuracyOutput]):
|
74 |
+
instruction: str = (
|
75 |
+
"Evaluate the technical accuracy of the response to a programming question. "
|
76 |
+
"Consider syntax correctness, algorithmic accuracy, and best practices."
|
77 |
+
)
|
78 |
+
input_model = TechnicalAccuracyInput
|
79 |
+
output_model = TechnicalAccuracyOutput
|
80 |
+
examples = [
|
81 |
+
# Add examples here
|
82 |
+
]
|
83 |
+
|
84 |
+
# Create the metric
|
85 |
+
@dataclass
|
86 |
+
class TechnicalAccuracy(MetricWithLLM, SingleTurnMetric):
|
87 |
+
name: str = "technical_accuracy"
|
88 |
+
_required_columns: Dict[MetricType, Set[str]] = field(
|
89 |
+
default_factory=lambda: {
|
90 |
+
MetricType.SINGLE_TURN: {
|
91 |
+
"user_input",
|
92 |
+
"response",
|
93 |
+
|
94 |
+
}
|
95 |
+
}
|
96 |
+
)
|
97 |
+
output_type: Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
|
98 |
+
evaluation_prompt: PydanticPrompt = field(default_factory=TechnicalAccuracyPrompt)
|
99 |
+
|
100 |
+
async def _single_turn_ascore(self, sample, callbacks) -> float:
|
101 |
+
assert self.llm is not None, "LLM must be set"
|
102 |
+
|
103 |
+
question = sample.user_input
|
104 |
+
response = sample.response
|
105 |
+
# Extract programming language from question if possible
|
106 |
+
programming_language = "python" # Default
|
107 |
+
languages = ["python", "javascript", "java", "c++", "rust", "go"]
|
108 |
+
for lang in languages:
|
109 |
+
if lang in question.lower():
|
110 |
+
programming_language = lang
|
111 |
+
break
|
112 |
+
|
113 |
+
# Get the context
|
114 |
+
context = "\n".join(sample.retrieved_contexts) if sample.retrieved_contexts else ""
|
115 |
+
|
116 |
+
# Prepare input for prompt
|
117 |
+
prompt_input = TechnicalAccuracyInput(
|
118 |
+
question=question,
|
119 |
+
context=context,
|
120 |
+
response=response,
|
121 |
+
programming_language=programming_language
|
122 |
+
)
|
123 |
+
|
124 |
+
# Generate evaluation
|
125 |
+
evaluation = await self.evaluation_prompt.generate(
|
126 |
+
data=prompt_input, llm=self.llm, callbacks=callbacks
|
127 |
+
)
|
128 |
+
|
129 |
+
return evaluation.score
|
130 |
+
```
|
131 |
+
## Using the Custom Metric
|
132 |
+
To use the custom metric, simply include it in your evaluation pipeline:
|
133 |
+
|
134 |
+
```python
|
135 |
+
from langchain_openai import ChatOpenAI
|
136 |
+
from ragas import SingleTurnSample
|
137 |
+
from ragas.llms import LangchainLLMWrapper
|
138 |
+
|
139 |
+
# Initialize the LLM, you are going to OPENAI API key
|
140 |
+
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
|
141 |
+
|
142 |
+
test_data = {
|
143 |
+
"user_input": "Write a function to calculate the factorial of a number in Python.",
|
144 |
+
"retrieved_contexts": ["Python is a programming language.", "A factorial of a number n is the product of all positive integers less than or equal to n."],
|
145 |
+
"response": "def factorial(n):\n if n == 0:\n return 1\n else:\n return n * factorial(n-1)",
|
146 |
+
}
|
147 |
+
|
148 |
+
# Create a sample
|
149 |
+
sample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor
|
150 |
+
technical_accuracy = TechnicalAccuracy(llm=evaluator_llm)
|
151 |
+
score = await technical_accuracy.single_turn_ascore(sample)
|
152 |
+
print(f"Technical Accuracy Score: {score}")
|
153 |
+
# Note: The above code is a simplified example. In a real-world scenario, you would need to handle exceptions,
|
154 |
+
```
|
155 |
+
You can also use the `evaluate` function to evaluate a dataset:
|
156 |
+
|
157 |
+
```python
|
158 |
+
from ragas import evaluate
|
159 |
+
from ragas import evaluate
|
160 |
+
|
161 |
+
results = evaluate(
|
162 |
+
dataset, # Your dataset of samples
|
163 |
+
metrics=[TechnicalAccuracy(), ...],
|
164 |
+
llm=myevaluator_llm_llm
|
165 |
+
)
|
166 |
+
```
|
167 |
+
|
168 |
+
> 💡 **Try it yourself:**
|
169 |
+
> Explore the hands-on notebook for synthetic data generation:
|
170 |
+
> [05_Advanced_Metrics_and_Customization](https://github.com/mafzaal/intro-to-ragas/blob/master/05_Advanced_Metrics_and_Customization.ipynb)
|
171 |
+
|
172 |
+
## Customizing Metrics for Your Application
|
173 |
+
|
174 |
+
You can further refine your evaluation by customizing existing metrics—such as adjusting thresholds or criteria—to better fit your application's requirements. For multi-turn conversations, you might configure metrics like topic adherence to emphasize specific aspects, such as precision or recall, based on your evaluation objectives.
|
175 |
+
|
176 |
+
In specialized domains like healthcare or legal, it's crucial to design custom metrics that capture domain-specific accuracy and compliance needs. For complex applications, consider combining several metrics into composite scores to represent multiple quality dimensions.
|
177 |
+
|
178 |
+
When assessing capabilities like code generation or structured outputs, develop metrics that evaluate execution correctness or schema compliance. For advanced scenarios, you can build metric pipelines that orchestrate several metrics and aggregate their results using strategies like weighted averages or minimum scores.
|
179 |
+
|
180 |
+
By thoughtfully customizing and combining metrics, you can achieve a comprehensive and meaningful evaluation framework tailored to your unique use case.
|
181 |
+
|
182 |
+
## Best Practices for Custom Metric Development
|
183 |
+
|
184 |
+
1. **Single Responsibility**: Each metric should evaluate one specific aspect
|
185 |
+
2. **Clear Definition**: Define precisely what your metric measures
|
186 |
+
3. **Bounded Output**: Scores should be normalized, typically in [0,1]
|
187 |
+
4. **Reproducibility**: Minimize randomness in evaluation
|
188 |
+
5. **Documentation**: Document criteria, prompt design, and interpretation guidelines
|
189 |
+
6. **Test with Examples**: Verify metric behavior on clear-cut examples
|
190 |
+
7. **Human Correlation**: Validate that metrics correlate with human judgment
|
191 |
+
|
192 |
+
## Standardizing Custom Metrics
|
193 |
+
|
194 |
+
To ensure consistency across custom metrics, consider the following best practices:
|
195 |
+
|
196 |
+
- Define a clear, human-readable description for each metric.
|
197 |
+
- Provide interpretation guidelines to help users understand score meanings.
|
198 |
+
- Include metadata such as metric name, required columns, and output type.
|
199 |
+
- Use a standardized interface or base class for all custom metrics.
|
200 |
+
|
201 |
+
## Implementation Patterns for Advanced Metrics
|
202 |
+
|
203 |
+
When developing advanced metrics like topic adherence:
|
204 |
+
|
205 |
+
- Design multi-step evaluation workflows for complex tasks.
|
206 |
+
- Use specialized prompts for different sub-tasks within the metric.
|
207 |
+
- Allow configurable scoring modes (e.g., precision, recall, F1).
|
208 |
+
- Support conversational context for multi-turn evaluations.
|
209 |
+
|
210 |
+
## Debugging Custom Metrics
|
211 |
+
|
212 |
+
Effective debugging strategies include:
|
213 |
+
|
214 |
+
- Implementing a debug mode to capture prompt inputs, outputs, and intermediate results.
|
215 |
+
- Logging detailed evaluation steps for easier troubleshooting.
|
216 |
+
- Reviewing final scores alongside intermediate calculations to identify issues.
|
217 |
+
|
218 |
+
|
219 |
+
## Conclusion: Building an Evaluation Ecosystem
|
220 |
+
|
221 |
+
Custom metrics allow you to build a comprehensive evaluation ecosystem tailored to your application's specific needs:
|
222 |
+
|
223 |
+
1. **Baseline metrics**: Start with Ragas' core metrics for fundamental quality aspects
|
224 |
+
2. **Domain adaptation**: Add specialized metrics for your application domain
|
225 |
+
3. **Feature-specific metrics**: Develop metrics for unique features of your system
|
226 |
+
4. **Business alignment**: Create metrics that reflect specific business KPIs and requirements
|
227 |
+
|
228 |
+
By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.
|
229 |
+
|
230 |
+
In our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.
|
231 |
+
|
232 |
+
---
|
233 |
+
|
234 |
+
**[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
|
235 |
+
**[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
|
236 |
+
**[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**
|
237 |
+
**[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas)**
|
238 |
+
**Part 5: Advanced Evaluation Techniques — _You are here_**
|
239 |
+
*Next up in the series:*
|
240 |
+
**[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
|
241 |
+
**[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
|
242 |
+
**[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
|
243 |
+
|
244 |
+
|
245 |
+
*How have you implemented feedback loops in your LLM applications? What improvement strategies have been most effective for your use cases? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!*
|
data/basic-evaluation-workflow-with-ragas/index.md
ADDED
@@ -0,0 +1,248 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "Part 2: Basic Evaluation Workflow with Ragas"
|
3 |
+
date: 2025-04-26T19:00:00-06:00
|
4 |
+
layout: blog
|
5 |
+
description: "Learn how to set up a basic evaluation workflow for LLM applications using Ragas. This guide walks you through data preparation, metric selection, and result analysis."
|
6 |
+
categories: ["AI", "RAG", "Evaluation", "Ragas"]
|
7 |
+
coverImage: "https://images.unsplash.com/photo-1600132806370-bf17e65e942f?q=80&w=1988&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
|
8 |
+
readingTime: 8
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
In our previous post, we introduced Ragas as a powerful framework for evaluating LLM applications. Now, let's dive into the practical aspects of setting up your first evaluation pipeline.
|
13 |
+
|
14 |
+
## Understanding the Evaluation Workflow
|
15 |
+
|
16 |
+
A typical Ragas evaluation workflow consists of four key steps:
|
17 |
+
|
18 |
+
1. **Prepare your data**: Collect queries, contexts, responses, and reference answers
|
19 |
+
2. **Select appropriate metrics**: Choose metrics that align with what you want to evaluate
|
20 |
+
3. **Run the evaluation**: Process your data through the selected metrics
|
21 |
+
4. **Analyze the results**: Interpret scores and identify areas for improvement
|
22 |
+
|
23 |
+
Let's walk through each step with practical examples.
|
24 |
+
|
25 |
+
## Step 1: Setting Up Your Environment
|
26 |
+
|
27 |
+
First, ensure you have Ragas installed:
|
28 |
+
|
29 |
+
```bash
|
30 |
+
uv add ragas
|
31 |
+
```
|
32 |
+
|
33 |
+
Next, import the necessary components:
|
34 |
+
|
35 |
+
```python
|
36 |
+
import pandas as pd
|
37 |
+
from ragas import EvaluationDataset
|
38 |
+
from ragas import evaluate, RunConfig
|
39 |
+
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
|
40 |
+
```
|
41 |
+
|
42 |
+
## Step 2: Preparing Your Evaluation Data
|
43 |
+
|
44 |
+
For a RAG system evaluation, you'll need:
|
45 |
+
|
46 |
+
- **Questions**: User queries to your system
|
47 |
+
- **Contexts**: Documents or chunks retrieved by your system
|
48 |
+
- **Responses**: Answers generated by your system
|
49 |
+
- **Ground truth** (optional): Reference answers or documents for comparison
|
50 |
+
|
51 |
+
Here's how to organize this data:
|
52 |
+
|
53 |
+
```python
|
54 |
+
# Sample data
|
55 |
+
data = {
|
56 |
+
"user_input": [
|
57 |
+
"What are the main symptoms of COVID-19?",
|
58 |
+
"How does machine learning differ from deep learning?"
|
59 |
+
],
|
60 |
+
"retrieved_contexts": [
|
61 |
+
[
|
62 |
+
"Common symptoms of COVID-19 include fever, cough, and fatigue. Some patients also report loss of taste or smell, body aches, and difficulty breathing.",
|
63 |
+
"COVID-19 is caused by the SARS-CoV-2 virus and spreads primarily through respiratory droplets."
|
64 |
+
],
|
65 |
+
[
|
66 |
+
"Machine learning is a subset of AI focused on algorithms that learn from data without being explicitly programmed.",
|
67 |
+
"Deep learning is a specialized form of machine learning using neural networks with many layers (deep neural networks)."
|
68 |
+
]
|
69 |
+
],
|
70 |
+
"response": [
|
71 |
+
"The main symptoms of COVID-19 include fever, cough, fatigue, and sometimes loss of taste or smell, body aches, and breathing difficulties.",
|
72 |
+
"Machine learning is a subset of AI that focuses on algorithms learning from data, while deep learning is a specialized form of machine learning that uses deep neural networks with multiple layers."
|
73 |
+
],
|
74 |
+
"reference": [
|
75 |
+
"COVID-19 symptoms commonly include fever, dry cough, fatigue, loss of taste or smell, body aches, sore throat, and in severe cases, difficulty breathing.",
|
76 |
+
"Machine learning is a branch of AI where systems learn from data, identify patterns, and make decisions with minimal human intervention. Deep learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to analyze various factors of data."
|
77 |
+
]
|
78 |
+
}
|
79 |
+
|
80 |
+
eval_data = pd.DataFrame(data)
|
81 |
+
|
82 |
+
# Convert to a format Ragas can use
|
83 |
+
evaluation_dataset = EvaluationDataset.from_pandas(eval_data)
|
84 |
+
evaluation_dataset
|
85 |
+
|
86 |
+
```
|
87 |
+
|
88 |
+
## Step 3: Selecting and Configuring Metrics
|
89 |
+
|
90 |
+
Ragas offers various metrics to evaluate different aspects of your system:
|
91 |
+
|
92 |
+
### Core RAG Metrics:
|
93 |
+
|
94 |
+
- **Faithfulness**: Measures if the response is factually consistent with the provided context.
|
95 |
+
- **Factual Correctness**: Assesses if the response is accurate and free from factual errors.
|
96 |
+
- **Response Relevancy**: Evaluates if the response directly addresses the user query.
|
97 |
+
- **Context Entity Recall**: Measures how well the retrieved context captures relevant entities from the ground truth.
|
98 |
+
- **Noise Sensitivity**: Assesses the robustness of the response to irrelevant or noisy context.
|
99 |
+
- **LLM Context Recall**: Evaluates how effectively the LLM utilizes the provided context to generate the response.
|
100 |
+
|
101 |
+
For metrics that require an LLM (like faithfulness), you need to configure the LLM provider:
|
102 |
+
|
103 |
+
```python
|
104 |
+
# Configure LLM for evaluation
|
105 |
+
from langchain_openai import ChatOpenAI
|
106 |
+
from ragas.llms import LangchainLLMWrapper
|
107 |
+
|
108 |
+
# Initialize the LLM, you are going to OPENAI API key
|
109 |
+
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
|
110 |
+
|
111 |
+
# Define metrics to use
|
112 |
+
metrics = [
|
113 |
+
Faithfulness(),
|
114 |
+
FactualCorrectness(),
|
115 |
+
ResponseRelevancy(),
|
116 |
+
ContextEntityRecall(),
|
117 |
+
NoiseSensitivity(),
|
118 |
+
LLMContextRecall()
|
119 |
+
]
|
120 |
+
```
|
121 |
+
|
122 |
+
## Step 4: Running the Evaluation
|
123 |
+
|
124 |
+
Now, run the evaluation with your selected metrics:
|
125 |
+
|
126 |
+
```python
|
127 |
+
# Run evaluation
|
128 |
+
results = evaluate(
|
129 |
+
evaluation_dataset,
|
130 |
+
metrics=metrics,
|
131 |
+
llm=evaluator_llm # Required for LLM-based metrics
|
132 |
+
)
|
133 |
+
|
134 |
+
# View results
|
135 |
+
print(results)
|
136 |
+
```
|
137 |
+
### Output:
|
138 |
+
|
139 |
+
*Values will vary based on your data and LLM performance.*
|
140 |
+
|
141 |
+
```python
|
142 |
+
{
|
143 |
+
"faithfulness": 1.0000,
|
144 |
+
"factual_correctness": 0.6750,
|
145 |
+
"answer_relevancy": 0.9897,
|
146 |
+
"context_entity_recall": 0.8889,
|
147 |
+
"noise_sensitivity_relevant": 0.1667,
|
148 |
+
"context_recall": 0.5000
|
149 |
+
}
|
150 |
+
```
|
151 |
+
|
152 |
+
|
153 |
+
## Step 5: Interpreting Results
|
154 |
+
|
155 |
+
Ragas metrics typically return scores between 0 and 1, where higher is better:
|
156 |
+
|
157 |
+
### Understanding Score Ranges:
|
158 |
+
|
159 |
+
- **0.8-1.0**: Excellent performance
|
160 |
+
- **0.6-0.8**: Good performance
|
161 |
+
- **0.4-0.6**: Moderate performance, needs improvement
|
162 |
+
- **0.4 or lower**: Poor performance, requires significant attention
|
163 |
+
|
164 |
+
## Advanced Use: Custom Evaluation for Specific Examples
|
165 |
+
|
166 |
+
For more detailed analysis of specific examples:
|
167 |
+
|
168 |
+
```python
|
169 |
+
from ragas import SingleTurnSample
|
170 |
+
from ragas.metrics import AspectCritic
|
171 |
+
|
172 |
+
# Define a specific test case
|
173 |
+
test_data = {
|
174 |
+
"user_input": "What are quantum computers?",
|
175 |
+
"response": "Quantum computers use quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits that can only be 0 or 1.",
|
176 |
+
"retrieved_contexts": ["Quantum computing is a type of computation that harnesses quantum mechanical phenomena."]
|
177 |
+
}
|
178 |
+
|
179 |
+
# Create a custom evaluation metric
|
180 |
+
custom_metric = AspectCritic(
|
181 |
+
name="quantum_accuracy",
|
182 |
+
llm=llm,
|
183 |
+
definition="Verify if the explanation of quantum computing is accurate and complete."
|
184 |
+
)
|
185 |
+
|
186 |
+
# Score the sample
|
187 |
+
sample = SingleTurnSample(**test_data)
|
188 |
+
score = await custom_metric.single_turn_ascore(sample)
|
189 |
+
print(f"Quantum accuracy score: {score}")
|
190 |
+
```
|
191 |
+
> 💡 **Try it yourself:**
|
192 |
+
> Explore the hands-on notebook for this workflow:
|
193 |
+
> [02_Basic_Evaluation_Workflow_with_Ragas](https://github.com/mafzaal/intro-to-ragas/blob/master/02_Basic_Evaluation_Workflow_with_Ragas.ipynb)
|
194 |
+
|
195 |
+
## Common Evaluation Patterns and Metrics
|
196 |
+
|
197 |
+
Below is a matrix mapping evaluation patterns to the metrics used, along with definitions for each metric:
|
198 |
+
|
199 |
+
| **Metric** | **Comprehensive RAG Evaluation** | **Content Quality Evaluation** | **Retrieval Quality Evaluation** |
|
200 |
+
|-----------------------------|----------------------------------|---------------------------------|-----------------------------------|
|
201 |
+
| **Faithfulness** | ✓ | ✓ | |
|
202 |
+
| **Answer Relevancy** | ✓ | ✓ | |
|
203 |
+
| **Context Recall** | ✓ | | ✓ |
|
204 |
+
| **Context Precision** | ✓ | | ✓ |
|
205 |
+
| **Harmfulness** | | ✓ | |
|
206 |
+
| **Coherence** | | ✓ | |
|
207 |
+
| **Context Relevancy** | | | ✓ |
|
208 |
+
|
209 |
+
### Metric Definitions
|
210 |
+
|
211 |
+
- **Faithfulness**: Measures if the response is factually consistent with the provided context.
|
212 |
+
- **Answer Relevancy**: Assesses if the response addresses the question.
|
213 |
+
- **Context Recall**: Measures how well the retrieved context covers the information in the ground truth.
|
214 |
+
- **Context Precision**: Evaluates the proportion of relevant information in the retrieved context.
|
215 |
+
- **Harmfulness**: Evaluates if the response contains harmful or inappropriate content.
|
216 |
+
- **Coherence**: Measures the logical flow and clarity of the response.
|
217 |
+
- **Context Relevancy**: Evaluates if the retrieved context is relevant to the question.
|
218 |
+
|
219 |
+
This matrix provides a clear overview of which metrics to use for specific evaluation patterns and their respective definitions.
|
220 |
+
|
221 |
+
## Best Practices for Ragas Evaluation
|
222 |
+
|
223 |
+
1. **Start simple**: Begin with core metrics before adding more specialized ones
|
224 |
+
2. **Use diverse test cases**: Include a variety of questions, from simple to complex
|
225 |
+
3. **Consider edge cases**: Test with queries that might challenge your system
|
226 |
+
4. **Compare versions**: Track metrics across different versions of your application
|
227 |
+
5. **Combine with human evaluation**: Use Ragas alongside human feedback for a comprehensive assessment
|
228 |
+
|
229 |
+
## Conclusion
|
230 |
+
|
231 |
+
Setting up a basic evaluation workflow with Ragas is straightforward yet powerful. By systematically evaluating your LLM applications, you gain objective insights into their performance and clear directions for improvement.
|
232 |
+
|
233 |
+
In our next post, we'll delve deeper into specialized evaluation techniques for RAG systems, exploring advanced metrics and evaluation strategies for retrieval-augmented generation applications.
|
234 |
+
|
235 |
+
---
|
236 |
+
|
237 |
+
**[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
|
238 |
+
**Part 2: Basic Evaluation Workflow — _You are here_**
|
239 |
+
*Next up in the series:*
|
240 |
+
**[Part 3: Evaluating RAG Systems](/blog/evaluating-rag-systems-with-ragas/)**
|
241 |
+
**[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
|
242 |
+
**[Part 5: Advanced Evaluation Techniques](/blog/advanced-metrics-and-customization-with-ragas)**
|
243 |
+
**[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
|
244 |
+
**[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
|
245 |
+
**[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
|
246 |
+
|
247 |
+
|
248 |
+
*Have you set up your first Ragas evaluation? What aspects of your LLM application are you most interested in measuring? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!*
|
data/building-feedback-loops-with-ragas/index.md
ADDED
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "Part 8: Building Feedback Loops with Ragas"
|
3 |
+
date: 2025-05-04T00:00:00-06:00
|
4 |
+
layout: blog
|
5 |
+
description: "A research-driven guide to designing robust, actionable feedback loops for LLM and RAG systems using Ragas. Learn how to select metrics, set baselines, define thresholds, and incorporate user and human feedback for continuous improvement."
|
6 |
+
categories: ["AI", "RAG", "Evaluation", "Ragas", "Data"]
|
7 |
+
coverImage: "/images/building-feedback-loops.png"
|
8 |
+
readingTime: 10
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
|
13 |
+
A high-performing LLM or RAG system is never static. The most successful teams treat evaluation as a continuous, iterative process—one that closes the loop between measurement, analysis, and improvement. In this post, we’ll design a research-backed feedback loop process using Ragas, focusing on actionable activities at each stage and strategies for integrating user and human feedback.
|
14 |
+
|
15 |
+
|
16 |
+
## Designing the Feedback Loop: A Stepwise Process
|
17 |
+
|
18 |
+
The feedback loop process is a systematic approach to continuously improve your LLM or RAG system. It consists of seven key steps, each building on the previous one to create a sustainable cycle of evidence-driven progress.
|
19 |
+
|
20 |
+

|
21 |
+
|
22 |
+
### 1. Select the Right Metric
|
23 |
+
|
24 |
+
**Purpose:**
|
25 |
+
Identify metrics that best reflect your application’s goals and user needs.
|
26 |
+
|
27 |
+
**Activities:**
|
28 |
+
- Map business objectives to measurable outcomes (e.g., accuracy, faithfulness, relevancy).
|
29 |
+
- Review available Ragas metrics and select those most aligned with your use case.
|
30 |
+
- Periodically revisit metric selection as your product or user base evolves.
|
31 |
+
|
32 |
+
### 2. Develop and Measure Baseline Metrics
|
33 |
+
|
34 |
+
**Purpose:**
|
35 |
+
Establish a reference point for current system performance.
|
36 |
+
|
37 |
+
**Activities:**
|
38 |
+
- Assemble a representative evaluation dataset.
|
39 |
+
- Run your system and record metric scores for each example.
|
40 |
+
- Document baseline results for all selected metrics.
|
41 |
+
- Ensure the baseline dataset remains stable for future comparisons.
|
42 |
+
|
43 |
+
### 3. Analyze and Define Acceptable Threshold Values
|
44 |
+
|
45 |
+
**Purpose:**
|
46 |
+
Set clear, actionable standards for what constitutes “good enough” performance.
|
47 |
+
|
48 |
+
**Activities:**
|
49 |
+
- Analyze baseline metric distributions (mean, variance, outliers).
|
50 |
+
- Consult stakeholders to define minimum acceptable values for each metric.
|
51 |
+
- Document thresholds and rationale for transparency.
|
52 |
+
- Consider different thresholds for different segments (e.g., critical vs. non-critical queries).
|
53 |
+
|
54 |
+
### 4. Evaluate and Select Improvement Areas
|
55 |
+
|
56 |
+
**Purpose:**
|
57 |
+
Identify where your system most often fails to meet thresholds and prioritize improvements.
|
58 |
+
|
59 |
+
**Activities:**
|
60 |
+
- Segment evaluation results by metric, query type, or user group.
|
61 |
+
- Identify patterns or clusters of failure (e.g., certain topics, long queries).
|
62 |
+
- Prioritize areas with the greatest impact on user experience or business goals.
|
63 |
+
- Formulate hypotheses about root causes.
|
64 |
+
|
65 |
+
### 5. Implement Improvements
|
66 |
+
|
67 |
+
**Purpose:**
|
68 |
+
Take targeted actions to address identified weaknesses.
|
69 |
+
|
70 |
+
**Activities:**
|
71 |
+
- Design and implement changes (e.g., prompt tuning, retrieval upgrades, model fine-tuning).
|
72 |
+
- Document all interventions and their intended effects.
|
73 |
+
- Ensure changes are isolated for clear attribution of impact.
|
74 |
+
|
75 |
+
|
76 |
+
### 6. Record Metrics for History
|
77 |
+
|
78 |
+
**Purpose:**
|
79 |
+
Build a longitudinal record to track progress and avoid regressions.
|
80 |
+
|
81 |
+
**Activities:**
|
82 |
+
- After each improvement, re-evaluate on the same baseline dataset.
|
83 |
+
- Log metric scores, system version, date, and description of changes.
|
84 |
+
- Visualize trends over time to inform future decisions.
|
85 |
+
|
86 |
+
**Metric Record Log Schema Example:**
|
87 |
+
|
88 |
+
| Timestamp | System Version | Metric Name | Value | Dataset Name | Change Description |
|
89 |
+
|---------------------|---------------|-------------------|--------|--------------|---------------------------|
|
90 |
+
| 2025-05-04T12:00:00 | v1.2.0 | faithfulness | 0.78 | baseline_v1 | Added re-ranking to retriever |
|
91 |
+
| 2025-05-04T12:00:00 | v1.2.0 | answer_relevancy | 0.81 | baseline_v1 | Added re-ranking to retriever |
|
92 |
+
| ... | ... | ... | ... | ... | ... |
|
93 |
+
|
94 |
+
|
95 |
+
### 7. Repeat: Analyze, Evaluate, Implement, Record
|
96 |
+
|
97 |
+
**Purpose:**
|
98 |
+
Establish a sustainable, iterative cycle of improvement.
|
99 |
+
|
100 |
+
**Activities:**
|
101 |
+
- Regularly revisit analysis as new data or feedback emerges.
|
102 |
+
- Continuously refine thresholds and priorities.
|
103 |
+
- Maintain a culture of evidence-based iteration.
|
104 |
+
|
105 |
+
|
106 |
+
## Integrating User Feedback in Production
|
107 |
+
|
108 |
+
### Purpose
|
109 |
+
|
110 |
+
User feedback provides real-world validation and uncovers blind spots in automated metrics. Incorporating it closes the gap between technical evaluation and actual user satisfaction.
|
111 |
+
|
112 |
+
### Strategies
|
113 |
+
|
114 |
+
- **In-Product Feedback Widgets:** Allow users to rate answers or flag issues directly in the interface.
|
115 |
+
- **Passive Signals:** Analyze user behavior (e.g., follow-up queries, abandonment) as implicit feedback.
|
116 |
+
- **Feedback Sampling:** Periodically sample user sessions for manual review.
|
117 |
+
- **Feedback Aggregation:** Aggregate and categorize feedback to identify recurring pain points.
|
118 |
+
- **Metric Correlation:** Analyze how user feedback correlates with automated metrics to calibrate thresholds.
|
119 |
+
|
120 |
+
### Recording User Feedback
|
121 |
+
|
122 |
+
**User Feedback Log Schema Example:**
|
123 |
+
|
124 |
+
| Timestamp | User ID | Query ID | User Rating | Feedback Text | Metric Scores | System Version |
|
125 |
+
|---------------------|---------|----------|-------------|----------------------|--------------|---------------|
|
126 |
+
| 2025-05-04T13:00:00 | 12345 | q_987 | 2 | "Answer was off-topic" | `{faithfulness: 0.6, answer_relevancy: 0.5}` | v1.2.0 |
|
127 |
+
| 2025-05-04T13:00:00 | 67890 | q_654 | 4 | "Good answer, but could be more concise" | `{faithfulness: 0.8, answer_relevancy: 0.9}` | v1.2.0 |
|
128 |
+
| ... | ... | ... | ... | ... | ... | ... |
|
129 |
+
|
130 |
+
## Including Human Labelers in Evaluation
|
131 |
+
|
132 |
+
### Purpose
|
133 |
+
|
134 |
+
Human labelers provide high-quality, nuanced judgments that automated metrics may miss, especially for ambiguous or complex queries.
|
135 |
+
|
136 |
+
### Strategies
|
137 |
+
|
138 |
+
- **Periodic Human Review:** Regularly sample evaluation outputs for human annotation.
|
139 |
+
- **Disagreement Analysis:** Focus human review on cases where user feedback and metrics disagree.
|
140 |
+
- **Labeler Training:** Provide clear guidelines and calibration sessions to ensure consistency.
|
141 |
+
- **Hybrid Scoring:** Combine human and automated scores for a more holistic evaluation.
|
142 |
+
- **Continuous Calibration:** Use human labels to refine and validate automated metric thresholds.
|
143 |
+
|
144 |
+
|
145 |
+
## Conclusion
|
146 |
+
|
147 |
+
A robust feedback loop is the foundation of sustainable improvement for LLM and RAG systems. By systematically selecting metrics, measuring baselines, setting thresholds, and integrating both user and human feedback, you create a virtuous cycle of evidence-driven progress. The most effective teams treat evaluation as an ongoing process—one that is deeply connected to real user outcomes and grounded in transparent, repeatable measurement.
|
148 |
+
|
149 |
+
---
|
150 |
+
*This is the eighth part of a series on Ragas, a research-driven evaluation framework for LLM and RAG systems. If you missed the previous parts, check them out below:*
|
151 |
+
|
152 |
+
**[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
|
153 |
+
**[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
|
154 |
+
**[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**
|
155 |
+
**[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
|
156 |
+
**[Part 5: Advanced Metrics and Customization](/blog/advanced-metrics-and-customization-with-ragas/)**
|
157 |
+
**[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
|
158 |
+
**[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
|
159 |
+
**Part 8: Building Feedback Loops — _You are here_**
|
160 |
+
|
161 |
+
*Have questions or want to share your feedback loop strategies? [Connect with me on LinkedIn](https://www.linkedin.com/in/muhammadafzaal/) for discussion or collaboration!*
|
data/building-research-agent/index.md
ADDED
@@ -0,0 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
layout: blog
|
3 |
+
title: Building a Research Agent with RSS Feed Support
|
4 |
+
date: 2025-04-20T00:00:00-06:00
|
5 |
+
description: How I created a comprehensive research assistant that combines web search, academic papers, RSS feeds, and document analysis to revolutionize information discovery.
|
6 |
+
categories: ["AI", "LLM", "Research", "Technology", "Agents"]
|
7 |
+
coverImage: "https://images.unsplash.com/photo-1507842217343-583bb7270b66?q=80&w=2290&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
|
8 |
+
readingTime: 5
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
In the age of information overload, finding the right data efficiently has become increasingly challenging. Whether you're conducting academic research, staying updated on industry trends, or investigating specific topics, the process often involves juggling multiple tools and platforms. This fragmentation inspired me to create a comprehensive solution: a research agent with RSS feed support that brings together multiple information sources in one unified interface.
|
13 |
+
|
14 |
+
## Why Build a Research Agent?
|
15 |
+
|
16 |
+
As someone who regularly conducts research across different domains, I've experienced the frustration of switching between search engines, academic databases, news aggregators, and document analysis tools. Each context switch breaks concentration and slows down the discovery process. I wanted a tool that could:
|
17 |
+
|
18 |
+
- Search across multiple information sources simultaneously
|
19 |
+
- Analyze uploaded documents in the context of web information
|
20 |
+
- Provide transparent reasoning about its research process
|
21 |
+
- Deliver structured, well-cited reports
|
22 |
+
|
23 |
+
The result is the [Research Agent](https://huggingface.co/spaces/mafzaal/AIE6-ResearchAgent) - an LLM-powered assistant that brings together web search, academic papers, RSS feeds, and document analysis into a single, coherent workflow.
|
24 |
+
|
25 |
+
## Multi-Source Research Architecture
|
26 |
+
|
27 |
+
The agent's strength comes from its ability to tap into various information streams:
|
28 |
+
|
29 |
+
### Web Search Integration
|
30 |
+
|
31 |
+
For real-time information and general knowledge, the agent leverages both Tavily and DuckDuckGo APIs to perform semantic searches across the web. This provides access to current events, recent developments, and general information that might not be available in academic sources.
|
32 |
+
|
33 |
+
### Academic Research Pipeline
|
34 |
+
|
35 |
+
Research often requires scholarly sources. The agent connects to arXiv's extensive database of scientific papers, allowing it to retrieve relevant academic articles complete with titles, authors, and abstracts. This is particularly valuable for technical topics that require peer-reviewed information.
|
36 |
+
|
37 |
+
### RSS Feed Aggregation
|
38 |
+
|
39 |
+
For targeted news monitoring and industry updates, the RSS feed reader component allows the agent to retrieve content from specific publications and blogs. This is ideal for tracking industry trends or following particular news sources relevant to your research topic.
|
40 |
+
|
41 |
+
### Document Analysis Engine
|
42 |
+
|
43 |
+
Perhaps the most powerful feature is the document analysis capability, which uses Retrieval Augmented Generation (RAG) to process uploaded PDFs or text files. By breaking documents into semantic chunks and creating vector embeddings, the agent can answer questions specifically about your documents while incorporating relevant information from other sources.
|
44 |
+
|
45 |
+
## Behind the Scenes: LangGraph Workflow
|
46 |
+
|
47 |
+
What makes this agent particularly powerful is its LangGraph-based architecture, which provides a structured framework for reasoning and tool orchestration:
|
48 |
+
|
49 |
+

|
50 |
+
|
51 |
+
This workflow provides several key advantages:
|
52 |
+
|
53 |
+
1. **Contextual Awareness**: The agent maintains context throughout the research process
|
54 |
+
2. **Dynamic Tool Selection**: It intelligently chooses which information sources to query based on your question
|
55 |
+
3. **Transparent Reasoning**: You can see each step of the research process
|
56 |
+
4. **Consistent Output Structure**: Results are formatted into comprehensive reports with proper citations
|
57 |
+
|
58 |
+
## The Technology Stack
|
59 |
+
|
60 |
+
Building the Research Agent required integrating several cutting-edge technologies:
|
61 |
+
|
62 |
+
- **LangChain**: Provides the foundation for LLM application development
|
63 |
+
- **LangGraph**: Enables sophisticated workflow orchestration
|
64 |
+
- **Chainlit**: Powers the interactive chat interface
|
65 |
+
- **Qdrant**: Serves as the vector database for document embeddings
|
66 |
+
- **OpenAI**: Supplies the GPT-4o language model and embeddings
|
67 |
+
- **Tavily/DuckDuckGo**: Delivers web search capabilities
|
68 |
+
- **arXiv API**: Connects to academic paper repositories
|
69 |
+
- **Feedparser**: Handles RSS feed processing
|
70 |
+
|
71 |
+
## The Research Process in Action
|
72 |
+
|
73 |
+
When you ask the Research Agent a question, it follows a systematic process:
|
74 |
+
|
75 |
+
1. **Query Analysis**: It first analyzes your question to determine which information sources would be most relevant
|
76 |
+
2. **Multi-Tool Research**: Depending on the query, it executes searches across selected tools
|
77 |
+
3. **Context Retrieval**: If you've uploaded documents, it retrieves relevant passages from them
|
78 |
+
4. **Research Transparency**: It shows each step of its research process for full transparency
|
79 |
+
5. **Information Synthesis**: It analyzes and combines information from all sources
|
80 |
+
6. **Structured Reporting**: It delivers a comprehensive response with proper citations
|
81 |
+
|
82 |
+
## Real-World Applications
|
83 |
+
|
84 |
+
The Research Agent has proven valuable across various use cases:
|
85 |
+
|
86 |
+
- **Academic Research**: Gathering information across multiple scholarly sources
|
87 |
+
- **Competitive Analysis**: Staying updated on industry competitors
|
88 |
+
- **Technical Deep Dives**: Understanding complex technical topics
|
89 |
+
- **News Monitoring**: Tracking specific events across multiple sources
|
90 |
+
- **Document Q&A**: Asking questions about specific documents in broader context
|
91 |
+
|
92 |
+
## Lessons Learned and Future Directions
|
93 |
+
|
94 |
+
Building this agent taught me several valuable lessons about LLM application development:
|
95 |
+
|
96 |
+
1. **Tool Integration Complexity**: Combining multiple data sources requires careful consideration of data formats and query patterns
|
97 |
+
2. **Context Management**: Maintaining context across different research steps is critical for coherent outputs
|
98 |
+
3. **Transparency Matters**: Users trust AI more when they can see how it reached its conclusions
|
99 |
+
4. **LangGraph Power**: The graph-based approach to LLM workflows provides significant advantages over simpler chains
|
100 |
+
|
101 |
+
Looking ahead, I'm exploring several enhancements:
|
102 |
+
|
103 |
+
- Expanded academic database integration beyond arXiv
|
104 |
+
- More sophisticated document analysis with multi-document reasoning
|
105 |
+
- Improved citation formats and bibliographic support
|
106 |
+
- Enhanced visualization of research findings
|
107 |
+
|
108 |
+
## Try It Yourself
|
109 |
+
|
110 |
+
The Research Agent is available as an open-source project, and you can try it directly on Hugging Face Spaces:
|
111 |
+
|
112 |
+
- **Live Demo**: [Hugging Face Space](https://huggingface.co/spaces/mafzaal/AIE6-ResearchAgent)
|
113 |
+
- **Source Code**: [GitHub Repository](https://github.com/mafzaal/AIE6-ResearchAgent)
|
114 |
+
|
115 |
+
If you're interested in deploying your own instance, the GitHub repository includes detailed setup instructions for both local development and Docker deployment.
|
116 |
+
|
117 |
+
---
|
118 |
+
|
119 |
+
*Have you used the Research Agent or built similar tools? I'd love to hear about your experiences and any suggestions for improvements. Feel free to reach out through the contact form or connect with me on social media!*
|
data/coming-back-to-ai-roots/index.md
ADDED
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
layout: blog
|
3 |
+
title: Coming Back to AI Roots - My Professional Journey
|
4 |
+
date: 2025-04-14T00:00:00-06:00
|
5 |
+
description: A personal reflection on my career journey from AI to web and enterprise software development, and why I'm returning to my original passion for artificial intelligence.
|
6 |
+
categories: ["AI", "Personal Journey", "Technology"]
|
7 |
+
coverVideo: "/videos/back_to_future.mp4"
|
8 |
+
readingTime: 4
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
|
13 |
+
Have you ever felt that life has a way of bringing you full circle? That's exactly how I feel about my career trajectory. My name is Muhammad Afzaal, and I'd like to share the story of my professional journey - from my early fascination with artificial intelligence, through years of web and enterprise software development, and now back to where it all began.
|
14 |
+
|
15 |
+
## The Early AI Days
|
16 |
+
|
17 |
+
My professional journey began with a deep fascination for artificial intelligence. As a student, I was captivated by the potential of machines that could learn and make decisions. This was well before the current AI boom - back when neural networks were still considered somewhat niche and the term "deep learning" wasn't yet a household phrase.
|
18 |
+
|
19 |
+
I spent countless hours immersed in neural networks, image processing, and computer vision. My early career was defined by research projects and small-scale AI implementations - including Urdu OCR systems and data extraction from paper-based forms in 2003-2004. I still have vivid memories of recruiting fellow students to handwrite text samples, then meticulously scanning, labeling, and training neural networks with this data. While modest by today's standards, these projects represented glimpses into a future where machines could meaningfully augment human capabilities in ways that seemed almost magical at the time.
|
20 |
+
|
21 |
+
## The Pivot to Web and Enterprise Development
|
22 |
+
|
23 |
+
As often happens in technology careers, opportunities led me in a different direction. The explosive growth of web technologies and enterprise systems created a high demand for developers with these skills, and I found myself gradually pivoting away from AI.
|
24 |
+
|
25 |
+
For several years, I immersed myself in the world of web and enterprise software development. I worked with various frameworks and technologies, built scalable systems, and helped businesses solve complex problems through software. This journey taught me invaluable lessons about software architecture, user experience, and delivering production-quality code that serves real business needs.
|
26 |
+
|
27 |
+
Working in enterprise software development exposed me to the challenges of building systems that not only function correctly but can also scale, evolve, and adapt to changing requirements. I learned the importance of clean code, thoughtful architecture, and considering the entire lifecycle of software products.
|
28 |
+
|
29 |
+
## Why I'm Returning to AI
|
30 |
+
|
31 |
+
While my time in web and enterprise development was rewarding, I've always felt a pull back toward artificial intelligence. The recent AI renaissance - with breakthroughs in large language models, generative AI, and machine learning at scale - has reignited my original passion.
|
32 |
+
|
33 |
+
We're living in what may be the most exciting time in AI history. Models like GPT-4, Claude, and open-source alternatives are demonstrating capabilities that seemed like science fiction just a few years ago. The tools and frameworks available today make AI more accessible than ever before, and the potential applications span virtually every domain of human endeavor.
|
34 |
+
|
35 |
+
What excites me most is that my experience in enterprise software development gives me a unique perspective on AI implementation. I understand not just the algorithms and models, but also how to integrate them into robust, production-ready systems that deliver real value.
|
36 |
+
|
37 |
+
## The Best of Both Worlds
|
38 |
+
|
39 |
+
Coming back to AI doesn't mean leaving behind everything I learned in web and enterprise development. Quite the opposite - I believe my background gives me a particular advantage in building AI systems that are:
|
40 |
+
|
41 |
+
- **Production-ready**: Understanding software engineering best practices helps create AI systems that can operate reliably at scale.
|
42 |
+
- **User-focused**: Experience with UX principles ensures AI solutions are designed with actual human users in mind.
|
43 |
+
- **Integrated**: Knowledge of enterprise systems makes it easier to connect AI capabilities with existing business processes.
|
44 |
+
- **Simplified**: My experience in streamlining complex business processes helps me identify where AI can have the greatest impact through intelligent automation.
|
45 |
+
- **Business-oriented**: I understand that AI isn't just about the technology—it's about solving real business problems and creating measurable value.
|
46 |
+
- **Practical**: I focus on practical applications that deliver immediate benefits rather than getting caught up in theoretical possibilities.
|
47 |
+
|
48 |
+
## What's Next
|
49 |
+
|
50 |
+
As I return to my AI roots, I'm excited to share this journey with you through this blog. In the coming months, I plan to write about:
|
51 |
+
|
52 |
+
- Practical applications of modern AI technologies
|
53 |
+
- How to bridge the gap between AI research and production systems
|
54 |
+
- The intersection of web technologies and AI
|
55 |
+
- Ethical considerations in AI implementation
|
56 |
+
- Tutorials and guides for developers looking to incorporate AI into their projects
|
57 |
+
|
58 |
+
If you're interested in AI, software development, or the intersection of these fields, I hope you'll join me on this journey. Whether you're a seasoned AI practitioner, a web developer curious about machine learning, or simply interested in how technology is evolving, I believe there's something here for you.
|
59 |
+
|
60 |
+
Here's to coming full circle, building on past experiences, and embracing the exciting future of AI!
|
61 |
+
|
62 |
+
---
|
63 |
+
|
64 |
+
*Have questions or topics you'd like me to cover? Feel free to [reach out](https://www.linkedin.com/in/muhammadafzaal/) — we’d love to help!*
|
data/data-is-king/index.md
ADDED
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
layout: blog
|
3 |
+
title: "Data is King: Why Your Data Strategy IS Your Business Strategy"
|
4 |
+
date: 2025-04-15T00:00:00-06:00
|
5 |
+
categories: ["AI", "Strategy","Data"]
|
6 |
+
description: "Discover why controlling unique, high-quality data is your organization's most valuable competitive advantage in the AI era, and how a strategic approach to data ownership is becoming essential to business success."
|
7 |
+
coverImage: "https://images.unsplash.com/photo-1705484229341-4f7f7519b718?q=80&w=1740&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
|
8 |
+
readingTime: 3
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
In the rapidly evolving world of artificial intelligence and machine learning, there's a phrase that has become something of a mantra among practitioners: "Data is king." This concept, often attributed to Peter Norvig, the Research Director at Google, challenges the conventional wisdom that sophisticated algorithms are the primary drivers of AI advancement.
|
13 |
+
|
14 |
+
## The Origin of "Data is King"
|
15 |
+
|
16 |
+
Peter Norvig famously stated, "We don't have better algorithms. We just have more data." This statement emerged during a time when Google's approach to machine translation was yielding surprisingly effective results not through algorithmic innovations, but through the sheer volume of multilingual data they had amassed.
|
17 |
+
|
18 |
+
This perspective represented a paradigm shift. Prior to this, the field had largely focused on crafting ever more sophisticated algorithms, with the assumption that smarter code would yield better results. Norvig's insight suggested something different: even relatively simple algorithms could outperform more sophisticated ones when trained on sufficiently large datasets.
|
19 |
+
|
20 |
+
## The Business Imperative of Data Ownership
|
21 |
+
|
22 |
+
In today's AI-driven economy, Norvig's insight has profound implications for businesses. Companies that control unique, high-quality datasets possess an increasingly valuable competitive advantage that can't be easily replicated—even by competitors with superior engineering talent.
|
23 |
+
|
24 |
+
### Why Data Ownership Matters
|
25 |
+
|
26 |
+
1. **Sustainable Competitive Advantage**: While algorithms can be replicated or even improved upon by competitors, proprietary data is uniquely yours. A company with exclusive access to valuable data can maintain market leadership even when algorithmic approaches become standardized.
|
27 |
+
|
28 |
+
2. **Diminishing Returns on Algorithmic Innovation**: As machine learning techniques mature, algorithmic improvements often yield smaller incremental gains compared to expanding or improving your data assets.
|
29 |
+
|
30 |
+
3. **Model Defensibility**: Proprietary data creates a moat around your AI systems that competitors cannot easily cross, regardless of their technical capabilities.
|
31 |
+
|
32 |
+
4. **Value Appreciation**: Unlike physical assets that depreciate, data assets often appreciate in value over time as more patterns and insights can be extracted with evolving technology.
|
33 |
+
|
34 |
+
### The Risks of Data Dependency
|
35 |
+
|
36 |
+
Organizations that rely on third-party data sources or lack clear data ownership strategies face significant risks:
|
37 |
+
|
38 |
+
- **Vulnerability to supply disruptions** when data providers change terms or access
|
39 |
+
- **Limited ability to differentiate** their AI applications from competitors
|
40 |
+
- **Reduced capacity for innovation** as they lack the raw material for new insights
|
41 |
+
- **Potential lock-in** to specific vendors or platforms that control their data access
|
42 |
+
|
43 |
+
For forward-thinking enterprises, data strategy should be elevated to the same level of importance as product, technology, and market strategies. This means investing in data acquisition, management, and governance with the same rigor applied to other mission-critical functions.
|
44 |
+
|
45 |
+
## How "TheDataGuy" Can Transform Your Data Strategy
|
46 |
+
|
47 |
+
As "TheDataGuy," I help businesses transform their approach to data assets through a comprehensive framework that turns raw information into strategic advantage:
|
48 |
+
|
49 |
+
### My Data Value Chain Approach
|
50 |
+
|
51 |
+
1. **Data Collection & Acquisition**: Designing efficient pipelines to gather relevant, high-quality data while ensuring compliance with regulatory requirements.
|
52 |
+
|
53 |
+
2. **Storage Architecture**: Implementing scalable, secure storage solutions that balance accessibility with cost-effectiveness.
|
54 |
+
|
55 |
+
3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.
|
56 |
+
|
57 |
+
4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.
|
58 |
+
|
59 |
+
5. **LLM Specialization**: Creating specialized AI capabilities tailored to your business context:
|
60 |
+
|
61 |
+
a. **Retrieval-Augmented Generation (RAG)**: Implementing systems that combine your proprietary data with foundation models, enabling AI to access your business knowledge while reducing hallucinations and improving accuracy.
|
62 |
+
|
63 |
+
b. **Domain-Specific Fine-Tuning**: Adapting pre-trained models to your industry's terminology, workflows, and requirements through targeted training on curated datasets.
|
64 |
+
|
65 |
+
c. **Hybrid Approaches**: Developing systems that intelligently combine RAG and fine-tuning to maximize performance while minimizing computational costs and training time.
|
66 |
+
|
67 |
+
d. **Knowledge Distillation**: Creating smaller, more efficient specialized models that capture the essential capabilities needed for your specific business applications.
|
68 |
+
|
69 |
+
By working across this entire spectrum, organizations can develop truly proprietary AI capabilities that competitors cannot easily replicate, regardless of their technical talent or computational resources.
|
70 |
+
|
71 |
+
Remember: in the age of AI, your data strategy isn't just supporting your business strategy—increasingly, it *is* your business strategy.
|
72 |
+
## Ready to Make Data Your Competitive Advantage?
|
73 |
+
|
74 |
+
Don't let valuable data opportunities slip away. Whether you're just beginning your data journey or looking to enhance your existing strategy, I can help transform your approach to this critical business asset.
|
75 |
+
|
76 |
+
### Let's Connect
|
77 |
+
Connect with me on [LinkedIn](https://www.linkedin.com/in/muhammadafzaal/) to discuss how I can help your organization harness the power of data.
|
78 |
+
|
data/evaluating-ai-agents-with-ragas/index.md
ADDED
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "Part 6: Evaluating AI Agents: Beyond Simple Answers with Ragas"
|
3 |
+
date: 2025-04-28T06:00:00-06:00
|
4 |
+
layout: blog
|
5 |
+
description: "Learn how to evaluate complex AI agents using Ragas' specialized metrics for goal accuracy, tool call accuracy, and topic adherence to build more reliable and effective agent-based applications."
|
6 |
+
categories: ["AI", "Agents", "Evaluation", "Ragas", "LLM"]
|
7 |
+
coverImage: "/images/ai_agent_evaluation.png"
|
8 |
+
readingTime: 8
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
In our previous posts, we've explored how Ragas evaluates RAG systems and enables custom metrics for specialized applications. As LLMs evolve beyond simple question-answering to become powerful AI agents, evaluation needs have grown more sophisticated too. In this post, we'll explore Ragas' specialized metrics for evaluating AI agents that engage in multi-turn interactions, use tools, and work toward specific goals.
|
13 |
+
|
14 |
+
## The Challenge of Evaluating AI Agents
|
15 |
+
|
16 |
+
Unlike traditional RAG systems, AI agents present unique evaluation challenges:
|
17 |
+
|
18 |
+
- **Multi-turn interactions**: Agents maintain context across multiple exchanges
|
19 |
+
- **Tool usage**: Agents call external tools and APIs to accomplish tasks
|
20 |
+
- **Goal-oriented behavior**: Success means achieving the user's ultimate objective
|
21 |
+
- **Boundaries and constraints**: Agents must operate within defined topic boundaries
|
22 |
+
|
23 |
+
Standard metrics like faithfulness or answer relevancy don't fully capture these dimensions. Let's explore three specialized metrics Ragas provides for agent evaluation.
|
24 |
+
|
25 |
+
## Evaluating AI Agents: Beyond Simple Answers with Ragas
|
26 |
+
|
27 |
+
### 1. Goal Accuracy (`agent_goal_accuracy`)
|
28 |
+
|
29 |
+
**What it measures:** Did the agent successfully achieve the user's ultimate objective over the course of the interaction?
|
30 |
+
|
31 |
+
**How it works:**
|
32 |
+
This metric analyzes the entire agent workflow (user inputs, AI responses, tool calls).
|
33 |
+
* It uses an LLM (`InferGoalOutcomePrompt`) to identify the `user_goal` and the `end_state` (what actually happened).
|
34 |
+
* It then compares the `end_state` to either:
|
35 |
+
* A provided `reference` outcome (**`AgentGoalAccuracyWithReference`**).
|
36 |
+
* The inferred `user_goal` (**`AgentGoalAccuracyWithoutReference`**).
|
37 |
+
* An LLM (`CompareOutcomePrompt`) determines if the achieved outcome matches the desired one, resulting in a binary score (1 for success, 0 for failure).
|
38 |
+
|
39 |
+
**Why it's important:** For task-oriented agents (like booking systems or assistants), success isn't about individual responses but about completing the overall task correctly. This metric directly measures that end-to-end success.
|
40 |
+
|
41 |
+
### 2. Tool Call Accuracy (`tool_call_accuracy`)
|
42 |
+
|
43 |
+
**What it measures:** Did the agent use the correct tools, in the right order, and with the right arguments?
|
44 |
+
|
45 |
+
**How it works:**
|
46 |
+
This metric compares the sequence and details of tool calls made by the agent against a `reference_tool_calls` list.
|
47 |
+
* It checks if the *sequence* of tool names called by the agent aligns with the reference sequence (`is_sequence_aligned`).
|
48 |
+
* For each matching tool call, it compares the arguments provided by the agent to the reference arguments, often using a sub-metric like `ExactMatch` (`_get_arg_score`).
|
49 |
+
* The final score reflects both the sequence alignment and the argument correctness.
|
50 |
+
|
51 |
+
**Why it's important:** Many agents rely on external tools (APIs, databases, etc.). Incorrect tool usage (wrong tool, bad parameters) leads to task failure. This metric pinpoints issues in the agent's interaction with its tools.
|
52 |
+
|
53 |
+
### 3. Topic Adherence (`topic_adherence`)
|
54 |
+
|
55 |
+
**What it measures:** Did the agent stick to the allowed topics and appropriately handle requests about restricted topics?
|
56 |
+
|
57 |
+
**How it works:**
|
58 |
+
This metric evaluates conversations against a list of `reference_topics`.
|
59 |
+
* It extracts the topics discussed in the user's input (`TopicExtractionPrompt`).
|
60 |
+
* It checks if the agent refused to answer questions related to specific topics (`TopicRefusedPrompt`).
|
61 |
+
* It classifies whether the discussed topics fall within the allowed `reference_topics` (`TopicClassificationPrompt`).
|
62 |
+
* Based on these classifications and refusals, it calculates a score (Precision, Recall, or F1) indicating how well the agent adhered to the topic constraints.
|
63 |
+
|
64 |
+
**Why it's important:** Ensures agents stay focused, avoid generating content on forbidden subjects (safety, policy), and handle out-of-scope requests gracefully.
|
65 |
+
|
66 |
+
## Implementing Agent Evaluation in Practice
|
67 |
+
|
68 |
+
Let's look at a practical example of evaluating an AI agent using these metrics:
|
69 |
+
|
70 |
+
```python
|
71 |
+
from ragas.metrics import AgentGoalAccuracyWithoutReference, ToolCallAccuracy, TopicAdherenceScore
|
72 |
+
from ragas.evaluation import EvaluationDataset
|
73 |
+
from ragas.dataset_schema import MultiTurnSample
|
74 |
+
from langchain_openai import ChatOpenAI
|
75 |
+
from ragas.llms import LangchainLLMWrapper
|
76 |
+
|
77 |
+
# Initialize the LLM
|
78 |
+
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
|
79 |
+
|
80 |
+
# Example conversation with a travel booking agent
|
81 |
+
test_data = {
|
82 |
+
"user_input": [
|
83 |
+
{"role": "user", "content": "I need to book a flight from New York to London next Friday"},
|
84 |
+
{"role": "assistant", "content": "I'd be happy to help you book a flight. Let me search for options...",
|
85 |
+
"tool_calls": [{"name": "search_flights", "arguments": {"origin": "NYC", "destination": "LON", "date": "next Friday"}}]},
|
86 |
+
{"role": "tool", "name": "search_flights", "content": "Found 5 flights: Flight 1 (Delta, $750), Flight 2 (British Airways, $820)..."},
|
87 |
+
{"role": "assistant", "content": "I found several flights from New York to London next Friday. The cheapest option is Delta for $750. Would you like to book this one?"},
|
88 |
+
{"role": "user", "content": "Yes, please book the Delta flight"},
|
89 |
+
{"role": "assistant", "content": "I'll book that for you now.",
|
90 |
+
"tool_calls": [{"name": "book_flight", "arguments": {"flight_id": "delta_123", "price": "$750"}}]},
|
91 |
+
{"role": "tool", "name": "book_flight", "content": "Booking confirmed. Confirmation #: ABC123"},
|
92 |
+
{"role": "assistant", "content": "Great news! Your flight is confirmed. Your confirmation number is ABC123. The flight is scheduled for next Friday. Is there anything else you need help with?"}
|
93 |
+
],
|
94 |
+
"reference_topics": ["travel", "flight booking", "schedules", "prices"],
|
95 |
+
"reference_tool_calls": [
|
96 |
+
{"name": "search_flights", "args": {"origin": "NYC", "destination": "LON", "date": "next Friday"}},
|
97 |
+
{"name": "book_flight", "args": {"flight_id": "delta_123", "price": "$750"}}
|
98 |
+
]
|
99 |
+
}
|
100 |
+
|
101 |
+
# Create a sample
|
102 |
+
sample = MultiTurnSample(**test_data)
|
103 |
+
|
104 |
+
# Initialize metrics
|
105 |
+
goal_accuracy = AgentGoalAccuracyWithoutReference(llm=evaluator_llm)
|
106 |
+
tool_accuracy = ToolCallAccuracy()
|
107 |
+
topic_adherence = TopicAdherenceScore(llm=evaluator_llm)
|
108 |
+
|
109 |
+
# Calculate scores
|
110 |
+
goal_score = await goal_accuracy.multi_turn_ascore(sample)
|
111 |
+
tool_score = tool_accuracy.multi_turn_score(sample)
|
112 |
+
topic_score = await topic_adherence.multi_turn_ascore(sample)
|
113 |
+
|
114 |
+
print(f"Goal Accuracy: {goal_score}")
|
115 |
+
print(f"Tool Call Accuracy: {tool_score}")
|
116 |
+
print(f"Topic Adherence: {topic_score}")
|
117 |
+
```
|
118 |
+
|
119 |
+
> 💡 **Try it yourself:**
|
120 |
+
> Explore the hands-on notebook for agent evaluation:
|
121 |
+
> [06_Evaluating_AI_Agents](https://github.com/mafzaal/intro-to-ragas/blob/master/06_Evaluating_AI_Agents.ipynb)
|
122 |
+
|
123 |
+
## Advanced Agent Evaluation Techniques
|
124 |
+
|
125 |
+
### Combining Metrics for Comprehensive Evaluation
|
126 |
+
|
127 |
+
For a complete assessment of agent capabilities, combine multiple metrics:
|
128 |
+
|
129 |
+
```python
|
130 |
+
from ragas import evaluate
|
131 |
+
|
132 |
+
results = evaluate(
|
133 |
+
dataset, # Your dataset of agent conversations
|
134 |
+
metrics=[
|
135 |
+
AgentGoalAccuracyWithoutReference(llm=evaluator_llm),
|
136 |
+
ToolCallAccuracy(),
|
137 |
+
TopicAdherence(llm=evaluator_llm)
|
138 |
+
]
|
139 |
+
)
|
140 |
+
```
|
141 |
+
|
142 |
+
## Best Practices for Agent Evaluation
|
143 |
+
|
144 |
+
1. **Test scenario coverage:** Include a diverse range of interaction scenarios
|
145 |
+
2. **Edge case handling:** Test how agents handle unexpected inputs or failures
|
146 |
+
3. **Longitudinal evaluation:** Track performance over time to identify regressions
|
147 |
+
4. **Human-in-the-loop validation:** Periodically verify metric alignment with human judgments
|
148 |
+
5. **Continuous feedback loops:** Use evaluation insights to guide agent improvements
|
149 |
+
|
150 |
+
## Conclusion
|
151 |
+
|
152 |
+
Evaluating AI agents requires specialized metrics that go beyond traditional RAG evaluation. Ragas' `agent_goal_accuracy`, `tool_call_accuracy`, and `topic_adherence` provide crucial insights into whether an agent can successfully complete tasks, use tools correctly, and stay within designated boundaries.
|
153 |
+
|
154 |
+
By incorporating these metrics into your evaluation pipeline, you can build more reliable and effective AI agents that truly deliver on the promise of helpful, goal-oriented AI assistants.
|
155 |
+
|
156 |
+
In our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.
|
157 |
+
|
158 |
+
---
|
159 |
+
|
160 |
+
**[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
|
161 |
+
**[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
|
162 |
+
**[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**
|
163 |
+
**[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
|
164 |
+
**[Part 5: Advanced Metrics and Customization](/blog/advanced-metrics-and-customization-with-ragas/)**
|
165 |
+
**Part 6: Evaluating AI Agents — _You are here_**
|
166 |
+
*Next up in the series:*
|
167 |
+
**[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
|
168 |
+
**[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
|
169 |
+
|
170 |
+
*How are you evaluating your AI agents? What challenges have you encountered in measuring agent performance? If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!*
|
data/evaluating-rag-systems-with-ragas/index.md
ADDED
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "Part 3: Evaluating RAG Systems with Ragas"
|
3 |
+
date: 2025-04-26T20:00:00-06:00
|
4 |
+
layout: blog
|
5 |
+
description: "Learn specialized techniques for comprehensive evaluation of Retrieval-Augmented Generation systems using Ragas, including metrics for retrieval quality, generation quality, and end-to-end performance."
|
6 |
+
categories: ["AI", "RAG", "Evaluation", "Ragas"]
|
7 |
+
coverImage: "https://images.unsplash.com/photo-1743796055664-3473eedab36e?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
|
8 |
+
readingTime: 14
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
In our previous post, we covered the fundamentals of setting up evaluation workflows with Ragas. Now, let's focus specifically on evaluating Retrieval-Augmented Generation (RAG) systems, which present unique evaluation challenges due to their multi-component nature.
|
13 |
+
|
14 |
+
## Understanding RAG Systems: More Than the Sum of Their Parts
|
15 |
+
|
16 |
+
RAG systems combine two critical capabilities:
|
17 |
+
1. **Retrieval**: Finding relevant information from a knowledge base
|
18 |
+
2. **Generation**: Creating coherent, accurate responses based on retrieved information
|
19 |
+
|
20 |
+
This dual nature means evaluation must address both components while also assessing their interaction. A system might retrieve perfect information but generate poor responses, or generate excellent prose from irrelevant retrieved content.
|
21 |
+
|
22 |
+
## The RAG Evaluation Triad
|
23 |
+
|
24 |
+
Effective RAG evaluation requires examining three key dimensions:
|
25 |
+
|
26 |
+
1. **Retrieval Quality**: How well does the system find relevant information?
|
27 |
+
2. **Generation Quality**: How well does the system produce responses from retrieved information?
|
28 |
+
3. **End-to-End Performance**: How well does the complete system satisfy user needs?
|
29 |
+
|
30 |
+
Let's explore how Ragas helps evaluate each dimension of RAG systems.
|
31 |
+
|
32 |
+
## Core RAG Metrics in Ragas
|
33 |
+
|
34 |
+
Ragas provides specialized metrics to assess RAG systems across retrieval, generation, and end-to-end performance.
|
35 |
+
|
36 |
+
### Retrieval Quality Metrics
|
37 |
+
|
38 |
+
#### 1. Context Relevancy
|
39 |
+
|
40 |
+
Measures how relevant the retrieved documents are to the user's question.
|
41 |
+
|
42 |
+
- **How it works:**
|
43 |
+
- Takes the user's question (`user_input`) and the retrieved documents (`retrieved_contexts`).
|
44 |
+
- Uses an LLM to score relevance with two different prompts, averaging the results for robustness.
|
45 |
+
- Scores are normalized between 0.0 (irrelevant) and 1.0 (fully relevant).
|
46 |
+
|
47 |
+
- **Why it matters:**
|
48 |
+
Low scores indicate your retriever is pulling in unrelated or noisy documents. Monitoring this helps you improve the retrieval step.
|
49 |
+
|
50 |
+
#### 2. Context Precision
|
51 |
+
|
52 |
+
Assesses how much of the retrieved context is actually useful for generating the answer.
|
53 |
+
|
54 |
+
- **How it works:**
|
55 |
+
- For each retrieved chunk, an LLM judges if it was necessary for the answer, using the ground truth (`reference`) or the generated response.
|
56 |
+
- Calculates [Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Average_precision), rewarding systems that rank useful chunks higher.
|
57 |
+
|
58 |
+
- **Variants:**
|
59 |
+
- `ContextUtilization`: Uses the generated response instead of ground truth.
|
60 |
+
- Non-LLM version: Compares retrieved chunks to ideal reference contexts using string similarity.
|
61 |
+
|
62 |
+
- **Why it matters:**
|
63 |
+
High precision means your retriever is efficient; low precision means too much irrelevant information is included.
|
64 |
+
|
65 |
+
#### 3. Context Recall
|
66 |
+
|
67 |
+
Evaluates whether all necessary information from the ground truth answer is present in the retrieved context.
|
68 |
+
|
69 |
+
- **How it works:**
|
70 |
+
- Breaks down the reference answer into sentences.
|
71 |
+
- For each sentence, an LLM checks if it can be supported by the retrieved context.
|
72 |
+
- The score is the proportion of reference sentences attributed to the retrieved context.
|
73 |
+
|
74 |
+
- **Variants:**
|
75 |
+
- Non-LLM version: Compares reference and retrieved contexts using similarity and thresholds.
|
76 |
+
|
77 |
+
- **Why it matters:**
|
78 |
+
High recall means your retriever finds all needed information; low recall means critical information is missing.
|
79 |
+
|
80 |
+
**Summary:**
|
81 |
+
- **Low context relevancy:** Retriever needs better query understanding or semantic matching.
|
82 |
+
- **Low context precision:** Retriever includes unnecessary information.
|
83 |
+
- **Low context recall:** Retriever misses critical information.
|
84 |
+
|
85 |
+
### Generation Quality Metrics
|
86 |
+
|
87 |
+
#### 1. Faithfulness
|
88 |
+
|
89 |
+
Checks if the generated answer is factually consistent with the retrieved context, addressing hallucination.
|
90 |
+
|
91 |
+
- **How it works:**
|
92 |
+
- Breaks the answer into simple statements.
|
93 |
+
- For each, an LLM checks if it can be inferred from the retrieved context.
|
94 |
+
- The score is the proportion of faithful statements.
|
95 |
+
|
96 |
+
- **Alternative:**
|
97 |
+
- `FaithfulnesswithHHEM`: Uses a specialized NLI model for verification.
|
98 |
+
|
99 |
+
- **Why it matters:**
|
100 |
+
High faithfulness means answers are grounded in context; low faithfulness signals hallucination.
|
101 |
+
|
102 |
+
#### 2. Answer Relevancy
|
103 |
+
|
104 |
+
Measures if the generated answer directly addresses the user's question.
|
105 |
+
|
106 |
+
- **How it works:**
|
107 |
+
- Asks an LLM to generate possible questions for the answer.
|
108 |
+
- Compares these to the original question using embedding similarity.
|
109 |
+
- Penalizes noncommittal answers.
|
110 |
+
|
111 |
+
- **Why it matters:**
|
112 |
+
High relevancy means answers are on-topic; low relevancy means answers are off-topic or incomplete.
|
113 |
+
|
114 |
+
**Summary:**
|
115 |
+
- **Low faithfulness:** Generator adds facts not supported by context.
|
116 |
+
- **Low answer relevancy:** Generator doesn't focus on the specific question.
|
117 |
+
|
118 |
+
### End-to-End Metrics
|
119 |
+
|
120 |
+
#### 1. Correctness
|
121 |
+
|
122 |
+
Assesses factual alignment between the generated answer and a ground truth reference.
|
123 |
+
|
124 |
+
- **How it works:**
|
125 |
+
- Breaks both the answer and reference into claims.
|
126 |
+
- Uses NLI to verify claims in both directions.
|
127 |
+
- Calculates precision, recall, or F1-score.
|
128 |
+
|
129 |
+
- **Why it matters:**
|
130 |
+
High correctness means answers match the ground truth; low correctness signals factual errors.
|
131 |
+
|
132 |
+
**Key distinction:**
|
133 |
+
- `Faithfulness`: Compares answer to retrieved context.
|
134 |
+
- `FactualCorrectness`: Compares answer to ground truth.
|
135 |
+
|
136 |
+
---
|
137 |
+
|
138 |
+
## Common RAG Evaluation Patterns
|
139 |
+
|
140 |
+
### 1. High Retrieval, Low Generation Scores
|
141 |
+
|
142 |
+
- **Diagnosis:** Good retrieval, poor use of information.
|
143 |
+
- **Fixes:** Improve prompts, use better generation models, or verify responses post-generation.
|
144 |
+
|
145 |
+
### 2. Low Retrieval, High Generation Scores
|
146 |
+
|
147 |
+
- **Diagnosis:** Good generation, inadequate information.
|
148 |
+
- **Fixes:** Enhance indexing, retrieval algorithms, or expand the knowledge base.
|
149 |
+
|
150 |
+
### 3. Low Context Precision, High Faithfulness
|
151 |
+
|
152 |
+
- **Diagnosis:** Retrieves too much, but generates reliably.
|
153 |
+
- **Fixes:** Filter passages, optimize chunk size, or use re-ranking.
|
154 |
+
|
155 |
+
---
|
156 |
+
|
157 |
+
## Best Practices for RAG Evaluation
|
158 |
+
|
159 |
+
1. **Evaluate components independently:** Assess retrieval and generation separately.
|
160 |
+
2. **Use diverse queries:** Include factoid, explanatory, and complex questions.
|
161 |
+
3. **Compare against baselines:** Test against simpler systems.
|
162 |
+
4. **Perform ablation studies:** Try variations like different chunk sizes or retrieval models.
|
163 |
+
5. **Combine with human evaluation:** Use Ragas with human judgment for a complete view.
|
164 |
+
|
165 |
+
---
|
166 |
+
|
167 |
+
## Conclusion: The Iterative RAG Evaluation Cycle
|
168 |
+
|
169 |
+
Effective RAG development is iterative:
|
170 |
+
|
171 |
+
1. **Evaluate:** Measure performance.
|
172 |
+
2. **Analyze:** Identify weaknesses.
|
173 |
+
3. **Improve:** Apply targeted enhancements.
|
174 |
+
4. **Re-evaluate:** Measure the impact of changes.
|
175 |
+
|
176 |
+
<p align="center">
|
177 |
+
<img src="/images/the-iterative-rag-evaluation-cycle.png" alt="The Iterative RAG Evaluation Cycle" width="50%">
|
178 |
+
</p>
|
179 |
+
|
180 |
+
By using Ragas to implement this cycle, you can systematically improve your RAG system's performance across all dimensions.
|
181 |
+
|
182 |
+
In our next post, we'll explore how to generate high-quality test datasets for comprehensive RAG evaluation, addressing the common challenge of limited test data.
|
183 |
+
|
184 |
+
---
|
185 |
+
|
186 |
+
**[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
|
187 |
+
**[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
|
188 |
+
**Part 3: Evaluating RAG Systems with Ragas — _You are here_**
|
189 |
+
*Next up in the series:*
|
190 |
+
**[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
|
191 |
+
**[Part 5: Advanced Evaluation Techniques](/blog/advanced-metrics-and-customization-with-ragas)**
|
192 |
+
**[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
|
193 |
+
**[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
|
194 |
+
**[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
|
195 |
+
|
196 |
+
|
197 |
+
*How have you implemented feedback loops in your LLM applications? What improvement strategies have been most effective for your use cases? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!*
|
data/generating-test-data-with-ragas/index.md
ADDED
@@ -0,0 +1,206 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "Part 4: Generating Test Data with Ragas"
|
3 |
+
date: 2025-04-27T16:00:00-06:00
|
4 |
+
layout: blog
|
5 |
+
description: "Discover how to generate robust test datasets for evaluating Retrieval-Augmented Generation systems using Ragas, including document-based, domain-specific, and adversarial test generation techniques."
|
6 |
+
categories: ["AI", "RAG", "Evaluation", "Ragas","Data"]
|
7 |
+
coverImage: "/images/generating_test_data.png"
|
8 |
+
readingTime: 14
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
|
13 |
+
In our previous post, we explored how to comprehensively evaluate RAG systems using specialized metrics. However, even the best evaluation framework requires high-quality test data to yield meaningful insights. In this post, we'll dive into how Ragas helps you generate robust test datasets for evaluating your LLM applications.
|
14 |
+
|
15 |
+
|
16 |
+
## Why and How to Generate Synthetic Data for RAG Evaluation
|
17 |
+
|
18 |
+
In the world of Retrieval-Augmented Generation (RAG) and LLM-powered applications, **synthetic data generation** is a game-changer for rapid iteration and robust evaluation. This blog post explains why synthetic data is essential, and how you can generate it for your own RAG pipelines—using modern tools like [RAGAS](https://github.com/explodinggradients/ragas) and [LangSmith](https://smith.langchain.com/).
|
19 |
+
|
20 |
+
---
|
21 |
+
|
22 |
+
### Why Generate Synthetic Data?
|
23 |
+
|
24 |
+
1. **Early Signal, Fast Iteration**
|
25 |
+
Real-world data is often scarce or expensive to label. Synthetic data lets you quickly create test sets that mimic real user queries and contexts, so you can evaluate your system’s performance before deploying to production.
|
26 |
+
|
27 |
+
2. **Controlled Complexity**
|
28 |
+
You can design synthetic datasets to cover edge cases, multi-hop reasoning, or specific knowledge domains—ensuring your RAG system is robust, not just good at the “easy” cases.
|
29 |
+
|
30 |
+
3. **Benchmarking and Comparison**
|
31 |
+
Synthetic test sets provide a repeatable, comparable way to measure improvements as you tweak your pipeline (e.g., changing chunk size, embeddings, or prompts).
|
32 |
+
|
33 |
+
---
|
34 |
+
|
35 |
+
### How to Generate Synthetic Data
|
36 |
+
|
37 |
+
#### 1. **Prepare Your Source Data**
|
38 |
+
Start with a set of documents relevant to your domain. For example, you might download and load HTML blog posts into a document format using tools like LangChain’s `DirectoryLoader`.
|
39 |
+
|
40 |
+
#### 2. **Build a Knowledge Graph**
|
41 |
+
Use RAGAS to convert your documents into a knowledge graph. This graph captures entities, relationships, and summaries, forming the backbone for generating meaningful queries. RAGAS applies default transformations are dependent on the corpus length, here are some examples:
|
42 |
+
|
43 |
+
- Producing Summaries -> produces summaries of the documents
|
44 |
+
- Extracting Headlines -> finding the overall headline for the document
|
45 |
+
- Theme Extractor -> extracts broad themes about the documents
|
46 |
+
|
47 |
+
It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes. This is a crucial step, as the quality of your knowledge graph directly impacts the relevance and accuracy of the generated queries.
|
48 |
+
|
49 |
+
#### 3. **Configure Query Synthesizers**
|
50 |
+
RAGAS provides several query synthesizers:
|
51 |
+
- **SingleHopSpecificQuerySynthesizer**: Generates direct, fact-based questions.
|
52 |
+
- **MultiHopAbstractQuerySynthesizer**: Creates broader, multi-step reasoning questions.
|
53 |
+
- **MultiHopSpecificQuerySynthesizer**: Focuses on questions that require connecting specific entities across documents.
|
54 |
+
|
55 |
+
By mixing these, you get a diverse and challenging test set.
|
56 |
+
|
57 |
+
#### 4. **Generate the Test Set**
|
58 |
+
With your knowledge graph and query synthesizers, use RAGAS’s `TestsetGenerator` to create a synthetic dataset. This dataset will include questions, reference answers, and supporting contexts.
|
59 |
+
|
60 |
+
#### 5. **Evaluate and Iterate**
|
61 |
+
Load your synthetic dataset into an evaluation platform like LangSmith. Run your RAG pipeline against the test set, and use automated evaluators (for accuracy, helpfulness, style, etc.) to identify strengths and weaknesses. Tweak your pipeline and re-evaluate to drive improvements.
|
62 |
+
|
63 |
+
---
|
64 |
+
|
65 |
+
### Minimal Example
|
66 |
+
|
67 |
+
Here’s a high-level pseudocode outline (see the notebook for full details):
|
68 |
+
|
69 |
+
````python
|
70 |
+
# 1. Load documents
|
71 |
+
from langchain_community.document_loaders import DirectoryLoader
|
72 |
+
path = "data/"
|
73 |
+
loader = DirectoryLoader(path, glob="*.md")
|
74 |
+
docs = loader.load()
|
75 |
+
|
76 |
+
# 2. Generate data
|
77 |
+
from ragas.testset import TestsetGenerator
|
78 |
+
from ragas.llms import LangchainLLMWrapper
|
79 |
+
from ragas.embeddings import LangchainEmbeddingsWrapper
|
80 |
+
from langchain_openai import ChatOpenAI
|
81 |
+
from langchain_openai import OpenAIEmbeddings
|
82 |
+
# Initialize the generator with the LLM and embedding model
|
83 |
+
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
|
84 |
+
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
|
85 |
+
|
86 |
+
# Create the test set generator
|
87 |
+
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
|
88 |
+
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)
|
89 |
+
````
|
90 |
+
|
91 |
+
`dataset` will now contain a set of questions, answers, and contexts that you can use to evaluate your RAG system.
|
92 |
+
|
93 |
+
> 💡 **Try it yourself:**
|
94 |
+
> Explore the hands-on notebook for synthetic data generation:
|
95 |
+
> 💡 **Try it yourself:**
|
96 |
+
> Explore the hands-on notebook for synthetic data generation:
|
97 |
+
> [04_Synthetic_Data_Generation](https://github.com/mafzaal/intro-to-ragas/blob/master/04_Synthetic_Data_Generation.ipynb)
|
98 |
+
|
99 |
+
### Understanding the Generated Dataset Columns
|
100 |
+
|
101 |
+
The synthetic dataset generated by Ragas typically includes the following columns:
|
102 |
+
|
103 |
+
- **`user_input`**: The generated question or query that simulates what a real user might ask. This is the prompt your RAG system will attempt to answer.
|
104 |
+
- **`reference_contexts`**: A list of document snippets or passages that contain the information needed to answer the `user_input`. These serve as the ground truth retrieval targets.
|
105 |
+
- **`reference`**: The ideal answer to the `user_input`, based strictly on the `reference_contexts`. This is used as the gold standard for evaluating answer accuracy.
|
106 |
+
- **`synthesizer_name`**: The name of the query synthesizer (e.g., `SingleHopSpecificQuerySynthesizer`, `MultiHopAbstractQuerySynthesizer`) that generated the question. This helps track the type and complexity of each test case.
|
107 |
+
|
108 |
+
These columns enable comprehensive evaluation by linking each question to its supporting evidence and expected answer, while also providing insight into the diversity and difficulty of the generated queries.
|
109 |
+
|
110 |
+
|
111 |
+
## Deep Dive into Test Data Generation
|
112 |
+
|
113 |
+
So you have a collection of documents and want to create a robust evaluation dataset for your RAG system using Ragas. The `TestsetGenerator`'s `generate_with_langchain_docs` method is your starting point. But what exactly happens when you call it? Let's peek under the hood.
|
114 |
+
|
115 |
+
**The Goal:** To take raw Langchain `Document` objects and transform them into a structured Ragas `Testset` containing diverse question-answer pairs grounded in those documents.
|
116 |
+
|
117 |
+
**The Workflow:**
|
118 |
+
|
119 |
+
1. **Input & Validation:** The function receives your Langchain `documents`, the desired `testset_size`, and optional configurations for transformations and query types. It first checks if it has the necessary LLM and embedding models to proceed (either provided during `TestsetGenerator` initialization or passed directly to this method).
|
120 |
+
|
121 |
+
2. **Setting Up Transformations:** This is a crucial step.
|
122 |
+
* **User-Provided:** If you pass a specific `transforms` configuration, the generator uses that.
|
123 |
+
* **Default Transformations:** If you *don't* provide `transforms`, the generator calls `ragas.testset.transforms.default_transforms`. This sets up a standard pipeline to process your raw documents into a usable knowledge graph foundation. We'll detail this below.
|
124 |
+
|
125 |
+
3. **Document Conversion:** Your Langchain `Document` objects are converted into Ragas' internal `Node` representation, specifically `NodeType.DOCUMENT`. Each node holds the `page_content` and `metadata`.
|
126 |
+
|
127 |
+
4. **Initial Knowledge Graph:** A `KnowledgeGraph` object is created, initially containing just these document nodes.
|
128 |
+
|
129 |
+
5. **Applying Transformations:** The core processing happens here using `ragas.testset.transforms.apply_transforms`. The chosen `transforms` (default or custom) are executed sequentially on the `KnowledgeGraph`. This modifies the graph by:
|
130 |
+
* Adding new nodes (e.g., chunks, questions, answers).
|
131 |
+
* Adding relationships between nodes (e.g., linking a question to the chunk it came from).
|
132 |
+
The generator's internal `knowledge_graph` attribute is updated with this processed graph.
|
133 |
+
|
134 |
+
6. **Delegation to `generate()`:** Now that the foundational knowledge graph with basic Q&A pairs is built (thanks to transformations), `generate_with_langchain_docs` calls the main `self.generate()` method. This method handles the final step of creating the diverse test samples.
|
135 |
+
|
136 |
+
**Spotlight: Default Transformations (`default_transforms`)**
|
137 |
+
|
138 |
+
When you don't specify custom transformations, Ragas applies a sensible default pipeline to prepare your documents:
|
139 |
+
|
140 |
+
1. **Chunking (`SentenceChunker`):** Breaks down your large documents into smaller, more manageable chunks (often sentences or groups of sentences). This is essential for focused retrieval and question generation.
|
141 |
+
2. **Embedding:** Generates vector embeddings for each chunk using the provided embedding model. These are needed for similarity-based operations.
|
142 |
+
3. **Filtering (`SimilarityFilter`, `InformationFilter`):** Removes redundant chunks (those too similar to others) and potentially low-information chunks to clean up the knowledge base.
|
143 |
+
4. **Base Q&A Generation (`QAGenerator`):** This is where the initial, simple question-answer pairs are created. The generator looks at individual (filtered) chunks and uses an LLM to formulate straightforward questions whose answers are directly present in that chunk.
|
144 |
+
|
145 |
+
Essentially, the default transformations build a knowledge graph populated with embedded, filtered document chunks and corresponding simple, extractive question-answer pairs.
|
146 |
+
|
147 |
+
**Spotlight: Query Synthesizers (via `self.generate()` and `default_query_distribution`)**
|
148 |
+
|
149 |
+
The `self.generate()` method, called by `generate_with_langchain_docs`, is responsible for taking the foundational graph and creating the final, potentially complex, test questions using **Query Synthesizers** (also referred to as "evolutions" or "scenarios").
|
150 |
+
|
151 |
+
* **Query Distribution:** `self.generate()` uses a `query_distribution` parameter. If you don't provide one, it calls `ragas.testset.synthesizers.default_query_distribution`.
|
152 |
+
* **Default Synthesizers:** This default distribution defines a mix of different synthesizer types and the probability of using each one. Common defaults include:
|
153 |
+
* **`simple`:** Takes the base Q&A pairs generated during transformation and potentially rephrases them slightly.
|
154 |
+
* **`reasoning`:** Creates questions requiring logical inference based on the context in the graph.
|
155 |
+
* **`multi_context`:** Generates questions needing information synthesized from multiple different chunks/nodes in the graph.
|
156 |
+
* **`conditional`:** Creates questions with "if/then" clauses based on information in the graph.
|
157 |
+
* **Generation Process:** `self.generate()` calculates how many questions of each type to create based on the `testset_size` and the distribution probabilities. It then uses an `Executor` to run the appropriate synthesizers, generating the final `TestsetSample` objects that make up your evaluation dataset.
|
158 |
+
|
159 |
+
**In Summary:**
|
160 |
+
|
161 |
+
`generate_with_langchain_docs` orchestrates a two-phase process:
|
162 |
+
|
163 |
+
1. **Transformation Phase:** Uses (typically default) transformations like chunking, filtering, and base Q&A generation to build a foundational knowledge graph from your documents.
|
164 |
+
2. **Synthesis Phase (via `self.generate`):** Uses (typically default) query synthesizers/evolutions (`simple`, `reasoning`, `multi_context`, etc.) to create diverse and complex questions based on the information stored in the transformed knowledge graph.
|
165 |
+
|
166 |
+
This automated pipeline allows you to go from raw documents to a rich, multi-faceted evaluation dataset with minimal configuration.
|
167 |
+
|
168 |
+
|
169 |
+
## Best Practices for Test Data Generation
|
170 |
+
|
171 |
+
1. **Start small and iterate**: Begin with a small test set to verify quality before scaling up
|
172 |
+
2. **Diversify document sources**: Include different document types, styles, and domains
|
173 |
+
3. **Balance question types**: Ensure coverage of simple, complex, and edge-case scenarios
|
174 |
+
4. **Manual review**: Sample-check generated questions for quality and relevance
|
175 |
+
5. **Progressive difficulty**: Include both easy and challenging questions to identify performance thresholds
|
176 |
+
6. **Document metadata**: Retain information about test case generation for later analysis
|
177 |
+
7. **Version control**: Track test set versions alongside your application versions
|
178 |
+
|
179 |
+
## Conclusion: Building a Test Data Generation Strategy
|
180 |
+
|
181 |
+
Test data generation should be an integral part of your LLM application development cycle:
|
182 |
+
|
183 |
+
1. **Initial development**: Generate broad test sets to identify general capabilities and limitations
|
184 |
+
2. **Refinement**: Create targeted test sets for specific features or improvements
|
185 |
+
3. **Regression testing**: Maintain benchmark test sets to ensure changes don't break existing functionality
|
186 |
+
4. **Continuous improvement**: Generate new test cases as your application evolves
|
187 |
+
|
188 |
+
By leveraging Ragas for automated test data generation, you can build comprehensive evaluation datasets that thoroughly exercise your LLM applications, leading to more robust, reliable systems.
|
189 |
+
|
190 |
+
In our next post, we'll explore advanced metrics and customization techniques for specialized evaluation needs.
|
191 |
+
|
192 |
+
---
|
193 |
+
|
194 |
+
|
195 |
+
**[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
|
196 |
+
**[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
|
197 |
+
**[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**
|
198 |
+
**Part 4: Test Data Generation — _You are here_**
|
199 |
+
*Next up in the series:*
|
200 |
+
**[Part 5: Advanced Evaluation Techniques](/blog/advanced-metrics-and-customization-with-ragas)**
|
201 |
+
**[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
|
202 |
+
**[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
|
203 |
+
**[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
|
204 |
+
|
205 |
+
|
206 |
+
*How have you implemented feedback loops in your LLM applications? What improvement strategies have been most effective for your use cases? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!*
|
data/integrations-and-observability-with-ragas/index.md
ADDED
@@ -0,0 +1,137 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "Part 7: Integrations and Observability with Ragas"
|
3 |
+
date: 2025-04-30T07:00:00-06:00
|
4 |
+
layout: blog
|
5 |
+
description: "Discover how to generate robust test datasets for evaluating Retrieval-Augmented Generation systems using Ragas, including document-based, domain-specific, and adversarial test generation techniques."
|
6 |
+
categories: ["AI", "RAG", "Evaluation", "Ragas","Data"]
|
7 |
+
coverImage: "/images/integrations-and-observability.png"
|
8 |
+
readingTime: 12
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
# Part 6: Integrations and Observability with Ragas
|
13 |
+
|
14 |
+
In our previous post, we explored how to evaluate complex AI agents using Ragas' specialized metrics for goal accuracy, tool call accuracy, and topic adherence to build more reliable and effective agent-based applications. Now, let's discuss how to integrate Ragas into your broader LLM development ecosystem and establish observability practices that transform evaluation from a one-time exercise into a continuous improvement cycle.
|
15 |
+
|
16 |
+
## Why Integrations and Observability Matter
|
17 |
+
|
18 |
+
Evaluation is most powerful when it's:
|
19 |
+
|
20 |
+
- **Integrated** into your existing workflow and tools
|
21 |
+
- **Automated** to run consistently with minimal friction
|
22 |
+
- **Observable** so insights are easily accessible and actionable
|
23 |
+
- **Continuous** rather than a one-time or sporadic effort
|
24 |
+
|
25 |
+
Let's explore how Ragas helps you achieve these goals through its extensive integration capabilities.
|
26 |
+
|
27 |
+
## Framework Integrations
|
28 |
+
|
29 |
+
Ragas seamlessly connects with popular LLM application frameworks, allowing you to evaluate systems built with your preferred tools.
|
30 |
+
|
31 |
+
### LangChain Integration
|
32 |
+
For LangChain-based applications, Ragas provides dedicated integration support. Here’s how you can integrate Ragas step by step:
|
33 |
+
|
34 |
+
1. **Prepare your documents**: Load your source documents and split them into manageable chunks for retrieval.
|
35 |
+
2. **Set up vector storage**: Embed the document chunks and store them in a vector database to enable efficient retrieval.
|
36 |
+
3. **Configure the retriever and QA chain**: Use LangChain components to create a retriever and a question-answering (QA) chain powered by your chosen language model.
|
37 |
+
4. **Generate a test set**: Use Ragas to automatically generate a set of test questions and answers from your documents, or supply your own.
|
38 |
+
5. **Evaluate retrieval and QA performance**: Apply Ragas metrics to assess both the retriever and the full QA chain, measuring aspects like context relevancy, faithfulness, and answer quality.
|
39 |
+
6. **Review results**: Analyze the evaluation outputs to identify strengths and areas for improvement in your RAG pipeline.
|
40 |
+
|
41 |
+
This integration allows you to continuously measure and improve the effectiveness of your retrieval and generation components within the LangChain framework.
|
42 |
+
|
43 |
+
> 💡 **Try it yourself:**
|
44 |
+
> Explore the hands-on notebook for synthetic data generation:
|
45 |
+
> [07_Integrations_and_Observability](https://github.com/mafzaal/intro-to-ragas/blob/master/07_Integrations_and_Observability.ipynb)
|
46 |
+
|
47 |
+
Ragas supports integration with a variety of popular LLM and RAG frameworks beyond LangChain, including LlamaIndex and Haystack. These integrations enable seamless evaluation of retrieval and generation components within your preferred stack. If you need guidance or code examples for integrating Ragas with platforms such as LlamaIndex, Haystack, or others, support and tailored examples can be provided on demand to fit your specific workflow and requirements.
|
48 |
+
|
49 |
+
## Observability Platform Integrations
|
50 |
+
|
51 |
+
Beyond framework integrations, Ragas connects with leading observability platforms to help you monitor, track, and analyze evaluation results over time.
|
52 |
+
|
53 |
+
### LangSmith Integration
|
54 |
+
For LangChain users, LangSmith provides comprehensive tracing and evaluation. To integrate Ragas evaluation with LangSmith, follow these steps:
|
55 |
+
|
56 |
+
1. **Set up your environment**
|
57 |
+
2. **Upload dataset to LangSmith**
|
58 |
+
3. **Define your LLM or chain**
|
59 |
+
4. **Select Ragas metrics**
|
60 |
+
5. **Run evaluation with LangSmith**
|
61 |
+
|
62 |
+
You can now view detailed experiment results in your LangSmith project dashboard. This integration enables you to trace, evaluate, and monitor your RAG pipeline performance directly within LangSmith, leveraging Ragas metrics for deeper insights.
|
63 |
+
|
64 |
+
> 💡 **Try it yourself:**
|
65 |
+
> Explore the hands-on notebook for synthetic data generation:
|
66 |
+
> [07_Integrations_and_Observability](https://github.com/mafzaal/intro-to-ragas/blob/master/07_Integrations_and_Observability.ipynb)
|
67 |
+
|
68 |
+
|
69 |
+
### Other Platform Integrations
|
70 |
+
|
71 |
+
Ragas can be integrated with a range of observability and monitoring platforms beyond LangSmith, such as Langfuse and others. If you need help connecting Ragas to platforms like Langfuse or have specific requirements for your observability stack, tailored support and examples are available to fit your workflow.
|
72 |
+
|
73 |
+
## Building Automated Evaluation Pipelines
|
74 |
+
|
75 |
+
To ensure evaluation is a continuous part of your development process, set up automated pipelines that run evaluations regularly and automatically.
|
76 |
+
|
77 |
+
### CI/CD Integration
|
78 |
+
|
79 |
+
You can incorporate Ragas into your CI/CD pipeline so that every code change is automatically evaluated. This helps catch regressions early and ensures your RAG system maintains high performance before merging new changes.
|
80 |
+
|
81 |
+
### Scheduled Evaluations
|
82 |
+
|
83 |
+
Regularly scheduled evaluations allow you to monitor your system’s performance over time. By running evaluations at set intervals, you can track trends, spot regressions, and ensure your system continues to meet quality standards.
|
84 |
+
|
85 |
+
## Monitoring Evaluation Metrics Over Time
|
86 |
+
|
87 |
+
Tracking evaluation metrics over time helps you identify performance trends and quickly detect any drops in quality. By visualizing these metrics, you can better understand how changes to your system impact its effectiveness.
|
88 |
+
|
89 |
+
## Creating Custom Dashboards
|
90 |
+
|
91 |
+
Building custom dashboards gives you a comprehensive view of your evaluation results. Dashboards can display current performance, trends, and detailed breakdowns of recent evaluations, making it easier to monitor your system and identify areas for improvement.
|
92 |
+
|
93 |
+
With these practices, you can make evaluation an ongoing, automated, and visible part of your development workflow, leading to more reliable and robust RAG systems.
|
94 |
+
|
95 |
+
## Best Practices for Observability
|
96 |
+
|
97 |
+
1. **Define clear thresholds**: Establish performance baselines and alert thresholds for each metric
|
98 |
+
2. **Segment evaluations**: Break down results by query type, data source, or other relevant factors
|
99 |
+
3. **Historical tracking**: Maintain historical evaluation data to identify trends and regressions
|
100 |
+
4. **Correlation analysis**: Link evaluation metrics to user feedback and business outcomes
|
101 |
+
5. **Regular benchmarking**: Periodically evaluate against fixed test sets to ensure consistency
|
102 |
+
6. **Alert on regressions**: Implement automated alerts when metrics drop below thresholds
|
103 |
+
7. **Contextualize metrics**: Include example failures alongside aggregate metrics for better understanding
|
104 |
+
|
105 |
+
## Building a Feedback Loop
|
106 |
+
|
107 |
+
The ultimate goal of evaluation is to drive improvements. Establish a feedback loop:
|
108 |
+
|
109 |
+
1. **Capture evaluation results** with Ragas
|
110 |
+
2. **Identify patterns** in failures and underperforming areas
|
111 |
+
3. **Prioritize improvements** based on impact and effort
|
112 |
+
4. **Implement changes** to your RAG components
|
113 |
+
5. **Validate improvements** with focused re-evaluation
|
114 |
+
6. **Monitor continuously** to catch regressions
|
115 |
+
|
116 |
+
## Conclusion: From Evaluation to Action
|
117 |
+
|
118 |
+
Integrating Ragas with your frameworks and observability tools transforms evaluation from a point-in-time activity to a continuous improvement cycle. By making evaluation metrics visible, actionable, and integrated into your workflows, you create a foundation for systematic improvement of your LLM applications.
|
119 |
+
|
120 |
+
The most successful teams don't just evaluate occasionally — they build evaluation into their development culture, making data-driven decisions based on objective metrics rather than subjective impressions.
|
121 |
+
|
122 |
+
In our final post, we'll explore how to build effective feedback loops that translate evaluation insights into concrete improvements for your LLM applications.
|
123 |
+
|
124 |
+
---
|
125 |
+
|
126 |
+
|
127 |
+
**[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
|
128 |
+
**[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
|
129 |
+
**[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**
|
130 |
+
**[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
|
131 |
+
**[Part 5: Advanced Metrics and Customization](/blog/advanced-metrics-and-customization-with-ragas/)**
|
132 |
+
**[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
|
133 |
+
**Part 7: Integrations and Observability with Ragas — _You are here_**
|
134 |
+
*Next up in the series:*
|
135 |
+
**[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
|
136 |
+
|
137 |
+
*How are you evaluating your AI agents? What challenges have you encountered in measuring agent performance? If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!*
|
data/introduction-to-ragas/index.md
ADDED
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications"
|
3 |
+
date: 2025-04-26T18:00:00-06:00
|
4 |
+
layout: blog
|
5 |
+
description: "Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems."
|
6 |
+
categories: ["AI", "RAG", "Evaluation","Ragas"]
|
7 |
+
coverImage: "https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3"
|
8 |
+
readingTime: 7
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
As Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you're building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how well your application performs. This is where Ragas steps in.
|
13 |
+
|
14 |
+
## What is Ragas?
|
15 |
+
|
16 |
+
[Ragas](https://docs.ragas.io/en/stable/) is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems.
|
17 |
+
|
18 |
+
At its core, Ragas helps answer crucial questions:
|
19 |
+
- Is my application retrieving the right information?
|
20 |
+
- Are the responses factually accurate and consistent with the retrieved context?
|
21 |
+
- Does the system appropriately address the user's query?
|
22 |
+
- How well does my application handle multi-turn conversations?
|
23 |
+
|
24 |
+
## Why Evaluate LLM Applications?
|
25 |
+
|
26 |
+
LLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable.
|
27 |
+
|
28 |
+
Evaluation serves several key purposes:
|
29 |
+
- **Quality assurance**: Identify and fix issues before they reach users
|
30 |
+
- **Performance tracking**: Monitor how changes impact system performance
|
31 |
+
- **Benchmarking**: Compare different approaches objectively
|
32 |
+
- **Continuous improvement**: Build feedback loops to enhance your application
|
33 |
+
|
34 |
+
## Key Features of Ragas
|
35 |
+
|
36 |
+
### 🎯 Specialized Metrics
|
37 |
+
Ragas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications:
|
38 |
+
|
39 |
+
- **Faithfulness**: Measures if the response is factually consistent with the retrieved context
|
40 |
+
- **Context Relevancy**: Evaluates if the retrieved information is relevant to the query
|
41 |
+
- **Answer Relevancy**: Assesses if the response addresses the user's question
|
42 |
+
- **Topic Adherence**: Gauges how well multi-turn conversations stay on topic
|
43 |
+
|
44 |
+
### 🧪 Test Data Generation
|
45 |
+
Creating high-quality test data is often a bottleneck in evaluation. Ragas helps you generate comprehensive test datasets automatically, saving time and ensuring thorough coverage.
|
46 |
+
|
47 |
+
### 🔗 Seamless Integrations
|
48 |
+
Ragas works with popular LLM frameworks and tools:
|
49 |
+
- [LangChain](https://www.langchain.com/)
|
50 |
+
- [LlamaIndex](https://www.llamaindex.ai/)
|
51 |
+
- [Haystack](https://haystack.deepset.ai/)
|
52 |
+
- [OpenAI](https://openai.com/)
|
53 |
+
|
54 |
+
Observability platforms
|
55 |
+
- [Phoenix](https://phoenix.arize.com/)
|
56 |
+
- [LangSmith](https://python.langchain.com/docs/introduction/)
|
57 |
+
- [Langfuse](https://www.langfuse.com/)
|
58 |
+
|
59 |
+
### 📊 Comprehensive Analysis
|
60 |
+
Beyond simple scores, Ragas provides detailed insights into your application's strengths and weaknesses, enabling targeted improvements.
|
61 |
+
|
62 |
+
## Getting Started with Ragas
|
63 |
+
|
64 |
+
Installing Ragas is straightforward:
|
65 |
+
|
66 |
+
```bash
|
67 |
+
uv init && uv add ragas
|
68 |
+
```
|
69 |
+
|
70 |
+
Here's a simple example of evaluating a response using Ragas:
|
71 |
+
|
72 |
+
```python
|
73 |
+
from ragas.metrics import Faithfulness
|
74 |
+
from ragas.evaluation import EvaluationDataset
|
75 |
+
from ragas.dataset_schema import SingleTurnSample
|
76 |
+
from langchain_openai import ChatOpenAI
|
77 |
+
from ragas.llms import LangchainLLMWrapper
|
78 |
+
from langchain_openai import ChatOpenAI
|
79 |
+
|
80 |
+
# Initialize the LLM, you are going to new OPENAI API key
|
81 |
+
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
|
82 |
+
|
83 |
+
# Your evaluation data
|
84 |
+
test_data = {
|
85 |
+
"user_input": "What is the capital of France?",
|
86 |
+
"retrieved_contexts": ["Paris is the capital and most populous city of France."],
|
87 |
+
"response": "The capital of France is Paris."
|
88 |
+
}
|
89 |
+
|
90 |
+
# Create a sample
|
91 |
+
sample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor
|
92 |
+
|
93 |
+
# Create metric
|
94 |
+
faithfulness = Faithfulness(llm=evaluator_llm)
|
95 |
+
# Calculate the score
|
96 |
+
result = await faithfulness.single_turn_ascore(sample)
|
97 |
+
print(f"Faithfulness score: {result}")
|
98 |
+
```
|
99 |
+
|
100 |
+
> 💡 **Try it yourself:**
|
101 |
+
> Explore the hands-on notebook for this workflow:
|
102 |
+
> [01_Introduction_to_Ragas](https://github.com/mafzaal/intro-to-ragas/blob/master/01_Introduction_to_Ragas.ipynb)
|
103 |
+
|
104 |
+
## What's Coming in This Blog Series
|
105 |
+
|
106 |
+
This introduction is just the beginning. In the upcoming posts, we'll dive deeper into all aspects of evaluating LLM applications with Ragas:
|
107 |
+
|
108 |
+
**[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
|
109 |
+
We'll explore each metric in detail, explaining when and how to use them effectively.
|
110 |
+
|
111 |
+
**[Part 3: Evaluating RAG Systems](/blog/evaluating-rag-systems-with-ragas/)**
|
112 |
+
Learn specialized techniques for evaluating retrieval-augmented generation systems, including context precision, recall, and relevance.
|
113 |
+
|
114 |
+
**[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
|
115 |
+
Discover how to create high-quality test datasets that thoroughly exercise your application's capabilities.
|
116 |
+
|
117 |
+
**[Part 5: Advanced Evaluation Techniques](/blog/advanced-metrics-and-customization-with-ragas)**
|
118 |
+
Go beyond basic metrics with custom evaluations, multi-aspect analysis, and domain-specific assessments.
|
119 |
+
|
120 |
+
**[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
|
121 |
+
Learn how to evaluate complex AI agents that engage in multi-turn interactions, use tools, and work toward specific goals.
|
122 |
+
|
123 |
+
**[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
|
124 |
+
Connect Ragas with your existing tools and platforms for streamlined evaluation workflows.
|
125 |
+
|
126 |
+
**[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
|
127 |
+
Learn how to implement feedback loops that drive continuous improvement in your LLM applications.
|
128 |
+
Transform evaluation insights into concrete improvements for your LLM applications.
|
129 |
+
|
130 |
+
## Conclusion
|
131 |
+
|
132 |
+
In a world increasingly powered by LLMs, robust evaluation is the difference between reliable applications and unpredictable ones. Ragas provides the tools you need to confidently assess and improve your LLM applications.
|
133 |
+
|
134 |
+
### Ready to Elevate Your LLM Applications?
|
135 |
+
|
136 |
+
Start exploring Ragas today by visiting the [official documentation](https://docs.ragas.io/en/stable/). Share your thoughts, challenges, or success stories. If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!
|
data/langchain-experience-csharp-perspective/index.md
ADDED
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
layout: blog
|
3 |
+
title: A C# Programmer's Perspective on LangChain Expression Language
|
4 |
+
date: 2025-04-16T00:00:00-06:00
|
5 |
+
description: My experiences transitioning from C# to LangChain Expression Language, exploring the pipe operator abstraction challenges and the surprising simplicity of parallel execution.
|
6 |
+
categories: ["Technology", "AI", "Programming"]
|
7 |
+
coverImage: "https://images.unsplash.com/photo-1555066931-4365d14bab8c?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3"
|
8 |
+
readingTime: 3
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
|
13 |
+
As a C# developer diving into [LangChain Expression Language (LCEL)](https://langchain-ai.github.io/langgraph/), I've encountered both challenges and pleasant surprises. Here's what stood out most during my transition.
|
14 |
+
|
15 |
+
## The Pipe Operator Abstraction Challenge
|
16 |
+
|
17 |
+
In C#, processing pipelines are explicit:
|
18 |
+
|
19 |
+
```csharp
|
20 |
+
var result = inputData
|
21 |
+
.Where(item => item.IsValid)
|
22 |
+
.Select(item => TransformItem(item))
|
23 |
+
.ToList()
|
24 |
+
.ForEach(item => ProcessItem(item));
|
25 |
+
```
|
26 |
+
|
27 |
+
LCEL's pipe operator creates a different flow:
|
28 |
+
|
29 |
+
```python
|
30 |
+
chain = (
|
31 |
+
ChatPromptTemplate.from_messages([
|
32 |
+
("system", "You are a helpful assistant specialized in {topic}."),
|
33 |
+
("human", "{query}")
|
34 |
+
])
|
35 |
+
| ChatOpenAI(temperature=0.7)
|
36 |
+
| (lambda llm_result: llm_result.content)
|
37 |
+
| (lambda content: content.split("\n"))
|
38 |
+
| (lambda lines: [line for line in lines if line.strip()])
|
39 |
+
| (lambda filtered_lines: "\n".join(filtered_lines))
|
40 |
+
)
|
41 |
+
```
|
42 |
+
|
43 |
+
With complex chains, questions arise:
|
44 |
+
- What exactly passes through each step?
|
45 |
+
- How can I inspect intermediate results?
|
46 |
+
- How do I debug unexpected outcomes?
|
47 |
+
|
48 |
+
This becomes more apparent in real-world examples:
|
49 |
+
|
50 |
+
```python
|
51 |
+
retrieval_chain = (
|
52 |
+
{"query": RunnablePassthrough(), "context": retriever | format_docs}
|
53 |
+
| prompt
|
54 |
+
| llm
|
55 |
+
| StrOutputParser()
|
56 |
+
)
|
57 |
+
```
|
58 |
+
|
59 |
+
## Surprisingly Simple Parallel Execution
|
60 |
+
|
61 |
+
Despite abstraction challenges, LCEL handles parallel execution elegantly.
|
62 |
+
|
63 |
+
In C#:
|
64 |
+
```csharp
|
65 |
+
var task1 = Task.Run(() => ProcessData(data1));
|
66 |
+
var task2 = Task.Run(() => ProcessData(data2));
|
67 |
+
var task3 = Task.Run(() => ProcessData(data3));
|
68 |
+
|
69 |
+
await Task.WhenAll(task1, task2, task3);
|
70 |
+
var results = new[] { task1.Result, task2.Result, task3.Result };
|
71 |
+
```
|
72 |
+
|
73 |
+
In LCEL:
|
74 |
+
```python
|
75 |
+
parallel_chain = RunnableMap({
|
76 |
+
"summary": prompt_summary | llm | StrOutputParser(),
|
77 |
+
"translation": prompt_translate | llm | StrOutputParser(),
|
78 |
+
"analysis": prompt_analyze | llm | StrOutputParser()
|
79 |
+
})
|
80 |
+
|
81 |
+
result = parallel_chain.invoke({"input": user_query})
|
82 |
+
```
|
83 |
+
|
84 |
+
This approach eliminates manual task management, handling everything behind the scenes.
|
85 |
+
|
86 |
+
## Best Practices I've Adopted
|
87 |
+
|
88 |
+
To balance LCEL's expressiveness with clarity:
|
89 |
+
|
90 |
+
1. Break complex chains into named subcomponents
|
91 |
+
2. Comment non-obvious transformations
|
92 |
+
3. Create visualization helpers for debugging
|
93 |
+
4. Embrace functional thinking
|
94 |
+
|
95 |
+
## Conclusion
|
96 |
+
|
97 |
+
For C# developers exploring LCEL, approach it with an open mind. The initial learning curve is worth it, especially for AI workflows where LCEL's parallel execution shines.
|
98 |
+
|
99 |
+
Want to see these concepts in practice? Check out my [Pythonic RAG repository](https://github.com/mafzaal/AIE6-DeployPythonicRAG) for working examples.
|
100 |
+
|
101 |
+
---
|
102 |
+
|
103 |
+
*If you found this useful or have questions about transitioning from C# to LCEL, feel free to [reach out](https://www.linkedin.com/in/muhammadafzaal/) — we’d love to help!*
|
data/metric-driven-development/index.md
ADDED
@@ -0,0 +1,155 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "Metric-Driven Development: Make Smarter Decisions, Faster"
|
3 |
+
date: 2025-05-05T00:00:00-06:00
|
4 |
+
layout: blog
|
5 |
+
description: "Your Team's Secret Weapon for Cutting Through Noise and Driving Real Progress. Learn how to use clear metrics to eliminate guesswork and make faster, smarter progress in your projects."
|
6 |
+
categories: ["Development", "Productivity", "AI", "Management"]
|
7 |
+
coverImage: "/images/metric-driven-development.png"
|
8 |
+
readingTime: 9
|
9 |
+
published: true
|
10 |
+
---
|
11 |
+
|
12 |
+
In today's data-driven world, success depends increasingly on our ability to measure the right things at the right time. Whether you're developing AI systems, building web applications, or managing projects, having clear metrics guides your team toward meaningful progress while eliminating subjective debates.
|
13 |
+
|
14 |
+
## The Power of Metrics in AI Evaluation
|
15 |
+
|
16 |
+
Recent advances in generative AI and large language models (LLMs) highlight the critical importance of proper evaluation frameworks. Projects like RAGAS (Retrieval Augmented Generation Assessment System) demonstrate how specialized metrics can transform vague goals into actionable insights.
|
17 |
+
|
18 |
+
For example, when evaluating retrieval-augmented generation systems, generic metrics like BLEU or ROUGE scores often fail to capture what truly matters - the accuracy, relevance, and contextual understanding of the generated responses. RAGAS instead introduces metrics specifically designed for RAG systems:
|
19 |
+
|
20 |
+
* **Faithfulness**: Measures how well the generated answer aligns with the retrieved context
|
21 |
+
* **Answer Relevancy**: Evaluates whether the response correctly addresses the user's query
|
22 |
+
* **Context Relevancy**: Assesses if the system retrieves information that's actually needed
|
23 |
+
* **Context Precision**: Quantifies how efficiently the system uses retrieved information
|
24 |
+
|
25 |
+
These targeted metrics provide clearer direction than general-purpose evaluations, allowing teams to make precise improvements where they matter most.
|
26 |
+
Imagine two teams building a new feature for a streaming platform:
|
27 |
+
|
28 |
+
* **Team A** is stuck in debates. Should they focus on improving video load speed or making the recommendation engine more accurate? One engineer insists, "Faster videos keep users from leaving!" Another counters, "But better recommendations are what make them subscribe!" They argue based on gut feelings.
|
29 |
+
* **Team B** operates differently. They have a clear, agreed-upon goal: ***Improve the average "Watch Time per User" metric, while ensuring video buffering times stay below 2 seconds.*** They rapidly test ideas, measuring the impact of each change against this specific target.
|
30 |
+
|
31 |
+
Which team do you think will make faster, smarter progress?
|
32 |
+
|
33 |
+
|
34 |
+
Team B has the edge because they're using **Metric-Driven Development (MDD)**. This is a powerful strategy where teams unite around measurable goals to eliminate guesswork and make real strides. Let's break down how it works, what makes a metric truly useful, and see how industries from healthcare to e-commerce use it to succeed.
|
35 |
+
|
36 |
+
## What Exactly is Metric-Driven Development?
|
37 |
+
|
38 |
+
Metric-Driven Development (MDD) is a simple but effective framework where teams:
|
39 |
+
|
40 |
+
1. **Define Clear, Measurable Goals:** Set specific numerical targets (e.g., "Increase user sign-ups by 20% this quarter").
|
41 |
+
2. **Base Decisions on Data:** Rely on evidence and measurements, not just opinions or assumptions.
|
42 |
+
3. **Iterate and Learn Quickly:** Continuously measure the impact of changes to see what works and what doesn't.
|
43 |
+
|
44 |
+
Think of MDD as a **GPS for your project**. Without clear metrics, you're driving in the fog, hoping you're heading in the right direction. With MDD, you get real-time feedback, ensuring you're moving towards your destination efficiently.
|
45 |
+
|
46 |
+
## Why Teams Struggle Without Clear Metrics
|
47 |
+
|
48 |
+
Without a metric-driven approach, teams often fall into common traps:
|
49 |
+
|
50 |
+
* **Chasing Too Many Goals:** Trying to improve everything at once ("We need higher accuracy *and* faster speed *and* lower costs!") leads to scattered effort and slow progress.
|
51 |
+
* **Endless Subjective Debates:** Arguments arise that are hard to resolve with data ("Is Model A's slightly better performance worth the extra complexity?").
|
52 |
+
* **Difficulty Measuring Progress:** It's hard to know if you're actually improving ("Are we doing better than last quarter? How can we be sure?").
|
53 |
+
|
54 |
+
In **machine learning (ML)**, this often happens when teams track various technical scores (like precision, recall, or F1 score – measures of model accuracy) without a single, unifying metric tied to the *actual business outcome* they want to achieve.
|
55 |
+
|
56 |
+
## What Makes a Metric Great? The Key Ingredients
|
57 |
+
|
58 |
+
Not all numbers are helpful. A truly effective metric has these essential traits:
|
59 |
+
|
60 |
+
1. **Measurable:** It must be quantifiable and objective. *"95% accuracy"* is measurable; *"a better user experience"* is not, unless defined by specific, measurable indicators.
|
61 |
+
2. **Actionable:** Your team must be able to influence the metric through their work. For example, changing a website's design *can* affect the "click-through rate."
|
62 |
+
3. **Aligned with Business Goals:** The metric should directly contribute to the overall success of the product or business. If user retention is key, optimizing for ad clicks might be counterproductive.
|
63 |
+
4. **Simple & Understandable:** It should be easy for everyone on the team (and stakeholders) to grasp and track. *"Monthly Active Users"* is usually simpler than a complex, weighted formula.
|
64 |
+
5. **Robust (Hard to Game):** The metric shouldn't be easily manipulated in ways that don't reflect real progress. *Example:* A ride-sharing app tracking only "rides booked" could be fooled by drivers booking and immediately canceling rides. A better metric might be "completed rides lasting over 1 minute."
|
65 |
+
6. **Directional:** The desired direction of the metric should be clear – whether you're trying to maximize it (like conversion rate or user retention) or minimize it (like error rate or load time). This clarity helps teams understand exactly what success looks like without ambiguity.
|
66 |
+
|
67 |
+
|
68 |
+
## Deep Dive: Reward Functions in AI – Metrics in Action
|
69 |
+
|
70 |
+
A fascinating application of MDD principles comes from **Reinforcement Learning (RL)**, a type of AI where agents learn through trial and error. In RL, learning is guided by a **reward function**: a numerical score that tells the AI how well it's doing.
|
71 |
+
|
72 |
+
Think of it like training a dog:
|
73 |
+
* Good behavior (sitting on command) gets a treat (positive reward).
|
74 |
+
* Bad behavior (chewing shoes) gets a scold (negative reward or penalty).
|
75 |
+
|
76 |
+
Examples in AI:
|
77 |
+
* A chess-playing AI might get +1 for winning, -1 for losing, and 0 for a draw.
|
78 |
+
* A self-driving car simulation might receive rewards for smooth driving and staying in its lane, and penalties for sudden braking or collisions.
|
79 |
+
|
80 |
+
**Why Reward Functions Showcase MDD:**
|
81 |
+
|
82 |
+
Reward functions are essentially highly specialized metrics that:
|
83 |
+
|
84 |
+
* **Define Priorities Clearly:** A robot arm designed to pack boxes might get rewards for speed and gentle handling, but penalties for crushing items. The reward function dictates the trade-offs.
|
85 |
+
* **Guide Behavior in Real-Time:** Unlike metrics evaluated after a project phase, reward functions shape the AI's learning process continuously.
|
86 |
+
* **Require Careful Design to Avoid "Gaming":** Just like business metrics, a poorly designed reward can lead to unintended shortcuts. An RL agent in a game might discover a way to rack up points by repeatedly performing a trivial action, instead of actually trying to win the level. This highlights the importance of the "Robust" trait we discussed earlier.
|
87 |
+
|
88 |
+
Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.
|
89 |
+
|
90 |
+
## Metric-Driven Development Across Industries: Real-World Examples
|
91 |
+
|
92 |
+
MDD isn't just for software. Here's how different fields use it:
|
93 |
+
|
94 |
+
* **E-Commerce: Conversion Rate**
|
95 |
+
* **Metric:** Percentage of website visitors who make a purchase.
|
96 |
+
* **Impact:** Directly ties development efforts (like A/B testing checkout flows) to revenue growth.
|
97 |
+
* **Healthcare: Patient Readmission Rate**
|
98 |
+
* **Metric:** Percentage of patients readmitted to the hospital within 30 days of discharge.
|
99 |
+
* **Impact:** Focuses efforts on improving care quality and follow-up, leading to better patient outcomes and lower costs.
|
100 |
+
* **Manufacturing: Defect Rate**
|
101 |
+
* **Metric:** Percentage of products produced with flaws.
|
102 |
+
* **Impact:** Drives process improvements on the factory floor, saving costs and enhancing brand reputation.
|
103 |
+
* **Gaming (AI Development): Player Performance Score**
|
104 |
+
* **Metric:** A combined score, e.g., `Points Scored - (Time Taken * Penalty Factor)`.
|
105 |
+
* **Impact:** Trains AI opponents that are challenging but fair, balancing speed and skill.
|
106 |
+
* **Autonomous Vehicles: Safety & Comfort Score**
|
107 |
+
* **Metric:** Combination of factors like smooth acceleration/braking, lane adherence, and deductions for interventions or near-misses.
|
108 |
+
* **Impact:** Guides development towards vehicles that are not only safe but also provide a comfortable ride.
|
109 |
+
|
110 |
+
## Smart Tactics: Optimizing vs. Satisficing Metrics
|
111 |
+
|
112 |
+
Sometimes, you have competing priorities. MDD offers a smart way to handle this using two types of metrics:
|
113 |
+
|
114 |
+
* **Optimizing Metric:** The main goal you want to maximize or minimize (your "North Star").
|
115 |
+
* **Satisficing Metrics:** Other important factors that just need to meet a minimum acceptable level ("good enough").
|
116 |
+
|
117 |
+
*Example: Developing a voice assistant like Alexa or Google Assistant:*
|
118 |
+
|
119 |
+
* **Optimizing Metric:** *Minimize missed commands (false negatives)* – You want it to respond reliably when you speak the wake-word.
|
120 |
+
* **Satisficing Metric:** *Keep false activations below 1 per day (false positives)* – You don't want it waking up constantly when you haven't addressed it, but perfect prevention might hurt its responsiveness.
|
121 |
+
|
122 |
+
This approach prevents teams from sacrificing critical aspects (like basic usability) in the pursuit of perfecting a single metric.
|
123 |
+
|
124 |
+
## Don't Forget Early Signals: The Role of Leading Indicators
|
125 |
+
|
126 |
+
In machine learning projects, **training loss** is a common metric monitored during development. Think of it as a **"practice test score"** for the model – it shows how well the model is learning the patterns in the training data *before* it faces the real world.
|
127 |
+
|
128 |
+
While a low training loss is good (it means the model is learning *something*), it's a **leading indicator**. It doesn't guarantee success on its own. You still need **lagging indicators** – metrics that measure real-world performance, like user satisfaction, task completion rates, or the ultimate business goal (e.g., user retention).
|
129 |
+
|
130 |
+
MDD reminds us to track both:
|
131 |
+
* **Leading indicators** (like training loss, code coverage) to monitor progress during development.
|
132 |
+
* **Lagging indicators** (like user engagement, revenue, customer support tickets) to measure the actual impact.
|
133 |
+
|
134 |
+
## The Takeaway: Use Metrics as Your Compass
|
135 |
+
Metric-Driven Development isn't a complex theory reserved for tech giants. It's a fundamental mindset applicable everywhere:
|
136 |
+
|
137 |
+
* A local bakery might track *"Daily Units Sold per Pastry Type"* to optimize baking schedules.
|
138 |
+
* A city planner could use *"Average Commute Time Reduction"* to evaluate the success of new traffic light patterns.
|
139 |
+
* A project manager might measure progress through *"Sprint Velocity"* or *"Percentage of On-Time Task Completions"* rather than subjective assessments of how "busy" the team appears.
|
140 |
+
|
141 |
+
|
142 |
+
By choosing metrics that are **measurable, actionable, aligned, simple, and robust**, you transform ambiguity into clarity and opinion into evidence.
|
143 |
+
|
144 |
+
Whether you're building sophisticated AI or launching a simple website feature, MDD empowers your team to:
|
145 |
+
|
146 |
+
1. **Move Faster:** Make decisions quickly based on clear success criteria.
|
147 |
+
2. **Collaborate Effectively:** Unite everyone around shared, objective goals.
|
148 |
+
3. **Know When You've Won:** Celebrate real, measurable progress.
|
149 |
+
|
150 |
+
So, the next time your team feels stuck or unsure about the path forward, ask the crucial question: ***What's our metric?***
|
151 |
+
|
152 |
+
Finding that answer might just be the compass you need to navigate towards success.
|
153 |
+
|
154 |
+
---
|
155 |
+
*Inspired by insights from Andrew Ng's [Machine Learning Yearning](https://info.deeplearning.ai/machine-learning-yearning-book). Remember: A great metric doesn't just measure success—it actively helps create it.*
|
data/rss-feed-announcement/index.md
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: "Subscribe to Our Blog via RSS"
|
3 |
+
date: 2025-05-03T00:00:00-06:00
|
4 |
+
description: "Stay updated with our latest content by subscribing to our new RSS feed"
|
5 |
+
categories: ["Announcements", "Blog"]
|
6 |
+
published: true
|
7 |
+
layout: blog
|
8 |
+
coverImage: "/images/rss-announcement.png"
|
9 |
+
readingTime: 2
|
10 |
+
---
|
11 |
+
|
12 |
+
# Subscribe to Our Blog via RSS
|
13 |
+
|
14 |
+
I'm excited to announce that TheDataGuy blog now supports RSS feeds! This means you can now easily stay updated with all the latest posts without having to manually check the website.
|
15 |
+
|
16 |
+
## What is RSS?
|
17 |
+
|
18 |
+
RSS (Really Simple Syndication) is a web feed that allows you to subscribe to updates from websites you follow. When new content is published, your RSS reader will automatically notify you and display the latest posts.
|
19 |
+
|
20 |
+
## Why Use RSS?
|
21 |
+
|
22 |
+
There are several benefits to using RSS:
|
23 |
+
|
24 |
+
- **No algorithms**: Unlike social media, RSS feeds show you everything from the sources you subscribe to, in chronological order.
|
25 |
+
- **No ads or distractions**: Get pure content without the clutter.
|
26 |
+
- **Privacy**: RSS readers don't track you like social media platforms do.
|
27 |
+
- **Efficiency**: Check all your favorite sites in one place instead of visiting each individually.
|
28 |
+
|
29 |
+
## How to Subscribe
|
30 |
+
|
31 |
+
You can subscribe to our RSS feed in a few easy steps:
|
32 |
+
|
33 |
+
1. Copy this link: `https://thedataguy.pro/rss.xml`
|
34 |
+
2. Open your favorite RSS reader (like Feedly, Inoreader, NewsBlur, or even built-in RSS features in browsers like Vivaldi)
|
35 |
+
3. Add a new subscription and paste the link
|
36 |
+
|
37 |
+
Alternatively, just click the RSS button in the navigation bar of our blog.
|
38 |
+
|
39 |
+
## Popular RSS Readers
|
40 |
+
|
41 |
+
If you don't have an RSS reader yet, here are some popular options:
|
42 |
+
|
43 |
+
- [Feedly](https://feedly.com/)
|
44 |
+
- [Inoreader](https://www.inoreader.com/)
|
45 |
+
- [NewsBlur](https://newsblur.com/)
|
46 |
+
- [Feedbin](https://feedbin.com/)
|
47 |
+
- [The Old Reader](https://theoldreader.com/)
|
48 |
+
|
49 |
+
Many browsers like Firefox and Vivaldi also have built-in RSS capabilities.
|
50 |
+
|
51 |
+
## What's Next?
|
52 |
+
|
53 |
+
I'll continue to improve the blog experience based on your feedback. If you have any suggestions or feature requests, feel free to [reach out](https://www.linkedin.com/in/muhammadafzaal/).
|
54 |
+
|
55 |
+
Happy reading!
|
py-src/app.py
CHANGED
@@ -19,9 +19,11 @@ from qdrant_client.http.models import Distance, VectorParams
|
|
19 |
from lets_talk.config import LLM_MODEL, LLM_TEMPERATURE
|
20 |
import lets_talk.utils.blog as blog
|
21 |
from lets_talk.agent import build_agent,parse_output
|
|
|
22 |
|
23 |
|
24 |
-
|
|
|
25 |
|
26 |
tdg_agent = build_agent()
|
27 |
|
|
|
19 |
from lets_talk.config import LLM_MODEL, LLM_TEMPERATURE
|
20 |
import lets_talk.utils.blog as blog
|
21 |
from lets_talk.agent import build_agent,parse_output
|
22 |
+
import pipeline
|
23 |
|
24 |
|
25 |
+
#build vector store
|
26 |
+
pipeline.main()
|
27 |
|
28 |
tdg_agent = build_agent()
|
29 |
|