mafzaal commited on
Commit
4ba7a19
·
1 Parent(s): f6edec6

Add multiple blog posts on Ragas evaluation framework and metric-driven development

Browse files

- Introduced "Part 1: Introduction to Ragas" covering the evaluation framework for LLM applications.
- Added "Part 4: Generating Test Data with Ragas" detailing synthetic data generation techniques.
- Included "Part 6: Integrations and Observability with Ragas" discussing integration with various frameworks and observability platforms.
- Published "Metric-Driven Development: Make Smarter Decisions, Faster" emphasizing the importance of metrics in project management.
- Announced the availability of an RSS feed for the blog to enhance content accessibility.
- Added a blog post titled "A C# Programmer's Perspective on LangChain Expression Language" sharing insights on transitioning from C# to LCEL.

.gitignore CHANGED
@@ -1,4 +1,4 @@
1
- data/
2
  db/
3
 
4
 
 
1
+
2
  db/
3
 
4
 
Dockerfile CHANGED
@@ -25,8 +25,10 @@ COPY --chown=user ./uv.lock $HOME/app
25
  # Install the dependencies
26
  # RUN uv sync --frozen
27
  RUN uv sync
28
- #copdy vectorstore
29
- COPY --chown=user ./db/ $HOME/app/db
 
 
30
  # Expose the port
31
  EXPOSE 7860
32
 
 
25
  # Install the dependencies
26
  # RUN uv sync --frozen
27
  RUN uv sync
28
+
29
+ #TODO: Fix this to download
30
+ #copy posts to container
31
+ COPY --chown=user ./data/ $HOME/app/data
32
  # Expose the port
33
  EXPOSE 7860
34
 
data/advanced-metrics-and-customization-with-ragas/index.md ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Part 5: Advanced Metrics and Customization with Ragas"
3
+ date: 2025-04-28T05:00:00-06:00
4
+ layout: blog
5
+ description: "Explore advanced metrics and customization techniques in Ragas for evaluating LLM applications, including creating custom metrics, domain-specific evaluation, composite scoring, and best practices for building a comprehensive evaluation ecosystem."
6
+ categories: ["AI", "RAG", "Evaluation", "Ragas","Data"]
7
+ coverImage: "https://plus.unsplash.com/premium_photo-1661368994107-43200954c524?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
8
+ readingTime: 9
9
+ published: true
10
+ ---
11
+
12
+ In our previous post, we explored how to generate comprehensive test datasets for evaluating LLM applications. Now, let's dive into one of Ragas' most powerful capabilities: advanced metrics and custom evaluation approaches that address specialized evaluation needs.
13
+
14
+ ## Beyond the Basics: Why Advanced Metrics Matter
15
+
16
+ While Ragas' core metrics cover fundamental evaluation aspects, real-world applications often have unique requirements:
17
+
18
+ - **Domain-specific quality criteria**: Legal, medical, or financial applications have specialized accuracy requirements
19
+ - **Custom interaction patterns**: Applications with unique conversation flows need tailored evaluation approaches
20
+ - **Specialized capabilities**: Features like reasoning, code generation, or structured output demand purpose-built metrics
21
+ - **Business-specific KPIs**: Aligning evaluation with business objectives requires customized metrics
22
+
23
+ Let's explore how to extend Ragas' capabilities to meet these specialized needs.
24
+
25
+ ## Understanding Ragas' Metric Architecture
26
+
27
+ Before creating custom metrics, it's helpful to understand Ragas' metric architecture:
28
+
29
+ ### 1. Understand the Metric Base Classes
30
+
31
+ All metrics in Ragas inherit from the abstract `Metric` class (see `metrics/base.py`). For most use cases, you’ll extend one of these:
32
+
33
+ - **SingleTurnMetric**: For metrics that evaluate a single question/response pair.
34
+ - **MultiTurnMetric**: For metrics that evaluate multi-turn conversations.
35
+ - **MetricWithLLM**: For metrics that require an LLM for evaluation.
36
+ - **MetricWithEmbeddings**: For metrics that use embeddings.
37
+
38
+ You can mix these as needed (e.g., `MetricWithLLM, SingleTurnMetric`).
39
+
40
+ Each metric implements specific scoring methods depending on its type:
41
+
42
+ - `_single_turn_ascore`: For single-turn metrics
43
+ - `_multi_turn_ascore`: For multi-turn metrics
44
+
45
+
46
+ ## Creating Your First Custom Metric
47
+
48
+ Let's create a custom metric that evaluates technical accuracy in programming explanations:
49
+
50
+ ```python
51
+ from dataclasses import dataclass, field
52
+ from typing import Dict, Optional, Set
53
+ import typing as t
54
+
55
+ from ragas.metrics.base import MetricWithLLM, SingleTurnMetric
56
+ from ragas.prompt import PydanticPrompt
57
+ from ragas.metrics import MetricType, MetricOutputType
58
+ from pydantic import BaseModel
59
+
60
+ # Define input/output models for the prompt
61
+ class TechnicalAccuracyInput(BaseModel):
62
+ question: str
63
+ context: str
64
+ response: str
65
+ programming_language: str = "python"
66
+
67
+ class TechnicalAccuracyOutput(BaseModel):
68
+ score: float
69
+ feedback: str
70
+
71
+
72
+ # Define the prompt
73
+ class TechnicalAccuracyPrompt(PydanticPrompt[TechnicalAccuracyInput, TechnicalAccuracyOutput]):
74
+ instruction: str = (
75
+ "Evaluate the technical accuracy of the response to a programming question. "
76
+ "Consider syntax correctness, algorithmic accuracy, and best practices."
77
+ )
78
+ input_model = TechnicalAccuracyInput
79
+ output_model = TechnicalAccuracyOutput
80
+ examples = [
81
+ # Add examples here
82
+ ]
83
+
84
+ # Create the metric
85
+ @dataclass
86
+ class TechnicalAccuracy(MetricWithLLM, SingleTurnMetric):
87
+ name: str = "technical_accuracy"
88
+ _required_columns: Dict[MetricType, Set[str]] = field(
89
+ default_factory=lambda: {
90
+ MetricType.SINGLE_TURN: {
91
+ "user_input",
92
+ "response",
93
+
94
+ }
95
+ }
96
+ )
97
+ output_type: Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
98
+ evaluation_prompt: PydanticPrompt = field(default_factory=TechnicalAccuracyPrompt)
99
+
100
+ async def _single_turn_ascore(self, sample, callbacks) -> float:
101
+ assert self.llm is not None, "LLM must be set"
102
+
103
+ question = sample.user_input
104
+ response = sample.response
105
+ # Extract programming language from question if possible
106
+ programming_language = "python" # Default
107
+ languages = ["python", "javascript", "java", "c++", "rust", "go"]
108
+ for lang in languages:
109
+ if lang in question.lower():
110
+ programming_language = lang
111
+ break
112
+
113
+ # Get the context
114
+ context = "\n".join(sample.retrieved_contexts) if sample.retrieved_contexts else ""
115
+
116
+ # Prepare input for prompt
117
+ prompt_input = TechnicalAccuracyInput(
118
+ question=question,
119
+ context=context,
120
+ response=response,
121
+ programming_language=programming_language
122
+ )
123
+
124
+ # Generate evaluation
125
+ evaluation = await self.evaluation_prompt.generate(
126
+ data=prompt_input, llm=self.llm, callbacks=callbacks
127
+ )
128
+
129
+ return evaluation.score
130
+ ```
131
+ ## Using the Custom Metric
132
+ To use the custom metric, simply include it in your evaluation pipeline:
133
+
134
+ ```python
135
+ from langchain_openai import ChatOpenAI
136
+ from ragas import SingleTurnSample
137
+ from ragas.llms import LangchainLLMWrapper
138
+
139
+ # Initialize the LLM, you are going to OPENAI API key
140
+ evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
141
+
142
+ test_data = {
143
+ "user_input": "Write a function to calculate the factorial of a number in Python.",
144
+ "retrieved_contexts": ["Python is a programming language.", "A factorial of a number n is the product of all positive integers less than or equal to n."],
145
+ "response": "def factorial(n):\n if n == 0:\n return 1\n else:\n return n * factorial(n-1)",
146
+ }
147
+
148
+ # Create a sample
149
+ sample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor
150
+ technical_accuracy = TechnicalAccuracy(llm=evaluator_llm)
151
+ score = await technical_accuracy.single_turn_ascore(sample)
152
+ print(f"Technical Accuracy Score: {score}")
153
+ # Note: The above code is a simplified example. In a real-world scenario, you would need to handle exceptions,
154
+ ```
155
+ You can also use the `evaluate` function to evaluate a dataset:
156
+
157
+ ```python
158
+ from ragas import evaluate
159
+ from ragas import evaluate
160
+
161
+ results = evaluate(
162
+ dataset, # Your dataset of samples
163
+ metrics=[TechnicalAccuracy(), ...],
164
+ llm=myevaluator_llm_llm
165
+ )
166
+ ```
167
+
168
+ > 💡 **Try it yourself:**
169
+ > Explore the hands-on notebook for synthetic data generation:
170
+ > [05_Advanced_Metrics_and_Customization](https://github.com/mafzaal/intro-to-ragas/blob/master/05_Advanced_Metrics_and_Customization.ipynb)
171
+
172
+ ## Customizing Metrics for Your Application
173
+
174
+ You can further refine your evaluation by customizing existing metrics—such as adjusting thresholds or criteria—to better fit your application's requirements. For multi-turn conversations, you might configure metrics like topic adherence to emphasize specific aspects, such as precision or recall, based on your evaluation objectives.
175
+
176
+ In specialized domains like healthcare or legal, it's crucial to design custom metrics that capture domain-specific accuracy and compliance needs. For complex applications, consider combining several metrics into composite scores to represent multiple quality dimensions.
177
+
178
+ When assessing capabilities like code generation or structured outputs, develop metrics that evaluate execution correctness or schema compliance. For advanced scenarios, you can build metric pipelines that orchestrate several metrics and aggregate their results using strategies like weighted averages or minimum scores.
179
+
180
+ By thoughtfully customizing and combining metrics, you can achieve a comprehensive and meaningful evaluation framework tailored to your unique use case.
181
+
182
+ ## Best Practices for Custom Metric Development
183
+
184
+ 1. **Single Responsibility**: Each metric should evaluate one specific aspect
185
+ 2. **Clear Definition**: Define precisely what your metric measures
186
+ 3. **Bounded Output**: Scores should be normalized, typically in [0,1]
187
+ 4. **Reproducibility**: Minimize randomness in evaluation
188
+ 5. **Documentation**: Document criteria, prompt design, and interpretation guidelines
189
+ 6. **Test with Examples**: Verify metric behavior on clear-cut examples
190
+ 7. **Human Correlation**: Validate that metrics correlate with human judgment
191
+
192
+ ## Standardizing Custom Metrics
193
+
194
+ To ensure consistency across custom metrics, consider the following best practices:
195
+
196
+ - Define a clear, human-readable description for each metric.
197
+ - Provide interpretation guidelines to help users understand score meanings.
198
+ - Include metadata such as metric name, required columns, and output type.
199
+ - Use a standardized interface or base class for all custom metrics.
200
+
201
+ ## Implementation Patterns for Advanced Metrics
202
+
203
+ When developing advanced metrics like topic adherence:
204
+
205
+ - Design multi-step evaluation workflows for complex tasks.
206
+ - Use specialized prompts for different sub-tasks within the metric.
207
+ - Allow configurable scoring modes (e.g., precision, recall, F1).
208
+ - Support conversational context for multi-turn evaluations.
209
+
210
+ ## Debugging Custom Metrics
211
+
212
+ Effective debugging strategies include:
213
+
214
+ - Implementing a debug mode to capture prompt inputs, outputs, and intermediate results.
215
+ - Logging detailed evaluation steps for easier troubleshooting.
216
+ - Reviewing final scores alongside intermediate calculations to identify issues.
217
+
218
+
219
+ ## Conclusion: Building an Evaluation Ecosystem
220
+
221
+ Custom metrics allow you to build a comprehensive evaluation ecosystem tailored to your application's specific needs:
222
+
223
+ 1. **Baseline metrics**: Start with Ragas' core metrics for fundamental quality aspects
224
+ 2. **Domain adaptation**: Add specialized metrics for your application domain
225
+ 3. **Feature-specific metrics**: Develop metrics for unique features of your system
226
+ 4. **Business alignment**: Create metrics that reflect specific business KPIs and requirements
227
+
228
+ By extending Ragas with custom metrics, you can create evaluation frameworks that precisely measure what matters most for your LLM applications, leading to more meaningful improvements and better user experiences.
229
+
230
+ In our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.
231
+
232
+ ---
233
+
234
+ **[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
235
+ **[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
236
+ **[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**
237
+ **[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas)**
238
+ **Part 5: Advanced Evaluation Techniques — _You are here_**
239
+ *Next up in the series:*
240
+ **[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
241
+ **[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
242
+ **[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
243
+
244
+
245
+ *How have you implemented feedback loops in your LLM applications? What improvement strategies have been most effective for your use cases? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!*
data/basic-evaluation-workflow-with-ragas/index.md ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Part 2: Basic Evaluation Workflow with Ragas"
3
+ date: 2025-04-26T19:00:00-06:00
4
+ layout: blog
5
+ description: "Learn how to set up a basic evaluation workflow for LLM applications using Ragas. This guide walks you through data preparation, metric selection, and result analysis."
6
+ categories: ["AI", "RAG", "Evaluation", "Ragas"]
7
+ coverImage: "https://images.unsplash.com/photo-1600132806370-bf17e65e942f?q=80&w=1988&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
8
+ readingTime: 8
9
+ published: true
10
+ ---
11
+
12
+ In our previous post, we introduced Ragas as a powerful framework for evaluating LLM applications. Now, let's dive into the practical aspects of setting up your first evaluation pipeline.
13
+
14
+ ## Understanding the Evaluation Workflow
15
+
16
+ A typical Ragas evaluation workflow consists of four key steps:
17
+
18
+ 1. **Prepare your data**: Collect queries, contexts, responses, and reference answers
19
+ 2. **Select appropriate metrics**: Choose metrics that align with what you want to evaluate
20
+ 3. **Run the evaluation**: Process your data through the selected metrics
21
+ 4. **Analyze the results**: Interpret scores and identify areas for improvement
22
+
23
+ Let's walk through each step with practical examples.
24
+
25
+ ## Step 1: Setting Up Your Environment
26
+
27
+ First, ensure you have Ragas installed:
28
+
29
+ ```bash
30
+ uv add ragas
31
+ ```
32
+
33
+ Next, import the necessary components:
34
+
35
+ ```python
36
+ import pandas as pd
37
+ from ragas import EvaluationDataset
38
+ from ragas import evaluate, RunConfig
39
+ from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
40
+ ```
41
+
42
+ ## Step 2: Preparing Your Evaluation Data
43
+
44
+ For a RAG system evaluation, you'll need:
45
+
46
+ - **Questions**: User queries to your system
47
+ - **Contexts**: Documents or chunks retrieved by your system
48
+ - **Responses**: Answers generated by your system
49
+ - **Ground truth** (optional): Reference answers or documents for comparison
50
+
51
+ Here's how to organize this data:
52
+
53
+ ```python
54
+ # Sample data
55
+ data = {
56
+ "user_input": [
57
+ "What are the main symptoms of COVID-19?",
58
+ "How does machine learning differ from deep learning?"
59
+ ],
60
+ "retrieved_contexts": [
61
+ [
62
+ "Common symptoms of COVID-19 include fever, cough, and fatigue. Some patients also report loss of taste or smell, body aches, and difficulty breathing.",
63
+ "COVID-19 is caused by the SARS-CoV-2 virus and spreads primarily through respiratory droplets."
64
+ ],
65
+ [
66
+ "Machine learning is a subset of AI focused on algorithms that learn from data without being explicitly programmed.",
67
+ "Deep learning is a specialized form of machine learning using neural networks with many layers (deep neural networks)."
68
+ ]
69
+ ],
70
+ "response": [
71
+ "The main symptoms of COVID-19 include fever, cough, fatigue, and sometimes loss of taste or smell, body aches, and breathing difficulties.",
72
+ "Machine learning is a subset of AI that focuses on algorithms learning from data, while deep learning is a specialized form of machine learning that uses deep neural networks with multiple layers."
73
+ ],
74
+ "reference": [
75
+ "COVID-19 symptoms commonly include fever, dry cough, fatigue, loss of taste or smell, body aches, sore throat, and in severe cases, difficulty breathing.",
76
+ "Machine learning is a branch of AI where systems learn from data, identify patterns, and make decisions with minimal human intervention. Deep learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to analyze various factors of data."
77
+ ]
78
+ }
79
+
80
+ eval_data = pd.DataFrame(data)
81
+
82
+ # Convert to a format Ragas can use
83
+ evaluation_dataset = EvaluationDataset.from_pandas(eval_data)
84
+ evaluation_dataset
85
+
86
+ ```
87
+
88
+ ## Step 3: Selecting and Configuring Metrics
89
+
90
+ Ragas offers various metrics to evaluate different aspects of your system:
91
+
92
+ ### Core RAG Metrics:
93
+
94
+ - **Faithfulness**: Measures if the response is factually consistent with the provided context.
95
+ - **Factual Correctness**: Assesses if the response is accurate and free from factual errors.
96
+ - **Response Relevancy**: Evaluates if the response directly addresses the user query.
97
+ - **Context Entity Recall**: Measures how well the retrieved context captures relevant entities from the ground truth.
98
+ - **Noise Sensitivity**: Assesses the robustness of the response to irrelevant or noisy context.
99
+ - **LLM Context Recall**: Evaluates how effectively the LLM utilizes the provided context to generate the response.
100
+
101
+ For metrics that require an LLM (like faithfulness), you need to configure the LLM provider:
102
+
103
+ ```python
104
+ # Configure LLM for evaluation
105
+ from langchain_openai import ChatOpenAI
106
+ from ragas.llms import LangchainLLMWrapper
107
+
108
+ # Initialize the LLM, you are going to OPENAI API key
109
+ evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
110
+
111
+ # Define metrics to use
112
+ metrics = [
113
+ Faithfulness(),
114
+ FactualCorrectness(),
115
+ ResponseRelevancy(),
116
+ ContextEntityRecall(),
117
+ NoiseSensitivity(),
118
+ LLMContextRecall()
119
+ ]
120
+ ```
121
+
122
+ ## Step 4: Running the Evaluation
123
+
124
+ Now, run the evaluation with your selected metrics:
125
+
126
+ ```python
127
+ # Run evaluation
128
+ results = evaluate(
129
+ evaluation_dataset,
130
+ metrics=metrics,
131
+ llm=evaluator_llm # Required for LLM-based metrics
132
+ )
133
+
134
+ # View results
135
+ print(results)
136
+ ```
137
+ ### Output:
138
+
139
+ *Values will vary based on your data and LLM performance.*
140
+
141
+ ```python
142
+ {
143
+ "faithfulness": 1.0000,
144
+ "factual_correctness": 0.6750,
145
+ "answer_relevancy": 0.9897,
146
+ "context_entity_recall": 0.8889,
147
+ "noise_sensitivity_relevant": 0.1667,
148
+ "context_recall": 0.5000
149
+ }
150
+ ```
151
+
152
+
153
+ ## Step 5: Interpreting Results
154
+
155
+ Ragas metrics typically return scores between 0 and 1, where higher is better:
156
+
157
+ ### Understanding Score Ranges:
158
+
159
+ - **0.8-1.0**: Excellent performance
160
+ - **0.6-0.8**: Good performance
161
+ - **0.4-0.6**: Moderate performance, needs improvement
162
+ - **0.4 or lower**: Poor performance, requires significant attention
163
+
164
+ ## Advanced Use: Custom Evaluation for Specific Examples
165
+
166
+ For more detailed analysis of specific examples:
167
+
168
+ ```python
169
+ from ragas import SingleTurnSample
170
+ from ragas.metrics import AspectCritic
171
+
172
+ # Define a specific test case
173
+ test_data = {
174
+ "user_input": "What are quantum computers?",
175
+ "response": "Quantum computers use quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits that can only be 0 or 1.",
176
+ "retrieved_contexts": ["Quantum computing is a type of computation that harnesses quantum mechanical phenomena."]
177
+ }
178
+
179
+ # Create a custom evaluation metric
180
+ custom_metric = AspectCritic(
181
+ name="quantum_accuracy",
182
+ llm=llm,
183
+ definition="Verify if the explanation of quantum computing is accurate and complete."
184
+ )
185
+
186
+ # Score the sample
187
+ sample = SingleTurnSample(**test_data)
188
+ score = await custom_metric.single_turn_ascore(sample)
189
+ print(f"Quantum accuracy score: {score}")
190
+ ```
191
+ > 💡 **Try it yourself:**
192
+ > Explore the hands-on notebook for this workflow:
193
+ > [02_Basic_Evaluation_Workflow_with_Ragas](https://github.com/mafzaal/intro-to-ragas/blob/master/02_Basic_Evaluation_Workflow_with_Ragas.ipynb)
194
+
195
+ ## Common Evaluation Patterns and Metrics
196
+
197
+ Below is a matrix mapping evaluation patterns to the metrics used, along with definitions for each metric:
198
+
199
+ | **Metric** | **Comprehensive RAG Evaluation** | **Content Quality Evaluation** | **Retrieval Quality Evaluation** |
200
+ |-----------------------------|----------------------------------|---------------------------------|-----------------------------------|
201
+ | **Faithfulness** | ✓ | ✓ | |
202
+ | **Answer Relevancy** | ✓ | ✓ | |
203
+ | **Context Recall** | ✓ | | ✓ |
204
+ | **Context Precision** | ✓ | | ✓ |
205
+ | **Harmfulness** | | ✓ | |
206
+ | **Coherence** | | ✓ | |
207
+ | **Context Relevancy** | | | ✓ |
208
+
209
+ ### Metric Definitions
210
+
211
+ - **Faithfulness**: Measures if the response is factually consistent with the provided context.
212
+ - **Answer Relevancy**: Assesses if the response addresses the question.
213
+ - **Context Recall**: Measures how well the retrieved context covers the information in the ground truth.
214
+ - **Context Precision**: Evaluates the proportion of relevant information in the retrieved context.
215
+ - **Harmfulness**: Evaluates if the response contains harmful or inappropriate content.
216
+ - **Coherence**: Measures the logical flow and clarity of the response.
217
+ - **Context Relevancy**: Evaluates if the retrieved context is relevant to the question.
218
+
219
+ This matrix provides a clear overview of which metrics to use for specific evaluation patterns and their respective definitions.
220
+
221
+ ## Best Practices for Ragas Evaluation
222
+
223
+ 1. **Start simple**: Begin with core metrics before adding more specialized ones
224
+ 2. **Use diverse test cases**: Include a variety of questions, from simple to complex
225
+ 3. **Consider edge cases**: Test with queries that might challenge your system
226
+ 4. **Compare versions**: Track metrics across different versions of your application
227
+ 5. **Combine with human evaluation**: Use Ragas alongside human feedback for a comprehensive assessment
228
+
229
+ ## Conclusion
230
+
231
+ Setting up a basic evaluation workflow with Ragas is straightforward yet powerful. By systematically evaluating your LLM applications, you gain objective insights into their performance and clear directions for improvement.
232
+
233
+ In our next post, we'll delve deeper into specialized evaluation techniques for RAG systems, exploring advanced metrics and evaluation strategies for retrieval-augmented generation applications.
234
+
235
+ ---
236
+
237
+ **[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
238
+ **Part 2: Basic Evaluation Workflow — _You are here_**
239
+ *Next up in the series:*
240
+ **[Part 3: Evaluating RAG Systems](/blog/evaluating-rag-systems-with-ragas/)**
241
+ **[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
242
+ **[Part 5: Advanced Evaluation Techniques](/blog/advanced-metrics-and-customization-with-ragas)**
243
+ **[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
244
+ **[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
245
+ **[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
246
+
247
+
248
+ *Have you set up your first Ragas evaluation? What aspects of your LLM application are you most interested in measuring? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!*
data/building-feedback-loops-with-ragas/index.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Part 8: Building Feedback Loops with Ragas"
3
+ date: 2025-05-04T00:00:00-06:00
4
+ layout: blog
5
+ description: "A research-driven guide to designing robust, actionable feedback loops for LLM and RAG systems using Ragas. Learn how to select metrics, set baselines, define thresholds, and incorporate user and human feedback for continuous improvement."
6
+ categories: ["AI", "RAG", "Evaluation", "Ragas", "Data"]
7
+ coverImage: "/images/building-feedback-loops.png"
8
+ readingTime: 10
9
+ published: true
10
+ ---
11
+
12
+
13
+ A high-performing LLM or RAG system is never static. The most successful teams treat evaluation as a continuous, iterative process—one that closes the loop between measurement, analysis, and improvement. In this post, we’ll design a research-backed feedback loop process using Ragas, focusing on actionable activities at each stage and strategies for integrating user and human feedback.
14
+
15
+
16
+ ## Designing the Feedback Loop: A Stepwise Process
17
+
18
+ The feedback loop process is a systematic approach to continuously improve your LLM or RAG system. It consists of seven key steps, each building on the previous one to create a sustainable cycle of evidence-driven progress.
19
+
20
+ ![Feedback Loop Process](/images/feedback-loop-process.png)
21
+
22
+ ### 1. Select the Right Metric
23
+
24
+ **Purpose:**
25
+ Identify metrics that best reflect your application’s goals and user needs.
26
+
27
+ **Activities:**
28
+ - Map business objectives to measurable outcomes (e.g., accuracy, faithfulness, relevancy).
29
+ - Review available Ragas metrics and select those most aligned with your use case.
30
+ - Periodically revisit metric selection as your product or user base evolves.
31
+
32
+ ### 2. Develop and Measure Baseline Metrics
33
+
34
+ **Purpose:**
35
+ Establish a reference point for current system performance.
36
+
37
+ **Activities:**
38
+ - Assemble a representative evaluation dataset.
39
+ - Run your system and record metric scores for each example.
40
+ - Document baseline results for all selected metrics.
41
+ - Ensure the baseline dataset remains stable for future comparisons.
42
+
43
+ ### 3. Analyze and Define Acceptable Threshold Values
44
+
45
+ **Purpose:**
46
+ Set clear, actionable standards for what constitutes “good enough” performance.
47
+
48
+ **Activities:**
49
+ - Analyze baseline metric distributions (mean, variance, outliers).
50
+ - Consult stakeholders to define minimum acceptable values for each metric.
51
+ - Document thresholds and rationale for transparency.
52
+ - Consider different thresholds for different segments (e.g., critical vs. non-critical queries).
53
+
54
+ ### 4. Evaluate and Select Improvement Areas
55
+
56
+ **Purpose:**
57
+ Identify where your system most often fails to meet thresholds and prioritize improvements.
58
+
59
+ **Activities:**
60
+ - Segment evaluation results by metric, query type, or user group.
61
+ - Identify patterns or clusters of failure (e.g., certain topics, long queries).
62
+ - Prioritize areas with the greatest impact on user experience or business goals.
63
+ - Formulate hypotheses about root causes.
64
+
65
+ ### 5. Implement Improvements
66
+
67
+ **Purpose:**
68
+ Take targeted actions to address identified weaknesses.
69
+
70
+ **Activities:**
71
+ - Design and implement changes (e.g., prompt tuning, retrieval upgrades, model fine-tuning).
72
+ - Document all interventions and their intended effects.
73
+ - Ensure changes are isolated for clear attribution of impact.
74
+
75
+
76
+ ### 6. Record Metrics for History
77
+
78
+ **Purpose:**
79
+ Build a longitudinal record to track progress and avoid regressions.
80
+
81
+ **Activities:**
82
+ - After each improvement, re-evaluate on the same baseline dataset.
83
+ - Log metric scores, system version, date, and description of changes.
84
+ - Visualize trends over time to inform future decisions.
85
+
86
+ **Metric Record Log Schema Example:**
87
+
88
+ | Timestamp | System Version | Metric Name | Value | Dataset Name | Change Description |
89
+ |---------------------|---------------|-------------------|--------|--------------|---------------------------|
90
+ | 2025-05-04T12:00:00 | v1.2.0 | faithfulness | 0.78 | baseline_v1 | Added re-ranking to retriever |
91
+ | 2025-05-04T12:00:00 | v1.2.0 | answer_relevancy | 0.81 | baseline_v1 | Added re-ranking to retriever |
92
+ | ... | ... | ... | ... | ... | ... |
93
+
94
+
95
+ ### 7. Repeat: Analyze, Evaluate, Implement, Record
96
+
97
+ **Purpose:**
98
+ Establish a sustainable, iterative cycle of improvement.
99
+
100
+ **Activities:**
101
+ - Regularly revisit analysis as new data or feedback emerges.
102
+ - Continuously refine thresholds and priorities.
103
+ - Maintain a culture of evidence-based iteration.
104
+
105
+
106
+ ## Integrating User Feedback in Production
107
+
108
+ ### Purpose
109
+
110
+ User feedback provides real-world validation and uncovers blind spots in automated metrics. Incorporating it closes the gap between technical evaluation and actual user satisfaction.
111
+
112
+ ### Strategies
113
+
114
+ - **In-Product Feedback Widgets:** Allow users to rate answers or flag issues directly in the interface.
115
+ - **Passive Signals:** Analyze user behavior (e.g., follow-up queries, abandonment) as implicit feedback.
116
+ - **Feedback Sampling:** Periodically sample user sessions for manual review.
117
+ - **Feedback Aggregation:** Aggregate and categorize feedback to identify recurring pain points.
118
+ - **Metric Correlation:** Analyze how user feedback correlates with automated metrics to calibrate thresholds.
119
+
120
+ ### Recording User Feedback
121
+
122
+ **User Feedback Log Schema Example:**
123
+
124
+ | Timestamp | User ID | Query ID | User Rating | Feedback Text | Metric Scores | System Version |
125
+ |---------------------|---------|----------|-------------|----------------------|--------------|---------------|
126
+ | 2025-05-04T13:00:00 | 12345 | q_987 | 2 | "Answer was off-topic" | `{faithfulness: 0.6, answer_relevancy: 0.5}` | v1.2.0 |
127
+ | 2025-05-04T13:00:00 | 67890 | q_654 | 4 | "Good answer, but could be more concise" | `{faithfulness: 0.8, answer_relevancy: 0.9}` | v1.2.0 |
128
+ | ... | ... | ... | ... | ... | ... | ... |
129
+
130
+ ## Including Human Labelers in Evaluation
131
+
132
+ ### Purpose
133
+
134
+ Human labelers provide high-quality, nuanced judgments that automated metrics may miss, especially for ambiguous or complex queries.
135
+
136
+ ### Strategies
137
+
138
+ - **Periodic Human Review:** Regularly sample evaluation outputs for human annotation.
139
+ - **Disagreement Analysis:** Focus human review on cases where user feedback and metrics disagree.
140
+ - **Labeler Training:** Provide clear guidelines and calibration sessions to ensure consistency.
141
+ - **Hybrid Scoring:** Combine human and automated scores for a more holistic evaluation.
142
+ - **Continuous Calibration:** Use human labels to refine and validate automated metric thresholds.
143
+
144
+
145
+ ## Conclusion
146
+
147
+ A robust feedback loop is the foundation of sustainable improvement for LLM and RAG systems. By systematically selecting metrics, measuring baselines, setting thresholds, and integrating both user and human feedback, you create a virtuous cycle of evidence-driven progress. The most effective teams treat evaluation as an ongoing process—one that is deeply connected to real user outcomes and grounded in transparent, repeatable measurement.
148
+
149
+ ---
150
+ *This is the eighth part of a series on Ragas, a research-driven evaluation framework for LLM and RAG systems. If you missed the previous parts, check them out below:*
151
+
152
+ **[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
153
+ **[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
154
+ **[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**
155
+ **[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
156
+ **[Part 5: Advanced Metrics and Customization](/blog/advanced-metrics-and-customization-with-ragas/)**
157
+ **[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
158
+ **[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
159
+ **Part 8: Building Feedback Loops — _You are here_**
160
+
161
+ *Have questions or want to share your feedback loop strategies? [Connect with me on LinkedIn](https://www.linkedin.com/in/muhammadafzaal/) for discussion or collaboration!*
data/building-research-agent/index.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ layout: blog
3
+ title: Building a Research Agent with RSS Feed Support
4
+ date: 2025-04-20T00:00:00-06:00
5
+ description: How I created a comprehensive research assistant that combines web search, academic papers, RSS feeds, and document analysis to revolutionize information discovery.
6
+ categories: ["AI", "LLM", "Research", "Technology", "Agents"]
7
+ coverImage: "https://images.unsplash.com/photo-1507842217343-583bb7270b66?q=80&w=2290&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
8
+ readingTime: 5
9
+ published: true
10
+ ---
11
+
12
+ In the age of information overload, finding the right data efficiently has become increasingly challenging. Whether you're conducting academic research, staying updated on industry trends, or investigating specific topics, the process often involves juggling multiple tools and platforms. This fragmentation inspired me to create a comprehensive solution: a research agent with RSS feed support that brings together multiple information sources in one unified interface.
13
+
14
+ ## Why Build a Research Agent?
15
+
16
+ As someone who regularly conducts research across different domains, I've experienced the frustration of switching between search engines, academic databases, news aggregators, and document analysis tools. Each context switch breaks concentration and slows down the discovery process. I wanted a tool that could:
17
+
18
+ - Search across multiple information sources simultaneously
19
+ - Analyze uploaded documents in the context of web information
20
+ - Provide transparent reasoning about its research process
21
+ - Deliver structured, well-cited reports
22
+
23
+ The result is the [Research Agent](https://huggingface.co/spaces/mafzaal/AIE6-ResearchAgent) - an LLM-powered assistant that brings together web search, academic papers, RSS feeds, and document analysis into a single, coherent workflow.
24
+
25
+ ## Multi-Source Research Architecture
26
+
27
+ The agent's strength comes from its ability to tap into various information streams:
28
+
29
+ ### Web Search Integration
30
+
31
+ For real-time information and general knowledge, the agent leverages both Tavily and DuckDuckGo APIs to perform semantic searches across the web. This provides access to current events, recent developments, and general information that might not be available in academic sources.
32
+
33
+ ### Academic Research Pipeline
34
+
35
+ Research often requires scholarly sources. The agent connects to arXiv's extensive database of scientific papers, allowing it to retrieve relevant academic articles complete with titles, authors, and abstracts. This is particularly valuable for technical topics that require peer-reviewed information.
36
+
37
+ ### RSS Feed Aggregation
38
+
39
+ For targeted news monitoring and industry updates, the RSS feed reader component allows the agent to retrieve content from specific publications and blogs. This is ideal for tracking industry trends or following particular news sources relevant to your research topic.
40
+
41
+ ### Document Analysis Engine
42
+
43
+ Perhaps the most powerful feature is the document analysis capability, which uses Retrieval Augmented Generation (RAG) to process uploaded PDFs or text files. By breaking documents into semantic chunks and creating vector embeddings, the agent can answer questions specifically about your documents while incorporating relevant information from other sources.
44
+
45
+ ## Behind the Scenes: LangGraph Workflow
46
+
47
+ What makes this agent particularly powerful is its LangGraph-based architecture, which provides a structured framework for reasoning and tool orchestration:
48
+
49
+ ![Research Agent Graph](/images/building-research-agent-01.png)
50
+
51
+ This workflow provides several key advantages:
52
+
53
+ 1. **Contextual Awareness**: The agent maintains context throughout the research process
54
+ 2. **Dynamic Tool Selection**: It intelligently chooses which information sources to query based on your question
55
+ 3. **Transparent Reasoning**: You can see each step of the research process
56
+ 4. **Consistent Output Structure**: Results are formatted into comprehensive reports with proper citations
57
+
58
+ ## The Technology Stack
59
+
60
+ Building the Research Agent required integrating several cutting-edge technologies:
61
+
62
+ - **LangChain**: Provides the foundation for LLM application development
63
+ - **LangGraph**: Enables sophisticated workflow orchestration
64
+ - **Chainlit**: Powers the interactive chat interface
65
+ - **Qdrant**: Serves as the vector database for document embeddings
66
+ - **OpenAI**: Supplies the GPT-4o language model and embeddings
67
+ - **Tavily/DuckDuckGo**: Delivers web search capabilities
68
+ - **arXiv API**: Connects to academic paper repositories
69
+ - **Feedparser**: Handles RSS feed processing
70
+
71
+ ## The Research Process in Action
72
+
73
+ When you ask the Research Agent a question, it follows a systematic process:
74
+
75
+ 1. **Query Analysis**: It first analyzes your question to determine which information sources would be most relevant
76
+ 2. **Multi-Tool Research**: Depending on the query, it executes searches across selected tools
77
+ 3. **Context Retrieval**: If you've uploaded documents, it retrieves relevant passages from them
78
+ 4. **Research Transparency**: It shows each step of its research process for full transparency
79
+ 5. **Information Synthesis**: It analyzes and combines information from all sources
80
+ 6. **Structured Reporting**: It delivers a comprehensive response with proper citations
81
+
82
+ ## Real-World Applications
83
+
84
+ The Research Agent has proven valuable across various use cases:
85
+
86
+ - **Academic Research**: Gathering information across multiple scholarly sources
87
+ - **Competitive Analysis**: Staying updated on industry competitors
88
+ - **Technical Deep Dives**: Understanding complex technical topics
89
+ - **News Monitoring**: Tracking specific events across multiple sources
90
+ - **Document Q&A**: Asking questions about specific documents in broader context
91
+
92
+ ## Lessons Learned and Future Directions
93
+
94
+ Building this agent taught me several valuable lessons about LLM application development:
95
+
96
+ 1. **Tool Integration Complexity**: Combining multiple data sources requires careful consideration of data formats and query patterns
97
+ 2. **Context Management**: Maintaining context across different research steps is critical for coherent outputs
98
+ 3. **Transparency Matters**: Users trust AI more when they can see how it reached its conclusions
99
+ 4. **LangGraph Power**: The graph-based approach to LLM workflows provides significant advantages over simpler chains
100
+
101
+ Looking ahead, I'm exploring several enhancements:
102
+
103
+ - Expanded academic database integration beyond arXiv
104
+ - More sophisticated document analysis with multi-document reasoning
105
+ - Improved citation formats and bibliographic support
106
+ - Enhanced visualization of research findings
107
+
108
+ ## Try It Yourself
109
+
110
+ The Research Agent is available as an open-source project, and you can try it directly on Hugging Face Spaces:
111
+
112
+ - **Live Demo**: [Hugging Face Space](https://huggingface.co/spaces/mafzaal/AIE6-ResearchAgent)
113
+ - **Source Code**: [GitHub Repository](https://github.com/mafzaal/AIE6-ResearchAgent)
114
+
115
+ If you're interested in deploying your own instance, the GitHub repository includes detailed setup instructions for both local development and Docker deployment.
116
+
117
+ ---
118
+
119
+ *Have you used the Research Agent or built similar tools? I'd love to hear about your experiences and any suggestions for improvements. Feel free to reach out through the contact form or connect with me on social media!*
data/coming-back-to-ai-roots/index.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ layout: blog
3
+ title: Coming Back to AI Roots - My Professional Journey
4
+ date: 2025-04-14T00:00:00-06:00
5
+ description: A personal reflection on my career journey from AI to web and enterprise software development, and why I'm returning to my original passion for artificial intelligence.
6
+ categories: ["AI", "Personal Journey", "Technology"]
7
+ coverVideo: "/videos/back_to_future.mp4"
8
+ readingTime: 4
9
+ published: true
10
+ ---
11
+
12
+
13
+ Have you ever felt that life has a way of bringing you full circle? That's exactly how I feel about my career trajectory. My name is Muhammad Afzaal, and I'd like to share the story of my professional journey - from my early fascination with artificial intelligence, through years of web and enterprise software development, and now back to where it all began.
14
+
15
+ ## The Early AI Days
16
+
17
+ My professional journey began with a deep fascination for artificial intelligence. As a student, I was captivated by the potential of machines that could learn and make decisions. This was well before the current AI boom - back when neural networks were still considered somewhat niche and the term "deep learning" wasn't yet a household phrase.
18
+
19
+ I spent countless hours immersed in neural networks, image processing, and computer vision. My early career was defined by research projects and small-scale AI implementations - including Urdu OCR systems and data extraction from paper-based forms in 2003-2004. I still have vivid memories of recruiting fellow students to handwrite text samples, then meticulously scanning, labeling, and training neural networks with this data. While modest by today's standards, these projects represented glimpses into a future where machines could meaningfully augment human capabilities in ways that seemed almost magical at the time.
20
+
21
+ ## The Pivot to Web and Enterprise Development
22
+
23
+ As often happens in technology careers, opportunities led me in a different direction. The explosive growth of web technologies and enterprise systems created a high demand for developers with these skills, and I found myself gradually pivoting away from AI.
24
+
25
+ For several years, I immersed myself in the world of web and enterprise software development. I worked with various frameworks and technologies, built scalable systems, and helped businesses solve complex problems through software. This journey taught me invaluable lessons about software architecture, user experience, and delivering production-quality code that serves real business needs.
26
+
27
+ Working in enterprise software development exposed me to the challenges of building systems that not only function correctly but can also scale, evolve, and adapt to changing requirements. I learned the importance of clean code, thoughtful architecture, and considering the entire lifecycle of software products.
28
+
29
+ ## Why I'm Returning to AI
30
+
31
+ While my time in web and enterprise development was rewarding, I've always felt a pull back toward artificial intelligence. The recent AI renaissance - with breakthroughs in large language models, generative AI, and machine learning at scale - has reignited my original passion.
32
+
33
+ We're living in what may be the most exciting time in AI history. Models like GPT-4, Claude, and open-source alternatives are demonstrating capabilities that seemed like science fiction just a few years ago. The tools and frameworks available today make AI more accessible than ever before, and the potential applications span virtually every domain of human endeavor.
34
+
35
+ What excites me most is that my experience in enterprise software development gives me a unique perspective on AI implementation. I understand not just the algorithms and models, but also how to integrate them into robust, production-ready systems that deliver real value.
36
+
37
+ ## The Best of Both Worlds
38
+
39
+ Coming back to AI doesn't mean leaving behind everything I learned in web and enterprise development. Quite the opposite - I believe my background gives me a particular advantage in building AI systems that are:
40
+
41
+ - **Production-ready**: Understanding software engineering best practices helps create AI systems that can operate reliably at scale.
42
+ - **User-focused**: Experience with UX principles ensures AI solutions are designed with actual human users in mind.
43
+ - **Integrated**: Knowledge of enterprise systems makes it easier to connect AI capabilities with existing business processes.
44
+ - **Simplified**: My experience in streamlining complex business processes helps me identify where AI can have the greatest impact through intelligent automation.
45
+ - **Business-oriented**: I understand that AI isn't just about the technology—it's about solving real business problems and creating measurable value.
46
+ - **Practical**: I focus on practical applications that deliver immediate benefits rather than getting caught up in theoretical possibilities.
47
+
48
+ ## What's Next
49
+
50
+ As I return to my AI roots, I'm excited to share this journey with you through this blog. In the coming months, I plan to write about:
51
+
52
+ - Practical applications of modern AI technologies
53
+ - How to bridge the gap between AI research and production systems
54
+ - The intersection of web technologies and AI
55
+ - Ethical considerations in AI implementation
56
+ - Tutorials and guides for developers looking to incorporate AI into their projects
57
+
58
+ If you're interested in AI, software development, or the intersection of these fields, I hope you'll join me on this journey. Whether you're a seasoned AI practitioner, a web developer curious about machine learning, or simply interested in how technology is evolving, I believe there's something here for you.
59
+
60
+ Here's to coming full circle, building on past experiences, and embracing the exciting future of AI!
61
+
62
+ ---
63
+
64
+ *Have questions or topics you'd like me to cover? Feel free to [reach out](https://www.linkedin.com/in/muhammadafzaal/) — we’d love to help!*
data/data-is-king/index.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ layout: blog
3
+ title: "Data is King: Why Your Data Strategy IS Your Business Strategy"
4
+ date: 2025-04-15T00:00:00-06:00
5
+ categories: ["AI", "Strategy","Data"]
6
+ description: "Discover why controlling unique, high-quality data is your organization's most valuable competitive advantage in the AI era, and how a strategic approach to data ownership is becoming essential to business success."
7
+ coverImage: "https://images.unsplash.com/photo-1705484229341-4f7f7519b718?q=80&w=1740&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
8
+ readingTime: 3
9
+ published: true
10
+ ---
11
+
12
+ In the rapidly evolving world of artificial intelligence and machine learning, there's a phrase that has become something of a mantra among practitioners: "Data is king." This concept, often attributed to Peter Norvig, the Research Director at Google, challenges the conventional wisdom that sophisticated algorithms are the primary drivers of AI advancement.
13
+
14
+ ## The Origin of "Data is King"
15
+
16
+ Peter Norvig famously stated, "We don't have better algorithms. We just have more data." This statement emerged during a time when Google's approach to machine translation was yielding surprisingly effective results not through algorithmic innovations, but through the sheer volume of multilingual data they had amassed.
17
+
18
+ This perspective represented a paradigm shift. Prior to this, the field had largely focused on crafting ever more sophisticated algorithms, with the assumption that smarter code would yield better results. Norvig's insight suggested something different: even relatively simple algorithms could outperform more sophisticated ones when trained on sufficiently large datasets.
19
+
20
+ ## The Business Imperative of Data Ownership
21
+
22
+ In today's AI-driven economy, Norvig's insight has profound implications for businesses. Companies that control unique, high-quality datasets possess an increasingly valuable competitive advantage that can't be easily replicated—even by competitors with superior engineering talent.
23
+
24
+ ### Why Data Ownership Matters
25
+
26
+ 1. **Sustainable Competitive Advantage**: While algorithms can be replicated or even improved upon by competitors, proprietary data is uniquely yours. A company with exclusive access to valuable data can maintain market leadership even when algorithmic approaches become standardized.
27
+
28
+ 2. **Diminishing Returns on Algorithmic Innovation**: As machine learning techniques mature, algorithmic improvements often yield smaller incremental gains compared to expanding or improving your data assets.
29
+
30
+ 3. **Model Defensibility**: Proprietary data creates a moat around your AI systems that competitors cannot easily cross, regardless of their technical capabilities.
31
+
32
+ 4. **Value Appreciation**: Unlike physical assets that depreciate, data assets often appreciate in value over time as more patterns and insights can be extracted with evolving technology.
33
+
34
+ ### The Risks of Data Dependency
35
+
36
+ Organizations that rely on third-party data sources or lack clear data ownership strategies face significant risks:
37
+
38
+ - **Vulnerability to supply disruptions** when data providers change terms or access
39
+ - **Limited ability to differentiate** their AI applications from competitors
40
+ - **Reduced capacity for innovation** as they lack the raw material for new insights
41
+ - **Potential lock-in** to specific vendors or platforms that control their data access
42
+
43
+ For forward-thinking enterprises, data strategy should be elevated to the same level of importance as product, technology, and market strategies. This means investing in data acquisition, management, and governance with the same rigor applied to other mission-critical functions.
44
+
45
+ ## How "TheDataGuy" Can Transform Your Data Strategy
46
+
47
+ As "TheDataGuy," I help businesses transform their approach to data assets through a comprehensive framework that turns raw information into strategic advantage:
48
+
49
+ ### My Data Value Chain Approach
50
+
51
+ 1. **Data Collection & Acquisition**: Designing efficient pipelines to gather relevant, high-quality data while ensuring compliance with regulatory requirements.
52
+
53
+ 2. **Storage Architecture**: Implementing scalable, secure storage solutions that balance accessibility with cost-effectiveness.
54
+
55
+ 3. **Organization & Governance**: Establishing metadata frameworks, quality control processes, and governance structures that make data discoverable and trustworthy.
56
+
57
+ 4. **Insight Extraction**: Applying analytics techniques from basic reporting to advanced machine learning that convert data into actionable business intelligence.
58
+
59
+ 5. **LLM Specialization**: Creating specialized AI capabilities tailored to your business context:
60
+
61
+ a. **Retrieval-Augmented Generation (RAG)**: Implementing systems that combine your proprietary data with foundation models, enabling AI to access your business knowledge while reducing hallucinations and improving accuracy.
62
+
63
+ b. **Domain-Specific Fine-Tuning**: Adapting pre-trained models to your industry's terminology, workflows, and requirements through targeted training on curated datasets.
64
+
65
+ c. **Hybrid Approaches**: Developing systems that intelligently combine RAG and fine-tuning to maximize performance while minimizing computational costs and training time.
66
+
67
+ d. **Knowledge Distillation**: Creating smaller, more efficient specialized models that capture the essential capabilities needed for your specific business applications.
68
+
69
+ By working across this entire spectrum, organizations can develop truly proprietary AI capabilities that competitors cannot easily replicate, regardless of their technical talent or computational resources.
70
+
71
+ Remember: in the age of AI, your data strategy isn't just supporting your business strategy—increasingly, it *is* your business strategy.
72
+ ## Ready to Make Data Your Competitive Advantage?
73
+
74
+ Don't let valuable data opportunities slip away. Whether you're just beginning your data journey or looking to enhance your existing strategy, I can help transform your approach to this critical business asset.
75
+
76
+ ### Let's Connect
77
+ Connect with me on [LinkedIn](https://www.linkedin.com/in/muhammadafzaal/) to discuss how I can help your organization harness the power of data.
78
+
data/evaluating-ai-agents-with-ragas/index.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Part 6: Evaluating AI Agents: Beyond Simple Answers with Ragas"
3
+ date: 2025-04-28T06:00:00-06:00
4
+ layout: blog
5
+ description: "Learn how to evaluate complex AI agents using Ragas' specialized metrics for goal accuracy, tool call accuracy, and topic adherence to build more reliable and effective agent-based applications."
6
+ categories: ["AI", "Agents", "Evaluation", "Ragas", "LLM"]
7
+ coverImage: "/images/ai_agent_evaluation.png"
8
+ readingTime: 8
9
+ published: true
10
+ ---
11
+
12
+ In our previous posts, we've explored how Ragas evaluates RAG systems and enables custom metrics for specialized applications. As LLMs evolve beyond simple question-answering to become powerful AI agents, evaluation needs have grown more sophisticated too. In this post, we'll explore Ragas' specialized metrics for evaluating AI agents that engage in multi-turn interactions, use tools, and work toward specific goals.
13
+
14
+ ## The Challenge of Evaluating AI Agents
15
+
16
+ Unlike traditional RAG systems, AI agents present unique evaluation challenges:
17
+
18
+ - **Multi-turn interactions**: Agents maintain context across multiple exchanges
19
+ - **Tool usage**: Agents call external tools and APIs to accomplish tasks
20
+ - **Goal-oriented behavior**: Success means achieving the user's ultimate objective
21
+ - **Boundaries and constraints**: Agents must operate within defined topic boundaries
22
+
23
+ Standard metrics like faithfulness or answer relevancy don't fully capture these dimensions. Let's explore three specialized metrics Ragas provides for agent evaluation.
24
+
25
+ ## Evaluating AI Agents: Beyond Simple Answers with Ragas
26
+
27
+ ### 1. Goal Accuracy (`agent_goal_accuracy`)
28
+
29
+ **What it measures:** Did the agent successfully achieve the user's ultimate objective over the course of the interaction?
30
+
31
+ **How it works:**
32
+ This metric analyzes the entire agent workflow (user inputs, AI responses, tool calls).
33
+ * It uses an LLM (`InferGoalOutcomePrompt`) to identify the `user_goal` and the `end_state` (what actually happened).
34
+ * It then compares the `end_state` to either:
35
+ * A provided `reference` outcome (**`AgentGoalAccuracyWithReference`**).
36
+ * The inferred `user_goal` (**`AgentGoalAccuracyWithoutReference`**).
37
+ * An LLM (`CompareOutcomePrompt`) determines if the achieved outcome matches the desired one, resulting in a binary score (1 for success, 0 for failure).
38
+
39
+ **Why it's important:** For task-oriented agents (like booking systems or assistants), success isn't about individual responses but about completing the overall task correctly. This metric directly measures that end-to-end success.
40
+
41
+ ### 2. Tool Call Accuracy (`tool_call_accuracy`)
42
+
43
+ **What it measures:** Did the agent use the correct tools, in the right order, and with the right arguments?
44
+
45
+ **How it works:**
46
+ This metric compares the sequence and details of tool calls made by the agent against a `reference_tool_calls` list.
47
+ * It checks if the *sequence* of tool names called by the agent aligns with the reference sequence (`is_sequence_aligned`).
48
+ * For each matching tool call, it compares the arguments provided by the agent to the reference arguments, often using a sub-metric like `ExactMatch` (`_get_arg_score`).
49
+ * The final score reflects both the sequence alignment and the argument correctness.
50
+
51
+ **Why it's important:** Many agents rely on external tools (APIs, databases, etc.). Incorrect tool usage (wrong tool, bad parameters) leads to task failure. This metric pinpoints issues in the agent's interaction with its tools.
52
+
53
+ ### 3. Topic Adherence (`topic_adherence`)
54
+
55
+ **What it measures:** Did the agent stick to the allowed topics and appropriately handle requests about restricted topics?
56
+
57
+ **How it works:**
58
+ This metric evaluates conversations against a list of `reference_topics`.
59
+ * It extracts the topics discussed in the user's input (`TopicExtractionPrompt`).
60
+ * It checks if the agent refused to answer questions related to specific topics (`TopicRefusedPrompt`).
61
+ * It classifies whether the discussed topics fall within the allowed `reference_topics` (`TopicClassificationPrompt`).
62
+ * Based on these classifications and refusals, it calculates a score (Precision, Recall, or F1) indicating how well the agent adhered to the topic constraints.
63
+
64
+ **Why it's important:** Ensures agents stay focused, avoid generating content on forbidden subjects (safety, policy), and handle out-of-scope requests gracefully.
65
+
66
+ ## Implementing Agent Evaluation in Practice
67
+
68
+ Let's look at a practical example of evaluating an AI agent using these metrics:
69
+
70
+ ```python
71
+ from ragas.metrics import AgentGoalAccuracyWithoutReference, ToolCallAccuracy, TopicAdherenceScore
72
+ from ragas.evaluation import EvaluationDataset
73
+ from ragas.dataset_schema import MultiTurnSample
74
+ from langchain_openai import ChatOpenAI
75
+ from ragas.llms import LangchainLLMWrapper
76
+
77
+ # Initialize the LLM
78
+ evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
79
+
80
+ # Example conversation with a travel booking agent
81
+ test_data = {
82
+ "user_input": [
83
+ {"role": "user", "content": "I need to book a flight from New York to London next Friday"},
84
+ {"role": "assistant", "content": "I'd be happy to help you book a flight. Let me search for options...",
85
+ "tool_calls": [{"name": "search_flights", "arguments": {"origin": "NYC", "destination": "LON", "date": "next Friday"}}]},
86
+ {"role": "tool", "name": "search_flights", "content": "Found 5 flights: Flight 1 (Delta, $750), Flight 2 (British Airways, $820)..."},
87
+ {"role": "assistant", "content": "I found several flights from New York to London next Friday. The cheapest option is Delta for $750. Would you like to book this one?"},
88
+ {"role": "user", "content": "Yes, please book the Delta flight"},
89
+ {"role": "assistant", "content": "I'll book that for you now.",
90
+ "tool_calls": [{"name": "book_flight", "arguments": {"flight_id": "delta_123", "price": "$750"}}]},
91
+ {"role": "tool", "name": "book_flight", "content": "Booking confirmed. Confirmation #: ABC123"},
92
+ {"role": "assistant", "content": "Great news! Your flight is confirmed. Your confirmation number is ABC123. The flight is scheduled for next Friday. Is there anything else you need help with?"}
93
+ ],
94
+ "reference_topics": ["travel", "flight booking", "schedules", "prices"],
95
+ "reference_tool_calls": [
96
+ {"name": "search_flights", "args": {"origin": "NYC", "destination": "LON", "date": "next Friday"}},
97
+ {"name": "book_flight", "args": {"flight_id": "delta_123", "price": "$750"}}
98
+ ]
99
+ }
100
+
101
+ # Create a sample
102
+ sample = MultiTurnSample(**test_data)
103
+
104
+ # Initialize metrics
105
+ goal_accuracy = AgentGoalAccuracyWithoutReference(llm=evaluator_llm)
106
+ tool_accuracy = ToolCallAccuracy()
107
+ topic_adherence = TopicAdherenceScore(llm=evaluator_llm)
108
+
109
+ # Calculate scores
110
+ goal_score = await goal_accuracy.multi_turn_ascore(sample)
111
+ tool_score = tool_accuracy.multi_turn_score(sample)
112
+ topic_score = await topic_adherence.multi_turn_ascore(sample)
113
+
114
+ print(f"Goal Accuracy: {goal_score}")
115
+ print(f"Tool Call Accuracy: {tool_score}")
116
+ print(f"Topic Adherence: {topic_score}")
117
+ ```
118
+
119
+ > 💡 **Try it yourself:**
120
+ > Explore the hands-on notebook for agent evaluation:
121
+ > [06_Evaluating_AI_Agents](https://github.com/mafzaal/intro-to-ragas/blob/master/06_Evaluating_AI_Agents.ipynb)
122
+
123
+ ## Advanced Agent Evaluation Techniques
124
+
125
+ ### Combining Metrics for Comprehensive Evaluation
126
+
127
+ For a complete assessment of agent capabilities, combine multiple metrics:
128
+
129
+ ```python
130
+ from ragas import evaluate
131
+
132
+ results = evaluate(
133
+ dataset, # Your dataset of agent conversations
134
+ metrics=[
135
+ AgentGoalAccuracyWithoutReference(llm=evaluator_llm),
136
+ ToolCallAccuracy(),
137
+ TopicAdherence(llm=evaluator_llm)
138
+ ]
139
+ )
140
+ ```
141
+
142
+ ## Best Practices for Agent Evaluation
143
+
144
+ 1. **Test scenario coverage:** Include a diverse range of interaction scenarios
145
+ 2. **Edge case handling:** Test how agents handle unexpected inputs or failures
146
+ 3. **Longitudinal evaluation:** Track performance over time to identify regressions
147
+ 4. **Human-in-the-loop validation:** Periodically verify metric alignment with human judgments
148
+ 5. **Continuous feedback loops:** Use evaluation insights to guide agent improvements
149
+
150
+ ## Conclusion
151
+
152
+ Evaluating AI agents requires specialized metrics that go beyond traditional RAG evaluation. Ragas' `agent_goal_accuracy`, `tool_call_accuracy`, and `topic_adherence` provide crucial insights into whether an agent can successfully complete tasks, use tools correctly, and stay within designated boundaries.
153
+
154
+ By incorporating these metrics into your evaluation pipeline, you can build more reliable and effective AI agents that truly deliver on the promise of helpful, goal-oriented AI assistants.
155
+
156
+ In our next post, we'll explore how to integrate Ragas with popular frameworks and observability tools for seamless evaluation workflows.
157
+
158
+ ---
159
+
160
+ **[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
161
+ **[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
162
+ **[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**
163
+ **[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
164
+ **[Part 5: Advanced Metrics and Customization](/blog/advanced-metrics-and-customization-with-ragas/)**
165
+ **Part 6: Evaluating AI Agents — _You are here_**
166
+ *Next up in the series:*
167
+ **[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
168
+ **[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
169
+
170
+ *How are you evaluating your AI agents? What challenges have you encountered in measuring agent performance? If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!*
data/evaluating-rag-systems-with-ragas/index.md ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Part 3: Evaluating RAG Systems with Ragas"
3
+ date: 2025-04-26T20:00:00-06:00
4
+ layout: blog
5
+ description: "Learn specialized techniques for comprehensive evaluation of Retrieval-Augmented Generation systems using Ragas, including metrics for retrieval quality, generation quality, and end-to-end performance."
6
+ categories: ["AI", "RAG", "Evaluation", "Ragas"]
7
+ coverImage: "https://images.unsplash.com/photo-1743796055664-3473eedab36e?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
8
+ readingTime: 14
9
+ published: true
10
+ ---
11
+
12
+ In our previous post, we covered the fundamentals of setting up evaluation workflows with Ragas. Now, let's focus specifically on evaluating Retrieval-Augmented Generation (RAG) systems, which present unique evaluation challenges due to their multi-component nature.
13
+
14
+ ## Understanding RAG Systems: More Than the Sum of Their Parts
15
+
16
+ RAG systems combine two critical capabilities:
17
+ 1. **Retrieval**: Finding relevant information from a knowledge base
18
+ 2. **Generation**: Creating coherent, accurate responses based on retrieved information
19
+
20
+ This dual nature means evaluation must address both components while also assessing their interaction. A system might retrieve perfect information but generate poor responses, or generate excellent prose from irrelevant retrieved content.
21
+
22
+ ## The RAG Evaluation Triad
23
+
24
+ Effective RAG evaluation requires examining three key dimensions:
25
+
26
+ 1. **Retrieval Quality**: How well does the system find relevant information?
27
+ 2. **Generation Quality**: How well does the system produce responses from retrieved information?
28
+ 3. **End-to-End Performance**: How well does the complete system satisfy user needs?
29
+
30
+ Let's explore how Ragas helps evaluate each dimension of RAG systems.
31
+
32
+ ## Core RAG Metrics in Ragas
33
+
34
+ Ragas provides specialized metrics to assess RAG systems across retrieval, generation, and end-to-end performance.
35
+
36
+ ### Retrieval Quality Metrics
37
+
38
+ #### 1. Context Relevancy
39
+
40
+ Measures how relevant the retrieved documents are to the user's question.
41
+
42
+ - **How it works:**
43
+ - Takes the user's question (`user_input`) and the retrieved documents (`retrieved_contexts`).
44
+ - Uses an LLM to score relevance with two different prompts, averaging the results for robustness.
45
+ - Scores are normalized between 0.0 (irrelevant) and 1.0 (fully relevant).
46
+
47
+ - **Why it matters:**
48
+ Low scores indicate your retriever is pulling in unrelated or noisy documents. Monitoring this helps you improve the retrieval step.
49
+
50
+ #### 2. Context Precision
51
+
52
+ Assesses how much of the retrieved context is actually useful for generating the answer.
53
+
54
+ - **How it works:**
55
+ - For each retrieved chunk, an LLM judges if it was necessary for the answer, using the ground truth (`reference`) or the generated response.
56
+ - Calculates [Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Average_precision), rewarding systems that rank useful chunks higher.
57
+
58
+ - **Variants:**
59
+ - `ContextUtilization`: Uses the generated response instead of ground truth.
60
+ - Non-LLM version: Compares retrieved chunks to ideal reference contexts using string similarity.
61
+
62
+ - **Why it matters:**
63
+ High precision means your retriever is efficient; low precision means too much irrelevant information is included.
64
+
65
+ #### 3. Context Recall
66
+
67
+ Evaluates whether all necessary information from the ground truth answer is present in the retrieved context.
68
+
69
+ - **How it works:**
70
+ - Breaks down the reference answer into sentences.
71
+ - For each sentence, an LLM checks if it can be supported by the retrieved context.
72
+ - The score is the proportion of reference sentences attributed to the retrieved context.
73
+
74
+ - **Variants:**
75
+ - Non-LLM version: Compares reference and retrieved contexts using similarity and thresholds.
76
+
77
+ - **Why it matters:**
78
+ High recall means your retriever finds all needed information; low recall means critical information is missing.
79
+
80
+ **Summary:**
81
+ - **Low context relevancy:** Retriever needs better query understanding or semantic matching.
82
+ - **Low context precision:** Retriever includes unnecessary information.
83
+ - **Low context recall:** Retriever misses critical information.
84
+
85
+ ### Generation Quality Metrics
86
+
87
+ #### 1. Faithfulness
88
+
89
+ Checks if the generated answer is factually consistent with the retrieved context, addressing hallucination.
90
+
91
+ - **How it works:**
92
+ - Breaks the answer into simple statements.
93
+ - For each, an LLM checks if it can be inferred from the retrieved context.
94
+ - The score is the proportion of faithful statements.
95
+
96
+ - **Alternative:**
97
+ - `FaithfulnesswithHHEM`: Uses a specialized NLI model for verification.
98
+
99
+ - **Why it matters:**
100
+ High faithfulness means answers are grounded in context; low faithfulness signals hallucination.
101
+
102
+ #### 2. Answer Relevancy
103
+
104
+ Measures if the generated answer directly addresses the user's question.
105
+
106
+ - **How it works:**
107
+ - Asks an LLM to generate possible questions for the answer.
108
+ - Compares these to the original question using embedding similarity.
109
+ - Penalizes noncommittal answers.
110
+
111
+ - **Why it matters:**
112
+ High relevancy means answers are on-topic; low relevancy means answers are off-topic or incomplete.
113
+
114
+ **Summary:**
115
+ - **Low faithfulness:** Generator adds facts not supported by context.
116
+ - **Low answer relevancy:** Generator doesn't focus on the specific question.
117
+
118
+ ### End-to-End Metrics
119
+
120
+ #### 1. Correctness
121
+
122
+ Assesses factual alignment between the generated answer and a ground truth reference.
123
+
124
+ - **How it works:**
125
+ - Breaks both the answer and reference into claims.
126
+ - Uses NLI to verify claims in both directions.
127
+ - Calculates precision, recall, or F1-score.
128
+
129
+ - **Why it matters:**
130
+ High correctness means answers match the ground truth; low correctness signals factual errors.
131
+
132
+ **Key distinction:**
133
+ - `Faithfulness`: Compares answer to retrieved context.
134
+ - `FactualCorrectness`: Compares answer to ground truth.
135
+
136
+ ---
137
+
138
+ ## Common RAG Evaluation Patterns
139
+
140
+ ### 1. High Retrieval, Low Generation Scores
141
+
142
+ - **Diagnosis:** Good retrieval, poor use of information.
143
+ - **Fixes:** Improve prompts, use better generation models, or verify responses post-generation.
144
+
145
+ ### 2. Low Retrieval, High Generation Scores
146
+
147
+ - **Diagnosis:** Good generation, inadequate information.
148
+ - **Fixes:** Enhance indexing, retrieval algorithms, or expand the knowledge base.
149
+
150
+ ### 3. Low Context Precision, High Faithfulness
151
+
152
+ - **Diagnosis:** Retrieves too much, but generates reliably.
153
+ - **Fixes:** Filter passages, optimize chunk size, or use re-ranking.
154
+
155
+ ---
156
+
157
+ ## Best Practices for RAG Evaluation
158
+
159
+ 1. **Evaluate components independently:** Assess retrieval and generation separately.
160
+ 2. **Use diverse queries:** Include factoid, explanatory, and complex questions.
161
+ 3. **Compare against baselines:** Test against simpler systems.
162
+ 4. **Perform ablation studies:** Try variations like different chunk sizes or retrieval models.
163
+ 5. **Combine with human evaluation:** Use Ragas with human judgment for a complete view.
164
+
165
+ ---
166
+
167
+ ## Conclusion: The Iterative RAG Evaluation Cycle
168
+
169
+ Effective RAG development is iterative:
170
+
171
+ 1. **Evaluate:** Measure performance.
172
+ 2. **Analyze:** Identify weaknesses.
173
+ 3. **Improve:** Apply targeted enhancements.
174
+ 4. **Re-evaluate:** Measure the impact of changes.
175
+
176
+ <p align="center">
177
+ <img src="/images/the-iterative-rag-evaluation-cycle.png" alt="The Iterative RAG Evaluation Cycle" width="50%">
178
+ </p>
179
+
180
+ By using Ragas to implement this cycle, you can systematically improve your RAG system's performance across all dimensions.
181
+
182
+ In our next post, we'll explore how to generate high-quality test datasets for comprehensive RAG evaluation, addressing the common challenge of limited test data.
183
+
184
+ ---
185
+
186
+ **[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
187
+ **[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
188
+ **Part 3: Evaluating RAG Systems with Ragas — _You are here_**
189
+ *Next up in the series:*
190
+ **[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
191
+ **[Part 5: Advanced Evaluation Techniques](/blog/advanced-metrics-and-customization-with-ragas)**
192
+ **[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
193
+ **[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
194
+ **[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
195
+
196
+
197
+ *How have you implemented feedback loops in your LLM applications? What improvement strategies have been most effective for your use cases? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!*
data/generating-test-data-with-ragas/index.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Part 4: Generating Test Data with Ragas"
3
+ date: 2025-04-27T16:00:00-06:00
4
+ layout: blog
5
+ description: "Discover how to generate robust test datasets for evaluating Retrieval-Augmented Generation systems using Ragas, including document-based, domain-specific, and adversarial test generation techniques."
6
+ categories: ["AI", "RAG", "Evaluation", "Ragas","Data"]
7
+ coverImage: "/images/generating_test_data.png"
8
+ readingTime: 14
9
+ published: true
10
+ ---
11
+
12
+
13
+ In our previous post, we explored how to comprehensively evaluate RAG systems using specialized metrics. However, even the best evaluation framework requires high-quality test data to yield meaningful insights. In this post, we'll dive into how Ragas helps you generate robust test datasets for evaluating your LLM applications.
14
+
15
+
16
+ ## Why and How to Generate Synthetic Data for RAG Evaluation
17
+
18
+ In the world of Retrieval-Augmented Generation (RAG) and LLM-powered applications, **synthetic data generation** is a game-changer for rapid iteration and robust evaluation. This blog post explains why synthetic data is essential, and how you can generate it for your own RAG pipelines—using modern tools like [RAGAS](https://github.com/explodinggradients/ragas) and [LangSmith](https://smith.langchain.com/).
19
+
20
+ ---
21
+
22
+ ### Why Generate Synthetic Data?
23
+
24
+ 1. **Early Signal, Fast Iteration**
25
+ Real-world data is often scarce or expensive to label. Synthetic data lets you quickly create test sets that mimic real user queries and contexts, so you can evaluate your system’s performance before deploying to production.
26
+
27
+ 2. **Controlled Complexity**
28
+ You can design synthetic datasets to cover edge cases, multi-hop reasoning, or specific knowledge domains—ensuring your RAG system is robust, not just good at the “easy” cases.
29
+
30
+ 3. **Benchmarking and Comparison**
31
+ Synthetic test sets provide a repeatable, comparable way to measure improvements as you tweak your pipeline (e.g., changing chunk size, embeddings, or prompts).
32
+
33
+ ---
34
+
35
+ ### How to Generate Synthetic Data
36
+
37
+ #### 1. **Prepare Your Source Data**
38
+ Start with a set of documents relevant to your domain. For example, you might download and load HTML blog posts into a document format using tools like LangChain’s `DirectoryLoader`.
39
+
40
+ #### 2. **Build a Knowledge Graph**
41
+ Use RAGAS to convert your documents into a knowledge graph. This graph captures entities, relationships, and summaries, forming the backbone for generating meaningful queries. RAGAS applies default transformations are dependent on the corpus length, here are some examples:
42
+
43
+ - Producing Summaries -> produces summaries of the documents
44
+ - Extracting Headlines -> finding the overall headline for the document
45
+ - Theme Extractor -> extracts broad themes about the documents
46
+
47
+ It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes. This is a crucial step, as the quality of your knowledge graph directly impacts the relevance and accuracy of the generated queries.
48
+
49
+ #### 3. **Configure Query Synthesizers**
50
+ RAGAS provides several query synthesizers:
51
+ - **SingleHopSpecificQuerySynthesizer**: Generates direct, fact-based questions.
52
+ - **MultiHopAbstractQuerySynthesizer**: Creates broader, multi-step reasoning questions.
53
+ - **MultiHopSpecificQuerySynthesizer**: Focuses on questions that require connecting specific entities across documents.
54
+
55
+ By mixing these, you get a diverse and challenging test set.
56
+
57
+ #### 4. **Generate the Test Set**
58
+ With your knowledge graph and query synthesizers, use RAGAS’s `TestsetGenerator` to create a synthetic dataset. This dataset will include questions, reference answers, and supporting contexts.
59
+
60
+ #### 5. **Evaluate and Iterate**
61
+ Load your synthetic dataset into an evaluation platform like LangSmith. Run your RAG pipeline against the test set, and use automated evaluators (for accuracy, helpfulness, style, etc.) to identify strengths and weaknesses. Tweak your pipeline and re-evaluate to drive improvements.
62
+
63
+ ---
64
+
65
+ ### Minimal Example
66
+
67
+ Here’s a high-level pseudocode outline (see the notebook for full details):
68
+
69
+ ````python
70
+ # 1. Load documents
71
+ from langchain_community.document_loaders import DirectoryLoader
72
+ path = "data/"
73
+ loader = DirectoryLoader(path, glob="*.md")
74
+ docs = loader.load()
75
+
76
+ # 2. Generate data
77
+ from ragas.testset import TestsetGenerator
78
+ from ragas.llms import LangchainLLMWrapper
79
+ from ragas.embeddings import LangchainEmbeddingsWrapper
80
+ from langchain_openai import ChatOpenAI
81
+ from langchain_openai import OpenAIEmbeddings
82
+ # Initialize the generator with the LLM and embedding model
83
+ generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
84
+ generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
85
+
86
+ # Create the test set generator
87
+ generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
88
+ dataset = generator.generate_with_langchain_docs(docs, testset_size=10)
89
+ ````
90
+
91
+ `dataset` will now contain a set of questions, answers, and contexts that you can use to evaluate your RAG system.
92
+
93
+ > 💡 **Try it yourself:**
94
+ > Explore the hands-on notebook for synthetic data generation:
95
+ > 💡 **Try it yourself:**
96
+ > Explore the hands-on notebook for synthetic data generation:
97
+ > [04_Synthetic_Data_Generation](https://github.com/mafzaal/intro-to-ragas/blob/master/04_Synthetic_Data_Generation.ipynb)
98
+
99
+ ### Understanding the Generated Dataset Columns
100
+
101
+ The synthetic dataset generated by Ragas typically includes the following columns:
102
+
103
+ - **`user_input`**: The generated question or query that simulates what a real user might ask. This is the prompt your RAG system will attempt to answer.
104
+ - **`reference_contexts`**: A list of document snippets or passages that contain the information needed to answer the `user_input`. These serve as the ground truth retrieval targets.
105
+ - **`reference`**: The ideal answer to the `user_input`, based strictly on the `reference_contexts`. This is used as the gold standard for evaluating answer accuracy.
106
+ - **`synthesizer_name`**: The name of the query synthesizer (e.g., `SingleHopSpecificQuerySynthesizer`, `MultiHopAbstractQuerySynthesizer`) that generated the question. This helps track the type and complexity of each test case.
107
+
108
+ These columns enable comprehensive evaluation by linking each question to its supporting evidence and expected answer, while also providing insight into the diversity and difficulty of the generated queries.
109
+
110
+
111
+ ## Deep Dive into Test Data Generation
112
+
113
+ So you have a collection of documents and want to create a robust evaluation dataset for your RAG system using Ragas. The `TestsetGenerator`'s `generate_with_langchain_docs` method is your starting point. But what exactly happens when you call it? Let's peek under the hood.
114
+
115
+ **The Goal:** To take raw Langchain `Document` objects and transform them into a structured Ragas `Testset` containing diverse question-answer pairs grounded in those documents.
116
+
117
+ **The Workflow:**
118
+
119
+ 1. **Input & Validation:** The function receives your Langchain `documents`, the desired `testset_size`, and optional configurations for transformations and query types. It first checks if it has the necessary LLM and embedding models to proceed (either provided during `TestsetGenerator` initialization or passed directly to this method).
120
+
121
+ 2. **Setting Up Transformations:** This is a crucial step.
122
+ * **User-Provided:** If you pass a specific `transforms` configuration, the generator uses that.
123
+ * **Default Transformations:** If you *don't* provide `transforms`, the generator calls `ragas.testset.transforms.default_transforms`. This sets up a standard pipeline to process your raw documents into a usable knowledge graph foundation. We'll detail this below.
124
+
125
+ 3. **Document Conversion:** Your Langchain `Document` objects are converted into Ragas' internal `Node` representation, specifically `NodeType.DOCUMENT`. Each node holds the `page_content` and `metadata`.
126
+
127
+ 4. **Initial Knowledge Graph:** A `KnowledgeGraph` object is created, initially containing just these document nodes.
128
+
129
+ 5. **Applying Transformations:** The core processing happens here using `ragas.testset.transforms.apply_transforms`. The chosen `transforms` (default or custom) are executed sequentially on the `KnowledgeGraph`. This modifies the graph by:
130
+ * Adding new nodes (e.g., chunks, questions, answers).
131
+ * Adding relationships between nodes (e.g., linking a question to the chunk it came from).
132
+ The generator's internal `knowledge_graph` attribute is updated with this processed graph.
133
+
134
+ 6. **Delegation to `generate()`:** Now that the foundational knowledge graph with basic Q&A pairs is built (thanks to transformations), `generate_with_langchain_docs` calls the main `self.generate()` method. This method handles the final step of creating the diverse test samples.
135
+
136
+ **Spotlight: Default Transformations (`default_transforms`)**
137
+
138
+ When you don't specify custom transformations, Ragas applies a sensible default pipeline to prepare your documents:
139
+
140
+ 1. **Chunking (`SentenceChunker`):** Breaks down your large documents into smaller, more manageable chunks (often sentences or groups of sentences). This is essential for focused retrieval and question generation.
141
+ 2. **Embedding:** Generates vector embeddings for each chunk using the provided embedding model. These are needed for similarity-based operations.
142
+ 3. **Filtering (`SimilarityFilter`, `InformationFilter`):** Removes redundant chunks (those too similar to others) and potentially low-information chunks to clean up the knowledge base.
143
+ 4. **Base Q&A Generation (`QAGenerator`):** This is where the initial, simple question-answer pairs are created. The generator looks at individual (filtered) chunks and uses an LLM to formulate straightforward questions whose answers are directly present in that chunk.
144
+
145
+ Essentially, the default transformations build a knowledge graph populated with embedded, filtered document chunks and corresponding simple, extractive question-answer pairs.
146
+
147
+ **Spotlight: Query Synthesizers (via `self.generate()` and `default_query_distribution`)**
148
+
149
+ The `self.generate()` method, called by `generate_with_langchain_docs`, is responsible for taking the foundational graph and creating the final, potentially complex, test questions using **Query Synthesizers** (also referred to as "evolutions" or "scenarios").
150
+
151
+ * **Query Distribution:** `self.generate()` uses a `query_distribution` parameter. If you don't provide one, it calls `ragas.testset.synthesizers.default_query_distribution`.
152
+ * **Default Synthesizers:** This default distribution defines a mix of different synthesizer types and the probability of using each one. Common defaults include:
153
+ * **`simple`:** Takes the base Q&A pairs generated during transformation and potentially rephrases them slightly.
154
+ * **`reasoning`:** Creates questions requiring logical inference based on the context in the graph.
155
+ * **`multi_context`:** Generates questions needing information synthesized from multiple different chunks/nodes in the graph.
156
+ * **`conditional`:** Creates questions with "if/then" clauses based on information in the graph.
157
+ * **Generation Process:** `self.generate()` calculates how many questions of each type to create based on the `testset_size` and the distribution probabilities. It then uses an `Executor` to run the appropriate synthesizers, generating the final `TestsetSample` objects that make up your evaluation dataset.
158
+
159
+ **In Summary:**
160
+
161
+ `generate_with_langchain_docs` orchestrates a two-phase process:
162
+
163
+ 1. **Transformation Phase:** Uses (typically default) transformations like chunking, filtering, and base Q&A generation to build a foundational knowledge graph from your documents.
164
+ 2. **Synthesis Phase (via `self.generate`):** Uses (typically default) query synthesizers/evolutions (`simple`, `reasoning`, `multi_context`, etc.) to create diverse and complex questions based on the information stored in the transformed knowledge graph.
165
+
166
+ This automated pipeline allows you to go from raw documents to a rich, multi-faceted evaluation dataset with minimal configuration.
167
+
168
+
169
+ ## Best Practices for Test Data Generation
170
+
171
+ 1. **Start small and iterate**: Begin with a small test set to verify quality before scaling up
172
+ 2. **Diversify document sources**: Include different document types, styles, and domains
173
+ 3. **Balance question types**: Ensure coverage of simple, complex, and edge-case scenarios
174
+ 4. **Manual review**: Sample-check generated questions for quality and relevance
175
+ 5. **Progressive difficulty**: Include both easy and challenging questions to identify performance thresholds
176
+ 6. **Document metadata**: Retain information about test case generation for later analysis
177
+ 7. **Version control**: Track test set versions alongside your application versions
178
+
179
+ ## Conclusion: Building a Test Data Generation Strategy
180
+
181
+ Test data generation should be an integral part of your LLM application development cycle:
182
+
183
+ 1. **Initial development**: Generate broad test sets to identify general capabilities and limitations
184
+ 2. **Refinement**: Create targeted test sets for specific features or improvements
185
+ 3. **Regression testing**: Maintain benchmark test sets to ensure changes don't break existing functionality
186
+ 4. **Continuous improvement**: Generate new test cases as your application evolves
187
+
188
+ By leveraging Ragas for automated test data generation, you can build comprehensive evaluation datasets that thoroughly exercise your LLM applications, leading to more robust, reliable systems.
189
+
190
+ In our next post, we'll explore advanced metrics and customization techniques for specialized evaluation needs.
191
+
192
+ ---
193
+
194
+
195
+ **[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
196
+ **[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
197
+ **[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**
198
+ **Part 4: Test Data Generation — _You are here_**
199
+ *Next up in the series:*
200
+ **[Part 5: Advanced Evaluation Techniques](/blog/advanced-metrics-and-customization-with-ragas)**
201
+ **[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
202
+ **[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
203
+ **[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
204
+
205
+
206
+ *How have you implemented feedback loops in your LLM applications? What improvement strategies have been most effective for your use cases? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!*
data/integrations-and-observability-with-ragas/index.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Part 7: Integrations and Observability with Ragas"
3
+ date: 2025-04-30T07:00:00-06:00
4
+ layout: blog
5
+ description: "Discover how to generate robust test datasets for evaluating Retrieval-Augmented Generation systems using Ragas, including document-based, domain-specific, and adversarial test generation techniques."
6
+ categories: ["AI", "RAG", "Evaluation", "Ragas","Data"]
7
+ coverImage: "/images/integrations-and-observability.png"
8
+ readingTime: 12
9
+ published: true
10
+ ---
11
+
12
+ # Part 6: Integrations and Observability with Ragas
13
+
14
+ In our previous post, we explored how to evaluate complex AI agents using Ragas' specialized metrics for goal accuracy, tool call accuracy, and topic adherence to build more reliable and effective agent-based applications. Now, let's discuss how to integrate Ragas into your broader LLM development ecosystem and establish observability practices that transform evaluation from a one-time exercise into a continuous improvement cycle.
15
+
16
+ ## Why Integrations and Observability Matter
17
+
18
+ Evaluation is most powerful when it's:
19
+
20
+ - **Integrated** into your existing workflow and tools
21
+ - **Automated** to run consistently with minimal friction
22
+ - **Observable** so insights are easily accessible and actionable
23
+ - **Continuous** rather than a one-time or sporadic effort
24
+
25
+ Let's explore how Ragas helps you achieve these goals through its extensive integration capabilities.
26
+
27
+ ## Framework Integrations
28
+
29
+ Ragas seamlessly connects with popular LLM application frameworks, allowing you to evaluate systems built with your preferred tools.
30
+
31
+ ### LangChain Integration
32
+ For LangChain-based applications, Ragas provides dedicated integration support. Here’s how you can integrate Ragas step by step:
33
+
34
+ 1. **Prepare your documents**: Load your source documents and split them into manageable chunks for retrieval.
35
+ 2. **Set up vector storage**: Embed the document chunks and store them in a vector database to enable efficient retrieval.
36
+ 3. **Configure the retriever and QA chain**: Use LangChain components to create a retriever and a question-answering (QA) chain powered by your chosen language model.
37
+ 4. **Generate a test set**: Use Ragas to automatically generate a set of test questions and answers from your documents, or supply your own.
38
+ 5. **Evaluate retrieval and QA performance**: Apply Ragas metrics to assess both the retriever and the full QA chain, measuring aspects like context relevancy, faithfulness, and answer quality.
39
+ 6. **Review results**: Analyze the evaluation outputs to identify strengths and areas for improvement in your RAG pipeline.
40
+
41
+ This integration allows you to continuously measure and improve the effectiveness of your retrieval and generation components within the LangChain framework.
42
+
43
+ > 💡 **Try it yourself:**
44
+ > Explore the hands-on notebook for synthetic data generation:
45
+ > [07_Integrations_and_Observability](https://github.com/mafzaal/intro-to-ragas/blob/master/07_Integrations_and_Observability.ipynb)
46
+
47
+ Ragas supports integration with a variety of popular LLM and RAG frameworks beyond LangChain, including LlamaIndex and Haystack. These integrations enable seamless evaluation of retrieval and generation components within your preferred stack. If you need guidance or code examples for integrating Ragas with platforms such as LlamaIndex, Haystack, or others, support and tailored examples can be provided on demand to fit your specific workflow and requirements.
48
+
49
+ ## Observability Platform Integrations
50
+
51
+ Beyond framework integrations, Ragas connects with leading observability platforms to help you monitor, track, and analyze evaluation results over time.
52
+
53
+ ### LangSmith Integration
54
+ For LangChain users, LangSmith provides comprehensive tracing and evaluation. To integrate Ragas evaluation with LangSmith, follow these steps:
55
+
56
+ 1. **Set up your environment**
57
+ 2. **Upload dataset to LangSmith**
58
+ 3. **Define your LLM or chain**
59
+ 4. **Select Ragas metrics**
60
+ 5. **Run evaluation with LangSmith**
61
+
62
+ You can now view detailed experiment results in your LangSmith project dashboard. This integration enables you to trace, evaluate, and monitor your RAG pipeline performance directly within LangSmith, leveraging Ragas metrics for deeper insights.
63
+
64
+ > 💡 **Try it yourself:**
65
+ > Explore the hands-on notebook for synthetic data generation:
66
+ > [07_Integrations_and_Observability](https://github.com/mafzaal/intro-to-ragas/blob/master/07_Integrations_and_Observability.ipynb)
67
+
68
+
69
+ ### Other Platform Integrations
70
+
71
+ Ragas can be integrated with a range of observability and monitoring platforms beyond LangSmith, such as Langfuse and others. If you need help connecting Ragas to platforms like Langfuse or have specific requirements for your observability stack, tailored support and examples are available to fit your workflow.
72
+
73
+ ## Building Automated Evaluation Pipelines
74
+
75
+ To ensure evaluation is a continuous part of your development process, set up automated pipelines that run evaluations regularly and automatically.
76
+
77
+ ### CI/CD Integration
78
+
79
+ You can incorporate Ragas into your CI/CD pipeline so that every code change is automatically evaluated. This helps catch regressions early and ensures your RAG system maintains high performance before merging new changes.
80
+
81
+ ### Scheduled Evaluations
82
+
83
+ Regularly scheduled evaluations allow you to monitor your system’s performance over time. By running evaluations at set intervals, you can track trends, spot regressions, and ensure your system continues to meet quality standards.
84
+
85
+ ## Monitoring Evaluation Metrics Over Time
86
+
87
+ Tracking evaluation metrics over time helps you identify performance trends and quickly detect any drops in quality. By visualizing these metrics, you can better understand how changes to your system impact its effectiveness.
88
+
89
+ ## Creating Custom Dashboards
90
+
91
+ Building custom dashboards gives you a comprehensive view of your evaluation results. Dashboards can display current performance, trends, and detailed breakdowns of recent evaluations, making it easier to monitor your system and identify areas for improvement.
92
+
93
+ With these practices, you can make evaluation an ongoing, automated, and visible part of your development workflow, leading to more reliable and robust RAG systems.
94
+
95
+ ## Best Practices for Observability
96
+
97
+ 1. **Define clear thresholds**: Establish performance baselines and alert thresholds for each metric
98
+ 2. **Segment evaluations**: Break down results by query type, data source, or other relevant factors
99
+ 3. **Historical tracking**: Maintain historical evaluation data to identify trends and regressions
100
+ 4. **Correlation analysis**: Link evaluation metrics to user feedback and business outcomes
101
+ 5. **Regular benchmarking**: Periodically evaluate against fixed test sets to ensure consistency
102
+ 6. **Alert on regressions**: Implement automated alerts when metrics drop below thresholds
103
+ 7. **Contextualize metrics**: Include example failures alongside aggregate metrics for better understanding
104
+
105
+ ## Building a Feedback Loop
106
+
107
+ The ultimate goal of evaluation is to drive improvements. Establish a feedback loop:
108
+
109
+ 1. **Capture evaluation results** with Ragas
110
+ 2. **Identify patterns** in failures and underperforming areas
111
+ 3. **Prioritize improvements** based on impact and effort
112
+ 4. **Implement changes** to your RAG components
113
+ 5. **Validate improvements** with focused re-evaluation
114
+ 6. **Monitor continuously** to catch regressions
115
+
116
+ ## Conclusion: From Evaluation to Action
117
+
118
+ Integrating Ragas with your frameworks and observability tools transforms evaluation from a point-in-time activity to a continuous improvement cycle. By making evaluation metrics visible, actionable, and integrated into your workflows, you create a foundation for systematic improvement of your LLM applications.
119
+
120
+ The most successful teams don't just evaluate occasionally — they build evaluation into their development culture, making data-driven decisions based on objective metrics rather than subjective impressions.
121
+
122
+ In our final post, we'll explore how to build effective feedback loops that translate evaluation insights into concrete improvements for your LLM applications.
123
+
124
+ ---
125
+
126
+
127
+ **[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)**
128
+ **[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
129
+ **[Part 3: Evaluating RAG Systems with Ragas](/blog/evaluating-rag-systems-with-ragas/)**
130
+ **[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
131
+ **[Part 5: Advanced Metrics and Customization](/blog/advanced-metrics-and-customization-with-ragas/)**
132
+ **[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
133
+ **Part 7: Integrations and Observability with Ragas — _You are here_**
134
+ *Next up in the series:*
135
+ **[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
136
+
137
+ *How are you evaluating your AI agents? What challenges have you encountered in measuring agent performance? If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!*
data/introduction-to-ragas/index.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications"
3
+ date: 2025-04-26T18:00:00-06:00
4
+ layout: blog
5
+ description: "Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems."
6
+ categories: ["AI", "RAG", "Evaluation","Ragas"]
7
+ coverImage: "https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3"
8
+ readingTime: 7
9
+ published: true
10
+ ---
11
+
12
+ As Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you're building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how well your application performs. This is where Ragas steps in.
13
+
14
+ ## What is Ragas?
15
+
16
+ [Ragas](https://docs.ragas.io/en/stable/) is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems.
17
+
18
+ At its core, Ragas helps answer crucial questions:
19
+ - Is my application retrieving the right information?
20
+ - Are the responses factually accurate and consistent with the retrieved context?
21
+ - Does the system appropriately address the user's query?
22
+ - How well does my application handle multi-turn conversations?
23
+
24
+ ## Why Evaluate LLM Applications?
25
+
26
+ LLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable.
27
+
28
+ Evaluation serves several key purposes:
29
+ - **Quality assurance**: Identify and fix issues before they reach users
30
+ - **Performance tracking**: Monitor how changes impact system performance
31
+ - **Benchmarking**: Compare different approaches objectively
32
+ - **Continuous improvement**: Build feedback loops to enhance your application
33
+
34
+ ## Key Features of Ragas
35
+
36
+ ### 🎯 Specialized Metrics
37
+ Ragas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications:
38
+
39
+ - **Faithfulness**: Measures if the response is factually consistent with the retrieved context
40
+ - **Context Relevancy**: Evaluates if the retrieved information is relevant to the query
41
+ - **Answer Relevancy**: Assesses if the response addresses the user's question
42
+ - **Topic Adherence**: Gauges how well multi-turn conversations stay on topic
43
+
44
+ ### 🧪 Test Data Generation
45
+ Creating high-quality test data is often a bottleneck in evaluation. Ragas helps you generate comprehensive test datasets automatically, saving time and ensuring thorough coverage.
46
+
47
+ ### 🔗 Seamless Integrations
48
+ Ragas works with popular LLM frameworks and tools:
49
+ - [LangChain](https://www.langchain.com/)
50
+ - [LlamaIndex](https://www.llamaindex.ai/)
51
+ - [Haystack](https://haystack.deepset.ai/)
52
+ - [OpenAI](https://openai.com/)
53
+
54
+ Observability platforms
55
+ - [Phoenix](https://phoenix.arize.com/)
56
+ - [LangSmith](https://python.langchain.com/docs/introduction/)
57
+ - [Langfuse](https://www.langfuse.com/)
58
+
59
+ ### 📊 Comprehensive Analysis
60
+ Beyond simple scores, Ragas provides detailed insights into your application's strengths and weaknesses, enabling targeted improvements.
61
+
62
+ ## Getting Started with Ragas
63
+
64
+ Installing Ragas is straightforward:
65
+
66
+ ```bash
67
+ uv init && uv add ragas
68
+ ```
69
+
70
+ Here's a simple example of evaluating a response using Ragas:
71
+
72
+ ```python
73
+ from ragas.metrics import Faithfulness
74
+ from ragas.evaluation import EvaluationDataset
75
+ from ragas.dataset_schema import SingleTurnSample
76
+ from langchain_openai import ChatOpenAI
77
+ from ragas.llms import LangchainLLMWrapper
78
+ from langchain_openai import ChatOpenAI
79
+
80
+ # Initialize the LLM, you are going to new OPENAI API key
81
+ evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
82
+
83
+ # Your evaluation data
84
+ test_data = {
85
+ "user_input": "What is the capital of France?",
86
+ "retrieved_contexts": ["Paris is the capital and most populous city of France."],
87
+ "response": "The capital of France is Paris."
88
+ }
89
+
90
+ # Create a sample
91
+ sample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor
92
+
93
+ # Create metric
94
+ faithfulness = Faithfulness(llm=evaluator_llm)
95
+ # Calculate the score
96
+ result = await faithfulness.single_turn_ascore(sample)
97
+ print(f"Faithfulness score: {result}")
98
+ ```
99
+
100
+ > 💡 **Try it yourself:**
101
+ > Explore the hands-on notebook for this workflow:
102
+ > [01_Introduction_to_Ragas](https://github.com/mafzaal/intro-to-ragas/blob/master/01_Introduction_to_Ragas.ipynb)
103
+
104
+ ## What's Coming in This Blog Series
105
+
106
+ This introduction is just the beginning. In the upcoming posts, we'll dive deeper into all aspects of evaluating LLM applications with Ragas:
107
+
108
+ **[Part 2: Basic Evaluation Workflow](/blog/basic-evaluation-workflow-with-ragas/)**
109
+ We'll explore each metric in detail, explaining when and how to use them effectively.
110
+
111
+ **[Part 3: Evaluating RAG Systems](/blog/evaluating-rag-systems-with-ragas/)**
112
+ Learn specialized techniques for evaluating retrieval-augmented generation systems, including context precision, recall, and relevance.
113
+
114
+ **[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)**
115
+ Discover how to create high-quality test datasets that thoroughly exercise your application's capabilities.
116
+
117
+ **[Part 5: Advanced Evaluation Techniques](/blog/advanced-metrics-and-customization-with-ragas)**
118
+ Go beyond basic metrics with custom evaluations, multi-aspect analysis, and domain-specific assessments.
119
+
120
+ **[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)**
121
+ Learn how to evaluate complex AI agents that engage in multi-turn interactions, use tools, and work toward specific goals.
122
+
123
+ **[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)**
124
+ Connect Ragas with your existing tools and platforms for streamlined evaluation workflows.
125
+
126
+ **[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)**
127
+ Learn how to implement feedback loops that drive continuous improvement in your LLM applications.
128
+ Transform evaluation insights into concrete improvements for your LLM applications.
129
+
130
+ ## Conclusion
131
+
132
+ In a world increasingly powered by LLMs, robust evaluation is the difference between reliable applications and unpredictable ones. Ragas provides the tools you need to confidently assess and improve your LLM applications.
133
+
134
+ ### Ready to Elevate Your LLM Applications?
135
+
136
+ Start exploring Ragas today by visiting the [official documentation](https://docs.ragas.io/en/stable/). Share your thoughts, challenges, or success stories. If you're facing specific evaluation hurdles, don't hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we'd love to help!
data/langchain-experience-csharp-perspective/index.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ layout: blog
3
+ title: A C# Programmer's Perspective on LangChain Expression Language
4
+ date: 2025-04-16T00:00:00-06:00
5
+ description: My experiences transitioning from C# to LangChain Expression Language, exploring the pipe operator abstraction challenges and the surprising simplicity of parallel execution.
6
+ categories: ["Technology", "AI", "Programming"]
7
+ coverImage: "https://images.unsplash.com/photo-1555066931-4365d14bab8c?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3"
8
+ readingTime: 3
9
+ published: true
10
+ ---
11
+
12
+
13
+ As a C# developer diving into [LangChain Expression Language (LCEL)](https://langchain-ai.github.io/langgraph/), I've encountered both challenges and pleasant surprises. Here's what stood out most during my transition.
14
+
15
+ ## The Pipe Operator Abstraction Challenge
16
+
17
+ In C#, processing pipelines are explicit:
18
+
19
+ ```csharp
20
+ var result = inputData
21
+ .Where(item => item.IsValid)
22
+ .Select(item => TransformItem(item))
23
+ .ToList()
24
+ .ForEach(item => ProcessItem(item));
25
+ ```
26
+
27
+ LCEL's pipe operator creates a different flow:
28
+
29
+ ```python
30
+ chain = (
31
+ ChatPromptTemplate.from_messages([
32
+ ("system", "You are a helpful assistant specialized in {topic}."),
33
+ ("human", "{query}")
34
+ ])
35
+ | ChatOpenAI(temperature=0.7)
36
+ | (lambda llm_result: llm_result.content)
37
+ | (lambda content: content.split("\n"))
38
+ | (lambda lines: [line for line in lines if line.strip()])
39
+ | (lambda filtered_lines: "\n".join(filtered_lines))
40
+ )
41
+ ```
42
+
43
+ With complex chains, questions arise:
44
+ - What exactly passes through each step?
45
+ - How can I inspect intermediate results?
46
+ - How do I debug unexpected outcomes?
47
+
48
+ This becomes more apparent in real-world examples:
49
+
50
+ ```python
51
+ retrieval_chain = (
52
+ {"query": RunnablePassthrough(), "context": retriever | format_docs}
53
+ | prompt
54
+ | llm
55
+ | StrOutputParser()
56
+ )
57
+ ```
58
+
59
+ ## Surprisingly Simple Parallel Execution
60
+
61
+ Despite abstraction challenges, LCEL handles parallel execution elegantly.
62
+
63
+ In C#:
64
+ ```csharp
65
+ var task1 = Task.Run(() => ProcessData(data1));
66
+ var task2 = Task.Run(() => ProcessData(data2));
67
+ var task3 = Task.Run(() => ProcessData(data3));
68
+
69
+ await Task.WhenAll(task1, task2, task3);
70
+ var results = new[] { task1.Result, task2.Result, task3.Result };
71
+ ```
72
+
73
+ In LCEL:
74
+ ```python
75
+ parallel_chain = RunnableMap({
76
+ "summary": prompt_summary | llm | StrOutputParser(),
77
+ "translation": prompt_translate | llm | StrOutputParser(),
78
+ "analysis": prompt_analyze | llm | StrOutputParser()
79
+ })
80
+
81
+ result = parallel_chain.invoke({"input": user_query})
82
+ ```
83
+
84
+ This approach eliminates manual task management, handling everything behind the scenes.
85
+
86
+ ## Best Practices I've Adopted
87
+
88
+ To balance LCEL's expressiveness with clarity:
89
+
90
+ 1. Break complex chains into named subcomponents
91
+ 2. Comment non-obvious transformations
92
+ 3. Create visualization helpers for debugging
93
+ 4. Embrace functional thinking
94
+
95
+ ## Conclusion
96
+
97
+ For C# developers exploring LCEL, approach it with an open mind. The initial learning curve is worth it, especially for AI workflows where LCEL's parallel execution shines.
98
+
99
+ Want to see these concepts in practice? Check out my [Pythonic RAG repository](https://github.com/mafzaal/AIE6-DeployPythonicRAG) for working examples.
100
+
101
+ ---
102
+
103
+ *If you found this useful or have questions about transitioning from C# to LCEL, feel free to [reach out](https://www.linkedin.com/in/muhammadafzaal/) — we’d love to help!*
data/metric-driven-development/index.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Metric-Driven Development: Make Smarter Decisions, Faster"
3
+ date: 2025-05-05T00:00:00-06:00
4
+ layout: blog
5
+ description: "Your Team's Secret Weapon for Cutting Through Noise and Driving Real Progress. Learn how to use clear metrics to eliminate guesswork and make faster, smarter progress in your projects."
6
+ categories: ["Development", "Productivity", "AI", "Management"]
7
+ coverImage: "/images/metric-driven-development.png"
8
+ readingTime: 9
9
+ published: true
10
+ ---
11
+
12
+ In today's data-driven world, success depends increasingly on our ability to measure the right things at the right time. Whether you're developing AI systems, building web applications, or managing projects, having clear metrics guides your team toward meaningful progress while eliminating subjective debates.
13
+
14
+ ## The Power of Metrics in AI Evaluation
15
+
16
+ Recent advances in generative AI and large language models (LLMs) highlight the critical importance of proper evaluation frameworks. Projects like RAGAS (Retrieval Augmented Generation Assessment System) demonstrate how specialized metrics can transform vague goals into actionable insights.
17
+
18
+ For example, when evaluating retrieval-augmented generation systems, generic metrics like BLEU or ROUGE scores often fail to capture what truly matters - the accuracy, relevance, and contextual understanding of the generated responses. RAGAS instead introduces metrics specifically designed for RAG systems:
19
+
20
+ * **Faithfulness**: Measures how well the generated answer aligns with the retrieved context
21
+ * **Answer Relevancy**: Evaluates whether the response correctly addresses the user's query
22
+ * **Context Relevancy**: Assesses if the system retrieves information that's actually needed
23
+ * **Context Precision**: Quantifies how efficiently the system uses retrieved information
24
+
25
+ These targeted metrics provide clearer direction than general-purpose evaluations, allowing teams to make precise improvements where they matter most.
26
+ Imagine two teams building a new feature for a streaming platform:
27
+
28
+ * **Team A** is stuck in debates. Should they focus on improving video load speed or making the recommendation engine more accurate? One engineer insists, "Faster videos keep users from leaving!" Another counters, "But better recommendations are what make them subscribe!" They argue based on gut feelings.
29
+ * **Team B** operates differently. They have a clear, agreed-upon goal: ***Improve the average "Watch Time per User" metric, while ensuring video buffering times stay below 2 seconds.*** They rapidly test ideas, measuring the impact of each change against this specific target.
30
+
31
+ Which team do you think will make faster, smarter progress?
32
+
33
+
34
+ Team B has the edge because they're using **Metric-Driven Development (MDD)**. This is a powerful strategy where teams unite around measurable goals to eliminate guesswork and make real strides. Let's break down how it works, what makes a metric truly useful, and see how industries from healthcare to e-commerce use it to succeed.
35
+
36
+ ## What Exactly is Metric-Driven Development?
37
+
38
+ Metric-Driven Development (MDD) is a simple but effective framework where teams:
39
+
40
+ 1. **Define Clear, Measurable Goals:** Set specific numerical targets (e.g., "Increase user sign-ups by 20% this quarter").
41
+ 2. **Base Decisions on Data:** Rely on evidence and measurements, not just opinions or assumptions.
42
+ 3. **Iterate and Learn Quickly:** Continuously measure the impact of changes to see what works and what doesn't.
43
+
44
+ Think of MDD as a **GPS for your project**. Without clear metrics, you're driving in the fog, hoping you're heading in the right direction. With MDD, you get real-time feedback, ensuring you're moving towards your destination efficiently.
45
+
46
+ ## Why Teams Struggle Without Clear Metrics
47
+
48
+ Without a metric-driven approach, teams often fall into common traps:
49
+
50
+ * **Chasing Too Many Goals:** Trying to improve everything at once ("We need higher accuracy *and* faster speed *and* lower costs!") leads to scattered effort and slow progress.
51
+ * **Endless Subjective Debates:** Arguments arise that are hard to resolve with data ("Is Model A's slightly better performance worth the extra complexity?").
52
+ * **Difficulty Measuring Progress:** It's hard to know if you're actually improving ("Are we doing better than last quarter? How can we be sure?").
53
+
54
+ In **machine learning (ML)**, this often happens when teams track various technical scores (like precision, recall, or F1 score – measures of model accuracy) without a single, unifying metric tied to the *actual business outcome* they want to achieve.
55
+
56
+ ## What Makes a Metric Great? The Key Ingredients
57
+
58
+ Not all numbers are helpful. A truly effective metric has these essential traits:
59
+
60
+ 1. **Measurable:** It must be quantifiable and objective. *"95% accuracy"* is measurable; *"a better user experience"* is not, unless defined by specific, measurable indicators.
61
+ 2. **Actionable:** Your team must be able to influence the metric through their work. For example, changing a website's design *can* affect the "click-through rate."
62
+ 3. **Aligned with Business Goals:** The metric should directly contribute to the overall success of the product or business. If user retention is key, optimizing for ad clicks might be counterproductive.
63
+ 4. **Simple & Understandable:** It should be easy for everyone on the team (and stakeholders) to grasp and track. *"Monthly Active Users"* is usually simpler than a complex, weighted formula.
64
+ 5. **Robust (Hard to Game):** The metric shouldn't be easily manipulated in ways that don't reflect real progress. *Example:* A ride-sharing app tracking only "rides booked" could be fooled by drivers booking and immediately canceling rides. A better metric might be "completed rides lasting over 1 minute."
65
+ 6. **Directional:** The desired direction of the metric should be clear – whether you're trying to maximize it (like conversion rate or user retention) or minimize it (like error rate or load time). This clarity helps teams understand exactly what success looks like without ambiguity.
66
+
67
+
68
+ ## Deep Dive: Reward Functions in AI – Metrics in Action
69
+
70
+ A fascinating application of MDD principles comes from **Reinforcement Learning (RL)**, a type of AI where agents learn through trial and error. In RL, learning is guided by a **reward function**: a numerical score that tells the AI how well it's doing.
71
+
72
+ Think of it like training a dog:
73
+ * Good behavior (sitting on command) gets a treat (positive reward).
74
+ * Bad behavior (chewing shoes) gets a scold (negative reward or penalty).
75
+
76
+ Examples in AI:
77
+ * A chess-playing AI might get +1 for winning, -1 for losing, and 0 for a draw.
78
+ * A self-driving car simulation might receive rewards for smooth driving and staying in its lane, and penalties for sudden braking or collisions.
79
+
80
+ **Why Reward Functions Showcase MDD:**
81
+
82
+ Reward functions are essentially highly specialized metrics that:
83
+
84
+ * **Define Priorities Clearly:** A robot arm designed to pack boxes might get rewards for speed and gentle handling, but penalties for crushing items. The reward function dictates the trade-offs.
85
+ * **Guide Behavior in Real-Time:** Unlike metrics evaluated after a project phase, reward functions shape the AI's learning process continuously.
86
+ * **Require Careful Design to Avoid "Gaming":** Just like business metrics, a poorly designed reward can lead to unintended shortcuts. An RL agent in a game might discover a way to rack up points by repeatedly performing a trivial action, instead of actually trying to win the level. This highlights the importance of the "Robust" trait we discussed earlier.
87
+
88
+ Reward functions embody the core MDD idea: set a clear, measurable goal, and let it guide actions towards success.
89
+
90
+ ## Metric-Driven Development Across Industries: Real-World Examples
91
+
92
+ MDD isn't just for software. Here's how different fields use it:
93
+
94
+ * **E-Commerce: Conversion Rate**
95
+ * **Metric:** Percentage of website visitors who make a purchase.
96
+ * **Impact:** Directly ties development efforts (like A/B testing checkout flows) to revenue growth.
97
+ * **Healthcare: Patient Readmission Rate**
98
+ * **Metric:** Percentage of patients readmitted to the hospital within 30 days of discharge.
99
+ * **Impact:** Focuses efforts on improving care quality and follow-up, leading to better patient outcomes and lower costs.
100
+ * **Manufacturing: Defect Rate**
101
+ * **Metric:** Percentage of products produced with flaws.
102
+ * **Impact:** Drives process improvements on the factory floor, saving costs and enhancing brand reputation.
103
+ * **Gaming (AI Development): Player Performance Score**
104
+ * **Metric:** A combined score, e.g., `Points Scored - (Time Taken * Penalty Factor)`.
105
+ * **Impact:** Trains AI opponents that are challenging but fair, balancing speed and skill.
106
+ * **Autonomous Vehicles: Safety & Comfort Score**
107
+ * **Metric:** Combination of factors like smooth acceleration/braking, lane adherence, and deductions for interventions or near-misses.
108
+ * **Impact:** Guides development towards vehicles that are not only safe but also provide a comfortable ride.
109
+
110
+ ## Smart Tactics: Optimizing vs. Satisficing Metrics
111
+
112
+ Sometimes, you have competing priorities. MDD offers a smart way to handle this using two types of metrics:
113
+
114
+ * **Optimizing Metric:** The main goal you want to maximize or minimize (your "North Star").
115
+ * **Satisficing Metrics:** Other important factors that just need to meet a minimum acceptable level ("good enough").
116
+
117
+ *Example: Developing a voice assistant like Alexa or Google Assistant:*
118
+
119
+ * **Optimizing Metric:** *Minimize missed commands (false negatives)* – You want it to respond reliably when you speak the wake-word.
120
+ * **Satisficing Metric:** *Keep false activations below 1 per day (false positives)* – You don't want it waking up constantly when you haven't addressed it, but perfect prevention might hurt its responsiveness.
121
+
122
+ This approach prevents teams from sacrificing critical aspects (like basic usability) in the pursuit of perfecting a single metric.
123
+
124
+ ## Don't Forget Early Signals: The Role of Leading Indicators
125
+
126
+ In machine learning projects, **training loss** is a common metric monitored during development. Think of it as a **"practice test score"** for the model – it shows how well the model is learning the patterns in the training data *before* it faces the real world.
127
+
128
+ While a low training loss is good (it means the model is learning *something*), it's a **leading indicator**. It doesn't guarantee success on its own. You still need **lagging indicators** – metrics that measure real-world performance, like user satisfaction, task completion rates, or the ultimate business goal (e.g., user retention).
129
+
130
+ MDD reminds us to track both:
131
+ * **Leading indicators** (like training loss, code coverage) to monitor progress during development.
132
+ * **Lagging indicators** (like user engagement, revenue, customer support tickets) to measure the actual impact.
133
+
134
+ ## The Takeaway: Use Metrics as Your Compass
135
+ Metric-Driven Development isn't a complex theory reserved for tech giants. It's a fundamental mindset applicable everywhere:
136
+
137
+ * A local bakery might track *"Daily Units Sold per Pastry Type"* to optimize baking schedules.
138
+ * A city planner could use *"Average Commute Time Reduction"* to evaluate the success of new traffic light patterns.
139
+ * A project manager might measure progress through *"Sprint Velocity"* or *"Percentage of On-Time Task Completions"* rather than subjective assessments of how "busy" the team appears.
140
+
141
+
142
+ By choosing metrics that are **measurable, actionable, aligned, simple, and robust**, you transform ambiguity into clarity and opinion into evidence.
143
+
144
+ Whether you're building sophisticated AI or launching a simple website feature, MDD empowers your team to:
145
+
146
+ 1. **Move Faster:** Make decisions quickly based on clear success criteria.
147
+ 2. **Collaborate Effectively:** Unite everyone around shared, objective goals.
148
+ 3. **Know When You've Won:** Celebrate real, measurable progress.
149
+
150
+ So, the next time your team feels stuck or unsure about the path forward, ask the crucial question: ***What's our metric?***
151
+
152
+ Finding that answer might just be the compass you need to navigate towards success.
153
+
154
+ ---
155
+ *Inspired by insights from Andrew Ng's [Machine Learning Yearning](https://info.deeplearning.ai/machine-learning-yearning-book). Remember: A great metric doesn't just measure success—it actively helps create it.*
data/rss-feed-announcement/index.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Subscribe to Our Blog via RSS"
3
+ date: 2025-05-03T00:00:00-06:00
4
+ description: "Stay updated with our latest content by subscribing to our new RSS feed"
5
+ categories: ["Announcements", "Blog"]
6
+ published: true
7
+ layout: blog
8
+ coverImage: "/images/rss-announcement.png"
9
+ readingTime: 2
10
+ ---
11
+
12
+ # Subscribe to Our Blog via RSS
13
+
14
+ I'm excited to announce that TheDataGuy blog now supports RSS feeds! This means you can now easily stay updated with all the latest posts without having to manually check the website.
15
+
16
+ ## What is RSS?
17
+
18
+ RSS (Really Simple Syndication) is a web feed that allows you to subscribe to updates from websites you follow. When new content is published, your RSS reader will automatically notify you and display the latest posts.
19
+
20
+ ## Why Use RSS?
21
+
22
+ There are several benefits to using RSS:
23
+
24
+ - **No algorithms**: Unlike social media, RSS feeds show you everything from the sources you subscribe to, in chronological order.
25
+ - **No ads or distractions**: Get pure content without the clutter.
26
+ - **Privacy**: RSS readers don't track you like social media platforms do.
27
+ - **Efficiency**: Check all your favorite sites in one place instead of visiting each individually.
28
+
29
+ ## How to Subscribe
30
+
31
+ You can subscribe to our RSS feed in a few easy steps:
32
+
33
+ 1. Copy this link: `https://thedataguy.pro/rss.xml`
34
+ 2. Open your favorite RSS reader (like Feedly, Inoreader, NewsBlur, or even built-in RSS features in browsers like Vivaldi)
35
+ 3. Add a new subscription and paste the link
36
+
37
+ Alternatively, just click the RSS button in the navigation bar of our blog.
38
+
39
+ ## Popular RSS Readers
40
+
41
+ If you don't have an RSS reader yet, here are some popular options:
42
+
43
+ - [Feedly](https://feedly.com/)
44
+ - [Inoreader](https://www.inoreader.com/)
45
+ - [NewsBlur](https://newsblur.com/)
46
+ - [Feedbin](https://feedbin.com/)
47
+ - [The Old Reader](https://theoldreader.com/)
48
+
49
+ Many browsers like Firefox and Vivaldi also have built-in RSS capabilities.
50
+
51
+ ## What's Next?
52
+
53
+ I'll continue to improve the blog experience based on your feedback. If you have any suggestions or feature requests, feel free to [reach out](https://www.linkedin.com/in/muhammadafzaal/).
54
+
55
+ Happy reading!
py-src/app.py CHANGED
@@ -19,9 +19,11 @@ from qdrant_client.http.models import Distance, VectorParams
19
  from lets_talk.config import LLM_MODEL, LLM_TEMPERATURE
20
  import lets_talk.utils.blog as blog
21
  from lets_talk.agent import build_agent,parse_output
 
22
 
23
 
24
-
 
25
 
26
  tdg_agent = build_agent()
27
 
 
19
  from lets_talk.config import LLM_MODEL, LLM_TEMPERATURE
20
  import lets_talk.utils.blog as blog
21
  from lets_talk.agent import build_agent,parse_output
22
+ import pipeline
23
 
24
 
25
+ #build vector store
26
+ pipeline.main()
27
 
28
  tdg_agent = build_agent()
29