## Ragas demonstration

This notebook demonstrates how to evaluate a RAG system using the [Ragas evaluation framework](https://github.com/explodinggradients/ragas?tab=readme-ov-file) and import the resulting evaluation results into InspectorRAGet for analysis.

### Installation

In [1]:
!pip install git+https://github.com/explodinggradients/ragas

Collecting git+https://github.com/explodinggradients/ragas
  Cloning https://github.com/explodinggradients/ragas to /private/var/folders/l5/fj0t2qmn44x0r042xw6cjr8w0000gn/T/pip-req-build-k70llt1h
  Running command git clone --filter=blob:none --quiet https://github.com/explodinggradients/ragas /private/var/folders/l5/fj0t2qmn44x0r042xw6cjr8w0000gn/T/pip-req-build-k70llt1h
  Resolved https://github.com/explodinggradients/ragas to commit d2486f117fd827dcfc3e196d4cf7798573c55b09
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


In [2]:
!pip install ipywidgets



In [3]:
from datasets import Dataset 
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

In [4]:
os.environ["RAGAS_DO_NOT_TRACK"] = "true"

### Import dataset

For this example, we will be using the sample dataset provided in Ragas' quick start instructions.

In [5]:
data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}

dataset = Dataset.from_dict(data_samples)

### Run evaluation

In this example, we run evaluation using `gpt-4o-mini` and the following Ragas evaluation metrics: faithfulness, answer relevance, context precision, and context recall. Consult the Ragas documentation on how to use different models or metrics.

**Note: To use one of the OpenAI model for evaluation, you have to fill in your `OPENAI_API_KEY` below.** 

In [6]:
os.environ["OPENAI_API_KEY"] = "provide-your-openai-api-key"

In [7]:
from langchain_openai.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")

score = evaluate(dataset,llm=llm, metrics=[faithfulness,answer_relevancy,context_precision,context_recall])
df_score = score.to_pandas()

display(df_score)

Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_precision,context_recall
0,When was the first super bowl?,"The first superbowl was held on Jan 15, 1967",[The First AFL–NFL World Championship Game was...,"The first superbowl was held on January 15, 1967",1.0,0.980714,1.0,1.0
1,Who won the most super bowls?,The most super bowls have been won by The New ...,"[The Green Bay Packers...Green Bay, Wisconsin....",The New England Patriots have won the Super Bo...,0.0,0.943043,0.0,0.0


### Create InspectorRAGet file

We next generate the file for InspectorRAGet. We start by specifying the experiment metadata (experiment name, model, metrics).

**Specify name of experiment**

In [8]:
name = "Ragas Demo"

**Specify name of model that produced the answers:** Since we do not know where the answers in the `data_samples` came from, we use a dummy name. In your experiments, replace this with the name of the model that produced the answers.

In [9]:
# models -> List[dict]
models = [
    {
        "model_id": "model_a",   # e.g., "OpenAI/gpt-3.5-turbo"
        "name": "Model A",       # e.g., "GPT-3.5-Turbo"
        "owner": "Owner A"       # e.g., "OpenAI"
    }
]

**Specify metrics used:** Ceate a record for each metric used in the Ragas evaluation. Note that the `name` of each metric below should match the name of the metric used in the data frame `df_score` output by Ragas.

In [10]:
# metrics -> List[dict]
all_metrics = [
    {
        "name": "faithfulness",
        "display_name": "Faithfulness",
        "description": "Faithfulness",
        "author": "algorithm",
        "type": "numerical",
        "aggregator": "average",
        "range": [0, 1.0, 0.1]
    },
    {
        "name": "answer_relevancy",
        "display_name": "Answer Relevancy",
        "description": "Answer Relevancy",
        "author": "algorithm",
        "type": "numerical",
        "aggregator": "average",
        "range": [0, 1.0, 0.1]
    },
    {
        "name": "context_precision",
        "display_name": "Context Precision",
        "description": "Context Precision",
        "author": "algorithm",
        "type": "numerical",
        "aggregator": "average",
        "range": [0, 1.0, 0.1]
    },
    {
        "name": "context_recall",
        "display_name": "Context Recall",
        "description": "Context Recall",
        "author": "algorithm",
        "type": "numerical",
        "aggregator": "average",
        "range": [0, 1.0, 0.1]
    }
]

##### Compute document IDs

In [11]:
doc_id_counter = 0

doc_text_to_id = {}
for index, row in df_score.iterrows():
    for c in row["contexts"]:
        if c not in doc_text_to_id:
            doc_text_to_id[c] = doc_id_counter
            doc_id_counter += 1

##### Populate documents, tasks, and evaluations

In [12]:
all_documents = []
all_tasks = []
all_evaluations = []

# Populate documents
for doc_text, doc_id in doc_text_to_id.items():
    document = {
        "document_id": f"{doc_id}",
        "text": f"{doc_text}"
    }
    all_documents.append(document)

# Populate taks and evaluations
for index, row in df_score.iterrows():
    instance = {
        "task_id": f"{index}",
        "task_type": "rag",
        "contexts": [ {"document_id": f"{doc_text_to_id[c]}"} for c in row["contexts"] ],
        "input": [{"speaker": "user", "text": f"{row['question']}"}],
        "targets": [{"text": f"{row['ground_truth']}"}]
    }
    all_tasks.append(instance)

    evaluation = {
        "task_id": f"{index}",
        "model_id": f"{models[0]['model_id']}",
        "model_response": f"{row['answer']}",
        "annotations": {}
    }
    for metric in all_metrics:
        metric_name = metric["name"]
        evaluation["annotations"][metric_name] = {
            "system": {
                "value": row[metric_name],
                "duration": 0
            }
        }
    all_evaluations.append(evaluation)

##### Write final json file to `ragas-inspectorraget-demo.json` in working directory

In [13]:
import json

output = {
    "name": name,
    "models": models,
    "metrics": all_metrics,
    "documents": all_documents,
    "tasks": all_tasks,
    "evaluations": all_evaluations,
}

with open(
    file="ragas-inspectorraget-demo.json", mode="w", encoding="utf-8"
) as fp:
    json.dump(output, fp, indent=4)