**DataMorgana** is a powerful tool for generating synthetic question-answering data, useful for both evaluating and training question-answering systems.

If you're using DataMorgana for the first time, it's recommended to start with the [DataMorgana Sandbox](https://platform.ai71.ai/playground). The Sandbox provides an intuitive UI for generating individual question-answer pairs interactively.

In this notebook, we'll explore how to use the DataMorgana API to generate large-scale synthetic question-answering data on FineWeb.

For the full API documentation, refer to [this link](https://api.ai71.ai/redoc#tag/Synthetic-Conversations).

In [11]:
import json
import time
from typing import Dict, List

import requests

BASE_URL = "https://api.ai71.ai/v1/"

First, ensure that you have an API key for the AI71 platform.

In [12]:
API_KEY = # Your API key

The generation of the data is done using LLMs, which is costly. Therefore, you will have a limited amount of credits - each credit corresponds to a single generated question. 

You can use the `check_budget` endpoint to see the remaining credits for your organization.

In [13]:
def check_budget():
    resp = requests.get(
        f"{BASE_URL}check_budget",
        headers={"Authorization": f"Bearer {API_KEY}"},
    )
    resp.raise_for_status()
    print(json.dumps(resp.json(), indent=4))

In [14]:
check_budget()

{
    "remaining_budget": 9987
}


Now, let's see how to generate questions using the `bulk_generation endpoint`.

This endpoint accepts three arguments: `n_questions`, `question_categorizations`, and `user_categorizations`.

Since the endpoint is **asynchronous**, it returns only a `request_id`. To retrieve the generated questions once they are ready, we need to use the `fetch_generation_results` endpoint with the corresponding `request_id`.

In [31]:
def bulk_generate(n_questions: int, question_categorizations: List[Dict], user_categorizations: List[Dict]):
    resp = requests.post(
        f"{BASE_URL}bulk_generation",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
                "n_questions": n_questions,
                "question_categorizations": question_categorizations,
                "user_categorizations": user_categorizations
            }
    )
    resp.raise_for_status()
    request_id = resp.json()["request_id"]
    print(json.dumps(resp.json(), indent=4))

    result = wait_for_generation_to_finish(request_id)
    return result


def wait_for_generation_to_finish(request_id: str):
    while True:
        resp = requests.get(
            f"{BASE_URL}fetch_generation_results",
            headers={"Authorization": f"Bearer {API_KEY}"},
            params={"request_id": request_id},
        )
        resp.raise_for_status()
        if resp.json()["status"] == "completed":
            print(json.dumps(resp.json(), indent=4))
            return resp.json()
        else:
            print("Waiting for generation to finish...")
            time.sleep(5)

Let's call the `bulk_generation` endpoint. In this example, we will define two question categorizations and one user categorization.  

When defining categorizations, keep in mind:  

- You can create your own categorizations—these are just examples.  
- Each categorization can include as many categories as you like, as long as their probabilities sum to 1.  
- The **descriptions** of the categories are injected into the LLM prompt during question generation. To ensure high-quality outputs, it’s important to write them clearly and thoughtfully.  

For the competition, you’ll want to evaluate and train your system on a diverse set of questions, since you won’t know in advance what types of questions will appear in the test. Keep in mind that the categorizations used in this notebook are just examples and will not correspond to those used to generate the actual test set.

In [28]:
question_length_categorization = {
    "categorization_name": "question_length",
    "categories": [
        {
            "name": "short",
            "description": "a short question with no more than 6 words.",
            "probability": 0.4
        },
        {
            "name": "long",
            "description": "a long question with at least 7 words.",
            "probability": 0.6
        }
    ]
}

question_formulation_categorization = {
    "categorization_name": "question_formulation",
    "categories": [
        {
            "name": "natural",
            "description": "phrased in the way people typically speak, reflecting everyday language use, without formal or artificial structure.",
            "probability": 0.8
        },
        {
            "name": "search query",
            "description": "phrased as a typed web query for search engines (only keywords, without punctuation and without a natural-sounding structure).",
            "probability": 0.2
        }
    ]
}

user_expertise_categorization = {
    "categorization_name": "user_expertise",
    "categories": [
        {
            "name": "expert",
            "description": "an expert of the subject discussed in the document, therefore he asks complex questions.",
            "probability": 0.8
        },
        {
        "name": "common person",
            "description": "a common person who is not expert of the subject discussed in the document, therefore he asks basic questions.",
            "probability": 0.2
        }
    ]
}

For example, let's use these categorizations to generate 5 questions.

In [37]:
results = bulk_generate(n_questions=5,
                         question_categorizations=[question_length_categorization, question_formulation_categorization],
                         user_categorizations=[user_expertise_categorization]
                         )

{
    "request_id": "5d5f6002-2395-4ff9-95a2-83c37947f9ee",
    "type": "async"
}
Waiting for generation to finish...
Waiting for generation to finish...
Waiting for generation to finish...
Waiting for generation to finish...
Waiting for generation to finish...
Waiting for generation to finish...
Waiting for generation to finish...
Waiting for generation to finish...
{
    "status": "completed",
    "file": "https://s3.amazonaws.com/data.aiir/data_morgana/web_api/results_id_d01f3d5e-9cef-4be8-a311-8b07d3a22c6f_user_id_ded43e4d-7723-49d3-ad51-6636cf5aefd2.jsonl?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2UC3AHBFWKQ2GES5%2F20250204%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250204T091912Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEA4aCXVzLWVhc3QtMSJIMEYCIQDtXc99Td%2BZFZ5JRPTjV9GHEB7zKrsxhjodL5WmngqN7AIhAM7Cp7PiP%2FvHEfZ0LYKeps6T7nTAKmBZJqy2wNJmbfW%2FKrsFCCcQABoMNzMwMzM1Mjk1NTYzIgy2SkG3NP09ZldZskwqmAVLZLbPkvHnGbkiGqacmLPE5%2FpMud8ZCJUcK

The API response includes a link to the file where the generated results are stored. 
This link is valid for 12 hours. If you need to access the file after that time, you can call `fetch_generation_results` again with the original `request_id` to receive a new link.

Let's retrieve the data from the file and print the results.

In [39]:
response = requests.get(results["file"])
qas = [json.loads(line) for line in response.text.splitlines()]

In [41]:
qas[0]

{'question': 'How does ADMS handle network faults?',
 'answer': 'The system assists grid operators by suggesting remedial actions and restoration processes to isolate and restore affected network sections quickly after a fault. It also uses smart meter data for outage prediction, fault detection and clearance.',
 'context': ["Siemens Smart Grid has introduced a comprehensive distribution management system specifically developed for enhancing and expanding smarter grids. Spectrum Power™ ADMS (Advanced Distribution Management System) combines SCADA (Supervisory Control and Data Acquisition), outage management, and fault and network analysis functions for the first time on a software platform within a consolidated user environment, the first of its kind in North America. This simplifies workflows and facilitates efficient data management. The system also allows network operators to not only control and monitor their distribution network more reliably, but also track and restore outages an

Each generated result includes:  

- The generated **question**  
- The generated **answer**  
- The **context** (FineWeb documents) the question is based on  
- The **IDs** of those documents  
- The **question categories** used during generation  
- The **user categories** used during generation  

You can track all your requests using the get_all_requests endpoint.

In [42]:
def get_all_requests():
    resp = requests.get(
        f"{BASE_URL}get_all_requests",
        headers={"Authorization": f"Bearer {API_KEY}"},
    )
    resp.raise_for_status()
    print(json.dumps(resp.json(), indent=4))

In [None]:
get_all_requests()