{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**DataMorgana** is a powerful tool for generating synthetic question-answering data, useful for both evaluating and training question-answering systems.\n", "\n", "If you're using DataMorgana for the first time, it's recommended to start with the [DataMorgana Sandbox](https://platform.ai71.ai/playground). The Sandbox provides an intuitive UI for generating individual question-answer pairs interactively.\n", "\n", "In this notebook, we'll explore how to use the DataMorgana API to generate large-scale synthetic question-answering data on FineWeb.\n", "\n", "For the full API documentation, refer to [this link](https://api.ai71.ai/redoc#tag/Synthetic-Conversations)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from typing import Dict, List\n", "\n", "import requests\n", "\n", "BASE_URL = \"https://api.ai71.ai/v1/\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, ensure that you have an API key for the AI71 platform." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "API_KEY = # Your API key" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The generation of the data is done using LLMs, which is costly. Therefore, you will have a limited amount of credits - each credit corresponds to a single generated question. \n", "\n", "You can use the `check_budget` endpoint to see the remaining credits for your organization." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def check_budget():\n", " resp = requests.get(\n", " f\"{BASE_URL}check_budget\",\n", " headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n", " )\n", " resp.raise_for_status()\n", " print(json.dumps(resp.json(), indent=4))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"remaining_budget\": 9987\n", "}\n" ] } ], "source": [ "check_budget()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's see how to generate questions using the `bulk_generation endpoint`.\n", "\n", "This endpoint accepts three arguments: `n_questions`, `question_categorizations`, and `user_categorizations`.\n", "\n", "Since the endpoint is **asynchronous**, it returns only a `request_id`. To retrieve the generated questions once they are ready, we need to use the `fetch_generation_results` endpoint with the corresponding `request_id`." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "def bulk_generate(n_questions: int, question_categorizations: List[Dict], user_categorizations: List[Dict]):\n", " resp = requests.post(\n", " f\"{BASE_URL}bulk_generation\",\n", " headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n", " json={\n", " \"n_questions\": n_questions,\n", " \"question_categorizations\": question_categorizations,\n", " \"user_categorizations\": user_categorizations\n", " }\n", " )\n", " resp.raise_for_status()\n", " request_id = resp.json()[\"request_id\"]\n", " print(json.dumps(resp.json(), indent=4))\n", "\n", " result = wait_for_generation_to_finish(request_id)\n", " return result\n", "\n", "\n", "def wait_for_generation_to_finish(request_id: str):\n", " while True:\n", " resp = requests.get(\n", " f\"{BASE_URL}fetch_generation_results\",\n", " headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n", " params={\"request_id\": request_id},\n", " )\n", " resp.raise_for_status()\n", " if resp.json()[\"status\"] == \"completed\":\n", " print(json.dumps(resp.json(), indent=4))\n", " return resp.json()\n", " else:\n", " print(\"Waiting for generation to finish...\")\n", " time.sleep(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's call the `bulk_generation` endpoint. In this example, we will define two question categorizations and one user categorization. \n", "\n", "When defining categorizations, keep in mind: \n", "\n", "- You can create your own categorizations—these are just examples. \n", "- Each categorization can include as many categories as you like, as long as their probabilities sum to 1. \n", "- The **descriptions** of the categories are injected into the LLM prompt during question generation. To ensure high-quality outputs, it’s important to write them clearly and thoughtfully. \n", "\n", "For the competition, you’ll want to evaluate and train your system on a diverse set of questions, since you won’t know in advance what types of questions will appear in the test. Keep in mind that the categorizations used in this notebook are just examples and will not correspond to those used to generate the actual test set." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "question_length_categorization = {\n", " \"categorization_name\": \"question_length\",\n", " \"categories\": [\n", " {\n", " \"name\": \"short\",\n", " \"description\": \"a short question with no more than 6 words.\",\n", " \"probability\": 0.4\n", " },\n", " {\n", " \"name\": \"long\",\n", " \"description\": \"a long question with at least 7 words.\",\n", " \"probability\": 0.6\n", " }\n", " ]\n", "}\n", "\n", "question_formulation_categorization = {\n", " \"categorization_name\": \"question_formulation\",\n", " \"categories\": [\n", " {\n", " \"name\": \"natural\",\n", " \"description\": \"phrased in the way people typically speak, reflecting everyday language use, without formal or artificial structure.\",\n", " \"probability\": 0.8\n", " },\n", " {\n", " \"name\": \"search query\",\n", " \"description\": \"phrased as a typed web query for search engines (only keywords, without punctuation and without a natural-sounding structure).\",\n", " \"probability\": 0.2\n", " }\n", " ]\n", "}\n", "\n", "user_expertise_categorization = {\n", " \"categorization_name\": \"user_expertise\",\n", " \"categories\": [\n", " {\n", " \"name\": \"expert\",\n", " \"description\": \"an expert of the subject discussed in the document, therefore he asks complex questions.\",\n", " \"probability\": 0.8\n", " },\n", " {\n", " \"name\": \"common person\",\n", " \"description\": \"a common person who is not expert of the subject discussed in the document, therefore he asks basic questions.\",\n", " \"probability\": 0.2\n", " }\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, let's use these categorizations to generate 5 questions." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"request_id\": \"5d5f6002-2395-4ff9-95a2-83c37947f9ee\",\n", " \"type\": \"async\"\n", "}\n", "Waiting for generation to finish...\n", "Waiting for generation to finish...\n", "Waiting for generation to finish...\n", "Waiting for generation to finish...\n", "Waiting for generation to finish...\n", "Waiting for generation to finish...\n", "Waiting for generation to finish...\n", "Waiting for generation to finish...\n", "{\n", " \"status\": \"completed\",\n", " \"file\": \"https://s3.amazonaws.com/data.aiir/data_morgana/web_api/results_id_d01f3d5e-9cef-4be8-a311-8b07d3a22c6f_user_id_ded43e4d-7723-49d3-ad51-6636cf5aefd2.jsonl?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2UC3AHBFWKQ2GES5%2F20250204%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250204T091912Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEA4aCXVzLWVhc3QtMSJIMEYCIQDtXc99Td%2BZFZ5JRPTjV9GHEB7zKrsxhjodL5WmngqN7AIhAM7Cp7PiP%2FvHEfZ0LYKeps6T7nTAKmBZJqy2wNJmbfW%2FKrsFCCcQABoMNzMwMzM1Mjk1NTYzIgy2SkG3NP09ZldZskwqmAVLZLbPkvHnGbkiGqacmLPE5%2FpMud8ZCJUcKedwpv4uu0R5FhbxEhWGWg2pXp%2Fsgz5mYqQkowG%2BdId0PcrhDwW%2FLD0SCVmd0P6Bp3ha5FCNp4ssvl4q6Mozbfq7U%2FBeOAvrDJQg1Z3ofOp%2FxVS%2BgVkRleRwhS36cOWaTeDaAMyIYJNmEnYkkBQ%2FzRTwWkaw0uID2sCVgHRELPE1k9U99PaQ0xWizWX3HnoLjBIsILRSnDYhqm%2B8Klttf4keD093tqLl3U6Clcd0nCEsLpHlpK8ScytwM8RrMbMdiydDtXcM2FsgLQtUA5Ks28qjBxO47C7s8CR4Oop3TS%2BcpWacHDlYEjCX6pTyyu5hhcTTzpl4SkCZ%2FxmQSIKoPy0GRjudzpeAficf9dSmdoCWH3kkeLPa1j6rWpGzTtqldm3lHfWZKOj191blFSF4r2hy95y64hu8EomFf2r3vkQAXC2ZN4DtbFA5MgTRo37tr0nlu%2Bnfxer72dJA9V8SAInQYP9nnFZdJilgTNUdD2E4jdhVz7oZGtWpJhiuYOTDD7%2B4UezNf3idDgBdYZ8bbt2ENCcMKgvpq89TezsPr3BOPux2sh4Y9JvN0rqsvoYu28eVfrGJC6JNL14s9SUy7FnTAY8lCIvHTGjXlVG0nQsSiEJwRG56C4TY77zO4HsReMe9%2B6Rl7s3siB%2B8atqhPhwZrdypOUbmltv8uWjtxHDwuDnaM5yJLTbFKoOK6CpKv8EC16TjszENlAYTAkUlUdvVlseZx80cAVSa7mtQvEEclR9namYo3Wkv6UlKXSYK5ir%2BWavTIE1tnyNBtKp37aTZJr9nmQNund1L6G4bj855NzeFvga0RA58UEQ7dFw0gRysOP7mgU1zsPP%2FZFJzMKHThr0GOrABQuCeKIWKait%2F0Q1YIjaq1jmibSqUI7pLveuxH3Nl3VXeQorntgj3Ucq7GBnZ9Y5rGCrOkDVsepOWAP9piZEBIiwGo7Gp%2BZfmUg%2B0qf9lVGsKVeJtbvpJTFyhHLrPdGMllIg7aq4fqjnXUyyHHlplwphe3ezKYYHTcAbJINGt95Ed5K3zBSe0DqNdiY9bpVrveIbcWU829IJXp4r939aH6pNuwZ3jlFXMAw2%2FY4BDk6g%3D&X-Amz-Signature=6e135cdd7cf2751d2e9231d04ad9a049ef9f1ac8c4fc23a83c42043eca8f870b\",\n", " \"credits\": 5\n", "}\n" ] } ], "source": [ "results = bulk_generate(n_questions=5,\n", " question_categorizations=[question_length_categorization, question_formulation_categorization],\n", " user_categorizations=[user_expertise_categorization]\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The API response includes a link to the file where the generated results are stored. \n", "This link is valid for 12 hours. If you need to access the file after that time, you can call `fetch_generation_results` again with the original `request_id` to receive a new link.\n", "\n", "Let's retrieve the data from the file and print the results." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "response = requests.get(results[\"file\"])\n", "qas = [json.loads(line) for line in response.text.splitlines()]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'question': 'How does ADMS handle network faults?',\n", " 'answer': 'The system assists grid operators by suggesting remedial actions and restoration processes to isolate and restore affected network sections quickly after a fault. It also uses smart meter data for outage prediction, fault detection and clearance.',\n", " 'context': [\"Siemens Smart Grid has introduced a comprehensive distribution management system specifically developed for enhancing and expanding smarter grids. Spectrum Power™ ADMS (Advanced Distribution Management System) combines SCADA (Supervisory Control and Data Acquisition), outage management, and fault and network analysis functions for the first time on a software platform within a consolidated user environment, the first of its kind in North America. This simplifies workflows and facilitates efficient data management. The system also allows network operators to not only control and monitor their distribution network more reliably, but also track and restore outages and carry out maintenance and repair work more efficiently.\\n“Siemens has developed its ADMS solution for North America to specifically address these challenges and achieve wide-spread build out of the smart grid. Utilities will now be able to make data more actionable, increasing operational efficiency for the utility itself and improving power reliability for end users.”\\nBy suggesting remedial actions and restoration processes, the system assists grid operators in isolating and restoring the affected network sections as quickly as possible after a fault. The system also leverages the intelligent use of smart meter data for outage prediction, fault detection and clearance, and for managing distributed energy sources making Spectrum Power ADMS a key component for enhancing and expanding smart grids.\\n“Energy systems worldwide are facing a growing range of challenges, especially in power distribution management,” said Dr. Jan Mrosik, CEO of Siemens’ Smart Grid Division. “Siemens has developed its ADMS solution for North America to specifically address these challenges and achieve wide-spread build out of the smart grid. Utilities will now be able to make data more actionable, increasing operational efficiency for the utility itself and improving power reliability for end users.”\\nSince it was developed for integration in a service-oriented architecture (SOA), the system can make use of services and data from other IT systems, such as network data from geographic information systems or load profiles from meter data management systems. In the same way, other IT systems can access services and data in the distribution network management system, like information for customer information systems about downtime in case of a malfunction, or work orders or switching jobs for the workforce management system. The SOA focused design concept allows Spectrum Power ADMS to be integrated efficiently in the user's IT environment, promoting business process optimization and work process automation.\\nThe system provides the grid operator with the right tools to comply with the requirements of international reliability standards such as NERC CIP (North American Electric Reliability Corporation – Critical Infrastructure Protection) or those of the US federal agency NIST (National Institute of Standards and Technology). Interoperability standards such as IEC 61970 (CIM, Common Information Model) and IEC 61968 (system interfaces for distribution networks) have also been taken into account in the development of Spectrum Power ADMS in order to facilitate the IT integration of the system in an existing infrastructure.\\nThe introduction of Siemens Spectrum Power™ ADMS solution is the culmination of many successfully deployed distribution management solutions as well as Siemens ability to integrate these solutions with the utility’s existing infrastructure. The fully integrated operational environment available from Siemens enables improved monitoring, control and maintenance for a reliable and efficient distribution network.\"],\n", " 'question_categories': [{'categorization_name': 'question_length',\n", " 'category_name': 'short'},\n", " {'categorization_name': 'question_formulation', 'category_name': 'natural'}],\n", " 'user_categories': [{'categorization_name': 'user_expertise',\n", " 'category_name': 'expert'}],\n", " 'document_ids': ['']}" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "qas[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each generated result includes: \n", "\n", "- The generated **question** \n", "- The generated **answer** \n", "- The **context** (FineWeb documents) the question is based on \n", "- The **IDs** of those documents \n", "- The **question categories** used during generation \n", "- The **user categories** used during generation " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can track all your requests using the get_all_requests endpoint." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "def get_all_requests():\n", " resp = requests.get(\n", " f\"{BASE_URL}get_all_requests\",\n", " headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n", " )\n", " resp.raise_for_status()\n", " print(json.dumps(resp.json(), indent=4))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "get_all_requests()" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 2 }