Spaces:
Running
Running
File size: 17,914 Bytes
1232fcb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 |
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**DataMorgana** is a powerful tool for generating synthetic question-answering data, useful for both evaluating and training question-answering systems.\n",
"\n",
"If you're using DataMorgana for the first time, it's recommended to start with the [DataMorgana Sandbox](https://platform.ai71.ai/playground). The Sandbox provides an intuitive UI for generating individual question-answer pairs interactively.\n",
"\n",
"In this notebook, we'll explore how to use the DataMorgana API to generate large-scale synthetic question-answering data on FineWeb.\n",
"\n",
"For the full API documentation, refer to [this link](https://api.ai71.ai/redoc#tag/Synthetic-Conversations)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import time\n",
"from typing import Dict, List\n",
"\n",
"import requests\n",
"\n",
"BASE_URL = \"https://api.ai71.ai/v1/\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, ensure that you have an API key for the AI71 platform."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"API_KEY = # Your API key"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The generation of the data is done using LLMs, which is costly. Therefore, you will have a limited amount of credits - each credit corresponds to a single generated question. \n",
"\n",
"You can use the `check_budget` endpoint to see the remaining credits for your organization."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"def check_budget():\n",
" resp = requests.get(\n",
" f\"{BASE_URL}check_budget\",\n",
" headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n",
" )\n",
" resp.raise_for_status()\n",
" print(json.dumps(resp.json(), indent=4))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"remaining_budget\": 9987\n",
"}\n"
]
}
],
"source": [
"check_budget()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's see how to generate questions using the `bulk_generation endpoint`.\n",
"\n",
"This endpoint accepts three arguments: `n_questions`, `question_categorizations`, and `user_categorizations`.\n",
"\n",
"Since the endpoint is **asynchronous**, it returns only a `request_id`. To retrieve the generated questions once they are ready, we need to use the `fetch_generation_results` endpoint with the corresponding `request_id`."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"def bulk_generate(n_questions: int, question_categorizations: List[Dict], user_categorizations: List[Dict]):\n",
" resp = requests.post(\n",
" f\"{BASE_URL}bulk_generation\",\n",
" headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n",
" json={\n",
" \"n_questions\": n_questions,\n",
" \"question_categorizations\": question_categorizations,\n",
" \"user_categorizations\": user_categorizations\n",
" }\n",
" )\n",
" resp.raise_for_status()\n",
" request_id = resp.json()[\"request_id\"]\n",
" print(json.dumps(resp.json(), indent=4))\n",
"\n",
" result = wait_for_generation_to_finish(request_id)\n",
" return result\n",
"\n",
"\n",
"def wait_for_generation_to_finish(request_id: str):\n",
" while True:\n",
" resp = requests.get(\n",
" f\"{BASE_URL}fetch_generation_results\",\n",
" headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n",
" params={\"request_id\": request_id},\n",
" )\n",
" resp.raise_for_status()\n",
" if resp.json()[\"status\"] == \"completed\":\n",
" print(json.dumps(resp.json(), indent=4))\n",
" return resp.json()\n",
" else:\n",
" print(\"Waiting for generation to finish...\")\n",
" time.sleep(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's call the `bulk_generation` endpoint. In this example, we will define two question categorizations and one user categorization. \n",
"\n",
"When defining categorizations, keep in mind: \n",
"\n",
"- You can create your own categorizations—these are just examples. \n",
"- Each categorization can include as many categories as you like, as long as their probabilities sum to 1. \n",
"- The **descriptions** of the categories are injected into the LLM prompt during question generation. To ensure high-quality outputs, it’s important to write them clearly and thoughtfully. \n",
"\n",
"For the competition, you’ll want to evaluate and train your system on a diverse set of questions, since you won’t know in advance what types of questions will appear in the test. Keep in mind that the categorizations used in this notebook are just examples and will not correspond to those used to generate the actual test set."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"question_length_categorization = {\n",
" \"categorization_name\": \"question_length\",\n",
" \"categories\": [\n",
" {\n",
" \"name\": \"short\",\n",
" \"description\": \"a short question with no more than 6 words.\",\n",
" \"probability\": 0.4\n",
" },\n",
" {\n",
" \"name\": \"long\",\n",
" \"description\": \"a long question with at least 7 words.\",\n",
" \"probability\": 0.6\n",
" }\n",
" ]\n",
"}\n",
"\n",
"question_formulation_categorization = {\n",
" \"categorization_name\": \"question_formulation\",\n",
" \"categories\": [\n",
" {\n",
" \"name\": \"natural\",\n",
" \"description\": \"phrased in the way people typically speak, reflecting everyday language use, without formal or artificial structure.\",\n",
" \"probability\": 0.8\n",
" },\n",
" {\n",
" \"name\": \"search query\",\n",
" \"description\": \"phrased as a typed web query for search engines (only keywords, without punctuation and without a natural-sounding structure).\",\n",
" \"probability\": 0.2\n",
" }\n",
" ]\n",
"}\n",
"\n",
"user_expertise_categorization = {\n",
" \"categorization_name\": \"user_expertise\",\n",
" \"categories\": [\n",
" {\n",
" \"name\": \"expert\",\n",
" \"description\": \"an expert of the subject discussed in the document, therefore he asks complex questions.\",\n",
" \"probability\": 0.8\n",
" },\n",
" {\n",
" \"name\": \"common person\",\n",
" \"description\": \"a common person who is not expert of the subject discussed in the document, therefore he asks basic questions.\",\n",
" \"probability\": 0.2\n",
" }\n",
" ]\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For example, let's use these categorizations to generate 5 questions."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"request_id\": \"5d5f6002-2395-4ff9-95a2-83c37947f9ee\",\n",
" \"type\": \"async\"\n",
"}\n",
"Waiting for generation to finish...\n",
"Waiting for generation to finish...\n",
"Waiting for generation to finish...\n",
"Waiting for generation to finish...\n",
"Waiting for generation to finish...\n",
"Waiting for generation to finish...\n",
"Waiting for generation to finish...\n",
"Waiting for generation to finish...\n",
"{\n",
" \"status\": \"completed\",\n",
" \"file\": \"https://s3.amazonaws.com/data.aiir/data_morgana/web_api/results_id_d01f3d5e-9cef-4be8-a311-8b07d3a22c6f_user_id_ded43e4d-7723-49d3-ad51-6636cf5aefd2.jsonl?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2UC3AHBFWKQ2GES5%2F20250204%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250204T091912Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEA4aCXVzLWVhc3QtMSJIMEYCIQDtXc99Td%2BZFZ5JRPTjV9GHEB7zKrsxhjodL5WmngqN7AIhAM7Cp7PiP%2FvHEfZ0LYKeps6T7nTAKmBZJqy2wNJmbfW%2FKrsFCCcQABoMNzMwMzM1Mjk1NTYzIgy2SkG3NP09ZldZskwqmAVLZLbPkvHnGbkiGqacmLPE5%2FpMud8ZCJUcKedwpv4uu0R5FhbxEhWGWg2pXp%2Fsgz5mYqQkowG%2BdId0PcrhDwW%2FLD0SCVmd0P6Bp3ha5FCNp4ssvl4q6Mozbfq7U%2FBeOAvrDJQg1Z3ofOp%2FxVS%2BgVkRleRwhS36cOWaTeDaAMyIYJNmEnYkkBQ%2FzRTwWkaw0uID2sCVgHRELPE1k9U99PaQ0xWizWX3HnoLjBIsILRSnDYhqm%2B8Klttf4keD093tqLl3U6Clcd0nCEsLpHlpK8ScytwM8RrMbMdiydDtXcM2FsgLQtUA5Ks28qjBxO47C7s8CR4Oop3TS%2BcpWacHDlYEjCX6pTyyu5hhcTTzpl4SkCZ%2FxmQSIKoPy0GRjudzpeAficf9dSmdoCWH3kkeLPa1j6rWpGzTtqldm3lHfWZKOj191blFSF4r2hy95y64hu8EomFf2r3vkQAXC2ZN4DtbFA5MgTRo37tr0nlu%2Bnfxer72dJA9V8SAInQYP9nnFZdJilgTNUdD2E4jdhVz7oZGtWpJhiuYOTDD7%2B4UezNf3idDgBdYZ8bbt2ENCcMKgvpq89TezsPr3BOPux2sh4Y9JvN0rqsvoYu28eVfrGJC6JNL14s9SUy7FnTAY8lCIvHTGjXlVG0nQsSiEJwRG56C4TY77zO4HsReMe9%2B6Rl7s3siB%2B8atqhPhwZrdypOUbmltv8uWjtxHDwuDnaM5yJLTbFKoOK6CpKv8EC16TjszENlAYTAkUlUdvVlseZx80cAVSa7mtQvEEclR9namYo3Wkv6UlKXSYK5ir%2BWavTIE1tnyNBtKp37aTZJr9nmQNund1L6G4bj855NzeFvga0RA58UEQ7dFw0gRysOP7mgU1zsPP%2FZFJzMKHThr0GOrABQuCeKIWKait%2F0Q1YIjaq1jmibSqUI7pLveuxH3Nl3VXeQorntgj3Ucq7GBnZ9Y5rGCrOkDVsepOWAP9piZEBIiwGo7Gp%2BZfmUg%2B0qf9lVGsKVeJtbvpJTFyhHLrPdGMllIg7aq4fqjnXUyyHHlplwphe3ezKYYHTcAbJINGt95Ed5K3zBSe0DqNdiY9bpVrveIbcWU829IJXp4r939aH6pNuwZ3jlFXMAw2%2FY4BDk6g%3D&X-Amz-Signature=6e135cdd7cf2751d2e9231d04ad9a049ef9f1ac8c4fc23a83c42043eca8f870b\",\n",
" \"credits\": 5\n",
"}\n"
]
}
],
"source": [
"results = bulk_generate(n_questions=5,\n",
" question_categorizations=[question_length_categorization, question_formulation_categorization],\n",
" user_categorizations=[user_expertise_categorization]\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The API response includes a link to the file where the generated results are stored. \n",
"This link is valid for 12 hours. If you need to access the file after that time, you can call `fetch_generation_results` again with the original `request_id` to receive a new link.\n",
"\n",
"Let's retrieve the data from the file and print the results."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"response = requests.get(results[\"file\"])\n",
"qas = [json.loads(line) for line in response.text.splitlines()]"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'question': 'How does ADMS handle network faults?',\n",
" 'answer': 'The system assists grid operators by suggesting remedial actions and restoration processes to isolate and restore affected network sections quickly after a fault. It also uses smart meter data for outage prediction, fault detection and clearance.',\n",
" 'context': [\"Siemens Smart Grid has introduced a comprehensive distribution management system specifically developed for enhancing and expanding smarter grids. Spectrum Power™ ADMS (Advanced Distribution Management System) combines SCADA (Supervisory Control and Data Acquisition), outage management, and fault and network analysis functions for the first time on a software platform within a consolidated user environment, the first of its kind in North America. This simplifies workflows and facilitates efficient data management. The system also allows network operators to not only control and monitor their distribution network more reliably, but also track and restore outages and carry out maintenance and repair work more efficiently.\\n“Siemens has developed its ADMS solution for North America to specifically address these challenges and achieve wide-spread build out of the smart grid. Utilities will now be able to make data more actionable, increasing operational efficiency for the utility itself and improving power reliability for end users.”\\nBy suggesting remedial actions and restoration processes, the system assists grid operators in isolating and restoring the affected network sections as quickly as possible after a fault. The system also leverages the intelligent use of smart meter data for outage prediction, fault detection and clearance, and for managing distributed energy sources making Spectrum Power ADMS a key component for enhancing and expanding smart grids.\\n“Energy systems worldwide are facing a growing range of challenges, especially in power distribution management,” said Dr. Jan Mrosik, CEO of Siemens’ Smart Grid Division. “Siemens has developed its ADMS solution for North America to specifically address these challenges and achieve wide-spread build out of the smart grid. Utilities will now be able to make data more actionable, increasing operational efficiency for the utility itself and improving power reliability for end users.”\\nSince it was developed for integration in a service-oriented architecture (SOA), the system can make use of services and data from other IT systems, such as network data from geographic information systems or load profiles from meter data management systems. In the same way, other IT systems can access services and data in the distribution network management system, like information for customer information systems about downtime in case of a malfunction, or work orders or switching jobs for the workforce management system. The SOA focused design concept allows Spectrum Power ADMS to be integrated efficiently in the user's IT environment, promoting business process optimization and work process automation.\\nThe system provides the grid operator with the right tools to comply with the requirements of international reliability standards such as NERC CIP (North American Electric Reliability Corporation – Critical Infrastructure Protection) or those of the US federal agency NIST (National Institute of Standards and Technology). Interoperability standards such as IEC 61970 (CIM, Common Information Model) and IEC 61968 (system interfaces for distribution networks) have also been taken into account in the development of Spectrum Power ADMS in order to facilitate the IT integration of the system in an existing infrastructure.\\nThe introduction of Siemens Spectrum Power™ ADMS solution is the culmination of many successfully deployed distribution management solutions as well as Siemens ability to integrate these solutions with the utility’s existing infrastructure. The fully integrated operational environment available from Siemens enables improved monitoring, control and maintenance for a reliable and efficient distribution network.\"],\n",
" 'question_categories': [{'categorization_name': 'question_length',\n",
" 'category_name': 'short'},\n",
" {'categorization_name': 'question_formulation', 'category_name': 'natural'}],\n",
" 'user_categories': [{'categorization_name': 'user_expertise',\n",
" 'category_name': 'expert'}],\n",
" 'document_ids': ['<urn:uuid:db2e6b90-78a6-418b-9c12-f19144174c74>']}"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qas[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each generated result includes: \n",
"\n",
"- The generated **question** \n",
"- The generated **answer** \n",
"- The **context** (FineWeb documents) the question is based on \n",
"- The **IDs** of those documents \n",
"- The **question categories** used during generation \n",
"- The **user categories** used during generation "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can track all your requests using the get_all_requests endpoint."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"def get_all_requests():\n",
" resp = requests.get(\n",
" f\"{BASE_URL}get_all_requests\",\n",
" headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n",
" )\n",
" resp.raise_for_status()\n",
" print(json.dumps(resp.json(), indent=4))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"get_all_requests()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
|