File size: 17,914 Bytes
1232fcb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**DataMorgana** is a powerful tool for generating synthetic question-answering data, useful for both evaluating and training question-answering systems.\n",
    "\n",
    "If you're using DataMorgana for the first time, it's recommended to start with the [DataMorgana Sandbox](https://platform.ai71.ai/playground). The Sandbox provides an intuitive UI for generating individual question-answer pairs interactively.\n",
    "\n",
    "In this notebook, we'll explore how to use the DataMorgana API to generate large-scale synthetic question-answering data on FineWeb.\n",
    "\n",
    "For the full API documentation, refer to [this link](https://api.ai71.ai/redoc#tag/Synthetic-Conversations)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import time\n",
    "from typing import Dict, List\n",
    "\n",
    "import requests\n",
    "\n",
    "BASE_URL = \"https://api.ai71.ai/v1/\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, ensure that you have an API key for the AI71 platform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "API_KEY = # Your API key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The generation of the data is done using LLMs, which is costly. Therefore, you will have a limited amount of credits - each credit corresponds to a single generated question. \n",
    "\n",
    "You can use the `check_budget` endpoint to see the remaining credits for your organization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def check_budget():\n",
    "    resp = requests.get(\n",
    "        f\"{BASE_URL}check_budget\",\n",
    "        headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n",
    "    )\n",
    "    resp.raise_for_status()\n",
    "    print(json.dumps(resp.json(), indent=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"remaining_budget\": 9987\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "check_budget()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's see how to generate questions using the `bulk_generation endpoint`.\n",
    "\n",
    "This endpoint accepts three arguments: `n_questions`, `question_categorizations`, and `user_categorizations`.\n",
    "\n",
    "Since the endpoint is **asynchronous**, it returns only a `request_id`. To retrieve the generated questions once they are ready, we need to use the `fetch_generation_results` endpoint with the corresponding `request_id`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bulk_generate(n_questions: int, question_categorizations: List[Dict], user_categorizations: List[Dict]):\n",
    "    resp = requests.post(\n",
    "        f\"{BASE_URL}bulk_generation\",\n",
    "        headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n",
    "        json={\n",
    "                \"n_questions\": n_questions,\n",
    "                \"question_categorizations\": question_categorizations,\n",
    "                \"user_categorizations\": user_categorizations\n",
    "            }\n",
    "    )\n",
    "    resp.raise_for_status()\n",
    "    request_id = resp.json()[\"request_id\"]\n",
    "    print(json.dumps(resp.json(), indent=4))\n",
    "\n",
    "    result = wait_for_generation_to_finish(request_id)\n",
    "    return result\n",
    "\n",
    "\n",
    "def wait_for_generation_to_finish(request_id: str):\n",
    "    while True:\n",
    "        resp = requests.get(\n",
    "            f\"{BASE_URL}fetch_generation_results\",\n",
    "            headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n",
    "            params={\"request_id\": request_id},\n",
    "        )\n",
    "        resp.raise_for_status()\n",
    "        if resp.json()[\"status\"] == \"completed\":\n",
    "            print(json.dumps(resp.json(), indent=4))\n",
    "            return resp.json()\n",
    "        else:\n",
    "            print(\"Waiting for generation to finish...\")\n",
    "            time.sleep(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's call the `bulk_generation` endpoint. In this example, we will define two question categorizations and one user categorization.  \n",
    "\n",
    "When defining categorizations, keep in mind:  \n",
    "\n",
    "- You can create your own categorizations—these are just examples.  \n",
    "- Each categorization can include as many categories as you like, as long as their probabilities sum to 1.  \n",
    "- The **descriptions** of the categories are injected into the LLM prompt during question generation. To ensure high-quality outputs, it’s important to write them clearly and thoughtfully.  \n",
    "\n",
    "For the competition, you’ll want to evaluate and train your system on a diverse set of questions, since you won’t know in advance what types of questions will appear in the test. Keep in mind that the categorizations used in this notebook are just examples and will not correspond to those used to generate the actual test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "question_length_categorization = {\n",
    "    \"categorization_name\": \"question_length\",\n",
    "    \"categories\": [\n",
    "        {\n",
    "            \"name\": \"short\",\n",
    "            \"description\": \"a short question with no more than 6 words.\",\n",
    "            \"probability\": 0.4\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"long\",\n",
    "            \"description\": \"a long question with at least 7 words.\",\n",
    "            \"probability\": 0.6\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "\n",
    "question_formulation_categorization = {\n",
    "    \"categorization_name\": \"question_formulation\",\n",
    "    \"categories\": [\n",
    "        {\n",
    "            \"name\": \"natural\",\n",
    "            \"description\": \"phrased in the way people typically speak, reflecting everyday language use, without formal or artificial structure.\",\n",
    "            \"probability\": 0.8\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"search query\",\n",
    "            \"description\": \"phrased as a typed web query for search engines (only keywords, without punctuation and without a natural-sounding structure).\",\n",
    "            \"probability\": 0.2\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "\n",
    "user_expertise_categorization = {\n",
    "    \"categorization_name\": \"user_expertise\",\n",
    "    \"categories\": [\n",
    "        {\n",
    "            \"name\": \"expert\",\n",
    "            \"description\": \"an expert of the subject discussed in the document, therefore he asks complex questions.\",\n",
    "            \"probability\": 0.8\n",
    "        },\n",
    "        {\n",
    "        \"name\": \"common person\",\n",
    "            \"description\": \"a common person who is not expert of the subject discussed in the document, therefore he asks basic questions.\",\n",
    "            \"probability\": 0.2\n",
    "        }\n",
    "    ]\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, let's use these categorizations to generate 5 questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"request_id\": \"5d5f6002-2395-4ff9-95a2-83c37947f9ee\",\n",
      "    \"type\": \"async\"\n",
      "}\n",
      "Waiting for generation to finish...\n",
      "Waiting for generation to finish...\n",
      "Waiting for generation to finish...\n",
      "Waiting for generation to finish...\n",
      "Waiting for generation to finish...\n",
      "Waiting for generation to finish...\n",
      "Waiting for generation to finish...\n",
      "Waiting for generation to finish...\n",
      "{\n",
      "    \"status\": \"completed\",\n",
      "    \"file\": \"https://s3.amazonaws.com/data.aiir/data_morgana/web_api/results_id_d01f3d5e-9cef-4be8-a311-8b07d3a22c6f_user_id_ded43e4d-7723-49d3-ad51-6636cf5aefd2.jsonl?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2UC3AHBFWKQ2GES5%2F20250204%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250204T091912Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEA4aCXVzLWVhc3QtMSJIMEYCIQDtXc99Td%2BZFZ5JRPTjV9GHEB7zKrsxhjodL5WmngqN7AIhAM7Cp7PiP%2FvHEfZ0LYKeps6T7nTAKmBZJqy2wNJmbfW%2FKrsFCCcQABoMNzMwMzM1Mjk1NTYzIgy2SkG3NP09ZldZskwqmAVLZLbPkvHnGbkiGqacmLPE5%2FpMud8ZCJUcKedwpv4uu0R5FhbxEhWGWg2pXp%2Fsgz5mYqQkowG%2BdId0PcrhDwW%2FLD0SCVmd0P6Bp3ha5FCNp4ssvl4q6Mozbfq7U%2FBeOAvrDJQg1Z3ofOp%2FxVS%2BgVkRleRwhS36cOWaTeDaAMyIYJNmEnYkkBQ%2FzRTwWkaw0uID2sCVgHRELPE1k9U99PaQ0xWizWX3HnoLjBIsILRSnDYhqm%2B8Klttf4keD093tqLl3U6Clcd0nCEsLpHlpK8ScytwM8RrMbMdiydDtXcM2FsgLQtUA5Ks28qjBxO47C7s8CR4Oop3TS%2BcpWacHDlYEjCX6pTyyu5hhcTTzpl4SkCZ%2FxmQSIKoPy0GRjudzpeAficf9dSmdoCWH3kkeLPa1j6rWpGzTtqldm3lHfWZKOj191blFSF4r2hy95y64hu8EomFf2r3vkQAXC2ZN4DtbFA5MgTRo37tr0nlu%2Bnfxer72dJA9V8SAInQYP9nnFZdJilgTNUdD2E4jdhVz7oZGtWpJhiuYOTDD7%2B4UezNf3idDgBdYZ8bbt2ENCcMKgvpq89TezsPr3BOPux2sh4Y9JvN0rqsvoYu28eVfrGJC6JNL14s9SUy7FnTAY8lCIvHTGjXlVG0nQsSiEJwRG56C4TY77zO4HsReMe9%2B6Rl7s3siB%2B8atqhPhwZrdypOUbmltv8uWjtxHDwuDnaM5yJLTbFKoOK6CpKv8EC16TjszENlAYTAkUlUdvVlseZx80cAVSa7mtQvEEclR9namYo3Wkv6UlKXSYK5ir%2BWavTIE1tnyNBtKp37aTZJr9nmQNund1L6G4bj855NzeFvga0RA58UEQ7dFw0gRysOP7mgU1zsPP%2FZFJzMKHThr0GOrABQuCeKIWKait%2F0Q1YIjaq1jmibSqUI7pLveuxH3Nl3VXeQorntgj3Ucq7GBnZ9Y5rGCrOkDVsepOWAP9piZEBIiwGo7Gp%2BZfmUg%2B0qf9lVGsKVeJtbvpJTFyhHLrPdGMllIg7aq4fqjnXUyyHHlplwphe3ezKYYHTcAbJINGt95Ed5K3zBSe0DqNdiY9bpVrveIbcWU829IJXp4r939aH6pNuwZ3jlFXMAw2%2FY4BDk6g%3D&X-Amz-Signature=6e135cdd7cf2751d2e9231d04ad9a049ef9f1ac8c4fc23a83c42043eca8f870b\",\n",
      "    \"credits\": 5\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "results = bulk_generate(n_questions=5,\n",
    "                         question_categorizations=[question_length_categorization, question_formulation_categorization],\n",
    "                         user_categorizations=[user_expertise_categorization]\n",
    "                         )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The API response includes a link to the file where the generated results are stored. \n",
    "This link is valid for 12 hours. If you need to access the file after that time, you can call `fetch_generation_results` again with the original `request_id` to receive a new link.\n",
    "\n",
    "Let's retrieve the data from the file and print the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = requests.get(results[\"file\"])\n",
    "qas = [json.loads(line) for line in response.text.splitlines()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'question': 'How does ADMS handle network faults?',\n",
       " 'answer': 'The system assists grid operators by suggesting remedial actions and restoration processes to isolate and restore affected network sections quickly after a fault. It also uses smart meter data for outage prediction, fault detection and clearance.',\n",
       " 'context': [\"Siemens Smart Grid has introduced a comprehensive distribution management system specifically developed for enhancing and expanding smarter grids. Spectrum PowerADMS (Advanced Distribution Management System) combines SCADA (Supervisory Control and Data Acquisition), outage management, and fault and network analysis functions for the first time on a software platform within a consolidated user environment, the first of its kind in North America. This simplifies workflows and facilitates efficient data management. The system also allows network operators to not only control and monitor their distribution network more reliably, but also track and restore outages and carry out maintenance and repair work more efficiently.\\n“Siemens has developed its ADMS solution for North America to specifically address these challenges and achieve wide-spread build out of the smart grid. Utilities will now be able to make data more actionable, increasing operational efficiency for the utility itself and improving power reliability for end users.”\\nBy suggesting remedial actions and restoration processes, the system assists grid operators in isolating and restoring the affected network sections as quickly as possible after a fault. The system also leverages the intelligent use of smart meter data for outage prediction, fault detection and clearance, and for managing distributed energy sources making Spectrum Power ADMS a key component for enhancing and expanding smart grids.\\n“Energy systems worldwide are facing a growing range of challenges, especially in power distribution management,” said Dr. Jan Mrosik, CEO of SiemensSmart Grid Division. “Siemens has developed its ADMS solution for North America to specifically address these challenges and achieve wide-spread build out of the smart grid. Utilities will now be able to make data more actionable, increasing operational efficiency for the utility itself and improving power reliability for end users.”\\nSince it was developed for integration in a service-oriented architecture (SOA), the system can make use of services and data from other IT systems, such as network data from geographic information systems or load profiles from meter data management systems. In the same way, other IT systems can access services and data in the distribution network management system, like information for customer information systems about downtime in case of a malfunction, or work orders or switching jobs for the workforce management system. The SOA focused design concept allows Spectrum Power ADMS to be integrated efficiently in the user's IT environment, promoting business process optimization and work process automation.\\nThe system provides the grid operator with the right tools to comply with the requirements of international reliability standards such as NERC CIP (North American Electric Reliability Corporation – Critical Infrastructure Protection) or those of the US federal agency NIST (National Institute of Standards and Technology). Interoperability standards such as IEC 61970 (CIM, Common Information Model) and IEC 61968 (system interfaces for distribution networks) have also been taken into account in the development of Spectrum Power ADMS in order to facilitate the IT integration of the system in an existing infrastructure.\\nThe introduction of Siemens Spectrum Power™ ADMS solution is the culmination of many successfully deployed distribution management solutions as well as Siemens ability to integrate these solutions with the utility’s existing infrastructure. The fully integrated operational environment available from Siemens enables improved monitoring, control and maintenance for a reliable and efficient distribution network.\"],\n",
       " 'question_categories': [{'categorization_name': 'question_length',\n",
       "   'category_name': 'short'},\n",
       "  {'categorization_name': 'question_formulation', 'category_name': 'natural'}],\n",
       " 'user_categories': [{'categorization_name': 'user_expertise',\n",
       "   'category_name': 'expert'}],\n",
       " 'document_ids': ['<urn:uuid:db2e6b90-78a6-418b-9c12-f19144174c74>']}"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "qas[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each generated result includes:  \n",
    "\n",
    "- The generated **question**  \n",
    "- The generated **answer**  \n",
    "- The **context** (FineWeb documents) the question is based on  \n",
    "- The **IDs** of those documents  \n",
    "- The **question categories** used during generation  \n",
    "- The **user categories** used during generation  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can track all your requests using the get_all_requests endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_all_requests():\n",
    "    resp = requests.get(\n",
    "        f\"{BASE_URL}get_all_requests\",\n",
    "        headers={\"Authorization\": f\"Bearer {API_KEY}\"},\n",
    "    )\n",
    "    resp.raise_for_status()\n",
    "    print(json.dumps(resp.json(), indent=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "get_all_requests()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}