katanaml commited on
Commit
42cd5f6
·
1 Parent(s): 77b5374

Sparrow Parse

Browse files
Files changed (47) hide show
  1. .gitignore +2 -0
  2. Dockerfile +10 -0
  3. LICENSE +674 -0
  4. README.md +1 -1
  5. api.py +108 -0
  6. assistant.py +23 -0
  7. config.yml +90 -0
  8. data/inout-20211211_001.jpg +0 -0
  9. data/invoice_1.jpg +0 -0
  10. data/invoice_1.pdf +0 -0
  11. data/ross-20211211_010.jpg +0 -0
  12. docker-compose.yml +28 -0
  13. embeddings/__init__.py +0 -0
  14. embeddings/agents/__init__.py +0 -0
  15. embeddings/agents/haystack.py +68 -0
  16. embeddings/agents/interface.py +29 -0
  17. embeddings/agents/llamaindex.py +85 -0
  18. engine.py +82 -0
  19. ingest.py +42 -0
  20. rag/__init__.py +0 -0
  21. rag/agents/__init__.py +0 -0
  22. rag/agents/haystack/__init__.py +0 -0
  23. rag/agents/haystack/haystack.py +227 -0
  24. rag/agents/instructor/__init__.py +0 -0
  25. rag/agents/instructor/fcall.py +77 -0
  26. rag/agents/instructor/helpers/__init__.py +0 -0
  27. rag/agents/instructor/helpers/instructor_helper.py +60 -0
  28. rag/agents/instructor/instructor.py +254 -0
  29. rag/agents/interface.py +61 -0
  30. rag/agents/llamaindex/__init__.py +0 -0
  31. rag/agents/llamaindex/llamaindex.py +209 -0
  32. rag/agents/llamaindex/vllamaindex.py +139 -0
  33. rag/agents/llamaindex/vprocessor.py +183 -0
  34. rag/agents/sparrow_parse/__init__.py +0 -0
  35. rag/agents/sparrow_parse/sparrow_parse.py +137 -0
  36. rag/agents/sparrow_parse/sparrow_utils.py +54 -0
  37. rag/agents/sparrow_parse/sparrow_validator.py +26 -0
  38. rag/agents/unstructured/__init__.py +0 -0
  39. rag/agents/unstructured/unstructured.py +372 -0
  40. rag/agents/unstructured/unstructured_light.py +293 -0
  41. requirements_haystack.txt +14 -0
  42. requirements_instructor.txt +16 -0
  43. requirements_llamaindex.txt +27 -0
  44. requirements_sparrow_parse.txt +13 -0
  45. requirements_unstructured.txt +19 -0
  46. sample_prompts.txt +390 -0
  47. sparrow.sh +28 -0
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+
2
+ .DS_Store
Dockerfile ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10
2
+
3
+ RUN useradd -m -u 1000 user
4
+ WORKDIR /app
5
+
6
+ COPY --chown=user ./requirements_sparrow_parse.txt requirements_sparrow_parse.txt
7
+ RUN pip install --no-cache-dir --upgrade -r requirements_sparrow_parse.txt
8
+
9
+ COPY --chown=user . /app
10
+ CMD ["python", "api.py", "--port", "7860"]
LICENSE ADDED
@@ -0,0 +1,674 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ GNU GENERAL PUBLIC LICENSE
2
+ Version 3, 29 June 2007
3
+
4
+ Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
5
+ Everyone is permitted to copy and distribute verbatim copies
6
+ of this license document, but changing it is not allowed.
7
+
8
+ Preamble
9
+
10
+ The GNU General Public License is a free, copyleft license for
11
+ software and other kinds of works.
12
+
13
+ The licenses for most software and other practical works are designed
14
+ to take away your freedom to share and change the works. By contrast,
15
+ the GNU General Public License is intended to guarantee your freedom to
16
+ share and change all versions of a program--to make sure it remains free
17
+ software for all its users. We, the Free Software Foundation, use the
18
+ GNU General Public License for most of our software; it applies also to
19
+ any other work released this way by its authors. You can apply it to
20
+ your programs, too.
21
+
22
+ When we speak of free software, we are referring to freedom, not
23
+ price. Our General Public Licenses are designed to make sure that you
24
+ have the freedom to distribute copies of free software (and charge for
25
+ them if you wish), that you receive source code or can get it if you
26
+ want it, that you can change the software or use pieces of it in new
27
+ free programs, and that you know you can do these things.
28
+
29
+ To protect your rights, we need to prevent others from denying you
30
+ these rights or asking you to surrender the rights. Therefore, you have
31
+ certain responsibilities if you distribute copies of the software, or if
32
+ you modify it: responsibilities to respect the freedom of others.
33
+
34
+ For example, if you distribute copies of such a program, whether
35
+ gratis or for a fee, you must pass on to the recipients the same
36
+ freedoms that you received. You must make sure that they, too, receive
37
+ or can get the source code. And you must show them these terms so they
38
+ know their rights.
39
+
40
+ Developers that use the GNU GPL protect your rights with two steps:
41
+ (1) assert copyright on the software, and (2) offer you this License
42
+ giving you legal permission to copy, distribute and/or modify it.
43
+
44
+ For the developers' and authors' protection, the GPL clearly explains
45
+ that there is no warranty for this free software. For both users' and
46
+ authors' sake, the GPL requires that modified versions be marked as
47
+ changed, so that their problems will not be attributed erroneously to
48
+ authors of previous versions.
49
+
50
+ Some devices are designed to deny users access to install or run
51
+ modified versions of the software inside them, although the manufacturer
52
+ can do so. This is fundamentally incompatible with the aim of
53
+ protecting users' freedom to change the software. The systematic
54
+ pattern of such abuse occurs in the area of products for individuals to
55
+ use, which is precisely where it is most unacceptable. Therefore, we
56
+ have designed this version of the GPL to prohibit the practice for those
57
+ products. If such problems arise substantially in other domains, we
58
+ stand ready to extend this provision to those domains in future versions
59
+ of the GPL, as needed to protect the freedom of users.
60
+
61
+ Finally, every program is threatened constantly by software patents.
62
+ States should not allow patents to restrict development and use of
63
+ software on general-purpose computers, but in those that do, we wish to
64
+ avoid the special danger that patents applied to a free program could
65
+ make it effectively proprietary. To prevent this, the GPL assures that
66
+ patents cannot be used to render the program non-free.
67
+
68
+ The precise terms and conditions for copying, distribution and
69
+ modification follow.
70
+
71
+ TERMS AND CONDITIONS
72
+
73
+ 0. Definitions.
74
+
75
+ "This License" refers to version 3 of the GNU General Public License.
76
+
77
+ "Copyright" also means copyright-like laws that apply to other kinds of
78
+ works, such as semiconductor masks.
79
+
80
+ "The Program" refers to any copyrightable work licensed under this
81
+ License. Each licensee is addressed as "you". "Licensees" and
82
+ "recipients" may be individuals or organizations.
83
+
84
+ To "modify" a work means to copy from or adapt all or part of the work
85
+ in a fashion requiring copyright permission, other than the making of an
86
+ exact copy. The resulting work is called a "modified version" of the
87
+ earlier work or a work "based on" the earlier work.
88
+
89
+ A "covered work" means either the unmodified Program or a work based
90
+ on the Program.
91
+
92
+ To "propagate" a work means to do anything with it that, without
93
+ permission, would make you directly or secondarily liable for
94
+ infringement under applicable copyright law, except executing it on a
95
+ computer or modifying a private copy. Propagation includes copying,
96
+ distribution (with or without modification), making available to the
97
+ public, and in some countries other activities as well.
98
+
99
+ To "convey" a work means any kind of propagation that enables other
100
+ parties to make or receive copies. Mere interaction with a user through
101
+ a computer network, with no transfer of a copy, is not conveying.
102
+
103
+ An interactive user interface displays "Appropriate Legal Notices"
104
+ to the extent that it includes a convenient and prominently visible
105
+ feature that (1) displays an appropriate copyright notice, and (2)
106
+ tells the user that there is no warranty for the work (except to the
107
+ extent that warranties are provided), that licensees may convey the
108
+ work under this License, and how to view a copy of this License. If
109
+ the interface presents a list of user commands or options, such as a
110
+ menu, a prominent item in the list meets this criterion.
111
+
112
+ 1. Source Code.
113
+
114
+ The "source code" for a work means the preferred form of the work
115
+ for making modifications to it. "Object code" means any non-source
116
+ form of a work.
117
+
118
+ A "Standard Interface" means an interface that either is an official
119
+ standard defined by a recognized standards body, or, in the case of
120
+ interfaces specified for a particular programming language, one that
121
+ is widely used among developers working in that language.
122
+
123
+ The "System Libraries" of an executable work include anything, other
124
+ than the work as a whole, that (a) is included in the normal form of
125
+ packaging a Major Component, but which is not part of that Major
126
+ Component, and (b) serves only to enable use of the work with that
127
+ Major Component, or to implement a Standard Interface for which an
128
+ implementation is available to the public in source code form. A
129
+ "Major Component", in this context, means a major essential component
130
+ (kernel, window system, and so on) of the specific operating system
131
+ (if any) on which the executable work runs, or a compiler used to
132
+ produce the work, or an object code interpreter used to run it.
133
+
134
+ The "Corresponding Source" for a work in object code form means all
135
+ the source code needed to generate, install, and (for an executable
136
+ work) run the object code and to modify the work, including scripts to
137
+ control those activities. However, it does not include the work's
138
+ System Libraries, or general-purpose tools or generally available free
139
+ programs which are used unmodified in performing those activities but
140
+ which are not part of the work. For example, Corresponding Source
141
+ includes interface definition files associated with source files for
142
+ the work, and the source code for shared libraries and dynamically
143
+ linked subprograms that the work is specifically designed to require,
144
+ such as by intimate data communication or control flow between those
145
+ subprograms and other parts of the work.
146
+
147
+ The Corresponding Source need not include anything that users
148
+ can regenerate automatically from other parts of the Corresponding
149
+ Source.
150
+
151
+ The Corresponding Source for a work in source code form is that
152
+ same work.
153
+
154
+ 2. Basic Permissions.
155
+
156
+ All rights granted under this License are granted for the term of
157
+ copyright on the Program, and are irrevocable provided the stated
158
+ conditions are met. This License explicitly affirms your unlimited
159
+ permission to run the unmodified Program. The output from running a
160
+ covered work is covered by this License only if the output, given its
161
+ content, constitutes a covered work. This License acknowledges your
162
+ rights of fair use or other equivalent, as provided by copyright law.
163
+
164
+ You may make, run and propagate covered works that you do not
165
+ convey, without conditions so long as your license otherwise remains
166
+ in force. You may convey covered works to others for the sole purpose
167
+ of having them make modifications exclusively for you, or provide you
168
+ with facilities for running those works, provided that you comply with
169
+ the terms of this License in conveying all material for which you do
170
+ not control copyright. Those thus making or running the covered works
171
+ for you must do so exclusively on your behalf, under your direction
172
+ and control, on terms that prohibit them from making any copies of
173
+ your copyrighted material outside their relationship with you.
174
+
175
+ Conveying under any other circumstances is permitted solely under
176
+ the conditions stated below. Sublicensing is not allowed; section 10
177
+ makes it unnecessary.
178
+
179
+ 3. Protecting Users' Legal Rights From Anti-Circumvention Law.
180
+
181
+ No covered work shall be deemed part of an effective technological
182
+ measure under any applicable law fulfilling obligations under article
183
+ 11 of the WIPO copyright treaty adopted on 20 December 1996, or
184
+ similar laws prohibiting or restricting circumvention of such
185
+ measures.
186
+
187
+ When you convey a covered work, you waive any legal power to forbid
188
+ circumvention of technological measures to the extent such circumvention
189
+ is effected by exercising rights under this License with respect to
190
+ the covered work, and you disclaim any intention to limit operation or
191
+ modification of the work as a means of enforcing, against the work's
192
+ users, your or third parties' legal rights to forbid circumvention of
193
+ technological measures.
194
+
195
+ 4. Conveying Verbatim Copies.
196
+
197
+ You may convey verbatim copies of the Program's source code as you
198
+ receive it, in any medium, provided that you conspicuously and
199
+ appropriately publish on each copy an appropriate copyright notice;
200
+ keep intact all notices stating that this License and any
201
+ non-permissive terms added in accord with section 7 apply to the code;
202
+ keep intact all notices of the absence of any warranty; and give all
203
+ recipients a copy of this License along with the Program.
204
+
205
+ You may charge any price or no price for each copy that you convey,
206
+ and you may offer support or warranty protection for a fee.
207
+
208
+ 5. Conveying Modified Source Versions.
209
+
210
+ You may convey a work based on the Program, or the modifications to
211
+ produce it from the Program, in the form of source code under the
212
+ terms of section 4, provided that you also meet all of these conditions:
213
+
214
+ a) The work must carry prominent notices stating that you modified
215
+ it, and giving a relevant date.
216
+
217
+ b) The work must carry prominent notices stating that it is
218
+ released under this License and any conditions added under section
219
+ 7. This requirement modifies the requirement in section 4 to
220
+ "keep intact all notices".
221
+
222
+ c) You must license the entire work, as a whole, under this
223
+ License to anyone who comes into possession of a copy. This
224
+ License will therefore apply, along with any applicable section 7
225
+ additional terms, to the whole of the work, and all its parts,
226
+ regardless of how they are packaged. This License gives no
227
+ permission to license the work in any other way, but it does not
228
+ invalidate such permission if you have separately received it.
229
+
230
+ d) If the work has interactive user interfaces, each must display
231
+ Appropriate Legal Notices; however, if the Program has interactive
232
+ interfaces that do not display Appropriate Legal Notices, your
233
+ work need not make them do so.
234
+
235
+ A compilation of a covered work with other separate and independent
236
+ works, which are not by their nature extensions of the covered work,
237
+ and which are not combined with it such as to form a larger program,
238
+ in or on a volume of a storage or distribution medium, is called an
239
+ "aggregate" if the compilation and its resulting copyright are not
240
+ used to limit the access or legal rights of the compilation's users
241
+ beyond what the individual works permit. Inclusion of a covered work
242
+ in an aggregate does not cause this License to apply to the other
243
+ parts of the aggregate.
244
+
245
+ 6. Conveying Non-Source Forms.
246
+
247
+ You may convey a covered work in object code form under the terms
248
+ of sections 4 and 5, provided that you also convey the
249
+ machine-readable Corresponding Source under the terms of this License,
250
+ in one of these ways:
251
+
252
+ a) Convey the object code in, or embodied in, a physical product
253
+ (including a physical distribution medium), accompanied by the
254
+ Corresponding Source fixed on a durable physical medium
255
+ customarily used for software interchange.
256
+
257
+ b) Convey the object code in, or embodied in, a physical product
258
+ (including a physical distribution medium), accompanied by a
259
+ written offer, valid for at least three years and valid for as
260
+ long as you offer spare parts or customer support for that product
261
+ model, to give anyone who possesses the object code either (1) a
262
+ copy of the Corresponding Source for all the software in the
263
+ product that is covered by this License, on a durable physical
264
+ medium customarily used for software interchange, for a price no
265
+ more than your reasonable cost of physically performing this
266
+ conveying of source, or (2) access to copy the
267
+ Corresponding Source from a network server at no charge.
268
+
269
+ c) Convey individual copies of the object code with a copy of the
270
+ written offer to provide the Corresponding Source. This
271
+ alternative is allowed only occasionally and noncommercially, and
272
+ only if you received the object code with such an offer, in accord
273
+ with subsection 6b.
274
+
275
+ d) Convey the object code by offering access from a designated
276
+ place (gratis or for a charge), and offer equivalent access to the
277
+ Corresponding Source in the same way through the same place at no
278
+ further charge. You need not require recipients to copy the
279
+ Corresponding Source along with the object code. If the place to
280
+ copy the object code is a network server, the Corresponding Source
281
+ may be on a different server (operated by you or a third party)
282
+ that supports equivalent copying facilities, provided you maintain
283
+ clear directions next to the object code saying where to find the
284
+ Corresponding Source. Regardless of what server hosts the
285
+ Corresponding Source, you remain obligated to ensure that it is
286
+ available for as long as needed to satisfy these requirements.
287
+
288
+ e) Convey the object code using peer-to-peer transmission, provided
289
+ you inform other peers where the object code and Corresponding
290
+ Source of the work are being offered to the general public at no
291
+ charge under subsection 6d.
292
+
293
+ A separable portion of the object code, whose source code is excluded
294
+ from the Corresponding Source as a System Library, need not be
295
+ included in conveying the object code work.
296
+
297
+ A "User Product" is either (1) a "consumer product", which means any
298
+ tangible personal property which is normally used for personal, family,
299
+ or household purposes, or (2) anything designed or sold for incorporation
300
+ into a dwelling. In determining whether a product is a consumer product,
301
+ doubtful cases shall be resolved in favor of coverage. For a particular
302
+ product received by a particular user, "normally used" refers to a
303
+ typical or common use of that class of product, regardless of the status
304
+ of the particular user or of the way in which the particular user
305
+ actually uses, or expects or is expected to use, the product. A product
306
+ is a consumer product regardless of whether the product has substantial
307
+ commercial, industrial or non-consumer uses, unless such uses represent
308
+ the only significant mode of use of the product.
309
+
310
+ "Installation Information" for a User Product means any methods,
311
+ procedures, authorization keys, or other information required to install
312
+ and execute modified versions of a covered work in that User Product from
313
+ a modified version of its Corresponding Source. The information must
314
+ suffice to ensure that the continued functioning of the modified object
315
+ code is in no case prevented or interfered with solely because
316
+ modification has been made.
317
+
318
+ If you convey an object code work under this section in, or with, or
319
+ specifically for use in, a User Product, and the conveying occurs as
320
+ part of a transaction in which the right of possession and use of the
321
+ User Product is transferred to the recipient in perpetuity or for a
322
+ fixed term (regardless of how the transaction is characterized), the
323
+ Corresponding Source conveyed under this section must be accompanied
324
+ by the Installation Information. But this requirement does not apply
325
+ if neither you nor any third party retains the ability to install
326
+ modified object code on the User Product (for example, the work has
327
+ been installed in ROM).
328
+
329
+ The requirement to provide Installation Information does not include a
330
+ requirement to continue to provide support service, warranty, or updates
331
+ for a work that has been modified or installed by the recipient, or for
332
+ the User Product in which it has been modified or installed. Access to a
333
+ network may be denied when the modification itself materially and
334
+ adversely affects the operation of the network or violates the rules and
335
+ protocols for communication across the network.
336
+
337
+ Corresponding Source conveyed, and Installation Information provided,
338
+ in accord with this section must be in a format that is publicly
339
+ documented (and with an implementation available to the public in
340
+ source code form), and must require no special password or key for
341
+ unpacking, reading or copying.
342
+
343
+ 7. Additional Terms.
344
+
345
+ "Additional permissions" are terms that supplement the terms of this
346
+ License by making exceptions from one or more of its conditions.
347
+ Additional permissions that are applicable to the entire Program shall
348
+ be treated as though they were included in this License, to the extent
349
+ that they are valid under applicable law. If additional permissions
350
+ apply only to part of the Program, that part may be used separately
351
+ under those permissions, but the entire Program remains governed by
352
+ this License without regard to the additional permissions.
353
+
354
+ When you convey a copy of a covered work, you may at your option
355
+ remove any additional permissions from that copy, or from any part of
356
+ it. (Additional permissions may be written to require their own
357
+ removal in certain cases when you modify the work.) You may place
358
+ additional permissions on material, added by you to a covered work,
359
+ for which you have or can give appropriate copyright permission.
360
+
361
+ Notwithstanding any other provision of this License, for material you
362
+ add to a covered work, you may (if authorized by the copyright holders of
363
+ that material) supplement the terms of this License with terms:
364
+
365
+ a) Disclaiming warranty or limiting liability differently from the
366
+ terms of sections 15 and 16 of this License; or
367
+
368
+ b) Requiring preservation of specified reasonable legal notices or
369
+ author attributions in that material or in the Appropriate Legal
370
+ Notices displayed by works containing it; or
371
+
372
+ c) Prohibiting misrepresentation of the origin of that material, or
373
+ requiring that modified versions of such material be marked in
374
+ reasonable ways as different from the original version; or
375
+
376
+ d) Limiting the use for publicity purposes of names of licensors or
377
+ authors of the material; or
378
+
379
+ e) Declining to grant rights under trademark law for use of some
380
+ trade names, trademarks, or service marks; or
381
+
382
+ f) Requiring indemnification of licensors and authors of that
383
+ material by anyone who conveys the material (or modified versions of
384
+ it) with contractual assumptions of liability to the recipient, for
385
+ any liability that these contractual assumptions directly impose on
386
+ those licensors and authors.
387
+
388
+ All other non-permissive additional terms are considered "further
389
+ restrictions" within the meaning of section 10. If the Program as you
390
+ received it, or any part of it, contains a notice stating that it is
391
+ governed by this License along with a term that is a further
392
+ restriction, you may remove that term. If a license document contains
393
+ a further restriction but permits relicensing or conveying under this
394
+ License, you may add to a covered work material governed by the terms
395
+ of that license document, provided that the further restriction does
396
+ not survive such relicensing or conveying.
397
+
398
+ If you add terms to a covered work in accord with this section, you
399
+ must place, in the relevant source files, a statement of the
400
+ additional terms that apply to those files, or a notice indicating
401
+ where to find the applicable terms.
402
+
403
+ Additional terms, permissive or non-permissive, may be stated in the
404
+ form of a separately written license, or stated as exceptions;
405
+ the above requirements apply either way.
406
+
407
+ 8. Termination.
408
+
409
+ You may not propagate or modify a covered work except as expressly
410
+ provided under this License. Any attempt otherwise to propagate or
411
+ modify it is void, and will automatically terminate your rights under
412
+ this License (including any patent licenses granted under the third
413
+ paragraph of section 11).
414
+
415
+ However, if you cease all violation of this License, then your
416
+ license from a particular copyright holder is reinstated (a)
417
+ provisionally, unless and until the copyright holder explicitly and
418
+ finally terminates your license, and (b) permanently, if the copyright
419
+ holder fails to notify you of the violation by some reasonable means
420
+ prior to 60 days after the cessation.
421
+
422
+ Moreover, your license from a particular copyright holder is
423
+ reinstated permanently if the copyright holder notifies you of the
424
+ violation by some reasonable means, this is the first time you have
425
+ received notice of violation of this License (for any work) from that
426
+ copyright holder, and you cure the violation prior to 30 days after
427
+ your receipt of the notice.
428
+
429
+ Termination of your rights under this section does not terminate the
430
+ licenses of parties who have received copies or rights from you under
431
+ this License. If your rights have been terminated and not permanently
432
+ reinstated, you do not qualify to receive new licenses for the same
433
+ material under section 10.
434
+
435
+ 9. Acceptance Not Required for Having Copies.
436
+
437
+ You are not required to accept this License in order to receive or
438
+ run a copy of the Program. Ancillary propagation of a covered work
439
+ occurring solely as a consequence of using peer-to-peer transmission
440
+ to receive a copy likewise does not require acceptance. However,
441
+ nothing other than this License grants you permission to propagate or
442
+ modify any covered work. These actions infringe copyright if you do
443
+ not accept this License. Therefore, by modifying or propagating a
444
+ covered work, you indicate your acceptance of this License to do so.
445
+
446
+ 10. Automatic Licensing of Downstream Recipients.
447
+
448
+ Each time you convey a covered work, the recipient automatically
449
+ receives a license from the original licensors, to run, modify and
450
+ propagate that work, subject to this License. You are not responsible
451
+ for enforcing compliance by third parties with this License.
452
+
453
+ An "entity transaction" is a transaction transferring control of an
454
+ organization, or substantially all assets of one, or subdividing an
455
+ organization, or merging organizations. If propagation of a covered
456
+ work results from an entity transaction, each party to that
457
+ transaction who receives a copy of the work also receives whatever
458
+ licenses to the work the party's predecessor in interest had or could
459
+ give under the previous paragraph, plus a right to possession of the
460
+ Corresponding Source of the work from the predecessor in interest, if
461
+ the predecessor has it or can get it with reasonable efforts.
462
+
463
+ You may not impose any further restrictions on the exercise of the
464
+ rights granted or affirmed under this License. For example, you may
465
+ not impose a license fee, royalty, or other charge for exercise of
466
+ rights granted under this License, and you may not initiate litigation
467
+ (including a cross-claim or counterclaim in a lawsuit) alleging that
468
+ any patent claim is infringed by making, using, selling, offering for
469
+ sale, or importing the Program or any portion of it.
470
+
471
+ 11. Patents.
472
+
473
+ A "contributor" is a copyright holder who authorizes use under this
474
+ License of the Program or a work on which the Program is based. The
475
+ work thus licensed is called the contributor's "contributor version".
476
+
477
+ A contributor's "essential patent claims" are all patent claims
478
+ owned or controlled by the contributor, whether already acquired or
479
+ hereafter acquired, that would be infringed by some manner, permitted
480
+ by this License, of making, using, or selling its contributor version,
481
+ but do not include claims that would be infringed only as a
482
+ consequence of further modification of the contributor version. For
483
+ purposes of this definition, "control" includes the right to grant
484
+ patent sublicenses in a manner consistent with the requirements of
485
+ this License.
486
+
487
+ Each contributor grants you a non-exclusive, worldwide, royalty-free
488
+ patent license under the contributor's essential patent claims, to
489
+ make, use, sell, offer for sale, import and otherwise run, modify and
490
+ propagate the contents of its contributor version.
491
+
492
+ In the following three paragraphs, a "patent license" is any express
493
+ agreement or commitment, however denominated, not to enforce a patent
494
+ (such as an express permission to practice a patent or covenant not to
495
+ sue for patent infringement). To "grant" such a patent license to a
496
+ party means to make such an agreement or commitment not to enforce a
497
+ patent against the party.
498
+
499
+ If you convey a covered work, knowingly relying on a patent license,
500
+ and the Corresponding Source of the work is not available for anyone
501
+ to copy, free of charge and under the terms of this License, through a
502
+ publicly available network server or other readily accessible means,
503
+ then you must either (1) cause the Corresponding Source to be so
504
+ available, or (2) arrange to deprive yourself of the benefit of the
505
+ patent license for this particular work, or (3) arrange, in a manner
506
+ consistent with the requirements of this License, to extend the patent
507
+ license to downstream recipients. "Knowingly relying" means you have
508
+ actual knowledge that, but for the patent license, your conveying the
509
+ covered work in a country, or your recipient's use of the covered work
510
+ in a country, would infringe one or more identifiable patents in that
511
+ country that you have reason to believe are valid.
512
+
513
+ If, pursuant to or in connection with a single transaction or
514
+ arrangement, you convey, or propagate by procuring conveyance of, a
515
+ covered work, and grant a patent license to some of the parties
516
+ receiving the covered work authorizing them to use, propagate, modify
517
+ or convey a specific copy of the covered work, then the patent license
518
+ you grant is automatically extended to all recipients of the covered
519
+ work and works based on it.
520
+
521
+ A patent license is "discriminatory" if it does not include within
522
+ the scope of its coverage, prohibits the exercise of, or is
523
+ conditioned on the non-exercise of one or more of the rights that are
524
+ specifically granted under this License. You may not convey a covered
525
+ work if you are a party to an arrangement with a third party that is
526
+ in the business of distributing software, under which you make payment
527
+ to the third party based on the extent of your activity of conveying
528
+ the work, and under which the third party grants, to any of the
529
+ parties who would receive the covered work from you, a discriminatory
530
+ patent license (a) in connection with copies of the covered work
531
+ conveyed by you (or copies made from those copies), or (b) primarily
532
+ for and in connection with specific products or compilations that
533
+ contain the covered work, unless you entered into that arrangement,
534
+ or that patent license was granted, prior to 28 March 2007.
535
+
536
+ Nothing in this License shall be construed as excluding or limiting
537
+ any implied license or other defenses to infringement that may
538
+ otherwise be available to you under applicable patent law.
539
+
540
+ 12. No Surrender of Others' Freedom.
541
+
542
+ If conditions are imposed on you (whether by court order, agreement or
543
+ otherwise) that contradict the conditions of this License, they do not
544
+ excuse you from the conditions of this License. If you cannot convey a
545
+ covered work so as to satisfy simultaneously your obligations under this
546
+ License and any other pertinent obligations, then as a consequence you may
547
+ not convey it at all. For example, if you agree to terms that obligate you
548
+ to collect a royalty for further conveying from those to whom you convey
549
+ the Program, the only way you could satisfy both those terms and this
550
+ License would be to refrain entirely from conveying the Program.
551
+
552
+ 13. Use with the GNU Affero General Public License.
553
+
554
+ Notwithstanding any other provision of this License, you have
555
+ permission to link or combine any covered work with a work licensed
556
+ under version 3 of the GNU Affero General Public License into a single
557
+ combined work, and to convey the resulting work. The terms of this
558
+ License will continue to apply to the part which is the covered work,
559
+ but the special requirements of the GNU Affero General Public License,
560
+ section 13, concerning interaction through a network will apply to the
561
+ combination as such.
562
+
563
+ 14. Revised Versions of this License.
564
+
565
+ The Free Software Foundation may publish revised and/or new versions of
566
+ the GNU General Public License from time to time. Such new versions will
567
+ be similar in spirit to the present version, but may differ in detail to
568
+ address new problems or concerns.
569
+
570
+ Each version is given a distinguishing version number. If the
571
+ Program specifies that a certain numbered version of the GNU General
572
+ Public License "or any later version" applies to it, you have the
573
+ option of following the terms and conditions either of that numbered
574
+ version or of any later version published by the Free Software
575
+ Foundation. If the Program does not specify a version number of the
576
+ GNU General Public License, you may choose any version ever published
577
+ by the Free Software Foundation.
578
+
579
+ If the Program specifies that a proxy can decide which future
580
+ versions of the GNU General Public License can be used, that proxy's
581
+ public statement of acceptance of a version permanently authorizes you
582
+ to choose that version for the Program.
583
+
584
+ Later license versions may give you additional or different
585
+ permissions. However, no additional obligations are imposed on any
586
+ author or copyright holder as a result of your choosing to follow a
587
+ later version.
588
+
589
+ 15. Disclaimer of Warranty.
590
+
591
+ THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
592
+ APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
593
+ HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
594
+ OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
595
+ THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
596
+ PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
597
+ IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
598
+ ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
599
+
600
+ 16. Limitation of Liability.
601
+
602
+ IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
603
+ WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
604
+ THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
605
+ GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
606
+ USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
607
+ DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
608
+ PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
609
+ EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
610
+ SUCH DAMAGES.
611
+
612
+ 17. Interpretation of Sections 15 and 16.
613
+
614
+ If the disclaimer of warranty and limitation of liability provided
615
+ above cannot be given local legal effect according to their terms,
616
+ reviewing courts shall apply local law that most closely approximates
617
+ an absolute waiver of all civil liability in connection with the
618
+ Program, unless a warranty or assumption of liability accompanies a
619
+ copy of the Program in return for a fee.
620
+
621
+ END OF TERMS AND CONDITIONS
622
+
623
+ How to Apply These Terms to Your New Programs
624
+
625
+ If you develop a new program, and you want it to be of the greatest
626
+ possible use to the public, the best way to achieve this is to make it
627
+ free software which everyone can redistribute and change under these terms.
628
+
629
+ To do so, attach the following notices to the program. It is safest
630
+ to attach them to the start of each source file to most effectively
631
+ state the exclusion of warranty; and each file should have at least
632
+ the "copyright" line and a pointer to where the full notice is found.
633
+
634
+ <one line to give the program's name and a brief idea of what it does.>
635
+ Copyright (C) <year> <name of author>
636
+
637
+ This program is free software: you can redistribute it and/or modify
638
+ it under the terms of the GNU General Public License as published by
639
+ the Free Software Foundation, either version 3 of the License, or
640
+ (at your option) any later version.
641
+
642
+ This program is distributed in the hope that it will be useful,
643
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
644
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
645
+ GNU General Public License for more details.
646
+
647
+ You should have received a copy of the GNU General Public License
648
+ along with this program. If not, see <https://www.gnu.org/licenses/>.
649
+
650
+ Also add information on how to contact you by electronic and paper mail.
651
+
652
+ If the program does terminal interaction, make it output a short
653
+ notice like this when it starts in an interactive mode:
654
+
655
+ <program> Copyright (C) <year> <name of author>
656
+ This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
657
+ This is free software, and you are welcome to redistribute it
658
+ under certain conditions; type `show c' for details.
659
+
660
+ The hypothetical commands `show w' and `show c' should show the appropriate
661
+ parts of the General Public License. Of course, your program's commands
662
+ might be different; for a GUI interface, you would use an "about box".
663
+
664
+ You should also get your employer (if you work as a programmer) or school,
665
+ if any, to sign a "copyright disclaimer" for the program, if necessary.
666
+ For more information on this, and how to apply and follow the GNU GPL, see
667
+ <https://www.gnu.org/licenses/>.
668
+
669
+ The GNU General Public License does not permit incorporating your program
670
+ into proprietary programs. If your program is a subroutine library, you
671
+ may consider it more useful to permit linking proprietary applications with
672
+ the library. If this is what you want to do, use the GNU Lesser General
673
+ Public License instead of this License. But first, please read
674
+ <https://www.gnu.org/licenses/why-not-lgpl.html>.
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Sparrow Ml
3
  emoji: 😻
4
  colorFrom: green
5
  colorTo: red
 
1
  ---
2
+ title: Sparrow ML
3
  emoji: 😻
4
  colorFrom: green
5
  colorTo: red
api.py ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, File, UploadFile, Form, HTTPException
2
+ from fastapi.middleware.cors import CORSMiddleware
3
+ from engine import run_from_api_engine
4
+ from ingest import run_from_api_ingest
5
+ import uvicorn
6
+ import warnings
7
+ from typing import Annotated
8
+ import json
9
+ import argparse
10
+ from dotenv import load_dotenv
11
+ import os
12
+ from rich import print
13
+
14
+
15
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
16
+
17
+
18
+ # Load environment variables from .env file
19
+ load_dotenv()
20
+
21
+
22
+ # add asyncio to the pipeline
23
+
24
+ app = FastAPI(openapi_url="/api/v1/sparrow-llm/openapi.json", docs_url="/api/v1/sparrow-llm/docs")
25
+
26
+ app.add_middleware(
27
+ CORSMiddleware,
28
+ allow_origins=["*"],
29
+ allow_methods=["*"],
30
+ allow_headers=["*"],
31
+ allow_credentials=True
32
+ )
33
+
34
+
35
+ @app.get("/")
36
+ def root():
37
+ return {"message": "Sparrow LLM API"}
38
+
39
+
40
+ @app.post("/api/v1/sparrow-llm/inference", tags=["LLM Inference"])
41
+ async def inference(
42
+ fields: Annotated[str, Form()],
43
+ agent: Annotated[str, Form()],
44
+ types: Annotated[str, Form()] = None,
45
+ keywords: Annotated[str, Form()] = None,
46
+ index_name: Annotated[str, Form()] = None,
47
+ options: Annotated[str, Form()] = None,
48
+ group_by_rows: Annotated[bool, Form()] = True,
49
+ update_targets: Annotated[bool, Form()] = True,
50
+ debug: Annotated[bool, Form()] = False,
51
+ file: UploadFile = File(None)
52
+ ):
53
+ query = 'retrieve ' + fields
54
+ query_types = types
55
+
56
+ query_inputs_arr = [param.strip() for param in fields.split(',')] if query_types else []
57
+ query_types_arr = [param.strip() for param in query_types.split(',')] if query_types else []
58
+ keywords_arr = [param.strip() for param in keywords.split(',')] if keywords is not None else None
59
+ options_arr = [param.strip() for param in options.split(',')] if options is not None else None
60
+
61
+ if not query_types:
62
+ query = fields
63
+
64
+ try:
65
+ answer = await run_from_api_engine(agent, query_inputs_arr, query_types_arr, keywords_arr, query, index_name,
66
+ options_arr, file, group_by_rows, update_targets, debug)
67
+ except ValueError as e:
68
+ raise HTTPException(status_code=418, detail=str(e))
69
+
70
+ try:
71
+ if isinstance(answer, (str, bytes, bytearray)):
72
+ answer = json.loads(answer)
73
+ except json.JSONDecodeError as e:
74
+ raise HTTPException(status_code=418, detail=answer)
75
+
76
+ if debug:
77
+ print(f"\nJSON response:\n")
78
+ print(answer)
79
+
80
+ return {"message": answer}
81
+
82
+
83
+ @app.post("/api/v1/sparrow-llm/ingest", tags=["LLM Ingest"])
84
+ async def ingest(
85
+ agent: Annotated[str, Form()],
86
+ index_name: Annotated[str, Form()],
87
+ file: UploadFile = File()
88
+ ):
89
+ try:
90
+ answer = await run_from_api_ingest(agent, index_name, file, False)
91
+ except ValueError as e:
92
+ raise HTTPException(status_code=418, detail=str(e))
93
+
94
+ if isinstance(answer, (str, bytes, bytearray)):
95
+ answer = json.loads(answer)
96
+
97
+ return {"message": answer}
98
+
99
+
100
+ if __name__ == "__main__":
101
+ parser = argparse.ArgumentParser(description="Run FastAPI App")
102
+ parser.add_argument("-p", "--port", type=int, default=8000, help="Port to run the FastAPI app on")
103
+ args = parser.parse_args()
104
+
105
+ uvicorn.run("api:app", host="0.0.0.0", port=args.port, reload=True)
106
+
107
+ # run the app with: python api.py --port 8000
108
+ # go to http://127.0.0.1:8000/api/v1/sparrow-llm/docs to see the Swagger UI
assistant.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import warnings
2
+ import typer
3
+ from typing_extensions import Annotated
4
+ from rag.agents.interface import get_pipeline
5
+
6
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
7
+ warnings.filterwarnings("ignore", category=UserWarning)
8
+
9
+
10
+ def run(agent: Annotated[str, typer.Option(help="Ingest agent")] = "fcall",
11
+ query: Annotated[str, typer.Option(help="The query to run")] = "retrieve",
12
+ debug: Annotated[bool, typer.Option(help="Enable debug mode")] = False):
13
+ user_selected_agent = agent # Modify this as needed
14
+
15
+ try:
16
+ rag = get_pipeline(user_selected_agent)
17
+ rag.run_pipeline(user_selected_agent, None, None, query, None, None, debug)
18
+ except ValueError as e:
19
+ print(f"Caught an exception: {e}")
20
+
21
+
22
+ if __name__ == "__main__":
23
+ typer.run(run)
config.yml ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AGENT FOR LLAMAINDEX
2
+ # Tested with these LLMs
3
+ #LLM: 'starling-lm:7b-alpha-q4_K_M'
4
+ #LLM: 'starling-lm:7b-alpha-q5_K_M'
5
+ LLM: 'adrienbrault/nous-hermes2theta-llama3-8b:q5_K_M'
6
+ #LLM: 'llama3:8b-instruct-q5_K_M'
7
+ EMBEDDINGS: 'sentence-transformers/all-mpnet-base-v2'
8
+ WEAVIATE_URL: 'http://localhost:8080'
9
+ CHUNK_SIZE: 3000
10
+ OLLAMA_BASE_URL: 'http://127.0.0.1:11434'
11
+ #OLLAMA_BASE_URL: 'http://192.168.68.107:11434'
12
+
13
+
14
+ # AGENT FOR HAYSTACK
15
+ SPLIT_BY_HAYSTACK: 'sentence'
16
+ SPLIT_LENGTH_HAYSTACK: 3000
17
+ SPLIT_OVERLAP_HAYSTACK: 100
18
+ EMBEDDINGS_HAYSTACK: 'sentence-transformers/all-MiniLM-L6-v2'
19
+ # Tested with these LLMs
20
+ #LLM_HAYSTACK: 'starling-lm:7b-alpha-q4_K_M'
21
+ #LLM_HAYSTACK: 'starling-lm:7b-alpha-q5_K_M'
22
+ LLM_HAYSTACK: 'adrienbrault/nous-hermes2theta-llama3-8b:q5_K_M'
23
+ #LLM_HAYSTACK: 'llama3:8b-instruct-q5_K_M'
24
+ OLLAMA_BASE_URL_HAYSTACK: 'http://127.0.0.1:11434'
25
+ #OLLAMA_BASE_URL_HAYSTACK: 'http://192.168.68.107:11434'
26
+ MAX_LOOPS_ALLOWED_HAYSTACK: 3
27
+
28
+
29
+ # AGENT FOR VLLAMAINDEX
30
+ # Tested with these LLMs
31
+ LLM_VLLAMAINDEX: 'llava:13b'
32
+
33
+
34
+ # AGENT FOR VPROCESSOR
35
+ OCR_ENDPOINT_VPROCESSOR: 'http://127.0.0.1:8001/api/v1/sparrow-ocr/inference'
36
+ # Tested with these LLMs
37
+ #LLM_VPROCESSOR: 'starling-lm:7b-alpha-q5_K_M'
38
+ #LLM_VPROCESSOR: 'adrienbrault/nous-hermes2pro:Q5_K_M-json'
39
+ LLM_VPROCESSOR: 'llama3:8b-instruct-q5_K_M'
40
+ OLLAMA_BASE_URL_VPROCESSOR: 'http://127.0.0.1:11434'
41
+
42
+
43
+ # AGENT FOR FUNCTION CALL
44
+ OLLAMA_BASE_URL_FUNCTION: 'http://127.0.0.1:11434/v1'
45
+ # Tested with these LLMs
46
+ LLM_FUNCTION: 'adrienbrault/nous-hermes2theta-llama3-8b:q5_K_M'
47
+
48
+
49
+ # AGENT FOR UNSTRUCTURED LIGHT
50
+ # Tested with these LLMs
51
+ LLM_UNSTRUCTURED_LIGHT: 'adrienbrault/nous-hermes2pro:Q5_K_M-json'
52
+ # Strategy for analyzing PDFs and extracting table structure
53
+ STRATEGY_UNSTRUCTURED_LIGHT: 'hi_res'
54
+ # Best model for table extraction. Other options are detectron2_onnx and chipper depending on file layout
55
+ MODEL_UNSTRUCTURED_LIGHT: 'yolox'
56
+ CHUNK_SIZE_UNSTRUCTURED_LIGHT: 1000
57
+ OVERLAP_UNSTRUCTURED_LIGHT: 200
58
+ # ollama pull nomic-embed-text
59
+ EMBEDDINGS_UNSTRUCTURED_LIGHT: 'nomic-embed-text'
60
+ BASE_URL_UNSTRUCTURED_LIGHT: 'http://127.0.0.1:11434'
61
+
62
+
63
+ # AGENT FOR UNSTRUCTURED
64
+ # Tested with these LLMs
65
+ LLM_UNSTRUCTURED: 'adrienbrault/nous-hermes2pro:Q5_K_M-json'
66
+ OUTPUT_DIR_UNSTRUCTURED: 'data/json'
67
+ INPUT_DIR_UNSTRUCTURED: 'data/pdf'
68
+ WEAVIATE_URL_UNSTRUCTURED: 'http://localhost:8080'
69
+ EMBEDDINGS_UNSTRUCTURED: 'all-MiniLM-L6-v2'
70
+ DEVICE_UNSTRUCTURED: 'cpu'
71
+ CHUNK_UNDER_N_CHARS_UNSTRUCTURED: 250
72
+ CHUNK_NEW_AFTER_N_CHARS_UNSTRUCTURED: 500
73
+ BASE_URL_UNSTRUCTURED: 'http://127.0.0.1:11434'
74
+
75
+
76
+ # AGENT FOR INSTRUCTOR
77
+ OLLAMA_BASE_URL_INSTRUCTOR: 'http://127.0.0.1:11434/v1'
78
+ #OLLAMA_BASE_URL_INSTRUCTOR: 'http://192.168.68.107:11434/v1'
79
+ # Tested with these LLMs
80
+ LLM_INSTRUCTOR: 'adrienbrault/nous-hermes2theta-llama3-8b:q5_K_M'
81
+ #LLM_INSTRUCTOR: 'adrienbrault/nous-hermes2pro:Q5_K_M-json'
82
+ #LLM_INSTRUCTOR: 'wizardlm2:7b-q5_K_M'
83
+ # Strategy for analyzing PDFs and extracting table structure
84
+ STRATEGY_INSTRUCTOR: 'hi_res'
85
+ # Using yolox model by default. Other option is detectron2_onnx, depending on file layout
86
+ MODEL_INSTRUCTOR: 'yolox'
87
+ SIMILARITY_THRESHOLD_JUNK_COLUMNS_INSTRUCTOR: 0.5
88
+ SIMILARITY_THRESHOLD_COLUMN_ID_INSTRUCTOR: 0.3
89
+ PDF_SPLIT_OUTPUT_DIR_INSTRUCTOR: ""
90
+ PDF_CONVERT_TO_IMAGES_INSTRUCTOR: False
data/inout-20211211_001.jpg ADDED
data/invoice_1.jpg ADDED
data/invoice_1.pdf ADDED
Binary file (45.3 kB). View file
 
data/ross-20211211_010.jpg ADDED
docker-compose.yml ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ services:
3
+ weaviate:
4
+ container_name: weaviate-db
5
+ command:
6
+ - --host
7
+ - 0.0.0.0
8
+ - --port
9
+ - '8080'
10
+ - --scheme
11
+ - http
12
+ image: semitechnologies/weaviate:1.24.2
13
+ ports:
14
+ - 8080:8080
15
+ - 50051:50051
16
+ volumes:
17
+ - weaviate_data:/var/lib/weaviate
18
+ restart: on-failure:0
19
+ environment:
20
+ QUERY_DEFAULTS_LIMIT: 25
21
+ AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
22
+ PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
23
+ DEFAULT_VECTORIZER_MODULE: 'none'
24
+ ENABLE_MODULES: ''
25
+ CLUSTER_HOSTNAME: 'node1'
26
+ volumes:
27
+ weaviate_data:
28
+ ...
embeddings/__init__.py ADDED
File without changes
embeddings/agents/__init__.py ADDED
File without changes
embeddings/agents/haystack.py ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from embeddings.agents.interface import Ingest
2
+ from haystack.components.converters import PyPDFToDocument
3
+ from haystack.components.routers import FileTypeRouter
4
+ from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
5
+ from haystack.components.embedders import SentenceTransformersDocumentEmbedder
6
+ from haystack import Pipeline
7
+ from haystack_integrations.document_stores.weaviate.document_store import WeaviateDocumentStore
8
+ from haystack.components.writers import DocumentWriter
9
+ import timeit
10
+ import box
11
+ import yaml
12
+ from rich import print
13
+
14
+
15
+ # Import config vars
16
+ with open('config.yml', 'r', encoding='utf8') as ymlfile:
17
+ cfg = box.Box(yaml.safe_load(ymlfile))
18
+
19
+
20
+ class HaystackIngest(Ingest):
21
+ def run_ingest(self,
22
+ payload: str,
23
+ file_path: str,
24
+ index_name: str) -> None:
25
+ print(f"\nRunning embeddings with {payload}\n")
26
+
27
+ file_list = [file_path]
28
+
29
+ start = timeit.default_timer()
30
+
31
+ document_store = WeaviateDocumentStore(url=cfg.WEAVIATE_URL, collection_settings={"class": index_name})
32
+ file_type_router = FileTypeRouter(mime_types=["application/pdf"])
33
+ pdf_converter = PyPDFToDocument()
34
+
35
+ document_cleaner = DocumentCleaner()
36
+ document_splitter = DocumentSplitter(
37
+ split_by="word",
38
+ split_length=cfg.SPLIT_LENGTH_HAYSTACK,
39
+ split_overlap=cfg.SPLIT_OVERLAP_HAYSTACK
40
+ )
41
+
42
+ document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
43
+ document_writer = DocumentWriter(document_store)
44
+
45
+ preprocessing_pipeline = Pipeline()
46
+ preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
47
+ preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
48
+ preprocessing_pipeline.add_component(instance=document_cleaner, name="document_cleaner")
49
+ preprocessing_pipeline.add_component(instance=document_splitter, name="document_splitter")
50
+ preprocessing_pipeline.add_component(instance=document_embedder, name="document_embedder")
51
+ preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")
52
+
53
+ preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
54
+ preprocessing_pipeline.connect("pypdf_converter", "document_cleaner")
55
+ preprocessing_pipeline.connect("document_cleaner", "document_splitter")
56
+ preprocessing_pipeline.connect("document_splitter", "document_embedder")
57
+ preprocessing_pipeline.connect("document_embedder", "document_writer")
58
+
59
+ # preprocessing_pipeline.draw("pipeline.png")
60
+
61
+ preprocessing_pipeline.run({
62
+ "file_type_router": {"sources": file_list}
63
+ })
64
+
65
+ print(f"Number of documents in document store: {document_store.count_documents()}")
66
+
67
+ end = timeit.default_timer()
68
+ print(f"Time to embeddings data: {end - start}")
embeddings/agents/interface.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from abc import ABC, abstractmethod
2
+ import warnings
3
+
4
+
5
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
6
+ warnings.filterwarnings("ignore", category=UserWarning)
7
+
8
+
9
+ # Abstract Interface
10
+ class Ingest(ABC):
11
+ @abstractmethod
12
+ def run_ingest(self,
13
+ payload: str,
14
+ file_path: str,
15
+ index_name: str) -> None:
16
+ pass
17
+
18
+
19
+ # Factory Method
20
+ def get_ingest(agent_name: str) -> Ingest:
21
+ if agent_name == "llamaindex":
22
+ from .llamaindex import LlamaIndexIngest
23
+ return LlamaIndexIngest()
24
+ elif agent_name == "haystack":
25
+ from .haystack import HaystackIngest
26
+ return HaystackIngest()
27
+ else:
28
+ raise ValueError(f"Unknown agent: {agent_name}")
29
+
embeddings/agents/llamaindex.py ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from .interface import Ingest
2
+ import weaviate
3
+ from llama_index.core import StorageContext, SimpleDirectoryReader, Settings, VectorStoreIndex
4
+ from llama_index.vector_stores.weaviate import WeaviateVectorStore
5
+ from llama_index.embeddings.huggingface import HuggingFaceEmbedding
6
+ import box
7
+ import yaml
8
+ from rich.progress import Progress, SpinnerColumn, TextColumn
9
+ import timeit
10
+ from rich import print
11
+ import warnings
12
+
13
+
14
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
15
+ warnings.filterwarnings("ignore", category=UserWarning)
16
+
17
+
18
+ class LlamaIndexIngest(Ingest):
19
+ def run_ingest(self,
20
+ payload: str,
21
+ file_path: str,
22
+ index_name: str) -> None:
23
+ print(f"\nRunning ingest with {payload}\n")
24
+
25
+ # Import config vars
26
+ with open('config.yml', 'r', encoding='utf8') as ymlfile:
27
+ cfg = box.Box(yaml.safe_load(ymlfile))
28
+
29
+ start = timeit.default_timer()
30
+
31
+ client = self.invoke_pipeline_step(lambda: weaviate.Client(cfg.WEAVIATE_URL),
32
+ "Connecting to Weaviate...")
33
+
34
+ documents = self.invoke_pipeline_step(lambda: self.load_documents(file_path),
35
+ "Loading documents...")
36
+
37
+ embeddings = self.invoke_pipeline_step(lambda: self.load_embedding_model(cfg.EMBEDDINGS),
38
+ "Loading embedding model...")
39
+
40
+ index = self.invoke_pipeline_step(lambda: self.build_index(client, embeddings, documents, index_name,
41
+ cfg.CHUNK_SIZE),
42
+ "Building index...")
43
+
44
+ end = timeit.default_timer()
45
+ print(f"\nTime to ingest data: {end - start}\n")
46
+
47
+ def load_documents(self, file_path):
48
+ documents = SimpleDirectoryReader(input_files=[file_path], required_exts=[".pdf", ".PDF"]).load_data()
49
+ print(f"\nLoaded {len(documents)} documents")
50
+ print(f"\nFirst document: {documents[0]}")
51
+ print("\nFirst document content:\n")
52
+ print(documents[0])
53
+ print()
54
+ return documents
55
+
56
+ def load_embedding_model(self, model_name):
57
+ return HuggingFaceEmbedding(model_name=model_name)
58
+
59
+ def build_index(self, weaviate_client, embed_model, documents, index_name, chunk_size):
60
+ # Delete index if it already exists, to avoid data corruption
61
+ weaviate_client.schema.delete_class(index_name)
62
+
63
+ Settings.chunk_size = chunk_size
64
+ Settings.llm = None
65
+ Settings.embed_model = embed_model
66
+
67
+ vector_store = WeaviateVectorStore(weaviate_client=weaviate_client, index_name=index_name)
68
+ storage_context = StorageContext.from_defaults(vector_store=vector_store)
69
+
70
+ index = VectorStoreIndex.from_documents(
71
+ documents,
72
+ storage_context=storage_context
73
+ )
74
+
75
+ return index
76
+
77
+ def invoke_pipeline_step(self, task_call, task_description):
78
+ with Progress(
79
+ SpinnerColumn(),
80
+ TextColumn("[progress.description]{task.description}"),
81
+ transient=False,
82
+ ) as progress:
83
+ progress.add_task(description=task_description, total=None)
84
+ ret = task_call()
85
+ return ret
engine.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import warnings
2
+ import typer
3
+ from typing_extensions import Annotated, List
4
+ from rag.agents.interface import get_pipeline
5
+ import tempfile
6
+ import os
7
+ from rich import print
8
+
9
+
10
+ # Disable parallelism in the Huggingface tokenizers library to prevent potential deadlocks and ensure consistent behavior.
11
+ # This is especially important in environments where multiprocessing is used, as forking after parallelism can lead to issues.
12
+ # Note: Disabling parallelism may impact performance, but it ensures safer and more predictable execution.
13
+ os.environ['TOKENIZERS_PARALLELISM'] = 'false'
14
+
15
+
16
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
17
+ warnings.filterwarnings("ignore", category=UserWarning)
18
+
19
+
20
+ def run(inputs: Annotated[str, typer.Argument(help="The list of fields to fetch")],
21
+ types: Annotated[str, typer.Argument(help="The list of types of the fields")] = None,
22
+ keywords: Annotated[str, typer.Argument(help="The list of table column keywords")] = None,
23
+ file_path: Annotated[str, typer.Option(help="The file to process")] = None,
24
+ agent: Annotated[str, typer.Option(help="Selected agent")] = "llamaindex",
25
+ index_name: Annotated[str, typer.Option(help="Index to identify embeddings")] = None,
26
+ options: Annotated[List[str], typer.Option(help="Options to pass to the agent")] = None,
27
+ group_by_rows: Annotated[bool, typer.Option(help="Group JSON collection by rows")] = True,
28
+ update_targets: Annotated[bool, typer.Option(help="Update targets")] = True,
29
+ debug: Annotated[bool, typer.Option(help="Enable debug mode")] = False):
30
+
31
+ query = 'retrieve ' + inputs
32
+ query_types = types
33
+
34
+ query_inputs_arr = [param.strip() for param in inputs.split(',')] if query_types else []
35
+ query_types_arr = [param.strip() for param in query_types.split(',')] if query_types else []
36
+ keywords_arr = [param.strip() for param in keywords.split(',')] if keywords is not None else None
37
+
38
+ if not query_types:
39
+ query = inputs
40
+
41
+ user_selected_agent = agent # Modify this as needed
42
+
43
+ try:
44
+ rag = get_pipeline(user_selected_agent)
45
+ answer = rag.run_pipeline(user_selected_agent, query_inputs_arr, query_types_arr, keywords_arr, query, file_path,
46
+ index_name, options, group_by_rows, update_targets, debug)
47
+
48
+ print(f"\nJSON response:\n")
49
+ print(answer)
50
+ except ValueError as e:
51
+ print(f"Caught an exception: {e}")
52
+
53
+
54
+ async def run_from_api_engine(user_selected_agent, query_inputs_arr, query_types_arr, keywords_arr, query, index_name,
55
+ options_arr, file, group_by_rows, update_targets, debug):
56
+ try:
57
+ rag = get_pipeline(user_selected_agent)
58
+
59
+ if file is not None:
60
+ with tempfile.TemporaryDirectory() as temp_dir:
61
+ temp_file_path = os.path.join(temp_dir, file.filename)
62
+
63
+ # Save the uploaded file to the temporary directory
64
+ with open(temp_file_path, 'wb') as temp_file:
65
+ content = await file.read()
66
+ temp_file.write(content)
67
+
68
+ answer = rag.run_pipeline(user_selected_agent, query_inputs_arr, query_types_arr, keywords_arr, query,
69
+ temp_file_path, index_name, options_arr, group_by_rows, update_targets,
70
+ debug, False)
71
+ else:
72
+ answer = rag.run_pipeline(user_selected_agent, query_inputs_arr, query_types_arr, keywords_arr, query,
73
+ None, index_name, options_arr, group_by_rows, update_targets,
74
+ debug, False)
75
+ except ValueError as e:
76
+ raise e
77
+
78
+ return answer
79
+
80
+
81
+ if __name__ == "__main__":
82
+ typer.run(run)
ingest.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import warnings
2
+ from embeddings.agents.interface import get_ingest
3
+ import typer
4
+ from typing_extensions import Annotated
5
+ import tempfile
6
+ import os
7
+
8
+
9
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
10
+ warnings.filterwarnings("ignore", category=UserWarning)
11
+
12
+
13
+ def run(file_path: Annotated[str, typer.Option(help="The file to process")],
14
+ agent: Annotated[str, typer.Option(help="Ingest agent")] = "llamaindex",
15
+ index_name: Annotated[str, typer.Option(help="Index to identify embeddings")] = None):
16
+ user_selected_agent = agent # Modify this as needed
17
+ ingest = get_ingest(user_selected_agent)
18
+ ingest.run_ingest(user_selected_agent, file_path, index_name)
19
+
20
+
21
+ async def run_from_api_ingest(agent, index_name, file, debug):
22
+ try:
23
+ user_selected_agent = agent # Modify this as needed
24
+ ingest = get_ingest(user_selected_agent)
25
+
26
+ with tempfile.TemporaryDirectory() as temp_dir:
27
+ temp_file_path = os.path.join(temp_dir, file.filename)
28
+
29
+ # Save the uploaded file to the temporary directory
30
+ with open(temp_file_path, 'wb') as temp_file:
31
+ content = await file.read()
32
+ temp_file.write(content)
33
+
34
+ ingest.run_ingest(user_selected_agent, temp_file_path, index_name)
35
+ except ValueError as e:
36
+ raise e
37
+
38
+ return {"message": "Ingested successfully"}
39
+
40
+
41
+ if __name__ == "__main__":
42
+ typer.run(run)
rag/__init__.py ADDED
File without changes
rag/agents/__init__.py ADDED
File without changes
rag/agents/haystack/__init__.py ADDED
File without changes
rag/agents/haystack/haystack.py ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rag.agents.interface import Pipeline as PipelineInterface
2
+ from typing import Any
3
+ from haystack import Pipeline
4
+ from haystack_integrations.document_stores.weaviate.document_store import WeaviateDocumentStore
5
+ from haystack.components.embedders import SentenceTransformersTextEmbedder
6
+ from haystack_integrations.components.retrievers.weaviate.embedding_retriever import WeaviateEmbeddingRetriever
7
+ from haystack.components.builders import PromptBuilder
8
+ from haystack_integrations.components.generators.ollama import OllamaGenerator
9
+ from pydantic import create_model
10
+ import json
11
+ from haystack import component
12
+ import pydantic
13
+ from typing import Optional, List
14
+ from pydantic import ValidationError
15
+ import timeit
16
+ import box
17
+ import yaml
18
+ from rich import print
19
+ from rich.progress import Progress, SpinnerColumn, TextColumn
20
+ import warnings
21
+
22
+
23
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
24
+ warnings.filterwarnings("ignore", category=UserWarning)
25
+
26
+
27
+ # Import config vars
28
+ with open('config.yml', 'r', encoding='utf8') as ymlfile:
29
+ cfg = box.Box(yaml.safe_load(ymlfile))
30
+
31
+
32
+ class HaystackPipeline(PipelineInterface):
33
+ def run_pipeline(self,
34
+ payload: str,
35
+ query_inputs: [str],
36
+ query_types: [str],
37
+ keywords: [str],
38
+ query: str,
39
+ file_path: str,
40
+ index_name: str,
41
+ options: List[str] = None,
42
+ group_by_rows: bool = True,
43
+ update_targets: bool = True,
44
+ debug: bool = False,
45
+ local: bool = True) -> Any:
46
+ print(f"\nRunning pipeline with {payload}\n")
47
+
48
+ ResponseModel, json_schema = self.invoke_pipeline_step(lambda: self.build_response_class(query_inputs, query_types),
49
+ "Building dynamic response class...",
50
+ local)
51
+
52
+ output_validator = self.invoke_pipeline_step(lambda: self.build_validator(ResponseModel),
53
+ "Building output validator...",
54
+ local)
55
+
56
+ document_store = self.run_preprocessing_pipeline(index_name, local)
57
+
58
+ answer = self.run_inference_pipeline(document_store, json_schema, output_validator, query, local)
59
+
60
+ return answer
61
+
62
+ # Function to safely evaluate type strings
63
+ def safe_eval_type(self, type_str, context):
64
+ try:
65
+ return eval(type_str, {}, context)
66
+ except NameError:
67
+ raise ValueError(f"Type '{type_str}' is not recognized")
68
+
69
+ def build_response_class(self, query_inputs, query_types_as_strings):
70
+ # Controlled context for eval
71
+ context = {
72
+ 'List': List,
73
+ 'str': str,
74
+ 'int': int,
75
+ 'float': float
76
+ # Include other necessary types or typing constructs here
77
+ }
78
+
79
+ # Convert string representations to actual types
80
+ query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
81
+
82
+ # Create fields dictionary
83
+ fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
84
+
85
+ DynamicModel = create_model('DynamicModel', **fields)
86
+
87
+ json_schema = DynamicModel.schema_json(indent=2)
88
+
89
+ return DynamicModel, json_schema
90
+
91
+ def build_validator(self, Invoice):
92
+ @component
93
+ class OutputValidator:
94
+ def __init__(self, pydantic_model: pydantic.BaseModel):
95
+ self.pydantic_model = pydantic_model
96
+ self.iteration_counter = 0
97
+
98
+ # Define the component output
99
+ @component.output_types(valid_replies=List[str], invalid_replies=Optional[List[str]],
100
+ error_message=Optional[str])
101
+ def run(self, replies: List[str]):
102
+
103
+ self.iteration_counter += 1
104
+
105
+ ## Try to parse the LLM's reply ##
106
+ # If the LLM's reply is a valid object, return `"valid_replies"`
107
+ try:
108
+ output_dict = json.loads(replies[0].strip())
109
+ # Disable data validation for now
110
+ # self.pydantic_model.model_validate(output_dict)
111
+ print(
112
+ f"OutputValidator at Iteration {self.iteration_counter}: Valid JSON from LLM - No need for looping."
113
+ )
114
+ return {"valid_replies": replies}
115
+
116
+ # If the LLM's reply is corrupted or not valid, return "invalid_replies" and the "error_message" for LLM to try again
117
+ except (ValueError, ValidationError) as e:
118
+ print(
119
+ f"\nOutputValidator at Iteration {self.iteration_counter}: Invalid JSON from LLM - Let's try again.\n"
120
+ f"Output from LLM:\n {replies[0]} \n"
121
+ f"Error from OutputValidator: {e}"
122
+ )
123
+ return {"invalid_replies": replies, "error_message": str(e)}
124
+
125
+ output_validator = OutputValidator(pydantic_model=Invoice)
126
+
127
+ return output_validator
128
+
129
+ def run_preprocessing_pipeline(self, index_name, local):
130
+ document_store = WeaviateDocumentStore(url=cfg.WEAVIATE_URL, collection_settings={"class": index_name})
131
+
132
+ print(f"\nNumber of documents in document store: {document_store.count_documents()}\n")
133
+
134
+ if document_store.count_documents() == 0:
135
+ raise ValueError("Document store is empty. Please check your data source.")
136
+
137
+ return document_store
138
+
139
+ def run_inference_pipeline(self, document_store, json_schema, output_validator, query, local):
140
+ start = timeit.default_timer()
141
+
142
+ generator = OllamaGenerator(model=cfg.LLM_HAYSTACK,
143
+ url=cfg.OLLAMA_BASE_URL_HAYSTACK + "/api/generate",
144
+ timeout=900)
145
+
146
+ template = """
147
+ Given only the following document information, retrieve answer.
148
+ Ignore your own knowledge. Format response with the following JSON schema:
149
+ {{schema}}
150
+ Make sure your response is a dict and not a list. Return only JSON, no additional text.
151
+
152
+ Context:
153
+ {% for document in documents %}
154
+ {{ document.content }}
155
+ {% endfor %}
156
+
157
+ Question: {{ question }}?
158
+
159
+ {% if invalid_replies and error_message %}
160
+ You already created the following output in a previous attempt: {{invalid_replies}}
161
+ However, this doesn't comply with the format requirements from above and triggered this Python exception: {{error_message}}
162
+ Correct the output and try again. Just return the corrected output without any extra explanations.
163
+ {% endif %}
164
+ """
165
+
166
+ text_embedder = SentenceTransformersTextEmbedder(model=cfg.EMBEDDINGS_HAYSTACK,
167
+ progress_bar=False)
168
+
169
+ retriever = WeaviateEmbeddingRetriever(document_store=document_store, top_k=3)
170
+
171
+ prompt_builder = PromptBuilder(template=template)
172
+
173
+ pipe = Pipeline(max_loops_allowed=cfg.MAX_LOOPS_ALLOWED_HAYSTACK)
174
+ pipe.add_component("embedder", text_embedder)
175
+ pipe.add_component("retriever", retriever)
176
+ pipe.add_component("prompt_builder", prompt_builder)
177
+ pipe.add_component("llm", generator)
178
+ pipe.add_component("output_validator", output_validator)
179
+
180
+ pipe.connect("embedder.embedding", "retriever.query_embedding")
181
+ pipe.connect("retriever", "prompt_builder.documents")
182
+ pipe.connect("prompt_builder", "llm")
183
+ pipe.connect("llm", "output_validator")
184
+ # If a component has more than one output or input, explicitly specify the connections:
185
+ pipe.connect("output_validator.invalid_replies", "prompt_builder.invalid_replies")
186
+ pipe.connect("output_validator.error_message", "prompt_builder.error_message")
187
+
188
+ question = (
189
+ query
190
+ )
191
+
192
+ response = self.invoke_pipeline_step(
193
+ lambda: pipe.run(
194
+ {
195
+ "embedder": {"text": question},
196
+ "prompt_builder": {"question": question, "schema": json_schema}
197
+ }
198
+ ),
199
+ "Running inference pipeline...",
200
+ local)
201
+
202
+ end = timeit.default_timer()
203
+
204
+ valid_reply = response["output_validator"]["valid_replies"][0]
205
+ valid_json = json.loads(valid_reply)
206
+ print(f"\nJSON response:\n")
207
+ print(valid_json)
208
+ print('\n' + ('=' * 50))
209
+
210
+ print(f"Time to retrieve answer: {end - start}")
211
+
212
+ return valid_json
213
+
214
+ def invoke_pipeline_step(self, task_call, task_description, local):
215
+ if local:
216
+ with Progress(
217
+ SpinnerColumn(),
218
+ TextColumn("[progress.description]{task.description}"),
219
+ transient=False,
220
+ ) as progress:
221
+ progress.add_task(description=task_description, total=None)
222
+ ret = task_call()
223
+ else:
224
+ print(task_description)
225
+ ret = task_call()
226
+
227
+ return ret
rag/agents/instructor/__init__.py ADDED
File without changes
rag/agents/instructor/fcall.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rag.agents.interface import Pipeline
2
+ from openai import OpenAI
3
+ from pydantic import BaseModel, Field
4
+ import yfinance as yf
5
+ import instructor
6
+ import timeit
7
+ import box
8
+ import yaml
9
+ from rich import print
10
+ from typing import Any, List
11
+ import warnings
12
+
13
+
14
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
15
+ warnings.filterwarnings("ignore", category=UserWarning)
16
+
17
+
18
+ class FCall(Pipeline):
19
+ def run_pipeline(self,
20
+ payload: str,
21
+ query_inputs: [str],
22
+ query_types: [str],
23
+ keywords: [str],
24
+ query: str,
25
+ file_path: str,
26
+ index_name: str,
27
+ options: List[str] = None,
28
+ group_by_rows: bool = True,
29
+ update_targets: bool = True,
30
+ debug: bool = False,
31
+ local: bool = True) -> Any:
32
+ print(f"\nRunning pipeline with {payload}\n")
33
+
34
+ # Import config vars
35
+ with open('config.yml', 'r', encoding='utf8') as ymlfile:
36
+ cfg = box.Box(yaml.safe_load(ymlfile))
37
+
38
+ start = timeit.default_timer()
39
+
40
+ company = query
41
+
42
+ class StockInfo(BaseModel):
43
+ company: str = Field(..., description="Name of the company")
44
+ ticker: str = Field(..., description="Ticker symbol of the company")
45
+
46
+ # enables `response_model` in create call
47
+ client = instructor.patch(
48
+ OpenAI(
49
+ base_url=cfg.OLLAMA_BASE_URL_FUNCTION,
50
+ api_key="ollama",
51
+ ),
52
+ mode=instructor.Mode.JSON,
53
+ )
54
+
55
+ resp = client.chat.completions.create(
56
+ model=cfg.LLM_FUNCTION,
57
+ messages=[
58
+ {
59
+ "role": "user",
60
+ "content": f"Return the company name and the ticker symbol of the {company}."
61
+ }
62
+ ],
63
+ response_model=StockInfo,
64
+ max_retries=10
65
+ )
66
+
67
+ print(resp.model_dump_json(indent=2))
68
+ stock = yf.Ticker(resp.ticker)
69
+ hist = stock.history(period="1d")
70
+ stock_price = hist['Close'].iloc[-1]
71
+ print(f"The stock price of the {resp.company} is {stock_price}. USD")
72
+
73
+ end = timeit.default_timer()
74
+
75
+ print('=' * 50)
76
+
77
+ print(f"Time to retrieve answer: {end - start}")
rag/agents/instructor/helpers/__init__.py ADDED
File without changes
rag/agents/instructor/helpers/instructor_helper.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from sparrow_parse.extractor.unstructured_processor import UnstructuredProcessor
2
+ from sparrow_parse.extractor.markdown_processor import MarkdownProcessor
3
+ import json
4
+
5
+
6
+ def execute_sparrow_processor(options, file_path, strategy, model_name, local, debug):
7
+ content, table_content = None, None
8
+ if "unstructured" in options:
9
+ processor = UnstructuredProcessor()
10
+ content, table_content = processor.extract_data(file_path, strategy, model_name,
11
+ ['tables', 'unstructured'], local, debug)
12
+ elif "markdown" in options:
13
+ processor = MarkdownProcessor()
14
+ content, table_content = processor.extract_data(file_path, ['tables', 'markdown'], local, debug)
15
+
16
+ return content, table_content
17
+
18
+
19
+ def merge_dicts(json_str1, json_str2):
20
+ # Convert JSON strings to dictionaries
21
+ dict1 = json.loads(json_str1)
22
+ dict2 = json.loads(json_str2)
23
+
24
+ merged_dict = dict1.copy()
25
+ for key, value in dict2.items():
26
+ if key in merged_dict and isinstance(merged_dict[key], list) and isinstance(value, list):
27
+ merged_dict[key].extend(value)
28
+ else:
29
+ merged_dict[key] = value
30
+ return merged_dict
31
+
32
+
33
+ def track_query_output(keys, json_data, types):
34
+ # Convert JSON string to dictionary
35
+ data = json.loads(json_data)
36
+
37
+ # Initialize the result lists
38
+ result = []
39
+ result_types = []
40
+
41
+ # Iterate through each key in the keys array
42
+ for i, key in enumerate(keys):
43
+ # Check if the key is present in the JSON and has a non-empty value
44
+ if key not in data or not data[key].strip():
45
+ result.append(key)
46
+ result_types.append(types[i])
47
+
48
+ return result, result_types
49
+
50
+
51
+ def add_answer_page(answer, page_name, answer_page):
52
+ if not isinstance(answer, dict):
53
+ raise ValueError("The answer should be a dictionary.")
54
+
55
+ # Parse answer_table if it is a JSON string
56
+ if isinstance(answer_page, str):
57
+ answer_page = json.loads(answer_page)
58
+
59
+ answer[page_name] = answer_page
60
+ return answer
rag/agents/instructor/instructor.py ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rag.agents.interface import Pipeline
2
+ from openai import OpenAI
3
+ import instructor
4
+ from .helpers.instructor_helper import execute_sparrow_processor, merge_dicts, track_query_output
5
+ from .helpers.instructor_helper import add_answer_page
6
+ from sparrow_parse.extractor.html_extractor import HTMLExtractor
7
+ from sparrow_parse.extractor.unstructured_processor import UnstructuredProcessor
8
+ from sparrow_parse.extractor.pdf_optimizer import PDFOptimizer
9
+ from pydantic import create_model
10
+ from typing import List
11
+ from rich.progress import Progress, SpinnerColumn, TextColumn
12
+ import timeit
13
+ from rich import print
14
+ from typing import Any
15
+ import shutil
16
+ import json
17
+ import box
18
+ import yaml
19
+ import warnings
20
+
21
+
22
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
23
+ warnings.filterwarnings("ignore", category=UserWarning)
24
+
25
+
26
+ # Import config vars
27
+ with open('config.yml', 'r', encoding='utf8') as ymlfile:
28
+ cfg = box.Box(yaml.safe_load(ymlfile))
29
+
30
+
31
+ class InstructorPipeline(Pipeline):
32
+ def run_pipeline(self,
33
+ payload: str,
34
+ query_inputs: [str],
35
+ query_types: [str],
36
+ keywords: [str],
37
+ query: str,
38
+ file_path: str,
39
+ index_name: str,
40
+ options: List[str] = None,
41
+ group_by_rows: bool = True,
42
+ update_targets: bool = True,
43
+ debug: bool = False,
44
+ local: bool = True) -> Any:
45
+ print(f"\nRunning pipeline with {payload}\n")
46
+
47
+ # Import config vars
48
+ with open('config.yml', 'r', encoding='utf8') as ymlfile:
49
+ cfg = box.Box(yaml.safe_load(ymlfile))
50
+
51
+ start = timeit.default_timer()
52
+
53
+ strategy = cfg.STRATEGY_INSTRUCTOR
54
+ model_name = cfg.MODEL_INSTRUCTOR
55
+ similarity_threshold_junk = cfg.SIMILARITY_THRESHOLD_JUNK_COLUMNS_INSTRUCTOR
56
+ similarity_threshold_column_id = cfg.SIMILARITY_THRESHOLD_COLUMN_ID_INSTRUCTOR
57
+ pdf_split_output_dir = None if cfg.PDF_SPLIT_OUTPUT_DIR_INSTRUCTOR == "" else cfg.PDF_SPLIT_OUTPUT_DIR_INSTRUCTOR
58
+ pdf_convert_to_images = cfg.PDF_CONVERT_TO_IMAGES_INSTRUCTOR
59
+
60
+ answer = '{}'
61
+ answer_form = '{}'
62
+
63
+ validate_options = self.validate_options(options)
64
+ if validate_options:
65
+ if options and "tables" in options:
66
+ pdf_optimizer = PDFOptimizer()
67
+ num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(file_path,
68
+ pdf_split_output_dir,
69
+ pdf_convert_to_images)
70
+
71
+ if debug:
72
+ print(f'The PDF file has {num_pages} pages.')
73
+ print('The pages are stored in the following files:')
74
+ for file in output_files:
75
+ print(file)
76
+
77
+ # support for multipage docs
78
+ query_inputs_form, query_types_form = self.filter_fields_query(query_inputs, query_types, "form")
79
+
80
+ for i, page in enumerate(output_files):
81
+ content, table_contents = execute_sparrow_processor(options, page, strategy, model_name, local, debug)
82
+
83
+ if debug:
84
+ print(f"Query form inputs: {query_inputs_form}")
85
+ print(f"Query form types: {query_types_form}")
86
+ if len(query_inputs_form) > 0:
87
+ query_form = "retrieve " + ", ".join(query_inputs_form)
88
+ answer_form = self.execute(query_inputs_form, query_types_form, content, query_form, 'form', debug, local)
89
+ query_inputs_form, query_types_form = track_query_output(query_inputs_form, answer_form, query_types_form)
90
+ if debug:
91
+ print(f"Answer from LLM: {answer_form}")
92
+ print(f"Unprocessed query targets: {query_inputs_form}")
93
+
94
+ answer_table = {}
95
+ if table_contents is not None:
96
+ query_targets, query_targets_types = self.filter_fields_query(query_inputs, query_types, "table")
97
+ extractor = HTMLExtractor()
98
+
99
+ answer_table, targets_unprocessed = extractor.read_data(query_targets, table_contents,
100
+ similarity_threshold_junk,
101
+ similarity_threshold_column_id,
102
+ keywords, group_by_rows, update_targets,
103
+ local, debug)
104
+
105
+ if num_pages > 1:
106
+ answer_current = merge_dicts(answer_form, answer_table)
107
+ answer_current_page = add_answer_page({}, "page" + str(i + 1), answer_current)
108
+ answer = merge_dicts(answer, json.dumps(answer_current_page))
109
+ answer_form = '{}'
110
+ else:
111
+ answer = merge_dicts(answer_form, answer_table)
112
+
113
+ answer = self.format_json_output(answer)
114
+
115
+ shutil.rmtree(temp_dir, ignore_errors=True)
116
+ else:
117
+ # No options provided
118
+ processor = UnstructuredProcessor()
119
+ content, table_content = processor.extract_data(file_path, strategy, model_name, None, local, debug)
120
+ answer = self.execute(query_inputs, query_types, content, query, 'all', debug, local)
121
+ else:
122
+ raise ValueError(
123
+ "Invalid combination of options provided. Only 'tables and html' or 'tables and markdown' are allowed.")
124
+
125
+ end = timeit.default_timer()
126
+
127
+ print(f"\nJSON response:\n")
128
+ print(answer)
129
+ print('\n')
130
+ print('=' * 50)
131
+
132
+ print(f"Time to retrieve answer: {end - start}")
133
+
134
+ return answer
135
+
136
+ def execute(self, query_inputs, query_types, content, query, mode, debug, local):
137
+ if mode == 'form' or mode == 'all':
138
+ ResponseModel = self.invoke_pipeline_step(lambda: self.build_response_class(query_inputs, query_types),
139
+ "Building dynamic response class for " + mode + " data...",
140
+ local)
141
+
142
+ answer = self.invoke_pipeline_step(
143
+ lambda: self.execute_query(query, content, ResponseModel, mode),
144
+ "Executing query for " + mode + " data...",
145
+ local
146
+ )
147
+
148
+ return answer
149
+
150
+ def execute_query(self, query, content, ResponseModel, mode):
151
+ client = instructor.from_openai(
152
+ OpenAI(
153
+ base_url=cfg.OLLAMA_BASE_URL_INSTRUCTOR,
154
+ api_key="ollama",
155
+ ),
156
+ mode=instructor.Mode.JSON,
157
+ )
158
+
159
+ resp = []
160
+ if mode == 'form' or mode == 'all':
161
+ resp = client.chat.completions.create(
162
+ model=cfg.LLM_INSTRUCTOR,
163
+ messages=[
164
+ {
165
+ "role": "user",
166
+ "content": f"{query} from the following content {content}. if query field value is missing, return None."
167
+ }
168
+ ],
169
+ response_model=ResponseModel,
170
+ max_retries=3
171
+ )
172
+
173
+ answer = resp.model_dump_json(indent=4)
174
+
175
+ return answer
176
+
177
+ def filter_fields_query(self, query_inputs, query_types, mode):
178
+ fields = []
179
+
180
+ for query_input, query_type in zip(query_inputs, query_types):
181
+ if mode == "form" and query_type.startswith("List") is False:
182
+ fields.append((query_input, query_type))
183
+ elif mode == "table" and query_type.startswith("List") is True:
184
+ fields.append((query_input, query_type))
185
+
186
+ # return filtered query_inputs and query_types as two array of strings
187
+ query_inputs = [field[0] for field in fields]
188
+ query_types = [field[1] for field in fields]
189
+
190
+ return query_inputs, query_types
191
+
192
+ # Function to safely evaluate type strings
193
+ def safe_eval_type(self, type_str, context):
194
+ try:
195
+ return eval(type_str, {}, context)
196
+ except NameError:
197
+ raise ValueError(f"Type '{type_str}' is not recognized")
198
+
199
+ def build_response_class(self, query_inputs, query_types_as_strings):
200
+ # Controlled context for eval
201
+ context = {
202
+ 'List': List,
203
+ 'str': str,
204
+ 'int': int,
205
+ 'float': float
206
+ # Include other necessary types or typing constructs here
207
+ }
208
+
209
+ query_types_as_strings = [s.replace('Array', 'List') for s in query_types_as_strings]
210
+
211
+ # Convert string representations to actual types
212
+ query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
213
+
214
+ # Create fields dictionary
215
+ fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
216
+
217
+ DynamicModel = create_model('DynamicModel', **fields)
218
+
219
+ return DynamicModel
220
+
221
+ def validate_options(self, options: List[str]) -> bool:
222
+ # Define valid combinations
223
+ valid_combinations = [
224
+ ["tables", "unstructured"],
225
+ ["tables", "markdown"]
226
+ ]
227
+
228
+ # Check for valid combinations or empty list
229
+ if not options: # Valid if no options are provided
230
+ return True
231
+ if sorted(options) in (sorted(combination) for combination in valid_combinations):
232
+ return True
233
+ return False
234
+
235
+ def format_json_output(self, answer):
236
+ formatted_json = json.dumps(answer, indent=4)
237
+ formatted_json = formatted_json.replace('", "', '",\n"')
238
+ formatted_json = formatted_json.replace('}, {', '},\n{')
239
+ return formatted_json
240
+
241
+ def invoke_pipeline_step(self, task_call, task_description, local):
242
+ if local:
243
+ with Progress(
244
+ SpinnerColumn(),
245
+ TextColumn("[progress.description]{task.description}"),
246
+ transient=False,
247
+ ) as progress:
248
+ progress.add_task(description=task_description, total=None)
249
+ ret = task_call()
250
+ else:
251
+ print(task_description)
252
+ ret = task_call()
253
+
254
+ return ret
rag/agents/interface.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from abc import ABC, abstractmethod
2
+ from typing import Any
3
+ from typing import List
4
+ import warnings
5
+
6
+
7
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
8
+ warnings.filterwarnings("ignore", category=UserWarning)
9
+
10
+
11
+ # Abstract Interface
12
+ class Pipeline(ABC):
13
+ @abstractmethod
14
+ def run_pipeline(self,
15
+ payload: str,
16
+ query_inputs: [str],
17
+ query_types: [str],
18
+ keywords: [str],
19
+ query: str,
20
+ file_path: str,
21
+ index_name: str,
22
+ options: List[str] = None,
23
+ group_by_rows: bool = True,
24
+ update_targets: bool = True,
25
+ debug: bool = False,
26
+ local: bool = True) -> Any:
27
+ pass
28
+
29
+
30
+ # Factory Method
31
+ def get_pipeline(agent_name: str) -> Pipeline:
32
+ if agent_name == "llamaindex":
33
+ from rag.agents.llamaindex.llamaindex import LlamaIndexPipeline
34
+ return LlamaIndexPipeline()
35
+ elif agent_name == "haystack":
36
+ from rag.agents.haystack.haystack import HaystackPipeline
37
+ return HaystackPipeline()
38
+ elif agent_name == "vllamaindex":
39
+ from rag.agents.llamaindex.vllamaindex import VLlamaIndexPipeline
40
+ return VLlamaIndexPipeline()
41
+ elif agent_name == "vprocessor":
42
+ from rag.agents.llamaindex.vprocessor import VProcessorPipeline
43
+ return VProcessorPipeline()
44
+ elif agent_name == "fcall":
45
+ from rag.agents.instructor.fcall import FCall
46
+ return FCall()
47
+ elif agent_name == "instructor":
48
+ from rag.agents.instructor.instructor import InstructorPipeline
49
+ return InstructorPipeline()
50
+ elif agent_name == "unstructured-light":
51
+ from rag.agents.unstructured.unstructured_light import UnstructuredLightPipeline
52
+ return UnstructuredLightPipeline()
53
+ elif agent_name == "unstructured":
54
+ from rag.agents.unstructured.unstructured import UnstructuredPipeline
55
+ return UnstructuredPipeline()
56
+ elif agent_name == "sparrow-parse":
57
+ from rag.agents.sparrow_parse.sparrow_parse import SparrowParsePipeline
58
+ return SparrowParsePipeline()
59
+ else:
60
+ raise ValueError(f"Unknown agent: {agent_name}")
61
+
rag/agents/llamaindex/__init__.py ADDED
File without changes
rag/agents/llamaindex/llamaindex.py ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rag.agents.interface import Pipeline
2
+ from llama_index.core import VectorStoreIndex, Settings
3
+ from llama_index.embeddings.huggingface import HuggingFaceEmbedding
4
+ from llama_index.llms.ollama import Ollama
5
+ from llama_index.vector_stores.weaviate import WeaviateVectorStore
6
+ import weaviate
7
+ from pydantic.v1 import create_model
8
+ from typing import List
9
+ import box
10
+ import yaml
11
+ from rich.progress import Progress, SpinnerColumn, TextColumn
12
+ import warnings
13
+ import timeit
14
+ import time
15
+ import json
16
+ from rich import print
17
+ from typing import Any
18
+
19
+
20
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
21
+ warnings.filterwarnings("ignore", category=UserWarning)
22
+
23
+
24
+ class LlamaIndexPipeline(Pipeline):
25
+ def run_pipeline(self,
26
+ payload: str,
27
+ query_inputs: [str],
28
+ query_types: [str],
29
+ keywords: [str],
30
+ query: str,
31
+ file_path: str,
32
+ index_name: str,
33
+ options: List[str] = None,
34
+ group_by_rows: bool = True,
35
+ update_targets: bool = True,
36
+ debug: bool = False,
37
+ local: bool = True) -> Any:
38
+ print(f"\nRunning pipeline with {payload}\n")
39
+
40
+ if len(query_inputs) == 1:
41
+ raise ValueError("Please provide more than one query input")
42
+
43
+ start = timeit.default_timer()
44
+
45
+ rag_chain = self.build_rag_pipeline(query_inputs, query_types, index_name, debug, local)
46
+
47
+ end = timeit.default_timer()
48
+ print(f"Time to prepare RAG pipeline: {end - start}")
49
+
50
+ answer = self.process_query(query, rag_chain, debug, local)
51
+ return answer
52
+
53
+
54
+ def build_rag_pipeline(self, query_inputs, query_types, index_name, debug, local):
55
+ # Import config vars
56
+ with open('config.yml', 'r', encoding='utf8') as ymlfile:
57
+ cfg = box.Box(yaml.safe_load(ymlfile))
58
+
59
+ client = self.invoke_pipeline_step(lambda: weaviate.Client(cfg.WEAVIATE_URL),
60
+ "Connecting to Weaviate...",
61
+ local)
62
+
63
+ llm = self.invoke_pipeline_step(lambda: Ollama(model=cfg.LLM, base_url=cfg.OLLAMA_BASE_URL, temperature=0,
64
+ request_timeout=900),
65
+ "Loading Ollama...",
66
+ local)
67
+
68
+ embeddings = self.invoke_pipeline_step(lambda: self.load_embedding_model(model_name=cfg.EMBEDDINGS),
69
+ "Loading embedding model...",
70
+ local)
71
+
72
+ index = self.invoke_pipeline_step(
73
+ lambda: self.build_index(cfg.CHUNK_SIZE, llm, embeddings, client, index_name),
74
+ "Building index...",
75
+ local)
76
+
77
+ ResponseModel = self.invoke_pipeline_step(lambda: self.build_response_class(query_inputs, query_types),
78
+ "Building dynamic response class...",
79
+ local)
80
+
81
+ # may want to try with similarity_top_k=5, default is 2
82
+ query_engine = self.invoke_pipeline_step(lambda: index.as_query_engine(
83
+ streaming=False,
84
+ output_cls=ResponseModel,
85
+ response_mode="compact"
86
+ ),
87
+ "Constructing query engine...",
88
+ local)
89
+
90
+ return query_engine
91
+
92
+ # Function to safely evaluate type strings
93
+ def safe_eval_type(self, type_str, context):
94
+ try:
95
+ return eval(type_str, {}, context)
96
+ except NameError:
97
+ raise ValueError(f"Type '{type_str}' is not recognized")
98
+
99
+ def build_response_class(self, query_inputs, query_types_as_strings):
100
+ # Controlled context for eval
101
+ context = {
102
+ 'List': List,
103
+ 'str': str,
104
+ 'int': int,
105
+ 'float': float
106
+ # Include other necessary types or typing constructs here
107
+ }
108
+
109
+ # Convert string representations to actual types
110
+ query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
111
+
112
+ # Create fields dictionary
113
+ fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
114
+
115
+ DynamicModel = create_model('DynamicModel', **fields)
116
+
117
+ return DynamicModel
118
+
119
+ def load_embedding_model(self, model_name):
120
+ return HuggingFaceEmbedding(model_name=model_name)
121
+
122
+ def build_index(self, chunk_size, llm, embed_model, weaviate_client, index_name):
123
+ Settings.chunk_size = chunk_size
124
+ Settings.llm = llm
125
+ Settings.embed_model = embed_model
126
+
127
+ vector_store = WeaviateVectorStore(weaviate_client=weaviate_client, index_name=index_name)
128
+
129
+ index = VectorStoreIndex.from_vector_store(
130
+ vector_store=vector_store
131
+ )
132
+
133
+ return index
134
+
135
+ def process_query(self, query, rag_chain, debug=False, local=True) -> str:
136
+ start = timeit.default_timer()
137
+
138
+ step = 0
139
+ answer = None
140
+ while answer is None:
141
+ step += 1
142
+ if step > 1:
143
+ print('Refining answer...')
144
+ # add wait time, before refining to avoid spamming the server
145
+ time.sleep(5)
146
+ if step > 3:
147
+ # if we have refined 3 times, and still no answer, break
148
+ answer = 'No answer found.'
149
+ break
150
+
151
+ if local:
152
+ with Progress(
153
+ SpinnerColumn(),
154
+ TextColumn("[progress.description]{task.description}"),
155
+ transient=False,
156
+ ) as progress:
157
+ progress.add_task(description="Retrieving answer...", total=None)
158
+ answer = self.get_rag_response(query, rag_chain, debug)
159
+ else:
160
+ print('Retrieving answer...')
161
+ answer = self.get_rag_response(query, rag_chain, debug)
162
+
163
+ end = timeit.default_timer()
164
+
165
+ print(f"\nJSON response:\n")
166
+ print(answer + '\n')
167
+ print('=' * 50)
168
+
169
+ print(f"Time to retrieve answer: {end - start}")
170
+
171
+ return answer
172
+
173
+ def get_rag_response(self, query, chain, debug=False) -> str | None:
174
+ try:
175
+ result = chain.query(query)
176
+ except ValueError as error:
177
+ text = error.args[0]
178
+ starting_str = "Could not extract json string from output: \n"
179
+ if (index := text.find(starting_str)) != -1:
180
+ json_str = text[index + len(starting_str) :]
181
+ result = json_str + "}"
182
+ else:
183
+ return
184
+
185
+ try:
186
+ # Convert and pretty print
187
+ data = json.loads(str(result))
188
+ data = json.dumps(data, indent=4)
189
+ return data
190
+ except (json.decoder.JSONDecodeError, TypeError):
191
+ print("The response is not in JSON format:\n")
192
+ print(result)
193
+
194
+ # return False
195
+
196
+ def invoke_pipeline_step(self, task_call, task_description, local):
197
+ if local:
198
+ with Progress(
199
+ SpinnerColumn(),
200
+ TextColumn("[progress.description]{task.description}"),
201
+ transient=False,
202
+ ) as progress:
203
+ progress.add_task(description=task_description, total=None)
204
+ ret = task_call()
205
+ else:
206
+ print(task_description)
207
+ ret = task_call()
208
+
209
+ return ret
rag/agents/llamaindex/vllamaindex.py ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rag.agents.interface import Pipeline
2
+ from rich.progress import Progress, SpinnerColumn, TextColumn
3
+ from typing import Any
4
+ from pydantic import create_model
5
+ from typing import List
6
+ import warnings
7
+ import box
8
+ import yaml
9
+ import timeit
10
+ from rich import print
11
+ from llama_index.core import SimpleDirectoryReader
12
+ from llama_index.multi_modal_llms.ollama import OllamaMultiModal
13
+ from llama_index.core.program import MultiModalLLMCompletionProgram
14
+ from llama_index.core.output_parsers import PydanticOutputParser
15
+
16
+
17
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
18
+ warnings.filterwarnings("ignore", category=UserWarning)
19
+
20
+
21
+ # Import config vars
22
+ with open('config.yml', 'r', encoding='utf8') as ymlfile:
23
+ cfg = box.Box(yaml.safe_load(ymlfile))
24
+
25
+
26
+ class VLlamaIndexPipeline(Pipeline):
27
+ def run_pipeline(self,
28
+ payload: str,
29
+ query_inputs: [str],
30
+ query_types: [str],
31
+ keywords: [str],
32
+ query: str,
33
+ file_path: str,
34
+ index_name: str,
35
+ options: List[str] = None,
36
+ group_by_rows: bool = True,
37
+ update_targets: bool = True,
38
+ debug: bool = False,
39
+ local: bool = True) -> Any:
40
+ print(f"\nRunning pipeline with {payload}\n")
41
+
42
+ start = timeit.default_timer()
43
+
44
+ if file_path is None:
45
+ raise ValueError("File path is required for vllamaindex pipeline")
46
+
47
+ mm_model = self.invoke_pipeline_step(lambda: OllamaMultiModal(model=cfg.LLM_VLLAMAINDEX),
48
+ "Loading Ollama MultiModal...",
49
+ local)
50
+
51
+ # load as image documents
52
+ image_documents = self.invoke_pipeline_step(lambda: SimpleDirectoryReader(input_files=[file_path],
53
+ required_exts=[".jpg", ".JPG",
54
+ ".JPEG"]).load_data(),
55
+ "Loading image documents...",
56
+ local)
57
+
58
+ ResponseModel = self.invoke_pipeline_step(lambda: self.build_response_class(query_inputs, query_types),
59
+ "Building dynamic response class...",
60
+ local)
61
+
62
+ prompt_template_str = """\
63
+ {query_str}
64
+
65
+ Return the answer as a Pydantic object. The Pydantic schema is given below:
66
+
67
+ """
68
+ mm_program = MultiModalLLMCompletionProgram.from_defaults(
69
+ output_parser=PydanticOutputParser(ResponseModel),
70
+ image_documents=image_documents,
71
+ prompt_template_str=prompt_template_str,
72
+ multi_modal_llm=mm_model,
73
+ verbose=True,
74
+ )
75
+
76
+ try:
77
+ response = self.invoke_pipeline_step(lambda: mm_program(query_str=query),
78
+ "Running inference...",
79
+ local)
80
+ except ValueError as e:
81
+ print(f"Error: {e}")
82
+ msg = 'Inference failed'
83
+ return '{"answer": "' + msg + '"}'
84
+
85
+ end = timeit.default_timer()
86
+
87
+ print(f"\nJSON response:\n")
88
+ for res in response:
89
+ print(res)
90
+ print('=' * 50)
91
+
92
+ print(f"Time to retrieve answer: {end - start}")
93
+
94
+ return response
95
+
96
+
97
+ # Function to safely evaluate type strings
98
+ def safe_eval_type(self, type_str, context):
99
+ try:
100
+ return eval(type_str, {}, context)
101
+ except NameError:
102
+ raise ValueError(f"Type '{type_str}' is not recognized")
103
+
104
+ def build_response_class(self, query_inputs, query_types_as_strings):
105
+ # Controlled context for eval
106
+ context = {
107
+ 'List': List,
108
+ 'str': str,
109
+ 'int': int,
110
+ 'float': float
111
+ # Include other necessary types or typing constructs here
112
+ }
113
+
114
+ # Convert string representations to actual types
115
+ query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
116
+
117
+ # Create fields dictionary
118
+ fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
119
+
120
+ DynamicModel = create_model('DynamicModel', **fields)
121
+
122
+ return DynamicModel
123
+
124
+ def invoke_pipeline_step(self, task_call, task_description, local):
125
+ if local:
126
+ with Progress(
127
+ SpinnerColumn(),
128
+ TextColumn("[progress.description]{task.description}"),
129
+ transient=False,
130
+ ) as progress:
131
+ progress.add_task(description=task_description, total=None)
132
+ ret = task_call()
133
+ else:
134
+ print(task_description)
135
+ ret = task_call()
136
+
137
+ return ret
138
+
139
+
rag/agents/llamaindex/vprocessor.py ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rag.agents.interface import Pipeline
2
+ from llama_index.core.program import LLMTextCompletionProgram
3
+ import json
4
+ from llama_index.llms.ollama import Ollama
5
+ from typing import List
6
+ from pydantic import create_model
7
+ from rich.progress import Progress, SpinnerColumn, TextColumn
8
+ import requests
9
+ import warnings
10
+ import box
11
+ import yaml
12
+ import timeit
13
+ from rich import print
14
+ from typing import Any
15
+
16
+
17
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
18
+ warnings.filterwarnings("ignore", category=UserWarning)
19
+
20
+
21
+ # Import config vars
22
+ with open('config.yml', 'r', encoding='utf8') as ymlfile:
23
+ cfg = box.Box(yaml.safe_load(ymlfile))
24
+
25
+
26
+ class VProcessorPipeline(Pipeline):
27
+ def run_pipeline(self,
28
+ payload: str,
29
+ query_inputs: [str],
30
+ query_types: [str],
31
+ keywords: [str],
32
+ query: str,
33
+ file_path: str,
34
+ index_name: str,
35
+ options: List[str] = None,
36
+ group_by_rows: bool = True,
37
+ update_targets: bool = True,
38
+ debug: bool = False,
39
+ local: bool = True) -> Any:
40
+ print(f"\nRunning pipeline with {payload}\n")
41
+
42
+ start = timeit.default_timer()
43
+
44
+ if file_path is None:
45
+ raise ValueError("File path is required for vprocessor pipeline")
46
+
47
+ with open(file_path, "rb") as file:
48
+ files = {'file': (file_path, file, 'image/jpeg')}
49
+
50
+ data = {
51
+ 'image_url': ''
52
+ }
53
+
54
+ response = self.invoke_pipeline_step(lambda: requests.post(cfg.OCR_ENDPOINT_VPROCESSOR,
55
+ data=data,
56
+ files=files,
57
+ timeout=180),
58
+ "Running OCR...",
59
+ local)
60
+
61
+ if response.status_code != 200:
62
+ print('Request failed with status code:', response.status_code)
63
+ print('Response:', response.text)
64
+
65
+ return "Failed to process file. Please try again."
66
+
67
+ end = timeit.default_timer()
68
+ print(f"Time to run OCR: {end - start}")
69
+
70
+ start = timeit.default_timer()
71
+
72
+ data = response.json()
73
+
74
+ ResponseModel = self.invoke_pipeline_step(lambda: self.build_response_class(query_inputs, query_types),
75
+ "Building dynamic response class...",
76
+ local)
77
+
78
+ prompt_template_str = """\
79
+ """ + query + """\
80
+ using this structured data, coming from OCR {document_data}.\
81
+ """
82
+
83
+ llm_ollama = self.invoke_pipeline_step(lambda: Ollama(model=cfg.LLM_VPROCESSOR,
84
+ base_url=cfg.OLLAMA_BASE_URL_VPROCESSOR,
85
+ temperature=0,
86
+ request_timeout=900),
87
+ "Loading Ollama...",
88
+ local)
89
+
90
+ program = LLMTextCompletionProgram.from_defaults(
91
+ output_cls=ResponseModel,
92
+ prompt_template_str=prompt_template_str,
93
+ llm=llm_ollama,
94
+ verbose=True,
95
+ )
96
+
97
+ output = self.invoke_pipeline_step(lambda: program(document_data=data),
98
+ "Running inference...",
99
+ local)
100
+
101
+ answer = self.beautify_json(output.model_dump_json())
102
+
103
+ end = timeit.default_timer()
104
+
105
+ print(f"\nJSON response:\n")
106
+ print(answer + '\n')
107
+ print('=' * 50)
108
+
109
+ print(f"Time to retrieve answer: {end - start}")
110
+
111
+ return answer
112
+
113
+ def prepare_files(self, file_path, file):
114
+ if file_path is not None:
115
+ with open(file_path, "rb") as file:
116
+ files = {'file': (file_path, file, 'image/jpeg')}
117
+
118
+ data = {
119
+ 'image_url': ''
120
+ }
121
+ else:
122
+ files = {'file': (file.filename, file.file, file.content_type)}
123
+
124
+ data = {
125
+ 'image_url': ''
126
+ }
127
+ return data, files
128
+
129
+
130
+ # Function to safely evaluate type strings
131
+ def safe_eval_type(self, type_str, context):
132
+ try:
133
+ return eval(type_str, {}, context)
134
+ except NameError:
135
+ raise ValueError(f"Type '{type_str}' is not recognized")
136
+
137
+ def build_response_class(self, query_inputs, query_types_as_strings):
138
+ # Controlled context for eval
139
+ context = {
140
+ 'List': List,
141
+ 'str': str,
142
+ 'int': int,
143
+ 'float': float
144
+ # Include other necessary types or typing constructs here
145
+ }
146
+
147
+ # Convert string representations to actual types
148
+ query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
149
+
150
+ # Create fields dictionary
151
+ fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
152
+
153
+ DynamicModel = create_model('DynamicModel', **fields)
154
+
155
+ return DynamicModel
156
+
157
+ def invoke_pipeline_step(self, task_call, task_description, local):
158
+ if local:
159
+ with Progress(
160
+ SpinnerColumn(),
161
+ TextColumn("[progress.description]{task.description}"),
162
+ transient=False,
163
+ ) as progress:
164
+ progress.add_task(description=task_description, total=None)
165
+ ret = task_call()
166
+ else:
167
+ print(task_description)
168
+ ret = task_call()
169
+
170
+ return ret
171
+
172
+
173
+ def beautify_json(self, result):
174
+ try:
175
+ # Convert and pretty print
176
+ data = json.loads(str(result))
177
+ data = json.dumps(data, indent=4)
178
+ return data
179
+ except (json.decoder.JSONDecodeError, TypeError):
180
+ print("The response is not in JSON format:\n")
181
+ print(result)
182
+
183
+ return {}
rag/agents/sparrow_parse/__init__.py ADDED
File without changes
rag/agents/sparrow_parse/sparrow_parse.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rag.agents.interface import Pipeline
2
+ from sparrow_parse.vllm.inference_factory import InferenceFactory
3
+ from sparrow_parse.extractors.vllm_extractor import VLLMExtractor
4
+ import timeit
5
+ from rich import print
6
+ from rich.progress import Progress, SpinnerColumn, TextColumn
7
+ from typing import Any, List
8
+ from .sparrow_validator import Validator
9
+ from .sparrow_utils import is_valid_json, get_json_keys_as_string
10
+ import warnings
11
+ import os
12
+
13
+
14
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
15
+ warnings.filterwarnings("ignore", category=UserWarning)
16
+
17
+
18
+ class SparrowParsePipeline(Pipeline):
19
+
20
+ def __init__(self):
21
+ pass
22
+
23
+ def run_pipeline(self,
24
+ payload: str,
25
+ query_inputs: [str],
26
+ query_types: [str],
27
+ keywords: [str],
28
+ query: str,
29
+ file_path: str,
30
+ index_name: str,
31
+ options: List[str] = None,
32
+ group_by_rows: bool = True,
33
+ update_targets: bool = True,
34
+ debug: bool = False,
35
+ local: bool = True) -> Any:
36
+ print(f"\nRunning pipeline with {payload}\n")
37
+
38
+ start = timeit.default_timer()
39
+
40
+ query_all_data = False
41
+ if query == "*":
42
+ query_all_data = True
43
+ query = None
44
+ else:
45
+ try:
46
+ query, query_schema = self.invoke_pipeline_step(lambda: self.prepare_query_and_schema(query, debug),
47
+ "Preparing query and schema", local)
48
+ except ValueError as e:
49
+ raise e
50
+
51
+ llm_output = self.invoke_pipeline_step(lambda: self.execute_query(options, query_all_data, query, file_path, debug),
52
+ "Executing query", local)
53
+
54
+ validation_result = None
55
+ if query_all_data is False:
56
+ validation_result = self.invoke_pipeline_step(lambda: self.validate_result(llm_output, query_all_data, query_schema, debug),
57
+ "Validating result", local)
58
+
59
+ end = timeit.default_timer()
60
+
61
+ print(f"Time to retrieve answer: {end - start}")
62
+
63
+ return validation_result if validation_result is not None else llm_output
64
+
65
+
66
+ def prepare_query_and_schema(self, query, debug):
67
+ is_query_valid = is_valid_json(query)
68
+ if not is_query_valid:
69
+ raise ValueError("Invalid query. Please provide a valid JSON query.")
70
+
71
+ query_keys = get_json_keys_as_string(query)
72
+ query_schema = query
73
+ query = "retrieve " + query_keys
74
+
75
+ query = query + ". return response in JSON format, by strictly following this JSON schema: " + query_schema
76
+
77
+ return query, query_schema
78
+
79
+
80
+ def execute_query(self, options, query_all_data, query, file_path, debug):
81
+ extractor = VLLMExtractor()
82
+
83
+ # export HF_TOKEN="hf_"
84
+ config = {}
85
+ if options[0] == 'huggingface':
86
+ config = {
87
+ "method": options[0], # Could be 'huggingface' or 'local_gpu'
88
+ "hf_space": options[1],
89
+ "hf_token": os.getenv('HF_TOKEN')
90
+ }
91
+ else:
92
+ # Handle other cases if needed
93
+ return "First element is not 'huggingface'"
94
+
95
+ # Use the factory to get the correct instance
96
+ factory = InferenceFactory(config)
97
+ model_inference_instance = factory.get_inference_instance()
98
+
99
+ input_data = [
100
+ {
101
+ "image": file_path,
102
+ "text_input": query
103
+ }
104
+ ]
105
+
106
+ # Now you can run inference without knowing which implementation is used
107
+ llm_output = extractor.run_inference(model_inference_instance, input_data, generic_query=query_all_data,
108
+ debug=debug)
109
+
110
+ return llm_output
111
+
112
+
113
+ def validate_result(self, llm_output, query_all_data, query_schema, debug):
114
+ validator = Validator(query_schema)
115
+
116
+ validation_result = validator.validate_json_against_schema(llm_output, validator.generated_schema)
117
+ if validation_result is not None:
118
+ return validation_result
119
+ else:
120
+ if debug:
121
+ print("LLM output is valid according to the schema.")
122
+
123
+
124
+ def invoke_pipeline_step(self, task_call, task_description, local):
125
+ if local:
126
+ with Progress(
127
+ SpinnerColumn(),
128
+ TextColumn("[progress.description]{task.description}"),
129
+ transient=False,
130
+ ) as progress:
131
+ progress.add_task(description=task_description, total=None)
132
+ ret = task_call()
133
+ else:
134
+ print(task_description)
135
+ ret = task_call()
136
+
137
+ return ret
rag/agents/sparrow_parse/sparrow_utils.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+
3
+
4
+ def is_valid_json(json_string):
5
+ try:
6
+ json.loads(json_string)
7
+ return True
8
+ except json.JSONDecodeError as e:
9
+ print("JSONDecodeError:", e)
10
+ return False
11
+
12
+
13
+ def get_json_keys_as_string(json_string):
14
+ try:
15
+ # Load the JSON string into a Python object
16
+ json_data = json.loads(json_string)
17
+
18
+ # If the input is a list, treat it like a dictionary by merging all the keys
19
+ if isinstance(json_data, list):
20
+ merged_dict = {}
21
+ for item in json_data:
22
+ if isinstance(item, dict):
23
+ merged_dict.update(item)
24
+ json_data = merged_dict # Now json_data is a dictionary
25
+
26
+ # A helper function to recursively gather keys while preserving order
27
+ def extract_keys(data, keys):
28
+ if isinstance(data, dict):
29
+ for key, value in data.items():
30
+ if isinstance(value, dict):
31
+ # Recursively extract from nested dictionaries
32
+ extract_keys(value, keys)
33
+ elif isinstance(value, list):
34
+ # Process each dictionary inside the list
35
+ for item in value:
36
+ if isinstance(item, dict):
37
+ extract_keys(item, keys)
38
+ else:
39
+ if key not in keys:
40
+ keys.append(key)
41
+ return keys
42
+
43
+ # List to hold the keys in order
44
+ keys = []
45
+
46
+ # Process the top-level dictionary first
47
+ extract_keys(json_data, keys)
48
+
49
+ # Join and return the keys as a comma-separated string
50
+ return ', '.join(keys)
51
+
52
+ except json.JSONDecodeError:
53
+ print("Invalid JSON string.")
54
+ return ''
rag/agents/sparrow_parse/sparrow_validator.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from genson import SchemaBuilder
2
+ from jsonschema import validate, ValidationError
3
+ import json
4
+
5
+
6
+ class Validator:
7
+ def __init__(self, example_json):
8
+ self.generated_schema = self.generate_schema_from_example(example_json)
9
+
10
+ def generate_schema_from_example(self, example_json):
11
+ # Parse the example JSON into a Python object
12
+ example_data = json.loads(example_json)
13
+
14
+ # Generate the schema using Genson
15
+ builder = SchemaBuilder()
16
+ builder.add_object(example_data)
17
+
18
+ return builder.to_schema()
19
+
20
+ def validate_json_against_schema(self, json_string, schema):
21
+ try:
22
+ json_data = json.loads(json_string) # Parse LLM JSON output
23
+ validate(instance=json_data, schema=schema) # Validate against schema
24
+ return None # Return None if valid
25
+ except (json.JSONDecodeError, ValidationError) as e:
26
+ return str(e) # Return error message if invalid
rag/agents/unstructured/__init__.py ADDED
File without changes
rag/agents/unstructured/unstructured.py ADDED
@@ -0,0 +1,372 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rag.agents.interface import Pipeline
2
+ import uuid
3
+ import weaviate
4
+ from weaviate.util import get_valid_uuid
5
+ from unstructured.chunking.title import chunk_by_title
6
+ from unstructured.documents.elements import DataSourceMetadata
7
+ from unstructured.partition.json import partition_json
8
+ from sentence_transformers import SentenceTransformer
9
+ from langchain.vectorstores.weaviate import Weaviate
10
+ from langchain.prompts import PromptTemplate
11
+ from langchain_community.llms import Ollama
12
+ import tempfile
13
+ import subprocess
14
+ import os
15
+ from typing import List, Dict
16
+ import warnings
17
+ import box
18
+ import yaml
19
+ import timeit
20
+ import json
21
+ from rich import print
22
+ from typing import Any
23
+ from rich.progress import Progress, SpinnerColumn, TextColumn
24
+ from pydantic.v1 import create_model
25
+
26
+
27
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
28
+ warnings.filterwarnings("ignore", category=UserWarning)
29
+
30
+
31
+ # Import config vars
32
+ with open('config.yml', 'r', encoding='utf8') as ymlfile:
33
+ cfg = box.Box(yaml.safe_load(ymlfile))
34
+
35
+
36
+ class UnstructuredPipeline(Pipeline):
37
+ def run_pipeline(self,
38
+ payload: str,
39
+ query_inputs: [str],
40
+ query_types: [str],
41
+ keywords: [str],
42
+ query: str,
43
+ file_path: str,
44
+ index_name: str,
45
+ options: List[str] = None,
46
+ group_by_rows: bool = True,
47
+ update_targets: bool = True,
48
+ debug: bool = False,
49
+ local: bool = True) -> Any:
50
+ print(f"\nRunning pipeline with {payload}\n")
51
+
52
+ if len(query_inputs) == 1:
53
+ raise ValueError("Please provide more than one query input")
54
+
55
+ start = timeit.default_timer()
56
+
57
+ output_dir = cfg.OUTPUT_DIR_UNSTRUCTURED
58
+ input_dir = cfg.INPUT_DIR_UNSTRUCTURED
59
+ weaviate_url = cfg.WEAVIATE_URL_UNSTRUCTURED
60
+ embedding_model_name = cfg.EMBEDDINGS_UNSTRUCTURED
61
+ device = cfg.DEVICE_UNSTRUCTURED
62
+
63
+ with tempfile.TemporaryDirectory() as temp_dir:
64
+ temp_input_dir = os.path.join(temp_dir, input_dir)
65
+ temp_output_dir = os.path.join(temp_dir, output_dir) if debug is False else output_dir
66
+
67
+ if debug:
68
+ print(f"Copying {file_path} to {temp_input_dir}")
69
+ os.makedirs(temp_input_dir, exist_ok=True)
70
+ os.system(f"cp {file_path} {temp_input_dir}")
71
+
72
+ os.makedirs(temp_output_dir, exist_ok=True)
73
+
74
+ files = self.invoke_pipeline_step(
75
+ lambda: self.process_files(temp_output_dir, temp_input_dir),
76
+ "Processing file with unstructured...",
77
+ local
78
+ )
79
+
80
+ vectorstore, embedding_model = self.invoke_pipeline_step(
81
+ lambda: self.build_vector_store(weaviate_url, embedding_model_name, device, files, debug),
82
+ "Building vector store...",
83
+ local
84
+ )
85
+
86
+ llm = self.invoke_pipeline_step(
87
+ lambda: Ollama(model=cfg.LLM_UNSTRUCTURED,
88
+ base_url=cfg.BASE_URL_UNSTRUCTURED),
89
+ "Initializing Ollama...",
90
+ local
91
+ )
92
+
93
+ raw_result, similar_docs = self.invoke_pipeline_step(
94
+ lambda: self.question_answer(query, vectorstore, embedding_model, device, llm),
95
+ "Answering question...",
96
+ local
97
+ )
98
+
99
+ answer = self.invoke_pipeline_step(
100
+ lambda: self.validate_output(raw_result, query_inputs, query_types),
101
+ "Validating output...",
102
+ local
103
+ )
104
+
105
+ if debug:
106
+ print("\n\n\n-------------------------")
107
+ print(f"QUERY: {query}")
108
+ print("\n\n\n-------------------------")
109
+ print(f"Answer: {answer}")
110
+ print("\n\n\n-------------------------")
111
+ for index, result in enumerate(similar_docs):
112
+ print(f"\n\n-- RESULT {index + 1}:\n")
113
+ print(result)
114
+
115
+ end = timeit.default_timer()
116
+
117
+ print(f"\nJSON response:\n")
118
+ print(answer + '\n')
119
+ print('=' * 50)
120
+
121
+ print(f"Time to retrieve answer: {end - start}")
122
+
123
+ return answer
124
+
125
+ def process_files(self, temp_output_dir, temp_input_dir):
126
+ self.process_local(output_dir=temp_output_dir, num_processes=2, input_path=temp_input_dir)
127
+ files = self.get_result_files(temp_output_dir)
128
+ return files
129
+
130
+ def build_vector_store(self, weaviate_url, embedding_model_name, device, files, debug):
131
+ client = self.create_local_weaviate_client(db_url=weaviate_url)
132
+ my_schema = self.get_schema()
133
+ self.upload_schema(my_schema, weaviate=client)
134
+
135
+ vectorstore = Weaviate(client, "Doc", "text")
136
+ embedding_model = SentenceTransformer(embedding_model_name, device=device)
137
+
138
+ self.add_data_to_weaviate(
139
+ debug,
140
+ files=files,
141
+ client=client,
142
+ embedding_model=embedding_model,
143
+ device=device,
144
+ chunk_under_n_chars=cfg.CHUNK_UNDER_N_CHARS_UNSTRUCTURED,
145
+ chunk_new_after_n_chars=cfg.CHUNK_NEW_AFTER_N_CHARS_UNSTRUCTURED
146
+ )
147
+
148
+ if debug:
149
+ print(self.count_documents(client=client)['data']['Aggregate']['Doc'])
150
+
151
+ return vectorstore, embedding_model
152
+
153
+ def process_local(self, output_dir: str, num_processes: int, input_path: str):
154
+ command = [
155
+ "unstructured-ingest",
156
+ "local",
157
+ "--input-path", input_path,
158
+ "--output-dir", output_dir,
159
+ "--num-processes", str(num_processes),
160
+ "--recursive",
161
+ "--verbose",
162
+ ]
163
+
164
+ # Run the command
165
+ process = subprocess.Popen(command, stdout=subprocess.PIPE)
166
+ output, error = process.communicate()
167
+
168
+ # Print output
169
+ if process.returncode == 0:
170
+ print('Command executed successfully. Output:')
171
+ print(output.decode())
172
+ else:
173
+ print('Command failed. Error:')
174
+ print(error.decode())
175
+
176
+ def get_result_files(self, folder_path) -> List[Dict]:
177
+ file_list = []
178
+ for root, dirs, files in os.walk(folder_path):
179
+ for file in files:
180
+ if file.endswith('.json'):
181
+ file_path = os.path.join(root, file)
182
+ file_list.append(file_path)
183
+ return file_list
184
+
185
+
186
+ def create_local_weaviate_client(self, db_url: str):
187
+ return weaviate.Client(
188
+ url=db_url,
189
+ )
190
+
191
+ def get_schema(self, vectorizer: str = "none"):
192
+ return {
193
+ "classes": [
194
+ {
195
+ "class": "Doc",
196
+ "description": "A generic document class",
197
+ "vectorizer": vectorizer,
198
+ "properties": [
199
+ {
200
+ "name": "last_modified",
201
+ "dataType": ["text"],
202
+ "description": "Last modified date for the document",
203
+ },
204
+ {
205
+ "name": "player",
206
+ "dataType": ["text"],
207
+ "description": "Player related to the document",
208
+ },
209
+ {
210
+ "name": "position",
211
+ "dataType": ["text"],
212
+ "description": "Player Position related to the document",
213
+ },
214
+ {
215
+ "name": "text",
216
+ "dataType": ["text"],
217
+ "description": "Text content for the document",
218
+ },
219
+ ],
220
+ },
221
+ ],
222
+ }
223
+
224
+ def upload_schema(self, my_schema, weaviate):
225
+ weaviate.schema.delete_all()
226
+ weaviate.schema.create(my_schema)
227
+
228
+ def count_documents(self, client: weaviate.Client) -> Dict:
229
+ response = (
230
+ client.query
231
+ .aggregate("Doc")
232
+ .with_meta_count()
233
+ .do()
234
+ )
235
+ count = response
236
+ return count
237
+
238
+ def compute_embedding(self, chunk_text: List[str], embedding_model, device):
239
+ embeddings = embedding_model.encode(chunk_text, device=device)
240
+ return embeddings
241
+
242
+ def get_chunks(self, elements, embedding_model, device, chunk_under_n_chars=500, chunk_new_after_n_chars=1500):
243
+ for element in elements:
244
+ if not type(element.metadata.data_source) is DataSourceMetadata:
245
+ delattr(element.metadata, "data_source")
246
+
247
+ if hasattr(element.metadata, "coordinates"):
248
+ delattr(element.metadata, "coordinates")
249
+
250
+ chunks = chunk_by_title(
251
+ elements,
252
+ combine_text_under_n_chars=chunk_under_n_chars,
253
+ new_after_n_chars=chunk_new_after_n_chars
254
+ )
255
+
256
+ for i in range(len(chunks)):
257
+ chunks[i] = {"last_modified": chunks[i].metadata.last_modified, "text": chunks[i].text}
258
+
259
+ chunk_texts = [x['text'] for x in chunks]
260
+ embeddings = self.compute_embedding(chunk_texts, embedding_model, device)
261
+ return chunks, embeddings
262
+
263
+ def add_data_to_weaviate(self, debug, files, client, embedding_model, device, chunk_under_n_chars=500, chunk_new_after_n_chars=1500):
264
+ for filename in files:
265
+ try:
266
+ elements = partition_json(filename=filename)
267
+ chunks, embeddings = self.get_chunks(elements, embedding_model, device, chunk_under_n_chars, chunk_new_after_n_chars)
268
+ except IndexError as e:
269
+ print(e)
270
+ continue
271
+
272
+ if debug:
273
+ print(f"Uploading {len(chunks)} chunks for {str(filename)}.")
274
+
275
+ for i, chunk in enumerate(chunks):
276
+ client.batch.add_data_object(
277
+ data_object=chunk,
278
+ class_name="doc",
279
+ uuid=get_valid_uuid(uuid.uuid4()),
280
+ vector=embeddings[i]
281
+ )
282
+
283
+ client.batch.flush()
284
+
285
+ def question_answer(self, question: str, vectorstore: Weaviate, embedding_model, device, llm):
286
+ embedding = self.compute_embedding(question, embedding_model, device)
287
+ similar_docs = vectorstore.max_marginal_relevance_search_by_vector(embedding)
288
+ content = [x.page_content for x in similar_docs]
289
+ prompt_template = PromptTemplate.from_template(
290
+ """\
291
+ Given context about the subject, answer the question based on the context provided to the best of your ability.
292
+ Context: {context}
293
+ Question:
294
+ {question}
295
+ Answer:
296
+ """
297
+ )
298
+ prompt = prompt_template.format(context=content, question=question)
299
+ answer = llm(prompt)
300
+ return answer, similar_docs
301
+
302
+ def validate_output(self, raw_result, query_inputs, query_types):
303
+ if raw_result is None:
304
+ return {}
305
+
306
+ clean_str = raw_result.replace('<|im_end|>', '')
307
+
308
+ # Convert the cleaned string to a dictionary
309
+ response_dict = json.loads(clean_str)
310
+
311
+ ResponseModel = self.build_response_class(query_inputs, query_types)
312
+
313
+ # Validate and create a Pydantic model instance
314
+ validated_response = ResponseModel(**response_dict)
315
+
316
+ # Convert the model instance to JSON
317
+ answer = self.beautify_json(validated_response.json())
318
+
319
+ return answer
320
+
321
+ def safe_eval_type(self, type_str, context):
322
+ try:
323
+ return eval(type_str, {}, context)
324
+ except NameError:
325
+ raise ValueError(f"Type '{type_str}' is not recognized")
326
+
327
+ def build_response_class(self, query_inputs, query_types_as_strings):
328
+ # Controlled context for eval
329
+ context = {
330
+ 'List': List,
331
+ 'str': str,
332
+ 'int': int,
333
+ 'float': float
334
+ # Include other necessary types or typing constructs here
335
+ }
336
+
337
+ # Convert string representations to actual types
338
+ query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
339
+
340
+ # Create fields dictionary
341
+ fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
342
+
343
+ DynamicModel = create_model('DynamicModel', **fields)
344
+
345
+ return DynamicModel
346
+
347
+ def beautify_json(self, result):
348
+ try:
349
+ # Convert and pretty print
350
+ data = json.loads(str(result))
351
+ data = json.dumps(data, indent=4)
352
+ return data
353
+ except (json.decoder.JSONDecodeError, TypeError):
354
+ print("The response is not in JSON format:\n")
355
+ print(result)
356
+
357
+ return {}
358
+
359
+ def invoke_pipeline_step(self, task_call, task_description, local):
360
+ if local:
361
+ with Progress(
362
+ SpinnerColumn(),
363
+ TextColumn("[progress.description]{task.description}"),
364
+ transient=False,
365
+ ) as progress:
366
+ progress.add_task(description=task_description, total=None)
367
+ ret = task_call()
368
+ else:
369
+ print(task_description)
370
+ ret = task_call()
371
+
372
+ return ret
rag/agents/unstructured/unstructured_light.py ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rag.agents.interface import Pipeline
2
+ from unstructured.partition.pdf import partition_pdf
3
+ from unstructured.partition.image import partition_image
4
+ from unstructured.staging.base import elements_to_json
5
+ from langchain_community.document_loaders import TextLoader
6
+ from langchain.text_splitter import CharacterTextSplitter
7
+ from langchain_community.embeddings import OllamaEmbeddings
8
+ from langchain.chains import RetrievalQA
9
+ from langchain_community.vectorstores import Chroma
10
+ from langchain_community.llms import Ollama
11
+ from pydantic.v1 import create_model
12
+ from typing import List
13
+ from rich.progress import Progress, SpinnerColumn, TextColumn
14
+ import tempfile
15
+ import json
16
+ import warnings
17
+ import box
18
+ import yaml
19
+ import timeit
20
+ from rich import print
21
+ from typing import Any
22
+ import os
23
+
24
+
25
+ warnings.filterwarnings("ignore", category=DeprecationWarning)
26
+ warnings.filterwarnings("ignore", category=UserWarning)
27
+
28
+
29
+ # Import config vars
30
+ with open('config.yml', 'r', encoding='utf8') as ymlfile:
31
+ cfg = box.Box(yaml.safe_load(ymlfile))
32
+
33
+
34
+ class UnstructuredLightPipeline(Pipeline):
35
+ def run_pipeline(self,
36
+ payload: str,
37
+ query_inputs: [str],
38
+ query_types: [str],
39
+ keywords: [str],
40
+ query: str,
41
+ file_path: str,
42
+ index_name: str,
43
+ options: List[str] = None,
44
+ group_by_rows: bool = True,
45
+ update_targets: bool = True,
46
+ debug: bool = False,
47
+ local: bool = True) -> Any:
48
+ print(f"\nRunning pipeline with {payload}\n")
49
+
50
+ if len(query_inputs) == 1:
51
+ raise ValueError("Please provide more than one query input")
52
+
53
+ start = timeit.default_timer()
54
+
55
+ strategy = cfg.STRATEGY_UNSTRUCTURED_LIGHT
56
+ model_name = cfg.MODEL_UNSTRUCTURED_LIGHT
57
+
58
+ extract_tables = False
59
+ # Initialize options as an empty list if it is None
60
+ options = options or []
61
+ if "tables" in options:
62
+ extract_tables = True
63
+
64
+ # Extracts the elements from the PDF
65
+ elements = self.invoke_pipeline_step(
66
+ lambda: self.process_file(file_path, strategy, model_name),
67
+ "Extracting elements from the document...",
68
+ local
69
+ )
70
+
71
+ if debug:
72
+ new_extension = 'json' # You can change this to any extension you want
73
+ new_file_path = self.change_file_extension(file_path, new_extension)
74
+
75
+ documents = self.invoke_pipeline_step(
76
+ lambda: self.load_text_data(elements, new_file_path, extract_tables),
77
+ "Loading text data...",
78
+ local
79
+ )
80
+ else:
81
+ with tempfile.TemporaryDirectory() as temp_dir:
82
+ temp_file_path = os.path.join(temp_dir, "file_data.json")
83
+
84
+ documents = self.invoke_pipeline_step(
85
+ lambda: self.load_text_data(elements, temp_file_path, extract_tables),
86
+ "Loading text data...",
87
+ local
88
+ )
89
+
90
+ docs = self.invoke_pipeline_step(
91
+ lambda: self.split_text(documents, cfg.CHUNK_SIZE_UNSTRUCTURED_LIGHT, cfg.OVERLAP_UNSTRUCTURED_LIGHT),
92
+ "Splitting text...",
93
+ local
94
+ )
95
+
96
+ db = self.invoke_pipeline_step(
97
+ lambda: self.prepare_vector_store(docs, cfg.EMBEDDINGS_UNSTRUCTURED_LIGHT),
98
+ "Preparing vector store...",
99
+ local
100
+ )
101
+
102
+ llm = self.invoke_pipeline_step(
103
+ lambda: Ollama(model=cfg.LLM_UNSTRUCTURED_LIGHT,
104
+ base_url=cfg.BASE_URL_UNSTRUCTURED_LIGHT),
105
+ "Initializing Ollama...",
106
+ local
107
+ )
108
+
109
+ raw_result = self.invoke_pipeline_step(
110
+ lambda: self.execute_langchain_query(llm, db, query),
111
+ "Executing query...",
112
+ local
113
+ )
114
+
115
+ answer = self.invoke_pipeline_step(
116
+ lambda: self.validate_output(raw_result, query_inputs, query_types),
117
+ "Validating output...",
118
+ local
119
+ )
120
+
121
+ end = timeit.default_timer()
122
+
123
+ print(f"\nJSON response:\n")
124
+ print(answer + '\n')
125
+ print('=' * 50)
126
+
127
+ print(f"Time to retrieve answer: {end - start}")
128
+
129
+ return answer
130
+
131
+ def process_file(self, file_path, strategy, model_name):
132
+ elements = None
133
+
134
+ if file_path.lower().endswith('.pdf'):
135
+ elements = partition_pdf(
136
+ filename=file_path,
137
+ strategy=strategy,
138
+ infer_table_structure=True,
139
+ model_name=model_name
140
+ )
141
+ elif file_path.lower().endswith(('.jpg', '.jpeg', '.png')):
142
+ elements = partition_image(
143
+ filename=file_path,
144
+ strategy=strategy,
145
+ infer_table_structure=True,
146
+ model_name=model_name
147
+ )
148
+
149
+ return elements
150
+
151
+ def load_text_data(self, elements, file_path, extract_tables):
152
+ elements_to_json(elements, filename=file_path)
153
+ text_file = self.process_json_file(file_path, extract_tables)
154
+
155
+ loader = TextLoader(text_file)
156
+ documents = loader.load()
157
+
158
+ return documents
159
+
160
+ def split_text(self, text, chunk_size, overlap):
161
+ text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
162
+ docs = text_splitter.split_documents(text)
163
+
164
+ return docs
165
+
166
+ def prepare_vector_store(self, docs, model_name):
167
+ db = Chroma.from_documents(
168
+ documents=docs,
169
+ collection_name="sparrow-rag",
170
+ embedding=OllamaEmbeddings(model=model_name)
171
+ )
172
+
173
+ return db
174
+
175
+ def execute_langchain_query(self, llm, db, query):
176
+ qa_chain = RetrievalQA.from_chain_type(llm, retriever=db.as_retriever())
177
+ response = qa_chain({"query": query})
178
+ raw_result = response['result']
179
+
180
+ return raw_result
181
+
182
+ def validate_output(self, raw_result, query_inputs, query_types):
183
+ if raw_result is None:
184
+ return {}
185
+
186
+ clean_str = raw_result.replace('<|im_end|>', '')
187
+
188
+ # Convert the cleaned string to a dictionary
189
+ response_dict = json.loads(clean_str)
190
+
191
+ ResponseModel = self.build_response_class(query_inputs, query_types)
192
+
193
+ # Validate and create a Pydantic model instance
194
+ validated_response = ResponseModel(**response_dict)
195
+
196
+ # Convert the model instance to JSON
197
+ answer = self.beautify_json(validated_response.json())
198
+
199
+ return answer
200
+
201
+ def process_json_file(self, input_data, extract_tables):
202
+ # Read the JSON file
203
+ with open(input_data, 'r') as file:
204
+ data = json.load(file)
205
+
206
+ # Iterate over the JSON data and extract required table elements
207
+ extracted_elements = []
208
+ for entry in data:
209
+ if entry["type"] == "Table":
210
+ extracted_elements.append(entry["metadata"]["text_as_html"])
211
+ elif entry["type"] == "Title" and extract_tables is False:
212
+ extracted_elements.append(entry["text"])
213
+ elif entry["type"] == "NarrativeText" and extract_tables is False:
214
+ extracted_elements.append(entry["text"])
215
+ elif entry["type"] == "UncategorizedText" and extract_tables is False:
216
+ extracted_elements.append(entry["text"])
217
+
218
+ # Write the extracted elements to the output file
219
+ new_extension = 'txt' # You can change this to any extension you want
220
+ new_file_path = self.change_file_extension(input_data, new_extension)
221
+ with open(new_file_path, 'w') as output_file:
222
+ for element in extracted_elements:
223
+ output_file.write(element + "\n\n") # Adding two newlines for separation
224
+
225
+ return new_file_path
226
+
227
+ # Function to safely evaluate type strings
228
+ def safe_eval_type(self, type_str, context):
229
+ try:
230
+ return eval(type_str, {}, context)
231
+ except NameError:
232
+ raise ValueError(f"Type '{type_str}' is not recognized")
233
+
234
+ def build_response_class(self, query_inputs, query_types_as_strings):
235
+ # Controlled context for eval
236
+ context = {
237
+ 'List': List,
238
+ 'str': str,
239
+ 'int': int,
240
+ 'float': float
241
+ # Include other necessary types or typing constructs here
242
+ }
243
+
244
+ # Convert string representations to actual types
245
+ query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
246
+
247
+ # Create fields dictionary
248
+ fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
249
+
250
+ DynamicModel = create_model('DynamicModel', **fields)
251
+
252
+ return DynamicModel
253
+
254
+ def change_file_extension(self, file_path, new_extension):
255
+ # Check if the new extension starts with a dot and add one if not
256
+ if not new_extension.startswith('.'):
257
+ new_extension = '.' + new_extension
258
+
259
+ # Split the file path into two parts: the base (everything before the last dot) and the extension
260
+ # If there's no dot in the filename, it'll just return the original filename without an extension
261
+ base = file_path.rsplit('.', 1)[0]
262
+
263
+ # Concatenate the base with the new extension
264
+ new_file_path = base + new_extension
265
+
266
+ return new_file_path
267
+
268
+ def beautify_json(self, result):
269
+ try:
270
+ # Convert and pretty print
271
+ data = json.loads(str(result))
272
+ data = json.dumps(data, indent=4)
273
+ return data
274
+ except (json.decoder.JSONDecodeError, TypeError):
275
+ print("The response is not in JSON format:\n")
276
+ print(result)
277
+
278
+ return {}
279
+
280
+ def invoke_pipeline_step(self, task_call, task_description, local):
281
+ if local:
282
+ with Progress(
283
+ SpinnerColumn(),
284
+ TextColumn("[progress.description]{task.description}"),
285
+ transient=False,
286
+ ) as progress:
287
+ progress.add_task(description=task_description, total=None)
288
+ ret = task_call()
289
+ else:
290
+ print(task_description)
291
+ ret = task_call()
292
+
293
+ return ret
requirements_haystack.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ pypdf
2
+ python-box
3
+ typer[all]
4
+ fastapi==0.110.0
5
+ uvicorn[standard]
6
+ ollama-haystack==0.0.5
7
+ haystack-ai==2.0.0
8
+ weaviate-haystack==1.0.2
9
+ ollama==0.1.7
10
+ python-multipart
11
+ sentence-transformers
12
+
13
+ # Force reinstall:
14
+ # pip install --force-reinstall -r requirements_haystack.txt
requirements_instructor.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ollama==0.2.1
2
+ python-multipart
3
+ yfinance==0.2.40
4
+ instructor==1.3.5
5
+ python-box
6
+ PyYAML
7
+ rich
8
+ typer[all]
9
+ fastapi==0.111.1
10
+ uvicorn[standard]
11
+ sparrow-parse==0.3.2
12
+ numpy==1.26.4
13
+
14
+
15
+ # Force reinstall:
16
+ # pip install --force-reinstall -r requirements_instructor.txt
requirements_llamaindex.txt ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ llama-index==0.10.23
2
+ llama-index-core==0.10.23.post1
3
+ llama-index-embeddings-langchain==0.1.2
4
+ llama-index-llms-ollama==0.1.2
5
+ llama-index-vector-stores-weaviate==0.1.4
6
+ llama-index-multi-modal-llms-ollama==0.1.3
7
+ llama-index-readers-file==0.1.12
8
+ llama-index-embeddings-huggingface==0.1.4
9
+ llama-index-vector-stores-qdrant==0.1.4
10
+ llama-index-embeddings-clip==0.1.4
11
+ sentence-transformers
12
+ weaviate-client==3.26.2
13
+ pypdf
14
+ python-box
15
+ typer[all]
16
+ fastapi==0.110.0
17
+ uvicorn[standard]
18
+ ollama==0.1.7
19
+ python-multipart
20
+
21
+
22
+ # LlamaIndex upgrade:
23
+ # pip uninstall llama-index
24
+ # pip install llama-index --upgrade --no-cache-dir --force-reinstall
25
+
26
+ # Force reinstall:
27
+ # pip install --force-reinstall -r requirements_llamaindex.txt
requirements_sparrow_parse.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ python-multipart
2
+ rich
3
+ typer[all]
4
+ fastapi==0.115.0
5
+ uvicorn[standard]
6
+ sparrow-parse==0.3.4
7
+ genson==1.3.0
8
+ jsonschema==4.23.0
9
+ python-dotenv
10
+
11
+
12
+ # Force reinstall:
13
+ # pip install --force-reinstall -r requirements_sparrow_parse.txt
requirements_unstructured.txt ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ unstructured[all-docs]==0.13.3
2
+ unstructured-inference==0.7.27
3
+ langchain==0.1.16
4
+ langchain-community==0.0.34
5
+ langchain-core==0.1.45
6
+ chromadb
7
+ sentence_transformers
8
+ python-box
9
+ rich
10
+ typer[all]
11
+ fastapi==0.110.2
12
+ uvicorn[standard]
13
+ ollama==0.1.8
14
+ python-multipart
15
+ weaviate-client==4.5.5
16
+
17
+
18
+ # Force reinstall:
19
+ # pip install --force-reinstall -r requirements_unstructured.txt
sample_prompts.txt ADDED
@@ -0,0 +1,390 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ./sparrow.sh "invoice_number, invoice_date, client_name, client_address, client_tax_id, seller_name, seller_address,
2
+ seller_tax_id, iban, names_of_invoice_items, gross_worth_of_invoice_items, total_gross_worth" "int, str, str, str, str,
3
+ str, str, str, str, List[str], List[str], str" --agent llamaindex --index-name Sparrow_llamaindex_doc1
4
+
5
+
6
+ {
7
+ "invoice_number": 61356291,
8
+ "invoice_date": "09/06/2012",
9
+ "client_name": "Rodriguez-Stevens",
10
+ "client_address": "2280 Angela Plain, Hortonshire, MS 93248",
11
+ "client_tax_id": "939-98-8477",
12
+ "seller_name": "Chapman, Kim and Green",
13
+ "seller_address": "64731 James Branch, Smithmouth, NC 26872",
14
+ "seller_tax_id": "949-84-9105",
15
+ "iban": "GB50ACIE59715038217063",
16
+ "names_of_invoice_items": [
17
+ "Wine Glasses Goblets Pair Clear Glass",
18
+ "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
19
+ "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
20
+ "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW"
21
+ ],
22
+ "gross_worth_of_invoice_items": [
23
+ 66.0,
24
+ 123.55,
25
+ 8.25,
26
+ 14.29
27
+ ],
28
+ "total_gross_worth": "$212,09"
29
+ }
30
+ ==================================================
31
+ Time to retrieve answer: 63.74948522399791
32
+
33
+
34
+ ./sparrow.sh "invoice_number, invoice_date" "int, str" --agent llamaindex --index-name Sparrow_llamaindex_doc1
35
+
36
+ {
37
+ "invoice_number": 61356291,
38
+ "invoice_date": "09/06/2012"
39
+ }
40
+ ==================================================
41
+ Time to retrieve answer: 15.325319556002796
42
+
43
+
44
+ ./sparrow.sh "gross_worth_of_invoice_items" "List[float]" --agent llamaindex --index-name Sparrow_llamaindex_doc1
45
+
46
+ {
47
+ "gross_worth_of_invoice_items": [
48
+ 66.0,
49
+ 123.55,
50
+ 8.25,
51
+ 14.29
52
+ ]
53
+ }
54
+ ==================================================
55
+ Time to retrieve answer: 17.55766561099881
56
+
57
+
58
+ ./sparrow.sh "guest_no, cashier_name" "int, str" --agent vllamaindex --file-path
59
+ /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/inout-20211211_001.jpg
60
+
61
+ {
62
+ "guest_no": 49,
63
+ "cashier_name": "Cashier Name"
64
+ }
65
+
66
+
67
+
68
+ ./sparrow.sh "store_name, receipt_id, receipt_item_names, receipt_item_prices, receipt_date, receipt_store_id,
69
+ receipt_sold, receipt_returned, receipt_total" "str, str, List[str], List[str], str, int, int,
70
+ int, str" --agent vprocessor --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/ross-20211211_010.jpg
71
+
72
+ {
73
+ "store_name": "Ross",
74
+ "receipt_id": "Receipt # 0421-01-1602-1330-0",
75
+ "receipt_item_names": [
76
+ "400226513665 x hanes b1ue 4pk",
77
+ "400239602790 fruit premium 4pk"
78
+ ],
79
+ "receipt_item_prices": [
80
+ "$9.99R",
81
+ "$12.99R"
82
+ ],
83
+ "receipt_date": "11/26/21 10:35:05 AM",
84
+ "receipt_store_id": 421,
85
+ "receipt_sold": 2,
86
+ "receipt_returned": 0,
87
+ "receipt_total": "$25.33"
88
+ }
89
+ ==================================================
90
+ Time to retrieve answer: 106.27733000399894
91
+
92
+
93
+ ./sparrow.sh assistant --agent "fcall" --query "Exxon"
94
+
95
+ {
96
+ "company": "ExxonMobil",
97
+ "ticker": "XOM"
98
+ }
99
+ The stock price of the ExxonMobil is 113.48999786376953. USD
100
+ ==================================================
101
+ Time to retrieve answer: 16.426633964991197
102
+
103
+
104
+ ./sparrow.sh "invoice_number, invoice_date, total_gross_worth" "int, str, str" --agent unstructured-light
105
+ --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
106
+
107
+ {
108
+ "invoice_number": 61356291,
109
+ "invoice_date": "09/06/2012",
110
+ "total_gross_worth": "$ 212,09"
111
+ }
112
+ ==================================================
113
+ Time to retrieve answer: 93.95840702600253
114
+
115
+
116
+ ./sparrow.sh "names_of_invoice_items, gross_worth_of_invoice_items, total_gross_worth" "List[str], List[str], str"
117
+ --agent unstructured-light --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
118
+ --options tables
119
+
120
+ {
121
+ "names_of_invoice_items": [
122
+ "Wine Glasses Goblets Pair Clear Glass",
123
+ "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
124
+ "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
125
+ "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW"
126
+ ],
127
+ "gross_worth_of_invoice_items": [
128
+ "$66.00",
129
+ "$123.55",
130
+ "$8.25",
131
+ "$14.29"
132
+ ],
133
+ "total_gross_worth": "$212.09"
134
+ }
135
+ ==================================================
136
+ Time to retrieve answer: 109.55890596199606
137
+
138
+
139
+ ./sparrow.sh "invoice_number, invoice_date, client_name, client_address, client_tax_id, seller_name, seller_address,
140
+ seller_tax_id, iban, names_of_invoice_items, gross_worth_of_invoice_items, total_gross_worth" "int, str, str, str, str,
141
+ str, str, str, str, List[str], List[str], str" --agent unstructured --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
142
+
143
+ {
144
+ "invoice_number": 61356291,
145
+ "invoice_date": "09/06/2012",
146
+ "client_name": "Rodriguez-Stevens",
147
+ "client_address": "2280 Angela Plain Hortonshire, MS 93248",
148
+ "client_tax_id": "939-98-8477",
149
+ "seller_name": "Chapman, Kim and Green",
150
+ "seller_address": "64731 James Branch Smithmouth, NC 26872",
151
+ "seller_tax_id": "949-84-9105",
152
+ "iban": "GB50ACIE59715038217063",
153
+ "names_of_invoice_items": [
154
+ "Wine Glasses Goblets Pair Clear Glass",
155
+ "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
156
+ "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
157
+ "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW"
158
+ ],
159
+ "gross_worth_of_invoice_items": [
160
+ "6,00",
161
+ "123,55",
162
+ "8,25",
163
+ "14,29"
164
+ ],
165
+ "total_gross_worth": "$ 192,81"
166
+ }
167
+ ==================================================
168
+ Time to retrieve answer: 85.94320003400207
169
+
170
+ ./sparrow.sh "invoice_number, invoice_date, total_gross_worth" "int, str, str" --agent unstructured --file-path
171
+ /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
172
+
173
+ {
174
+ "invoice_number": 61356291,
175
+ "invoice_date": "09/06/2012",
176
+ "total_gross_worth": "$ 212,09"
177
+ }
178
+ ==================================================
179
+ Time to retrieve answer: 24.074920559010934
180
+
181
+
182
+ ./sparrow.sh "store_name, receipt_id, receipt_item_names, receipt_item_prices, receipt_date, receipt_store_id, receipt_sold,
183
+ receipt_returned, receipt_total" "str, str, List[str], List[str], str, int, int,
184
+ int, str" --agent unstructured --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/ross-20211211_010.jpg
185
+
186
+ {
187
+ "store_name": "IT OSS DRESS FOR LESS PASADENA, CA 91107 626-351-5334 # 0421-01-1602-1330-",
188
+ "receipt_id": "0421-01-1602-1330-",
189
+ "receipt_item_names": [
190
+ "A iain an 6513665 x hanes blue 4pk 9.99R 4nbes9e05500",
191
+ "fruit premium 4pk 12:98"
192
+ ],
193
+ "receipt_item_prices": [
194
+ "$9.99",
195
+ "$12.98"
196
+ ],
197
+ "receipt_date": "11/26/21 10:35:05 AM",
198
+ "receipt_store_id": 421,
199
+ "receipt_sold": 2,
200
+ "receipt_returned": 0,
201
+ "receipt_total": "$25.00"
202
+ }
203
+
204
+ ==================================================
205
+ Time to retrieve answer: 76.49691557901679
206
+
207
+
208
+ ./sparrow.sh "store_name, receipt_id, receipt_item_names, receipt_item_prices, receipt_date, receipt_store_id, receipt_sold,
209
+ receipt_returned, receipt_total" "str, str, List[str], List[str], str, int, int, int,
210
+ str" --agent unstructured-light --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/ross-20211211_010.jpg
211
+
212
+ {
213
+ "store_name": "Ross Dress for Less",
214
+ "receipt_id": "0421-01-1602-1330-0",
215
+ "receipt_item_names": [
216
+ "A iain an 6513665 x hanes blue 4pk",
217
+ "9.99R 4nbes9e05500 fruit premium 4pk"
218
+ ],
219
+ "receipt_item_prices": [
220
+ "$22.98",
221
+ "$22.98"
222
+ ],
223
+ "receipt_date": "11/26/21",
224
+ "receipt_store_id": 421,
225
+ "receipt_sold": 2,
226
+ "receipt_returned": 0,
227
+ "receipt_total": "$25"
228
+ }
229
+ ==================================================
230
+ Time to retrieve answer: 80.8209542609984
231
+
232
+
233
+ ./sparrow.sh "names_of_invoice_items, gross_worth_of_invoice_items, total_gross_worth" "List[str], List[str], str" --agent instructor --file-path /Users/andrejb/infra/s
234
+ hared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
235
+
236
+ {
237
+ "names_of_invoice_items": [
238
+ "Wine Glasses Goblets Pair Clear Glass",
239
+ "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
240
+ "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
241
+ "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW"
242
+ ],
243
+ "gross_worth_of_invoice_items": [
244
+ "66,00",
245
+ "123,55",
246
+ "8,25",
247
+ "14,29"
248
+ ],
249
+ "total_gross_worth": "212,09"
250
+ }
251
+ ==================================================
252
+ Time to retrieve answer: 97.52105149999261
253
+
254
+
255
+ ./sparrow.sh "invoice_number, invoice_date, description, quantity, net_price, net_worth, vat, gross_worth, total_gross_worth" "str, str, List[str], List[str],
256
+ List[str], List[str], List[str], List[str], str" --agent instructor --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
257
+ --options tables --options unstructured --group-by-rows --update-targets --debug
258
+
259
+ {
260
+ "invoice_number": "61356291",
261
+ "invoice_date": "09/06/2012",
262
+ "total_gross_worth": "212.09",
263
+ "items1": [
264
+ {
265
+ "description": "Wine Glasses Goblets Pair Clear Glass",
266
+ "quantity": "5,00",
267
+ "net_price": "12,00",
268
+ "net_worth": "60,00",
269
+ "vat": "10%",
270
+ "gross_worth": "66,00"
271
+ },
272
+ {
273
+ "description": "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
274
+ "quantity": "4,00",
275
+ "net_price": "28,08",
276
+ "net_worth": "112,32",
277
+ "vat": "10%",
278
+ "gross_worth": "123,55"
279
+ },
280
+ {
281
+ "description": "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
282
+ "quantity": "1,00",
283
+ "net_price": "7,50",
284
+ "net_worth": "7,50",
285
+ "vat": "10%",
286
+ "gross_worth": "8,25"
287
+ },
288
+ {
289
+ "description": "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW",
290
+ "quantity": "1,00",
291
+ "net_price": "12,99",
292
+ "net_worth": "12,99",
293
+ "vat": "10%",
294
+ "gross_worth": "14,29"
295
+ }
296
+ ]
297
+ }
298
+ ==================================================
299
+ Time to retrieve answer: 24.45439903100487
300
+
301
+
302
+ ./sparrow.sh "{\"invoice_no\":\"example\", \"invoice_date\":\"example\", \"seller_name\":\"example\", \"seller_address\":\"example\",
303
+ \"seller_taxid\":\"example\", \"seller_iban\":\"example\", \"client_name\":\"example\", \"client_address\":\"example\",
304
+ \"client_taxid\":\"example\", \"invoice_items\":[{\"description\":\"example\", \"quantity\":0.00, \"net_price\":0.00,
305
+ \"net_worth\":0.00, \"vat\":\"example\", \"gross_worth\":0.00}], \"invoice_summary\":[{\"net_worth\":0.00, \"vat\":0.00, \"gross_worth\":0.00}]}"
306
+ --agent "sparrow-parse" --debug --options huggingface --options katanaml/sparrow-qwen2-vl-7b --file-path "/Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.jpg"
307
+
308
+ {
309
+ "invoice_no": "61356291",
310
+ "invoice_date": "09/06/2012",
311
+ "seller_name": "Chapman, Kim and Green",
312
+ "seller_address": "64731 James Branch, Smithmouth, NC 26872",
313
+ "seller_taxid": "949-84-9105",
314
+ "seller_iban": "GB50ACIE59715038217063",
315
+ "client_name": "Rodriguez-Stevens",
316
+ "client_address": "2280 Angela Plain, Hortonshire, MS 93248",
317
+ "client_taxid": "939-98-8477",
318
+ "invoice_items": [
319
+ {
320
+ "description": "Wine Glasses Goblets Pair Clear Glass",
321
+ "quantity": 5.0,
322
+ "net_price": 12.0,
323
+ "net_worth": 60.0,
324
+ "vat": "10%",
325
+ "gross_worth": 66.0
326
+ },
327
+ {
328
+ "description": "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
329
+ "quantity": 4.0,
330
+ "net_price": 28.08,
331
+ "net_worth": 112.32,
332
+ "vat": "10%",
333
+ "gross_worth": 123.55
334
+ },
335
+ {
336
+ "description": "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
337
+ "quantity": 1.0,
338
+ "net_price": 7.5,
339
+ "net_worth": 7.5,
340
+ "vat": "10%",
341
+ "gross_worth": 8.25
342
+ },
343
+ {
344
+ "description": "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW",
345
+ "quantity": 1.0,
346
+ "net_price": 12.99,
347
+ "net_worth": 12.99,
348
+ "vat": "10%",
349
+ "gross_worth": 14.29
350
+ }
351
+ ],
352
+ "invoice_summary": [
353
+ {
354
+ "net_worth": 192.81,
355
+ "vat": 19.28,
356
+ "gross_worth": 212.09
357
+ }
358
+ ]
359
+ }
360
+
361
+ Time to retrieve answer: 47.84319644900097
362
+
363
+
364
+ ./sparrow.sh "[{\"instrument_name\":\"example\", \"valuation\":0}]" --agent "sparrow-parse" --debug --options huggingface
365
+ --options katanaml/sparrow-qwen2-vl-7b --file-path "/Users/andrejb/Documents/work/epik/bankstatement/bonds_table.png"
366
+
367
+ [
368
+ {
369
+ "instrument_name": "UNITS BLACKROCK FIX INC DUB FDS PLC ISHS EUR INV GRD CP BD IDX/INST/E",
370
+ "valuation": 19049
371
+ },
372
+ {
373
+ "instrument_name": "UNITS ISHARES III PLC CORE EUR GOVT BOND UCITS ETF/EUR",
374
+ "valuation": 83488
375
+ },
376
+ {
377
+ "instrument_name": "UNITS ISHARES III PLC EUR CORP BOND 1-5YR UCITS ETF/EUR",
378
+ "valuation": 213030
379
+ },
380
+ {
381
+ "instrument_name": "UNIT ISHARES VI PLC/JP MORGAN USD E BOND EUR HED UCITS ETF DIST/HDGD/",
382
+ "valuation": 32774
383
+ },
384
+ {
385
+ "instrument_name": "UNITS XTRACKERS II SICAV/EUR HY CORP BOND UCITS ETF/-1D-/DISTR.",
386
+ "valuation": 23643
387
+ }
388
+ ]
389
+
390
+ Time to retrieve answer: 22.78700271800335
sparrow.sh ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ command -v python >/dev/null 2>&1 || { echo >&2 "Python is required but it's not installed. Aborting."; exit 1; }
4
+
5
+ # Check Python version
6
+ PYTHON_VERSION=$(python --version 2>&1) # Capture both stdout and stderr
7
+ echo "Detected Python version: $PYTHON_VERSION"
8
+ if [[ ! "$PYTHON_VERSION" == *"3.10.4"* ]]; then
9
+ echo "Python version 3.10.4 is required. Current version is $PYTHON_VERSION. Aborting."
10
+ exit 1
11
+ fi
12
+
13
+ PYTHON_SCRIPT_PATH="engine.py"
14
+
15
+ # Check if the "ingest" flag is passed
16
+ if [ "$1" == "ingest" ]; then
17
+ PYTHON_SCRIPT_PATH="ingest.py"
18
+ shift # Shift the arguments to exclude the first one
19
+ fi
20
+
21
+ if [ "$1" == "assistant" ]; then
22
+ PYTHON_SCRIPT_PATH="assistant.py"
23
+ shift # Shift the arguments to exclude the first one
24
+ fi
25
+
26
+ python "${PYTHON_SCRIPT_PATH}" "$@"
27
+
28
+ # make script executable with: chmod +x sparrow.sh