Spaces:

katanaml
/

sparrow-ml

Sleeping

App Files Files Community

katanaml commited on Oct 17, 2024

Commit

42cd5f6

1 Parent(s): 77b5374

Sparrow Parse

Browse files

Files changed (47) hide show

.gitignore +2 -0
Dockerfile +10 -0
LICENSE +674 -0
README.md +1 -1
api.py +108 -0
assistant.py +23 -0
config.yml +90 -0
data/inout-20211211_001.jpg +0 -0
data/invoice_1.jpg +0 -0
data/invoice_1.pdf +0 -0
data/ross-20211211_010.jpg +0 -0
docker-compose.yml +28 -0
embeddings/__init__.py +0 -0
embeddings/agents/__init__.py +0 -0
embeddings/agents/haystack.py +68 -0
embeddings/agents/interface.py +29 -0
embeddings/agents/llamaindex.py +85 -0
engine.py +82 -0
ingest.py +42 -0
rag/__init__.py +0 -0
rag/agents/__init__.py +0 -0
rag/agents/haystack/__init__.py +0 -0
rag/agents/haystack/haystack.py +227 -0
rag/agents/instructor/__init__.py +0 -0
rag/agents/instructor/fcall.py +77 -0
rag/agents/instructor/helpers/__init__.py +0 -0
rag/agents/instructor/helpers/instructor_helper.py +60 -0
rag/agents/instructor/instructor.py +254 -0
rag/agents/interface.py +61 -0
rag/agents/llamaindex/__init__.py +0 -0
rag/agents/llamaindex/llamaindex.py +209 -0
rag/agents/llamaindex/vllamaindex.py +139 -0
rag/agents/llamaindex/vprocessor.py +183 -0
rag/agents/sparrow_parse/__init__.py +0 -0
rag/agents/sparrow_parse/sparrow_parse.py +137 -0
rag/agents/sparrow_parse/sparrow_utils.py +54 -0
rag/agents/sparrow_parse/sparrow_validator.py +26 -0
rag/agents/unstructured/__init__.py +0 -0
rag/agents/unstructured/unstructured.py +372 -0
rag/agents/unstructured/unstructured_light.py +293 -0
requirements_haystack.txt +14 -0
requirements_instructor.txt +16 -0
requirements_llamaindex.txt +27 -0
requirements_sparrow_parse.txt +13 -0
requirements_unstructured.txt +19 -0
sample_prompts.txt +390 -0
sparrow.sh +28 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+
2	+ .DS_Store

Dockerfile ADDED Viewed

	@@ -0,0 +1,10 @@

+FROM python:3.10
+RUN useradd -m -u 1000 user
+WORKDIR /app
+COPY --chown=user ./requirements_sparrow_parse.txt requirements_sparrow_parse.txt
+RUN pip install --no-cache-dir --upgrade -r requirements_sparrow_parse.txt
+COPY --chown=user . /app
+CMD ["python", "api.py", "--port", "7860"]

LICENSE ADDED Viewed

	@@ -0,0 +1,674 @@

+                    GNU GENERAL PUBLIC LICENSE
+                       Version 3, 29 June 2007
+ Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+                            Preamble
+  The GNU General Public License is a free, copyleft license for
+software and other kinds of works.
+  The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works.  By contrast,
+the GNU General Public License is intended to guarantee your freedom to
+share and change all versions of a program--to make sure it remains free
+software for all its users.  We, the Free Software Foundation, use the
+GNU General Public License for most of our software; it applies also to
+any other work released this way by its authors.  You can apply it to
+your programs, too.
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+them if you wish), that you receive source code or can get it if you
+want it, that you can change the software or use pieces of it in new
+free programs, and that you know you can do these things.
+  To protect your rights, we need to prevent others from denying you
+these rights or asking you to surrender the rights.  Therefore, you have
+certain responsibilities if you distribute copies of the software, or if
+you modify it: responsibilities to respect the freedom of others.
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must pass on to the recipients the same
+freedoms that you received.  You must make sure that they, too, receive
+or can get the source code.  And you must show them these terms so they
+know their rights.
+  Developers that use the GNU GPL protect your rights with two steps:
+(1) assert copyright on the software, and (2) offer you this License
+giving you legal permission to copy, distribute and/or modify it.
+  For the developers' and authors' protection, the GPL clearly explains
+that there is no warranty for this free software.  For both users' and
+authors' sake, the GPL requires that modified versions be marked as
+changed, so that their problems will not be attributed erroneously to
+authors of previous versions.
+  Some devices are designed to deny users access to install or run
+modified versions of the software inside them, although the manufacturer
+can do so.  This is fundamentally incompatible with the aim of
+protecting users' freedom to change the software.  The systematic
+pattern of such abuse occurs in the area of products for individuals to
+use, which is precisely where it is most unacceptable.  Therefore, we
+have designed this version of the GPL to prohibit the practice for those
+products.  If such problems arise substantially in other domains, we
+stand ready to extend this provision to those domains in future versions
+of the GPL, as needed to protect the freedom of users.
+  Finally, every program is threatened constantly by software patents.
+States should not allow patents to restrict development and use of
+software on general-purpose computers, but in those that do, we wish to
+avoid the special danger that patents applied to a free program could
+make it effectively proprietary.  To prevent this, the GPL assures that
+patents cannot be used to render the program non-free.
+  The precise terms and conditions for copying, distribution and
+modification follow.
+                       TERMS AND CONDITIONS
+  0. Definitions.
+  "This License" refers to version 3 of the GNU General Public License.
+  "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+  "The Program" refers to any copyrightable work licensed under this
+License.  Each licensee is addressed as "you".  "Licensees" and
+"recipients" may be individuals or organizations.
+  To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy.  The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+  A "covered work" means either the unmodified Program or a work based
+on the Program.
+  To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy.  Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+  To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies.  Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+  An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License.  If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+  1. Source Code.
+  The "source code" for a work means the preferred form of the work
+for making modifications to it.  "Object code" means any non-source
+form of a work.
+  A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+  The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form.  A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+  The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities.  However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work.  For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+  The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+  The Corresponding Source for a work in source code form is that
+same work.
+  2. Basic Permissions.
+  All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met.  This License explicitly affirms your unlimited
+permission to run the unmodified Program.  The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work.  This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+  You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force.  You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright.  Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+  Conveying under any other circumstances is permitted solely under
+the conditions stated below.  Sublicensing is not allowed; section 10
+makes it unnecessary.
+  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+  No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+  When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+  4. Conveying Verbatim Copies.
+  You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+  You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+  5. Conveying Modified Source Versions.
+  You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+    a) The work must carry prominent notices stating that you modified
+    it, and giving a relevant date.
+    b) The work must carry prominent notices stating that it is
+    released under this License and any conditions added under section
+    7.  This requirement modifies the requirement in section 4 to
+    "keep intact all notices".
+    c) You must license the entire work, as a whole, under this
+    License to anyone who comes into possession of a copy.  This
+    License will therefore apply, along with any applicable section 7
+    additional terms, to the whole of the work, and all its parts,
+    regardless of how they are packaged.  This License gives no
+    permission to license the work in any other way, but it does not
+    invalidate such permission if you have separately received it.
+    d) If the work has interactive user interfaces, each must display
+    Appropriate Legal Notices; however, if the Program has interactive
+    interfaces that do not display Appropriate Legal Notices, your
+    work need not make them do so.
+  A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit.  Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+  6. Conveying Non-Source Forms.
+  You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+    a) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by the
+    Corresponding Source fixed on a durable physical medium
+    customarily used for software interchange.
+    b) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by a
+    written offer, valid for at least three years and valid for as
+    long as you offer spare parts or customer support for that product
+    model, to give anyone who possesses the object code either (1) a
+    copy of the Corresponding Source for all the software in the
+    product that is covered by this License, on a durable physical
+    medium customarily used for software interchange, for a price no
+    more than your reasonable cost of physically performing this
+    conveying of source, or (2) access to copy the
+    Corresponding Source from a network server at no charge.
+    c) Convey individual copies of the object code with a copy of the
+    written offer to provide the Corresponding Source.  This
+    alternative is allowed only occasionally and noncommercially, and
+    only if you received the object code with such an offer, in accord
+    with subsection 6b.
+    d) Convey the object code by offering access from a designated
+    place (gratis or for a charge), and offer equivalent access to the
+    Corresponding Source in the same way through the same place at no
+    further charge.  You need not require recipients to copy the
+    Corresponding Source along with the object code.  If the place to
+    copy the object code is a network server, the Corresponding Source
+    may be on a different server (operated by you or a third party)
+    that supports equivalent copying facilities, provided you maintain
+    clear directions next to the object code saying where to find the
+    Corresponding Source.  Regardless of what server hosts the
+    Corresponding Source, you remain obligated to ensure that it is
+    available for as long as needed to satisfy these requirements.
+    e) Convey the object code using peer-to-peer transmission, provided
+    you inform other peers where the object code and Corresponding
+    Source of the work are being offered to the general public at no
+    charge under subsection 6d.
+  A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+  A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling.  In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage.  For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product.  A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+  "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source.  The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+  If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information.  But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+  The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed.  Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+  Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+  7. Additional Terms.
+  "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law.  If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+  When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it.  (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.)  You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+  Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+    a) Disclaiming warranty or limiting liability differently from the
+    terms of sections 15 and 16 of this License; or
+    b) Requiring preservation of specified reasonable legal notices or
+    author attributions in that material or in the Appropriate Legal
+    Notices displayed by works containing it; or
+    c) Prohibiting misrepresentation of the origin of that material, or
+    requiring that modified versions of such material be marked in
+    reasonable ways as different from the original version; or
+    d) Limiting the use for publicity purposes of names of licensors or
+    authors of the material; or
+    e) Declining to grant rights under trademark law for use of some
+    trade names, trademarks, or service marks; or
+    f) Requiring indemnification of licensors and authors of that
+    material by anyone who conveys the material (or modified versions of
+    it) with contractual assumptions of liability to the recipient, for
+    any liability that these contractual assumptions directly impose on
+    those licensors and authors.
+  All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10.  If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term.  If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+  If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+  Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+  8. Termination.
+  You may not propagate or modify a covered work except as expressly
+provided under this License.  Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+  However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+  Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+  Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+  9. Acceptance Not Required for Having Copies.
+  You are not required to accept this License in order to receive or
+run a copy of the Program.  Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance.  However,
+nothing other than this License grants you permission to propagate or
+modify any covered work.  These actions infringe copyright if you do
+not accept this License.  Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+  10. Automatic Licensing of Downstream Recipients.
+  Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License.  You are not responsible
+for enforcing compliance by third parties with this License.
+  An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations.  If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+  You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License.  For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+  11. Patents.
+  A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based.  The
+work thus licensed is called the contributor's "contributor version".
+  A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version.  For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+  Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+  In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement).  To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+  If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients.  "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+  If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+  A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License.  You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+  Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+  12. No Surrender of Others' Freedom.
+  If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all.  For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+  13. Use with the GNU Affero General Public License.
+  Notwithstanding any other provision of this License, you have
+permission to link or combine any covered work with a work licensed
+under version 3 of the GNU Affero General Public License into a single
+combined work, and to convey the resulting work.  The terms of this
+License will continue to apply to the part which is the covered work,
+but the special requirements of the GNU Affero General Public License,
+section 13, concerning interaction through a network will apply to the
+combination as such.
+  14. Revised Versions of this License.
+  The Free Software Foundation may publish revised and/or new versions of
+the GNU General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+  Each version is given a distinguishing version number.  If the
+Program specifies that a certain numbered version of the GNU General
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version published by the Free Software
+Foundation.  If the Program does not specify a version number of the
+GNU General Public License, you may choose any version ever published
+by the Free Software Foundation.
+  If the Program specifies that a proxy can decide which future
+versions of the GNU General Public License can be used, that proxy's
+public statement of acceptance of a version permanently authorizes you
+to choose that version for the Program.
+  Later license versions may give you additional or different
+permissions.  However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+  15. Disclaimer of Warranty.
+  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+  16. Limitation of Liability.
+  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+  17. Interpretation of Sections 15 and 16.
+  If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+                     END OF TERMS AND CONDITIONS
+            How to Apply These Terms to Your New Programs
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+    You should have received a copy of the GNU General Public License
+    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+Also add information on how to contact you by electronic and paper mail.
+  If the program does terminal interaction, make it output a short
+notice like this when it starts in an interactive mode:
+    <program>  Copyright (C) <year>  <name of author>
+    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, your program's commands
+might be different; for a GUI interface, you would use an "about box".
+  You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU GPL, see
+<https://www.gnu.org/licenses/>.
+  The GNU General Public License does not permit incorporating your program
+into proprietary programs.  If your program is a subroutine library, you
+may consider it more useful to permit linking proprietary applications with
+the library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.  But first, please read
+<https://www.gnu.org/licenses/why-not-lgpl.html>.

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Sparrow Ml
 emoji: 😻
 colorFrom: green
 colorTo: red

 ---
+title: Sparrow ML
 emoji: 😻
 colorFrom: green
 colorTo: red

api.py ADDED Viewed

	@@ -0,0 +1,108 @@

+from fastapi import FastAPI, File, UploadFile, Form, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from engine import run_from_api_engine
+from ingest import run_from_api_ingest
+import uvicorn
+import warnings
+from typing import Annotated
+import json
+import argparse
+from dotenv import load_dotenv
+import os
+from rich import print
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+# Load environment variables from .env file
+load_dotenv()
+# add asyncio to the pipeline
+app = FastAPI(openapi_url="/api/v1/sparrow-llm/openapi.json", docs_url="/api/v1/sparrow-llm/docs")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+    allow_credentials=True
+)
+@app.get("/")
+def root():
+    return {"message": "Sparrow LLM API"}
+@app.post("/api/v1/sparrow-llm/inference", tags=["LLM Inference"])
+async def inference(
+        fields: Annotated[str, Form()],
+        agent: Annotated[str, Form()],
+        types: Annotated[str, Form()] = None,
+        keywords: Annotated[str, Form()] = None,
+        index_name: Annotated[str, Form()] = None,
+        options: Annotated[str, Form()] = None,
+        group_by_rows: Annotated[bool, Form()] = True,
+        update_targets: Annotated[bool, Form()] = True,
+        debug: Annotated[bool, Form()] = False,
+        file: UploadFile = File(None)
+        ):
+    query = 'retrieve ' + fields
+    query_types = types
+    query_inputs_arr = [param.strip() for param in fields.split(',')] if query_types else []
+    query_types_arr = [param.strip() for param in query_types.split(',')] if query_types else []
+    keywords_arr = [param.strip() for param in keywords.split(',')] if keywords is not None else None
+    options_arr = [param.strip() for param in options.split(',')] if options is not None else None
+    if not query_types:
+        query = fields
+    try:
+        answer = await run_from_api_engine(agent, query_inputs_arr, query_types_arr, keywords_arr, query, index_name,
+                                           options_arr, file, group_by_rows, update_targets, debug)
+    except ValueError as e:
+        raise HTTPException(status_code=418, detail=str(e))
+    try:
+        if isinstance(answer, (str, bytes, bytearray)):
+            answer = json.loads(answer)
+    except json.JSONDecodeError as e:
+        raise HTTPException(status_code=418, detail=answer)
+    if debug:
+        print(f"\nJSON response:\n")
+        print(answer)
+    return {"message": answer}
+@app.post("/api/v1/sparrow-llm/ingest", tags=["LLM Ingest"])
+async def ingest(
+        agent: Annotated[str, Form()],
+        index_name: Annotated[str, Form()],
+        file: UploadFile = File()
+        ):
+    try:
+        answer = await run_from_api_ingest(agent, index_name, file, False)
+    except ValueError as e:
+        raise HTTPException(status_code=418, detail=str(e))
+    if isinstance(answer, (str, bytes, bytearray)):
+        answer = json.loads(answer)
+    return {"message": answer}
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Run FastAPI App")
+    parser.add_argument("-p", "--port", type=int, default=8000, help="Port to run the FastAPI app on")
+    args = parser.parse_args()
+    uvicorn.run("api:app", host="0.0.0.0", port=args.port, reload=True)
+# run the app with: python api.py --port 8000
+# go to http://127.0.0.1:8000/api/v1/sparrow-llm/docs to see the Swagger UI

assistant.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import warnings
+import typer
+from typing_extensions import Annotated
+from rag.agents.interface import get_pipeline
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+def run(agent: Annotated[str, typer.Option(help="Ingest agent")] = "fcall",
+        query: Annotated[str, typer.Option(help="The query to run")] = "retrieve",
+        debug: Annotated[bool, typer.Option(help="Enable debug mode")] = False):
+    user_selected_agent = agent  # Modify this as needed
+    try:
+        rag = get_pipeline(user_selected_agent)
+        rag.run_pipeline(user_selected_agent, None, None, query, None, None, debug)
+    except ValueError as e:
+        print(f"Caught an exception: {e}")
+if __name__ == "__main__":
+    typer.run(run)

config.yml ADDED Viewed

	@@ -0,0 +1,90 @@

+# AGENT FOR LLAMAINDEX
+# Tested with these LLMs
+#LLM: 'starling-lm:7b-alpha-q4_K_M'
+#LLM: 'starling-lm:7b-alpha-q5_K_M'
+LLM: 'adrienbrault/nous-hermes2theta-llama3-8b:q5_K_M'
+#LLM: 'llama3:8b-instruct-q5_K_M'
+EMBEDDINGS: 'sentence-transformers/all-mpnet-base-v2'
+WEAVIATE_URL: 'http://localhost:8080'
+CHUNK_SIZE: 3000
+OLLAMA_BASE_URL: 'http://127.0.0.1:11434'
+#OLLAMA_BASE_URL: 'http://192.168.68.107:11434'
+# AGENT FOR HAYSTACK
+SPLIT_BY_HAYSTACK: 'sentence'
+SPLIT_LENGTH_HAYSTACK: 3000
+SPLIT_OVERLAP_HAYSTACK: 100
+EMBEDDINGS_HAYSTACK: 'sentence-transformers/all-MiniLM-L6-v2'
+# Tested with these LLMs
+#LLM_HAYSTACK: 'starling-lm:7b-alpha-q4_K_M'
+#LLM_HAYSTACK: 'starling-lm:7b-alpha-q5_K_M'
+LLM_HAYSTACK: 'adrienbrault/nous-hermes2theta-llama3-8b:q5_K_M'
+#LLM_HAYSTACK: 'llama3:8b-instruct-q5_K_M'
+OLLAMA_BASE_URL_HAYSTACK: 'http://127.0.0.1:11434'
+#OLLAMA_BASE_URL_HAYSTACK: 'http://192.168.68.107:11434'
+MAX_LOOPS_ALLOWED_HAYSTACK: 3
+# AGENT FOR VLLAMAINDEX
+# Tested with these LLMs
+LLM_VLLAMAINDEX: 'llava:13b'
+# AGENT FOR VPROCESSOR
+OCR_ENDPOINT_VPROCESSOR: 'http://127.0.0.1:8001/api/v1/sparrow-ocr/inference'
+# Tested with these LLMs
+#LLM_VPROCESSOR: 'starling-lm:7b-alpha-q5_K_M'
+#LLM_VPROCESSOR: 'adrienbrault/nous-hermes2pro:Q5_K_M-json'
+LLM_VPROCESSOR: 'llama3:8b-instruct-q5_K_M'
+OLLAMA_BASE_URL_VPROCESSOR: 'http://127.0.0.1:11434'
+# AGENT FOR FUNCTION CALL
+OLLAMA_BASE_URL_FUNCTION: 'http://127.0.0.1:11434/v1'
+# Tested with these LLMs
+LLM_FUNCTION: 'adrienbrault/nous-hermes2theta-llama3-8b:q5_K_M'
+# AGENT FOR UNSTRUCTURED LIGHT
+# Tested with these LLMs
+LLM_UNSTRUCTURED_LIGHT: 'adrienbrault/nous-hermes2pro:Q5_K_M-json'
+# Strategy for analyzing PDFs and extracting table structure
+STRATEGY_UNSTRUCTURED_LIGHT: 'hi_res'
+# Best model for table extraction. Other options are detectron2_onnx and chipper depending on file layout
+MODEL_UNSTRUCTURED_LIGHT: 'yolox'
+CHUNK_SIZE_UNSTRUCTURED_LIGHT: 1000
+OVERLAP_UNSTRUCTURED_LIGHT: 200
+# ollama pull nomic-embed-text
+EMBEDDINGS_UNSTRUCTURED_LIGHT: 'nomic-embed-text'
+BASE_URL_UNSTRUCTURED_LIGHT: 'http://127.0.0.1:11434'
+# AGENT FOR UNSTRUCTURED
+# Tested with these LLMs
+LLM_UNSTRUCTURED: 'adrienbrault/nous-hermes2pro:Q5_K_M-json'
+OUTPUT_DIR_UNSTRUCTURED: 'data/json'
+INPUT_DIR_UNSTRUCTURED: 'data/pdf'
+WEAVIATE_URL_UNSTRUCTURED: 'http://localhost:8080'
+EMBEDDINGS_UNSTRUCTURED: 'all-MiniLM-L6-v2'
+DEVICE_UNSTRUCTURED: 'cpu'
+CHUNK_UNDER_N_CHARS_UNSTRUCTURED: 250
+CHUNK_NEW_AFTER_N_CHARS_UNSTRUCTURED: 500
+BASE_URL_UNSTRUCTURED: 'http://127.0.0.1:11434'
+# AGENT FOR INSTRUCTOR
+OLLAMA_BASE_URL_INSTRUCTOR: 'http://127.0.0.1:11434/v1'
+#OLLAMA_BASE_URL_INSTRUCTOR: 'http://192.168.68.107:11434/v1'
+# Tested with these LLMs
+LLM_INSTRUCTOR: 'adrienbrault/nous-hermes2theta-llama3-8b:q5_K_M'
+#LLM_INSTRUCTOR: 'adrienbrault/nous-hermes2pro:Q5_K_M-json'
+#LLM_INSTRUCTOR: 'wizardlm2:7b-q5_K_M'
+# Strategy for analyzing PDFs and extracting table structure
+STRATEGY_INSTRUCTOR: 'hi_res'
+# Using yolox model by default. Other option is detectron2_onnx, depending on file layout
+MODEL_INSTRUCTOR: 'yolox'
+SIMILARITY_THRESHOLD_JUNK_COLUMNS_INSTRUCTOR: 0.5
+SIMILARITY_THRESHOLD_COLUMN_ID_INSTRUCTOR: 0.3
+PDF_SPLIT_OUTPUT_DIR_INSTRUCTOR: ""
+PDF_CONVERT_TO_IMAGES_INSTRUCTOR: False

data/inout-20211211_001.jpg ADDED Viewed

data/invoice_1.jpg ADDED Viewed

data/invoice_1.pdf ADDED Viewed

Binary file (45.3 kB). View file

data/ross-20211211_010.jpg ADDED Viewed

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,28 @@

+---
+services:
+  weaviate:
+    container_name: weaviate-db
+    command:
+    - --host
+    - 0.0.0.0
+    - --port
+    - '8080'
+    - --scheme
+    - http
+    image: semitechnologies/weaviate:1.24.2
+    ports:
+    - 8080:8080
+    - 50051:50051
+    volumes:
+    - weaviate_data:/var/lib/weaviate
+    restart: on-failure:0
+    environment:
+      QUERY_DEFAULTS_LIMIT: 25
+      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
+      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
+      DEFAULT_VECTORIZER_MODULE: 'none'
+      ENABLE_MODULES: ''
+      CLUSTER_HOSTNAME: 'node1'
+volumes:
+  weaviate_data:
+...

embeddings/__init__.py ADDED Viewed

File without changes

embeddings/agents/__init__.py ADDED Viewed

File without changes

embeddings/agents/haystack.py ADDED Viewed

	@@ -0,0 +1,68 @@

+from embeddings.agents.interface import Ingest
+from haystack.components.converters import PyPDFToDocument
+from haystack.components.routers import FileTypeRouter
+from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
+from haystack.components.embedders import SentenceTransformersDocumentEmbedder
+from haystack import Pipeline
+from haystack_integrations.document_stores.weaviate.document_store import WeaviateDocumentStore
+from haystack.components.writers import DocumentWriter
+import timeit
+import box
+import yaml
+from rich import print
+# Import config vars
+with open('config.yml', 'r', encoding='utf8') as ymlfile:
+    cfg = box.Box(yaml.safe_load(ymlfile))
+class HaystackIngest(Ingest):
+    def run_ingest(self,
+                   payload: str,
+                   file_path: str,
+                   index_name: str) -> None:
+        print(f"\nRunning embeddings with {payload}\n")
+        file_list = [file_path]
+        start = timeit.default_timer()
+        document_store = WeaviateDocumentStore(url=cfg.WEAVIATE_URL, collection_settings={"class": index_name})
+        file_type_router = FileTypeRouter(mime_types=["application/pdf"])
+        pdf_converter = PyPDFToDocument()
+        document_cleaner = DocumentCleaner()
+        document_splitter = DocumentSplitter(
+            split_by="word",
+            split_length=cfg.SPLIT_LENGTH_HAYSTACK,
+            split_overlap=cfg.SPLIT_OVERLAP_HAYSTACK
+        )
+        document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
+        document_writer = DocumentWriter(document_store)
+        preprocessing_pipeline = Pipeline()
+        preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
+        preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
+        preprocessing_pipeline.add_component(instance=document_cleaner, name="document_cleaner")
+        preprocessing_pipeline.add_component(instance=document_splitter, name="document_splitter")
+        preprocessing_pipeline.add_component(instance=document_embedder, name="document_embedder")
+        preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")
+        preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
+        preprocessing_pipeline.connect("pypdf_converter", "document_cleaner")
+        preprocessing_pipeline.connect("document_cleaner", "document_splitter")
+        preprocessing_pipeline.connect("document_splitter", "document_embedder")
+        preprocessing_pipeline.connect("document_embedder", "document_writer")
+        # preprocessing_pipeline.draw("pipeline.png")
+        preprocessing_pipeline.run({
+            "file_type_router": {"sources": file_list}
+        })
+        print(f"Number of documents in document store: {document_store.count_documents()}")
+        end = timeit.default_timer()
+        print(f"Time to embeddings data: {end - start}")

embeddings/agents/interface.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from abc import ABC, abstractmethod
+import warnings
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+# Abstract Interface
+class Ingest(ABC):
+    @abstractmethod
+    def run_ingest(self,
+                   payload: str,
+                   file_path: str,
+                   index_name: str) -> None:
+        pass
+# Factory Method
+def get_ingest(agent_name: str) -> Ingest:
+    if agent_name == "llamaindex":
+        from .llamaindex import LlamaIndexIngest
+        return LlamaIndexIngest()
+    elif agent_name == "haystack":
+        from .haystack import HaystackIngest
+        return HaystackIngest()
+    else:
+        raise ValueError(f"Unknown agent: {agent_name}")

embeddings/agents/llamaindex.py ADDED Viewed

	@@ -0,0 +1,85 @@

+from .interface import Ingest
+import weaviate
+from llama_index.core import StorageContext, SimpleDirectoryReader, Settings, VectorStoreIndex
+from llama_index.vector_stores.weaviate import WeaviateVectorStore
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+import box
+import yaml
+from rich.progress import Progress, SpinnerColumn, TextColumn
+import timeit
+from rich import print
+import warnings
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+class LlamaIndexIngest(Ingest):
+    def run_ingest(self,
+                   payload: str,
+                   file_path: str,
+                   index_name: str) -> None:
+        print(f"\nRunning ingest with {payload}\n")
+        # Import config vars
+        with open('config.yml', 'r', encoding='utf8') as ymlfile:
+            cfg = box.Box(yaml.safe_load(ymlfile))
+        start = timeit.default_timer()
+        client = self.invoke_pipeline_step(lambda: weaviate.Client(cfg.WEAVIATE_URL),
+                                           "Connecting to Weaviate...")
+        documents = self.invoke_pipeline_step(lambda: self.load_documents(file_path),
+                                         "Loading documents...")
+        embeddings = self.invoke_pipeline_step(lambda: self.load_embedding_model(cfg.EMBEDDINGS),
+                                          "Loading embedding model...")
+        index = self.invoke_pipeline_step(lambda: self.build_index(client, embeddings, documents, index_name,
+                                                                   cfg.CHUNK_SIZE),
+                                          "Building index...")
+        end = timeit.default_timer()
+        print(f"\nTime to ingest data: {end - start}\n")
+    def load_documents(self, file_path):
+        documents = SimpleDirectoryReader(input_files=[file_path], required_exts=[".pdf", ".PDF"]).load_data()
+        print(f"\nLoaded {len(documents)} documents")
+        print(f"\nFirst document: {documents[0]}")
+        print("\nFirst document content:\n")
+        print(documents[0])
+        print()
+        return documents
+    def load_embedding_model(self, model_name):
+        return HuggingFaceEmbedding(model_name=model_name)
+    def build_index(self, weaviate_client, embed_model, documents, index_name, chunk_size):
+        # Delete index if it already exists, to avoid data corruption
+        weaviate_client.schema.delete_class(index_name)
+        Settings.chunk_size = chunk_size
+        Settings.llm = None
+        Settings.embed_model = embed_model
+        vector_store = WeaviateVectorStore(weaviate_client=weaviate_client, index_name=index_name)
+        storage_context = StorageContext.from_defaults(vector_store=vector_store)
+        index = VectorStoreIndex.from_documents(
+            documents,
+            storage_context=storage_context
+        )
+        return index
+    def invoke_pipeline_step(self, task_call, task_description):
+        with Progress(
+                SpinnerColumn(),
+                TextColumn("[progress.description]{task.description}"),
+                transient=False,
+        ) as progress:
+            progress.add_task(description=task_description, total=None)
+            ret = task_call()
+        return ret

engine.py ADDED Viewed

	@@ -0,0 +1,82 @@

+import warnings
+import typer
+from typing_extensions import Annotated, List
+from rag.agents.interface import get_pipeline
+import tempfile
+import os
+from rich import print
+# Disable parallelism in the Huggingface tokenizers library to prevent potential deadlocks and ensure consistent behavior.
+# This is especially important in environments where multiprocessing is used, as forking after parallelism can lead to issues.
+# Note: Disabling parallelism may impact performance, but it ensures safer and more predictable execution.
+os.environ['TOKENIZERS_PARALLELISM'] = 'false'
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+def run(inputs: Annotated[str, typer.Argument(help="The list of fields to fetch")],
+        types: Annotated[str, typer.Argument(help="The list of types of the fields")] = None,
+        keywords: Annotated[str, typer.Argument(help="The list of table column keywords")] = None,
+        file_path: Annotated[str, typer.Option(help="The file to process")] = None,
+        agent: Annotated[str, typer.Option(help="Selected agent")] = "llamaindex",
+        index_name: Annotated[str, typer.Option(help="Index to identify embeddings")] = None,
+        options: Annotated[List[str], typer.Option(help="Options to pass to the agent")] = None,
+        group_by_rows: Annotated[bool, typer.Option(help="Group JSON collection by rows")] = True,
+        update_targets: Annotated[bool, typer.Option(help="Update targets")] = True,
+        debug: Annotated[bool, typer.Option(help="Enable debug mode")] = False):
+    query = 'retrieve ' + inputs
+    query_types = types
+    query_inputs_arr = [param.strip() for param in inputs.split(',')] if query_types else []
+    query_types_arr = [param.strip() for param in query_types.split(',')] if query_types else []
+    keywords_arr = [param.strip() for param in keywords.split(',')] if keywords is not None else None
+    if not query_types:
+        query = inputs
+    user_selected_agent = agent  # Modify this as needed
+    try:
+        rag = get_pipeline(user_selected_agent)
+        answer = rag.run_pipeline(user_selected_agent, query_inputs_arr, query_types_arr, keywords_arr, query, file_path,
+                                  index_name, options, group_by_rows, update_targets, debug)
+        print(f"\nJSON response:\n")
+        print(answer)
+    except ValueError as e:
+        print(f"Caught an exception: {e}")
+async def run_from_api_engine(user_selected_agent, query_inputs_arr, query_types_arr, keywords_arr, query, index_name,
+                              options_arr, file, group_by_rows, update_targets, debug):
+    try:
+        rag = get_pipeline(user_selected_agent)
+        if file is not None:
+            with tempfile.TemporaryDirectory() as temp_dir:
+                temp_file_path = os.path.join(temp_dir, file.filename)
+                # Save the uploaded file to the temporary directory
+                with open(temp_file_path, 'wb') as temp_file:
+                    content = await file.read()
+                    temp_file.write(content)
+                answer = rag.run_pipeline(user_selected_agent, query_inputs_arr, query_types_arr, keywords_arr, query,
+                                          temp_file_path, index_name, options_arr, group_by_rows, update_targets,
+                                          debug, False)
+        else:
+            answer = rag.run_pipeline(user_selected_agent, query_inputs_arr, query_types_arr, keywords_arr, query,
+                                      None, index_name, options_arr, group_by_rows, update_targets,
+                                      debug, False)
+    except ValueError as e:
+        raise e
+    return answer
+if __name__ == "__main__":
+    typer.run(run)

ingest.py ADDED Viewed

	@@ -0,0 +1,42 @@

+import warnings
+from embeddings.agents.interface import get_ingest
+import typer
+from typing_extensions import Annotated
+import tempfile
+import os
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+def run(file_path: Annotated[str, typer.Option(help="The file to process")],
+        agent: Annotated[str, typer.Option(help="Ingest agent")] = "llamaindex",
+        index_name: Annotated[str, typer.Option(help="Index to identify embeddings")] = None):
+    user_selected_agent = agent  # Modify this as needed
+    ingest = get_ingest(user_selected_agent)
+    ingest.run_ingest(user_selected_agent, file_path, index_name)
+async def run_from_api_ingest(agent, index_name, file, debug):
+    try:
+        user_selected_agent = agent  # Modify this as needed
+        ingest = get_ingest(user_selected_agent)
+        with tempfile.TemporaryDirectory() as temp_dir:
+            temp_file_path = os.path.join(temp_dir, file.filename)
+            # Save the uploaded file to the temporary directory
+            with open(temp_file_path, 'wb') as temp_file:
+                content = await file.read()
+                temp_file.write(content)
+            ingest.run_ingest(user_selected_agent, temp_file_path, index_name)
+    except ValueError as e:
+        raise e
+    return {"message": "Ingested successfully"}
+if __name__ == "__main__":
+    typer.run(run)

rag/__init__.py ADDED Viewed

File without changes

rag/agents/__init__.py ADDED Viewed

File without changes

rag/agents/haystack/__init__.py ADDED Viewed

File without changes

rag/agents/haystack/haystack.py ADDED Viewed

	@@ -0,0 +1,227 @@

+from rag.agents.interface import Pipeline as PipelineInterface
+from typing import Any
+from haystack import Pipeline
+from haystack_integrations.document_stores.weaviate.document_store import WeaviateDocumentStore
+from haystack.components.embedders import SentenceTransformersTextEmbedder
+from haystack_integrations.components.retrievers.weaviate.embedding_retriever import WeaviateEmbeddingRetriever
+from haystack.components.builders import PromptBuilder
+from haystack_integrations.components.generators.ollama import OllamaGenerator
+from pydantic import create_model
+import json
+from haystack import component
+import pydantic
+from typing import Optional, List
+from pydantic import ValidationError
+import timeit
+import box
+import yaml
+from rich import print
+from rich.progress import Progress, SpinnerColumn, TextColumn
+import warnings
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+# Import config vars
+with open('config.yml', 'r', encoding='utf8') as ymlfile:
+    cfg = box.Box(yaml.safe_load(ymlfile))
+class HaystackPipeline(PipelineInterface):
+    def run_pipeline(self,
+                     payload: str,
+                     query_inputs: [str],
+                     query_types: [str],
+                     keywords: [str],
+                     query: str,
+                     file_path: str,
+                     index_name: str,
+                     options: List[str] = None,
+                     group_by_rows: bool = True,
+                     update_targets: bool = True,
+                     debug: bool = False,
+                     local: bool = True) -> Any:
+        print(f"\nRunning pipeline with {payload}\n")
+        ResponseModel, json_schema = self.invoke_pipeline_step(lambda: self.build_response_class(query_inputs, query_types),
+                                                               "Building dynamic response class...",
+                                                               local)
+        output_validator = self.invoke_pipeline_step(lambda: self.build_validator(ResponseModel),
+                                                     "Building output validator...",
+                                                     local)
+        document_store = self.run_preprocessing_pipeline(index_name, local)
+        answer = self.run_inference_pipeline(document_store, json_schema, output_validator, query, local)
+        return answer
+    # Function to safely evaluate type strings
+    def safe_eval_type(self, type_str, context):
+        try:
+            return eval(type_str, {}, context)
+        except NameError:
+            raise ValueError(f"Type '{type_str}' is not recognized")
+    def build_response_class(self, query_inputs, query_types_as_strings):
+        # Controlled context for eval
+        context = {
+            'List': List,
+            'str': str,
+            'int': int,
+            'float': float
+            # Include other necessary types or typing constructs here
+        }
+        # Convert string representations to actual types
+        query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
+        # Create fields dictionary
+        fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
+        DynamicModel = create_model('DynamicModel', **fields)
+        json_schema = DynamicModel.schema_json(indent=2)
+        return DynamicModel, json_schema
+    def build_validator(self, Invoice):
+        @component
+        class OutputValidator:
+            def __init__(self, pydantic_model: pydantic.BaseModel):
+                self.pydantic_model = pydantic_model
+                self.iteration_counter = 0
+            # Define the component output
+            @component.output_types(valid_replies=List[str], invalid_replies=Optional[List[str]],
+                                    error_message=Optional[str])
+            def run(self, replies: List[str]):
+                self.iteration_counter += 1
+                ## Try to parse the LLM's reply ##
+                # If the LLM's reply is a valid object, return `"valid_replies"`
+                try:
+                    output_dict = json.loads(replies[0].strip())
+                    # Disable data validation for now
+                    # self.pydantic_model.model_validate(output_dict)
+                    print(
+                        f"OutputValidator at Iteration {self.iteration_counter}: Valid JSON from LLM - No need for looping."
+                    )
+                    return {"valid_replies": replies}
+                # If the LLM's reply is corrupted or not valid, return "invalid_replies" and the "error_message" for LLM to try again
+                except (ValueError, ValidationError) as e:
+                    print(
+                          f"\nOutputValidator at Iteration {self.iteration_counter}: Invalid JSON from LLM - Let's try again.\n"
+                          f"Output from LLM:\n {replies[0]} \n"
+                          f"Error from OutputValidator: {e}"
+                    )
+                    return {"invalid_replies": replies, "error_message": str(e)}
+        output_validator = OutputValidator(pydantic_model=Invoice)
+        return output_validator
+    def run_preprocessing_pipeline(self, index_name, local):
+        document_store = WeaviateDocumentStore(url=cfg.WEAVIATE_URL, collection_settings={"class": index_name})
+        print(f"\nNumber of documents in document store: {document_store.count_documents()}\n")
+        if document_store.count_documents() == 0:
+            raise ValueError("Document store is empty. Please check your data source.")
+        return document_store
+    def run_inference_pipeline(self, document_store, json_schema, output_validator, query, local):
+        start = timeit.default_timer()
+        generator = OllamaGenerator(model=cfg.LLM_HAYSTACK,
+                                    url=cfg.OLLAMA_BASE_URL_HAYSTACK + "/api/generate",
+                                    timeout=900)
+        template = """
+        Given only the following document information, retrieve answer.
+        Ignore your own knowledge. Format response with the following JSON schema:
+        {{schema}}
+        Make sure your response is a dict and not a list. Return only JSON, no additional text.
+        Context:
+        {% for document in documents %}
+            {{ document.content }}
+        {% endfor %}
+        Question: {{ question }}?
+        {% if invalid_replies and error_message %}
+          You already created the following output in a previous attempt: {{invalid_replies}}
+          However, this doesn't comply with the format requirements from above and triggered this Python exception: {{error_message}}
+          Correct the output and try again. Just return the corrected output without any extra explanations.
+        {% endif %}
+        """
+        text_embedder = SentenceTransformersTextEmbedder(model=cfg.EMBEDDINGS_HAYSTACK,
+                                                         progress_bar=False)
+        retriever = WeaviateEmbeddingRetriever(document_store=document_store, top_k=3)
+        prompt_builder = PromptBuilder(template=template)
+        pipe = Pipeline(max_loops_allowed=cfg.MAX_LOOPS_ALLOWED_HAYSTACK)
+        pipe.add_component("embedder", text_embedder)
+        pipe.add_component("retriever", retriever)
+        pipe.add_component("prompt_builder", prompt_builder)
+        pipe.add_component("llm", generator)
+        pipe.add_component("output_validator", output_validator)
+        pipe.connect("embedder.embedding", "retriever.query_embedding")
+        pipe.connect("retriever", "prompt_builder.documents")
+        pipe.connect("prompt_builder", "llm")
+        pipe.connect("llm", "output_validator")
+        # If a component has more than one output or input, explicitly specify the connections:
+        pipe.connect("output_validator.invalid_replies", "prompt_builder.invalid_replies")
+        pipe.connect("output_validator.error_message", "prompt_builder.error_message")
+        question = (
+            query
+        )
+        response = self.invoke_pipeline_step(
+                            lambda: pipe.run(
+                                        {
+                                            "embedder": {"text": question},
+                                            "prompt_builder": {"question": question, "schema": json_schema}
+                                        }
+                                    ),
+            "Running inference pipeline...",
+                          local)
+        end = timeit.default_timer()
+        valid_reply = response["output_validator"]["valid_replies"][0]
+        valid_json = json.loads(valid_reply)
+        print(f"\nJSON response:\n")
+        print(valid_json)
+        print('\n' + ('=' * 50))
+        print(f"Time to retrieve answer: {end - start}")
+        return valid_json
+    def invoke_pipeline_step(self, task_call, task_description, local):
+        if local:
+            with Progress(
+                    SpinnerColumn(),
+                    TextColumn("[progress.description]{task.description}"),
+                    transient=False,
+            ) as progress:
+                progress.add_task(description=task_description, total=None)
+                ret = task_call()
+        else:
+            print(task_description)
+            ret = task_call()
+        return ret

rag/agents/instructor/__init__.py ADDED Viewed

File without changes

rag/agents/instructor/fcall.py ADDED Viewed

	@@ -0,0 +1,77 @@

+from rag.agents.interface import Pipeline
+from openai import OpenAI
+from pydantic import BaseModel, Field
+import yfinance as yf
+import instructor
+import timeit
+import box
+import yaml
+from rich import print
+from typing import Any, List
+import warnings
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+class FCall(Pipeline):
+    def run_pipeline(self,
+                     payload: str,
+                     query_inputs: [str],
+                     query_types: [str],
+                     keywords: [str],
+                     query: str,
+                     file_path: str,
+                     index_name: str,
+                     options: List[str] = None,
+                     group_by_rows: bool = True,
+                     update_targets: bool = True,
+                     debug: bool = False,
+                     local: bool = True) -> Any:
+        print(f"\nRunning pipeline with {payload}\n")
+        # Import config vars
+        with open('config.yml', 'r', encoding='utf8') as ymlfile:
+            cfg = box.Box(yaml.safe_load(ymlfile))
+        start = timeit.default_timer()
+        company = query
+        class StockInfo(BaseModel):
+            company: str = Field(..., description="Name of the company")
+            ticker: str = Field(..., description="Ticker symbol of the company")
+        # enables `response_model` in create call
+        client = instructor.patch(
+            OpenAI(
+                base_url=cfg.OLLAMA_BASE_URL_FUNCTION,
+                api_key="ollama",
+            ),
+            mode=instructor.Mode.JSON,
+        )
+        resp = client.chat.completions.create(
+            model=cfg.LLM_FUNCTION,
+            messages=[
+                {
+                    "role": "user",
+                    "content": f"Return the company name and the ticker symbol of the {company}."
+                }
+            ],
+            response_model=StockInfo,
+            max_retries=10
+        )
+        print(resp.model_dump_json(indent=2))
+        stock = yf.Ticker(resp.ticker)
+        hist = stock.history(period="1d")
+        stock_price = hist['Close'].iloc[-1]
+        print(f"The stock price of the {resp.company} is {stock_price}. USD")
+        end = timeit.default_timer()
+        print('=' * 50)
+        print(f"Time to retrieve answer: {end - start}")

rag/agents/instructor/helpers/__init__.py ADDED Viewed

File without changes

rag/agents/instructor/helpers/instructor_helper.py ADDED Viewed

	@@ -0,0 +1,60 @@

+from sparrow_parse.extractor.unstructured_processor import UnstructuredProcessor
+from sparrow_parse.extractor.markdown_processor import MarkdownProcessor
+import json
+def execute_sparrow_processor(options, file_path, strategy, model_name, local, debug):
+    content, table_content = None, None
+    if "unstructured" in options:
+        processor = UnstructuredProcessor()
+        content, table_content = processor.extract_data(file_path, strategy, model_name,
+                                                        ['tables', 'unstructured'], local, debug)
+    elif "markdown" in options:
+        processor = MarkdownProcessor()
+        content, table_content = processor.extract_data(file_path, ['tables', 'markdown'], local, debug)
+    return content, table_content
+def merge_dicts(json_str1, json_str2):
+    # Convert JSON strings to dictionaries
+    dict1 = json.loads(json_str1)
+    dict2 = json.loads(json_str2)
+    merged_dict = dict1.copy()
+    for key, value in dict2.items():
+        if key in merged_dict and isinstance(merged_dict[key], list) and isinstance(value, list):
+            merged_dict[key].extend(value)
+        else:
+            merged_dict[key] = value
+    return merged_dict
+def track_query_output(keys, json_data, types):
+    # Convert JSON string to dictionary
+    data = json.loads(json_data)
+    # Initialize the result lists
+    result = []
+    result_types = []
+    # Iterate through each key in the keys array
+    for i, key in enumerate(keys):
+        # Check if the key is present in the JSON and has a non-empty value
+        if key not in data or not data[key].strip():
+            result.append(key)
+            result_types.append(types[i])
+    return result, result_types
+def add_answer_page(answer, page_name, answer_page):
+    if not isinstance(answer, dict):
+        raise ValueError("The answer should be a dictionary.")
+    # Parse answer_table if it is a JSON string
+    if isinstance(answer_page, str):
+        answer_page = json.loads(answer_page)
+    answer[page_name] = answer_page
+    return answer

rag/agents/instructor/instructor.py ADDED Viewed

	@@ -0,0 +1,254 @@

+from rag.agents.interface import Pipeline
+from openai import OpenAI
+import instructor
+from .helpers.instructor_helper import execute_sparrow_processor, merge_dicts, track_query_output
+from .helpers.instructor_helper import add_answer_page
+from sparrow_parse.extractor.html_extractor import HTMLExtractor
+from sparrow_parse.extractor.unstructured_processor import UnstructuredProcessor
+from sparrow_parse.extractor.pdf_optimizer import PDFOptimizer
+from pydantic import create_model
+from typing import List
+from rich.progress import Progress, SpinnerColumn, TextColumn
+import timeit
+from rich import print
+from typing import Any
+import shutil
+import json
+import box
+import yaml
+import warnings
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+# Import config vars
+with open('config.yml', 'r', encoding='utf8') as ymlfile:
+    cfg = box.Box(yaml.safe_load(ymlfile))
+class InstructorPipeline(Pipeline):
+    def run_pipeline(self,
+                     payload: str,
+                     query_inputs: [str],
+                     query_types: [str],
+                     keywords: [str],
+                     query: str,
+                     file_path: str,
+                     index_name: str,
+                     options: List[str] = None,
+                     group_by_rows: bool = True,
+                     update_targets: bool = True,
+                     debug: bool = False,
+                     local: bool = True) -> Any:
+        print(f"\nRunning pipeline with {payload}\n")
+        # Import config vars
+        with open('config.yml', 'r', encoding='utf8') as ymlfile:
+            cfg = box.Box(yaml.safe_load(ymlfile))
+        start = timeit.default_timer()
+        strategy = cfg.STRATEGY_INSTRUCTOR
+        model_name = cfg.MODEL_INSTRUCTOR
+        similarity_threshold_junk = cfg.SIMILARITY_THRESHOLD_JUNK_COLUMNS_INSTRUCTOR
+        similarity_threshold_column_id = cfg.SIMILARITY_THRESHOLD_COLUMN_ID_INSTRUCTOR
+        pdf_split_output_dir = None if cfg.PDF_SPLIT_OUTPUT_DIR_INSTRUCTOR == "" else cfg.PDF_SPLIT_OUTPUT_DIR_INSTRUCTOR
+        pdf_convert_to_images = cfg.PDF_CONVERT_TO_IMAGES_INSTRUCTOR
+        answer = '{}'
+        answer_form = '{}'
+        validate_options = self.validate_options(options)
+        if validate_options:
+            if options and "tables" in options:
+                pdf_optimizer = PDFOptimizer()
+                num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(file_path,
+                                                                                     pdf_split_output_dir,
+                                                                                     pdf_convert_to_images)
+                if debug:
+                    print(f'The PDF file has {num_pages} pages.')
+                    print('The pages are stored in the following files:')
+                    for file in output_files:
+                        print(file)
+                # support for multipage docs
+                query_inputs_form, query_types_form = self.filter_fields_query(query_inputs, query_types, "form")
+                for i, page in enumerate(output_files):
+                    content, table_contents = execute_sparrow_processor(options, page, strategy, model_name, local, debug)
+                    if debug:
+                        print(f"Query form inputs: {query_inputs_form}")
+                        print(f"Query form types: {query_types_form}")
+                    if len(query_inputs_form) > 0:
+                        query_form = "retrieve " + ", ".join(query_inputs_form)
+                        answer_form = self.execute(query_inputs_form, query_types_form, content, query_form, 'form', debug, local)
+                        query_inputs_form, query_types_form = track_query_output(query_inputs_form, answer_form, query_types_form)
+                        if debug:
+                            print(f"Answer from LLM: {answer_form}")
+                            print(f"Unprocessed query targets: {query_inputs_form}")
+                    answer_table = {}
+                    if table_contents is not None:
+                        query_targets, query_targets_types = self.filter_fields_query(query_inputs, query_types, "table")
+                        extractor = HTMLExtractor()
+                        answer_table, targets_unprocessed = extractor.read_data(query_targets, table_contents,
+                                                                                similarity_threshold_junk,
+                                                                                similarity_threshold_column_id,
+                                                                                keywords, group_by_rows, update_targets,
+                                                                                local, debug)
+                    if num_pages > 1:
+                        answer_current = merge_dicts(answer_form, answer_table)
+                        answer_current_page = add_answer_page({}, "page" + str(i + 1), answer_current)
+                        answer = merge_dicts(answer, json.dumps(answer_current_page))
+                        answer_form = '{}'
+                    else:
+                        answer = merge_dicts(answer_form, answer_table)
+                    answer = self.format_json_output(answer)
+                shutil.rmtree(temp_dir, ignore_errors=True)
+            else:
+                # No options provided
+                processor = UnstructuredProcessor()
+                content, table_content = processor.extract_data(file_path, strategy, model_name, None, local, debug)
+                answer = self.execute(query_inputs, query_types, content, query, 'all', debug, local)
+        else:
+            raise ValueError(
+                "Invalid combination of options provided. Only 'tables and html' or 'tables and markdown' are allowed.")
+        end = timeit.default_timer()
+        print(f"\nJSON response:\n")
+        print(answer)
+        print('\n')
+        print('=' * 50)
+        print(f"Time to retrieve answer: {end - start}")
+        return answer
+    def execute(self, query_inputs, query_types, content, query, mode, debug, local):
+        if mode == 'form' or mode == 'all':
+            ResponseModel = self.invoke_pipeline_step(lambda: self.build_response_class(query_inputs, query_types),
+                                                      "Building dynamic response class for " + mode + " data...",
+                                                      local)
+        answer = self.invoke_pipeline_step(
+            lambda: self.execute_query(query, content, ResponseModel, mode),
+            "Executing query for " + mode + " data...",
+            local
+        )
+        return answer
+    def execute_query(self, query, content, ResponseModel, mode):
+        client = instructor.from_openai(
+            OpenAI(
+                base_url=cfg.OLLAMA_BASE_URL_INSTRUCTOR,
+                api_key="ollama",
+            ),
+            mode=instructor.Mode.JSON,
+        )
+        resp = []
+        if mode == 'form' or mode == 'all':
+            resp = client.chat.completions.create(
+                model=cfg.LLM_INSTRUCTOR,
+                messages=[
+                    {
+                        "role": "user",
+                        "content": f"{query} from the following content {content}. if query field value is missing, return None."
+                    }
+                ],
+                response_model=ResponseModel,
+                max_retries=3
+            )
+        answer = resp.model_dump_json(indent=4)
+        return answer
+    def filter_fields_query(self, query_inputs, query_types, mode):
+        fields = []
+        for query_input, query_type in zip(query_inputs, query_types):
+            if mode == "form" and query_type.startswith("List") is False:
+                fields.append((query_input, query_type))
+            elif mode == "table" and query_type.startswith("List") is True:
+                fields.append((query_input, query_type))
+        # return filtered query_inputs and query_types as two array of strings
+        query_inputs = [field[0] for field in fields]
+        query_types = [field[1] for field in fields]
+        return query_inputs, query_types
+    # Function to safely evaluate type strings
+    def safe_eval_type(self, type_str, context):
+        try:
+            return eval(type_str, {}, context)
+        except NameError:
+            raise ValueError(f"Type '{type_str}' is not recognized")
+    def build_response_class(self, query_inputs, query_types_as_strings):
+        # Controlled context for eval
+        context = {
+            'List': List,
+            'str': str,
+            'int': int,
+            'float': float
+            # Include other necessary types or typing constructs here
+        }
+        query_types_as_strings = [s.replace('Array', 'List') for s in query_types_as_strings]
+        # Convert string representations to actual types
+        query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
+        # Create fields dictionary
+        fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
+        DynamicModel = create_model('DynamicModel', **fields)
+        return DynamicModel
+    def validate_options(self, options: List[str]) -> bool:
+        # Define valid combinations
+        valid_combinations = [
+            ["tables", "unstructured"],
+            ["tables", "markdown"]
+        ]
+        # Check for valid combinations or empty list
+        if not options:  # Valid if no options are provided
+            return True
+        if sorted(options) in (sorted(combination) for combination in valid_combinations):
+            return True
+        return False
+    def format_json_output(self, answer):
+        formatted_json = json.dumps(answer, indent=4)
+        formatted_json = formatted_json.replace('", "', '",\n"')
+        formatted_json = formatted_json.replace('}, {', '},\n{')
+        return formatted_json
+    def invoke_pipeline_step(self, task_call, task_description, local):
+        if local:
+            with Progress(
+                    SpinnerColumn(),
+                    TextColumn("[progress.description]{task.description}"),
+                    transient=False,
+            ) as progress:
+                progress.add_task(description=task_description, total=None)
+                ret = task_call()
+        else:
+            print(task_description)
+            ret = task_call()
+        return ret

rag/agents/interface.py ADDED Viewed

	@@ -0,0 +1,61 @@

+from abc import ABC, abstractmethod
+from typing import Any
+from typing import List
+import warnings
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+# Abstract Interface
+class Pipeline(ABC):
+    @abstractmethod
+    def run_pipeline(self,
+                     payload: str,
+                     query_inputs: [str],
+                     query_types: [str],
+                     keywords: [str],
+                     query: str,
+                     file_path: str,
+                     index_name: str,
+                     options: List[str] = None,
+                     group_by_rows: bool = True,
+                     update_targets: bool = True,
+                     debug: bool = False,
+                     local: bool = True) -> Any:
+        pass
+# Factory Method
+def get_pipeline(agent_name: str) -> Pipeline:
+    if agent_name == "llamaindex":
+        from rag.agents.llamaindex.llamaindex import LlamaIndexPipeline
+        return LlamaIndexPipeline()
+    elif agent_name == "haystack":
+        from rag.agents.haystack.haystack import HaystackPipeline
+        return HaystackPipeline()
+    elif agent_name == "vllamaindex":
+        from rag.agents.llamaindex.vllamaindex import VLlamaIndexPipeline
+        return VLlamaIndexPipeline()
+    elif agent_name == "vprocessor":
+        from rag.agents.llamaindex.vprocessor import VProcessorPipeline
+        return VProcessorPipeline()
+    elif agent_name == "fcall":
+        from rag.agents.instructor.fcall import FCall
+        return FCall()
+    elif agent_name == "instructor":
+        from rag.agents.instructor.instructor import InstructorPipeline
+        return InstructorPipeline()
+    elif agent_name == "unstructured-light":
+        from rag.agents.unstructured.unstructured_light import UnstructuredLightPipeline
+        return UnstructuredLightPipeline()
+    elif agent_name == "unstructured":
+        from rag.agents.unstructured.unstructured import UnstructuredPipeline
+        return UnstructuredPipeline()
+    elif agent_name == "sparrow-parse":
+        from rag.agents.sparrow_parse.sparrow_parse import SparrowParsePipeline
+        return SparrowParsePipeline()
+    else:
+        raise ValueError(f"Unknown agent: {agent_name}")

rag/agents/llamaindex/__init__.py ADDED Viewed

File without changes

rag/agents/llamaindex/llamaindex.py ADDED Viewed

	@@ -0,0 +1,209 @@

+from rag.agents.interface import Pipeline
+from llama_index.core import VectorStoreIndex, Settings
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+from llama_index.llms.ollama import Ollama
+from llama_index.vector_stores.weaviate import WeaviateVectorStore
+import weaviate
+from pydantic.v1 import create_model
+from typing import List
+import box
+import yaml
+from rich.progress import Progress, SpinnerColumn, TextColumn
+import warnings
+import timeit
+import time
+import json
+from rich import print
+from typing import Any
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+class LlamaIndexPipeline(Pipeline):
+    def run_pipeline(self,
+                     payload: str,
+                     query_inputs: [str],
+                     query_types: [str],
+                     keywords: [str],
+                     query: str,
+                     file_path: str,
+                     index_name: str,
+                     options: List[str] = None,
+                     group_by_rows: bool = True,
+                     update_targets: bool = True,
+                     debug: bool = False,
+                     local: bool = True) -> Any:
+        print(f"\nRunning pipeline with {payload}\n")
+        if len(query_inputs) == 1:
+            raise ValueError("Please provide more than one query input")
+        start = timeit.default_timer()
+        rag_chain = self.build_rag_pipeline(query_inputs, query_types, index_name, debug, local)
+        end = timeit.default_timer()
+        print(f"Time to prepare RAG pipeline: {end - start}")
+        answer = self.process_query(query, rag_chain, debug, local)
+        return answer
+    def build_rag_pipeline(self, query_inputs, query_types, index_name, debug, local):
+        # Import config vars
+        with open('config.yml', 'r', encoding='utf8') as ymlfile:
+            cfg = box.Box(yaml.safe_load(ymlfile))
+        client = self.invoke_pipeline_step(lambda: weaviate.Client(cfg.WEAVIATE_URL),
+                                           "Connecting to Weaviate...",
+                                           local)
+        llm = self.invoke_pipeline_step(lambda: Ollama(model=cfg.LLM, base_url=cfg.OLLAMA_BASE_URL, temperature=0,
+                                                       request_timeout=900),
+                                        "Loading Ollama...",
+                                        local)
+        embeddings = self.invoke_pipeline_step(lambda: self.load_embedding_model(model_name=cfg.EMBEDDINGS),
+                                               "Loading embedding model...",
+                                               local)
+        index = self.invoke_pipeline_step(
+            lambda: self.build_index(cfg.CHUNK_SIZE, llm, embeddings, client, index_name),
+            "Building index...",
+            local)
+        ResponseModel = self.invoke_pipeline_step(lambda: self.build_response_class(query_inputs, query_types),
+                                                  "Building dynamic response class...",
+                                                  local)
+        # may want to try with similarity_top_k=5, default is 2
+        query_engine = self.invoke_pipeline_step(lambda: index.as_query_engine(
+                                                            streaming=False,
+                                                            output_cls=ResponseModel,
+                                                            response_mode="compact"
+                                                        ),
+                                                 "Constructing query engine...",
+                                                 local)
+        return query_engine
+    # Function to safely evaluate type strings
+    def safe_eval_type(self, type_str, context):
+        try:
+            return eval(type_str, {}, context)
+        except NameError:
+            raise ValueError(f"Type '{type_str}' is not recognized")
+    def build_response_class(self, query_inputs, query_types_as_strings):
+        # Controlled context for eval
+        context = {
+            'List': List,
+            'str': str,
+            'int': int,
+            'float': float
+            # Include other necessary types or typing constructs here
+        }
+        # Convert string representations to actual types
+        query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
+        # Create fields dictionary
+        fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
+        DynamicModel = create_model('DynamicModel', **fields)
+        return DynamicModel
+    def load_embedding_model(self, model_name):
+        return HuggingFaceEmbedding(model_name=model_name)
+    def build_index(self, chunk_size, llm, embed_model, weaviate_client, index_name):
+        Settings.chunk_size = chunk_size
+        Settings.llm = llm
+        Settings.embed_model = embed_model
+        vector_store = WeaviateVectorStore(weaviate_client=weaviate_client, index_name=index_name)
+        index = VectorStoreIndex.from_vector_store(
+            vector_store=vector_store
+        )
+        return index
+    def process_query(self, query, rag_chain, debug=False, local=True) -> str:
+        start = timeit.default_timer()
+        step = 0
+        answer = None
+        while answer is None:
+            step += 1
+            if step > 1:
+                print('Refining answer...')
+                # add wait time, before refining to avoid spamming the server
+                time.sleep(5)
+            if step > 3:
+                # if we have refined 3 times, and still no answer, break
+                answer = 'No answer found.'
+                break
+            if local:
+                with Progress(
+                        SpinnerColumn(),
+                        TextColumn("[progress.description]{task.description}"),
+                        transient=False,
+                ) as progress:
+                    progress.add_task(description="Retrieving answer...", total=None)
+                    answer = self.get_rag_response(query, rag_chain, debug)
+            else:
+                print('Retrieving answer...')
+                answer = self.get_rag_response(query, rag_chain, debug)
+        end = timeit.default_timer()
+        print(f"\nJSON response:\n")
+        print(answer + '\n')
+        print('=' * 50)
+        print(f"Time to retrieve answer: {end - start}")
+        return answer
+    def get_rag_response(self, query, chain, debug=False) -> str | None:
+        try:
+            result = chain.query(query)
+        except ValueError as error:
+            text = error.args[0]
+            starting_str = "Could not extract json string from output: \n"
+            if (index := text.find(starting_str)) != -1:
+                json_str = text[index + len(starting_str) :]
+                result = json_str + "}"
+            else:
+                return
+        try:
+            # Convert and pretty print
+            data = json.loads(str(result))
+            data = json.dumps(data, indent=4)
+            return data
+        except (json.decoder.JSONDecodeError, TypeError):
+            print("The response is not in JSON format:\n")
+            print(result)
+        # return False
+    def invoke_pipeline_step(self, task_call, task_description, local):
+        if local:
+            with Progress(
+                    SpinnerColumn(),
+                    TextColumn("[progress.description]{task.description}"),
+                    transient=False,
+            ) as progress:
+                progress.add_task(description=task_description, total=None)
+                ret = task_call()
+        else:
+            print(task_description)
+            ret = task_call()
+        return ret

rag/agents/llamaindex/vllamaindex.py ADDED Viewed

	@@ -0,0 +1,139 @@

+from rag.agents.interface import Pipeline
+from rich.progress import Progress, SpinnerColumn, TextColumn
+from typing import Any
+from pydantic import create_model
+from typing import List
+import warnings
+import box
+import yaml
+import timeit
+from rich import print
+from llama_index.core import SimpleDirectoryReader
+from llama_index.multi_modal_llms.ollama import OllamaMultiModal
+from llama_index.core.program import MultiModalLLMCompletionProgram
+from llama_index.core.output_parsers import PydanticOutputParser
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+# Import config vars
+with open('config.yml', 'r', encoding='utf8') as ymlfile:
+    cfg = box.Box(yaml.safe_load(ymlfile))
+class VLlamaIndexPipeline(Pipeline):
+    def run_pipeline(self,
+                     payload: str,
+                     query_inputs: [str],
+                     query_types: [str],
+                     keywords: [str],
+                     query: str,
+                     file_path: str,
+                     index_name: str,
+                     options: List[str] = None,
+                     group_by_rows: bool = True,
+                     update_targets: bool = True,
+                     debug: bool = False,
+                     local: bool = True) -> Any:
+        print(f"\nRunning pipeline with {payload}\n")
+        start = timeit.default_timer()
+        if file_path is None:
+            raise ValueError("File path is required for vllamaindex pipeline")
+        mm_model = self.invoke_pipeline_step(lambda: OllamaMultiModal(model=cfg.LLM_VLLAMAINDEX),
+                                             "Loading Ollama MultiModal...",
+                                             local)
+        # load as image documents
+        image_documents = self.invoke_pipeline_step(lambda: SimpleDirectoryReader(input_files=[file_path],
+                                                                                  required_exts=[".jpg", ".JPG",
+                                                                                                 ".JPEG"]).load_data(),
+                                                    "Loading image documents...",
+                                                    local)
+        ResponseModel = self.invoke_pipeline_step(lambda: self.build_response_class(query_inputs, query_types),
+                                                  "Building dynamic response class...",
+                                                  local)
+        prompt_template_str = """\
+        {query_str}
+        Return the answer as a Pydantic object. The Pydantic schema is given below:
+        """
+        mm_program = MultiModalLLMCompletionProgram.from_defaults(
+            output_parser=PydanticOutputParser(ResponseModel),
+            image_documents=image_documents,
+            prompt_template_str=prompt_template_str,
+            multi_modal_llm=mm_model,
+            verbose=True,
+        )
+        try:
+            response = self.invoke_pipeline_step(lambda: mm_program(query_str=query),
+                                                 "Running inference...",
+                                                 local)
+        except ValueError as e:
+            print(f"Error: {e}")
+            msg = 'Inference failed'
+            return '{"answer": "' + msg + '"}'
+        end = timeit.default_timer()
+        print(f"\nJSON response:\n")
+        for res in response:
+            print(res)
+        print('=' * 50)
+        print(f"Time to retrieve answer: {end - start}")
+        return response
+    # Function to safely evaluate type strings
+    def safe_eval_type(self, type_str, context):
+        try:
+            return eval(type_str, {}, context)
+        except NameError:
+            raise ValueError(f"Type '{type_str}' is not recognized")
+    def build_response_class(self, query_inputs, query_types_as_strings):
+        # Controlled context for eval
+        context = {
+            'List': List,
+            'str': str,
+            'int': int,
+            'float': float
+            # Include other necessary types or typing constructs here
+        }
+        # Convert string representations to actual types
+        query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
+        # Create fields dictionary
+        fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
+        DynamicModel = create_model('DynamicModel', **fields)
+        return DynamicModel
+    def invoke_pipeline_step(self, task_call, task_description, local):
+        if local:
+            with Progress(
+                    SpinnerColumn(),
+                    TextColumn("[progress.description]{task.description}"),
+                    transient=False,
+            ) as progress:
+                progress.add_task(description=task_description, total=None)
+                ret = task_call()
+        else:
+            print(task_description)
+            ret = task_call()
+        return ret

rag/agents/llamaindex/vprocessor.py ADDED Viewed

	@@ -0,0 +1,183 @@

+from rag.agents.interface import Pipeline
+from llama_index.core.program import LLMTextCompletionProgram
+import json
+from llama_index.llms.ollama import Ollama
+from typing import List
+from pydantic import create_model
+from rich.progress import Progress, SpinnerColumn, TextColumn
+import requests
+import warnings
+import box
+import yaml
+import timeit
+from rich import print
+from typing import Any
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+# Import config vars
+with open('config.yml', 'r', encoding='utf8') as ymlfile:
+    cfg = box.Box(yaml.safe_load(ymlfile))
+class VProcessorPipeline(Pipeline):
+    def run_pipeline(self,
+                     payload: str,
+                     query_inputs: [str],
+                     query_types: [str],
+                     keywords: [str],
+                     query: str,
+                     file_path: str,
+                     index_name: str,
+                     options: List[str] = None,
+                     group_by_rows: bool = True,
+                     update_targets: bool = True,
+                     debug: bool = False,
+                     local: bool = True) -> Any:
+        print(f"\nRunning pipeline with {payload}\n")
+        start = timeit.default_timer()
+        if file_path is None:
+            raise ValueError("File path is required for vprocessor pipeline")
+        with open(file_path, "rb") as file:
+            files = {'file': (file_path, file, 'image/jpeg')}
+            data = {
+                'image_url': ''
+            }
+            response = self.invoke_pipeline_step(lambda: requests.post(cfg.OCR_ENDPOINT_VPROCESSOR,
+                                                                       data=data,
+                                                                       files=files,
+                                                                       timeout=180),
+                                                 "Running OCR...",
+                                                 local)
+        if response.status_code != 200:
+            print('Request failed with status code:', response.status_code)
+            print('Response:', response.text)
+            return "Failed to process file. Please try again."
+        end = timeit.default_timer()
+        print(f"Time to run OCR: {end - start}")
+        start = timeit.default_timer()
+        data = response.json()
+        ResponseModel = self.invoke_pipeline_step(lambda: self.build_response_class(query_inputs, query_types),
+                                                  "Building dynamic response class...",
+                                                  local)
+        prompt_template_str = """\
+        """ + query + """\
+        using this structured data, coming from OCR {document_data}.\
+        """
+        llm_ollama = self.invoke_pipeline_step(lambda: Ollama(model=cfg.LLM_VPROCESSOR,
+                                                              base_url=cfg.OLLAMA_BASE_URL_VPROCESSOR,
+                                                              temperature=0,
+                                                              request_timeout=900),
+                                               "Loading Ollama...",
+                                               local)
+        program = LLMTextCompletionProgram.from_defaults(
+            output_cls=ResponseModel,
+            prompt_template_str=prompt_template_str,
+            llm=llm_ollama,
+            verbose=True,
+        )
+        output = self.invoke_pipeline_step(lambda: program(document_data=data),
+                                           "Running inference...",
+                                           local)
+        answer = self.beautify_json(output.model_dump_json())
+        end = timeit.default_timer()
+        print(f"\nJSON response:\n")
+        print(answer + '\n')
+        print('=' * 50)
+        print(f"Time to retrieve answer: {end - start}")
+        return answer
+    def prepare_files(self, file_path, file):
+        if file_path is not None:
+            with open(file_path, "rb") as file:
+                files = {'file': (file_path, file, 'image/jpeg')}
+                data = {
+                    'image_url': ''
+                }
+        else:
+            files = {'file': (file.filename, file.file, file.content_type)}
+            data = {
+                'image_url': ''
+            }
+        return data, files
+    # Function to safely evaluate type strings
+    def safe_eval_type(self, type_str, context):
+        try:
+            return eval(type_str, {}, context)
+        except NameError:
+            raise ValueError(f"Type '{type_str}' is not recognized")
+    def build_response_class(self, query_inputs, query_types_as_strings):
+        # Controlled context for eval
+        context = {
+            'List': List,
+            'str': str,
+            'int': int,
+            'float': float
+            # Include other necessary types or typing constructs here
+        }
+        # Convert string representations to actual types
+        query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
+        # Create fields dictionary
+        fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
+        DynamicModel = create_model('DynamicModel', **fields)
+        return DynamicModel
+    def invoke_pipeline_step(self, task_call, task_description, local):
+        if local:
+            with Progress(
+                    SpinnerColumn(),
+                    TextColumn("[progress.description]{task.description}"),
+                    transient=False,
+            ) as progress:
+                progress.add_task(description=task_description, total=None)
+                ret = task_call()
+        else:
+            print(task_description)
+            ret = task_call()
+        return ret
+    def beautify_json(self, result):
+        try:
+            # Convert and pretty print
+            data = json.loads(str(result))
+            data = json.dumps(data, indent=4)
+            return data
+        except (json.decoder.JSONDecodeError, TypeError):
+            print("The response is not in JSON format:\n")
+            print(result)
+        return {}

rag/agents/sparrow_parse/__init__.py ADDED Viewed

File without changes

rag/agents/sparrow_parse/sparrow_parse.py ADDED Viewed

	@@ -0,0 +1,137 @@

+from rag.agents.interface import Pipeline
+from sparrow_parse.vllm.inference_factory import InferenceFactory
+from sparrow_parse.extractors.vllm_extractor import VLLMExtractor
+import timeit
+from rich import print
+from rich.progress import Progress, SpinnerColumn, TextColumn
+from typing import Any, List
+from .sparrow_validator import Validator
+from .sparrow_utils import is_valid_json, get_json_keys_as_string
+import warnings
+import os
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+class SparrowParsePipeline(Pipeline):
+    def __init__(self):
+        pass
+    def run_pipeline(self,
+                     payload: str,
+                     query_inputs: [str],
+                     query_types: [str],
+                     keywords: [str],
+                     query: str,
+                     file_path: str,
+                     index_name: str,
+                     options: List[str] = None,
+                     group_by_rows: bool = True,
+                     update_targets: bool = True,
+                     debug: bool = False,
+                     local: bool = True) -> Any:
+        print(f"\nRunning pipeline with {payload}\n")
+        start = timeit.default_timer()
+        query_all_data = False
+        if query == "*":
+            query_all_data = True
+            query = None
+        else:
+            try:
+                query, query_schema = self.invoke_pipeline_step(lambda: self.prepare_query_and_schema(query, debug),
+                                                          "Preparing query and schema", local)
+            except ValueError as e:
+                raise e
+        llm_output = self.invoke_pipeline_step(lambda: self.execute_query(options, query_all_data, query, file_path, debug),
+                                               "Executing query", local)
+        validation_result = None
+        if query_all_data is False:
+            validation_result = self.invoke_pipeline_step(lambda: self.validate_result(llm_output, query_all_data, query_schema, debug),
+                                                      "Validating result", local)
+        end = timeit.default_timer()
+        print(f"Time to retrieve answer: {end - start}")
+        return validation_result if validation_result is not None else llm_output
+    def prepare_query_and_schema(self, query, debug):
+        is_query_valid = is_valid_json(query)
+        if not is_query_valid:
+            raise ValueError("Invalid query. Please provide a valid JSON query.")
+        query_keys = get_json_keys_as_string(query)
+        query_schema = query
+        query = "retrieve " + query_keys
+        query = query + ". return response in JSON format, by strictly following this JSON schema: " + query_schema
+        return query, query_schema
+    def execute_query(self, options, query_all_data, query, file_path, debug):
+        extractor = VLLMExtractor()
+        # export HF_TOKEN="hf_"
+        config = {}
+        if options[0] == 'huggingface':
+            config = {
+                "method": options[0],  # Could be 'huggingface' or 'local_gpu'
+                "hf_space": options[1],
+                "hf_token": os.getenv('HF_TOKEN')
+            }
+        else:
+            # Handle other cases if needed
+            return "First element is not 'huggingface'"
+        # Use the factory to get the correct instance
+        factory = InferenceFactory(config)
+        model_inference_instance = factory.get_inference_instance()
+        input_data = [
+            {
+                "image": file_path,
+                "text_input": query
+            }
+        ]
+        # Now you can run inference without knowing which implementation is used
+        llm_output = extractor.run_inference(model_inference_instance, input_data, generic_query=query_all_data,
+                                             debug=debug)
+        return llm_output
+    def validate_result(self, llm_output, query_all_data, query_schema, debug):
+        validator = Validator(query_schema)
+        validation_result = validator.validate_json_against_schema(llm_output, validator.generated_schema)
+        if validation_result is not None:
+            return validation_result
+        else:
+            if debug:
+                print("LLM output is valid according to the schema.")
+    def invoke_pipeline_step(self, task_call, task_description, local):
+        if local:
+            with Progress(
+                    SpinnerColumn(),
+                    TextColumn("[progress.description]{task.description}"),
+                    transient=False,
+            ) as progress:
+                progress.add_task(description=task_description, total=None)
+                ret = task_call()
+        else:
+            print(task_description)
+            ret = task_call()
+        return ret

rag/agents/sparrow_parse/sparrow_utils.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import json
+def is_valid_json(json_string):
+    try:
+        json.loads(json_string)
+        return True
+    except json.JSONDecodeError as e:
+        print("JSONDecodeError:", e)
+        return False
+def get_json_keys_as_string(json_string):
+    try:
+        # Load the JSON string into a Python object
+        json_data = json.loads(json_string)
+        # If the input is a list, treat it like a dictionary by merging all the keys
+        if isinstance(json_data, list):
+            merged_dict = {}
+            for item in json_data:
+                if isinstance(item, dict):
+                    merged_dict.update(item)
+            json_data = merged_dict  # Now json_data is a dictionary
+        # A helper function to recursively gather keys while preserving order
+        def extract_keys(data, keys):
+            if isinstance(data, dict):
+                for key, value in data.items():
+                    if isinstance(value, dict):
+                        # Recursively extract from nested dictionaries
+                        extract_keys(value, keys)
+                    elif isinstance(value, list):
+                        # Process each dictionary inside the list
+                        for item in value:
+                            if isinstance(item, dict):
+                                extract_keys(item, keys)
+                    else:
+                        if key not in keys:
+                            keys.append(key)
+            return keys
+        # List to hold the keys in order
+        keys = []
+        # Process the top-level dictionary first
+        extract_keys(json_data, keys)
+        # Join and return the keys as a comma-separated string
+        return ', '.join(keys)
+    except json.JSONDecodeError:
+        print("Invalid JSON string.")
+        return ''

rag/agents/sparrow_parse/sparrow_validator.py ADDED Viewed

	@@ -0,0 +1,26 @@

+from genson import SchemaBuilder
+from jsonschema import validate, ValidationError
+import json
+class Validator:
+    def __init__(self, example_json):
+        self.generated_schema = self.generate_schema_from_example(example_json)
+    def generate_schema_from_example(self, example_json):
+        # Parse the example JSON into a Python object
+        example_data = json.loads(example_json)
+        # Generate the schema using Genson
+        builder = SchemaBuilder()
+        builder.add_object(example_data)
+        return builder.to_schema()
+    def validate_json_against_schema(self, json_string, schema):
+        try:
+            json_data = json.loads(json_string)  # Parse LLM JSON output
+            validate(instance=json_data, schema=schema)  # Validate against schema
+            return None  # Return None if valid
+        except (json.JSONDecodeError, ValidationError) as e:
+            return str(e)  # Return error message if invalid

rag/agents/unstructured/__init__.py ADDED Viewed

File without changes

rag/agents/unstructured/unstructured.py ADDED Viewed

	@@ -0,0 +1,372 @@

+from rag.agents.interface import Pipeline
+import uuid
+import weaviate
+from weaviate.util import get_valid_uuid
+from unstructured.chunking.title import chunk_by_title
+from unstructured.documents.elements import DataSourceMetadata
+from unstructured.partition.json import partition_json
+from sentence_transformers import SentenceTransformer
+from langchain.vectorstores.weaviate import Weaviate
+from langchain.prompts import PromptTemplate
+from langchain_community.llms import Ollama
+import tempfile
+import subprocess
+import os
+from typing import List, Dict
+import warnings
+import box
+import yaml
+import timeit
+import json
+from rich import print
+from typing import Any
+from rich.progress import Progress, SpinnerColumn, TextColumn
+from pydantic.v1 import create_model
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+# Import config vars
+with open('config.yml', 'r', encoding='utf8') as ymlfile:
+    cfg = box.Box(yaml.safe_load(ymlfile))
+class UnstructuredPipeline(Pipeline):
+    def run_pipeline(self,
+                     payload: str,
+                     query_inputs: [str],
+                     query_types: [str],
+                     keywords: [str],
+                     query: str,
+                     file_path: str,
+                     index_name: str,
+                     options: List[str] = None,
+                     group_by_rows: bool = True,
+                     update_targets: bool = True,
+                     debug: bool = False,
+                     local: bool = True) -> Any:
+        print(f"\nRunning pipeline with {payload}\n")
+        if len(query_inputs) == 1:
+            raise ValueError("Please provide more than one query input")
+        start = timeit.default_timer()
+        output_dir = cfg.OUTPUT_DIR_UNSTRUCTURED
+        input_dir = cfg.INPUT_DIR_UNSTRUCTURED
+        weaviate_url = cfg.WEAVIATE_URL_UNSTRUCTURED
+        embedding_model_name = cfg.EMBEDDINGS_UNSTRUCTURED
+        device = cfg.DEVICE_UNSTRUCTURED
+        with tempfile.TemporaryDirectory() as temp_dir:
+            temp_input_dir = os.path.join(temp_dir, input_dir)
+            temp_output_dir = os.path.join(temp_dir, output_dir) if debug is False else output_dir
+            if debug:
+                print(f"Copying {file_path} to {temp_input_dir}")
+            os.makedirs(temp_input_dir, exist_ok=True)
+            os.system(f"cp {file_path} {temp_input_dir}")
+            os.makedirs(temp_output_dir, exist_ok=True)
+            files = self.invoke_pipeline_step(
+                lambda: self.process_files(temp_output_dir, temp_input_dir),
+                "Processing file with unstructured...",
+                local
+            )
+            vectorstore, embedding_model = self.invoke_pipeline_step(
+                lambda: self.build_vector_store(weaviate_url, embedding_model_name, device, files, debug),
+                "Building vector store...",
+                local
+            )
+        llm = self.invoke_pipeline_step(
+            lambda: Ollama(model=cfg.LLM_UNSTRUCTURED,
+                           base_url=cfg.BASE_URL_UNSTRUCTURED),
+            "Initializing Ollama...",
+            local
+        )
+        raw_result, similar_docs = self.invoke_pipeline_step(
+            lambda: self.question_answer(query, vectorstore, embedding_model, device, llm),
+            "Answering question...",
+            local
+        )
+        answer = self.invoke_pipeline_step(
+            lambda: self.validate_output(raw_result, query_inputs, query_types),
+            "Validating output...",
+            local
+        )
+        if debug:
+            print("\n\n\n-------------------------")
+            print(f"QUERY: {query}")
+            print("\n\n\n-------------------------")
+            print(f"Answer: {answer}")
+            print("\n\n\n-------------------------")
+            for index, result in enumerate(similar_docs):
+                print(f"\n\n-- RESULT {index + 1}:\n")
+                print(result)
+        end = timeit.default_timer()
+        print(f"\nJSON response:\n")
+        print(answer + '\n')
+        print('=' * 50)
+        print(f"Time to retrieve answer: {end - start}")
+        return answer
+    def process_files(self, temp_output_dir, temp_input_dir):
+        self.process_local(output_dir=temp_output_dir, num_processes=2, input_path=temp_input_dir)
+        files = self.get_result_files(temp_output_dir)
+        return files
+    def build_vector_store(self, weaviate_url, embedding_model_name, device, files, debug):
+        client = self.create_local_weaviate_client(db_url=weaviate_url)
+        my_schema = self.get_schema()
+        self.upload_schema(my_schema, weaviate=client)
+        vectorstore = Weaviate(client, "Doc", "text")
+        embedding_model = SentenceTransformer(embedding_model_name, device=device)
+        self.add_data_to_weaviate(
+            debug,
+            files=files,
+            client=client,
+            embedding_model=embedding_model,
+            device=device,
+            chunk_under_n_chars=cfg.CHUNK_UNDER_N_CHARS_UNSTRUCTURED,
+            chunk_new_after_n_chars=cfg.CHUNK_NEW_AFTER_N_CHARS_UNSTRUCTURED
+        )
+        if debug:
+            print(self.count_documents(client=client)['data']['Aggregate']['Doc'])
+        return vectorstore, embedding_model
+    def process_local(self, output_dir: str, num_processes: int, input_path: str):
+        command = [
+            "unstructured-ingest",
+            "local",
+            "--input-path", input_path,
+            "--output-dir", output_dir,
+            "--num-processes", str(num_processes),
+            "--recursive",
+            "--verbose",
+        ]
+        # Run the command
+        process = subprocess.Popen(command, stdout=subprocess.PIPE)
+        output, error = process.communicate()
+        # Print output
+        if process.returncode == 0:
+            print('Command executed successfully. Output:')
+            print(output.decode())
+        else:
+            print('Command failed. Error:')
+            print(error.decode())
+    def get_result_files(self, folder_path) -> List[Dict]:
+        file_list = []
+        for root, dirs, files in os.walk(folder_path):
+            for file in files:
+                if file.endswith('.json'):
+                    file_path = os.path.join(root, file)
+                    file_list.append(file_path)
+        return file_list
+    def create_local_weaviate_client(self, db_url: str):
+        return weaviate.Client(
+            url=db_url,
+        )
+    def get_schema(self, vectorizer: str = "none"):
+        return {
+            "classes": [
+                {
+                    "class": "Doc",
+                    "description": "A generic document class",
+                    "vectorizer": vectorizer,
+                    "properties": [
+                        {
+                            "name": "last_modified",
+                            "dataType": ["text"],
+                            "description": "Last modified date for the document",
+                        },
+                        {
+                            "name": "player",
+                            "dataType": ["text"],
+                            "description": "Player related to the document",
+                        },
+                        {
+                            "name": "position",
+                            "dataType": ["text"],
+                            "description": "Player Position related to the document",
+                        },
+                        {
+                            "name": "text",
+                            "dataType": ["text"],
+                            "description": "Text content for the document",
+                        },
+                    ],
+                },
+            ],
+        }
+    def upload_schema(self, my_schema, weaviate):
+        weaviate.schema.delete_all()
+        weaviate.schema.create(my_schema)
+    def count_documents(self, client: weaviate.Client) -> Dict:
+        response = (
+            client.query
+            .aggregate("Doc")
+            .with_meta_count()
+            .do()
+        )
+        count = response
+        return count
+    def compute_embedding(self, chunk_text: List[str], embedding_model, device):
+        embeddings = embedding_model.encode(chunk_text, device=device)
+        return embeddings
+    def get_chunks(self, elements, embedding_model, device, chunk_under_n_chars=500, chunk_new_after_n_chars=1500):
+        for element in elements:
+            if not type(element.metadata.data_source) is DataSourceMetadata:
+                delattr(element.metadata, "data_source")
+            if hasattr(element.metadata, "coordinates"):
+                delattr(element.metadata, "coordinates")
+        chunks = chunk_by_title(
+            elements,
+            combine_text_under_n_chars=chunk_under_n_chars,
+            new_after_n_chars=chunk_new_after_n_chars
+        )
+        for i in range(len(chunks)):
+            chunks[i] = {"last_modified": chunks[i].metadata.last_modified, "text": chunks[i].text}
+        chunk_texts = [x['text'] for x in chunks]
+        embeddings = self.compute_embedding(chunk_texts, embedding_model, device)
+        return chunks, embeddings
+    def add_data_to_weaviate(self, debug, files, client, embedding_model, device, chunk_under_n_chars=500, chunk_new_after_n_chars=1500):
+        for filename in files:
+            try:
+                elements = partition_json(filename=filename)
+                chunks, embeddings = self.get_chunks(elements, embedding_model, device, chunk_under_n_chars, chunk_new_after_n_chars)
+            except IndexError as e:
+                print(e)
+                continue
+            if debug:
+                print(f"Uploading {len(chunks)} chunks for {str(filename)}.")
+            for i, chunk in enumerate(chunks):
+                client.batch.add_data_object(
+                    data_object=chunk,
+                    class_name="doc",
+                    uuid=get_valid_uuid(uuid.uuid4()),
+                    vector=embeddings[i]
+                )
+        client.batch.flush()
+    def question_answer(self, question: str, vectorstore: Weaviate, embedding_model, device, llm):
+        embedding = self.compute_embedding(question, embedding_model, device)
+        similar_docs = vectorstore.max_marginal_relevance_search_by_vector(embedding)
+        content = [x.page_content for x in similar_docs]
+        prompt_template = PromptTemplate.from_template(
+            """\
+            Given context about the subject, answer the question based on the context provided to the best of your ability.
+            Context: {context}
+            Question:
+            {question}
+            Answer:
+            """
+        )
+        prompt = prompt_template.format(context=content, question=question)
+        answer = llm(prompt)
+        return answer, similar_docs
+    def validate_output(self, raw_result, query_inputs, query_types):
+        if raw_result is None:
+            return {}
+        clean_str = raw_result.replace('<|im_end|>', '')
+        # Convert the cleaned string to a dictionary
+        response_dict = json.loads(clean_str)
+        ResponseModel = self.build_response_class(query_inputs, query_types)
+        # Validate and create a Pydantic model instance
+        validated_response = ResponseModel(**response_dict)
+        # Convert the model instance to JSON
+        answer = self.beautify_json(validated_response.json())
+        return answer
+    def safe_eval_type(self, type_str, context):
+        try:
+            return eval(type_str, {}, context)
+        except NameError:
+            raise ValueError(f"Type '{type_str}' is not recognized")
+    def build_response_class(self, query_inputs, query_types_as_strings):
+        # Controlled context for eval
+        context = {
+            'List': List,
+            'str': str,
+            'int': int,
+            'float': float
+            # Include other necessary types or typing constructs here
+        }
+        # Convert string representations to actual types
+        query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
+        # Create fields dictionary
+        fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
+        DynamicModel = create_model('DynamicModel', **fields)
+        return DynamicModel
+    def beautify_json(self, result):
+        try:
+            # Convert and pretty print
+            data = json.loads(str(result))
+            data = json.dumps(data, indent=4)
+            return data
+        except (json.decoder.JSONDecodeError, TypeError):
+            print("The response is not in JSON format:\n")
+            print(result)
+        return {}
+    def invoke_pipeline_step(self, task_call, task_description, local):
+        if local:
+            with Progress(
+                    SpinnerColumn(),
+                    TextColumn("[progress.description]{task.description}"),
+                    transient=False,
+            ) as progress:
+                progress.add_task(description=task_description, total=None)
+                ret = task_call()
+        else:
+            print(task_description)
+            ret = task_call()
+        return ret

rag/agents/unstructured/unstructured_light.py ADDED Viewed

	@@ -0,0 +1,293 @@

+from rag.agents.interface import Pipeline
+from unstructured.partition.pdf import partition_pdf
+from unstructured.partition.image import partition_image
+from unstructured.staging.base import elements_to_json
+from langchain_community.document_loaders import TextLoader
+from langchain.text_splitter import CharacterTextSplitter
+from langchain_community.embeddings import OllamaEmbeddings
+from langchain.chains import RetrievalQA
+from langchain_community.vectorstores import Chroma
+from langchain_community.llms import Ollama
+from pydantic.v1 import create_model
+from typing import List
+from rich.progress import Progress, SpinnerColumn, TextColumn
+import tempfile
+import json
+import warnings
+import box
+import yaml
+import timeit
+from rich import print
+from typing import Any
+import os
+warnings.filterwarnings("ignore", category=DeprecationWarning)
+warnings.filterwarnings("ignore", category=UserWarning)
+# Import config vars
+with open('config.yml', 'r', encoding='utf8') as ymlfile:
+    cfg = box.Box(yaml.safe_load(ymlfile))
+class UnstructuredLightPipeline(Pipeline):
+    def run_pipeline(self,
+                     payload: str,
+                     query_inputs: [str],
+                     query_types: [str],
+                     keywords: [str],
+                     query: str,
+                     file_path: str,
+                     index_name: str,
+                     options: List[str] = None,
+                     group_by_rows: bool = True,
+                     update_targets: bool = True,
+                     debug: bool = False,
+                     local: bool = True) -> Any:
+        print(f"\nRunning pipeline with {payload}\n")
+        if len(query_inputs) == 1:
+            raise ValueError("Please provide more than one query input")
+        start = timeit.default_timer()
+        strategy = cfg.STRATEGY_UNSTRUCTURED_LIGHT
+        model_name = cfg.MODEL_UNSTRUCTURED_LIGHT
+        extract_tables = False
+        # Initialize options as an empty list if it is None
+        options = options or []
+        if "tables" in options:
+            extract_tables = True
+        # Extracts the elements from the PDF
+        elements = self.invoke_pipeline_step(
+            lambda: self.process_file(file_path, strategy, model_name),
+            "Extracting elements from the document...",
+            local
+        )
+        if debug:
+            new_extension = 'json'  # You can change this to any extension you want
+            new_file_path = self.change_file_extension(file_path, new_extension)
+            documents = self.invoke_pipeline_step(
+                lambda: self.load_text_data(elements, new_file_path, extract_tables),
+                "Loading text data...",
+                local
+            )
+        else:
+            with tempfile.TemporaryDirectory() as temp_dir:
+                temp_file_path = os.path.join(temp_dir, "file_data.json")
+                documents = self.invoke_pipeline_step(
+                    lambda: self.load_text_data(elements, temp_file_path, extract_tables),
+                    "Loading text data...",
+                    local
+                )
+        docs = self.invoke_pipeline_step(
+            lambda: self.split_text(documents, cfg.CHUNK_SIZE_UNSTRUCTURED_LIGHT, cfg.OVERLAP_UNSTRUCTURED_LIGHT),
+            "Splitting text...",
+            local
+        )
+        db = self.invoke_pipeline_step(
+            lambda: self.prepare_vector_store(docs, cfg.EMBEDDINGS_UNSTRUCTURED_LIGHT),
+            "Preparing vector store...",
+            local
+        )
+        llm = self.invoke_pipeline_step(
+            lambda: Ollama(model=cfg.LLM_UNSTRUCTURED_LIGHT,
+                           base_url=cfg.BASE_URL_UNSTRUCTURED_LIGHT),
+            "Initializing Ollama...",
+            local
+        )
+        raw_result = self.invoke_pipeline_step(
+            lambda: self.execute_langchain_query(llm, db, query),
+            "Executing query...",
+            local
+        )
+        answer = self.invoke_pipeline_step(
+            lambda: self.validate_output(raw_result, query_inputs, query_types),
+            "Validating output...",
+            local
+        )
+        end = timeit.default_timer()
+        print(f"\nJSON response:\n")
+        print(answer + '\n')
+        print('=' * 50)
+        print(f"Time to retrieve answer: {end - start}")
+        return answer
+    def process_file(self, file_path, strategy, model_name):
+        elements = None
+        if file_path.lower().endswith('.pdf'):
+            elements = partition_pdf(
+                filename=file_path,
+                strategy=strategy,
+                infer_table_structure=True,
+                model_name=model_name
+            )
+        elif file_path.lower().endswith(('.jpg', '.jpeg', '.png')):
+            elements = partition_image(
+                filename=file_path,
+                strategy=strategy,
+                infer_table_structure=True,
+                model_name=model_name
+            )
+        return elements
+    def load_text_data(self, elements, file_path, extract_tables):
+        elements_to_json(elements, filename=file_path)
+        text_file = self.process_json_file(file_path, extract_tables)
+        loader = TextLoader(text_file)
+        documents = loader.load()
+        return documents
+    def split_text(self, text, chunk_size, overlap):
+        text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
+        docs = text_splitter.split_documents(text)
+        return docs
+    def prepare_vector_store(self, docs, model_name):
+        db = Chroma.from_documents(
+            documents=docs,
+            collection_name="sparrow-rag",
+            embedding=OllamaEmbeddings(model=model_name)
+        )
+        return db
+    def execute_langchain_query(self, llm, db, query):
+        qa_chain = RetrievalQA.from_chain_type(llm, retriever=db.as_retriever())
+        response = qa_chain({"query": query})
+        raw_result = response['result']
+        return raw_result
+    def validate_output(self, raw_result, query_inputs, query_types):
+        if raw_result is None:
+            return {}
+        clean_str = raw_result.replace('<|im_end|>', '')
+        # Convert the cleaned string to a dictionary
+        response_dict = json.loads(clean_str)
+        ResponseModel = self.build_response_class(query_inputs, query_types)
+        # Validate and create a Pydantic model instance
+        validated_response = ResponseModel(**response_dict)
+        # Convert the model instance to JSON
+        answer = self.beautify_json(validated_response.json())
+        return answer
+    def process_json_file(self, input_data, extract_tables):
+        # Read the JSON file
+        with open(input_data, 'r') as file:
+            data = json.load(file)
+        # Iterate over the JSON data and extract required table elements
+        extracted_elements = []
+        for entry in data:
+            if entry["type"] == "Table":
+                extracted_elements.append(entry["metadata"]["text_as_html"])
+            elif entry["type"] == "Title" and extract_tables is False:
+                extracted_elements.append(entry["text"])
+            elif entry["type"] == "NarrativeText" and extract_tables is False:
+                extracted_elements.append(entry["text"])
+            elif entry["type"] == "UncategorizedText" and extract_tables is False:
+                extracted_elements.append(entry["text"])
+        # Write the extracted elements to the output file
+        new_extension = 'txt'  # You can change this to any extension you want
+        new_file_path = self.change_file_extension(input_data, new_extension)
+        with open(new_file_path, 'w') as output_file:
+            for element in extracted_elements:
+                output_file.write(element + "\n\n")  # Adding two newlines for separation
+        return new_file_path
+    # Function to safely evaluate type strings
+    def safe_eval_type(self, type_str, context):
+        try:
+            return eval(type_str, {}, context)
+        except NameError:
+            raise ValueError(f"Type '{type_str}' is not recognized")
+    def build_response_class(self, query_inputs, query_types_as_strings):
+        # Controlled context for eval
+        context = {
+            'List': List,
+            'str': str,
+            'int': int,
+            'float': float
+            # Include other necessary types or typing constructs here
+        }
+        # Convert string representations to actual types
+        query_types = [self.safe_eval_type(type_str, context) for type_str in query_types_as_strings]
+        # Create fields dictionary
+        fields = {name: (type_, ...) for name, type_ in zip(query_inputs, query_types)}
+        DynamicModel = create_model('DynamicModel', **fields)
+        return DynamicModel
+    def change_file_extension(self, file_path, new_extension):
+        # Check if the new extension starts with a dot and add one if not
+        if not new_extension.startswith('.'):
+            new_extension = '.' + new_extension
+        # Split the file path into two parts: the base (everything before the last dot) and the extension
+        # If there's no dot in the filename, it'll just return the original filename without an extension
+        base = file_path.rsplit('.', 1)[0]
+        # Concatenate the base with the new extension
+        new_file_path = base + new_extension
+        return new_file_path
+    def beautify_json(self, result):
+        try:
+            # Convert and pretty print
+            data = json.loads(str(result))
+            data = json.dumps(data, indent=4)
+            return data
+        except (json.decoder.JSONDecodeError, TypeError):
+            print("The response is not in JSON format:\n")
+            print(result)
+        return {}
+    def invoke_pipeline_step(self, task_call, task_description, local):
+        if local:
+            with Progress(
+                    SpinnerColumn(),
+                    TextColumn("[progress.description]{task.description}"),
+                    transient=False,
+            ) as progress:
+                progress.add_task(description=task_description, total=None)
+                ret = task_call()
+        else:
+            print(task_description)
+            ret = task_call()
+        return ret

requirements_haystack.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+pypdf
+python-box
+typer[all]
+fastapi==0.110.0
+uvicorn[standard]
+ollama-haystack==0.0.5
+haystack-ai==2.0.0
+weaviate-haystack==1.0.2
+ollama==0.1.7
+python-multipart
+sentence-transformers
+# Force reinstall:
+# pip install --force-reinstall -r requirements_haystack.txt

requirements_instructor.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+ollama==0.2.1
+python-multipart
+yfinance==0.2.40
+instructor==1.3.5
+python-box
+PyYAML
+rich
+typer[all]
+fastapi==0.111.1
+uvicorn[standard]
+sparrow-parse==0.3.2
+numpy==1.26.4
+# Force reinstall:
+# pip install --force-reinstall -r requirements_instructor.txt

requirements_llamaindex.txt ADDED Viewed

	@@ -0,0 +1,27 @@

+llama-index==0.10.23
+llama-index-core==0.10.23.post1
+llama-index-embeddings-langchain==0.1.2
+llama-index-llms-ollama==0.1.2
+llama-index-vector-stores-weaviate==0.1.4
+llama-index-multi-modal-llms-ollama==0.1.3
+llama-index-readers-file==0.1.12
+llama-index-embeddings-huggingface==0.1.4
+llama-index-vector-stores-qdrant==0.1.4
+llama-index-embeddings-clip==0.1.4
+sentence-transformers
+weaviate-client==3.26.2
+pypdf
+python-box
+typer[all]
+fastapi==0.110.0
+uvicorn[standard]
+ollama==0.1.7
+python-multipart
+# LlamaIndex upgrade:
+# pip uninstall llama-index
+# pip install llama-index --upgrade --no-cache-dir --force-reinstall
+# Force reinstall:
+# pip install --force-reinstall -r requirements_llamaindex.txt

requirements_sparrow_parse.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+python-multipart
+rich
+typer[all]
+fastapi==0.115.0
+uvicorn[standard]
+sparrow-parse==0.3.4
+genson==1.3.0
+jsonschema==4.23.0
+python-dotenv
+# Force reinstall:
+# pip install --force-reinstall -r requirements_sparrow_parse.txt

requirements_unstructured.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+unstructured[all-docs]==0.13.3
+unstructured-inference==0.7.27
+langchain==0.1.16
+langchain-community==0.0.34
+langchain-core==0.1.45
+chromadb
+sentence_transformers
+python-box
+rich
+typer[all]
+fastapi==0.110.2
+uvicorn[standard]
+ollama==0.1.8
+python-multipart
+weaviate-client==4.5.5
+# Force reinstall:
+# pip install --force-reinstall -r requirements_unstructured.txt

sample_prompts.txt ADDED Viewed

	@@ -0,0 +1,390 @@

+./sparrow.sh "invoice_number, invoice_date, client_name, client_address, client_tax_id, seller_name, seller_address,
+seller_tax_id, iban, names_of_invoice_items, gross_worth_of_invoice_items, total_gross_worth" "int, str, str, str, str,
+str, str, str, str, List[str], List[str], str" --agent llamaindex --index-name Sparrow_llamaindex_doc1
+{
+    "invoice_number": 61356291,
+    "invoice_date": "09/06/2012",
+    "client_name": "Rodriguez-Stevens",
+    "client_address": "2280 Angela Plain, Hortonshire, MS 93248",
+    "client_tax_id": "939-98-8477",
+    "seller_name": "Chapman, Kim and Green",
+    "seller_address": "64731 James Branch, Smithmouth, NC 26872",
+    "seller_tax_id": "949-84-9105",
+    "iban": "GB50ACIE59715038217063",
+    "names_of_invoice_items": [
+        "Wine Glasses Goblets Pair Clear Glass",
+        "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
+        "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
+        "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW"
+    ],
+    "gross_worth_of_invoice_items": [
+        66.0,
+        123.55,
+        8.25,
+        14.29
+    ],
+    "total_gross_worth": "$212,09"
+}
+==================================================
+Time to retrieve answer: 63.74948522399791
+./sparrow.sh "invoice_number, invoice_date" "int, str" --agent llamaindex --index-name Sparrow_llamaindex_doc1
+{
+    "invoice_number": 61356291,
+    "invoice_date": "09/06/2012"
+}
+==================================================
+Time to retrieve answer: 15.325319556002796
+./sparrow.sh "gross_worth_of_invoice_items" "List[float]" --agent llamaindex --index-name Sparrow_llamaindex_doc1
+{
+    "gross_worth_of_invoice_items": [
+        66.0,
+        123.55,
+        8.25,
+        14.29
+    ]
+}
+==================================================
+Time to retrieve answer: 17.55766561099881
+./sparrow.sh "guest_no, cashier_name" "int, str" --agent vllamaindex --file-path
+/Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/inout-20211211_001.jpg
+{
+  "guest_no": 49,
+  "cashier_name": "Cashier Name"
+}
+./sparrow.sh "store_name, receipt_id, receipt_item_names, receipt_item_prices, receipt_date, receipt_store_id,
+receipt_sold, receipt_returned, receipt_total" "str, str, List[str], List[str], str, int, int,
+int, str" --agent vprocessor --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/ross-20211211_010.jpg
+{
+    "store_name": "Ross",
+    "receipt_id": "Receipt # 0421-01-1602-1330-0",
+    "receipt_item_names": [
+        "400226513665 x hanes b1ue 4pk",
+        "400239602790 fruit premium 4pk"
+    ],
+    "receipt_item_prices": [
+        "$9.99R",
+        "$12.99R"
+    ],
+    "receipt_date": "11/26/21 10:35:05 AM",
+    "receipt_store_id": 421,
+    "receipt_sold": 2,
+    "receipt_returned": 0,
+    "receipt_total": "$25.33"
+}
+==================================================
+Time to retrieve answer: 106.27733000399894
+./sparrow.sh assistant --agent "fcall" --query "Exxon"
+{
+  "company": "ExxonMobil",
+  "ticker": "XOM"
+}
+The stock price of the ExxonMobil is 113.48999786376953. USD
+==================================================
+Time to retrieve answer: 16.426633964991197
+./sparrow.sh "invoice_number, invoice_date, total_gross_worth" "int, str, str" --agent unstructured-light
+--file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
+{
+    "invoice_number": 61356291,
+    "invoice_date": "09/06/2012",
+    "total_gross_worth": "$ 212,09"
+}
+==================================================
+Time to retrieve answer: 93.95840702600253
+./sparrow.sh "names_of_invoice_items, gross_worth_of_invoice_items, total_gross_worth" "List[str], List[str], str"
+--agent unstructured-light --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
+--options tables
+{
+    "names_of_invoice_items": [
+        "Wine Glasses Goblets Pair Clear Glass",
+        "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
+        "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
+        "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW"
+    ],
+    "gross_worth_of_invoice_items": [
+        "$66.00",
+        "$123.55",
+        "$8.25",
+        "$14.29"
+    ],
+    "total_gross_worth": "$212.09"
+}
+==================================================
+Time to retrieve answer: 109.55890596199606
+./sparrow.sh "invoice_number, invoice_date, client_name, client_address, client_tax_id, seller_name, seller_address,
+seller_tax_id, iban, names_of_invoice_items, gross_worth_of_invoice_items, total_gross_worth" "int, str, str, str, str,
+str, str, str, str, List[str], List[str], str" --agent unstructured --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
+{
+    "invoice_number": 61356291,
+    "invoice_date": "09/06/2012",
+    "client_name": "Rodriguez-Stevens",
+    "client_address": "2280 Angela Plain Hortonshire, MS 93248",
+    "client_tax_id": "939-98-8477",
+    "seller_name": "Chapman, Kim and Green",
+    "seller_address": "64731 James Branch Smithmouth, NC 26872",
+    "seller_tax_id": "949-84-9105",
+    "iban": "GB50ACIE59715038217063",
+    "names_of_invoice_items": [
+        "Wine Glasses Goblets Pair Clear Glass",
+        "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
+        "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
+        "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW"
+    ],
+    "gross_worth_of_invoice_items": [
+        "6,00",
+        "123,55",
+        "8,25",
+        "14,29"
+    ],
+    "total_gross_worth": "$ 192,81"
+}
+==================================================
+Time to retrieve answer: 85.94320003400207
+./sparrow.sh "invoice_number, invoice_date, total_gross_worth" "int, str, str" --agent unstructured --file-path
+/Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
+{
+    "invoice_number": 61356291,
+    "invoice_date": "09/06/2012",
+    "total_gross_worth": "$ 212,09"
+}
+==================================================
+Time to retrieve answer: 24.074920559010934
+./sparrow.sh "store_name, receipt_id, receipt_item_names, receipt_item_prices, receipt_date, receipt_store_id, receipt_sold,
+receipt_returned, receipt_total" "str, str, List[str], List[str], str, int, int,
+int, str" --agent unstructured --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/ross-20211211_010.jpg
+{
+    "store_name": "IT OSS DRESS FOR LESS PASADENA, CA 91107 626-351-5334 # 0421-01-1602-1330-",
+    "receipt_id": "0421-01-1602-1330-",
+    "receipt_item_names": [
+        "A iain an 6513665 x hanes blue 4pk 9.99R 4nbes9e05500",
+        "fruit premium 4pk 12:98"
+    ],
+    "receipt_item_prices": [
+        "$9.99",
+        "$12.98"
+    ],
+    "receipt_date": "11/26/21 10:35:05 AM",
+    "receipt_store_id": 421,
+    "receipt_sold": 2,
+    "receipt_returned": 0,
+    "receipt_total": "$25.00"
+}
+==================================================
+Time to retrieve answer: 76.49691557901679
+./sparrow.sh "store_name, receipt_id, receipt_item_names, receipt_item_prices, receipt_date, receipt_store_id, receipt_sold,
+receipt_returned, receipt_total" "str, str, List[str], List[str], str, int, int, int,
+str" --agent unstructured-light --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/ross-20211211_010.jpg
+{
+    "store_name": "Ross Dress for Less",
+    "receipt_id": "0421-01-1602-1330-0",
+    "receipt_item_names": [
+        "A iain an 6513665 x hanes blue 4pk",
+        "9.99R 4nbes9e05500 fruit premium 4pk"
+    ],
+    "receipt_item_prices": [
+        "$22.98",
+        "$22.98"
+    ],
+    "receipt_date": "11/26/21",
+    "receipt_store_id": 421,
+    "receipt_sold": 2,
+    "receipt_returned": 0,
+    "receipt_total": "$25"
+}
+==================================================
+Time to retrieve answer: 80.8209542609984
+./sparrow.sh "names_of_invoice_items, gross_worth_of_invoice_items, total_gross_worth" "List[str], List[str], str" --agent instructor --file-path /Users/andrejb/infra/s
+hared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
+{
+    "names_of_invoice_items": [
+        "Wine Glasses Goblets Pair Clear Glass",
+        "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
+        "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
+        "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW"
+    ],
+    "gross_worth_of_invoice_items": [
+        "66,00",
+        "123,55",
+        "8,25",
+        "14,29"
+    ],
+    "total_gross_worth": "212,09"
+}
+==================================================
+Time to retrieve answer: 97.52105149999261
+./sparrow.sh "invoice_number, invoice_date, description, quantity, net_price, net_worth, vat, gross_worth, total_gross_worth" "str, str, List[str], List[str],
+List[str], List[str], List[str], List[str], str" --agent instructor --file-path /Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.pdf
+--options tables --options unstructured --group-by-rows --update-targets --debug
+{
+    "invoice_number": "61356291",
+    "invoice_date": "09/06/2012",
+    "total_gross_worth": "212.09",
+    "items1": [
+        {
+            "description": "Wine Glasses Goblets Pair Clear Glass",
+            "quantity": "5,00",
+            "net_price": "12,00",
+            "net_worth": "60,00",
+            "vat": "10%",
+            "gross_worth": "66,00"
+        },
+        {
+            "description": "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
+            "quantity": "4,00",
+            "net_price": "28,08",
+            "net_worth": "112,32",
+            "vat": "10%",
+            "gross_worth": "123,55"
+        },
+        {
+            "description": "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
+            "quantity": "1,00",
+            "net_price": "7,50",
+            "net_worth": "7,50",
+            "vat": "10%",
+            "gross_worth": "8,25"
+        },
+        {
+            "description": "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW",
+            "quantity": "1,00",
+            "net_price": "12,99",
+            "net_worth": "12,99",
+            "vat": "10%",
+            "gross_worth": "14,29"
+        }
+    ]
+}
+==================================================
+Time to retrieve answer: 24.45439903100487
+./sparrow.sh "{\"invoice_no\":\"example\", \"invoice_date\":\"example\", \"seller_name\":\"example\", \"seller_address\":\"example\",
+\"seller_taxid\":\"example\", \"seller_iban\":\"example\", \"client_name\":\"example\", \"client_address\":\"example\",
+\"client_taxid\":\"example\", \"invoice_items\":[{\"description\":\"example\", \"quantity\":0.00, \"net_price\":0.00,
+\"net_worth\":0.00, \"vat\":\"example\", \"gross_worth\":0.00}], \"invoice_summary\":[{\"net_worth\":0.00, \"vat\":0.00, \"gross_worth\":0.00}]}"
+--agent "sparrow-parse" --debug --options huggingface --options katanaml/sparrow-qwen2-vl-7b --file-path "/Users/andrejb/infra/shared/katana-git/sparrow/sparrow-ml/llm/data/invoice_1.jpg"
+{
+  "invoice_no": "61356291",
+  "invoice_date": "09/06/2012",
+  "seller_name": "Chapman, Kim and Green",
+  "seller_address": "64731 James Branch, Smithmouth, NC 26872",
+  "seller_taxid": "949-84-9105",
+  "seller_iban": "GB50ACIE59715038217063",
+  "client_name": "Rodriguez-Stevens",
+  "client_address": "2280 Angela Plain, Hortonshire, MS 93248",
+  "client_taxid": "939-98-8477",
+  "invoice_items": [
+    {
+      "description": "Wine Glasses Goblets Pair Clear Glass",
+      "quantity": 5.0,
+      "net_price": 12.0,
+      "net_worth": 60.0,
+      "vat": "10%",
+      "gross_worth": 66.0
+    },
+    {
+      "description": "With Hooks Stemware Storage Multiple Uses Iron Wine Rack Hanging Glass",
+      "quantity": 4.0,
+      "net_price": 28.08,
+      "net_worth": 112.32,
+      "vat": "10%",
+      "gross_worth": 123.55
+    },
+    {
+      "description": "Replacement Corkscrew Parts Spiral Worm Wine Opener Bottle Houdini",
+      "quantity": 1.0,
+      "net_price": 7.5,
+      "net_worth": 7.5,
+      "vat": "10%",
+      "gross_worth": 8.25
+    },
+    {
+      "description": "HOME ESSENTIALS GRADIENT STEMLESS WINE GLASSES SET OF 4 20 FL OZ (591 ml) NEW",
+      "quantity": 1.0,
+      "net_price": 12.99,
+      "net_worth": 12.99,
+      "vat": "10%",
+      "gross_worth": 14.29
+    }
+  ],
+  "invoice_summary": [
+    {
+      "net_worth": 192.81,
+      "vat": 19.28,
+      "gross_worth": 212.09
+    }
+  ]
+}
+Time to retrieve answer: 47.84319644900097
+./sparrow.sh "[{\"instrument_name\":\"example\", \"valuation\":0}]" --agent "sparrow-parse" --debug --options huggingface
+--options katanaml/sparrow-qwen2-vl-7b --file-path "/Users/andrejb/Documents/work/epik/bankstatement/bonds_table.png"
+[
+  {
+    "instrument_name": "UNITS BLACKROCK FIX INC DUB FDS PLC ISHS EUR INV GRD CP BD IDX/INST/E",
+    "valuation": 19049
+  },
+  {
+    "instrument_name": "UNITS ISHARES III PLC CORE EUR GOVT BOND UCITS ETF/EUR",
+    "valuation": 83488
+  },
+  {
+    "instrument_name": "UNITS ISHARES III PLC EUR CORP BOND 1-5YR UCITS ETF/EUR",
+    "valuation": 213030
+  },
+  {
+    "instrument_name": "UNIT ISHARES VI PLC/JP MORGAN USD E BOND EUR HED UCITS ETF DIST/HDGD/",
+    "valuation": 32774
+  },
+  {
+    "instrument_name": "UNITS XTRACKERS II SICAV/EUR HY CORP BOND UCITS ETF/-1D-/DISTR.",
+    "valuation": 23643
+  }
+]
+Time to retrieve answer: 22.78700271800335

sparrow.sh ADDED Viewed

	@@ -0,0 +1,28 @@

+#!/bin/bash
+command -v python >/dev/null 2>&1 || { echo >&2 "Python is required but it's not installed. Aborting."; exit 1; }
+# Check Python version
+PYTHON_VERSION=$(python --version 2>&1) # Capture both stdout and stderr
+echo "Detected Python version: $PYTHON_VERSION"
+if [[ ! "$PYTHON_VERSION" == *"3.10.4"* ]]; then
+  echo "Python version 3.10.4 is required. Current version is $PYTHON_VERSION. Aborting."
+  exit 1
+fi
+PYTHON_SCRIPT_PATH="engine.py"
+# Check if the "ingest" flag is passed
+if [ "$1" == "ingest" ]; then
+    PYTHON_SCRIPT_PATH="ingest.py"
+    shift # Shift the arguments to exclude the first one
+fi
+if [ "$1" == "assistant" ]; then
+    PYTHON_SCRIPT_PATH="assistant.py"
+    shift # Shift the arguments to exclude the first one
+fi
+python "${PYTHON_SCRIPT_PATH}" "$@"
+# make script executable with: chmod +x sparrow.sh