Spaces:

levalencia
/

doctorecord

Running

App Files Files Community

levalencia commited on Jun 3

Commit

2eac01a

1 Parent(s): 1c43e67

Revise documentation in app.py for Deep‑Research PDF Field Extractor, enhancing clarity on system architecture, core components, processing pipeline, and key features. Update usage instructions and support resources for better user guidance.

Browse files

Files changed (3) hide show

src/app.py +94 -47
src/config/__pycache__/settings.cpython-312.pyc +0 -0
src/services/__pycache__/llm_client.cpython-312.pyc +0 -0

src/app.py CHANGED Viewed

@@ -62,57 +62,104 @@ page = st.sidebar.radio("Go to", ["Documentation", "Traces", "Execution"])
 # Documentation Page
 if page == "Documentation":
-    st.title("Deep‑Research PDF Field Extractor (POC)")
     st.markdown("""
-    This system uses a multi-step pipeline to extract fields from PDFs:
-    1. **Document Intelligence**: Extracts text and tables from PDFs using Azure Document Intelligence
-    2. **Semantic Indexing**: Creates searchable chunks with embeddings for semantic search
-    3. **Field Extraction**: Uses a two-step approach:
-       - First attempts page-by-page scanning for precise extraction
-       - Falls back to semantic search if no value is found
-    4. **Validation**: Ensures extracted values are correct and properly formatted
-    5. **Confidence Scoring**: Identifies which extractions need review
-    """)
-    st.markdown("""
-    ### Agent Descriptions
-    #### DocumentIntelligenceAgent
-    - Uses Azure Document Intelligence to extract text and tables
-    - Preserves document layout and structure
-    - Outputs both raw text and formatted tables as HTML
-    - Handles complex document layouts and tables
-    #### IndexAgent
-    - Creates semantic search index from extracted content
-    - Splits document into manageable chunks with metadata
-    - Generates embeddings for semantic search
-    - Provides both chunk-based and full-text search capabilities
-    - Includes chunk statistics and visualization
-    #### FieldMapper
-    - Implements a two-step field extraction strategy:
       1. Page-by-page scanning for precise extraction
       2. Semantic search fallback if no value found
-    - Uses document context to improve extraction accuracy
-    - Handles multiple potential values with confidence scoring
-    - Returns structured JSON responses with value details
-    #### SemanticReasoner
-    - Validates and cleans up candidate values
-    - Uses domain knowledge to ensure values make sense
-    - Can reformat values to standard format
-    - Returns `<unresolved>` if value is wrong/missing
-    #### ConfidenceScorer
-    - Assigns confidence score (0-1) to each extraction
-    - Helps identify which extractions need review
-    - Can trigger follow-up queries when confidence is low
-    #### QueryGenerator
-    - Generates follow-up questions when confidence is low
-    - Creates concise questions (≤12 words)
-    - Helps guide system to better extractions
     """)
 # Traces Page

 # Documentation Page
 if page == "Documentation":
+    st.title("Deep‑Research PDF Field Extractor")
     st.markdown("""
+    ## Overview
+    This system uses a multi-agent architecture to extract fields from PDFs with high accuracy and reliability.
+    ### Core Components
+    1. **Planner**
+       - Generates execution plans using Azure OpenAI
+       - Determines optimal extraction strategy
+       - Manages task dependencies
+    2. **Executor**
+       - Executes the generated plan
+       - Manages agent execution flow
+       - Handles context and result management
+    3. **Agents**
+       - `TableAgent`: Extracts text and tables using Azure Document Intelligence
+       - `FieldMapper`: Maps fields to values using extracted content
+       - `ForEachField`: Controls field iteration flow
+    ### Processing Pipeline
+    1. **Document Processing**
+       - Text and table extraction using Azure Document Intelligence
+       - Layout and structure preservation
+       - Support for complex document formats
+    2. **Field Extraction**
+       - Document type inference
+       - User profile determination
+       - Page-by-page scanning
+       - Value extraction and validation
+    3. **Context Building**
+       - Document metadata
+       - Field descriptions
+       - User context
+       - Execution history
+    ### Key Features
+    #### Smart Field Extraction
+    - Two-step extraction strategy:
       1. Page-by-page scanning for precise extraction
       2. Semantic search fallback if no value found
+    - Basic context awareness for improved extraction
+    - Support for tabular data extraction
+    #### Document Intelligence
+    - Azure Document Intelligence integration
+    - Layout and structure preservation
+    - Table extraction and formatting
+    - Complex document handling
+    #### Execution Monitoring
+    - Detailed execution traces
+    - Success/failure status
+    - Comprehensive logging
+    - Result storage and retrieval
+    ### Technical Requirements
+    - Azure OpenAI API key
+    - Azure Document Intelligence endpoint
+    - Python 3.9 or higher
+    - Required Python packages (see requirements.txt)
+    ### Getting Started
+    1. **Upload Your PDF**
+       - Click the "Upload PDF" button
+       - Select your PDF file
+    2. **Specify Fields**
+       - Enter comma-separated field names
+       - Example: `Date, Name, Value, Location`
+    3. **Optional: Add Field Descriptions**
+       - Provide YAML-formatted field descriptions
+       - Helps improve extraction accuracy
+    4. **Run Extraction**
+       - Click "Run extraction"
+       - Monitor progress in execution trace
+       - View results in table format
+    5. **Download Results**
+       - Export as CSV
+       - View detailed execution logs
+    ### Support
+    For detailed technical documentation, please refer to:
+    - [Architecture Overview](ARCHITECTURE.md)
+    - [Developer Documentation](DEVELOPER.md)
     """)
 # Traces Page

src/config/__pycache__/settings.cpython-312.pyc CHANGED Viewed

Binary files a/src/config/__pycache__/settings.cpython-312.pyc and b/src/config/__pycache__/settings.cpython-312.pyc differ

src/services/__pycache__/llm_client.cpython-312.pyc CHANGED Viewed

Binary files a/src/services/__pycache__/llm_client.cpython-312.pyc and b/src/services/__pycache__/llm_client.cpython-312.pyc differ