levalencia commited on
Commit
2eac01a
·
1 Parent(s): 1c43e67

Revise documentation in app.py for Deep‑Research PDF Field Extractor, enhancing clarity on system architecture, core components, processing pipeline, and key features. Update usage instructions and support resources for better user guidance.

Browse files
src/app.py CHANGED
@@ -62,57 +62,104 @@ page = st.sidebar.radio("Go to", ["Documentation", "Traces", "Execution"])
62
 
63
  # Documentation Page
64
  if page == "Documentation":
65
- st.title("Deep‑Research PDF Field Extractor (POC)")
66
 
67
  st.markdown("""
68
- This system uses a multi-step pipeline to extract fields from PDFs:
69
- 1. **Document Intelligence**: Extracts text and tables from PDFs using Azure Document Intelligence
70
- 2. **Semantic Indexing**: Creates searchable chunks with embeddings for semantic search
71
- 3. **Field Extraction**: Uses a two-step approach:
72
- - First attempts page-by-page scanning for precise extraction
73
- - Falls back to semantic search if no value is found
74
- 4. **Validation**: Ensures extracted values are correct and properly formatted
75
- 5. **Confidence Scoring**: Identifies which extractions need review
76
- """)
77
-
78
- st.markdown("""
79
- ### Agent Descriptions
80
- #### DocumentIntelligenceAgent
81
- - Uses Azure Document Intelligence to extract text and tables
82
- - Preserves document layout and structure
83
- - Outputs both raw text and formatted tables as HTML
84
- - Handles complex document layouts and tables
85
-
86
- #### IndexAgent
87
- - Creates semantic search index from extracted content
88
- - Splits document into manageable chunks with metadata
89
- - Generates embeddings for semantic search
90
- - Provides both chunk-based and full-text search capabilities
91
- - Includes chunk statistics and visualization
92
-
93
- #### FieldMapper
94
- - Implements a two-step field extraction strategy:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  1. Page-by-page scanning for precise extraction
96
  2. Semantic search fallback if no value found
97
- - Uses document context to improve extraction accuracy
98
- - Handles multiple potential values with confidence scoring
99
- - Returns structured JSON responses with value details
100
-
101
- #### SemanticReasoner
102
- - Validates and cleans up candidate values
103
- - Uses domain knowledge to ensure values make sense
104
- - Can reformat values to standard format
105
- - Returns `<unresolved>` if value is wrong/missing
106
-
107
- #### ConfidenceScorer
108
- - Assigns confidence score (0-1) to each extraction
109
- - Helps identify which extractions need review
110
- - Can trigger follow-up queries when confidence is low
111
-
112
- #### QueryGenerator
113
- - Generates follow-up questions when confidence is low
114
- - Creates concise questions (≤12 words)
115
- - Helps guide system to better extractions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
  """)
117
 
118
  # Traces Page
 
62
 
63
  # Documentation Page
64
  if page == "Documentation":
65
+ st.title("Deep‑Research PDF Field Extractor")
66
 
67
  st.markdown("""
68
+ ## Overview
69
+ This system uses a multi-agent architecture to extract fields from PDFs with high accuracy and reliability.
70
+
71
+ ### Core Components
72
+
73
+ 1. **Planner**
74
+ - Generates execution plans using Azure OpenAI
75
+ - Determines optimal extraction strategy
76
+ - Manages task dependencies
77
+
78
+ 2. **Executor**
79
+ - Executes the generated plan
80
+ - Manages agent execution flow
81
+ - Handles context and result management
82
+
83
+ 3. **Agents**
84
+ - `TableAgent`: Extracts text and tables using Azure Document Intelligence
85
+ - `FieldMapper`: Maps fields to values using extracted content
86
+ - `ForEachField`: Controls field iteration flow
87
+
88
+ ### Processing Pipeline
89
+
90
+ 1. **Document Processing**
91
+ - Text and table extraction using Azure Document Intelligence
92
+ - Layout and structure preservation
93
+ - Support for complex document formats
94
+
95
+ 2. **Field Extraction**
96
+ - Document type inference
97
+ - User profile determination
98
+ - Page-by-page scanning
99
+ - Value extraction and validation
100
+
101
+ 3. **Context Building**
102
+ - Document metadata
103
+ - Field descriptions
104
+ - User context
105
+ - Execution history
106
+
107
+ ### Key Features
108
+
109
+ #### Smart Field Extraction
110
+ - Two-step extraction strategy:
111
  1. Page-by-page scanning for precise extraction
112
  2. Semantic search fallback if no value found
113
+ - Basic context awareness for improved extraction
114
+ - Support for tabular data extraction
115
+
116
+ #### Document Intelligence
117
+ - Azure Document Intelligence integration
118
+ - Layout and structure preservation
119
+ - Table extraction and formatting
120
+ - Complex document handling
121
+
122
+ #### Execution Monitoring
123
+ - Detailed execution traces
124
+ - Success/failure status
125
+ - Comprehensive logging
126
+ - Result storage and retrieval
127
+
128
+ ### Technical Requirements
129
+
130
+ - Azure OpenAI API key
131
+ - Azure Document Intelligence endpoint
132
+ - Python 3.9 or higher
133
+ - Required Python packages (see requirements.txt)
134
+
135
+ ### Getting Started
136
+
137
+ 1. **Upload Your PDF**
138
+ - Click the "Upload PDF" button
139
+ - Select your PDF file
140
+
141
+ 2. **Specify Fields**
142
+ - Enter comma-separated field names
143
+ - Example: `Date, Name, Value, Location`
144
+
145
+ 3. **Optional: Add Field Descriptions**
146
+ - Provide YAML-formatted field descriptions
147
+ - Helps improve extraction accuracy
148
+
149
+ 4. **Run Extraction**
150
+ - Click "Run extraction"
151
+ - Monitor progress in execution trace
152
+ - View results in table format
153
+
154
+ 5. **Download Results**
155
+ - Export as CSV
156
+ - View detailed execution logs
157
+
158
+ ### Support
159
+
160
+ For detailed technical documentation, please refer to:
161
+ - [Architecture Overview](ARCHITECTURE.md)
162
+ - [Developer Documentation](DEVELOPER.md)
163
  """)
164
 
165
  # Traces Page
src/config/__pycache__/settings.cpython-312.pyc CHANGED
Binary files a/src/config/__pycache__/settings.cpython-312.pyc and b/src/config/__pycache__/settings.cpython-312.pyc differ
 
src/services/__pycache__/llm_client.cpython-312.pyc CHANGED
Binary files a/src/services/__pycache__/llm_client.cpython-312.pyc and b/src/services/__pycache__/llm_client.cpython-312.pyc differ