Deep-Research PDF Field Extractor

A powerful tool for extracting structured data from PDF documents, designed to handle various document types and extract specific fields of interest.

For End Users

Overview

The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats.

How to Use

Upload Your PDF
- Click the "Upload PDF" button
- Select your PDF file from your computer
Specify Fields to Extract
- Enter the fields you want to extract, separated by commas
- Example: Date, Name, Value, Location, Page, FileName
Optional: Add Field Descriptions
- You can provide additional context about the fields.
- This helps the system better understand what to look for
Run Extraction
- Click the "Run extraction" button
- Wait for the process to complete
- View your results in a table format
Download Results
- Download your extracted data as a CSV file
- View execution traces and logs if needed

Features

Automatic document type detection
Smart field extraction
Support for tables and text
Detailed execution traces
Downloadable results and logs

For Developers

Architecture Overview

The application is built using a multi-agent architecture with the following components:

Core Components

Planner (orchestrator/planner.py)
- Generates execution plans using Azure OpenAI
Executor (orchestrator/executor.py)
- Executes the generated plan
- Manages agent execution flow
- Handles context and result management
Agents
- PDFAgent: Extracts text from PDFs
- TableAgent: Extracts tables from PDFs
- FieldMapper: Maps fields to values
- ForEachField: Control flow for field iteration

Agent Pipeline

Document Processing

# Document is processed in stages:
1. PDF text extraction
2. Table extraction
3. Field mapping

Field Extraction Process
- Document type inference
- User profile determination
- Page-by-page scanning
- Value extraction and validation
Context Building
- Document metadata
- Field descriptions
- User context
- Execution history

Key Features

Document Type Inference

The system automatically infers document type and user profile:

# Example inference:
"Document type: Analytical report
User profile: Data analysts or researchers working with document analysis"

Field Mapping

The FieldMapper agent uses a sophisticated approach:

Document context analysis
Page-by-page scanning
Value extraction using LLM
Result validation

Execution Traces

The system maintains detailed execution traces:

Tool execution history
Success/failure status
Detailed logs
Result storage

Technical Setup

Dependencies

# Key dependencies:
- streamlit
- pandas
- PyMuPDF (fitz)
- Azure OpenAI
- Azure Document Intelligence

Configuration
- Environment variables for API keys
- Prompt templates in config/prompts.yaml
- Settings in config/settings.py

Logging System

# Custom logging setup:
- LogCaptureHandler for UI display
- Structured logging format
- Execution history storage

Development Guidelines

Adding New Agents
- Inherit from base agent class
- Implement required methods
- Add to planner configuration
Modifying Extraction Logic
- Update prompt templates
- Modify field mapping logic
- Adjust validation rules
Extending Functionality
- Add new field types
- Implement custom validators
- Create new output formats

Testing

Unit tests for agents
Integration tests for pipeline
End-to-end testing with sample PDFs

Deployment

Streamlit app deployment
Environment configuration
API key management
Logging setup

Future Improvements

Enhanced error handling
Additional field types
Improved validation
Performance optimization
Extended documentation