Spaces:
Sleeping

Deep-Research PDF Field Extractor
A powerful tool for extracting structured data from PDF documents, designed to handle various document types and extract specific fields of interest.
For End Users
Overview
The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats.
How to Use
Upload Your PDF
- Click the "Upload PDF" button
- Select your PDF file from your computer
Specify Fields to Extract
- Enter the fields you want to extract, separated by commas
- Example:
Date, Name, Value, Location, Page, FileName
Optional: Add Field Descriptions
- You can provide additional context about the fields.
- This helps the system better understand what to look for
Run Extraction
- Click the "Run extraction" button
- Wait for the process to complete
- View your results in a table format
Download Results
- Download your extracted data as a CSV file
- View execution traces and logs if needed
Features
- Automatic document type detection
- Smart field extraction
- Support for tables and text
- Detailed execution traces
- Downloadable results and logs
For Developers
Architecture Overview
The application is built using a multi-agent architecture with the following components:
Core Components
Planner (
orchestrator/planner.py
)- Generates execution plans using Azure OpenAI
Executor (
orchestrator/executor.py
)- Executes the generated plan
- Manages agent execution flow
- Handles context and result management
Agents
PDFAgent
: Extracts text from PDFsTableAgent
: Extracts tables from PDFsFieldMapper
: Maps fields to valuesForEachField
: Control flow for field iteration
Agent Pipeline
Document Processing
# Document is processed in stages: 1. PDF text extraction 2. Table extraction 3. Field mapping
Field Extraction Process
- Document type inference
- User profile determination
- Page-by-page scanning
- Value extraction and validation
Context Building
- Document metadata
- Field descriptions
- User context
- Execution history
Key Features
Document Type Inference
The system automatically infers document type and user profile:
# Example inference:
"Document type: Analytical report
User profile: Data analysts or researchers working with document analysis"
Field Mapping
The FieldMapper agent uses a sophisticated approach:
- Document context analysis
- Page-by-page scanning
- Value extraction using LLM
- Result validation
Execution Traces
The system maintains detailed execution traces:
- Tool execution history
- Success/failure status
- Detailed logs
- Result storage
Technical Setup
Dependencies
# Key dependencies: - streamlit - pandas - PyMuPDF (fitz) - Azure OpenAI - Azure Document Intelligence
Configuration
- Environment variables for API keys
- Prompt templates in
config/prompts.yaml
- Settings in
config/settings.py
Logging System
# Custom logging setup: - LogCaptureHandler for UI display - Structured logging format - Execution history storage
Development Guidelines
Adding New Agents
- Inherit from base agent class
- Implement required methods
- Add to planner configuration
Modifying Extraction Logic
- Update prompt templates
- Modify field mapping logic
- Adjust validation rules
Extending Functionality
- Add new field types
- Implement custom validators
- Create new output formats
Testing
- Unit tests for agents
- Integration tests for pipeline
- End-to-end testing with sample PDFs
Deployment
- Streamlit app deployment
- Environment configuration
- API key management
- Logging setup
Future Improvements
- Enhanced error handling
- Additional field types
- Improved validation
- Performance optimization
- Extended documentation