Spaces:
Sleeping
title: Doctorecord
emoji: π
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Streamlit template space
Welcome to Streamlit!
Edit /src/streamlit_app.py
to customize this app to your heart's desire. :heart:
If you have any questions, checkout our documentation and community forums.
Deep-Research PDF Field Extractor
A multi-agent system for extracting structured data from biotech-related PDFs using Azure Document Intelligence and Azure OpenAI.
Features
- Multi-Agent Architecture: Uses specialized agents for different extraction tasks
- Azure Integration: Leverages Azure Document Intelligence and Azure OpenAI
- Flexible Extraction Strategies: Supports both original and unique indices strategies
- Robust Error Handling: Implements retry logic with exponential backoff
- Comprehensive Cost Tracking: Monitors API usage and costs for all LLM calls
- Streamlit UI: User-friendly interface for document processing
- Graceful Degradation: Continues processing even with partial failures
Installation
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables (see Configuration section)
- Run the application:
streamlit run src/app.py
Configuration
Environment Variables
Create a .env
file with the following variables:
# Azure OpenAI
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_DEPLOYMENT=your_deployment_name
AZURE_OPENAI_API_VERSION=2025-03-01-preview
# Azure Document Intelligence
AZURE_DI_ENDPOINT=your_di_endpoint
AZURE_DI_KEY=your_di_key
# Retry Configuration (Optional)
LLM_MAX_RETRIES=5
LLM_BASE_DELAY=1.0
LLM_MAX_DELAY=60.0
Retry Configuration
The system implements robust retry logic to handle transient service errors:
- LLM_MAX_RETRIES: Maximum number of retry attempts (default: 5)
- LLM_BASE_DELAY: Base delay in seconds for exponential backoff (default: 1.0)
- LLM_MAX_DELAY: Maximum delay in seconds (default: 60.0)
The retry logic automatically handles:
- 503 Service Unavailable errors
- 500 Internal Server Error
- Connection timeouts
- Network errors
Retries use exponential backoff with jitter to prevent thundering herd problems.
Usage
Original Strategy
Processes documents page by page, extracting fields individually using semantic search and LLM-based extraction.
Workflow:
PDFAgent β TableAgent β ForEachField β FieldMapperAgent
Unique Indices Strategy
Extracts data based on unique combinations of specified indices, then loops through each combination to extract additional fields.
Workflow:
PDFAgent β TableAgent β UniqueIndicesCombinator β UniqueIndicesLoopAgent
Step-by-step process:
- PDFAgent: Extracts text from PDF files
- TableAgent: Processes tables using Azure Document Intelligence
- UniqueIndicesCombinator: Extracts unique combinations of specified indices (e.g., Protein Lot, Peptide, Timepoint, Modification)
- UniqueIndicesLoopAgent: Loops through each combination to extract additional fields (e.g., Chain, Percentage, Seq Loc)
Example Output:
[
{
"Protein Lot": "P066_L14_H31_0-hulgG-LALAPG-FJB",
"Peptide": "PLTFGAGTK",
"Timepoint": "0w",
"Modification": "Clipping",
"Chain": "Heavy",
"Percentage": "90.0",
"Seq Loc": "HC(1-31)"
},
{
"Protein Lot": "P066_L14_H31_0-hulgG-LALAPG-FJB",
"Peptide": "PLTFGAGTK",
"Timepoint": "4w",
"Modification": "Clipping",
"Chain": "Heavy",
"Percentage": "85.0",
"Seq Loc": "HC(1-31)"
}
]
Architecture
Agents
- PDFAgent: Extracts text from PDF files using PyMuPDF
- TableAgent: Processes tables using Azure Document Intelligence with layout analysis
- UniqueIndicesCombinator: Extracts unique combinations of specified indices from documents
- UniqueIndicesLoopAgent: Loops through combinations to extract additional field values
- FieldMapperAgent: Maps individual fields to values using LLM-based extraction
- IndexAgent: Creates semantic search indices for improved field extraction
Services
- LLMClient: Azure OpenAI wrapper with retry logic and cost tracking
- AzureDIService: Azure Document Intelligence integration with table processing
- CostTracker: Comprehensive API usage and cost monitoring
- EmbeddingClient: Semantic search capabilities
Data Flow
- Document Processing: PDF text and table extraction
- Strategy Selection: Choose between original or unique indices approach
- Field Extraction: LLM-based extraction with detailed field descriptions
- Cost Tracking: Monitor all API usage and calculate costs
- Result Processing: Convert to structured format (DataFrame/CSV)
Cost Tracking
The system provides comprehensive cost tracking for all operations:
LLM Costs
- Input Tokens: Tracked for each LLM call with descriptions
- Output Tokens: Tracked for each LLM call with descriptions
- Cost Calculation: Based on Azure OpenAI pricing
- Detailed Breakdown: Individual call costs in the UI
Document Intelligence Costs
- Pages Processed: Tracked per operation
- Operation Types: Layout analysis, custom models, etc.
- Cost Calculation: Based on Azure DI pricing
Cost Display
- Real-time Updates: Costs shown during execution
- Detailed Table: Breakdown of all LLM calls
- Total Summary: Combined costs for the entire operation
Error Handling
The system implements comprehensive error handling:
- Retry Logic: Automatic retries for transient errors with exponential backoff
- Graceful Degradation: Continues processing even if some combinations fail
- Partial Results: Returns data for successful extractions with null values for failures
- Detailed Logging: Comprehensive logging for debugging and monitoring
- Cost Tracking: Monitors API usage even during failures
Error Types Handled
- β 503 Service Unavailable (Azure service overload)
- β 500 Internal Server Error (Server-side issues)
- β Connection timeouts (Network issues)
- β Network errors (Infrastructure problems)
- β 400 Bad Request (Client errors - not retried)
- β 401 Unauthorized (Authentication errors - not retried)
Field Descriptions
The system supports detailed field descriptions to improve extraction accuracy:
Field Description Format
{
"field_name": {
"description": "Detailed description of the field",
"format": "Expected format (String, Float, etc.)",
"examples": "Example values",
"possible_values": "Comma-separated list of possible values"
}
}
UI Support
- Editable Tables: Add, edit, and remove field descriptions
- Session State: Persists descriptions during the session
- Validation: Ensures proper format and structure
Testing
The system includes comprehensive test suites:
Test Scripts
- test_retry.py: Verifies retry logic with simulated failures
- test_cost_tracking.py: Validates cost tracking functionality
Running Tests
python test_retry.py
python test_cost_tracking.py
Performance
Optimization Features
- Retry Logic: Handles transient failures automatically
- Cost Optimization: Detailed tracking to monitor usage
- Graceful Degradation: Continues with partial results
- Caching: Session state for field descriptions
Expected Performance
- Small Documents: 30-60 seconds
- Large Documents: 2-5 minutes
- Cost Efficiency: ~$0.01-0.10 per document (depending on size)
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Troubleshooting
Common Issues
503 Service Unavailable Errors
- The system automatically retries with exponential backoff
- Check Azure service status if persistent
- Adjust retry configuration if needed
Cost Tracking Shows Zero
- Ensure cost tracker is properly initialized
- Check that agents are passing context correctly
- Verify LLM calls are being made
Partial Results
- Some combinations may fail due to document structure
- Check execution logs for specific failures
- Results include null values for failed extractions
Debug Mode
Enable detailed logging by setting log level to DEBUG in the application.
License
[Add your license information here]
Overview
The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats.
How to Use
Upload Your PDF
- Click the "Upload PDF" button
- Select your PDF file from your computer
Specify Fields to Extract
- Enter the fields you want to extract, separated by commas
- Example:
Date, Name, Value, Location, Page, FileName
Optional: Add Field Descriptions
- You can provide additional context about the fields
- This helps the system better understand what to look for
Run Extraction
- Click the "Run extraction" button
- Wait for the process to complete
- View your results in a table format
Download Results
- Download your extracted data as a CSV file
- View execution traces and logs if needed
Features
- Automatic document type detection
- Smart field extraction
- Support for tables and text
- Detailed execution traces
- Downloadable results and logs
Support
For technical documentation and architecture details, please refer to: