metadata

title: Doctorecord
emoji: 🚀
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: Streamlit template space

Welcome to Streamlit!

Edit /src/streamlit_app.py to customize this app to your heart's desire. :heart:

If you have any questions, checkout our documentation and community forums.

Deep-Research PDF Field Extractor

A multi-agent system for extracting structured data from biotech-related PDFs using Azure Document Intelligence and Azure OpenAI.

Features

Multi-Agent Architecture: Uses specialized agents for different extraction tasks
Azure Integration: Leverages Azure Document Intelligence and Azure OpenAI
Flexible Extraction Strategies: Supports both original and unique indices strategies
Robust Error Handling: Implements retry logic with exponential backoff
Comprehensive Cost Tracking: Monitors API usage and costs for all LLM calls
Streamlit UI: User-friendly interface for document processing
Graceful Degradation: Continues processing even with partial failures

Installation

Clone the repository
Install dependencies: pip install -r requirements.txt
Set up environment variables (see Configuration section)
Run the application: streamlit run src/app.py

Configuration

Environment Variables

Create a .env file with the following variables:

# Azure OpenAI
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_DEPLOYMENT=your_deployment_name
AZURE_OPENAI_API_VERSION=2025-03-01-preview

# Azure Document Intelligence
AZURE_DI_ENDPOINT=your_di_endpoint
AZURE_DI_KEY=your_di_key

# Retry Configuration (Optional)
LLM_MAX_RETRIES=5
LLM_BASE_DELAY=1.0
LLM_MAX_DELAY=60.0

Retry Configuration

The system implements robust retry logic to handle transient service errors:

LLM_MAX_RETRIES: Maximum number of retry attempts (default: 5)
LLM_BASE_DELAY: Base delay in seconds for exponential backoff (default: 1.0)
LLM_MAX_DELAY: Maximum delay in seconds (default: 60.0)

The retry logic automatically handles:

503 Service Unavailable errors
500 Internal Server Error
Connection timeouts
Network errors

Retries use exponential backoff with jitter to prevent thundering herd problems.

Usage

Original Strategy

Processes documents page by page, extracting fields individually using semantic search and LLM-based extraction.

Workflow:

PDFAgent → TableAgent → ForEachField → FieldMapperAgent

Unique Indices Strategy

Extracts data based on unique combinations of specified indices, then loops through each combination to extract additional fields.

Workflow:

PDFAgent → TableAgent → UniqueIndicesCombinator → UniqueIndicesLoopAgent

Step-by-step process:

PDFAgent: Extracts text from PDF files
TableAgent: Processes tables using Azure Document Intelligence
UniqueIndicesCombinator: Extracts unique combinations of specified indices (e.g., Protein Lot, Peptide, Timepoint, Modification)
UniqueIndicesLoopAgent: Loops through each combination to extract additional fields (e.g., Chain, Percentage, Seq Loc)

Example Output:

[
  {
    "Protein Lot": "P066_L14_H31_0-hulgG-LALAPG-FJB",
    "Peptide": "PLTFGAGTK",
    "Timepoint": "0w",
    "Modification": "Clipping",
    "Chain": "Heavy",
    "Percentage": "90.0",
    "Seq Loc": "HC(1-31)"
  },
  {
    "Protein Lot": "P066_L14_H31_0-hulgG-LALAPG-FJB",
    "Peptide": "PLTFGAGTK", 
    "Timepoint": "4w",
    "Modification": "Clipping",
    "Chain": "Heavy",
    "Percentage": "85.0",
    "Seq Loc": "HC(1-31)"
  }
]

Architecture

Agents

PDFAgent: Extracts text from PDF files using PyMuPDF
TableAgent: Processes tables using Azure Document Intelligence with layout analysis
UniqueIndicesCombinator: Extracts unique combinations of specified indices from documents
UniqueIndicesLoopAgent: Loops through combinations to extract additional field values
FieldMapperAgent: Maps individual fields to values using LLM-based extraction
IndexAgent: Creates semantic search indices for improved field extraction

Services

LLMClient: Azure OpenAI wrapper with retry logic and cost tracking
AzureDIService: Azure Document Intelligence integration with table processing
CostTracker: Comprehensive API usage and cost monitoring
EmbeddingClient: Semantic search capabilities

Data Flow

Document Processing: PDF text and table extraction
Strategy Selection: Choose between original or unique indices approach
Field Extraction: LLM-based extraction with detailed field descriptions
Cost Tracking: Monitor all API usage and calculate costs
Result Processing: Convert to structured format (DataFrame/CSV)

Cost Tracking

The system provides comprehensive cost tracking for all operations:

LLM Costs

Input Tokens: Tracked for each LLM call with descriptions
Output Tokens: Tracked for each LLM call with descriptions
Cost Calculation: Based on Azure OpenAI pricing
Detailed Breakdown: Individual call costs in the UI

Document Intelligence Costs

Pages Processed: Tracked per operation
Operation Types: Layout analysis, custom models, etc.
Cost Calculation: Based on Azure DI pricing

Cost Display

Real-time Updates: Costs shown during execution
Detailed Table: Breakdown of all LLM calls
Total Summary: Combined costs for the entire operation

Error Handling

The system implements comprehensive error handling:

Retry Logic: Automatic retries for transient errors with exponential backoff
Graceful Degradation: Continues processing even if some combinations fail
Partial Results: Returns data for successful extractions with null values for failures
Detailed Logging: Comprehensive logging for debugging and monitoring
Cost Tracking: Monitors API usage even during failures

Error Types Handled

✅ 503 Service Unavailable (Azure service overload)
✅ 500 Internal Server Error (Server-side issues)
✅ Connection timeouts (Network issues)
✅ Network errors (Infrastructure problems)
❌ 400 Bad Request (Client errors - not retried)
❌ 401 Unauthorized (Authentication errors - not retried)

Field Descriptions

The system supports detailed field descriptions to improve extraction accuracy:

Field Description Format

{
  "field_name": {
    "description": "Detailed description of the field",
    "format": "Expected format (String, Float, etc.)",
    "examples": "Example values",
    "possible_values": "Comma-separated list of possible values"
  }
}

UI Support

Editable Tables: Add, edit, and remove field descriptions
Session State: Persists descriptions during the session
Validation: Ensures proper format and structure

Testing

The system includes comprehensive test suites:

Test Scripts

test_retry.py: Verifies retry logic with simulated failures
test_cost_tracking.py: Validates cost tracking functionality

Running Tests

python test_retry.py
python test_cost_tracking.py

Performance

Optimization Features

Retry Logic: Handles transient failures automatically
Cost Optimization: Detailed tracking to monitor usage
Graceful Degradation: Continues with partial results
Caching: Session state for field descriptions

Expected Performance

Small Documents: 30-60 seconds
Large Documents: 2-5 minutes
Cost Efficiency: ~$0.01-0.10 per document (depending on size)

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Troubleshooting

Common Issues

503 Service Unavailable Errors

The system automatically retries with exponential backoff
Check Azure service status if persistent
Adjust retry configuration if needed

Cost Tracking Shows Zero

Ensure cost tracker is properly initialized
Check that agents are passing context correctly
Verify LLM calls are being made

Partial Results

Some combinations may fail due to document structure
Check execution logs for specific failures
Results include null values for failed extractions

Debug Mode

Enable detailed logging by setting log level to DEBUG in the application.

License

[Add your license information here]

Overview

The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats.

How to Use

Upload Your PDF
- Click the "Upload PDF" button
- Select your PDF file from your computer
Specify Fields to Extract
- Enter the fields you want to extract, separated by commas
- Example: Date, Name, Value, Location, Page, FileName
Optional: Add Field Descriptions
- You can provide additional context about the fields
- This helps the system better understand what to look for
Run Extraction
- Click the "Run extraction" button
- Wait for the process to complete
- View your results in a table format
Download Results
- Download your extracted data as a CSV file
- View execution traces and logs if needed

Features

Automatic document type detection
Smart field extraction
Support for tables and text
Detailed execution traces
Downloadable results and logs

Support

For technical documentation and architecture details, please refer to: