doctorecord / README.md
levalencia's picture
feat: enhance architecture and developer documentation for clarity and detail
665cc97
metadata
title: Doctorecord
emoji: πŸš€
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: Streamlit template space

Welcome to Streamlit!

Edit /src/streamlit_app.py to customize this app to your heart's desire. :heart:

If you have any questions, checkout our documentation and community forums.

Deep-Research PDF Field Extractor

A multi-agent system for extracting structured data from biotech-related PDFs using Azure Document Intelligence and Azure OpenAI.

Features

  • Multi-Agent Architecture: Uses specialized agents for different extraction tasks
  • Azure Integration: Leverages Azure Document Intelligence and Azure OpenAI
  • Flexible Extraction Strategies: Supports both original and unique indices strategies
  • Robust Error Handling: Implements retry logic with exponential backoff
  • Comprehensive Cost Tracking: Monitors API usage and costs for all LLM calls
  • Streamlit UI: User-friendly interface for document processing
  • Graceful Degradation: Continues processing even with partial failures

Installation

  1. Clone the repository
  2. Install dependencies: pip install -r requirements.txt
  3. Set up environment variables (see Configuration section)
  4. Run the application: streamlit run src/app.py

Configuration

Environment Variables

Create a .env file with the following variables:

# Azure OpenAI
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_DEPLOYMENT=your_deployment_name
AZURE_OPENAI_API_VERSION=2025-03-01-preview

# Azure Document Intelligence
AZURE_DI_ENDPOINT=your_di_endpoint
AZURE_DI_KEY=your_di_key

# Retry Configuration (Optional)
LLM_MAX_RETRIES=5
LLM_BASE_DELAY=1.0
LLM_MAX_DELAY=60.0

Retry Configuration

The system implements robust retry logic to handle transient service errors:

  • LLM_MAX_RETRIES: Maximum number of retry attempts (default: 5)
  • LLM_BASE_DELAY: Base delay in seconds for exponential backoff (default: 1.0)
  • LLM_MAX_DELAY: Maximum delay in seconds (default: 60.0)

The retry logic automatically handles:

  • 503 Service Unavailable errors
  • 500 Internal Server Error
  • Connection timeouts
  • Network errors

Retries use exponential backoff with jitter to prevent thundering herd problems.

Usage

Original Strategy

Processes documents page by page, extracting fields individually using semantic search and LLM-based extraction.

Workflow:

PDFAgent β†’ TableAgent β†’ ForEachField β†’ FieldMapperAgent

Unique Indices Strategy

Extracts data based on unique combinations of specified indices, then loops through each combination to extract additional fields.

Workflow:

PDFAgent β†’ TableAgent β†’ UniqueIndicesCombinator β†’ UniqueIndicesLoopAgent

Step-by-step process:

  1. PDFAgent: Extracts text from PDF files
  2. TableAgent: Processes tables using Azure Document Intelligence
  3. UniqueIndicesCombinator: Extracts unique combinations of specified indices (e.g., Protein Lot, Peptide, Timepoint, Modification)
  4. UniqueIndicesLoopAgent: Loops through each combination to extract additional fields (e.g., Chain, Percentage, Seq Loc)

Example Output:

[
  {
    "Protein Lot": "P066_L14_H31_0-hulgG-LALAPG-FJB",
    "Peptide": "PLTFGAGTK",
    "Timepoint": "0w",
    "Modification": "Clipping",
    "Chain": "Heavy",
    "Percentage": "90.0",
    "Seq Loc": "HC(1-31)"
  },
  {
    "Protein Lot": "P066_L14_H31_0-hulgG-LALAPG-FJB",
    "Peptide": "PLTFGAGTK", 
    "Timepoint": "4w",
    "Modification": "Clipping",
    "Chain": "Heavy",
    "Percentage": "85.0",
    "Seq Loc": "HC(1-31)"
  }
]

Architecture

Agents

  • PDFAgent: Extracts text from PDF files using PyMuPDF
  • TableAgent: Processes tables using Azure Document Intelligence with layout analysis
  • UniqueIndicesCombinator: Extracts unique combinations of specified indices from documents
  • UniqueIndicesLoopAgent: Loops through combinations to extract additional field values
  • FieldMapperAgent: Maps individual fields to values using LLM-based extraction
  • IndexAgent: Creates semantic search indices for improved field extraction

Services

  • LLMClient: Azure OpenAI wrapper with retry logic and cost tracking
  • AzureDIService: Azure Document Intelligence integration with table processing
  • CostTracker: Comprehensive API usage and cost monitoring
  • EmbeddingClient: Semantic search capabilities

Data Flow

  1. Document Processing: PDF text and table extraction
  2. Strategy Selection: Choose between original or unique indices approach
  3. Field Extraction: LLM-based extraction with detailed field descriptions
  4. Cost Tracking: Monitor all API usage and calculate costs
  5. Result Processing: Convert to structured format (DataFrame/CSV)

Cost Tracking

The system provides comprehensive cost tracking for all operations:

LLM Costs

  • Input Tokens: Tracked for each LLM call with descriptions
  • Output Tokens: Tracked for each LLM call with descriptions
  • Cost Calculation: Based on Azure OpenAI pricing
  • Detailed Breakdown: Individual call costs in the UI

Document Intelligence Costs

  • Pages Processed: Tracked per operation
  • Operation Types: Layout analysis, custom models, etc.
  • Cost Calculation: Based on Azure DI pricing

Cost Display

  • Real-time Updates: Costs shown during execution
  • Detailed Table: Breakdown of all LLM calls
  • Total Summary: Combined costs for the entire operation

Error Handling

The system implements comprehensive error handling:

  1. Retry Logic: Automatic retries for transient errors with exponential backoff
  2. Graceful Degradation: Continues processing even if some combinations fail
  3. Partial Results: Returns data for successful extractions with null values for failures
  4. Detailed Logging: Comprehensive logging for debugging and monitoring
  5. Cost Tracking: Monitors API usage even during failures

Error Types Handled

  • βœ… 503 Service Unavailable (Azure service overload)
  • βœ… 500 Internal Server Error (Server-side issues)
  • βœ… Connection timeouts (Network issues)
  • βœ… Network errors (Infrastructure problems)
  • ❌ 400 Bad Request (Client errors - not retried)
  • ❌ 401 Unauthorized (Authentication errors - not retried)

Field Descriptions

The system supports detailed field descriptions to improve extraction accuracy:

Field Description Format

{
  "field_name": {
    "description": "Detailed description of the field",
    "format": "Expected format (String, Float, etc.)",
    "examples": "Example values",
    "possible_values": "Comma-separated list of possible values"
  }
}

UI Support

  • Editable Tables: Add, edit, and remove field descriptions
  • Session State: Persists descriptions during the session
  • Validation: Ensures proper format and structure

Testing

The system includes comprehensive test suites:

Test Scripts

  • test_retry.py: Verifies retry logic with simulated failures
  • test_cost_tracking.py: Validates cost tracking functionality

Running Tests

python test_retry.py
python test_cost_tracking.py

Performance

Optimization Features

  • Retry Logic: Handles transient failures automatically
  • Cost Optimization: Detailed tracking to monitor usage
  • Graceful Degradation: Continues with partial results
  • Caching: Session state for field descriptions

Expected Performance

  • Small Documents: 30-60 seconds
  • Large Documents: 2-5 minutes
  • Cost Efficiency: ~$0.01-0.10 per document (depending on size)

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

Troubleshooting

Common Issues

503 Service Unavailable Errors

  • The system automatically retries with exponential backoff
  • Check Azure service status if persistent
  • Adjust retry configuration if needed

Cost Tracking Shows Zero

  • Ensure cost tracker is properly initialized
  • Check that agents are passing context correctly
  • Verify LLM calls are being made

Partial Results

  • Some combinations may fail due to document structure
  • Check execution logs for specific failures
  • Results include null values for failed extractions

Debug Mode

Enable detailed logging by setting log level to DEBUG in the application.

License

[Add your license information here]

Overview

The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats.

How to Use

  1. Upload Your PDF

    • Click the "Upload PDF" button
    • Select your PDF file from your computer
  2. Specify Fields to Extract

    • Enter the fields you want to extract, separated by commas
    • Example: Date, Name, Value, Location, Page, FileName
  3. Optional: Add Field Descriptions

    • You can provide additional context about the fields
    • This helps the system better understand what to look for
  4. Run Extraction

    • Click the "Run extraction" button
    • Wait for the process to complete
    • View your results in a table format
  5. Download Results

    • Download your extracted data as a CSV file
    • View execution traces and logs if needed

Features

  • Automatic document type detection
  • Smart field extraction
  • Support for tables and text
  • Detailed execution traces
  • Downloadable results and logs

Support

For technical documentation and architecture details, please refer to: