doctorecord / src /README.md
levalencia's picture
Update Dockerfile to use new app entry point and enhance requirements.txt with additional dependencies. Remove obsolete streamlit_app.py file.
0a40afa

Deep-Research PDF Field Extractor

A powerful tool for extracting structured data from PDF documents, designed to handle various document types and extract specific fields of interest.

For End Users

Overview

The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats.

How to Use

  1. Upload Your PDF

    • Click the "Upload PDF" button
    • Select your PDF file from your computer
  2. Specify Fields to Extract

    • Enter the fields you want to extract, separated by commas
    • Example: Date, Name, Value, Location, Page, FileName
  3. Optional: Add Field Descriptions

    • You can provide additional context about the fields.
    • This helps the system better understand what to look for
  4. Run Extraction

    • Click the "Run extraction" button
    • Wait for the process to complete
    • View your results in a table format
  5. Download Results

    • Download your extracted data as a CSV file
    • View execution traces and logs if needed

Features

  • Automatic document type detection
  • Smart field extraction
  • Support for tables and text
  • Detailed execution traces
  • Downloadable results and logs

For Developers

Architecture Overview

The application is built using a multi-agent architecture with the following components:

Core Components

  1. Planner (orchestrator/planner.py)

    • Generates execution plans using Azure OpenAI
  2. Executor (orchestrator/executor.py)

    • Executes the generated plan
    • Manages agent execution flow
    • Handles context and result management
  3. Agents

    • PDFAgent: Extracts text from PDFs
    • TableAgent: Extracts tables from PDFs
    • FieldMapper: Maps fields to values
    • ForEachField: Control flow for field iteration

Agent Pipeline

  1. Document Processing

    # Document is processed in stages:
    1. PDF text extraction
    2. Table extraction
    3. Field mapping
    
  2. Field Extraction Process

    • Document type inference
    • User profile determination
    • Page-by-page scanning
    • Value extraction and validation
  3. Context Building

    • Document metadata
    • Field descriptions
    • User context
    • Execution history

Key Features

Document Type Inference

The system automatically infers document type and user profile:

# Example inference:
"Document type: Analytical report
User profile: Data analysts or researchers working with document analysis"

Field Mapping

The FieldMapper agent uses a sophisticated approach:

  1. Document context analysis
  2. Page-by-page scanning
  3. Value extraction using LLM
  4. Result validation

Execution Traces

The system maintains detailed execution traces:

  • Tool execution history
  • Success/failure status
  • Detailed logs
  • Result storage

Technical Setup

  1. Dependencies

    # Key dependencies:
    - streamlit
    - pandas
    - PyMuPDF (fitz)
    - Azure OpenAI
    - Azure Document Intelligence
    
  2. Configuration

    • Environment variables for API keys
    • Prompt templates in config/prompts.yaml
    • Settings in config/settings.py
  3. Logging System

    # Custom logging setup:
    - LogCaptureHandler for UI display
    - Structured logging format
    - Execution history storage
    

Development Guidelines

  1. Adding New Agents

    • Inherit from base agent class
    • Implement required methods
    • Add to planner configuration
  2. Modifying Extraction Logic

    • Update prompt templates
    • Modify field mapping logic
    • Adjust validation rules
  3. Extending Functionality

    • Add new field types
    • Implement custom validators
    • Create new output formats

Testing

  • Unit tests for agents
  • Integration tests for pipeline
  • End-to-end testing with sample PDFs

Deployment

  • Streamlit app deployment
  • Environment configuration
  • API key management
  • Logging setup

Future Improvements

  • Enhanced error handling
  • Additional field types
  • Improved validation
  • Performance optimization
  • Extended documentation