# Architecture Documentation ## System Overview The Deep-Research PDF Field Extractor is a multi-agent system designed to extract structured data from biotech-related PDFs. The system uses Azure Document Intelligence for document processing and Azure OpenAI for intelligent field extraction. ## Core Architecture ### Multi-Agent Design The system follows a multi-agent architecture where each agent has a specific responsibility: ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ PDFAgent │ │ TableAgent │ │ IndexAgent │ │ │ │ │ │ │ │ • PDF Text │───▶│ • Table │───▶│ • Semantic │ │ Extraction │ │ Processing │ │ Indexing │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │UniqueIndices │ │UniqueIndices │ │FieldMapper │ │Combinator │ │LoopAgent │ │Agent │ │ │ │ │ │ │ │ • Extract │───▶│ • Loop through │ │ • Extract │ │ combinations │ │ combinations │ │ individual │ │ │ │ • Add fields │ │ fields │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ``` ### Execution Flow #### Original Strategy Flow ``` 1. PDFAgent → Extract text from PDF 2. TableAgent → Process tables with Azure DI 3. IndexAgent → Create semantic search index 4. ForEachField → Iterate through fields 5. FieldMapperAgent → Extract each field value ``` #### Unique Indices Strategy Flow ``` 1. PDFAgent → Extract text from PDF 2. TableAgent → Process tables with Azure DI 3. UniqueIndicesCombinator → Extract unique combinations 4. UniqueIndicesLoopAgent → Extract additional fields for each combination ``` ## Agent Details ### PDFAgent - **Purpose**: Extract text content from PDF files - **Technology**: PyMuPDF (fitz) - **Output**: Raw text content - **Error Handling**: Graceful handling of corrupted PDFs ### TableAgent - **Purpose**: Process tables using Azure Document Intelligence - **Technology**: Azure DI Layout Analysis - **Features**: - Table structure preservation - Rowspan/colspan handling - HTML table generation for debugging - **Output**: Processed table data ### UniqueIndicesCombinator - **Purpose**: Extract unique combinations of specified indices - **Input**: Document text, unique indices descriptions - **LLM Prompt**: Structured prompt for combination extraction - **Output**: JSON array of unique combinations - **Cost Tracking**: Tracks input/output tokens ### UniqueIndicesLoopAgent - **Purpose**: Extract additional fields for each unique combination - **Input**: Unique combinations, field descriptions - **Process**: Loops through each combination - **LLM Calls**: One call per combination - **Error Handling**: Continues with partial failures - **Output**: Complete data with all fields ### FieldMapperAgent - **Purpose**: Extract individual field values - **Strategies**: - Page-by-page analysis - Semantic search fallback - Unique indices strategy - **Features**: Context-aware extraction - **Output**: Field values with confidence scores ### IndexAgent - **Purpose**: Create semantic search indices - **Technology**: Azure OpenAI Embeddings - **Features**: Chunk-based indexing - **Output**: Searchable document index ## Services ### LLMClient ```python class LLMClient: def __init__(self, settings): # Azure OpenAI configuration self._deployment = settings.AZURE_OPENAI_DEPLOYMENT self._max_retries = settings.LLM_MAX_RETRIES self._base_delay = settings.LLM_BASE_DELAY def responses(self, prompt, **kwargs): # Retry logic with exponential backoff # Cost tracking integration # Error handling ``` **Key Features:** - Retry logic with exponential backoff - Cost tracking integration - Error classification (retryable vs non-retryable) - Jitter to prevent thundering herd ### CostTracker ```python class CostTracker: def __init__(self): self.llm_calls: List[LLMCall] = [] self.current_file_costs = {} self.total_costs = {} def add_llm_tokens(self, input_tokens, output_tokens, description): # Track individual LLM calls # Calculate costs # Store detailed information ``` **Key Features:** - Individual call tracking - Cost calculation based on Azure pricing - Detailed breakdown by operation - Session and total cost tracking ### AzureDIService ```python class AzureDIService: def extract_tables(self, pdf_bytes): # Azure DI Layout Analysis # Table structure preservation # HTML debugging output ``` **Key Features:** - Layout analysis for complex documents - Table structure preservation - Debug output generation - Error handling for DI operations ## Data Flow ### Context Management The system uses a context dictionary to pass data between agents: ```python ctx = { "pdf_file": pdf_file, "text": extracted_text, "fields": field_list, "unique_indices": unique_indices, "field_descriptions": field_descriptions, "cost_tracker": cost_tracker, "results": [], "strategy": strategy } ``` ### Result Processing Results are processed through multiple stages: 1. **Raw Extraction**: LLM responses in JSON format 2. **Validation**: JSON parsing and structure validation 3. **Flattening**: Convert to tabular format 4. **DataFrame**: Final structured output ## Error Handling Strategy ### Retry Logic ```python def _should_retry(self, exception) -> bool: # Retry on 5xx errors if hasattr(exception, 'status_code'): return exception.status_code >= 500 # Retry on connection errors return any(error in str(exception) for error in ['Timeout', 'Connection']) ``` ### Graceful Degradation - Continue processing with partial failures - Return null values for failed extractions - Log detailed error information - Maintain cost tracking during failures ### Error Classification - **Retryable**: 503, 500, connection timeouts - **Non-retryable**: 400, 401, validation errors - **Fatal**: Configuration errors, missing dependencies ## Performance Considerations ### Optimization Strategies 1. **Parallel Processing**: Independent field extraction 2. **Caching**: Session state for field descriptions 3. **Batching**: Group similar operations 4. **Early Termination**: Stop on critical failures ### Resource Management - **Memory**: Efficient text processing - **API Limits**: Respect Azure rate limits - **Cost Control**: Detailed tracking and alerts - **Timeout Handling**: Configurable timeouts ## Security ### Data Protection - No persistent storage of sensitive data - Secure API key management - Session-based data handling - Log sanitization ### Access Control - Environment variable configuration - API key validation - Error message sanitization ## Monitoring and Observability ### Logging Strategy ```python # Structured logging with levels logger.info(f"Processing {len(combinations)} combinations") logger.debug(f"LLM response: {response[:200]}...") logger.error(f"Failed to extract field: {field}") ``` ### Metrics Collection - LLM call counts and durations - Token usage and costs - Success/failure rates - Processing times ### Debug Information - Detailed execution traces - Cost breakdown tables - Error context and stack traces - Performance metrics ## Configuration Management ### Settings Structure ```python class Settings(BaseSettings): # Azure OpenAI AZURE_OPENAI_ENDPOINT: str AZURE_OPENAI_API_KEY: str AZURE_OPENAI_DEPLOYMENT: str # Azure Document Intelligence AZURE_DI_ENDPOINT: str AZURE_DI_KEY: str # Retry Configuration LLM_MAX_RETRIES: int = 5 LLM_BASE_DELAY: float = 1.0 LLM_MAX_DELAY: float = 60.0 ``` ### Environment Variables - `.env` file support - Environment variable override - Validation and defaults - Secure key management ## Testing Strategy ### Unit Tests - Individual agent testing - Service layer testing - Mock external dependencies - Cost tracking validation ### Integration Tests - End-to-end workflows - Error scenario testing - Performance benchmarking - Cost accuracy validation ### Test Coverage - Core functionality: 90%+ - Error handling: 100% - Cost tracking: 100% - Retry logic: 100% ## Deployment ### Requirements - Python 3.9+ - Azure OpenAI access - Azure Document Intelligence access - Streamlit for UI ### Dependencies ``` azure-ai-documentintelligence openai streamlit pandas pymupdf pydantic-settings ``` ### Environment Setup 1. Install dependencies 2. Configure environment variables 3. Set up Azure resources 4. Test connectivity 5. Deploy application ## Future Enhancements ### Planned Features - **Batch Processing**: Multiple document processing - **Custom Models**: Domain-specific extraction - **Advanced Caching**: Redis-based caching - **API Endpoints**: REST API for integration - **Real-time Processing**: Streaming document processing ### Scalability Improvements - **Microservices**: Agent separation - **Queue System**: Asynchronous processing - **Load Balancing**: Multiple instances - **Database Integration**: Persistent storage ### Performance Optimizations - **Vector Search**: Enhanced semantic search - **Model Optimization**: Smaller, faster models - **Parallel Processing**: Multi-threaded extraction - **Memory Optimization**: Efficient data structures