--- title: Doctorecord emoji: 🚀 colorFrom: red colorTo: red sdk: docker app_port: 8501 tags: - streamlit pinned: false short_description: Streamlit template space --- # Welcome to Streamlit! Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart: If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community forums](https://discuss.streamlit.io). # Deep-Research PDF Field Extractor A multi-agent system for extracting structured data from biotech-related PDFs using Azure Document Intelligence and Azure OpenAI. ## Features - **Multi-Agent Architecture**: Uses specialized agents for different extraction tasks - **Azure Integration**: Leverages Azure Document Intelligence and Azure OpenAI - **Flexible Extraction Strategies**: Supports both original and unique indices strategies - **Robust Error Handling**: Implements retry logic with exponential backoff - **Comprehensive Cost Tracking**: Monitors API usage and costs for all LLM calls - **Streamlit UI**: User-friendly interface for document processing - **Graceful Degradation**: Continues processing even with partial failures ## Installation 1. Clone the repository 2. Install dependencies: `pip install -r requirements.txt` 3. Set up environment variables (see Configuration section) 4. Run the application: `streamlit run src/app.py` ## Configuration ### Environment Variables Create a `.env` file with the following variables: ```env # Azure OpenAI AZURE_OPENAI_ENDPOINT=your_endpoint AZURE_OPENAI_API_KEY=your_api_key AZURE_OPENAI_DEPLOYMENT=your_deployment_name AZURE_OPENAI_API_VERSION=2025-03-01-preview # Azure Document Intelligence AZURE_DI_ENDPOINT=your_di_endpoint AZURE_DI_KEY=your_di_key # Retry Configuration (Optional) LLM_MAX_RETRIES=5 LLM_BASE_DELAY=1.0 LLM_MAX_DELAY=60.0 ``` ### Retry Configuration The system implements robust retry logic to handle transient service errors: - **LLM_MAX_RETRIES**: Maximum number of retry attempts (default: 5) - **LLM_BASE_DELAY**: Base delay in seconds for exponential backoff (default: 1.0) - **LLM_MAX_DELAY**: Maximum delay in seconds (default: 60.0) The retry logic automatically handles: - 503 Service Unavailable errors - 500 Internal Server Error - Connection timeouts - Network errors Retries use exponential backoff with jitter to prevent thundering herd problems. ## Usage ### Original Strategy Processes documents page by page, extracting fields individually using semantic search and LLM-based extraction. **Workflow:** ``` PDFAgent → TableAgent → ForEachField → FieldMapperAgent ``` ### Unique Indices Strategy Extracts data based on unique combinations of specified indices, then loops through each combination to extract additional fields. **Workflow:** ``` PDFAgent → TableAgent → UniqueIndicesCombinator → UniqueIndicesLoopAgent ``` **Step-by-step process:** 1. **PDFAgent**: Extracts text from PDF files 2. **TableAgent**: Processes tables using Azure Document Intelligence 3. **UniqueIndicesCombinator**: Extracts unique combinations of specified indices (e.g., Protein Lot, Peptide, Timepoint, Modification) 4. **UniqueIndicesLoopAgent**: Loops through each combination to extract additional fields (e.g., Chain, Percentage, Seq Loc) **Example Output:** ```json [ { "Protein Lot": "P066_L14_H31_0-hulgG-LALAPG-FJB", "Peptide": "PLTFGAGTK", "Timepoint": "0w", "Modification": "Clipping", "Chain": "Heavy", "Percentage": "90.0", "Seq Loc": "HC(1-31)" }, { "Protein Lot": "P066_L14_H31_0-hulgG-LALAPG-FJB", "Peptide": "PLTFGAGTK", "Timepoint": "4w", "Modification": "Clipping", "Chain": "Heavy", "Percentage": "85.0", "Seq Loc": "HC(1-31)" } ] ``` ## Architecture ### Agents - **PDFAgent**: Extracts text from PDF files using PyMuPDF - **TableAgent**: Processes tables using Azure Document Intelligence with layout analysis - **UniqueIndicesCombinator**: Extracts unique combinations of specified indices from documents - **UniqueIndicesLoopAgent**: Loops through combinations to extract additional field values - **FieldMapperAgent**: Maps individual fields to values using LLM-based extraction - **IndexAgent**: Creates semantic search indices for improved field extraction ### Services - **LLMClient**: Azure OpenAI wrapper with retry logic and cost tracking - **AzureDIService**: Azure Document Intelligence integration with table processing - **CostTracker**: Comprehensive API usage and cost monitoring - **EmbeddingClient**: Semantic search capabilities ### Data Flow 1. **Document Processing**: PDF text and table extraction 2. **Strategy Selection**: Choose between original or unique indices approach 3. **Field Extraction**: LLM-based extraction with detailed field descriptions 4. **Cost Tracking**: Monitor all API usage and calculate costs 5. **Result Processing**: Convert to structured format (DataFrame/CSV) ## Cost Tracking The system provides comprehensive cost tracking for all operations: ### LLM Costs - **Input Tokens**: Tracked for each LLM call with descriptions - **Output Tokens**: Tracked for each LLM call with descriptions - **Cost Calculation**: Based on Azure OpenAI pricing - **Detailed Breakdown**: Individual call costs in the UI ### Document Intelligence Costs - **Pages Processed**: Tracked per operation - **Operation Types**: Layout analysis, custom models, etc. - **Cost Calculation**: Based on Azure DI pricing ### Cost Display - **Real-time Updates**: Costs shown during execution - **Detailed Table**: Breakdown of all LLM calls - **Total Summary**: Combined costs for the entire operation ## Error Handling The system implements comprehensive error handling: 1. **Retry Logic**: Automatic retries for transient errors with exponential backoff 2. **Graceful Degradation**: Continues processing even if some combinations fail 3. **Partial Results**: Returns data for successful extractions with null values for failures 4. **Detailed Logging**: Comprehensive logging for debugging and monitoring 5. **Cost Tracking**: Monitors API usage even during failures ### Error Types Handled - ✅ **503 Service Unavailable** (Azure service overload) - ✅ **500 Internal Server Error** (Server-side issues) - ✅ **Connection timeouts** (Network issues) - ✅ **Network errors** (Infrastructure problems) - ❌ **400 Bad Request** (Client errors - not retried) - ❌ **401 Unauthorized** (Authentication errors - not retried) ## Field Descriptions The system supports detailed field descriptions to improve extraction accuracy: ### Field Description Format ```json { "field_name": { "description": "Detailed description of the field", "format": "Expected format (String, Float, etc.)", "examples": "Example values", "possible_values": "Comma-separated list of possible values" } } ``` ### UI Support - **Editable Tables**: Add, edit, and remove field descriptions - **Session State**: Persists descriptions during the session - **Validation**: Ensures proper format and structure ## Testing The system includes comprehensive test suites: ### Test Scripts - **test_retry.py**: Verifies retry logic with simulated failures - **test_cost_tracking.py**: Validates cost tracking functionality ### Running Tests ```bash python test_retry.py python test_cost_tracking.py ``` ## Performance ### Optimization Features - **Retry Logic**: Handles transient failures automatically - **Cost Optimization**: Detailed tracking to monitor usage - **Graceful Degradation**: Continues with partial results - **Caching**: Session state for field descriptions ### Expected Performance - **Small Documents**: 30-60 seconds - **Large Documents**: 2-5 minutes - **Cost Efficiency**: ~$0.01-0.10 per document (depending on size) ## Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Add tests if applicable 5. Submit a pull request ## Troubleshooting ### Common Issues **503 Service Unavailable Errors** - The system automatically retries with exponential backoff - Check Azure service status if persistent - Adjust retry configuration if needed **Cost Tracking Shows Zero** - Ensure cost tracker is properly initialized - Check that agents are passing context correctly - Verify LLM calls are being made **Partial Results** - Some combinations may fail due to document structure - Check execution logs for specific failures - Results include null values for failed extractions ### Debug Mode Enable detailed logging by setting log level to DEBUG in the application. ## License [Add your license information here] ## Overview The PDF Field Extractor helps you extract specific information from PDF documents. It can extract any fields you specify, such as dates, names, values, locations, and more. The tool is particularly useful for converting unstructured PDF data into structured, analyzable formats. ## How to Use 1. **Upload Your PDF** - Click the "Upload PDF" button - Select your PDF file from your computer 2. **Specify Fields to Extract** - Enter the fields you want to extract, separated by commas - Example: `Date, Name, Value, Location, Page, FileName` 3. **Optional: Add Field Descriptions** - You can provide additional context about the fields - This helps the system better understand what to look for 4. **Run Extraction** - Click the "Run extraction" button - Wait for the process to complete - View your results in a table format 5. **Download Results** - Download your extracted data as a CSV file - View execution traces and logs if needed ## Features - Automatic document type detection - Smart field extraction - Support for tables and text - Detailed execution traces - Downloadable results and logs ## Support For technical documentation and architecture details, please refer to: - [Architecture Overview](ARCHITECTURE.md) - [Developer Documentation](DEVELOPER.md)