# Changelog

All notable changes to the Deep-Research PDF Field Extractor project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Added
- Comprehensive cost tracking for all LLM calls
- Retry logic with exponential backoff for Azure OpenAI API calls
- Unique Indices Strategy for document processing
- UniqueIndicesCombinator agent for extracting unique combinations
- UniqueIndicesLoopAgent for processing combinations with additional fields
- Enhanced field descriptions with format, examples, and possible values
- Detailed cost breakdown tables in the UI
- Graceful degradation for partial failures
- Test scripts for retry logic and cost tracking validation

### Changed
- Updated FieldMapperAgent to support multiple extraction strategies
- Enhanced LLM prompts with detailed field descriptions
- Improved error handling with better classification
- Updated UI to support strategy selection and field description tables
- Modified executor to handle unique indices strategy workflow
- Enhanced logging with cost tracking information

### Fixed
- Cost tracking context not being passed to agents
- UI issues with redundant labels and spacing
- Peptide sequence extraction errors
- 503 Service Unavailable error handling
- Context management in agents

## [1.0.0] - 2024-01-XX

### Added
- Multi-agent architecture for PDF field extraction
- Azure Document Intelligence integration
- Azure OpenAI integration for intelligent extraction
- Streamlit-based user interface
- PDF text and table extraction capabilities
- Semantic search for improved field mapping
- Document context inference
- Field description support
- Execution trace monitoring
- Result export functionality

### Features
- **PDFAgent**: Extracts text from PDF files using PyMuPDF
- **TableAgent**: Processes tables using Azure Document Intelligence
- **FieldMapperAgent**: Maps fields to values using LLM-based extraction
- **IndexAgent**: Creates semantic search indices
- **Planner**: Generates execution plans using Azure OpenAI
- **Executor**: Orchestrates agent execution and manages context

### Technical Implementation
- Base agent class for consistent agent implementation
- Context management system for data passing between agents
- Error handling and logging infrastructure
- Configuration management with environment variables
- Session state management for UI persistence

## [0.9.0] - 2024-01-XX

### Added
- Initial project structure
- Basic PDF processing capabilities
- Azure service integrations
- Streamlit application framework

### Features
- PDF text extraction
- Basic field mapping
- Simple UI interface
- Azure OpenAI integration

## [0.8.0] - 2024-01-XX

### Added
- Project initialization
- Basic architecture design
- Development environment setup
- Documentation structure

---

## Detailed Changes

### Cost Tracking Implementation

**Added:**
- `CostTracker` class for comprehensive API usage monitoring
- Individual LLM call tracking with descriptions
- Token usage monitoring (input and output)
- Cost calculation based on Azure OpenAI pricing
- Detailed cost breakdown tables
- Session and total cost tracking

**Files Modified:**
- `src/services/cost_tracker.py` - New cost tracking service
- `src/services/llm_client.py` - Integrated cost tracking
- `src/agents/field_mapper_agent.py` - Added cost tracking context
- `src/agents/unique_indices_combinator.py` - Added cost tracking context
- `src/agents/unique_indices_loop_agent.py` - Added cost tracking context
- `src/orchestrator/executor.py` - Reset cost tracker for new files
- `src/app.py` - Display cost information in UI

### Retry Logic Implementation

**Added:**
- Exponential backoff with jitter for retry attempts
- Configurable retry parameters via environment variables
- Error classification (retryable vs non-retryable)
- Graceful handling of transient failures

**Configuration:**
- `LLM_MAX_RETRIES`: Maximum retry attempts (default: 5)
- `LLM_BASE_DELAY`: Base delay for exponential backoff (default: 1.0s)
- `LLM_MAX_DELAY`: Maximum delay cap (default: 60.0s)

**Files Modified:**
- `src/services/llm_client.py` - Added retry logic
- `test_retry.py` - Test script for retry functionality

### Unique Indices Strategy

**Added:**
- `UniqueIndicesCombinator` agent for extracting unique combinations
- `UniqueIndicesLoopAgent` agent for processing combinations
- Strategy selection in UI
- Enhanced field descriptions for unique indices

**Workflow:**
1. Extract unique combinations of specified indices
2. Loop through each combination to extract additional fields
3. Return complete data structure

**Files Added:**
- `src/agents/unique_indices_combinator.py`
- `src/agents/unique_indices_loop_agent.py`

**Files Modified:**
- `src/orchestrator/planner.py` - Added unique indices strategy
- `src/orchestrator/executor.py` - Added new agents to tools
- `src/app.py` - Added strategy selection and UI components

### Enhanced Field Descriptions

**Added:**
- Table-based field description interface
- Support for format, examples, and possible values
- Session state management for descriptions
- Enhanced LLM prompts with detailed field information

**UI Improvements:**
- Editable tables for field descriptions
- Add/remove row functionality
- Persistent session state
- Better layout and spacing

**Files Modified:**
- `src/app.py` - Enhanced UI with field description tables
- `src/agents/field_mapper_agent.py` - Enhanced prompts
- `src/agents/unique_indices_combinator.py` - Enhanced prompts
- `src/agents/unique_indices_loop_agent.py` - Enhanced prompts

### Error Handling Improvements

**Added:**
- Graceful degradation for partial failures
- Better error classification and handling
- Detailed error logging and reporting
- Cost tracking during failures

**Error Types Handled:**
- 503 Service Unavailable (retryable)
- 500 Internal Server Error (retryable)
- Connection timeouts (retryable)
- Network errors (retryable)
- 400 Bad Request (non-retryable)
- 401 Unauthorized (non-retryable)

### Testing Infrastructure

**Added:**
- `test_cost_tracking.py` - Validates cost tracking functionality
- `test_retry.py` - Tests retry logic with simulated failures
- Mock-based testing for external dependencies
- Comprehensive test coverage for new features

### Documentation Updates

**Added:**
- Comprehensive README.md with all new features
- Detailed ARCHITECTURE.md with technical implementation
- Developer-focused DEVELOPER.md with coding standards
- CHANGELOG.md for version tracking

**Documentation Coverage:**
- Installation and setup instructions
- Configuration management
- API usage and examples
- Troubleshooting guides
- Development guidelines

### Performance Optimizations

**Added:**
- Efficient context management
- Optimized LLM prompt design
- Better memory usage patterns
- Improved error recovery

**Monitoring:**
- Real-time cost tracking
- Performance metrics
- Detailed logging
- Debug information

---

## Migration Guide

### From Version 0.9.0 to 1.0.0

1. **Update Environment Variables:**
   ```bash
   # Add new retry configuration
   LLM_MAX_RETRIES=5
   LLM_BASE_DELAY=1.0
   LLM_MAX_DELAY=60.0
   ```

2. **Update Dependencies:**
   ```bash
   pip install -r requirements.txt
   ```

3. **New Features:**
   - Cost tracking is now enabled by default
   - Retry logic handles transient failures automatically
   - Unique indices strategy available in UI
   - Enhanced field descriptions with tables

### Breaking Changes

- None in this release - all changes are backward compatible

### Deprecations

- None in this release

---

## Future Roadmap

### Planned Features (Next Release)
- Batch processing for multiple documents
- Custom model support for domain-specific extraction
- Advanced caching with Redis
- REST API endpoints for integration
- Real-time streaming document processing

### Performance Improvements
- Vector search enhancements
- Model optimization for faster processing
- Parallel processing capabilities
- Memory optimization improvements

### Scalability Enhancements
- Microservices architecture
- Queue-based asynchronous processing
- Load balancing support
- Database integration for persistent storage

---

## Contributing

For information on contributing to this project, please see the [Contributing Guide](CONTRIBUTING.md) and [Developer Documentation](DEVELOPER.md).

## Support

For support and questions, please refer to:
- [README.md](README.md) - General documentation
- [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture
- [DEVELOPER.md](DEVELOPER.md) - Development guidelines
- [Issues](https://github.com/your-repo/issues) - Bug reports and feature requests