Spaces:

hellorahulk
/

docling_free

Running

App Files Files Community

hellorahulk commited on Jan 23

Commit

15fdcff

0 Parent(s):

Initial commit with document parser

Browse files

Files changed (11) hide show

.cursorrules +46 -0
.github/workflows/sync-to-hub.yml +20 -0
README-HF.md +46 -0
README.md +81 -0
app.py +160 -0
config.json +12 -0
dockling_parser/__init__.py +11 -0
dockling_parser/exceptions.py +19 -0
dockling_parser/parser.py +94 -0
dockling_parser/types.py +25 -0
requirements.txt +10 -0

.cursorrules ADDED Viewed

	@@ -0,0 +1,46 @@

+  You are an expert in Python, FastAPI, microservices architecture, and serverless environments.
+  Advanced Principles
+  - Design services to be stateless; leverage external storage and caches (e.g., Redis) for state persistence.
+  - Implement API gateways and reverse proxies (e.g., NGINX, Traefik) for handling traffic to microservices.
+  - Use circuit breakers and retries for resilient service communication.
+  - Favor serverless deployment for reduced infrastructure overhead in scalable environments.
+  - Use asynchronous workers (e.g., Celery, RQ) for handling background tasks efficiently.
+  Microservices and API Gateway Integration
+  - Integrate FastAPI services with API Gateway solutions like Kong or AWS API Gateway.
+  - Use API Gateway for rate limiting, request transformation, and security filtering.
+  - Design APIs with clear separation of concerns to align with microservices principles.
+  - Implement inter-service communication using message brokers (e.g., RabbitMQ, Kafka) for event-driven architectures.
+  Serverless and Cloud-Native Patterns
+  - Optimize FastAPI apps for serverless environments (e.g., AWS Lambda, Azure Functions) by minimizing cold start times.
+  - Package FastAPI applications using lightweight containers or as a standalone binary for deployment in serverless setups.
+  - Use managed services (e.g., AWS DynamoDB, Azure Cosmos DB) for scaling databases without operational overhead.
+  - Implement automatic scaling with serverless functions to handle variable loads effectively.
+  Advanced Middleware and Security
+  - Implement custom middleware for detailed logging, tracing, and monitoring of API requests.
+  - Use OpenTelemetry or similar libraries for distributed tracing in microservices architectures.
+  - Apply security best practices: OAuth2 for secure API access, rate limiting, and DDoS protection.
+  - Use security headers (e.g., CORS, CSP) and implement content validation using tools like OWASP Zap.
+  Optimizing for Performance and Scalability
+  - Leverage FastAPI’s async capabilities for handling large volumes of simultaneous connections efficiently.
+  - Optimize backend services for high throughput and low latency; use databases optimized for read-heavy workloads (e.g., Elasticsearch).
+  - Use caching layers (e.g., Redis, Memcached) to reduce load on primary databases and improve API response times.
+  - Apply load balancing and service mesh technologies (e.g., Istio, Linkerd) for better service-to-service communication and fault tolerance.
+  Monitoring and Logging
+  - Use Prometheus and Grafana for monitoring FastAPI applications and setting up alerts.
+  - Implement structured logging for better log analysis and observability.
+  - Integrate with centralized logging systems (e.g., ELK Stack, AWS CloudWatch) for aggregated logging and monitoring.
+  Key Conventions
+  1. Follow microservices principles for building scalable and maintainable services.
+  2. Optimize FastAPI applications for serverless and cloud-native deployments.
+  3. Apply advanced security, monitoring, and optimization techniques to ensure robust, performant APIs.
+  Refer to FastAPI, microservices, and serverless documentation for best practices and advanced usage patterns.

.github/workflows/sync-to-hub.yml ADDED Viewed

	@@ -0,0 +1,20 @@

+name: Sync to Hugging Face Hub
+on:
+  push:
+    branches: [main]
+  # to run this workflow manually from the Actions tab
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+          lfs: true
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: git push https://rahulkumar:$HF_TOKEN@huggingface.co/spaces/rahulkumar/dockling-parser main

README-HF.md ADDED Viewed

	@@ -0,0 +1,46 @@

+# 📄 Smart Document Parser
+A powerful document parsing application that automatically extracts structured information from various document formats.
+## 🚀 Features
+- **Multiple Format Support**: PDF, DOCX, TXT, HTML, and Markdown
+- **Rich Information Extraction**:
+  - Document content with preserved formatting
+  - Comprehensive metadata
+  - Section breakdown
+  - Named entity recognition
+- **Smart Processing**:
+  - Automatic format detection
+  - Confidence scoring
+  - Error handling
+## 🎯 How to Use
+1. **Upload Document**: Click the upload button or drag & drop your document
+2. **Process**: Click "Process Document"
+3. **View Results**: Explore the extracted information in different tabs:
+   - 📝 Content: Main document text
+   - 📊 Metadata: Document properties
+   - 📑 Sections: Document structure
+   - 🏷️ Entities: Named entities
+## 📋 Supported Formats
+- PDF Documents (*.pdf)
+- Word Documents (*.docx)
+- Text Files (*.txt)
+- HTML Files (*.html)
+- Markdown Files (*.md)
+## 🛠️ Technical Details
+Built with:
+- Docling: Advanced document processing
+- Gradio: Interactive web interface
+- Pydantic: Type-safe data handling
+- Hugging Face Spaces: Cloud deployment
+## 📝 License
+MIT License

README.md ADDED Viewed

	@@ -0,0 +1,81 @@

+# Dockling Parser
+A powerful multiformat document parsing module built on top of Docling. This module provides a unified interface for parsing various document formats including PDF, DOCX, TXT, HTML, and Markdown.
+## Features
+- Unified interface for multiple document formats
+- Rich metadata extraction
+- Structured content parsing
+- Format detection using MIME types
+- Error handling and validation
+- Type-safe using Pydantic models
+- Web interface using Gradio
+## Installation
+```bash
+pip install -r requirements.txt
+```
+## Usage
+### Python API
+```python
+from dockling_parser import DocumentParser
+# Initialize parser
+parser = DocumentParser()
+# Parse a document
+result = parser.parse("path/to/document.pdf")
+# Access parsed content
+print(result.content)  # Get main text content
+print(result.metadata)  # Get document metadata
+print(result.structured_content)  # Get structured content (sections, paragraphs, etc.)
+# Check format support
+is_supported = parser.supports_format("application/pdf")
+```
+### Web Interface
+The package includes a Gradio-based web interface for easy document parsing:
+```bash
+python app.py
+```
+This will launch a web interface with the following features:
+- Drag-and-drop document upload
+- Support for multiple document formats
+- Automatic format detection
+- Structured output display:
+  - Document content
+  - Metadata table
+  - Section breakdown
+  - Named entity recognition
+  - Confidence scoring
+## Supported Formats
+- PDF (application/pdf)
+- DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
+- Plain Text (text/plain)
+- HTML (text/html)
+- Markdown (text/markdown)
+## Error Handling
+The module provides specific exceptions for different error cases:
+- `UnsupportedFormatError`: When the document format is not supported
+- `ParseError`: When document parsing fails
+- `ValidationError`: When document validation fails
+- `EncodingError`: When document encoding issues occur
+## License
+MIT License

app.py ADDED Viewed

	@@ -0,0 +1,160 @@

+import os
+import gradio as gr
+import pandas as pd
+from dockling_parser import DocumentParser
+from dockling_parser.exceptions import ParserError
+import tempfile
+TITLE = "📄 Smart Document Parser"
+DESCRIPTION = """
+A powerful document parsing application that automatically extracts structured information from various document formats.
+Upload any document (PDF, DOCX, TXT, HTML, Markdown) and get structured information extracted automatically.
+"""
+ARTICLE = """
+## 🚀 Features
+- Multiple Format Support: PDF, DOCX, TXT, HTML, and Markdown
+- Rich Information Extraction
+- Smart Processing with Confidence Scoring
+- Automatic Format Detection
+Made with ❤️ using Docling and Gradio
+"""
+# Initialize the document parser
+parser = DocumentParser()
+def process_document(file):
+    """Process uploaded document and return structured information"""
+    try:
+        # Create a temporary file to handle the upload
+        with tempfile.NamedTemporaryFile(delete=False, suffix=os.path.splitext(file.name)[1]) as tmp_file:
+            tmp_file.write(file.read())
+            temp_path = tmp_file.name
+        # Parse the document
+        result = parser.parse(temp_path)
+        # Clean up temporary file
+        os.unlink(temp_path)
+        # Prepare the outputs
+        metadata_df = pd.DataFrame([{
+            "Property": k,
+            "Value": str(v)
+        } for k, v in result.metadata.dict().items()])
+        # Extract structured content
+        sections = result.structured_content.get('sections', [])
+        sections_text = "\n\n".join([f"Section {i+1}:\n{section}" for i, section in enumerate(sections)])
+        # Format entities if available
+        entities = result.structured_content.get('entities', {})
+        entities_text = "\n".join([f"{entity_type}: {', '.join(entities_list)}"
+                                 for entity_type, entities_list in entities.items()]) if entities else "No entities detected"
+        return (
+            result.content,  # Main content
+            metadata_df,     # Metadata as table
+            sections_text,   # Structured sections
+            entities_text,   # Named entities
+            f"Confidence Score: {result.confidence_score:.2f}"  # Confidence score
+        )
+    except ParserError as e:
+        return (
+            f"Error parsing document: {str(e)}",
+            pd.DataFrame(),
+            "No sections available",
+            "No entities available",
+            "Confidence Score: 0.0"
+        )
+    except Exception as e:
+        return (
+            f"Unexpected error: {str(e)}",
+            pd.DataFrame(),
+            "No sections available",
+            "No entities available",
+            "Confidence Score: 0.0"
+        )
+    finally:
+        # Ensure temporary file is cleaned up
+        if 'temp_path' in locals() and os.path.exists(temp_path):
+            try:
+                os.unlink(temp_path)
+            except:
+                pass
+# Create Gradio interface
+with gr.Blocks(title=TITLE, theme=gr.themes.Soft()) as iface:
+    gr.Markdown(f"# {TITLE}")
+    gr.Markdown(DESCRIPTION)
+    with gr.Row():
+        with gr.Column():
+            file_input = gr.File(
+                label="Upload Document",
+                file_types=[".pdf", ".docx", ".txt", ".html", ".md"],
+                type="file"
+            )
+            submit_btn = gr.Button("Process Document", variant="primary")
+        with gr.Column():
+            confidence = gr.Textbox(label="Processing Confidence")
+    with gr.Tabs():
+        with gr.TabItem("📝 Content"):
+            content_output = gr.Textbox(
+                label="Extracted Content",
+                lines=10,
+                max_lines=30
+            )
+        with gr.TabItem("📊 Metadata"):
+            metadata_output = gr.Dataframe(
+                label="Document Metadata",
+                headers=["Property", "Value"]
+            )
+        with gr.TabItem("📑 Sections"):
+            sections_output = gr.Textbox(
+                label="Document Sections",
+                lines=10,
+                max_lines=30
+            )
+        with gr.TabItem("🏷️ Entities"):
+            entities_output = gr.Textbox(
+                label="Named Entities",
+                lines=5,
+                max_lines=15
+            )
+    # Handle file submission
+    submit_btn.click(
+        fn=process_document,
+        inputs=[file_input],
+        outputs=[
+            content_output,
+            metadata_output,
+            sections_output,
+            entities_output,
+            confidence
+        ]
+    )
+    gr.Markdown("""
+    ### 📌 Supported Formats
+    - PDF Documents (*.pdf)
+    - Word Documents (*.docx)
+    - Text Files (*.txt)
+    - HTML Files (*.html)
+    - Markdown Files (*.md)
+    """)
+    gr.Markdown(ARTICLE)
+# Launch the app
+if __name__ == "__main__":
+    iface.launch()

config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+    "title": "Smart Document Parser",
+    "emoji": "📄",
+    "colorFrom": "blue",
+    "colorTo": "indigo",
+    "sdk": "gradio",
+    "sdk_version": "4.0.0",
+    "python_version": "3.10",
+    "app_file": "app.py",
+    "pinned": false,
+    "license": "mit"
+}

dockling_parser/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""
+Dockling Parser - A multiformat document parsing module using Docling
+"""
+__version__ = "0.1.0"
+from .parser import DocumentParser
+from .types import ParsedDocument, DocumentMetadata
+from .exceptions import ParserError
+__all__ = ["DocumentParser", "ParsedDocument", "DocumentMetadata", "ParserError"]

dockling_parser/exceptions.py ADDED Viewed

	@@ -0,0 +1,19 @@

+class ParserError(Exception):
+    """Base exception for parser errors"""
+    pass
+class UnsupportedFormatError(ParserError):
+    """Raised when document format is not supported"""
+    pass
+class ParseError(ParserError):
+    """Raised when document parsing fails"""
+    pass
+class ValidationError(ParserError):
+    """Raised when document validation fails"""
+    pass
+class EncodingError(ParserError):
+    """Raised when document encoding cannot be determined or is not supported"""
+    pass

dockling_parser/parser.py ADDED Viewed

	@@ -0,0 +1,94 @@

+import os
+from pathlib import Path
+from typing import Optional, Dict, Any, Union
+import magic
+import docling as dl
+from datetime import datetime
+from .types import ParsedDocument, DocumentMetadata
+from .exceptions import UnsupportedFormatError, ParseError
+class DocumentParser:
+    """
+    A multiformat document parser using Docling
+    """
+    SUPPORTED_FORMATS = {
+        'application/pdf': 'pdf',
+        'application/vnd.openxmlformats-officedocument.wordprocessingml.document': 'docx',
+        'text/plain': 'txt',
+        'text/html': 'html',
+        'text/markdown': 'md'
+    }
+    def __init__(self, config: Optional[Dict[str, Any]] = None):
+        self.config = config or {}
+        self.docling = dl.Docling()
+    def parse(self, file_path: Union[str, Path]) -> ParsedDocument:
+        """
+        Parse a document file and return structured content
+        Args:
+            file_path: Path to the document file
+        Returns:
+            ParsedDocument object containing parsed content and metadata
+        Raises:
+            UnsupportedFormatError: If the file format is not supported
+            ParseError: If parsing fails
+        """
+        file_path = Path(file_path)
+        if not file_path.exists():
+            raise FileNotFoundError(f"File not found: {file_path}")
+        mime_type = magic.from_file(str(file_path), mime=True)
+        if mime_type not in self.SUPPORTED_FORMATS:
+            raise UnsupportedFormatError(f"Unsupported file format: {mime_type}")
+        try:
+            # Get file metadata
+            stats = file_path.stat()
+            metadata = DocumentMetadata(
+                filename=file_path.name,
+                file_type=self.SUPPORTED_FORMATS[mime_type],
+                size_bytes=stats.st_size,
+                created_at=datetime.fromtimestamp(stats.st_ctime),
+                modified_at=datetime.fromtimestamp(stats.st_mtime),
+                mime_type=mime_type
+            )
+            # Parse document using Docling
+            doc = self.docling.parse(str(file_path))
+            # Extract content and structure
+            content = doc.text
+            structured_content = {
+                'sections': doc.sections,
+                'paragraphs': doc.paragraphs,
+                'entities': doc.entities,
+                'metadata': doc.metadata
+            }
+            # Update metadata with document-specific information
+            if doc.metadata:
+                metadata.title = doc.metadata.get('title')
+                metadata.author = doc.metadata.get('author')
+                metadata.pages = doc.metadata.get('pages')
+                metadata.extra.update(doc.metadata)
+            return ParsedDocument(
+                content=content,
+                metadata=metadata,
+                raw_text=doc.raw_text,
+                structured_content=structured_content,
+                confidence_score=doc.confidence if hasattr(doc, 'confidence') else 1.0
+            )
+        except Exception as e:
+            raise ParseError(f"Failed to parse document: {str(e)}") from e
+    def supports_format(self, mime_type: str) -> bool:
+        """Check if a given MIME type is supported"""
+        return mime_type in self.SUPPORTED_FORMATS

dockling_parser/types.py ADDED Viewed

	@@ -0,0 +1,25 @@

+from datetime import datetime
+from typing import Optional, Dict, Any
+from pydantic import BaseModel, Field
+class DocumentMetadata(BaseModel):
+    """Metadata for parsed documents"""
+    filename: str
+    file_type: str
+    size_bytes: int
+    created_at: Optional[datetime] = None
+    modified_at: Optional[datetime] = None
+    author: Optional[str] = None
+    title: Optional[str] = None
+    pages: Optional[int] = None
+    encoding: Optional[str] = None
+    mime_type: str
+    extra: Dict[str, Any] = Field(default_factory=dict)
+class ParsedDocument(BaseModel):
+    """Represents a parsed document with its content and metadata"""
+    content: str
+    metadata: DocumentMetadata
+    raw_text: Optional[str] = None
+    structured_content: Optional[Dict[str, Any]] = None
+    confidence_score: float = Field(ge=0.0, le=1.0, default=1.0)

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+docling>=0.1.0
+pydantic>=2.0.0
+python-magic-bin>=0.4.14
+python-docx>=0.8.11
+PyPDF2>=3.0.0
+beautifulsoup4>=4.12.0
+lxml>=4.9.0
+gradio>=4.0.0
+pandas>=1.5.0
+huggingface-hub>=0.19.0