hellorahulk commited on
Commit
15fdcff
·
0 Parent(s):

Initial commit with document parser

Browse files
.cursorrules ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ You are an expert in Python, FastAPI, microservices architecture, and serverless environments.
3
+
4
+ Advanced Principles
5
+ - Design services to be stateless; leverage external storage and caches (e.g., Redis) for state persistence.
6
+ - Implement API gateways and reverse proxies (e.g., NGINX, Traefik) for handling traffic to microservices.
7
+ - Use circuit breakers and retries for resilient service communication.
8
+ - Favor serverless deployment for reduced infrastructure overhead in scalable environments.
9
+ - Use asynchronous workers (e.g., Celery, RQ) for handling background tasks efficiently.
10
+
11
+ Microservices and API Gateway Integration
12
+ - Integrate FastAPI services with API Gateway solutions like Kong or AWS API Gateway.
13
+ - Use API Gateway for rate limiting, request transformation, and security filtering.
14
+ - Design APIs with clear separation of concerns to align with microservices principles.
15
+ - Implement inter-service communication using message brokers (e.g., RabbitMQ, Kafka) for event-driven architectures.
16
+
17
+ Serverless and Cloud-Native Patterns
18
+ - Optimize FastAPI apps for serverless environments (e.g., AWS Lambda, Azure Functions) by minimizing cold start times.
19
+ - Package FastAPI applications using lightweight containers or as a standalone binary for deployment in serverless setups.
20
+ - Use managed services (e.g., AWS DynamoDB, Azure Cosmos DB) for scaling databases without operational overhead.
21
+ - Implement automatic scaling with serverless functions to handle variable loads effectively.
22
+
23
+ Advanced Middleware and Security
24
+ - Implement custom middleware for detailed logging, tracing, and monitoring of API requests.
25
+ - Use OpenTelemetry or similar libraries for distributed tracing in microservices architectures.
26
+ - Apply security best practices: OAuth2 for secure API access, rate limiting, and DDoS protection.
27
+ - Use security headers (e.g., CORS, CSP) and implement content validation using tools like OWASP Zap.
28
+
29
+ Optimizing for Performance and Scalability
30
+ - Leverage FastAPI’s async capabilities for handling large volumes of simultaneous connections efficiently.
31
+ - Optimize backend services for high throughput and low latency; use databases optimized for read-heavy workloads (e.g., Elasticsearch).
32
+ - Use caching layers (e.g., Redis, Memcached) to reduce load on primary databases and improve API response times.
33
+ - Apply load balancing and service mesh technologies (e.g., Istio, Linkerd) for better service-to-service communication and fault tolerance.
34
+
35
+ Monitoring and Logging
36
+ - Use Prometheus and Grafana for monitoring FastAPI applications and setting up alerts.
37
+ - Implement structured logging for better log analysis and observability.
38
+ - Integrate with centralized logging systems (e.g., ELK Stack, AWS CloudWatch) for aggregated logging and monitoring.
39
+
40
+ Key Conventions
41
+ 1. Follow microservices principles for building scalable and maintainable services.
42
+ 2. Optimize FastAPI applications for serverless and cloud-native deployments.
43
+ 3. Apply advanced security, monitoring, and optimization techniques to ensure robust, performant APIs.
44
+
45
+ Refer to FastAPI, microservices, and serverless documentation for best practices and advanced usage patterns.
46
+
.github/workflows/sync-to-hub.yml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Sync to Hugging Face Hub
2
+ on:
3
+ push:
4
+ branches: [main]
5
+
6
+ # to run this workflow manually from the Actions tab
7
+ workflow_dispatch:
8
+
9
+ jobs:
10
+ sync-to-hub:
11
+ runs-on: ubuntu-latest
12
+ steps:
13
+ - uses: actions/checkout@v3
14
+ with:
15
+ fetch-depth: 0
16
+ lfs: true
17
+ - name: Push to hub
18
+ env:
19
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
20
+ run: git push https://rahulkumar:$HF_TOKEN@huggingface.co/spaces/rahulkumar/dockling-parser main
README-HF.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📄 Smart Document Parser
2
+
3
+ A powerful document parsing application that automatically extracts structured information from various document formats.
4
+
5
+ ## 🚀 Features
6
+
7
+ - **Multiple Format Support**: PDF, DOCX, TXT, HTML, and Markdown
8
+ - **Rich Information Extraction**:
9
+ - Document content with preserved formatting
10
+ - Comprehensive metadata
11
+ - Section breakdown
12
+ - Named entity recognition
13
+ - **Smart Processing**:
14
+ - Automatic format detection
15
+ - Confidence scoring
16
+ - Error handling
17
+
18
+ ## 🎯 How to Use
19
+
20
+ 1. **Upload Document**: Click the upload button or drag & drop your document
21
+ 2. **Process**: Click "Process Document"
22
+ 3. **View Results**: Explore the extracted information in different tabs:
23
+ - 📝 Content: Main document text
24
+ - 📊 Metadata: Document properties
25
+ - 📑 Sections: Document structure
26
+ - 🏷️ Entities: Named entities
27
+
28
+ ## 📋 Supported Formats
29
+
30
+ - PDF Documents (*.pdf)
31
+ - Word Documents (*.docx)
32
+ - Text Files (*.txt)
33
+ - HTML Files (*.html)
34
+ - Markdown Files (*.md)
35
+
36
+ ## 🛠️ Technical Details
37
+
38
+ Built with:
39
+ - Docling: Advanced document processing
40
+ - Gradio: Interactive web interface
41
+ - Pydantic: Type-safe data handling
42
+ - Hugging Face Spaces: Cloud deployment
43
+
44
+ ## 📝 License
45
+
46
+ MIT License
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dockling Parser
2
+
3
+ A powerful multiformat document parsing module built on top of Docling. This module provides a unified interface for parsing various document formats including PDF, DOCX, TXT, HTML, and Markdown.
4
+
5
+ ## Features
6
+
7
+ - Unified interface for multiple document formats
8
+ - Rich metadata extraction
9
+ - Structured content parsing
10
+ - Format detection using MIME types
11
+ - Error handling and validation
12
+ - Type-safe using Pydantic models
13
+ - Web interface using Gradio
14
+
15
+ ## Installation
16
+
17
+ ```bash
18
+ pip install -r requirements.txt
19
+ ```
20
+
21
+ ## Usage
22
+
23
+ ### Python API
24
+
25
+ ```python
26
+ from dockling_parser import DocumentParser
27
+
28
+ # Initialize parser
29
+ parser = DocumentParser()
30
+
31
+ # Parse a document
32
+ result = parser.parse("path/to/document.pdf")
33
+
34
+ # Access parsed content
35
+ print(result.content) # Get main text content
36
+ print(result.metadata) # Get document metadata
37
+ print(result.structured_content) # Get structured content (sections, paragraphs, etc.)
38
+
39
+ # Check format support
40
+ is_supported = parser.supports_format("application/pdf")
41
+ ```
42
+
43
+ ### Web Interface
44
+
45
+ The package includes a Gradio-based web interface for easy document parsing:
46
+
47
+ ```bash
48
+ python app.py
49
+ ```
50
+
51
+ This will launch a web interface with the following features:
52
+ - Drag-and-drop document upload
53
+ - Support for multiple document formats
54
+ - Automatic format detection
55
+ - Structured output display:
56
+ - Document content
57
+ - Metadata table
58
+ - Section breakdown
59
+ - Named entity recognition
60
+ - Confidence scoring
61
+
62
+ ## Supported Formats
63
+
64
+ - PDF (application/pdf)
65
+ - DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
66
+ - Plain Text (text/plain)
67
+ - HTML (text/html)
68
+ - Markdown (text/markdown)
69
+
70
+ ## Error Handling
71
+
72
+ The module provides specific exceptions for different error cases:
73
+
74
+ - `UnsupportedFormatError`: When the document format is not supported
75
+ - `ParseError`: When document parsing fails
76
+ - `ValidationError`: When document validation fails
77
+ - `EncodingError`: When document encoding issues occur
78
+
79
+ ## License
80
+
81
+ MIT License
app.py ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import gradio as gr
3
+ import pandas as pd
4
+ from dockling_parser import DocumentParser
5
+ from dockling_parser.exceptions import ParserError
6
+ import tempfile
7
+
8
+ TITLE = "📄 Smart Document Parser"
9
+ DESCRIPTION = """
10
+ A powerful document parsing application that automatically extracts structured information from various document formats.
11
+ Upload any document (PDF, DOCX, TXT, HTML, Markdown) and get structured information extracted automatically.
12
+ """
13
+
14
+ ARTICLE = """
15
+ ## 🚀 Features
16
+
17
+ - Multiple Format Support: PDF, DOCX, TXT, HTML, and Markdown
18
+ - Rich Information Extraction
19
+ - Smart Processing with Confidence Scoring
20
+ - Automatic Format Detection
21
+
22
+ Made with ❤️ using Docling and Gradio
23
+ """
24
+
25
+ # Initialize the document parser
26
+ parser = DocumentParser()
27
+
28
+ def process_document(file):
29
+ """Process uploaded document and return structured information"""
30
+ try:
31
+ # Create a temporary file to handle the upload
32
+ with tempfile.NamedTemporaryFile(delete=False, suffix=os.path.splitext(file.name)[1]) as tmp_file:
33
+ tmp_file.write(file.read())
34
+ temp_path = tmp_file.name
35
+
36
+ # Parse the document
37
+ result = parser.parse(temp_path)
38
+
39
+ # Clean up temporary file
40
+ os.unlink(temp_path)
41
+
42
+ # Prepare the outputs
43
+ metadata_df = pd.DataFrame([{
44
+ "Property": k,
45
+ "Value": str(v)
46
+ } for k, v in result.metadata.dict().items()])
47
+
48
+ # Extract structured content
49
+ sections = result.structured_content.get('sections', [])
50
+ sections_text = "\n\n".join([f"Section {i+1}:\n{section}" for i, section in enumerate(sections)])
51
+
52
+ # Format entities if available
53
+ entities = result.structured_content.get('entities', {})
54
+ entities_text = "\n".join([f"{entity_type}: {', '.join(entities_list)}"
55
+ for entity_type, entities_list in entities.items()]) if entities else "No entities detected"
56
+
57
+ return (
58
+ result.content, # Main content
59
+ metadata_df, # Metadata as table
60
+ sections_text, # Structured sections
61
+ entities_text, # Named entities
62
+ f"Confidence Score: {result.confidence_score:.2f}" # Confidence score
63
+ )
64
+
65
+ except ParserError as e:
66
+ return (
67
+ f"Error parsing document: {str(e)}",
68
+ pd.DataFrame(),
69
+ "No sections available",
70
+ "No entities available",
71
+ "Confidence Score: 0.0"
72
+ )
73
+ except Exception as e:
74
+ return (
75
+ f"Unexpected error: {str(e)}",
76
+ pd.DataFrame(),
77
+ "No sections available",
78
+ "No entities available",
79
+ "Confidence Score: 0.0"
80
+ )
81
+ finally:
82
+ # Ensure temporary file is cleaned up
83
+ if 'temp_path' in locals() and os.path.exists(temp_path):
84
+ try:
85
+ os.unlink(temp_path)
86
+ except:
87
+ pass
88
+
89
+ # Create Gradio interface
90
+ with gr.Blocks(title=TITLE, theme=gr.themes.Soft()) as iface:
91
+ gr.Markdown(f"# {TITLE}")
92
+ gr.Markdown(DESCRIPTION)
93
+
94
+ with gr.Row():
95
+ with gr.Column():
96
+ file_input = gr.File(
97
+ label="Upload Document",
98
+ file_types=[".pdf", ".docx", ".txt", ".html", ".md"],
99
+ type="file"
100
+ )
101
+ submit_btn = gr.Button("Process Document", variant="primary")
102
+
103
+ with gr.Column():
104
+ confidence = gr.Textbox(label="Processing Confidence")
105
+
106
+ with gr.Tabs():
107
+ with gr.TabItem("📝 Content"):
108
+ content_output = gr.Textbox(
109
+ label="Extracted Content",
110
+ lines=10,
111
+ max_lines=30
112
+ )
113
+
114
+ with gr.TabItem("📊 Metadata"):
115
+ metadata_output = gr.Dataframe(
116
+ label="Document Metadata",
117
+ headers=["Property", "Value"]
118
+ )
119
+
120
+ with gr.TabItem("📑 Sections"):
121
+ sections_output = gr.Textbox(
122
+ label="Document Sections",
123
+ lines=10,
124
+ max_lines=30
125
+ )
126
+
127
+ with gr.TabItem("🏷️ Entities"):
128
+ entities_output = gr.Textbox(
129
+ label="Named Entities",
130
+ lines=5,
131
+ max_lines=15
132
+ )
133
+
134
+ # Handle file submission
135
+ submit_btn.click(
136
+ fn=process_document,
137
+ inputs=[file_input],
138
+ outputs=[
139
+ content_output,
140
+ metadata_output,
141
+ sections_output,
142
+ entities_output,
143
+ confidence
144
+ ]
145
+ )
146
+
147
+ gr.Markdown("""
148
+ ### 📌 Supported Formats
149
+ - PDF Documents (*.pdf)
150
+ - Word Documents (*.docx)
151
+ - Text Files (*.txt)
152
+ - HTML Files (*.html)
153
+ - Markdown Files (*.md)
154
+ """)
155
+
156
+ gr.Markdown(ARTICLE)
157
+
158
+ # Launch the app
159
+ if __name__ == "__main__":
160
+ iface.launch()
config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "title": "Smart Document Parser",
3
+ "emoji": "📄",
4
+ "colorFrom": "blue",
5
+ "colorTo": "indigo",
6
+ "sdk": "gradio",
7
+ "sdk_version": "4.0.0",
8
+ "python_version": "3.10",
9
+ "app_file": "app.py",
10
+ "pinned": false,
11
+ "license": "mit"
12
+ }
dockling_parser/__init__.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Dockling Parser - A multiformat document parsing module using Docling
3
+ """
4
+
5
+ __version__ = "0.1.0"
6
+
7
+ from .parser import DocumentParser
8
+ from .types import ParsedDocument, DocumentMetadata
9
+ from .exceptions import ParserError
10
+
11
+ __all__ = ["DocumentParser", "ParsedDocument", "DocumentMetadata", "ParserError"]
dockling_parser/exceptions.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ class ParserError(Exception):
2
+ """Base exception for parser errors"""
3
+ pass
4
+
5
+ class UnsupportedFormatError(ParserError):
6
+ """Raised when document format is not supported"""
7
+ pass
8
+
9
+ class ParseError(ParserError):
10
+ """Raised when document parsing fails"""
11
+ pass
12
+
13
+ class ValidationError(ParserError):
14
+ """Raised when document validation fails"""
15
+ pass
16
+
17
+ class EncodingError(ParserError):
18
+ """Raised when document encoding cannot be determined or is not supported"""
19
+ pass
dockling_parser/parser.py ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from pathlib import Path
3
+ from typing import Optional, Dict, Any, Union
4
+ import magic
5
+ import docling as dl
6
+ from datetime import datetime
7
+
8
+ from .types import ParsedDocument, DocumentMetadata
9
+ from .exceptions import UnsupportedFormatError, ParseError
10
+
11
+ class DocumentParser:
12
+ """
13
+ A multiformat document parser using Docling
14
+ """
15
+
16
+ SUPPORTED_FORMATS = {
17
+ 'application/pdf': 'pdf',
18
+ 'application/vnd.openxmlformats-officedocument.wordprocessingml.document': 'docx',
19
+ 'text/plain': 'txt',
20
+ 'text/html': 'html',
21
+ 'text/markdown': 'md'
22
+ }
23
+
24
+ def __init__(self, config: Optional[Dict[str, Any]] = None):
25
+ self.config = config or {}
26
+ self.docling = dl.Docling()
27
+
28
+ def parse(self, file_path: Union[str, Path]) -> ParsedDocument:
29
+ """
30
+ Parse a document file and return structured content
31
+
32
+ Args:
33
+ file_path: Path to the document file
34
+
35
+ Returns:
36
+ ParsedDocument object containing parsed content and metadata
37
+
38
+ Raises:
39
+ UnsupportedFormatError: If the file format is not supported
40
+ ParseError: If parsing fails
41
+ """
42
+ file_path = Path(file_path)
43
+ if not file_path.exists():
44
+ raise FileNotFoundError(f"File not found: {file_path}")
45
+
46
+ mime_type = magic.from_file(str(file_path), mime=True)
47
+ if mime_type not in self.SUPPORTED_FORMATS:
48
+ raise UnsupportedFormatError(f"Unsupported file format: {mime_type}")
49
+
50
+ try:
51
+ # Get file metadata
52
+ stats = file_path.stat()
53
+ metadata = DocumentMetadata(
54
+ filename=file_path.name,
55
+ file_type=self.SUPPORTED_FORMATS[mime_type],
56
+ size_bytes=stats.st_size,
57
+ created_at=datetime.fromtimestamp(stats.st_ctime),
58
+ modified_at=datetime.fromtimestamp(stats.st_mtime),
59
+ mime_type=mime_type
60
+ )
61
+
62
+ # Parse document using Docling
63
+ doc = self.docling.parse(str(file_path))
64
+
65
+ # Extract content and structure
66
+ content = doc.text
67
+ structured_content = {
68
+ 'sections': doc.sections,
69
+ 'paragraphs': doc.paragraphs,
70
+ 'entities': doc.entities,
71
+ 'metadata': doc.metadata
72
+ }
73
+
74
+ # Update metadata with document-specific information
75
+ if doc.metadata:
76
+ metadata.title = doc.metadata.get('title')
77
+ metadata.author = doc.metadata.get('author')
78
+ metadata.pages = doc.metadata.get('pages')
79
+ metadata.extra.update(doc.metadata)
80
+
81
+ return ParsedDocument(
82
+ content=content,
83
+ metadata=metadata,
84
+ raw_text=doc.raw_text,
85
+ structured_content=structured_content,
86
+ confidence_score=doc.confidence if hasattr(doc, 'confidence') else 1.0
87
+ )
88
+
89
+ except Exception as e:
90
+ raise ParseError(f"Failed to parse document: {str(e)}") from e
91
+
92
+ def supports_format(self, mime_type: str) -> bool:
93
+ """Check if a given MIME type is supported"""
94
+ return mime_type in self.SUPPORTED_FORMATS
dockling_parser/types.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from datetime import datetime
2
+ from typing import Optional, Dict, Any
3
+ from pydantic import BaseModel, Field
4
+
5
+ class DocumentMetadata(BaseModel):
6
+ """Metadata for parsed documents"""
7
+ filename: str
8
+ file_type: str
9
+ size_bytes: int
10
+ created_at: Optional[datetime] = None
11
+ modified_at: Optional[datetime] = None
12
+ author: Optional[str] = None
13
+ title: Optional[str] = None
14
+ pages: Optional[int] = None
15
+ encoding: Optional[str] = None
16
+ mime_type: str
17
+ extra: Dict[str, Any] = Field(default_factory=dict)
18
+
19
+ class ParsedDocument(BaseModel):
20
+ """Represents a parsed document with its content and metadata"""
21
+ content: str
22
+ metadata: DocumentMetadata
23
+ raw_text: Optional[str] = None
24
+ structured_content: Optional[Dict[str, Any]] = None
25
+ confidence_score: float = Field(ge=0.0, le=1.0, default=1.0)
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ docling>=0.1.0
2
+ pydantic>=2.0.0
3
+ python-magic-bin>=0.4.14
4
+ python-docx>=0.8.11
5
+ PyPDF2>=3.0.0
6
+ beautifulsoup4>=4.12.0
7
+ lxml>=4.9.0
8
+ gradio>=4.0.0
9
+ pandas>=1.5.0
10
+ huggingface-hub>=0.19.0