hellorahulk commited on
Commit
8b5c234
·
1 Parent(s): c26416f

[Cursor] Simplify Python handler for Vercel

Browse files
Files changed (8) hide show
  1. .cursorrules +117 -45
  2. .gitignore +43 -0
  3. README.md +155 -1
  4. api.py +230 -0
  5. api/index.py +122 -221
  6. index.py +0 -290
  7. requirements-prod.txt +11 -0
  8. vercel.json +0 -1
.cursorrules CHANGED
@@ -1,46 +1,118 @@
 
1
 
2
- You are an expert in Python, FastAPI, microservices architecture, and serverless environments.
3
-
4
- Advanced Principles
5
- - Design services to be stateless; leverage external storage and caches (e.g., Redis) for state persistence.
6
- - Implement API gateways and reverse proxies (e.g., NGINX, Traefik) for handling traffic to microservices.
7
- - Use circuit breakers and retries for resilient service communication.
8
- - Favor serverless deployment for reduced infrastructure overhead in scalable environments.
9
- - Use asynchronous workers (e.g., Celery, RQ) for handling background tasks efficiently.
10
-
11
- Microservices and API Gateway Integration
12
- - Integrate FastAPI services with API Gateway solutions like Kong or AWS API Gateway.
13
- - Use API Gateway for rate limiting, request transformation, and security filtering.
14
- - Design APIs with clear separation of concerns to align with microservices principles.
15
- - Implement inter-service communication using message brokers (e.g., RabbitMQ, Kafka) for event-driven architectures.
16
-
17
- Serverless and Cloud-Native Patterns
18
- - Optimize FastAPI apps for serverless environments (e.g., AWS Lambda, Azure Functions) by minimizing cold start times.
19
- - Package FastAPI applications using lightweight containers or as a standalone binary for deployment in serverless setups.
20
- - Use managed services (e.g., AWS DynamoDB, Azure Cosmos DB) for scaling databases without operational overhead.
21
- - Implement automatic scaling with serverless functions to handle variable loads effectively.
22
-
23
- Advanced Middleware and Security
24
- - Implement custom middleware for detailed logging, tracing, and monitoring of API requests.
25
- - Use OpenTelemetry or similar libraries for distributed tracing in microservices architectures.
26
- - Apply security best practices: OAuth2 for secure API access, rate limiting, and DDoS protection.
27
- - Use security headers (e.g., CORS, CSP) and implement content validation using tools like OWASP Zap.
28
-
29
- Optimizing for Performance and Scalability
30
- - Leverage FastAPI’s async capabilities for handling large volumes of simultaneous connections efficiently.
31
- - Optimize backend services for high throughput and low latency; use databases optimized for read-heavy workloads (e.g., Elasticsearch).
32
- - Use caching layers (e.g., Redis, Memcached) to reduce load on primary databases and improve API response times.
33
- - Apply load balancing and service mesh technologies (e.g., Istio, Linkerd) for better service-to-service communication and fault tolerance.
34
-
35
- Monitoring and Logging
36
- - Use Prometheus and Grafana for monitoring FastAPI applications and setting up alerts.
37
- - Implement structured logging for better log analysis and observability.
38
- - Integrate with centralized logging systems (e.g., ELK Stack, AWS CloudWatch) for aggregated logging and monitoring.
39
-
40
- Key Conventions
41
- 1. Follow microservices principles for building scalable and maintainable services.
42
- 2. Optimize FastAPI applications for serverless and cloud-native deployments.
43
- 3. Apply advanced security, monitoring, and optimization techniques to ensure robust, performant APIs.
44
-
45
- Refer to FastAPI, microservices, and serverless documentation for best practices and advanced usage patterns.
46
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Instructions
2
 
3
+ During you interaction with the user, if you find anything reusable in this project (e.g. version of a library, model name), especially about a fix to a mistake you made or a correction you received, you should take note in the `Lessons` section in the `.cursorrules` file so you will not make the same mistake again.
4
+
5
+ You should also use the `.cursorrules` file as a scratchpad to organize your thoughts. Especially when you receive a new task, you should first review the content of the scratchpad, clear old different task if necessary, first explain the task, and plan the steps you need to take to complete the task. You can use todo markers to indicate the progress, e.g.
6
+ [X] Task 1
7
+ [ ] Task 2
8
+
9
+ Also update the progress of the task in the Scratchpad when you finish a subtask.
10
+ Especially when you finished a milestone, it will help to improve your depth of task accomplishment to use the scratchpad to reflect and plan.
11
+ The goal is to help you maintain a big picture as well as the progress of the task. Always refer to the Scratchpad when you plan the next step.
12
+
13
+ # Tools
14
+
15
+ Note all the tools are in python. So in the case you need to do batch processing, you can always consult the python files and write your own script.
16
+
17
+ ## Screenshot Verification
18
+ The screenshot verification workflow allows you to capture screenshots of web pages and verify their appearance using LLMs. The following tools are available:
19
+
20
+ 1. Screenshot Capture:
21
+ ```bash
22
+ venv/bin/python tools/screenshot_utils.py URL [--output OUTPUT] [--width WIDTH] [--height HEIGHT]
23
+ ```
24
+
25
+ 2. LLM Verification with Images:
26
+ ```bash
27
+ venv/bin/python tools/llm_api.py --prompt "Your verification question" --provider {openai|anthropic} --image path/to/screenshot.png
28
+ ```
29
+
30
+ Example workflow:
31
+ ```python
32
+ from screenshot_utils import take_screenshot_sync
33
+ from llm_api import query_llm
34
+
35
+ # Take a screenshot
36
+ screenshot_path = take_screenshot_sync('https://example.com', 'screenshot.png')
37
+
38
+ # Verify with LLM
39
+ response = query_llm(
40
+ "What is the background color and title of this webpage?",
41
+ provider="openai", # or "anthropic"
42
+ image_path=screenshot_path
43
+ )
44
+ print(response)
45
+ ```
46
+
47
+ ## LLM
48
+
49
+ You always have an LLM at your side to help you with the task. For simple tasks, you could invoke the LLM by running the following command:
50
+ ```
51
+ venv/bin/python ./tools/llm_api.py --prompt "What is the capital of France?" --provider "anthropic"
52
+ ```
53
+
54
+ The LLM API supports multiple providers:
55
+ - OpenAI (default, model: gpt-4o)
56
+ - Azure OpenAI (model: configured via AZURE_OPENAI_MODEL_DEPLOYMENT in .env file, defaults to gpt-4o-ms)
57
+ - DeepSeek (model: deepseek-chat)
58
+ - Anthropic (model: claude-3-sonnet-20240229)
59
+ - Gemini (model: gemini-pro)
60
+ - Local LLM (model: Qwen/Qwen2.5-32B-Instruct-AWQ)
61
+
62
+ But usually it's a better idea to check the content of the file and use the APIs in the `tools/llm_api.py` file to invoke the LLM if needed.
63
+
64
+ ## Web browser
65
+
66
+ You could use the `tools/web_scraper.py` file to scrape the web.
67
+ ```
68
+ venv/bin/python ./tools/web_scraper.py --max-concurrent 3 URL1 URL2 URL3
69
+ ```
70
+ This will output the content of the web pages.
71
+
72
+ ## Search engine
73
+
74
+ You could use the `tools/search_engine.py` file to search the web.
75
+ ```
76
+ venv/bin/python ./tools/search_engine.py "your search keywords"
77
+ ```
78
+ This will output the search results in the following format:
79
+ ```
80
+ URL: https://example.com
81
+ Title: This is the title of the search result
82
+ Snippet: This is a snippet of the search result
83
+ ```
84
+ If needed, you can further use the `web_scraper.py` file to scrape the web page content.
85
+
86
+ # Lessons
87
+
88
+ ## User Specified Lessons
89
+
90
+ - You have a python venv in ./venv. Use it.
91
+ - Include info useful for debugging in the program output.
92
+ - Read the file before you try to edit it.
93
+ - Due to Cursor's limit, when you use `git` and `gh` and need to submit a multiline commit message, first write the message in a file, and then use `git commit -F <filename>` or similar command to commit. And then remove the file. Include "[Cursor] " in the commit message and PR title.
94
+
95
+ ## Cursor learned
96
+
97
+ - For search results, ensure proper handling of different character encodings (UTF-8) for international queries
98
+ - Add debug information to stderr while keeping the main output clean in stdout for better pipeline integration
99
+ - When using seaborn styles in matplotlib, use 'seaborn-v0_8' instead of 'seaborn' as the style name due to recent seaborn version changes
100
+ - Use 'gpt-4o' as the model name for OpenAI's GPT-4 with vision capabilities
101
+ - For Vercel deployments with FastAPI and Mangum, use older stable versions (FastAPI 0.88.0, Mangum 0.15.0, Pydantic 1.10.2) to avoid compatibility issues
102
+ - Keep Vercel configuration simple and avoid unnecessary configuration options that might cause conflicts
103
+
104
+ # Scratchpad
105
+
106
+ Current Task: Fix Vercel deployment issues with FastAPI and Mangum
107
+
108
+ Progress:
109
+ [X] Identified issue with newer versions of FastAPI and Mangum
110
+ [X] Updated dependencies to use older, stable versions
111
+ [X] Simplified FastAPI configuration
112
+ [X] Simplified Vercel configuration
113
+ [X] Successfully deployed to production
114
+
115
+ Next Steps:
116
+ [ ] Test all API endpoints
117
+ [ ] Add more functionality if needed
118
+ [ ] Consider adding monitoring and logging
.gitignore ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual Environment
24
+ venv/
25
+ env/
26
+ ENV/
27
+
28
+ # IDE
29
+ .idea/
30
+ .vscode/
31
+ *.swp
32
+ *.swo
33
+
34
+ # Vercel
35
+ .vercel/
36
+ .env
37
+ .env.local
38
+
39
+ # Temporary files
40
+ *.tmp
41
+ tmp/
42
+ temp/
43
+ .vercel
README.md CHANGED
@@ -54,4 +54,158 @@ Built with:
54
 
55
  ## 📝 License
56
 
57
- MIT License
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  ## 📝 License
56
 
57
+ MIT License
58
+
59
+ # Document Parser API
60
+
61
+ A scalable FastAPI service for parsing various document formats (PDF, DOCX, TXT, HTML, Markdown) with automatic information extraction.
62
+
63
+ ## Features
64
+
65
+ - 📄 Multi-format support (PDF, DOCX, TXT, HTML, Markdown)
66
+ - 🔄 Asynchronous processing with background tasks
67
+ - 🌐 Support for both file uploads and URL inputs
68
+ - 📊 Structured information extraction
69
+ - 🔗 Webhook support for processing notifications
70
+ - 🚀 Highly scalable architecture
71
+ - 🛡️ Comprehensive error handling
72
+ - 📝 Detailed logging
73
+
74
+ ## Quick Start
75
+
76
+ ### Prerequisites
77
+
78
+ - Python 3.8+
79
+ - pip (Python package manager)
80
+
81
+ ### Installation
82
+
83
+ 1. Clone the repository:
84
+ ```bash
85
+ git clone https://github.com/yourusername/document-parser-api.git
86
+ cd document-parser-api
87
+ ```
88
+
89
+ 2. Install dependencies:
90
+ ```bash
91
+ pip install -r requirements.txt
92
+ ```
93
+
94
+ 3. Run the API server:
95
+ ```bash
96
+ python api.py
97
+ ```
98
+
99
+ The API will be available at `http://localhost:8000`
100
+
101
+ ## API Documentation
102
+
103
+ ### Endpoints
104
+
105
+ #### 1. Parse Document from File Upload
106
+ ```http
107
+ POST /parse/file
108
+ ```
109
+ - Upload a document file for parsing
110
+ - Optional callback URL for processing notification
111
+ - Returns a job ID for status tracking
112
+
113
+ #### 2. Parse Document from URL
114
+ ```http
115
+ POST /parse/url
116
+ ```
117
+ - Submit a document URL for parsing
118
+ - Optional callback URL for processing notification
119
+ - Returns a job ID for status tracking
120
+
121
+ #### 3. Check Processing Status
122
+ ```http
123
+ GET /status/{job_id}
124
+ ```
125
+ - Get the current status of a parsing job
126
+ - Returns processing status and results if completed
127
+
128
+ #### 4. Health Check
129
+ ```http
130
+ GET /health
131
+ ```
132
+ - Check if the API is running and healthy
133
+
134
+ ### Example Usage
135
+
136
+ #### Parse File
137
+ ```python
138
+ import requests
139
+
140
+ url = "http://localhost:8000/parse/file"
141
+ files = {"file": open("document.pdf", "rb")}
142
+ response = requests.post(url, files=files)
143
+ print(response.json())
144
+ ```
145
+
146
+ #### Parse URL
147
+ ```python
148
+ import requests
149
+
150
+ url = "http://localhost:8000/parse/url"
151
+ data = {
152
+ "url": "https://example.com/document.pdf",
153
+ "callback_url": "https://your-callback-url.com/webhook"
154
+ }
155
+ response = requests.post(url, json=data)
156
+ print(response.json())
157
+ ```
158
+
159
+ ## Error Handling
160
+
161
+ The API implements comprehensive error handling:
162
+
163
+ - Invalid file formats
164
+ - Failed URL downloads
165
+ - Processing errors
166
+ - Invalid requests
167
+ - Server errors
168
+
169
+ All errors return appropriate HTTP status codes and detailed error messages.
170
+
171
+ ## Scaling Considerations
172
+
173
+ For production deployment, consider:
174
+
175
+ 1. **Job Storage**: Replace in-memory storage with Redis or a database
176
+ 2. **Load Balancing**: Deploy behind a load balancer
177
+ 3. **Worker Processes**: Adjust number of workers based on load
178
+ 4. **Rate Limiting**: Implement rate limiting for API endpoints
179
+ 5. **Monitoring**: Add metrics collection and monitoring
180
+
181
+ ## Development
182
+
183
+ ### Running Tests
184
+ ```bash
185
+ pytest tests/
186
+ ```
187
+
188
+ ### Local Development
189
+ ```bash
190
+ uvicorn api:app --reload --port 8000
191
+ ```
192
+
193
+ ### API Documentation
194
+ - Swagger UI: `http://localhost:8000/docs`
195
+ - ReDoc: `http://localhost:8000/redoc`
196
+
197
+ ## Contributing
198
+
199
+ 1. Fork the repository
200
+ 2. Create a feature branch
201
+ 3. Commit your changes
202
+ 4. Push to the branch
203
+ 5. Create a Pull Request
204
+
205
+ ## License
206
+
207
+ This project is licensed under the MIT License - see the LICENSE file for details.
208
+
209
+ ## Support
210
+
211
+ For support, please open an issue in the GitHub repository or contact the maintainers.
api.py ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from fastapi import FastAPI, HTTPException, UploadFile, File, BackgroundTasks
3
+ from fastapi.middleware.cors import CORSMiddleware
4
+ from pydantic import BaseModel, HttpUrl
5
+ import tempfile
6
+ import requests
7
+ from typing import Optional, List, Dict, Any
8
+ from dockling_parser import DocumentParser
9
+ from dockling_parser.exceptions import ParserError, UnsupportedFormatError
10
+ from dockling_parser.types import ParsedDocument
11
+ import logging
12
+ import aiofiles
13
+ import asyncio
14
+ from urllib.parse import urlparse
15
+ from mangum import Mangum
16
+ import httpx
17
+
18
+ # Configure logging
19
+ logging.basicConfig(level=logging.INFO)
20
+ logger = logging.getLogger(__name__)
21
+
22
+ # Initialize FastAPI app
23
+ app = FastAPI(
24
+ title="Document Parser API",
25
+ description="A scalable API for parsing various document formats",
26
+ version="1.0.0"
27
+ )
28
+
29
+ # Add CORS middleware
30
+ app.add_middleware(
31
+ CORSMiddleware,
32
+ allow_origins=["*"],
33
+ allow_credentials=True,
34
+ allow_methods=["*"],
35
+ allow_headers=["*"],
36
+ )
37
+
38
+ # Initialize document parser
39
+ parser = DocumentParser()
40
+
41
+ class URLInput(BaseModel):
42
+ url: HttpUrl
43
+ callback_url: Optional[HttpUrl] = None
44
+
45
+ class ErrorResponse(BaseModel):
46
+ error: str
47
+ detail: Optional[str] = None
48
+ code: str
49
+
50
+ class ParseResponse(BaseModel):
51
+ job_id: str
52
+ status: str
53
+ result: Optional[ParsedDocument] = None
54
+ error: Optional[str] = None
55
+
56
+ # In-memory job storage (replace with Redis/DB in production)
57
+ jobs = {}
58
+
59
+ async def process_document_async(job_id: str, file_path: str, callback_url: Optional[str] = None):
60
+ """Process document asynchronously"""
61
+ try:
62
+ # Update job status
63
+ jobs[job_id] = {"status": "processing"}
64
+
65
+ # Parse document
66
+ result = parser.parse(file_path)
67
+
68
+ # Update job with result
69
+ jobs[job_id] = {
70
+ "status": "completed",
71
+ "result": result
72
+ }
73
+
74
+ # Call callback URL if provided
75
+ if callback_url:
76
+ try:
77
+ await notify_callback(callback_url, job_id, result)
78
+ except Exception as e:
79
+ logger.error(f"Failed to notify callback URL: {str(e)}")
80
+
81
+ except Exception as e:
82
+ logger.error(f"Error processing document: {str(e)}")
83
+ jobs[job_id] = {
84
+ "status": "failed",
85
+ "error": str(e)
86
+ }
87
+ finally:
88
+ # Cleanup temporary file
89
+ try:
90
+ if os.path.exists(file_path):
91
+ os.unlink(file_path)
92
+ except Exception as e:
93
+ logger.error(f"Error cleaning up file: {str(e)}")
94
+
95
+ async def notify_callback(callback_url: str, job_id: str, result: ParsedDocument):
96
+ """Notify callback URL with results"""
97
+ try:
98
+ async with httpx.AsyncClient() as client:
99
+ await client.post(
100
+ callback_url,
101
+ json={
102
+ "job_id": job_id,
103
+ "result": result.dict()
104
+ }
105
+ )
106
+ except Exception as e:
107
+ logger.error(f"Failed to send callback: {str(e)}")
108
+
109
+ @app.post("/parse/file", response_model=ParseResponse)
110
+ async def parse_file(
111
+ background_tasks: BackgroundTasks,
112
+ file: UploadFile = File(...),
113
+ callback_url: Optional[HttpUrl] = None
114
+ ):
115
+ """
116
+ Parse a document from file upload
117
+ """
118
+ try:
119
+ # Create temporary file in /tmp for Vercel
120
+ suffix = os.path.splitext(file.filename)[1]
121
+ tmp_dir = "/tmp" if os.path.exists("/tmp") else tempfile.gettempdir()
122
+ tmp_path = os.path.join(tmp_dir, f"upload_{os.urandom(8).hex()}{suffix}")
123
+
124
+ content = await file.read()
125
+ with open(tmp_path, "wb") as f:
126
+ f.write(content)
127
+
128
+ # Generate job ID
129
+ job_id = f"job_{len(jobs) + 1}"
130
+
131
+ # Start background processing
132
+ background_tasks.add_task(
133
+ process_document_async,
134
+ job_id,
135
+ tmp_path,
136
+ str(callback_url) if callback_url else None
137
+ )
138
+
139
+ return ParseResponse(
140
+ job_id=job_id,
141
+ status="queued"
142
+ )
143
+
144
+ except Exception as e:
145
+ logger.error(f"Error handling file upload: {str(e)}")
146
+ raise HTTPException(
147
+ status_code=500,
148
+ detail=str(e)
149
+ )
150
+
151
+ @app.post("/parse/url", response_model=ParseResponse)
152
+ async def parse_url(input_data: URLInput, background_tasks: BackgroundTasks):
153
+ """
154
+ Parse a document from URL
155
+ """
156
+ try:
157
+ # Download file
158
+ async with httpx.AsyncClient() as client:
159
+ response = await client.get(str(input_data.url), follow_redirects=True)
160
+ response.raise_for_status()
161
+
162
+ # Get filename from URL or use default
163
+ filename = os.path.basename(urlparse(str(input_data.url)).path)
164
+ if not filename:
165
+ filename = "document.pdf"
166
+
167
+ # Save to temporary file in /tmp for Vercel
168
+ tmp_dir = "/tmp" if os.path.exists("/tmp") else tempfile.gettempdir()
169
+ tmp_path = os.path.join(tmp_dir, f"download_{os.urandom(8).hex()}{os.path.splitext(filename)[1]}")
170
+
171
+ with open(tmp_path, "wb") as f:
172
+ f.write(response.content)
173
+
174
+ # Generate job ID
175
+ job_id = f"job_{len(jobs) + 1}"
176
+
177
+ # Start background processing
178
+ background_tasks.add_task(
179
+ process_document_async,
180
+ job_id,
181
+ tmp_path,
182
+ str(input_data.callback_url) if input_data.callback_url else None
183
+ )
184
+
185
+ return ParseResponse(
186
+ job_id=job_id,
187
+ status="queued"
188
+ )
189
+
190
+ except httpx.RequestError as e:
191
+ logger.error(f"Error downloading file: {str(e)}")
192
+ raise HTTPException(
193
+ status_code=400,
194
+ detail=f"Error downloading file: {str(e)}"
195
+ )
196
+ except Exception as e:
197
+ logger.error(f"Error processing URL: {str(e)}")
198
+ raise HTTPException(
199
+ status_code=500,
200
+ detail=str(e)
201
+ )
202
+
203
+ @app.get("/status/{job_id}", response_model=ParseResponse)
204
+ async def get_status(job_id: str):
205
+ """
206
+ Get the status of a parsing job
207
+ """
208
+ if job_id not in jobs:
209
+ raise HTTPException(
210
+ status_code=404,
211
+ detail="Job not found"
212
+ )
213
+
214
+ job = jobs[job_id]
215
+ return ParseResponse(
216
+ job_id=job_id,
217
+ status=job["status"],
218
+ result=job.get("result"),
219
+ error=job.get("error")
220
+ )
221
+
222
+ @app.get("/health")
223
+ async def health_check():
224
+ """
225
+ Health check endpoint
226
+ """
227
+ return {"status": "healthy"}
228
+
229
+ # Handler for Vercel
230
+ handler = Mangum(app, lifespan="off")
api/index.py CHANGED
@@ -1,3 +1,4 @@
 
1
  import json
2
  import os
3
  import tempfile
@@ -59,232 +60,132 @@ def download_file(url):
59
  except Exception as e:
60
  raise ValueError(f"Failed to download file: {str(e)}")
61
 
62
- def handle_root():
63
- return {
64
- "statusCode": 200,
65
- "headers": {
66
- "Content-Type": "application/json"
67
- },
68
- "body": json.dumps({
69
- "status": "ok",
70
- "message": "Document Processing API",
71
- "version": "1.0.0"
72
- })
73
- }
74
-
75
- def handle_health():
76
- return {
77
- "statusCode": 200,
78
- "headers": {
79
- "Content-Type": "application/json"
80
- },
81
- "body": json.dumps({
82
- "status": "healthy",
83
- "timestamp": str(datetime.datetime.now(datetime.UTC))
84
- })
85
- }
86
-
87
- def handle_parse_file(event):
88
- try:
89
- # Check if there's a file in the request
90
- if 'body' not in event or not event['body']:
91
- return {
92
- "statusCode": 400,
93
- "headers": {"Content-Type": "application/json"},
94
- "body": json.dumps({
95
- "error": "No file provided",
96
- "details": "Please provide a file in the request body"
97
- })
98
  }
99
-
100
- # Get file data
101
- file_data = event['body']
102
- if event.get('isBase64Encoded', False):
103
- import base64
104
- file_data = base64.b64decode(file_data)
105
-
106
- # Validate file type
107
- if not is_valid_file(file_data):
108
- return {
109
- "statusCode": 400,
110
- "headers": {"Content-Type": "application/json"},
111
- "body": json.dumps({
112
- "error": "Invalid file type",
113
- "details": "Supported formats: PDF, TXT, HTML, MD, DOCX"
114
- })
115
  }
116
-
117
- # Save to temp file
118
- fd, temp_path = tempfile.mkstemp()
119
- os.close(fd)
120
-
121
- try:
122
- with open(temp_path, 'wb') as f:
123
- f.write(file_data)
124
-
125
- # Process file here
126
- return {
127
- "statusCode": 200,
128
- "headers": {"Content-Type": "application/json"},
129
- "body": json.dumps({
130
- "status": "success",
131
- "message": "File processed successfully",
132
- "metadata": {
133
- "size": os.path.getsize(temp_path),
134
- "mime_type": magic.from_file(temp_path, mime=True)
135
- }
136
- })
137
  }
138
- finally:
139
- # Clean up temp file
 
 
 
 
 
140
  try:
141
- os.unlink(temp_path)
142
- except:
143
- pass
144
-
145
- except Exception as e:
146
- return {
147
- "statusCode": 500,
148
- "headers": {"Content-Type": "application/json"},
149
- "body": json.dumps({
150
- "error": "Processing failed",
151
- "details": str(e)
152
- })
153
- }
154
 
155
- def handle_parse_url(event):
156
- try:
157
- # Get request body
158
- if 'body' not in event or not event['body']:
159
- return {
160
- "statusCode": 400,
161
- "headers": {"Content-Type": "application/json"},
162
- "body": json.dumps({
163
- "error": "No request body",
164
- "details": "Please provide a URL in the request body"
165
- })
166
- }
167
-
168
- # Parse JSON body
169
- try:
170
- body = json.loads(event['body'])
171
- except:
172
- return {
173
- "statusCode": 400,
174
- "headers": {"Content-Type": "application/json"},
175
- "body": json.dumps({
176
- "error": "Invalid JSON",
177
- "details": "Request body must be valid JSON"
178
- })
179
- }
180
-
181
- # Get URL from body
182
- url = body.get('url')
183
- if not url:
184
- return {
185
- "statusCode": 400,
186
- "headers": {"Content-Type": "application/json"},
187
- "body": json.dumps({
188
- "error": "No URL provided",
189
- "details": "Please provide a URL in the request body"
190
- })
191
- }
192
-
193
- if not url.startswith(('http://', 'https://')):
194
- return {
195
- "statusCode": 400,
196
- "headers": {"Content-Type": "application/json"},
197
- "body": json.dumps({
198
- "error": "Invalid URL",
199
- "details": "URL must start with http:// or https://"
200
- })
201
- }
202
-
203
- # Download and process file
204
- temp_path, content_type = download_file(url)
205
- try:
206
- # Process file here
207
- return {
208
- "statusCode": 200,
209
- "headers": {"Content-Type": "application/json"},
210
- "body": json.dumps({
211
- "status": "success",
212
- "message": "URL processed successfully",
213
- "metadata": {
214
- "url": url,
215
- "content_type": content_type,
216
- "size": os.path.getsize(temp_path)
217
  }
218
- })
219
- }
220
- finally:
221
- # Clean up temp file
 
 
 
 
 
 
 
222
  try:
223
- os.unlink(temp_path)
224
- except:
225
- pass
226
-
227
- except ValueError as e:
228
- return {
229
- "statusCode": 400,
230
- "headers": {"Content-Type": "application/json"},
231
- "body": json.dumps({
232
- "error": "Invalid request",
233
- "details": str(e)
234
- })
235
- }
236
- except Exception as e:
237
- return {
238
- "statusCode": 500,
239
- "headers": {"Content-Type": "application/json"},
240
- "body": json.dumps({
241
- "error": "Processing failed",
242
- "details": str(e)
243
- })
244
- }
245
 
246
- def handler(event, context):
247
- """Serverless handler for Vercel"""
248
-
249
- # Add CORS headers to all responses
250
- cors_headers = {
251
- "Access-Control-Allow-Origin": "*",
252
- "Access-Control-Allow-Headers": "Content-Type, Authorization",
253
- "Access-Control-Allow-Methods": "GET, POST, OPTIONS"
254
- }
255
-
256
- # Handle OPTIONS request for CORS
257
- if event.get('httpMethod') == 'OPTIONS':
258
- return {
259
- "statusCode": 204,
260
- "headers": cors_headers,
261
- "body": ""
262
- }
263
-
264
- # Get path and method
265
- path = event.get('path', '/').rstrip('/')
266
- method = event.get('httpMethod', 'GET')
267
-
268
- # Route request
269
- response = None
270
- if path == '' or path == '/':
271
- response = handle_root()
272
- elif path == '/health':
273
- response = handle_health()
274
- elif path == '/parse/file' and method == 'POST':
275
- response = handle_parse_file(event)
276
- elif path == '/parse/url' and method == 'POST':
277
- response = handle_parse_url(event)
278
- else:
279
- response = {
280
- "statusCode": 404,
281
- "headers": {"Content-Type": "application/json"},
282
- "body": json.dumps({
283
- "error": "Not Found",
284
- "details": f"Path {path} not found"
285
- })
286
- }
287
-
288
- # Add CORS headers to response
289
- response['headers'].update(cors_headers)
290
- return response
 
 
 
 
1
+ from http.server import BaseHTTPRequestHandler
2
  import json
3
  import os
4
  import tempfile
 
60
  except Exception as e:
61
  raise ValueError(f"Failed to download file: {str(e)}")
62
 
63
+ class Handler(BaseHTTPRequestHandler):
64
+ def do_GET(self):
65
+ if self.path == '/' or self.path == '':
66
+ self.send_response(200)
67
+ self.send_header('Content-type', 'application/json')
68
+ self.end_headers()
69
+ response = {
70
+ "status": "ok",
71
+ "message": "Document Processing API",
72
+ "version": "1.0.0"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  }
74
+ self.wfile.write(json.dumps(response).encode())
75
+ elif self.path == '/health':
76
+ self.send_response(200)
77
+ self.send_header('Content-type', 'application/json')
78
+ self.end_headers()
79
+ response = {
80
+ "status": "healthy",
81
+ "timestamp": str(datetime.datetime.now(datetime.UTC))
 
 
 
 
 
 
 
 
82
  }
83
+ self.wfile.write(json.dumps(response).encode())
84
+ else:
85
+ self.send_response(404)
86
+ self.send_header('Content-type', 'application/json')
87
+ self.end_headers()
88
+ response = {
89
+ "error": "Not Found",
90
+ "details": f"Path {self.path} not found"
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  }
92
+ self.wfile.write(json.dumps(response).encode())
93
+
94
+ def do_POST(self):
95
+ content_length = int(self.headers.get('Content-Length', 0))
96
+ post_data = self.rfile.read(content_length)
97
+
98
+ if self.path == '/parse/file':
99
  try:
100
+ if not post_data:
101
+ self.send_error(400, "No file provided")
102
+ return
 
 
 
 
 
 
 
 
 
 
103
 
104
+ if not is_valid_file(post_data):
105
+ self.send_error(400, "Invalid file type")
106
+ return
107
+
108
+ fd, temp_path = tempfile.mkstemp()
109
+ os.close(fd)
110
+
111
+ try:
112
+ with open(temp_path, 'wb') as f:
113
+ f.write(post_data)
114
+
115
+ self.send_response(200)
116
+ self.send_header('Content-type', 'application/json')
117
+ self.end_headers()
118
+ response = {
119
+ "status": "success",
120
+ "message": "File processed successfully",
121
+ "metadata": {
122
+ "size": os.path.getsize(temp_path),
123
+ "mime_type": magic.from_file(temp_path, mime=True)
124
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  }
126
+ self.wfile.write(json.dumps(response).encode())
127
+ finally:
128
+ try:
129
+ os.unlink(temp_path)
130
+ except:
131
+ pass
132
+
133
+ except Exception as e:
134
+ self.send_error(500, str(e))
135
+
136
+ elif self.path == '/parse/url':
137
  try:
138
+ try:
139
+ body = json.loads(post_data)
140
+ except:
141
+ self.send_error(400, "Invalid JSON")
142
+ return
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
+ url = body.get('url')
145
+ if not url:
146
+ self.send_error(400, "No URL provided")
147
+ return
148
+
149
+ if not url.startswith(('http://', 'https://')):
150
+ self.send_error(400, "Invalid URL")
151
+ return
152
+
153
+ temp_path, content_type = download_file(url)
154
+ try:
155
+ self.send_response(200)
156
+ self.send_header('Content-type', 'application/json')
157
+ self.end_headers()
158
+ response = {
159
+ "status": "success",
160
+ "message": "URL processed successfully",
161
+ "metadata": {
162
+ "url": url,
163
+ "content_type": content_type,
164
+ "size": os.path.getsize(temp_path)
165
+ }
166
+ }
167
+ self.wfile.write(json.dumps(response).encode())
168
+ finally:
169
+ try:
170
+ os.unlink(temp_path)
171
+ except:
172
+ pass
173
+
174
+ except ValueError as e:
175
+ self.send_error(400, str(e))
176
+ except Exception as e:
177
+ self.send_error(500, str(e))
178
+
179
+ else:
180
+ self.send_error(404, f"Path {self.path} not found")
181
+
182
+ def do_OPTIONS(self):
183
+ self.send_response(204)
184
+ self.send_header('Access-Control-Allow-Origin', '*')
185
+ self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
186
+ self.send_header('Access-Control-Allow-Headers', 'Content-Type, Authorization')
187
+ self.end_headers()
188
+
189
+ def handler(request, context):
190
+ """Vercel serverless handler"""
191
+ return Handler(request, None, None).handle_request()
index.py DELETED
@@ -1,290 +0,0 @@
1
- import json
2
- import os
3
- import tempfile
4
- import magic
5
- import requests
6
- from werkzeug.utils import secure_filename
7
- import datetime
8
-
9
- def is_valid_file(file_data):
10
- """Check if file type is allowed using python-magic"""
11
- try:
12
- mime = magic.from_buffer(file_data, mime=True)
13
- allowed_mimes = [
14
- 'application/pdf',
15
- 'text/plain',
16
- 'text/html',
17
- 'text/markdown',
18
- 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
19
- ]
20
- return mime in allowed_mimes
21
- except Exception:
22
- return False
23
-
24
- def download_file(url):
25
- """Download file from URL and save to temp file"""
26
- try:
27
- response = requests.get(url, stream=True, timeout=10)
28
- response.raise_for_status()
29
-
30
- # Get content type
31
- content_type = response.headers.get('content-type', '').split(';')[0]
32
- if content_type not in [
33
- 'application/pdf',
34
- 'text/plain',
35
- 'text/html',
36
- 'text/markdown',
37
- 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
38
- ]:
39
- raise ValueError(f"Unsupported content type: {content_type}")
40
-
41
- # Create temp file with proper extension
42
- ext = {
43
- 'application/pdf': '.pdf',
44
- 'text/plain': '.txt',
45
- 'text/html': '.html',
46
- 'text/markdown': '.md',
47
- 'application/vnd.openxmlformats-officedocument.wordprocessingml.document': '.docx'
48
- }.get(content_type, '')
49
-
50
- fd, temp_path = tempfile.mkstemp(suffix=ext)
51
- os.close(fd)
52
-
53
- # Download file
54
- with open(temp_path, 'wb') as f:
55
- for chunk in response.iter_content(chunk_size=8192):
56
- f.write(chunk)
57
-
58
- return temp_path, content_type
59
- except Exception as e:
60
- raise ValueError(f"Failed to download file: {str(e)}")
61
-
62
- def handle_root():
63
- return {
64
- "statusCode": 200,
65
- "headers": {
66
- "Content-Type": "application/json"
67
- },
68
- "body": json.dumps({
69
- "status": "ok",
70
- "message": "Document Processing API",
71
- "version": "1.0.0"
72
- })
73
- }
74
-
75
- def handle_health():
76
- return {
77
- "statusCode": 200,
78
- "headers": {
79
- "Content-Type": "application/json"
80
- },
81
- "body": json.dumps({
82
- "status": "healthy",
83
- "timestamp": str(datetime.datetime.now(datetime.UTC))
84
- })
85
- }
86
-
87
- def handle_parse_file(event):
88
- try:
89
- # Check if there's a file in the request
90
- if 'body' not in event or not event['body']:
91
- return {
92
- "statusCode": 400,
93
- "headers": {"Content-Type": "application/json"},
94
- "body": json.dumps({
95
- "error": "No file provided",
96
- "details": "Please provide a file in the request body"
97
- })
98
- }
99
-
100
- # Get file data
101
- file_data = event['body']
102
- if event.get('isBase64Encoded', False):
103
- import base64
104
- file_data = base64.b64decode(file_data)
105
-
106
- # Validate file type
107
- if not is_valid_file(file_data):
108
- return {
109
- "statusCode": 400,
110
- "headers": {"Content-Type": "application/json"},
111
- "body": json.dumps({
112
- "error": "Invalid file type",
113
- "details": "Supported formats: PDF, TXT, HTML, MD, DOCX"
114
- })
115
- }
116
-
117
- # Save to temp file
118
- fd, temp_path = tempfile.mkstemp()
119
- os.close(fd)
120
-
121
- try:
122
- with open(temp_path, 'wb') as f:
123
- f.write(file_data)
124
-
125
- # Process file here
126
- return {
127
- "statusCode": 200,
128
- "headers": {"Content-Type": "application/json"},
129
- "body": json.dumps({
130
- "status": "success",
131
- "message": "File processed successfully",
132
- "metadata": {
133
- "size": os.path.getsize(temp_path),
134
- "mime_type": magic.from_file(temp_path, mime=True)
135
- }
136
- })
137
- }
138
- finally:
139
- # Clean up temp file
140
- try:
141
- os.unlink(temp_path)
142
- except:
143
- pass
144
-
145
- except Exception as e:
146
- return {
147
- "statusCode": 500,
148
- "headers": {"Content-Type": "application/json"},
149
- "body": json.dumps({
150
- "error": "Processing failed",
151
- "details": str(e)
152
- })
153
- }
154
-
155
- def handle_parse_url(event):
156
- try:
157
- # Get request body
158
- if 'body' not in event or not event['body']:
159
- return {
160
- "statusCode": 400,
161
- "headers": {"Content-Type": "application/json"},
162
- "body": json.dumps({
163
- "error": "No request body",
164
- "details": "Please provide a URL in the request body"
165
- })
166
- }
167
-
168
- # Parse JSON body
169
- try:
170
- body = json.loads(event['body'])
171
- except:
172
- return {
173
- "statusCode": 400,
174
- "headers": {"Content-Type": "application/json"},
175
- "body": json.dumps({
176
- "error": "Invalid JSON",
177
- "details": "Request body must be valid JSON"
178
- })
179
- }
180
-
181
- # Get URL from body
182
- url = body.get('url')
183
- if not url:
184
- return {
185
- "statusCode": 400,
186
- "headers": {"Content-Type": "application/json"},
187
- "body": json.dumps({
188
- "error": "No URL provided",
189
- "details": "Please provide a URL in the request body"
190
- })
191
- }
192
-
193
- if not url.startswith(('http://', 'https://')):
194
- return {
195
- "statusCode": 400,
196
- "headers": {"Content-Type": "application/json"},
197
- "body": json.dumps({
198
- "error": "Invalid URL",
199
- "details": "URL must start with http:// or https://"
200
- })
201
- }
202
-
203
- # Download and process file
204
- temp_path, content_type = download_file(url)
205
- try:
206
- # Process file here
207
- return {
208
- "statusCode": 200,
209
- "headers": {"Content-Type": "application/json"},
210
- "body": json.dumps({
211
- "status": "success",
212
- "message": "URL processed successfully",
213
- "metadata": {
214
- "url": url,
215
- "content_type": content_type,
216
- "size": os.path.getsize(temp_path)
217
- }
218
- })
219
- }
220
- finally:
221
- # Clean up temp file
222
- try:
223
- os.unlink(temp_path)
224
- except:
225
- pass
226
-
227
- except ValueError as e:
228
- return {
229
- "statusCode": 400,
230
- "headers": {"Content-Type": "application/json"},
231
- "body": json.dumps({
232
- "error": "Invalid request",
233
- "details": str(e)
234
- })
235
- }
236
- except Exception as e:
237
- return {
238
- "statusCode": 500,
239
- "headers": {"Content-Type": "application/json"},
240
- "body": json.dumps({
241
- "error": "Processing failed",
242
- "details": str(e)
243
- })
244
- }
245
-
246
- def handler(event, context):
247
- """Serverless handler for Vercel"""
248
-
249
- # Add CORS headers to all responses
250
- cors_headers = {
251
- "Access-Control-Allow-Origin": "*",
252
- "Access-Control-Allow-Headers": "Content-Type, Authorization",
253
- "Access-Control-Allow-Methods": "GET, POST, OPTIONS"
254
- }
255
-
256
- # Handle OPTIONS request for CORS
257
- if event.get('httpMethod') == 'OPTIONS':
258
- return {
259
- "statusCode": 204,
260
- "headers": cors_headers,
261
- "body": ""
262
- }
263
-
264
- # Get path and method
265
- path = event.get('path', '/').rstrip('/')
266
- method = event.get('httpMethod', 'GET')
267
-
268
- # Route request
269
- response = None
270
- if path == '' or path == '/':
271
- response = handle_root()
272
- elif path == '/health':
273
- response = handle_health()
274
- elif path == '/parse/file' and method == 'POST':
275
- response = handle_parse_file(event)
276
- elif path == '/parse/url' and method == 'POST':
277
- response = handle_parse_url(event)
278
- else:
279
- response = {
280
- "statusCode": 404,
281
- "headers": {"Content-Type": "application/json"},
282
- "body": json.dumps({
283
- "error": "Not Found",
284
- "details": f"Path {path} not found"
285
- })
286
- }
287
-
288
- # Add CORS headers to response
289
- response['headers'].update(cors_headers)
290
- return response
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements-prod.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ docling>=0.2.0
2
+ pydantic>=2.0.0
3
+ python-magic>=0.4.27
4
+ PyPDF2>=3.0.0
5
+ beautifulsoup4>=4.12.0
6
+ lxml>=4.9.0
7
+ requests>=2.31.0
8
+ fastapi>=0.104.0
9
+ python-multipart>=0.0.6
10
+ httpx>=0.25.0
11
+ mangum>=0.17.0
vercel.json CHANGED
@@ -14,7 +14,6 @@
14
  {
15
  "src": "/(.*)",
16
  "dest": "api/index.py",
17
- "continue": true,
18
  "headers": {
19
  "Access-Control-Allow-Origin": "*",
20
  "Access-Control-Allow-Headers": "Content-Type, Authorization",
 
14
  {
15
  "src": "/(.*)",
16
  "dest": "api/index.py",
 
17
  "headers": {
18
  "Access-Control-Allow-Origin": "*",
19
  "Access-Control-Allow-Headers": "Content-Type, Authorization",