Spaces:
Running
Running
Commit
·
8b5c234
1
Parent(s):
c26416f
[Cursor] Simplify Python handler for Vercel
Browse files- .cursorrules +117 -45
- .gitignore +43 -0
- README.md +155 -1
- api.py +230 -0
- api/index.py +122 -221
- index.py +0 -290
- requirements-prod.txt +11 -0
- vercel.json +0 -1
.cursorrules
CHANGED
@@ -1,46 +1,118 @@
|
|
|
|
1 |
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Instructions
|
2 |
|
3 |
+
During you interaction with the user, if you find anything reusable in this project (e.g. version of a library, model name), especially about a fix to a mistake you made or a correction you received, you should take note in the `Lessons` section in the `.cursorrules` file so you will not make the same mistake again.
|
4 |
+
|
5 |
+
You should also use the `.cursorrules` file as a scratchpad to organize your thoughts. Especially when you receive a new task, you should first review the content of the scratchpad, clear old different task if necessary, first explain the task, and plan the steps you need to take to complete the task. You can use todo markers to indicate the progress, e.g.
|
6 |
+
[X] Task 1
|
7 |
+
[ ] Task 2
|
8 |
+
|
9 |
+
Also update the progress of the task in the Scratchpad when you finish a subtask.
|
10 |
+
Especially when you finished a milestone, it will help to improve your depth of task accomplishment to use the scratchpad to reflect and plan.
|
11 |
+
The goal is to help you maintain a big picture as well as the progress of the task. Always refer to the Scratchpad when you plan the next step.
|
12 |
+
|
13 |
+
# Tools
|
14 |
+
|
15 |
+
Note all the tools are in python. So in the case you need to do batch processing, you can always consult the python files and write your own script.
|
16 |
+
|
17 |
+
## Screenshot Verification
|
18 |
+
The screenshot verification workflow allows you to capture screenshots of web pages and verify their appearance using LLMs. The following tools are available:
|
19 |
+
|
20 |
+
1. Screenshot Capture:
|
21 |
+
```bash
|
22 |
+
venv/bin/python tools/screenshot_utils.py URL [--output OUTPUT] [--width WIDTH] [--height HEIGHT]
|
23 |
+
```
|
24 |
+
|
25 |
+
2. LLM Verification with Images:
|
26 |
+
```bash
|
27 |
+
venv/bin/python tools/llm_api.py --prompt "Your verification question" --provider {openai|anthropic} --image path/to/screenshot.png
|
28 |
+
```
|
29 |
+
|
30 |
+
Example workflow:
|
31 |
+
```python
|
32 |
+
from screenshot_utils import take_screenshot_sync
|
33 |
+
from llm_api import query_llm
|
34 |
+
|
35 |
+
# Take a screenshot
|
36 |
+
screenshot_path = take_screenshot_sync('https://example.com', 'screenshot.png')
|
37 |
+
|
38 |
+
# Verify with LLM
|
39 |
+
response = query_llm(
|
40 |
+
"What is the background color and title of this webpage?",
|
41 |
+
provider="openai", # or "anthropic"
|
42 |
+
image_path=screenshot_path
|
43 |
+
)
|
44 |
+
print(response)
|
45 |
+
```
|
46 |
+
|
47 |
+
## LLM
|
48 |
+
|
49 |
+
You always have an LLM at your side to help you with the task. For simple tasks, you could invoke the LLM by running the following command:
|
50 |
+
```
|
51 |
+
venv/bin/python ./tools/llm_api.py --prompt "What is the capital of France?" --provider "anthropic"
|
52 |
+
```
|
53 |
+
|
54 |
+
The LLM API supports multiple providers:
|
55 |
+
- OpenAI (default, model: gpt-4o)
|
56 |
+
- Azure OpenAI (model: configured via AZURE_OPENAI_MODEL_DEPLOYMENT in .env file, defaults to gpt-4o-ms)
|
57 |
+
- DeepSeek (model: deepseek-chat)
|
58 |
+
- Anthropic (model: claude-3-sonnet-20240229)
|
59 |
+
- Gemini (model: gemini-pro)
|
60 |
+
- Local LLM (model: Qwen/Qwen2.5-32B-Instruct-AWQ)
|
61 |
+
|
62 |
+
But usually it's a better idea to check the content of the file and use the APIs in the `tools/llm_api.py` file to invoke the LLM if needed.
|
63 |
+
|
64 |
+
## Web browser
|
65 |
+
|
66 |
+
You could use the `tools/web_scraper.py` file to scrape the web.
|
67 |
+
```
|
68 |
+
venv/bin/python ./tools/web_scraper.py --max-concurrent 3 URL1 URL2 URL3
|
69 |
+
```
|
70 |
+
This will output the content of the web pages.
|
71 |
+
|
72 |
+
## Search engine
|
73 |
+
|
74 |
+
You could use the `tools/search_engine.py` file to search the web.
|
75 |
+
```
|
76 |
+
venv/bin/python ./tools/search_engine.py "your search keywords"
|
77 |
+
```
|
78 |
+
This will output the search results in the following format:
|
79 |
+
```
|
80 |
+
URL: https://example.com
|
81 |
+
Title: This is the title of the search result
|
82 |
+
Snippet: This is a snippet of the search result
|
83 |
+
```
|
84 |
+
If needed, you can further use the `web_scraper.py` file to scrape the web page content.
|
85 |
+
|
86 |
+
# Lessons
|
87 |
+
|
88 |
+
## User Specified Lessons
|
89 |
+
|
90 |
+
- You have a python venv in ./venv. Use it.
|
91 |
+
- Include info useful for debugging in the program output.
|
92 |
+
- Read the file before you try to edit it.
|
93 |
+
- Due to Cursor's limit, when you use `git` and `gh` and need to submit a multiline commit message, first write the message in a file, and then use `git commit -F <filename>` or similar command to commit. And then remove the file. Include "[Cursor] " in the commit message and PR title.
|
94 |
+
|
95 |
+
## Cursor learned
|
96 |
+
|
97 |
+
- For search results, ensure proper handling of different character encodings (UTF-8) for international queries
|
98 |
+
- Add debug information to stderr while keeping the main output clean in stdout for better pipeline integration
|
99 |
+
- When using seaborn styles in matplotlib, use 'seaborn-v0_8' instead of 'seaborn' as the style name due to recent seaborn version changes
|
100 |
+
- Use 'gpt-4o' as the model name for OpenAI's GPT-4 with vision capabilities
|
101 |
+
- For Vercel deployments with FastAPI and Mangum, use older stable versions (FastAPI 0.88.0, Mangum 0.15.0, Pydantic 1.10.2) to avoid compatibility issues
|
102 |
+
- Keep Vercel configuration simple and avoid unnecessary configuration options that might cause conflicts
|
103 |
+
|
104 |
+
# Scratchpad
|
105 |
+
|
106 |
+
Current Task: Fix Vercel deployment issues with FastAPI and Mangum
|
107 |
+
|
108 |
+
Progress:
|
109 |
+
[X] Identified issue with newer versions of FastAPI and Mangum
|
110 |
+
[X] Updated dependencies to use older, stable versions
|
111 |
+
[X] Simplified FastAPI configuration
|
112 |
+
[X] Simplified Vercel configuration
|
113 |
+
[X] Successfully deployed to production
|
114 |
+
|
115 |
+
Next Steps:
|
116 |
+
[ ] Test all API endpoints
|
117 |
+
[ ] Add more functionality if needed
|
118 |
+
[ ] Consider adding monitoring and logging
|
.gitignore
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Python
|
2 |
+
__pycache__/
|
3 |
+
*.py[cod]
|
4 |
+
*$py.class
|
5 |
+
*.so
|
6 |
+
.Python
|
7 |
+
build/
|
8 |
+
develop-eggs/
|
9 |
+
dist/
|
10 |
+
downloads/
|
11 |
+
eggs/
|
12 |
+
.eggs/
|
13 |
+
lib/
|
14 |
+
lib64/
|
15 |
+
parts/
|
16 |
+
sdist/
|
17 |
+
var/
|
18 |
+
wheels/
|
19 |
+
*.egg-info/
|
20 |
+
.installed.cfg
|
21 |
+
*.egg
|
22 |
+
|
23 |
+
# Virtual Environment
|
24 |
+
venv/
|
25 |
+
env/
|
26 |
+
ENV/
|
27 |
+
|
28 |
+
# IDE
|
29 |
+
.idea/
|
30 |
+
.vscode/
|
31 |
+
*.swp
|
32 |
+
*.swo
|
33 |
+
|
34 |
+
# Vercel
|
35 |
+
.vercel/
|
36 |
+
.env
|
37 |
+
.env.local
|
38 |
+
|
39 |
+
# Temporary files
|
40 |
+
*.tmp
|
41 |
+
tmp/
|
42 |
+
temp/
|
43 |
+
.vercel
|
README.md
CHANGED
@@ -54,4 +54,158 @@ Built with:
|
|
54 |
|
55 |
## 📝 License
|
56 |
|
57 |
-
MIT License
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
|
55 |
## 📝 License
|
56 |
|
57 |
+
MIT License
|
58 |
+
|
59 |
+
# Document Parser API
|
60 |
+
|
61 |
+
A scalable FastAPI service for parsing various document formats (PDF, DOCX, TXT, HTML, Markdown) with automatic information extraction.
|
62 |
+
|
63 |
+
## Features
|
64 |
+
|
65 |
+
- 📄 Multi-format support (PDF, DOCX, TXT, HTML, Markdown)
|
66 |
+
- 🔄 Asynchronous processing with background tasks
|
67 |
+
- 🌐 Support for both file uploads and URL inputs
|
68 |
+
- 📊 Structured information extraction
|
69 |
+
- 🔗 Webhook support for processing notifications
|
70 |
+
- 🚀 Highly scalable architecture
|
71 |
+
- 🛡️ Comprehensive error handling
|
72 |
+
- 📝 Detailed logging
|
73 |
+
|
74 |
+
## Quick Start
|
75 |
+
|
76 |
+
### Prerequisites
|
77 |
+
|
78 |
+
- Python 3.8+
|
79 |
+
- pip (Python package manager)
|
80 |
+
|
81 |
+
### Installation
|
82 |
+
|
83 |
+
1. Clone the repository:
|
84 |
+
```bash
|
85 |
+
git clone https://github.com/yourusername/document-parser-api.git
|
86 |
+
cd document-parser-api
|
87 |
+
```
|
88 |
+
|
89 |
+
2. Install dependencies:
|
90 |
+
```bash
|
91 |
+
pip install -r requirements.txt
|
92 |
+
```
|
93 |
+
|
94 |
+
3. Run the API server:
|
95 |
+
```bash
|
96 |
+
python api.py
|
97 |
+
```
|
98 |
+
|
99 |
+
The API will be available at `http://localhost:8000`
|
100 |
+
|
101 |
+
## API Documentation
|
102 |
+
|
103 |
+
### Endpoints
|
104 |
+
|
105 |
+
#### 1. Parse Document from File Upload
|
106 |
+
```http
|
107 |
+
POST /parse/file
|
108 |
+
```
|
109 |
+
- Upload a document file for parsing
|
110 |
+
- Optional callback URL for processing notification
|
111 |
+
- Returns a job ID for status tracking
|
112 |
+
|
113 |
+
#### 2. Parse Document from URL
|
114 |
+
```http
|
115 |
+
POST /parse/url
|
116 |
+
```
|
117 |
+
- Submit a document URL for parsing
|
118 |
+
- Optional callback URL for processing notification
|
119 |
+
- Returns a job ID for status tracking
|
120 |
+
|
121 |
+
#### 3. Check Processing Status
|
122 |
+
```http
|
123 |
+
GET /status/{job_id}
|
124 |
+
```
|
125 |
+
- Get the current status of a parsing job
|
126 |
+
- Returns processing status and results if completed
|
127 |
+
|
128 |
+
#### 4. Health Check
|
129 |
+
```http
|
130 |
+
GET /health
|
131 |
+
```
|
132 |
+
- Check if the API is running and healthy
|
133 |
+
|
134 |
+
### Example Usage
|
135 |
+
|
136 |
+
#### Parse File
|
137 |
+
```python
|
138 |
+
import requests
|
139 |
+
|
140 |
+
url = "http://localhost:8000/parse/file"
|
141 |
+
files = {"file": open("document.pdf", "rb")}
|
142 |
+
response = requests.post(url, files=files)
|
143 |
+
print(response.json())
|
144 |
+
```
|
145 |
+
|
146 |
+
#### Parse URL
|
147 |
+
```python
|
148 |
+
import requests
|
149 |
+
|
150 |
+
url = "http://localhost:8000/parse/url"
|
151 |
+
data = {
|
152 |
+
"url": "https://example.com/document.pdf",
|
153 |
+
"callback_url": "https://your-callback-url.com/webhook"
|
154 |
+
}
|
155 |
+
response = requests.post(url, json=data)
|
156 |
+
print(response.json())
|
157 |
+
```
|
158 |
+
|
159 |
+
## Error Handling
|
160 |
+
|
161 |
+
The API implements comprehensive error handling:
|
162 |
+
|
163 |
+
- Invalid file formats
|
164 |
+
- Failed URL downloads
|
165 |
+
- Processing errors
|
166 |
+
- Invalid requests
|
167 |
+
- Server errors
|
168 |
+
|
169 |
+
All errors return appropriate HTTP status codes and detailed error messages.
|
170 |
+
|
171 |
+
## Scaling Considerations
|
172 |
+
|
173 |
+
For production deployment, consider:
|
174 |
+
|
175 |
+
1. **Job Storage**: Replace in-memory storage with Redis or a database
|
176 |
+
2. **Load Balancing**: Deploy behind a load balancer
|
177 |
+
3. **Worker Processes**: Adjust number of workers based on load
|
178 |
+
4. **Rate Limiting**: Implement rate limiting for API endpoints
|
179 |
+
5. **Monitoring**: Add metrics collection and monitoring
|
180 |
+
|
181 |
+
## Development
|
182 |
+
|
183 |
+
### Running Tests
|
184 |
+
```bash
|
185 |
+
pytest tests/
|
186 |
+
```
|
187 |
+
|
188 |
+
### Local Development
|
189 |
+
```bash
|
190 |
+
uvicorn api:app --reload --port 8000
|
191 |
+
```
|
192 |
+
|
193 |
+
### API Documentation
|
194 |
+
- Swagger UI: `http://localhost:8000/docs`
|
195 |
+
- ReDoc: `http://localhost:8000/redoc`
|
196 |
+
|
197 |
+
## Contributing
|
198 |
+
|
199 |
+
1. Fork the repository
|
200 |
+
2. Create a feature branch
|
201 |
+
3. Commit your changes
|
202 |
+
4. Push to the branch
|
203 |
+
5. Create a Pull Request
|
204 |
+
|
205 |
+
## License
|
206 |
+
|
207 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
208 |
+
|
209 |
+
## Support
|
210 |
+
|
211 |
+
For support, please open an issue in the GitHub repository or contact the maintainers.
|
api.py
ADDED
@@ -0,0 +1,230 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
from fastapi import FastAPI, HTTPException, UploadFile, File, BackgroundTasks
|
3 |
+
from fastapi.middleware.cors import CORSMiddleware
|
4 |
+
from pydantic import BaseModel, HttpUrl
|
5 |
+
import tempfile
|
6 |
+
import requests
|
7 |
+
from typing import Optional, List, Dict, Any
|
8 |
+
from dockling_parser import DocumentParser
|
9 |
+
from dockling_parser.exceptions import ParserError, UnsupportedFormatError
|
10 |
+
from dockling_parser.types import ParsedDocument
|
11 |
+
import logging
|
12 |
+
import aiofiles
|
13 |
+
import asyncio
|
14 |
+
from urllib.parse import urlparse
|
15 |
+
from mangum import Mangum
|
16 |
+
import httpx
|
17 |
+
|
18 |
+
# Configure logging
|
19 |
+
logging.basicConfig(level=logging.INFO)
|
20 |
+
logger = logging.getLogger(__name__)
|
21 |
+
|
22 |
+
# Initialize FastAPI app
|
23 |
+
app = FastAPI(
|
24 |
+
title="Document Parser API",
|
25 |
+
description="A scalable API for parsing various document formats",
|
26 |
+
version="1.0.0"
|
27 |
+
)
|
28 |
+
|
29 |
+
# Add CORS middleware
|
30 |
+
app.add_middleware(
|
31 |
+
CORSMiddleware,
|
32 |
+
allow_origins=["*"],
|
33 |
+
allow_credentials=True,
|
34 |
+
allow_methods=["*"],
|
35 |
+
allow_headers=["*"],
|
36 |
+
)
|
37 |
+
|
38 |
+
# Initialize document parser
|
39 |
+
parser = DocumentParser()
|
40 |
+
|
41 |
+
class URLInput(BaseModel):
|
42 |
+
url: HttpUrl
|
43 |
+
callback_url: Optional[HttpUrl] = None
|
44 |
+
|
45 |
+
class ErrorResponse(BaseModel):
|
46 |
+
error: str
|
47 |
+
detail: Optional[str] = None
|
48 |
+
code: str
|
49 |
+
|
50 |
+
class ParseResponse(BaseModel):
|
51 |
+
job_id: str
|
52 |
+
status: str
|
53 |
+
result: Optional[ParsedDocument] = None
|
54 |
+
error: Optional[str] = None
|
55 |
+
|
56 |
+
# In-memory job storage (replace with Redis/DB in production)
|
57 |
+
jobs = {}
|
58 |
+
|
59 |
+
async def process_document_async(job_id: str, file_path: str, callback_url: Optional[str] = None):
|
60 |
+
"""Process document asynchronously"""
|
61 |
+
try:
|
62 |
+
# Update job status
|
63 |
+
jobs[job_id] = {"status": "processing"}
|
64 |
+
|
65 |
+
# Parse document
|
66 |
+
result = parser.parse(file_path)
|
67 |
+
|
68 |
+
# Update job with result
|
69 |
+
jobs[job_id] = {
|
70 |
+
"status": "completed",
|
71 |
+
"result": result
|
72 |
+
}
|
73 |
+
|
74 |
+
# Call callback URL if provided
|
75 |
+
if callback_url:
|
76 |
+
try:
|
77 |
+
await notify_callback(callback_url, job_id, result)
|
78 |
+
except Exception as e:
|
79 |
+
logger.error(f"Failed to notify callback URL: {str(e)}")
|
80 |
+
|
81 |
+
except Exception as e:
|
82 |
+
logger.error(f"Error processing document: {str(e)}")
|
83 |
+
jobs[job_id] = {
|
84 |
+
"status": "failed",
|
85 |
+
"error": str(e)
|
86 |
+
}
|
87 |
+
finally:
|
88 |
+
# Cleanup temporary file
|
89 |
+
try:
|
90 |
+
if os.path.exists(file_path):
|
91 |
+
os.unlink(file_path)
|
92 |
+
except Exception as e:
|
93 |
+
logger.error(f"Error cleaning up file: {str(e)}")
|
94 |
+
|
95 |
+
async def notify_callback(callback_url: str, job_id: str, result: ParsedDocument):
|
96 |
+
"""Notify callback URL with results"""
|
97 |
+
try:
|
98 |
+
async with httpx.AsyncClient() as client:
|
99 |
+
await client.post(
|
100 |
+
callback_url,
|
101 |
+
json={
|
102 |
+
"job_id": job_id,
|
103 |
+
"result": result.dict()
|
104 |
+
}
|
105 |
+
)
|
106 |
+
except Exception as e:
|
107 |
+
logger.error(f"Failed to send callback: {str(e)}")
|
108 |
+
|
109 |
+
@app.post("/parse/file", response_model=ParseResponse)
|
110 |
+
async def parse_file(
|
111 |
+
background_tasks: BackgroundTasks,
|
112 |
+
file: UploadFile = File(...),
|
113 |
+
callback_url: Optional[HttpUrl] = None
|
114 |
+
):
|
115 |
+
"""
|
116 |
+
Parse a document from file upload
|
117 |
+
"""
|
118 |
+
try:
|
119 |
+
# Create temporary file in /tmp for Vercel
|
120 |
+
suffix = os.path.splitext(file.filename)[1]
|
121 |
+
tmp_dir = "/tmp" if os.path.exists("/tmp") else tempfile.gettempdir()
|
122 |
+
tmp_path = os.path.join(tmp_dir, f"upload_{os.urandom(8).hex()}{suffix}")
|
123 |
+
|
124 |
+
content = await file.read()
|
125 |
+
with open(tmp_path, "wb") as f:
|
126 |
+
f.write(content)
|
127 |
+
|
128 |
+
# Generate job ID
|
129 |
+
job_id = f"job_{len(jobs) + 1}"
|
130 |
+
|
131 |
+
# Start background processing
|
132 |
+
background_tasks.add_task(
|
133 |
+
process_document_async,
|
134 |
+
job_id,
|
135 |
+
tmp_path,
|
136 |
+
str(callback_url) if callback_url else None
|
137 |
+
)
|
138 |
+
|
139 |
+
return ParseResponse(
|
140 |
+
job_id=job_id,
|
141 |
+
status="queued"
|
142 |
+
)
|
143 |
+
|
144 |
+
except Exception as e:
|
145 |
+
logger.error(f"Error handling file upload: {str(e)}")
|
146 |
+
raise HTTPException(
|
147 |
+
status_code=500,
|
148 |
+
detail=str(e)
|
149 |
+
)
|
150 |
+
|
151 |
+
@app.post("/parse/url", response_model=ParseResponse)
|
152 |
+
async def parse_url(input_data: URLInput, background_tasks: BackgroundTasks):
|
153 |
+
"""
|
154 |
+
Parse a document from URL
|
155 |
+
"""
|
156 |
+
try:
|
157 |
+
# Download file
|
158 |
+
async with httpx.AsyncClient() as client:
|
159 |
+
response = await client.get(str(input_data.url), follow_redirects=True)
|
160 |
+
response.raise_for_status()
|
161 |
+
|
162 |
+
# Get filename from URL or use default
|
163 |
+
filename = os.path.basename(urlparse(str(input_data.url)).path)
|
164 |
+
if not filename:
|
165 |
+
filename = "document.pdf"
|
166 |
+
|
167 |
+
# Save to temporary file in /tmp for Vercel
|
168 |
+
tmp_dir = "/tmp" if os.path.exists("/tmp") else tempfile.gettempdir()
|
169 |
+
tmp_path = os.path.join(tmp_dir, f"download_{os.urandom(8).hex()}{os.path.splitext(filename)[1]}")
|
170 |
+
|
171 |
+
with open(tmp_path, "wb") as f:
|
172 |
+
f.write(response.content)
|
173 |
+
|
174 |
+
# Generate job ID
|
175 |
+
job_id = f"job_{len(jobs) + 1}"
|
176 |
+
|
177 |
+
# Start background processing
|
178 |
+
background_tasks.add_task(
|
179 |
+
process_document_async,
|
180 |
+
job_id,
|
181 |
+
tmp_path,
|
182 |
+
str(input_data.callback_url) if input_data.callback_url else None
|
183 |
+
)
|
184 |
+
|
185 |
+
return ParseResponse(
|
186 |
+
job_id=job_id,
|
187 |
+
status="queued"
|
188 |
+
)
|
189 |
+
|
190 |
+
except httpx.RequestError as e:
|
191 |
+
logger.error(f"Error downloading file: {str(e)}")
|
192 |
+
raise HTTPException(
|
193 |
+
status_code=400,
|
194 |
+
detail=f"Error downloading file: {str(e)}"
|
195 |
+
)
|
196 |
+
except Exception as e:
|
197 |
+
logger.error(f"Error processing URL: {str(e)}")
|
198 |
+
raise HTTPException(
|
199 |
+
status_code=500,
|
200 |
+
detail=str(e)
|
201 |
+
)
|
202 |
+
|
203 |
+
@app.get("/status/{job_id}", response_model=ParseResponse)
|
204 |
+
async def get_status(job_id: str):
|
205 |
+
"""
|
206 |
+
Get the status of a parsing job
|
207 |
+
"""
|
208 |
+
if job_id not in jobs:
|
209 |
+
raise HTTPException(
|
210 |
+
status_code=404,
|
211 |
+
detail="Job not found"
|
212 |
+
)
|
213 |
+
|
214 |
+
job = jobs[job_id]
|
215 |
+
return ParseResponse(
|
216 |
+
job_id=job_id,
|
217 |
+
status=job["status"],
|
218 |
+
result=job.get("result"),
|
219 |
+
error=job.get("error")
|
220 |
+
)
|
221 |
+
|
222 |
+
@app.get("/health")
|
223 |
+
async def health_check():
|
224 |
+
"""
|
225 |
+
Health check endpoint
|
226 |
+
"""
|
227 |
+
return {"status": "healthy"}
|
228 |
+
|
229 |
+
# Handler for Vercel
|
230 |
+
handler = Mangum(app, lifespan="off")
|
api/index.py
CHANGED
@@ -1,3 +1,4 @@
|
|
|
|
1 |
import json
|
2 |
import os
|
3 |
import tempfile
|
@@ -59,232 +60,132 @@ def download_file(url):
|
|
59 |
except Exception as e:
|
60 |
raise ValueError(f"Failed to download file: {str(e)}")
|
61 |
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
})
|
73 |
-
}
|
74 |
-
|
75 |
-
def handle_health():
|
76 |
-
return {
|
77 |
-
"statusCode": 200,
|
78 |
-
"headers": {
|
79 |
-
"Content-Type": "application/json"
|
80 |
-
},
|
81 |
-
"body": json.dumps({
|
82 |
-
"status": "healthy",
|
83 |
-
"timestamp": str(datetime.datetime.now(datetime.UTC))
|
84 |
-
})
|
85 |
-
}
|
86 |
-
|
87 |
-
def handle_parse_file(event):
|
88 |
-
try:
|
89 |
-
# Check if there's a file in the request
|
90 |
-
if 'body' not in event or not event['body']:
|
91 |
-
return {
|
92 |
-
"statusCode": 400,
|
93 |
-
"headers": {"Content-Type": "application/json"},
|
94 |
-
"body": json.dumps({
|
95 |
-
"error": "No file provided",
|
96 |
-
"details": "Please provide a file in the request body"
|
97 |
-
})
|
98 |
}
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
if not is_valid_file(file_data):
|
108 |
-
return {
|
109 |
-
"statusCode": 400,
|
110 |
-
"headers": {"Content-Type": "application/json"},
|
111 |
-
"body": json.dumps({
|
112 |
-
"error": "Invalid file type",
|
113 |
-
"details": "Supported formats: PDF, TXT, HTML, MD, DOCX"
|
114 |
-
})
|
115 |
}
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
f.
|
124 |
-
|
125 |
-
# Process file here
|
126 |
-
return {
|
127 |
-
"statusCode": 200,
|
128 |
-
"headers": {"Content-Type": "application/json"},
|
129 |
-
"body": json.dumps({
|
130 |
-
"status": "success",
|
131 |
-
"message": "File processed successfully",
|
132 |
-
"metadata": {
|
133 |
-
"size": os.path.getsize(temp_path),
|
134 |
-
"mime_type": magic.from_file(temp_path, mime=True)
|
135 |
-
}
|
136 |
-
})
|
137 |
}
|
138 |
-
|
139 |
-
|
|
|
|
|
|
|
|
|
|
|
140 |
try:
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
|
145 |
-
except Exception as e:
|
146 |
-
return {
|
147 |
-
"statusCode": 500,
|
148 |
-
"headers": {"Content-Type": "application/json"},
|
149 |
-
"body": json.dumps({
|
150 |
-
"error": "Processing failed",
|
151 |
-
"details": str(e)
|
152 |
-
})
|
153 |
-
}
|
154 |
|
155 |
-
|
156 |
-
|
157 |
-
|
158 |
-
|
159 |
-
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
|
165 |
-
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
|
171 |
-
|
172 |
-
|
173 |
-
|
174 |
-
|
175 |
-
|
176 |
-
"error": "Invalid JSON",
|
177 |
-
"details": "Request body must be valid JSON"
|
178 |
-
})
|
179 |
-
}
|
180 |
-
|
181 |
-
# Get URL from body
|
182 |
-
url = body.get('url')
|
183 |
-
if not url:
|
184 |
-
return {
|
185 |
-
"statusCode": 400,
|
186 |
-
"headers": {"Content-Type": "application/json"},
|
187 |
-
"body": json.dumps({
|
188 |
-
"error": "No URL provided",
|
189 |
-
"details": "Please provide a URL in the request body"
|
190 |
-
})
|
191 |
-
}
|
192 |
-
|
193 |
-
if not url.startswith(('http://', 'https://')):
|
194 |
-
return {
|
195 |
-
"statusCode": 400,
|
196 |
-
"headers": {"Content-Type": "application/json"},
|
197 |
-
"body": json.dumps({
|
198 |
-
"error": "Invalid URL",
|
199 |
-
"details": "URL must start with http:// or https://"
|
200 |
-
})
|
201 |
-
}
|
202 |
-
|
203 |
-
# Download and process file
|
204 |
-
temp_path, content_type = download_file(url)
|
205 |
-
try:
|
206 |
-
# Process file here
|
207 |
-
return {
|
208 |
-
"statusCode": 200,
|
209 |
-
"headers": {"Content-Type": "application/json"},
|
210 |
-
"body": json.dumps({
|
211 |
-
"status": "success",
|
212 |
-
"message": "URL processed successfully",
|
213 |
-
"metadata": {
|
214 |
-
"url": url,
|
215 |
-
"content_type": content_type,
|
216 |
-
"size": os.path.getsize(temp_path)
|
217 |
}
|
218 |
-
|
219 |
-
|
220 |
-
|
221 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
222 |
try:
|
223 |
-
|
224 |
-
|
225 |
-
|
226 |
-
|
227 |
-
|
228 |
-
return {
|
229 |
-
"statusCode": 400,
|
230 |
-
"headers": {"Content-Type": "application/json"},
|
231 |
-
"body": json.dumps({
|
232 |
-
"error": "Invalid request",
|
233 |
-
"details": str(e)
|
234 |
-
})
|
235 |
-
}
|
236 |
-
except Exception as e:
|
237 |
-
return {
|
238 |
-
"statusCode": 500,
|
239 |
-
"headers": {"Content-Type": "application/json"},
|
240 |
-
"body": json.dumps({
|
241 |
-
"error": "Processing failed",
|
242 |
-
"details": str(e)
|
243 |
-
})
|
244 |
-
}
|
245 |
|
246 |
-
|
247 |
-
|
248 |
-
|
249 |
-
|
250 |
-
|
251 |
-
|
252 |
-
|
253 |
-
|
254 |
-
|
255 |
-
|
256 |
-
|
257 |
-
|
258 |
-
|
259 |
-
|
260 |
-
|
261 |
-
|
262 |
-
|
263 |
-
|
264 |
-
|
265 |
-
|
266 |
-
|
267 |
-
|
268 |
-
|
269 |
-
|
270 |
-
|
271 |
-
|
272 |
-
|
273 |
-
|
274 |
-
|
275 |
-
|
276 |
-
|
277 |
-
|
278 |
-
|
279 |
-
|
280 |
-
|
281 |
-
|
282 |
-
"
|
283 |
-
|
284 |
-
|
285 |
-
|
286 |
-
|
287 |
-
|
288 |
-
|
289 |
-
|
290 |
-
|
|
|
|
|
|
|
|
1 |
+
from http.server import BaseHTTPRequestHandler
|
2 |
import json
|
3 |
import os
|
4 |
import tempfile
|
|
|
60 |
except Exception as e:
|
61 |
raise ValueError(f"Failed to download file: {str(e)}")
|
62 |
|
63 |
+
class Handler(BaseHTTPRequestHandler):
|
64 |
+
def do_GET(self):
|
65 |
+
if self.path == '/' or self.path == '':
|
66 |
+
self.send_response(200)
|
67 |
+
self.send_header('Content-type', 'application/json')
|
68 |
+
self.end_headers()
|
69 |
+
response = {
|
70 |
+
"status": "ok",
|
71 |
+
"message": "Document Processing API",
|
72 |
+
"version": "1.0.0"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
}
|
74 |
+
self.wfile.write(json.dumps(response).encode())
|
75 |
+
elif self.path == '/health':
|
76 |
+
self.send_response(200)
|
77 |
+
self.send_header('Content-type', 'application/json')
|
78 |
+
self.end_headers()
|
79 |
+
response = {
|
80 |
+
"status": "healthy",
|
81 |
+
"timestamp": str(datetime.datetime.now(datetime.UTC))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
82 |
}
|
83 |
+
self.wfile.write(json.dumps(response).encode())
|
84 |
+
else:
|
85 |
+
self.send_response(404)
|
86 |
+
self.send_header('Content-type', 'application/json')
|
87 |
+
self.end_headers()
|
88 |
+
response = {
|
89 |
+
"error": "Not Found",
|
90 |
+
"details": f"Path {self.path} not found"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
}
|
92 |
+
self.wfile.write(json.dumps(response).encode())
|
93 |
+
|
94 |
+
def do_POST(self):
|
95 |
+
content_length = int(self.headers.get('Content-Length', 0))
|
96 |
+
post_data = self.rfile.read(content_length)
|
97 |
+
|
98 |
+
if self.path == '/parse/file':
|
99 |
try:
|
100 |
+
if not post_data:
|
101 |
+
self.send_error(400, "No file provided")
|
102 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
103 |
|
104 |
+
if not is_valid_file(post_data):
|
105 |
+
self.send_error(400, "Invalid file type")
|
106 |
+
return
|
107 |
+
|
108 |
+
fd, temp_path = tempfile.mkstemp()
|
109 |
+
os.close(fd)
|
110 |
+
|
111 |
+
try:
|
112 |
+
with open(temp_path, 'wb') as f:
|
113 |
+
f.write(post_data)
|
114 |
+
|
115 |
+
self.send_response(200)
|
116 |
+
self.send_header('Content-type', 'application/json')
|
117 |
+
self.end_headers()
|
118 |
+
response = {
|
119 |
+
"status": "success",
|
120 |
+
"message": "File processed successfully",
|
121 |
+
"metadata": {
|
122 |
+
"size": os.path.getsize(temp_path),
|
123 |
+
"mime_type": magic.from_file(temp_path, mime=True)
|
124 |
+
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
125 |
}
|
126 |
+
self.wfile.write(json.dumps(response).encode())
|
127 |
+
finally:
|
128 |
+
try:
|
129 |
+
os.unlink(temp_path)
|
130 |
+
except:
|
131 |
+
pass
|
132 |
+
|
133 |
+
except Exception as e:
|
134 |
+
self.send_error(500, str(e))
|
135 |
+
|
136 |
+
elif self.path == '/parse/url':
|
137 |
try:
|
138 |
+
try:
|
139 |
+
body = json.loads(post_data)
|
140 |
+
except:
|
141 |
+
self.send_error(400, "Invalid JSON")
|
142 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
143 |
|
144 |
+
url = body.get('url')
|
145 |
+
if not url:
|
146 |
+
self.send_error(400, "No URL provided")
|
147 |
+
return
|
148 |
+
|
149 |
+
if not url.startswith(('http://', 'https://')):
|
150 |
+
self.send_error(400, "Invalid URL")
|
151 |
+
return
|
152 |
+
|
153 |
+
temp_path, content_type = download_file(url)
|
154 |
+
try:
|
155 |
+
self.send_response(200)
|
156 |
+
self.send_header('Content-type', 'application/json')
|
157 |
+
self.end_headers()
|
158 |
+
response = {
|
159 |
+
"status": "success",
|
160 |
+
"message": "URL processed successfully",
|
161 |
+
"metadata": {
|
162 |
+
"url": url,
|
163 |
+
"content_type": content_type,
|
164 |
+
"size": os.path.getsize(temp_path)
|
165 |
+
}
|
166 |
+
}
|
167 |
+
self.wfile.write(json.dumps(response).encode())
|
168 |
+
finally:
|
169 |
+
try:
|
170 |
+
os.unlink(temp_path)
|
171 |
+
except:
|
172 |
+
pass
|
173 |
+
|
174 |
+
except ValueError as e:
|
175 |
+
self.send_error(400, str(e))
|
176 |
+
except Exception as e:
|
177 |
+
self.send_error(500, str(e))
|
178 |
+
|
179 |
+
else:
|
180 |
+
self.send_error(404, f"Path {self.path} not found")
|
181 |
+
|
182 |
+
def do_OPTIONS(self):
|
183 |
+
self.send_response(204)
|
184 |
+
self.send_header('Access-Control-Allow-Origin', '*')
|
185 |
+
self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
|
186 |
+
self.send_header('Access-Control-Allow-Headers', 'Content-Type, Authorization')
|
187 |
+
self.end_headers()
|
188 |
+
|
189 |
+
def handler(request, context):
|
190 |
+
"""Vercel serverless handler"""
|
191 |
+
return Handler(request, None, None).handle_request()
|
index.py
DELETED
@@ -1,290 +0,0 @@
|
|
1 |
-
import json
|
2 |
-
import os
|
3 |
-
import tempfile
|
4 |
-
import magic
|
5 |
-
import requests
|
6 |
-
from werkzeug.utils import secure_filename
|
7 |
-
import datetime
|
8 |
-
|
9 |
-
def is_valid_file(file_data):
|
10 |
-
"""Check if file type is allowed using python-magic"""
|
11 |
-
try:
|
12 |
-
mime = magic.from_buffer(file_data, mime=True)
|
13 |
-
allowed_mimes = [
|
14 |
-
'application/pdf',
|
15 |
-
'text/plain',
|
16 |
-
'text/html',
|
17 |
-
'text/markdown',
|
18 |
-
'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
|
19 |
-
]
|
20 |
-
return mime in allowed_mimes
|
21 |
-
except Exception:
|
22 |
-
return False
|
23 |
-
|
24 |
-
def download_file(url):
|
25 |
-
"""Download file from URL and save to temp file"""
|
26 |
-
try:
|
27 |
-
response = requests.get(url, stream=True, timeout=10)
|
28 |
-
response.raise_for_status()
|
29 |
-
|
30 |
-
# Get content type
|
31 |
-
content_type = response.headers.get('content-type', '').split(';')[0]
|
32 |
-
if content_type not in [
|
33 |
-
'application/pdf',
|
34 |
-
'text/plain',
|
35 |
-
'text/html',
|
36 |
-
'text/markdown',
|
37 |
-
'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
|
38 |
-
]:
|
39 |
-
raise ValueError(f"Unsupported content type: {content_type}")
|
40 |
-
|
41 |
-
# Create temp file with proper extension
|
42 |
-
ext = {
|
43 |
-
'application/pdf': '.pdf',
|
44 |
-
'text/plain': '.txt',
|
45 |
-
'text/html': '.html',
|
46 |
-
'text/markdown': '.md',
|
47 |
-
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': '.docx'
|
48 |
-
}.get(content_type, '')
|
49 |
-
|
50 |
-
fd, temp_path = tempfile.mkstemp(suffix=ext)
|
51 |
-
os.close(fd)
|
52 |
-
|
53 |
-
# Download file
|
54 |
-
with open(temp_path, 'wb') as f:
|
55 |
-
for chunk in response.iter_content(chunk_size=8192):
|
56 |
-
f.write(chunk)
|
57 |
-
|
58 |
-
return temp_path, content_type
|
59 |
-
except Exception as e:
|
60 |
-
raise ValueError(f"Failed to download file: {str(e)}")
|
61 |
-
|
62 |
-
def handle_root():
|
63 |
-
return {
|
64 |
-
"statusCode": 200,
|
65 |
-
"headers": {
|
66 |
-
"Content-Type": "application/json"
|
67 |
-
},
|
68 |
-
"body": json.dumps({
|
69 |
-
"status": "ok",
|
70 |
-
"message": "Document Processing API",
|
71 |
-
"version": "1.0.0"
|
72 |
-
})
|
73 |
-
}
|
74 |
-
|
75 |
-
def handle_health():
|
76 |
-
return {
|
77 |
-
"statusCode": 200,
|
78 |
-
"headers": {
|
79 |
-
"Content-Type": "application/json"
|
80 |
-
},
|
81 |
-
"body": json.dumps({
|
82 |
-
"status": "healthy",
|
83 |
-
"timestamp": str(datetime.datetime.now(datetime.UTC))
|
84 |
-
})
|
85 |
-
}
|
86 |
-
|
87 |
-
def handle_parse_file(event):
|
88 |
-
try:
|
89 |
-
# Check if there's a file in the request
|
90 |
-
if 'body' not in event or not event['body']:
|
91 |
-
return {
|
92 |
-
"statusCode": 400,
|
93 |
-
"headers": {"Content-Type": "application/json"},
|
94 |
-
"body": json.dumps({
|
95 |
-
"error": "No file provided",
|
96 |
-
"details": "Please provide a file in the request body"
|
97 |
-
})
|
98 |
-
}
|
99 |
-
|
100 |
-
# Get file data
|
101 |
-
file_data = event['body']
|
102 |
-
if event.get('isBase64Encoded', False):
|
103 |
-
import base64
|
104 |
-
file_data = base64.b64decode(file_data)
|
105 |
-
|
106 |
-
# Validate file type
|
107 |
-
if not is_valid_file(file_data):
|
108 |
-
return {
|
109 |
-
"statusCode": 400,
|
110 |
-
"headers": {"Content-Type": "application/json"},
|
111 |
-
"body": json.dumps({
|
112 |
-
"error": "Invalid file type",
|
113 |
-
"details": "Supported formats: PDF, TXT, HTML, MD, DOCX"
|
114 |
-
})
|
115 |
-
}
|
116 |
-
|
117 |
-
# Save to temp file
|
118 |
-
fd, temp_path = tempfile.mkstemp()
|
119 |
-
os.close(fd)
|
120 |
-
|
121 |
-
try:
|
122 |
-
with open(temp_path, 'wb') as f:
|
123 |
-
f.write(file_data)
|
124 |
-
|
125 |
-
# Process file here
|
126 |
-
return {
|
127 |
-
"statusCode": 200,
|
128 |
-
"headers": {"Content-Type": "application/json"},
|
129 |
-
"body": json.dumps({
|
130 |
-
"status": "success",
|
131 |
-
"message": "File processed successfully",
|
132 |
-
"metadata": {
|
133 |
-
"size": os.path.getsize(temp_path),
|
134 |
-
"mime_type": magic.from_file(temp_path, mime=True)
|
135 |
-
}
|
136 |
-
})
|
137 |
-
}
|
138 |
-
finally:
|
139 |
-
# Clean up temp file
|
140 |
-
try:
|
141 |
-
os.unlink(temp_path)
|
142 |
-
except:
|
143 |
-
pass
|
144 |
-
|
145 |
-
except Exception as e:
|
146 |
-
return {
|
147 |
-
"statusCode": 500,
|
148 |
-
"headers": {"Content-Type": "application/json"},
|
149 |
-
"body": json.dumps({
|
150 |
-
"error": "Processing failed",
|
151 |
-
"details": str(e)
|
152 |
-
})
|
153 |
-
}
|
154 |
-
|
155 |
-
def handle_parse_url(event):
|
156 |
-
try:
|
157 |
-
# Get request body
|
158 |
-
if 'body' not in event or not event['body']:
|
159 |
-
return {
|
160 |
-
"statusCode": 400,
|
161 |
-
"headers": {"Content-Type": "application/json"},
|
162 |
-
"body": json.dumps({
|
163 |
-
"error": "No request body",
|
164 |
-
"details": "Please provide a URL in the request body"
|
165 |
-
})
|
166 |
-
}
|
167 |
-
|
168 |
-
# Parse JSON body
|
169 |
-
try:
|
170 |
-
body = json.loads(event['body'])
|
171 |
-
except:
|
172 |
-
return {
|
173 |
-
"statusCode": 400,
|
174 |
-
"headers": {"Content-Type": "application/json"},
|
175 |
-
"body": json.dumps({
|
176 |
-
"error": "Invalid JSON",
|
177 |
-
"details": "Request body must be valid JSON"
|
178 |
-
})
|
179 |
-
}
|
180 |
-
|
181 |
-
# Get URL from body
|
182 |
-
url = body.get('url')
|
183 |
-
if not url:
|
184 |
-
return {
|
185 |
-
"statusCode": 400,
|
186 |
-
"headers": {"Content-Type": "application/json"},
|
187 |
-
"body": json.dumps({
|
188 |
-
"error": "No URL provided",
|
189 |
-
"details": "Please provide a URL in the request body"
|
190 |
-
})
|
191 |
-
}
|
192 |
-
|
193 |
-
if not url.startswith(('http://', 'https://')):
|
194 |
-
return {
|
195 |
-
"statusCode": 400,
|
196 |
-
"headers": {"Content-Type": "application/json"},
|
197 |
-
"body": json.dumps({
|
198 |
-
"error": "Invalid URL",
|
199 |
-
"details": "URL must start with http:// or https://"
|
200 |
-
})
|
201 |
-
}
|
202 |
-
|
203 |
-
# Download and process file
|
204 |
-
temp_path, content_type = download_file(url)
|
205 |
-
try:
|
206 |
-
# Process file here
|
207 |
-
return {
|
208 |
-
"statusCode": 200,
|
209 |
-
"headers": {"Content-Type": "application/json"},
|
210 |
-
"body": json.dumps({
|
211 |
-
"status": "success",
|
212 |
-
"message": "URL processed successfully",
|
213 |
-
"metadata": {
|
214 |
-
"url": url,
|
215 |
-
"content_type": content_type,
|
216 |
-
"size": os.path.getsize(temp_path)
|
217 |
-
}
|
218 |
-
})
|
219 |
-
}
|
220 |
-
finally:
|
221 |
-
# Clean up temp file
|
222 |
-
try:
|
223 |
-
os.unlink(temp_path)
|
224 |
-
except:
|
225 |
-
pass
|
226 |
-
|
227 |
-
except ValueError as e:
|
228 |
-
return {
|
229 |
-
"statusCode": 400,
|
230 |
-
"headers": {"Content-Type": "application/json"},
|
231 |
-
"body": json.dumps({
|
232 |
-
"error": "Invalid request",
|
233 |
-
"details": str(e)
|
234 |
-
})
|
235 |
-
}
|
236 |
-
except Exception as e:
|
237 |
-
return {
|
238 |
-
"statusCode": 500,
|
239 |
-
"headers": {"Content-Type": "application/json"},
|
240 |
-
"body": json.dumps({
|
241 |
-
"error": "Processing failed",
|
242 |
-
"details": str(e)
|
243 |
-
})
|
244 |
-
}
|
245 |
-
|
246 |
-
def handler(event, context):
|
247 |
-
"""Serverless handler for Vercel"""
|
248 |
-
|
249 |
-
# Add CORS headers to all responses
|
250 |
-
cors_headers = {
|
251 |
-
"Access-Control-Allow-Origin": "*",
|
252 |
-
"Access-Control-Allow-Headers": "Content-Type, Authorization",
|
253 |
-
"Access-Control-Allow-Methods": "GET, POST, OPTIONS"
|
254 |
-
}
|
255 |
-
|
256 |
-
# Handle OPTIONS request for CORS
|
257 |
-
if event.get('httpMethod') == 'OPTIONS':
|
258 |
-
return {
|
259 |
-
"statusCode": 204,
|
260 |
-
"headers": cors_headers,
|
261 |
-
"body": ""
|
262 |
-
}
|
263 |
-
|
264 |
-
# Get path and method
|
265 |
-
path = event.get('path', '/').rstrip('/')
|
266 |
-
method = event.get('httpMethod', 'GET')
|
267 |
-
|
268 |
-
# Route request
|
269 |
-
response = None
|
270 |
-
if path == '' or path == '/':
|
271 |
-
response = handle_root()
|
272 |
-
elif path == '/health':
|
273 |
-
response = handle_health()
|
274 |
-
elif path == '/parse/file' and method == 'POST':
|
275 |
-
response = handle_parse_file(event)
|
276 |
-
elif path == '/parse/url' and method == 'POST':
|
277 |
-
response = handle_parse_url(event)
|
278 |
-
else:
|
279 |
-
response = {
|
280 |
-
"statusCode": 404,
|
281 |
-
"headers": {"Content-Type": "application/json"},
|
282 |
-
"body": json.dumps({
|
283 |
-
"error": "Not Found",
|
284 |
-
"details": f"Path {path} not found"
|
285 |
-
})
|
286 |
-
}
|
287 |
-
|
288 |
-
# Add CORS headers to response
|
289 |
-
response['headers'].update(cors_headers)
|
290 |
-
return response
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
requirements-prod.txt
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
docling>=0.2.0
|
2 |
+
pydantic>=2.0.0
|
3 |
+
python-magic>=0.4.27
|
4 |
+
PyPDF2>=3.0.0
|
5 |
+
beautifulsoup4>=4.12.0
|
6 |
+
lxml>=4.9.0
|
7 |
+
requests>=2.31.0
|
8 |
+
fastapi>=0.104.0
|
9 |
+
python-multipart>=0.0.6
|
10 |
+
httpx>=0.25.0
|
11 |
+
mangum>=0.17.0
|
vercel.json
CHANGED
@@ -14,7 +14,6 @@
|
|
14 |
{
|
15 |
"src": "/(.*)",
|
16 |
"dest": "api/index.py",
|
17 |
-
"continue": true,
|
18 |
"headers": {
|
19 |
"Access-Control-Allow-Origin": "*",
|
20 |
"Access-Control-Allow-Headers": "Content-Type, Authorization",
|
|
|
14 |
{
|
15 |
"src": "/(.*)",
|
16 |
"dest": "api/index.py",
|
|
|
17 |
"headers": {
|
18 |
"Access-Control-Allow-Origin": "*",
|
19 |
"Access-Control-Allow-Headers": "Content-Type, Authorization",
|