Spaces:

hellorahulk
/

docling_free

Running

App Files Files Community

hellorahulk commited on Jan 25

Commit

8b5c234

1 Parent(s): c26416f

[Cursor] Simplify Python handler for Vercel

Browse files

Files changed (8) hide show

.cursorrules +117 -45
.gitignore +43 -0
README.md +155 -1
api.py +230 -0
api/index.py +122 -221
index.py +0 -290
requirements-prod.txt +11 -0
vercel.json +0 -1

.cursorrules CHANGED Viewed

@@ -1,46 +1,118 @@
-  You are an expert in Python, FastAPI, microservices architecture, and serverless environments.
-  Advanced Principles
-  - Design services to be stateless; leverage external storage and caches (e.g., Redis) for state persistence.
-  - Implement API gateways and reverse proxies (e.g., NGINX, Traefik) for handling traffic to microservices.
-  - Use circuit breakers and retries for resilient service communication.
-  - Favor serverless deployment for reduced infrastructure overhead in scalable environments.
-  - Use asynchronous workers (e.g., Celery, RQ) for handling background tasks efficiently.
-  Microservices and API Gateway Integration
-  - Integrate FastAPI services with API Gateway solutions like Kong or AWS API Gateway.
-  - Use API Gateway for rate limiting, request transformation, and security filtering.
-  - Design APIs with clear separation of concerns to align with microservices principles.
-  - Implement inter-service communication using message brokers (e.g., RabbitMQ, Kafka) for event-driven architectures.
-  Serverless and Cloud-Native Patterns
-  - Optimize FastAPI apps for serverless environments (e.g., AWS Lambda, Azure Functions) by minimizing cold start times.
-  - Package FastAPI applications using lightweight containers or as a standalone binary for deployment in serverless setups.
-  - Use managed services (e.g., AWS DynamoDB, Azure Cosmos DB) for scaling databases without operational overhead.
-  - Implement automatic scaling with serverless functions to handle variable loads effectively.
-  Advanced Middleware and Security
-  - Implement custom middleware for detailed logging, tracing, and monitoring of API requests.
-  - Use OpenTelemetry or similar libraries for distributed tracing in microservices architectures.
-  - Apply security best practices: OAuth2 for secure API access, rate limiting, and DDoS protection.
-  - Use security headers (e.g., CORS, CSP) and implement content validation using tools like OWASP Zap.
-  Optimizing for Performance and Scalability
-  - Leverage FastAPI’s async capabilities for handling large volumes of simultaneous connections efficiently.
-  - Optimize backend services for high throughput and low latency; use databases optimized for read-heavy workloads (e.g., Elasticsearch).
-  - Use caching layers (e.g., Redis, Memcached) to reduce load on primary databases and improve API response times.
-  - Apply load balancing and service mesh technologies (e.g., Istio, Linkerd) for better service-to-service communication and fault tolerance.
-  Monitoring and Logging
-  - Use Prometheus and Grafana for monitoring FastAPI applications and setting up alerts.
-  - Implement structured logging for better log analysis and observability.
-  - Integrate with centralized logging systems (e.g., ELK Stack, AWS CloudWatch) for aggregated logging and monitoring.
-  Key Conventions
-  1. Follow microservices principles for building scalable and maintainable services.
-  2. Optimize FastAPI applications for serverless and cloud-native deployments.
-  3. Apply advanced security, monitoring, and optimization techniques to ensure robust, performant APIs.
-  Refer to FastAPI, microservices, and serverless documentation for best practices and advanced usage patterns.

+# Instructions
+During you interaction with the user, if you find anything reusable in this project (e.g. version of a library, model name), especially about a fix to a mistake you made or a correction you received, you should take note in the `Lessons` section in the `.cursorrules` file so you will not make the same mistake again.
+You should also use the `.cursorrules` file as a scratchpad to organize your thoughts. Especially when you receive a new task, you should first review the content of the scratchpad, clear old different task if necessary, first explain the task, and plan the steps you need to take to complete the task. You can use todo markers to indicate the progress, e.g.
+[X] Task 1
+[ ] Task 2
+Also update the progress of the task in the Scratchpad when you finish a subtask.
+Especially when you finished a milestone, it will help to improve your depth of task accomplishment to use the scratchpad to reflect and plan.
+The goal is to help you maintain a big picture as well as the progress of the task. Always refer to the Scratchpad when you plan the next step.
+# Tools
+Note all the tools are in python. So in the case you need to do batch processing, you can always consult the python files and write your own script.
+## Screenshot Verification
+The screenshot verification workflow allows you to capture screenshots of web pages and verify their appearance using LLMs. The following tools are available:
+1. Screenshot Capture:
+```bash
+venv/bin/python tools/screenshot_utils.py URL [--output OUTPUT] [--width WIDTH] [--height HEIGHT]
+```
+2. LLM Verification with Images:
+```bash
+venv/bin/python tools/llm_api.py --prompt "Your verification question" --provider {openai|anthropic} --image path/to/screenshot.png
+```
+Example workflow:
+```python
+from screenshot_utils import take_screenshot_sync
+from llm_api import query_llm
+# Take a screenshot
+screenshot_path = take_screenshot_sync('https://example.com', 'screenshot.png')
+# Verify with LLM
+response = query_llm(
+    "What is the background color and title of this webpage?",
+    provider="openai",  # or "anthropic"
+    image_path=screenshot_path
+)
+print(response)
+```
+## LLM
+You always have an LLM at your side to help you with the task. For simple tasks, you could invoke the LLM by running the following command:
+```
+venv/bin/python ./tools/llm_api.py --prompt "What is the capital of France?" --provider "anthropic"
+```
+The LLM API supports multiple providers:
+- OpenAI (default, model: gpt-4o)
+- Azure OpenAI (model: configured via AZURE_OPENAI_MODEL_DEPLOYMENT in .env file, defaults to gpt-4o-ms)
+- DeepSeek (model: deepseek-chat)
+- Anthropic (model: claude-3-sonnet-20240229)
+- Gemini (model: gemini-pro)
+- Local LLM (model: Qwen/Qwen2.5-32B-Instruct-AWQ)
+But usually it's a better idea to check the content of the file and use the APIs in the `tools/llm_api.py` file to invoke the LLM if needed.
+## Web browser
+You could use the `tools/web_scraper.py` file to scrape the web.
+```
+venv/bin/python ./tools/web_scraper.py --max-concurrent 3 URL1 URL2 URL3
+```
+This will output the content of the web pages.
+## Search engine
+You could use the `tools/search_engine.py` file to search the web.
+```
+venv/bin/python ./tools/search_engine.py "your search keywords"
+```
+This will output the search results in the following format:
+```
+URL: https://example.com
+Title: This is the title of the search result
+Snippet: This is a snippet of the search result
+```
+If needed, you can further use the `web_scraper.py` file to scrape the web page content.
+# Lessons
+## User Specified Lessons
+- You have a python venv in ./venv. Use it.
+- Include info useful for debugging in the program output.
+- Read the file before you try to edit it.
+- Due to Cursor's limit, when you use `git` and `gh` and need to submit a multiline commit message, first write the message in a file, and then use `git commit -F <filename>` or similar command to commit. And then remove the file. Include "[Cursor] " in the commit message and PR title.
+## Cursor learned
+- For search results, ensure proper handling of different character encodings (UTF-8) for international queries
+- Add debug information to stderr while keeping the main output clean in stdout for better pipeline integration
+- When using seaborn styles in matplotlib, use 'seaborn-v0_8' instead of 'seaborn' as the style name due to recent seaborn version changes
+- Use 'gpt-4o' as the model name for OpenAI's GPT-4 with vision capabilities
+- For Vercel deployments with FastAPI and Mangum, use older stable versions (FastAPI 0.88.0, Mangum 0.15.0, Pydantic 1.10.2) to avoid compatibility issues
+- Keep Vercel configuration simple and avoid unnecessary configuration options that might cause conflicts
+# Scratchpad
+Current Task: Fix Vercel deployment issues with FastAPI and Mangum
+Progress:
+[X] Identified issue with newer versions of FastAPI and Mangum
+[X] Updated dependencies to use older, stable versions
+[X] Simplified FastAPI configuration
+[X] Simplified Vercel configuration
+[X] Successfully deployed to production
+Next Steps:
+[ ] Test all API endpoints
+[ ] Add more functionality if needed
+[ ] Consider adding monitoring and logging

.gitignore ADDED Viewed

	@@ -0,0 +1,43 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual Environment
+venv/
+env/
+ENV/
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+# Vercel
+.vercel/
+.env
+.env.local
+# Temporary files
+*.tmp
+tmp/
+temp/
+.vercel

README.md CHANGED Viewed

@@ -54,4 +54,158 @@ Built with:
 ## 📝 License
-MIT License

 ## 📝 License
+MIT License
+# Document Parser API
+A scalable FastAPI service for parsing various document formats (PDF, DOCX, TXT, HTML, Markdown) with automatic information extraction.
+## Features
+- 📄 Multi-format support (PDF, DOCX, TXT, HTML, Markdown)
+- 🔄 Asynchronous processing with background tasks
+- 🌐 Support for both file uploads and URL inputs
+- 📊 Structured information extraction
+- 🔗 Webhook support for processing notifications
+- 🚀 Highly scalable architecture
+- 🛡️ Comprehensive error handling
+- 📝 Detailed logging
+## Quick Start
+### Prerequisites
+- Python 3.8+
+- pip (Python package manager)
+### Installation
+1. Clone the repository:
+```bash
+git clone https://github.com/yourusername/document-parser-api.git
+cd document-parser-api
+```
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+3. Run the API server:
+```bash
+python api.py
+```
+The API will be available at `http://localhost:8000`
+## API Documentation
+### Endpoints
+#### 1. Parse Document from File Upload
+```http
+POST /parse/file
+```
+- Upload a document file for parsing
+- Optional callback URL for processing notification
+- Returns a job ID for status tracking
+#### 2. Parse Document from URL
+```http
+POST /parse/url
+```
+- Submit a document URL for parsing
+- Optional callback URL for processing notification
+- Returns a job ID for status tracking
+#### 3. Check Processing Status
+```http
+GET /status/{job_id}
+```
+- Get the current status of a parsing job
+- Returns processing status and results if completed
+#### 4. Health Check
+```http
+GET /health
+```
+- Check if the API is running and healthy
+### Example Usage
+#### Parse File
+```python
+import requests
+url = "http://localhost:8000/parse/file"
+files = {"file": open("document.pdf", "rb")}
+response = requests.post(url, files=files)
+print(response.json())
+```
+#### Parse URL
+```python
+import requests
+url = "http://localhost:8000/parse/url"
+data = {
+    "url": "https://example.com/document.pdf",
+    "callback_url": "https://your-callback-url.com/webhook"
+}
+response = requests.post(url, json=data)
+print(response.json())
+```
+## Error Handling
+The API implements comprehensive error handling:
+- Invalid file formats
+- Failed URL downloads
+- Processing errors
+- Invalid requests
+- Server errors
+All errors return appropriate HTTP status codes and detailed error messages.
+## Scaling Considerations
+For production deployment, consider:
+1. **Job Storage**: Replace in-memory storage with Redis or a database
+2. **Load Balancing**: Deploy behind a load balancer
+3. **Worker Processes**: Adjust number of workers based on load
+4. **Rate Limiting**: Implement rate limiting for API endpoints
+5. **Monitoring**: Add metrics collection and monitoring
+## Development
+### Running Tests
+```bash
+pytest tests/
+```
+### Local Development
+```bash
+uvicorn api:app --reload --port 8000
+```
+### API Documentation
+- Swagger UI: `http://localhost:8000/docs`
+- ReDoc: `http://localhost:8000/redoc`
+## Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Commit your changes
+4. Push to the branch
+5. Create a Pull Request
+## License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## Support
+For support, please open an issue in the GitHub repository or contact the maintainers.

api.py ADDED Viewed

	@@ -0,0 +1,230 @@

+import os
+from fastapi import FastAPI, HTTPException, UploadFile, File, BackgroundTasks
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel, HttpUrl
+import tempfile
+import requests
+from typing import Optional, List, Dict, Any
+from dockling_parser import DocumentParser
+from dockling_parser.exceptions import ParserError, UnsupportedFormatError
+from dockling_parser.types import ParsedDocument
+import logging
+import aiofiles
+import asyncio
+from urllib.parse import urlparse
+from mangum import Mangum
+import httpx
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Initialize FastAPI app
+app = FastAPI(
+    title="Document Parser API",
+    description="A scalable API for parsing various document formats",
+    version="1.0.0"
+)
+# Add CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Initialize document parser
+parser = DocumentParser()
+class URLInput(BaseModel):
+    url: HttpUrl
+    callback_url: Optional[HttpUrl] = None
+class ErrorResponse(BaseModel):
+    error: str
+    detail: Optional[str] = None
+    code: str
+class ParseResponse(BaseModel):
+    job_id: str
+    status: str
+    result: Optional[ParsedDocument] = None
+    error: Optional[str] = None
+# In-memory job storage (replace with Redis/DB in production)
+jobs = {}
+async def process_document_async(job_id: str, file_path: str, callback_url: Optional[str] = None):
+    """Process document asynchronously"""
+    try:
+        # Update job status
+        jobs[job_id] = {"status": "processing"}
+        # Parse document
+        result = parser.parse(file_path)
+        # Update job with result
+        jobs[job_id] = {
+            "status": "completed",
+            "result": result
+        }
+        # Call callback URL if provided
+        if callback_url:
+            try:
+                await notify_callback(callback_url, job_id, result)
+            except Exception as e:
+                logger.error(f"Failed to notify callback URL: {str(e)}")
+    except Exception as e:
+        logger.error(f"Error processing document: {str(e)}")
+        jobs[job_id] = {
+            "status": "failed",
+            "error": str(e)
+        }
+    finally:
+        # Cleanup temporary file
+        try:
+            if os.path.exists(file_path):
+                os.unlink(file_path)
+        except Exception as e:
+            logger.error(f"Error cleaning up file: {str(e)}")
+async def notify_callback(callback_url: str, job_id: str, result: ParsedDocument):
+    """Notify callback URL with results"""
+    try:
+        async with httpx.AsyncClient() as client:
+            await client.post(
+                callback_url,
+                json={
+                    "job_id": job_id,
+                    "result": result.dict()
+                }
+            )
+    except Exception as e:
+        logger.error(f"Failed to send callback: {str(e)}")
+@app.post("/parse/file", response_model=ParseResponse)
+async def parse_file(
+    background_tasks: BackgroundTasks,
+    file: UploadFile = File(...),
+    callback_url: Optional[HttpUrl] = None
+):
+    """
+    Parse a document from file upload
+    """
+    try:
+        # Create temporary file in /tmp for Vercel
+        suffix = os.path.splitext(file.filename)[1]
+        tmp_dir = "/tmp" if os.path.exists("/tmp") else tempfile.gettempdir()
+        tmp_path = os.path.join(tmp_dir, f"upload_{os.urandom(8).hex()}{suffix}")
+        content = await file.read()
+        with open(tmp_path, "wb") as f:
+            f.write(content)
+        # Generate job ID
+        job_id = f"job_{len(jobs) + 1}"
+        # Start background processing
+        background_tasks.add_task(
+            process_document_async,
+            job_id,
+            tmp_path,
+            str(callback_url) if callback_url else None
+        )
+        return ParseResponse(
+            job_id=job_id,
+            status="queued"
+        )
+    except Exception as e:
+        logger.error(f"Error handling file upload: {str(e)}")
+        raise HTTPException(
+            status_code=500,
+            detail=str(e)
+        )
+@app.post("/parse/url", response_model=ParseResponse)
+async def parse_url(input_data: URLInput, background_tasks: BackgroundTasks):
+    """
+    Parse a document from URL
+    """
+    try:
+        # Download file
+        async with httpx.AsyncClient() as client:
+            response = await client.get(str(input_data.url), follow_redirects=True)
+            response.raise_for_status()
+        # Get filename from URL or use default
+        filename = os.path.basename(urlparse(str(input_data.url)).path)
+        if not filename:
+            filename = "document.pdf"
+        # Save to temporary file in /tmp for Vercel
+        tmp_dir = "/tmp" if os.path.exists("/tmp") else tempfile.gettempdir()
+        tmp_path = os.path.join(tmp_dir, f"download_{os.urandom(8).hex()}{os.path.splitext(filename)[1]}")
+        with open(tmp_path, "wb") as f:
+            f.write(response.content)
+        # Generate job ID
+        job_id = f"job_{len(jobs) + 1}"
+        # Start background processing
+        background_tasks.add_task(
+            process_document_async,
+            job_id,
+            tmp_path,
+            str(input_data.callback_url) if input_data.callback_url else None
+        )
+        return ParseResponse(
+            job_id=job_id,
+            status="queued"
+        )
+    except httpx.RequestError as e:
+        logger.error(f"Error downloading file: {str(e)}")
+        raise HTTPException(
+            status_code=400,
+            detail=f"Error downloading file: {str(e)}"
+        )
+    except Exception as e:
+        logger.error(f"Error processing URL: {str(e)}")
+        raise HTTPException(
+            status_code=500,
+            detail=str(e)
+        )
+@app.get("/status/{job_id}", response_model=ParseResponse)
+async def get_status(job_id: str):
+    """
+    Get the status of a parsing job
+    """
+    if job_id not in jobs:
+        raise HTTPException(
+            status_code=404,
+            detail="Job not found"
+        )
+    job = jobs[job_id]
+    return ParseResponse(
+        job_id=job_id,
+        status=job["status"],
+        result=job.get("result"),
+        error=job.get("error")
+    )
+@app.get("/health")
+async def health_check():
+    """
+    Health check endpoint
+    """
+    return {"status": "healthy"}
+# Handler for Vercel
+handler = Mangum(app, lifespan="off")

api/index.py CHANGED Viewed

@@ -1,3 +1,4 @@
 import json
 import os
 import tempfile
@@ -59,232 +60,132 @@ def download_file(url):
     except Exception as e:
         raise ValueError(f"Failed to download file: {str(e)}")
-def handle_root():
-    return {
-        "statusCode": 200,
-        "headers": {
-            "Content-Type": "application/json"
-        },
-        "body": json.dumps({
-            "status": "ok",
-            "message": "Document Processing API",
-            "version": "1.0.0"
-        })
-    }
-def handle_health():
-    return {
-        "statusCode": 200,
-        "headers": {
-            "Content-Type": "application/json"
-        },
-        "body": json.dumps({
-            "status": "healthy",
-            "timestamp": str(datetime.datetime.now(datetime.UTC))
-        })
-    }
-def handle_parse_file(event):
-    try:
-        # Check if there's a file in the request
-        if 'body' not in event or not event['body']:
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "No file provided",
-                    "details": "Please provide a file in the request body"
-                })
             }
-        # Get file data
-        file_data = event['body']
-        if event.get('isBase64Encoded', False):
-            import base64
-            file_data = base64.b64decode(file_data)
-        # Validate file type
-        if not is_valid_file(file_data):
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "Invalid file type",
-                    "details": "Supported formats: PDF, TXT, HTML, MD, DOCX"
-                })
             }
-        # Save to temp file
-        fd, temp_path = tempfile.mkstemp()
-        os.close(fd)
-        try:
-            with open(temp_path, 'wb') as f:
-                f.write(file_data)
-            # Process file here
-            return {
-                "statusCode": 200,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "status": "success",
-                    "message": "File processed successfully",
-                    "metadata": {
-                        "size": os.path.getsize(temp_path),
-                        "mime_type": magic.from_file(temp_path, mime=True)
-                    }
-                })
             }
-        finally:
-            # Clean up temp file
             try:
-                os.unlink(temp_path)
-            except:
-                pass
-    except Exception as e:
-        return {
-            "statusCode": 500,
-            "headers": {"Content-Type": "application/json"},
-            "body": json.dumps({
-                "error": "Processing failed",
-                "details": str(e)
-            })
-        }
-def handle_parse_url(event):
-    try:
-        # Get request body
-        if 'body' not in event or not event['body']:
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "No request body",
-                    "details": "Please provide a URL in the request body"
-                })
-            }
-        # Parse JSON body
-        try:
-            body = json.loads(event['body'])
-        except:
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "Invalid JSON",
-                    "details": "Request body must be valid JSON"
-                })
-            }
-        # Get URL from body
-        url = body.get('url')
-        if not url:
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "No URL provided",
-                    "details": "Please provide a URL in the request body"
-                })
-            }
-        if not url.startswith(('http://', 'https://')):
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "Invalid URL",
-                    "details": "URL must start with http:// or https://"
-                })
-            }
-        # Download and process file
-        temp_path, content_type = download_file(url)
-        try:
-            # Process file here
-            return {
-                "statusCode": 200,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "status": "success",
-                    "message": "URL processed successfully",
-                    "metadata": {
-                        "url": url,
-                        "content_type": content_type,
-                        "size": os.path.getsize(temp_path)
                     }
-                })
-            }
-        finally:
-            # Clean up temp file
             try:
-                os.unlink(temp_path)
-            except:
-                pass
-    except ValueError as e:
-        return {
-            "statusCode": 400,
-            "headers": {"Content-Type": "application/json"},
-            "body": json.dumps({
-                "error": "Invalid request",
-                "details": str(e)
-            })
-        }
-    except Exception as e:
-        return {
-            "statusCode": 500,
-            "headers": {"Content-Type": "application/json"},
-            "body": json.dumps({
-                "error": "Processing failed",
-                "details": str(e)
-            })
-        }
-def handler(event, context):
-    """Serverless handler for Vercel"""
-    # Add CORS headers to all responses
-    cors_headers = {
-        "Access-Control-Allow-Origin": "*",
-        "Access-Control-Allow-Headers": "Content-Type, Authorization",
-        "Access-Control-Allow-Methods": "GET, POST, OPTIONS"
-    }
-    # Handle OPTIONS request for CORS
-    if event.get('httpMethod') == 'OPTIONS':
-        return {
-            "statusCode": 204,
-            "headers": cors_headers,
-            "body": ""
-        }
-    # Get path and method
-    path = event.get('path', '/').rstrip('/')
-    method = event.get('httpMethod', 'GET')
-    # Route request
-    response = None
-    if path == '' or path == '/':
-        response = handle_root()
-    elif path == '/health':
-        response = handle_health()
-    elif path == '/parse/file' and method == 'POST':
-        response = handle_parse_file(event)
-    elif path == '/parse/url' and method == 'POST':
-        response = handle_parse_url(event)
-    else:
-        response = {
-            "statusCode": 404,
-            "headers": {"Content-Type": "application/json"},
-            "body": json.dumps({
-                "error": "Not Found",
-                "details": f"Path {path} not found"
-            })
-        }
-    # Add CORS headers to response
-    response['headers'].update(cors_headers)
-    return response

+from http.server import BaseHTTPRequestHandler
 import json
 import os
 import tempfile
     except Exception as e:
         raise ValueError(f"Failed to download file: {str(e)}")
+class Handler(BaseHTTPRequestHandler):
+    def do_GET(self):
+        if self.path == '/' or self.path == '':
+            self.send_response(200)
+            self.send_header('Content-type', 'application/json')
+            self.end_headers()
+            response = {
+                "status": "ok",
+                "message": "Document Processing API",
+                "version": "1.0.0"
             }
+            self.wfile.write(json.dumps(response).encode())
+        elif self.path == '/health':
+            self.send_response(200)
+            self.send_header('Content-type', 'application/json')
+            self.end_headers()
+            response = {
+                "status": "healthy",
+                "timestamp": str(datetime.datetime.now(datetime.UTC))
             }
+            self.wfile.write(json.dumps(response).encode())
+        else:
+            self.send_response(404)
+            self.send_header('Content-type', 'application/json')
+            self.end_headers()
+            response = {
+                "error": "Not Found",
+                "details": f"Path {self.path} not found"
             }
+            self.wfile.write(json.dumps(response).encode())
+    def do_POST(self):
+        content_length = int(self.headers.get('Content-Length', 0))
+        post_data = self.rfile.read(content_length)
+        if self.path == '/parse/file':
             try:
+                if not post_data:
+                    self.send_error(400, "No file provided")
+                    return
+                if not is_valid_file(post_data):
+                    self.send_error(400, "Invalid file type")
+                    return
+                fd, temp_path = tempfile.mkstemp()
+                os.close(fd)
+                try:
+                    with open(temp_path, 'wb') as f:
+                        f.write(post_data)
+                    self.send_response(200)
+                    self.send_header('Content-type', 'application/json')
+                    self.end_headers()
+                    response = {
+                        "status": "success",
+                        "message": "File processed successfully",
+                        "metadata": {
+                            "size": os.path.getsize(temp_path),
+                            "mime_type": magic.from_file(temp_path, mime=True)
+                        }
                     }
+                    self.wfile.write(json.dumps(response).encode())
+                finally:
+                    try:
+                        os.unlink(temp_path)
+                    except:
+                        pass
+            except Exception as e:
+                self.send_error(500, str(e))
+        elif self.path == '/parse/url':
             try:
+                try:
+                    body = json.loads(post_data)
+                except:
+                    self.send_error(400, "Invalid JSON")
+                    return
+                url = body.get('url')
+                if not url:
+                    self.send_error(400, "No URL provided")
+                    return
+                if not url.startswith(('http://', 'https://')):
+                    self.send_error(400, "Invalid URL")
+                    return
+                temp_path, content_type = download_file(url)
+                try:
+                    self.send_response(200)
+                    self.send_header('Content-type', 'application/json')
+                    self.end_headers()
+                    response = {
+                        "status": "success",
+                        "message": "URL processed successfully",
+                        "metadata": {
+                            "url": url,
+                            "content_type": content_type,
+                            "size": os.path.getsize(temp_path)
+                        }
+                    }
+                    self.wfile.write(json.dumps(response).encode())
+                finally:
+                    try:
+                        os.unlink(temp_path)
+                    except:
+                        pass
+            except ValueError as e:
+                self.send_error(400, str(e))
+            except Exception as e:
+                self.send_error(500, str(e))
+        else:
+            self.send_error(404, f"Path {self.path} not found")
+    def do_OPTIONS(self):
+        self.send_response(204)
+        self.send_header('Access-Control-Allow-Origin', '*')
+        self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
+        self.send_header('Access-Control-Allow-Headers', 'Content-Type, Authorization')
+        self.end_headers()
+def handler(request, context):
+    """Vercel serverless handler"""
+    return Handler(request, None, None).handle_request()

index.py DELETED Viewed

@@ -1,290 +0,0 @@
-import json
-import os
-import tempfile
-import magic
-import requests
-from werkzeug.utils import secure_filename
-import datetime
-def is_valid_file(file_data):
-    """Check if file type is allowed using python-magic"""
-    try:
-        mime = magic.from_buffer(file_data, mime=True)
-        allowed_mimes = [
-            'application/pdf',
-            'text/plain',
-            'text/html',
-            'text/markdown',
-            'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
-        ]
-        return mime in allowed_mimes
-    except Exception:
-        return False
-def download_file(url):
-    """Download file from URL and save to temp file"""
-    try:
-        response = requests.get(url, stream=True, timeout=10)
-        response.raise_for_status()
-        # Get content type
-        content_type = response.headers.get('content-type', '').split(';')[0]
-        if content_type not in [
-            'application/pdf',
-            'text/plain',
-            'text/html',
-            'text/markdown',
-            'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
-        ]:
-            raise ValueError(f"Unsupported content type: {content_type}")
-        # Create temp file with proper extension
-        ext = {
-            'application/pdf': '.pdf',
-            'text/plain': '.txt',
-            'text/html': '.html',
-            'text/markdown': '.md',
-            'application/vnd.openxmlformats-officedocument.wordprocessingml.document': '.docx'
-        }.get(content_type, '')
-        fd, temp_path = tempfile.mkstemp(suffix=ext)
-        os.close(fd)
-        # Download file
-        with open(temp_path, 'wb') as f:
-            for chunk in response.iter_content(chunk_size=8192):
-                f.write(chunk)
-        return temp_path, content_type
-    except Exception as e:
-        raise ValueError(f"Failed to download file: {str(e)}")
-def handle_root():
-    return {
-        "statusCode": 200,
-        "headers": {
-            "Content-Type": "application/json"
-        },
-        "body": json.dumps({
-            "status": "ok",
-            "message": "Document Processing API",
-            "version": "1.0.0"
-        })
-    }
-def handle_health():
-    return {
-        "statusCode": 200,
-        "headers": {
-            "Content-Type": "application/json"
-        },
-        "body": json.dumps({
-            "status": "healthy",
-            "timestamp": str(datetime.datetime.now(datetime.UTC))
-        })
-    }
-def handle_parse_file(event):
-    try:
-        # Check if there's a file in the request
-        if 'body' not in event or not event['body']:
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "No file provided",
-                    "details": "Please provide a file in the request body"
-                })
-            }
-        # Get file data
-        file_data = event['body']
-        if event.get('isBase64Encoded', False):
-            import base64
-            file_data = base64.b64decode(file_data)
-        # Validate file type
-        if not is_valid_file(file_data):
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "Invalid file type",
-                    "details": "Supported formats: PDF, TXT, HTML, MD, DOCX"
-                })
-            }
-        # Save to temp file
-        fd, temp_path = tempfile.mkstemp()
-        os.close(fd)
-        try:
-            with open(temp_path, 'wb') as f:
-                f.write(file_data)
-            # Process file here
-            return {
-                "statusCode": 200,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "status": "success",
-                    "message": "File processed successfully",
-                    "metadata": {
-                        "size": os.path.getsize(temp_path),
-                        "mime_type": magic.from_file(temp_path, mime=True)
-                    }
-                })
-            }
-        finally:
-            # Clean up temp file
-            try:
-                os.unlink(temp_path)
-            except:
-                pass
-    except Exception as e:
-        return {
-            "statusCode": 500,
-            "headers": {"Content-Type": "application/json"},
-            "body": json.dumps({
-                "error": "Processing failed",
-                "details": str(e)
-            })
-        }
-def handle_parse_url(event):
-    try:
-        # Get request body
-        if 'body' not in event or not event['body']:
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "No request body",
-                    "details": "Please provide a URL in the request body"
-                })
-            }
-        # Parse JSON body
-        try:
-            body = json.loads(event['body'])
-        except:
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "Invalid JSON",
-                    "details": "Request body must be valid JSON"
-                })
-            }
-        # Get URL from body
-        url = body.get('url')
-        if not url:
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "No URL provided",
-                    "details": "Please provide a URL in the request body"
-                })
-            }
-        if not url.startswith(('http://', 'https://')):
-            return {
-                "statusCode": 400,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "error": "Invalid URL",
-                    "details": "URL must start with http:// or https://"
-                })
-            }
-        # Download and process file
-        temp_path, content_type = download_file(url)
-        try:
-            # Process file here
-            return {
-                "statusCode": 200,
-                "headers": {"Content-Type": "application/json"},
-                "body": json.dumps({
-                    "status": "success",
-                    "message": "URL processed successfully",
-                    "metadata": {
-                        "url": url,
-                        "content_type": content_type,
-                        "size": os.path.getsize(temp_path)
-                    }
-                })
-            }
-        finally:
-            # Clean up temp file
-            try:
-                os.unlink(temp_path)
-            except:
-                pass
-    except ValueError as e:
-        return {
-            "statusCode": 400,
-            "headers": {"Content-Type": "application/json"},
-            "body": json.dumps({
-                "error": "Invalid request",
-                "details": str(e)
-            })
-        }
-    except Exception as e:
-        return {
-            "statusCode": 500,
-            "headers": {"Content-Type": "application/json"},
-            "body": json.dumps({
-                "error": "Processing failed",
-                "details": str(e)
-            })
-        }
-def handler(event, context):
-    """Serverless handler for Vercel"""
-    # Add CORS headers to all responses
-    cors_headers = {
-        "Access-Control-Allow-Origin": "*",
-        "Access-Control-Allow-Headers": "Content-Type, Authorization",
-        "Access-Control-Allow-Methods": "GET, POST, OPTIONS"
-    }
-    # Handle OPTIONS request for CORS
-    if event.get('httpMethod') == 'OPTIONS':
-        return {
-            "statusCode": 204,
-            "headers": cors_headers,
-            "body": ""
-        }
-    # Get path and method
-    path = event.get('path', '/').rstrip('/')
-    method = event.get('httpMethod', 'GET')
-    # Route request
-    response = None
-    if path == '' or path == '/':
-        response = handle_root()
-    elif path == '/health':
-        response = handle_health()
-    elif path == '/parse/file' and method == 'POST':
-        response = handle_parse_file(event)
-    elif path == '/parse/url' and method == 'POST':
-        response = handle_parse_url(event)
-    else:
-        response = {
-            "statusCode": 404,
-            "headers": {"Content-Type": "application/json"},
-            "body": json.dumps({
-                "error": "Not Found",
-                "details": f"Path {path} not found"
-            })
-        }
-    # Add CORS headers to response
-    response['headers'].update(cors_headers)
-    return response

requirements-prod.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+docling>=0.2.0
+pydantic>=2.0.0
+python-magic>=0.4.27
+PyPDF2>=3.0.0
+beautifulsoup4>=4.12.0
+lxml>=4.9.0
+requests>=2.31.0
+fastapi>=0.104.0
+python-multipart>=0.0.6
+httpx>=0.25.0
+mangum>=0.17.0

vercel.json CHANGED Viewed

@@ -14,7 +14,6 @@
         {
             "src": "/(.*)",
             "dest": "api/index.py",
-            "continue": true,
             "headers": {
                 "Access-Control-Allow-Origin": "*",
                 "Access-Control-Allow-Headers": "Content-Type, Authorization",

         {
             "src": "/(.*)",
             "dest": "api/index.py",
             "headers": {
                 "Access-Control-Allow-Origin": "*",
                 "Access-Control-Allow-Headers": "Content-Type, Authorization",