Spaces:
Running
Running
File size: 3,973 Bytes
052ac62 154ae3f 052ac62 d7bf02a 0b47d18 d7bf02a 0b47d18 d7bf02a 0b47d18 d7bf02a 0b47d18 d7bf02a 0b47d18 d7bf02a 0b47d18 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
---
title: CV to CSV Extraction App
emoji: π
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# CV to CSV Extraction App
A Gradio application that extracts publications, talks, and other scholarly accomplishments from faculty CVs (PDFs) using Google's Gemini API with Pydantic-AI for robust structured data extraction.
## Features
- Extract scholarly accomplishments from faculty CVs in PDF format
- Categorize accomplishments into different types (books, journal articles, conference presentations, etc.)
- Display results in a tabular format
- Download results as CSV
- Password protection using Hugging Face secrets
- Robust JSON parsing with Pydantic-AI
## Installation
1. Clone this repository:
```
git clone <repository-url>
cd CV_to_CSV
```
2. Install the required dependencies:
```
pip install -r requirements.txt
```
3. Create a `.env` file in the root directory with your Google API key:
```
GOOGLE_API_KEY=your_google_api_key_here
APP_PASSWORD=your_app_password_here
```
## Usage
### Running Locally
1. Run the application:
```
python cv_extraction_app.py
```
2. Open your browser and navigate to `http://localhost:7860`
3. Enter the password (if set in the environment variable `APP_PASSWORD`)
4. Upload one or more faculty CV PDFs and click "Extract Accomplishments"
5. View the extracted accomplishments and download as CSV if desired
### Deploying on Hugging Face Spaces
1. Create a new Space on Hugging Face Spaces with the Gradio SDK
2. Upload your code to the Space
3. Set up the following secrets in your Space settings:
- `GOOGLE_API_KEY`: Your Google Gemini API key
- `APP_PASSWORD`: The password you want to use for app authentication
## How It Works
1. **Authentication**: The app checks if the provided password matches the one stored in the environment variable `APP_PASSWORD`
2. **PDF Processing**: The app extracts text from uploaded PDF files using PyPDF2
3. **LLM Processing with Pydantic-AI**:
- The extracted text is processed using Pydantic-AI with Google's Gemini model
- Pydantic models define the structure of the expected data
- This approach ensures more robust parsing and validation of the extracted data
- If Pydantic-AI processing fails, the app falls back to the standard Gemini API approach
4. **Categorization**: Accomplishments are categorized into different types based on a decision tree approach
5. **Results Display**: The extracted accomplishments are displayed in a tabular format and can be downloaded as CSV
## Pydantic-AI Integration
The app uses Pydantic-AI to improve the reliability of structured data extraction:
- **Defined Data Models**: Clear schema definitions for faculty data and accomplishments
- **Type Validation**: Ensures fields like years and confidence scores are properly typed
- **Default Values**: Handles missing fields gracefully with sensible defaults
- **Fallback Mechanism**: If Pydantic-AI extraction fails, the app falls back to standard extraction
## Customization
### Changing the Password
To change the password, update the `APP_PASSWORD` environment variable:
- Locally: Modify the `.env` file
- On Hugging Face Spaces: Update the secret in the Space settings
### Modifying Categories
To modify the categories of scholarly accomplishments, edit the `MAIN_CATEGORIES` and `SCHOLARLY_WORK_TYPES` lists in `cv_extraction_app.py`.
## Troubleshooting
- **API Key Issues**: Ensure your Google API key is correctly set in the environment variables
- **PDF Extraction Errors**: Some PDFs may be password-protected or have security settings that prevent text extraction
- **LLM Processing Errors**: If the LLM fails to extract accomplishments, try adjusting the prompt or model parameters
## License
This project is licensed under the MIT License.
|