Spaces:
Running
Running
title: CV to CSV Extraction App | |
emoji: π | |
colorFrom: blue | |
colorTo: green | |
sdk: gradio | |
sdk_version: 5.29.0 | |
app_file: app.py | |
pinned: false | |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
# CV to CSV Extraction App | |
A Gradio application that extracts publications, talks, and other scholarly accomplishments from faculty CVs (PDFs) using Google's Gemini API with Pydantic-AI for robust structured data extraction. | |
## Features | |
- Extract scholarly accomplishments from faculty CVs in PDF format | |
- Categorize accomplishments into different types (books, journal articles, conference presentations, etc.) | |
- Display results in a tabular format | |
- Download results as CSV | |
- Password protection using Hugging Face secrets | |
- Robust JSON parsing with Pydantic-AI | |
## Installation | |
1. Clone this repository: | |
``` | |
git clone <repository-url> | |
cd CV_to_CSV | |
``` | |
2. Install the required dependencies: | |
``` | |
pip install -r requirements.txt | |
``` | |
3. Create a `.env` file in the root directory with your Google API key: | |
``` | |
GOOGLE_API_KEY=your_google_api_key_here | |
APP_PASSWORD=your_app_password_here | |
``` | |
## Usage | |
### Running Locally | |
1. Run the application: | |
``` | |
python cv_extraction_app.py | |
``` | |
2. Open your browser and navigate to `http://localhost:7860` | |
3. Enter the password (if set in the environment variable `APP_PASSWORD`) | |
4. Upload one or more faculty CV PDFs and click "Extract Accomplishments" | |
5. View the extracted accomplishments and download as CSV if desired | |
### Deploying on Hugging Face Spaces | |
1. Create a new Space on Hugging Face Spaces with the Gradio SDK | |
2. Upload your code to the Space | |
3. Set up the following secrets in your Space settings: | |
- `GOOGLE_API_KEY`: Your Google Gemini API key | |
- `APP_PASSWORD`: The password you want to use for app authentication | |
## How It Works | |
1. **Authentication**: The app checks if the provided password matches the one stored in the environment variable `APP_PASSWORD` | |
2. **PDF Processing**: The app extracts text from uploaded PDF files using PyPDF2 | |
3. **LLM Processing with Pydantic-AI**: | |
- The extracted text is processed using Pydantic-AI with Google's Gemini model | |
- Pydantic models define the structure of the expected data | |
- This approach ensures more robust parsing and validation of the extracted data | |
- If Pydantic-AI processing fails, the app falls back to the standard Gemini API approach | |
4. **Categorization**: Accomplishments are categorized into different types based on a decision tree approach | |
5. **Results Display**: The extracted accomplishments are displayed in a tabular format and can be downloaded as CSV | |
## Pydantic-AI Integration | |
The app uses Pydantic-AI to improve the reliability of structured data extraction: | |
- **Defined Data Models**: Clear schema definitions for faculty data and accomplishments | |
- **Type Validation**: Ensures fields like years and confidence scores are properly typed | |
- **Default Values**: Handles missing fields gracefully with sensible defaults | |
- **Fallback Mechanism**: If Pydantic-AI extraction fails, the app falls back to standard extraction | |
## Customization | |
### Changing the Password | |
To change the password, update the `APP_PASSWORD` environment variable: | |
- Locally: Modify the `.env` file | |
- On Hugging Face Spaces: Update the secret in the Space settings | |
### Modifying Categories | |
To modify the categories of scholarly accomplishments, edit the `MAIN_CATEGORIES` and `SCHOLARLY_WORK_TYPES` lists in `cv_extraction_app.py`. | |
## Troubleshooting | |
- **API Key Issues**: Ensure your Google API key is correctly set in the environment variables | |
- **PDF Extraction Errors**: Some PDFs may be password-protected or have security settings that prevent text extraction | |
- **LLM Processing Errors**: If the LLM fails to extract accomplishments, try adjusting the prompt or model parameters | |
## License | |
This project is licensed under the MIT License. | |