cv-to-csv-extractor / README.md
Zwounds's picture
Update README.md
0b47d18 verified
---
title: CV to CSV Extraction App
emoji: πŸ“„
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# CV to CSV Extraction App
A Gradio application that extracts publications, talks, and other scholarly accomplishments from faculty CVs (PDFs) using Google's Gemini API with Pydantic-AI for robust structured data extraction.
## Features
- Extract scholarly accomplishments from faculty CVs in PDF format
- Categorize accomplishments into different types (books, journal articles, conference presentations, etc.)
- Display results in a tabular format
- Download results as CSV
- Password protection using Hugging Face secrets
- Robust JSON parsing with Pydantic-AI
## Installation
1. Clone this repository:
```
git clone <repository-url>
cd CV_to_CSV
```
2. Install the required dependencies:
```
pip install -r requirements.txt
```
3. Create a `.env` file in the root directory with your Google API key:
```
GOOGLE_API_KEY=your_google_api_key_here
APP_PASSWORD=your_app_password_here
```
## Usage
### Running Locally
1. Run the application:
```
python cv_extraction_app.py
```
2. Open your browser and navigate to `http://localhost:7860`
3. Enter the password (if set in the environment variable `APP_PASSWORD`)
4. Upload one or more faculty CV PDFs and click "Extract Accomplishments"
5. View the extracted accomplishments and download as CSV if desired
### Deploying on Hugging Face Spaces
1. Create a new Space on Hugging Face Spaces with the Gradio SDK
2. Upload your code to the Space
3. Set up the following secrets in your Space settings:
- `GOOGLE_API_KEY`: Your Google Gemini API key
- `APP_PASSWORD`: The password you want to use for app authentication
## How It Works
1. **Authentication**: The app checks if the provided password matches the one stored in the environment variable `APP_PASSWORD`
2. **PDF Processing**: The app extracts text from uploaded PDF files using PyPDF2
3. **LLM Processing with Pydantic-AI**:
- The extracted text is processed using Pydantic-AI with Google's Gemini model
- Pydantic models define the structure of the expected data
- This approach ensures more robust parsing and validation of the extracted data
- If Pydantic-AI processing fails, the app falls back to the standard Gemini API approach
4. **Categorization**: Accomplishments are categorized into different types based on a decision tree approach
5. **Results Display**: The extracted accomplishments are displayed in a tabular format and can be downloaded as CSV
## Pydantic-AI Integration
The app uses Pydantic-AI to improve the reliability of structured data extraction:
- **Defined Data Models**: Clear schema definitions for faculty data and accomplishments
- **Type Validation**: Ensures fields like years and confidence scores are properly typed
- **Default Values**: Handles missing fields gracefully with sensible defaults
- **Fallback Mechanism**: If Pydantic-AI extraction fails, the app falls back to standard extraction
## Customization
### Changing the Password
To change the password, update the `APP_PASSWORD` environment variable:
- Locally: Modify the `.env` file
- On Hugging Face Spaces: Update the secret in the Space settings
### Modifying Categories
To modify the categories of scholarly accomplishments, edit the `MAIN_CATEGORIES` and `SCHOLARLY_WORK_TYPES` lists in `cv_extraction_app.py`.
## Troubleshooting
- **API Key Issues**: Ensure your Google API key is correctly set in the environment variables
- **PDF Extraction Errors**: Some PDFs may be password-protected or have security settings that prevent text extraction
- **LLM Processing Errors**: If the LLM fails to extract accomplishments, try adjusting the prompt or model parameters
## License
This project is licensed under the MIT License.