metadata

title: CV to CSV Extraction App
emoji: 📄
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

CV to CSV Extraction App

A Gradio application that extracts publications, talks, and other scholarly accomplishments from faculty CVs (PDFs) using Google's Gemini API with Pydantic-AI for robust structured data extraction.

Features

Extract scholarly accomplishments from faculty CVs in PDF format
Categorize accomplishments into different types (books, journal articles, conference presentations, etc.)
Display results in a tabular format
Download results as CSV
Password protection using Hugging Face secrets
Robust JSON parsing with Pydantic-AI

Installation

Clone this repository:

git clone <repository-url>
cd CV_to_CSV

Install the required dependencies:
```
pip install -r requirements.txt
```

Create a .env file in the root directory with your Google API key:

GOOGLE_API_KEY=your_google_api_key_here
APP_PASSWORD=your_app_password_here

Usage

Running Locally

Run the application:
```
python cv_extraction_app.py
```
Open your browser and navigate to http://localhost:7860
Enter the password (if set in the environment variable APP_PASSWORD)
Upload one or more faculty CV PDFs and click "Extract Accomplishments"
View the extracted accomplishments and download as CSV if desired

Deploying on Hugging Face Spaces

Create a new Space on Hugging Face Spaces with the Gradio SDK
Upload your code to the Space
Set up the following secrets in your Space settings:
- GOOGLE_API_KEY: Your Google Gemini API key
- APP_PASSWORD: The password you want to use for app authentication

How It Works

Authentication: The app checks if the provided password matches the one stored in the environment variable APP_PASSWORD
PDF Processing: The app extracts text from uploaded PDF files using PyPDF2
LLM Processing with Pydantic-AI:
- The extracted text is processed using Pydantic-AI with Google's Gemini model
- Pydantic models define the structure of the expected data
- This approach ensures more robust parsing and validation of the extracted data
- If Pydantic-AI processing fails, the app falls back to the standard Gemini API approach
Categorization: Accomplishments are categorized into different types based on a decision tree approach
Results Display: The extracted accomplishments are displayed in a tabular format and can be downloaded as CSV

Pydantic-AI Integration

The app uses Pydantic-AI to improve the reliability of structured data extraction:

Defined Data Models: Clear schema definitions for faculty data and accomplishments
Type Validation: Ensures fields like years and confidence scores are properly typed
Default Values: Handles missing fields gracefully with sensible defaults
Fallback Mechanism: If Pydantic-AI extraction fails, the app falls back to standard extraction

Customization

Changing the Password

To change the password, update the APP_PASSWORD environment variable:

Locally: Modify the .env file
On Hugging Face Spaces: Update the secret in the Space settings

Modifying Categories

To modify the categories of scholarly accomplishments, edit the MAIN_CATEGORIES and SCHOLARLY_WORK_TYPES lists in cv_extraction_app.py.

Troubleshooting

API Key Issues: Ensure your Google API key is correctly set in the environment variables
PDF Extraction Errors: Some PDFs may be password-protected or have security settings that prevent text extraction
LLM Processing Errors: If the LLM fails to extract accomplishments, try adjusting the prompt or model parameters

License

This project is licensed under the MIT License.