Spaces:
Running
A newer version of the Gradio SDK is available:
5.33.1
title: CV to CSV Extraction App
emoji: π
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
CV to CSV Extraction App
A Gradio application that extracts publications, talks, and other scholarly accomplishments from faculty CVs (PDFs) using Google's Gemini API with Pydantic-AI for robust structured data extraction.
Features
- Extract scholarly accomplishments from faculty CVs in PDF format
- Categorize accomplishments into different types (books, journal articles, conference presentations, etc.)
- Display results in a tabular format
- Download results as CSV
- Password protection using Hugging Face secrets
- Robust JSON parsing with Pydantic-AI
Installation
Clone this repository:
git clone <repository-url> cd CV_to_CSV
Install the required dependencies:
pip install -r requirements.txt
Create a
.env
file in the root directory with your Google API key:GOOGLE_API_KEY=your_google_api_key_here APP_PASSWORD=your_app_password_here
Usage
Running Locally
Run the application:
python cv_extraction_app.py
Open your browser and navigate to
http://localhost:7860
Enter the password (if set in the environment variable
APP_PASSWORD
)Upload one or more faculty CV PDFs and click "Extract Accomplishments"
View the extracted accomplishments and download as CSV if desired
Deploying on Hugging Face Spaces
Create a new Space on Hugging Face Spaces with the Gradio SDK
Upload your code to the Space
Set up the following secrets in your Space settings:
GOOGLE_API_KEY
: Your Google Gemini API keyAPP_PASSWORD
: The password you want to use for app authentication
How It Works
Authentication: The app checks if the provided password matches the one stored in the environment variable
APP_PASSWORD
PDF Processing: The app extracts text from uploaded PDF files using PyPDF2
LLM Processing with Pydantic-AI:
- The extracted text is processed using Pydantic-AI with Google's Gemini model
- Pydantic models define the structure of the expected data
- This approach ensures more robust parsing and validation of the extracted data
- If Pydantic-AI processing fails, the app falls back to the standard Gemini API approach
Categorization: Accomplishments are categorized into different types based on a decision tree approach
Results Display: The extracted accomplishments are displayed in a tabular format and can be downloaded as CSV
Pydantic-AI Integration
The app uses Pydantic-AI to improve the reliability of structured data extraction:
- Defined Data Models: Clear schema definitions for faculty data and accomplishments
- Type Validation: Ensures fields like years and confidence scores are properly typed
- Default Values: Handles missing fields gracefully with sensible defaults
- Fallback Mechanism: If Pydantic-AI extraction fails, the app falls back to standard extraction
Customization
Changing the Password
To change the password, update the APP_PASSWORD
environment variable:
- Locally: Modify the
.env
file - On Hugging Face Spaces: Update the secret in the Space settings
Modifying Categories
To modify the categories of scholarly accomplishments, edit the MAIN_CATEGORIES
and SCHOLARLY_WORK_TYPES
lists in cv_extraction_app.py
.
Troubleshooting
- API Key Issues: Ensure your Google API key is correctly set in the environment variables
- PDF Extraction Errors: Some PDFs may be password-protected or have security settings that prevent text extraction
- LLM Processing Errors: If the LLM fails to extract accomplishments, try adjusting the prompt or model parameters
License
This project is licensed under the MIT License.