cv-to-csv-extractor / README.md
Zwounds's picture
Update README.md
0b47d18 verified

A newer version of the Gradio SDK is available: 5.33.1

Upgrade
metadata
title: CV to CSV Extraction App
emoji: πŸ“„
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

CV to CSV Extraction App

A Gradio application that extracts publications, talks, and other scholarly accomplishments from faculty CVs (PDFs) using Google's Gemini API with Pydantic-AI for robust structured data extraction.

Features

  • Extract scholarly accomplishments from faculty CVs in PDF format
  • Categorize accomplishments into different types (books, journal articles, conference presentations, etc.)
  • Display results in a tabular format
  • Download results as CSV
  • Password protection using Hugging Face secrets
  • Robust JSON parsing with Pydantic-AI

Installation

  1. Clone this repository:

    git clone <repository-url>
    cd CV_to_CSV
    
  2. Install the required dependencies:

    pip install -r requirements.txt
    
  3. Create a .env file in the root directory with your Google API key:

    GOOGLE_API_KEY=your_google_api_key_here
    APP_PASSWORD=your_app_password_here
    

Usage

Running Locally

  1. Run the application:

    python cv_extraction_app.py
    
  2. Open your browser and navigate to http://localhost:7860

  3. Enter the password (if set in the environment variable APP_PASSWORD)

  4. Upload one or more faculty CV PDFs and click "Extract Accomplishments"

  5. View the extracted accomplishments and download as CSV if desired

Deploying on Hugging Face Spaces

  1. Create a new Space on Hugging Face Spaces with the Gradio SDK

  2. Upload your code to the Space

  3. Set up the following secrets in your Space settings:

    • GOOGLE_API_KEY: Your Google Gemini API key
    • APP_PASSWORD: The password you want to use for app authentication

How It Works

  1. Authentication: The app checks if the provided password matches the one stored in the environment variable APP_PASSWORD

  2. PDF Processing: The app extracts text from uploaded PDF files using PyPDF2

  3. LLM Processing with Pydantic-AI:

    • The extracted text is processed using Pydantic-AI with Google's Gemini model
    • Pydantic models define the structure of the expected data
    • This approach ensures more robust parsing and validation of the extracted data
    • If Pydantic-AI processing fails, the app falls back to the standard Gemini API approach
  4. Categorization: Accomplishments are categorized into different types based on a decision tree approach

  5. Results Display: The extracted accomplishments are displayed in a tabular format and can be downloaded as CSV

Pydantic-AI Integration

The app uses Pydantic-AI to improve the reliability of structured data extraction:

  • Defined Data Models: Clear schema definitions for faculty data and accomplishments
  • Type Validation: Ensures fields like years and confidence scores are properly typed
  • Default Values: Handles missing fields gracefully with sensible defaults
  • Fallback Mechanism: If Pydantic-AI extraction fails, the app falls back to standard extraction

Customization

Changing the Password

To change the password, update the APP_PASSWORD environment variable:

  • Locally: Modify the .env file
  • On Hugging Face Spaces: Update the secret in the Space settings

Modifying Categories

To modify the categories of scholarly accomplishments, edit the MAIN_CATEGORIES and SCHOLARLY_WORK_TYPES lists in cv_extraction_app.py.

Troubleshooting

  • API Key Issues: Ensure your Google API key is correctly set in the environment variables
  • PDF Extraction Errors: Some PDFs may be password-protected or have security settings that prevent text extraction
  • LLM Processing Errors: If the LLM fails to extract accomplishments, try adjusting the prompt or model parameters

License

This project is licensed under the MIT License.