File size: 3,973 Bytes
052ac62
 
 
 
 
 
154ae3f
052ac62
 
 
 
 
 
d7bf02a
 
0b47d18
d7bf02a
 
 
 
 
 
 
 
0b47d18
d7bf02a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b47d18
d7bf02a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b47d18
 
 
 
 
d7bf02a
 
 
 
 
0b47d18
 
 
 
 
 
 
 
 
d7bf02a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b47d18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
title: CV to CSV Extraction App
emoji: πŸ“„
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# CV to CSV Extraction App

A Gradio application that extracts publications, talks, and other scholarly accomplishments from faculty CVs (PDFs) using Google's Gemini API with Pydantic-AI for robust structured data extraction.

## Features

- Extract scholarly accomplishments from faculty CVs in PDF format
- Categorize accomplishments into different types (books, journal articles, conference presentations, etc.)
- Display results in a tabular format
- Download results as CSV
- Password protection using Hugging Face secrets
- Robust JSON parsing with Pydantic-AI

## Installation

1. Clone this repository:
   ```
   git clone <repository-url>
   cd CV_to_CSV
   ```

2. Install the required dependencies:
   ```
   pip install -r requirements.txt
   ```

3. Create a `.env` file in the root directory with your Google API key:
   ```
   GOOGLE_API_KEY=your_google_api_key_here
   APP_PASSWORD=your_app_password_here
   ```

## Usage

### Running Locally

1. Run the application:
   ```
   python cv_extraction_app.py
   ```

2. Open your browser and navigate to `http://localhost:7860`

3. Enter the password (if set in the environment variable `APP_PASSWORD`)

4. Upload one or more faculty CV PDFs and click "Extract Accomplishments"

5. View the extracted accomplishments and download as CSV if desired

### Deploying on Hugging Face Spaces

1. Create a new Space on Hugging Face Spaces with the Gradio SDK

2. Upload your code to the Space

3. Set up the following secrets in your Space settings:
   - `GOOGLE_API_KEY`: Your Google Gemini API key
   - `APP_PASSWORD`: The password you want to use for app authentication

## How It Works

1. **Authentication**: The app checks if the provided password matches the one stored in the environment variable `APP_PASSWORD`

2. **PDF Processing**: The app extracts text from uploaded PDF files using PyPDF2

3. **LLM Processing with Pydantic-AI**: 
   - The extracted text is processed using Pydantic-AI with Google's Gemini model
   - Pydantic models define the structure of the expected data
   - This approach ensures more robust parsing and validation of the extracted data
   - If Pydantic-AI processing fails, the app falls back to the standard Gemini API approach

4. **Categorization**: Accomplishments are categorized into different types based on a decision tree approach

5. **Results Display**: The extracted accomplishments are displayed in a tabular format and can be downloaded as CSV

## Pydantic-AI Integration

The app uses Pydantic-AI to improve the reliability of structured data extraction:

- **Defined Data Models**: Clear schema definitions for faculty data and accomplishments
- **Type Validation**: Ensures fields like years and confidence scores are properly typed
- **Default Values**: Handles missing fields gracefully with sensible defaults
- **Fallback Mechanism**: If Pydantic-AI extraction fails, the app falls back to standard extraction

## Customization

### Changing the Password

To change the password, update the `APP_PASSWORD` environment variable:

- Locally: Modify the `.env` file
- On Hugging Face Spaces: Update the secret in the Space settings

### Modifying Categories

To modify the categories of scholarly accomplishments, edit the `MAIN_CATEGORIES` and `SCHOLARLY_WORK_TYPES` lists in `cv_extraction_app.py`.

## Troubleshooting

- **API Key Issues**: Ensure your Google API key is correctly set in the environment variables
- **PDF Extraction Errors**: Some PDFs may be password-protected or have security settings that prevent text extraction
- **LLM Processing Errors**: If the LLM fails to extract accomplishments, try adjusting the prompt or model parameters

## License

This project is licensed under the MIT License.