Spaces:

Zwounds
/

cv-to-csv-extractor

Running

App Files Files Community

cv-to-csv-extractor / README.md

Zwounds

Update README.md

0b47d18 verified about 1 month ago

preview code

raw

history blame contribute delete

3.97 kB

	---
	title: CV to CSV Extraction App
	emoji: 📄
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: 5.29.0
	app_file: app.py
	pinned: false
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

	# CV to CSV Extraction App

	A Gradio application that extracts publications, talks, and other scholarly accomplishments from faculty CVs (PDFs) using Google's Gemini API with Pydantic-AI for robust structured data extraction.

	## Features

	- Extract scholarly accomplishments from faculty CVs in PDF format
	- Categorize accomplishments into different types (books, journal articles, conference presentations, etc.)
	- Display results in a tabular format
	- Download results as CSV
	- Password protection using Hugging Face secrets
	- Robust JSON parsing with Pydantic-AI

	## Installation

	1. Clone this repository:
	```
	git clone <repository-url>
	cd CV_to_CSV
	```

	2. Install the required dependencies:
	```
	pip install -r requirements.txt
	```

	3. Create a `.env` file in the root directory with your Google API key:
	```
	GOOGLE_API_KEY=your_google_api_key_here
	APP_PASSWORD=your_app_password_here
	```

	## Usage

	### Running Locally

	1. Run the application:
	```
	python cv_extraction_app.py
	```

	2. Open your browser and navigate to `http://localhost:7860`

	3. Enter the password (if set in the environment variable `APP_PASSWORD`)

	4. Upload one or more faculty CV PDFs and click "Extract Accomplishments"

	5. View the extracted accomplishments and download as CSV if desired

	### Deploying on Hugging Face Spaces

	1. Create a new Space on Hugging Face Spaces with the Gradio SDK

	2. Upload your code to the Space

	3. Set up the following secrets in your Space settings:
	- `GOOGLE_API_KEY`: Your Google Gemini API key
	- `APP_PASSWORD`: The password you want to use for app authentication

	## How It Works

	1. Authentication: The app checks if the provided password matches the one stored in the environment variable `APP_PASSWORD`

	2. PDF Processing: The app extracts text from uploaded PDF files using PyPDF2

	3. LLM Processing with Pydantic-AI:
	- The extracted text is processed using Pydantic-AI with Google's Gemini model
	- Pydantic models define the structure of the expected data
	- This approach ensures more robust parsing and validation of the extracted data
	- If Pydantic-AI processing fails, the app falls back to the standard Gemini API approach

	4. Categorization: Accomplishments are categorized into different types based on a decision tree approach

	5. Results Display: The extracted accomplishments are displayed in a tabular format and can be downloaded as CSV

	## Pydantic-AI Integration

	The app uses Pydantic-AI to improve the reliability of structured data extraction:

	- Defined Data Models: Clear schema definitions for faculty data and accomplishments
	- Type Validation: Ensures fields like years and confidence scores are properly typed
	- Default Values: Handles missing fields gracefully with sensible defaults
	- Fallback Mechanism: If Pydantic-AI extraction fails, the app falls back to standard extraction

	## Customization

	### Changing the Password

	To change the password, update the `APP_PASSWORD` environment variable:

	- Locally: Modify the `.env` file
	- On Hugging Face Spaces: Update the secret in the Space settings

	### Modifying Categories

	To modify the categories of scholarly accomplishments, edit the `MAIN_CATEGORIES` and `SCHOLARLY_WORK_TYPES` lists in `cv_extraction_app.py`.

	## Troubleshooting

	- API Key Issues: Ensure your Google API key is correctly set in the environment variables
	- PDF Extraction Errors: Some PDFs may be password-protected or have security settings that prevent text extraction
	- LLM Processing Errors: If the LLM fails to extract accomplishments, try adjusting the prompt or model parameters

	## License

	This project is licensed under the MIT License.