Zwounds commited on
Commit
d7bf02a
·
verified ·
1 Parent(s): 61ccd2e

Upload 6 files

Browse files
Files changed (6) hide show
  1. .gitignore +4 -0
  2. README-HF.md +38 -0
  3. README.md +91 -14
  4. cv_extraction_app.py +619 -0
  5. huggingface-space.yml +9 -0
  6. requirements.txt +6 -0
.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ *.pdf
2
+ /.venv
3
+ .env
4
+
README-HF.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CV to CSV Extraction App
2
+
3
+ This Gradio app extracts publications, talks, and other scholarly accomplishments from faculty CVs (PDFs) using Google's Gemini API.
4
+
5
+ ## Features
6
+
7
+ - Extract scholarly accomplishments from faculty CVs in PDF format
8
+ - Categorize accomplishments into different types (books, journal articles, conference presentations, etc.)
9
+ - Display results in a tabular format
10
+ - Download results as CSV
11
+ - Password protection using Hugging Face secrets
12
+
13
+ ## Setup
14
+
15
+ ### Required Secrets
16
+
17
+ To use this app, you need to set up the following secrets in your Hugging Face Space settings:
18
+
19
+ 1. `GOOGLE_API_KEY`: Your Google Gemini API key
20
+ 2. `APP_PASSWORD`: The password you want to use for app authentication
21
+
22
+ ### How to Set Up Secrets
23
+
24
+ 1. Go to your Space settings
25
+ 2. Navigate to the "Secrets" section
26
+ 3. Add each secret with its corresponding value
27
+ 4. Restart your Space
28
+
29
+ ## Usage
30
+
31
+ 1. Enter the password you set in the `APP_PASSWORD` secret
32
+ 2. Upload one or more faculty CV PDFs
33
+ 3. Click "Extract Accomplishments"
34
+ 4. View the extracted accomplishments and download as CSV if desired
35
+
36
+ ## How It Works
37
+
38
+ The app uses Google's Gemini API to analyze CV text and extract scholarly accomplishments, categorizing them into different types based on a decision tree approach.
README.md CHANGED
@@ -1,14 +1,91 @@
1
- ---
2
- title: Cv To Csv Extractor
3
- emoji: 🦀
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 5.29.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: An AI-powered tool that extracts CVs to CSV.
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CV to CSV Extraction App
2
+
3
+ A Gradio application that extracts publications, talks, and other scholarly accomplishments from faculty CVs (PDFs) using Google's Gemini API.
4
+
5
+ ## Features
6
+
7
+ - Extract scholarly accomplishments from faculty CVs in PDF format
8
+ - Categorize accomplishments into different types (books, journal articles, conference presentations, etc.)
9
+ - Display results in a tabular format
10
+ - Download results as CSV
11
+ - Password protection using Hugging Face secrets
12
+
13
+ ## Installation
14
+
15
+ 1. Clone this repository:
16
+ ```
17
+ git clone <repository-url>
18
+ cd CV_to_CSV
19
+ ```
20
+
21
+ 2. Install the required dependencies:
22
+ ```
23
+ pip install -r requirements.txt
24
+ ```
25
+
26
+ 3. Create a `.env` file in the root directory with your Google API key:
27
+ ```
28
+ GOOGLE_API_KEY=your_google_api_key_here
29
+ ```
30
+
31
+ ## Usage
32
+
33
+ ### Running Locally
34
+
35
+ 1. Run the application:
36
+ ```
37
+ python cv_extraction_app.py
38
+ ```
39
+
40
+ 2. Open your browser and navigate to `http://localhost:7860`
41
+
42
+ 3. Enter the password (if set in the environment variable `APP_PASSWORD`)
43
+
44
+ 4. Upload one or more faculty CV PDFs and click "Extract Accomplishments"
45
+
46
+ 5. View the extracted accomplishments and download as CSV if desired
47
+
48
+ ### Deploying on Hugging Face Spaces
49
+
50
+ 1. Create a new Space on Hugging Face Spaces with the Gradio SDK
51
+
52
+ 2. Upload your code to the Space
53
+
54
+ 3. Set up the following secrets in your Space settings:
55
+ - `GOOGLE_API_KEY`: Your Google Gemini API key
56
+ - `APP_PASSWORD`: The password you want to use for app authentication
57
+
58
+ ## How It Works
59
+
60
+ 1. **Authentication**: The app checks if the provided password matches the one stored in the environment variable `APP_PASSWORD`
61
+
62
+ 2. **PDF Processing**: The app extracts text from uploaded PDF files using PyPDF2
63
+
64
+ 3. **LLM Processing**: The extracted text is sent to Google's Gemini API to identify faculty names and extract scholarly accomplishments
65
+
66
+ 4. **Categorization**: Accomplishments are categorized into different types based on a decision tree approach
67
+
68
+ 5. **Results Display**: The extracted accomplishments are displayed in a tabular format and can be downloaded as CSV
69
+
70
+ ## Customization
71
+
72
+ ### Changing the Password
73
+
74
+ To change the password, update the `APP_PASSWORD` environment variable:
75
+
76
+ - Locally: Modify the `.env` file
77
+ - On Hugging Face Spaces: Update the secret in the Space settings
78
+
79
+ ### Modifying Categories
80
+
81
+ To modify the categories of scholarly accomplishments, edit the `MAIN_CATEGORIES` and `SCHOLARLY_WORK_TYPES` lists in `cv_extraction_app.py`.
82
+
83
+ ## Troubleshooting
84
+
85
+ - **API Key Issues**: Ensure your Google API key is correctly set in the environment variables
86
+ - **PDF Extraction Errors**: Some PDFs may be password-protected or have security settings that prevent text extraction
87
+ - **LLM Processing Errors**: If the LLM fails to extract accomplishments, try adjusting the prompt or model parameters
88
+
89
+ ## License
90
+
91
+ This project is licensed under the MIT License.
cv_extraction_app.py ADDED
@@ -0,0 +1,619 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+ import tempfile
4
+ import pandas as pd
5
+ import logging
6
+ import time
7
+ from PyPDF2 import PdfReader
8
+ import google.generativeai as genai
9
+ import json
10
+ import re
11
+ from dotenv import load_dotenv
12
+
13
+ # Load environment variables from .env file
14
+ load_dotenv()
15
+
16
+ # Configure logging
17
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
18
+
19
+ # --- Configuration ---
20
+ GOOGLE_API_KEY = os.environ.get('GOOGLE_API_KEY')
21
+ MODEL_NAME = 'gemini-2.5-flash-preview-04-17' # Using the original model as specified
22
+ APP_PASSWORD = os.environ.get('APP_PASSWORD') # Password for app authentication
23
+
24
+ # Main categories (first tier)
25
+ MAIN_CATEGORIES = [
26
+ "Books & Book Contributions",
27
+ "Journal & Article Publications",
28
+ "Conference & Presentations",
29
+ "Creative & Artistic Works",
30
+ "Legal & Technical Documents",
31
+ "Funding, Grants & Awards", # New Main Category
32
+ "Other Scholarly Contributions"
33
+ ]
34
+
35
+ # Specific types (second tier) - these will be the actual categories in the CSV
36
+ SCHOLARLY_WORK_TYPES = [
37
+ # Books & Book Contributions
38
+ "Book, Authored",
39
+ "Book, Chapter",
40
+ "Book, Edited",
41
+ "Book, introduction, preface, etc.",
42
+ # Journal & Article Publications
43
+ "Journal Article, peer-reviewed",
44
+ "Journal Article, other",
45
+ "Newspaper/Magazine Article",
46
+ "Review/Commentary (including Blogging)",
47
+ # Conference & Presentations
48
+ "Conference Presentation - published as proceedings",
49
+ "Conference Presentation, other",
50
+ "Lecture (Invited)",
51
+ # Creative & Artistic Works
52
+ "Digital Project",
53
+ "Curated an Art Show",
54
+ "Direction/Choreography/Dramaturgy/Design",
55
+ "Exhibited at Curated Art Show",
56
+ "Music Composition Published/Performed",
57
+ "Performance (music, dance, theater)",
58
+ "Play or Screenplay Produced/Performed",
59
+ "Poem or Short Story Published",
60
+ # Legal & Technical Documents
61
+ "Legal Brief (Submitted)",
62
+ "Legal Review",
63
+ "Technical/Policy Reports, peer-reviewed",
64
+ "Technical/Policy Reports, other",
65
+ # Funding, Grants & Awards
66
+ "Grant (External)",
67
+ "Grant (Internal)",
68
+ "Fellowship",
69
+ "Award/Honor",
70
+ # Other Scholarly Contributions
71
+ "Patent",
72
+ "Other"
73
+ ]
74
+
75
+ # --- Helper Functions ---
76
+
77
+ def clean_text(text):
78
+ """Cleans text by replacing common ligatures and smart quotes."""
79
+ replacements = {
80
+ "ff": "ff", "fi": "fi", "fl": "fl", "ffi": "ffi", "ffl": "ffl",
81
+ """: "\"", """: "\"", "'": "'", "'": "'",
82
+ }
83
+ for old, new in replacements.items():
84
+ text = text.replace(old, new)
85
+ return text
86
+
87
+ def clean_cv_specific_text(text):
88
+ """Apply CV-specific cleaning rules to improve text quality."""
89
+ # Remove page numbers (common in CVs) - improved regex
90
+ text = re.sub(r'\n\s*\d+\s*(\n|$)', '\n', text)
91
+ text = re.sub(r'^\s*\d+\s*\n', '', text) # Page number at the very beginning
92
+
93
+ # Fix common CV formatting issues like names split across lines
94
+ text = re.sub(r'([a-zA-Z])\s*\n\s*([a-zA-Z])', r'\1 \2', text) # General case for text split over newlines
95
+ text = re.sub(r'([A-Z][a-z]+(?:-[A-Z][a-z]+)?)\s*\n\s*([A-Z][a-z]+)', r'\1 \2', text) # More specific for names
96
+
97
+ # Normalize citation formats - e.g., year punctuation
98
+ text = re.sub(r'(\d{4})\s*\.\s*', r'\1. ', text)
99
+ # Remove excessive newlines
100
+ text = re.sub(r'\n\s*\n', '\n', text)
101
+ return text
102
+
103
+ def extract_text_from_pdf(pdf_file):
104
+ """Extracts text from a given PDF file."""
105
+ logging.info(f"Extracting text from: {pdf_file.name}")
106
+ try:
107
+ reader = PdfReader(pdf_file.name)
108
+ text = ""
109
+ for page in reader.pages:
110
+ page_text = page.extract_text()
111
+ if page_text:
112
+ text += page_text + "\n"
113
+ cleaned_text = clean_text(text)
114
+ cleaned_text = clean_cv_specific_text(cleaned_text) # Apply CV specific cleaning
115
+ logging.info(f"Successfully extracted and cleaned text from {pdf_file.name} (Length: {len(cleaned_text)})")
116
+ return cleaned_text
117
+ except Exception as e:
118
+ logging.error(f"Error reading PDF {pdf_file.name}: {e}")
119
+ return None
120
+
121
+ def extract_pdf_metadata(pdf_file):
122
+ """Extract metadata from PDF that might help with faculty identification."""
123
+ try:
124
+ reader = PdfReader(pdf_file.name)
125
+ metadata = reader.metadata
126
+ author = metadata.get('/Author', '')
127
+ title = metadata.get('/Title', '')
128
+ # PyPDF2 might return Author as a list
129
+ if isinstance(author, list):
130
+ author = ", ".join(author) if author else ''
131
+ if isinstance(title, list):
132
+ title = ", ".join(title) if title else ''
133
+
134
+ return {
135
+ 'author': str(author) if author else '',
136
+ 'title': str(title) if title else '',
137
+ 'filename': os.path.basename(pdf_file.name)
138
+ }
139
+ except Exception as e:
140
+ logging.error(f"Error extracting metadata from {pdf_file.name}: {e}")
141
+ return {'filename': os.path.basename(pdf_file.name), 'author': '', 'title': ''}
142
+
143
+ def get_faculty_name_from_llm(cv_text_chunk):
144
+ """Sends a small chunk of CV text to LLM to extract only the faculty name."""
145
+ if not cv_text_chunk:
146
+ return "Unknown", None
147
+ prompt = f"""
148
+ Analyze the following CV text chunk. Identify the primary faculty member's name, usually found prominently at the beginning of the document.
149
+ Return the result as a single JSON object with a top-level key "faculty_name" and the extracted faculty name as a string.
150
+ If the name cannot be reliably determined, use "Unknown".
151
+
152
+ Example: {{ "faculty_name": "Dr. Jane Doe" }}
153
+
154
+ CV Text Chunk:
155
+ ---
156
+ {cv_text_chunk}
157
+ ---
158
+ JSON Output:
159
+ """
160
+ try:
161
+ model = genai.GenerativeModel(MODEL_NAME)
162
+ response = model.generate_content(
163
+ prompt,
164
+ generation_config=genai.types.GenerationConfig(response_mime_type="application/json")
165
+ )
166
+ parsed_json = json.loads(response.text)
167
+ faculty_name = parsed_json.get("faculty_name", "Unknown")
168
+ if not isinstance(faculty_name, str) or not faculty_name.strip():
169
+ faculty_name = "Unknown"
170
+ return faculty_name, None # No accomplishments from this call
171
+ except Exception as e:
172
+ logging.error(f"Error extracting faculty name with LLM: {e}")
173
+ return "Unknown", None
174
+
175
+ def get_accomplishments_from_llm(cv_text, faculty_name_hint=None):
176
+ """Sends CV text to Google Gemini API and returns faculty name and structured accomplishments."""
177
+ if not cv_text:
178
+ return faculty_name_hint or "Unknown", []
179
+
180
+ prompt = f"""
181
+ Analyze the following CV text. First, identify the primary faculty member's name, usually found prominently at the beginning of the document or in the header/footer.
182
+ Extract the name directly from the CV content. Look for patterns like "Curriculum Vitae of [Name]", "[Name], Ph.D.", or other indicators of the primary faculty member.
183
+
184
+ IMPORTANT: Return the faculty name in proper case (e.g., "John Smith" or "Jane Doe-Smith"), NOT in all caps, even if it appears in all caps in the document.
185
+
186
+ Second, extract scholarly accomplishments based on the categories below. Follow the decision tree approach to categorize each accomplishment accurately.
187
+ BE COMPREHENSIVE: Strive to extract ALL identifiable scholarly accomplishments from the CV text that fit the defined categories. Pay attention to all sections of the CV. If an item is ambiguous but potentially relevant, lean towards including it for later review.
188
+
189
+ # DECISION TREE FOR CATEGORIZATION:
190
+
191
+ Step 1: Determine the general type of scholarly work:
192
+ - Is it a book or book contribution? → Go to Books & Book Contributions
193
+ - Is it a journal article or similar publication? → Go to Journal & Article Publications
194
+ - Is it a conference presentation or lecture? → Go to Conference & Presentations
195
+ - Is it a creative or artistic work? → Go to Creative & Artistic Works
196
+ - Is it a legal document or technical report? → Go to Legal & Technical Documents
197
+ - Is it something else scholarly? → Go to Other Scholarly Contributions
198
+
199
+ Step 2: Within each general type, determine the specific category:
200
+
201
+ ## Books & Book Contributions
202
+ - "Book, Authored": A complete book written by the faculty member as primary author
203
+ Example: "Smith, J. (2020). The Evolution of Digital Learning. Routledge."
204
+ - "Book, Chapter": A chapter contributed to a book edited by someone else
205
+ Example: "Smith, J. (2020). Digital pedagogy frameworks. In A. Johnson (Ed.), Handbook of Educational Technology (pp. 45-67). Routledge."
206
+ - "Book, Edited": A book where the faculty member served as editor rather than author
207
+ Example: "Smith, J. (Ed.). (2020). Perspectives on Digital Learning. Routledge."
208
+ - "Book, introduction, preface, etc.": Shorter contributions to books like forewords, introductions
209
+ Example: "Smith, J. (2020). Foreword. In A. Johnson, Digital Learning Environments (pp. ix-xi). Routledge."
210
+
211
+ ## Journal & Article Publications
212
+ - "Journal Article, peer-reviewed": Articles published in peer-reviewed academic journals
213
+ Example: "Smith, J. (2020). Digital literacy in higher education. Journal of Educational Technology, 45(2), 123-145. https://doi.org/10.xxxx/yyyy"
214
+ Look for: journal name, volume/issue numbers, DOI, mentions of peer review
215
+ - "Journal Article, other": Articles in non-peer-reviewed journals
216
+ Example: "Smith, J. (2020). Teaching in digital environments. Educational Practice, 15, 78-92."
217
+ - "Newspaper/Magazine Article": Articles in popular press or magazines
218
+ Example: "Smith, J. (2020, March 15). How technology is changing education. The Education Times, pp. 23-24."
219
+ - "Review/Commentary (including Blogging)": Book reviews, commentaries, blog posts
220
+ Example: "Smith, J. (2020). [Review of the book Digital Pedagogy, by A. Johnson]. Educational Review, 12(3), 45-47."
221
+
222
+ ## Conference & Presentations
223
+ - "Conference Presentation - published as proceedings": Presentations published in conference proceedings
224
+ Example: "Smith, J. (2020). Virtual reality in education. Proceedings of the International Conference on Educational Technology, 234-241. IEEE."
225
+ Look for: "Proceedings of", publisher information, page numbers
226
+ - "Conference Presentation, other": Presentations at conferences without formal publication
227
+ Example: "Smith, J. (2020, June). Virtual reality applications. Paper presented at the Educational Technology Conference, Boston, MA."
228
+ - "Lecture (Invited)": Talks given by invitation rather than through submission process
229
+ Example: "Smith, J. (2020, April). The future of digital learning. Invited lecture at Harvard University, Cambridge, MA."
230
+ Look for: "invited", "keynote", "guest lecture"
231
+
232
+ ## Creative & Artistic Works
233
+ - "Digital Project": Digital scholarship, websites, tools, or resources created
234
+ Example: "Smith, J. (2018-2020). Digital Learning Archive [Web application]. https://digitallearningarchive.org"
235
+ - "Curated an Art Show": Organization and curation of artistic exhibitions
236
+ Example: "Smith, J. (Curator). (2020). Digital Art in Education [Exhibition]. University Gallery, Boston, MA."
237
+ - "Direction/Choreography/Dramaturgy/Design": Creative direction of performances
238
+ Example: "Smith, J. (Director). (2020). The Digital Divide [Theater production]. University Theater, Boston, MA."
239
+ - "Exhibited at Curated Art Show": Participation as an artist in exhibitions
240
+ Example: "Smith, J. (2020). Learning Through Screens [Digital art]. In Digital Expressions, University Gallery, Boston, MA."
241
+ - "Music Composition Published/Performed": Musical works composed
242
+ Example: "Smith, J. (Composer). (2020). Digital Sonata [Musical composition]. Performed by Boston Symphony, Symphony Hall, Boston, MA."
243
+ - "Performance (music, dance, theater)": Performance as an artist
244
+ Example: "Smith, J. (Performer). (2020). The Digital Age [Dance performance]. Kennedy Center, Washington, DC."
245
+ - "Play or Screenplay Produced/Performed": Written dramatic works
246
+ Example: "Smith, J. (Playwright). (2020). Virtual Connections [Play]. Produced at University Theater, Boston, MA."
247
+ - "Poem or Short Story Published": Creative writing published
248
+ Example: "Smith, J. (2020). Digital dreams [Poem]. Literary Journal, 23(2), 45-46."
249
+
250
+ ## Legal & Technical Documents
251
+ - "Legal Brief (Submitted)": Legal documents submitted to courts
252
+ Example: "Smith, J. (2020). Amicus brief in Digital Rights Foundation v. State Board of Education. Supreme Court of Massachusetts."
253
+ - "Legal Review": Analysis of legal cases or issues
254
+ Example: "Smith, J. (2020). Digital privacy in educational settings: A legal analysis. Harvard Law Review, 133(4), 1023-1056."
255
+ - "Technical/Policy Reports, peer-reviewed": Technical reports that underwent peer review
256
+ Example: "Smith, J. (2020). Digital learning standards (Technical Report No. 2020-05). Educational Technology Consortium. [Peer-reviewed]"
257
+ - "Technical/Policy Reports, other": Technical reports without peer review
258
+ Example: "Smith, J. (2020). Implementing digital tools in K-12 (White Paper). Center for Digital Education."
259
+
260
+ ## Funding, Grants & Awards
261
+ - "Grant (External)": Research grants received from external funding agencies (e.g., NSF, NIH, foundations).
262
+ Example: "Smith, J. (PI). (2021-2024). Project Title. National Science Foundation (#1234567). $500,000."
263
+ For this category, extract the numeric funding amount into the "funding_amount" field (e.g., 500000).
264
+ - "Grant (Internal)": Research grants or seed funding received from internal university sources.
265
+ Example: "Smith, J. (PI). (2020). Pilot study on X. University Research Grant. $10,000."
266
+ For this category, extract the numeric funding amount into the "funding_amount" field (e.g., 10000).
267
+ - "Fellowship": Competitive fellowships awarded for research or scholarly work. May or may not have an explicit monetary value listed.
268
+ Example: "Smith, J. (2019-2020). Doctoral Dissertation Fellowship. Mellon Foundation. $30,000 stipend."
269
+ If a monetary value is stated, extract it into "funding_amount". Otherwise, use "N/A".
270
+ - "Award/Honor": Awards, honors, or distinctions received for scholarly work or contributions. Typically no funding amount.
271
+ Example: "Smith, J. (2022). Best Paper Award, International Conference on Educational Technology."
272
+ "funding_amount" should usually be "N/A" for this category unless explicitly stated as a monetary prize.
273
+
274
+ ## Other Scholarly Contributions
275
+ - "Patent": Registered intellectual property
276
+ Example: "Smith, J. (2020). Digital learning assessment system (U.S. Patent No. 10,123,456). U.S. Patent and Trademark Office."
277
+ - "Other": Scholarly contributions that don't fit other categories, such as datasets, software, or professional service.
278
+ Example: "Smith, J. (2020). Dataset: Survey of digital learning practices [Data set]. Harvard Dataverse. https://doi.org/10.xxxx/yyyy"
279
+
280
+
281
+ Return the result as a single JSON object containing:
282
+ 1. A top-level key "faculty_name" with the extracted faculty name as a string. If the name cannot be reliably determined from this text and no hint was provided, use "Unknown". If a hint was provided, prefer the hint if no clear name is in the text.
283
+ 2. A top-level key "accomplishments" containing a list of JSON objects, where each object represents one accomplishment with the following details:
284
+ - "category": The specific type of scholarly work from the list above (e.g., "Book, Authored", "Journal Article, peer-reviewed", etc.)
285
+ - "main_category": The general category this work falls under (e.g., "Books & Book Contributions", "Journal & Article Publications", etc.)
286
+ - "year": The year the accomplishment occurred (as an integer or string). If multiple years or a range, use the start year or the most prominent year. If no year is found, use "N/A".
287
+ - "description": The full description or citation of the accomplishment.
288
+ - "doi_url": The DOI or URL associated with the accomplishment, if present. Use "N/A" if not found.
289
+ - "funding_amount": For grants or funded projects (often in "Other" category), the numeric funding amount if explicitly stated (e.g., 250000). Extract only the number, without currency symbols or commas. Use "N/A" if not applicable or not found.
290
+ - "confidence": A number from 1-5 indicating your confidence in this categorization (5 being highest confidence).
291
+
292
+ Ensure the entire output is a single, valid JSON object like this example:
293
+ {{
294
+ "faculty_name": "Example Faculty Name",
295
+ "accomplishments": [
296
+ {{ "category": "Journal Article, peer-reviewed", "main_category": "Journal & Article Publications", "year": "2023", "description": "...", "doi_url": "...", "funding_amount": "N/A", "confidence": 5 }},
297
+ {{ "category": "Book, Chapter", "main_category": "Books & Book Contributions", "year": "2022", "description": "...", "doi_url": "N/A", "funding_amount": "N/A", "confidence": 4 }}
298
+ ]
299
+ }}
300
+ Do not include any text before or after the JSON object.
301
+
302
+ CV Text:
303
+ ---
304
+ {cv_text[:45000]}
305
+ ---
306
+
307
+ JSON Output:
308
+ """
309
+ # Max input tokens for flash is ~128k, but output also counts.
310
+ # CV text can be very long. We'll truncate here, but a more robust solution might involve chunking.
311
+
312
+ logging.info(f"Sending request to Gemini API for faculty: {faculty_name_hint or 'Unknown'}")
313
+
314
+ try:
315
+ model = genai.GenerativeModel(MODEL_NAME)
316
+ response = model.generate_content(
317
+ prompt,
318
+ generation_config=genai.types.GenerationConfig(
319
+ response_mime_type="application/json",
320
+ temperature=0.2 # Lower temperature for more consistent JSON formatting
321
+ )
322
+ )
323
+ response_text = response.text.strip()
324
+
325
+ # Try to fix common JSON formatting issues before parsing
326
+ try:
327
+ parsed_json = json.loads(response_text)
328
+ except json.JSONDecodeError as e:
329
+ logging.warning(f"Initial JSON parsing failed: {e}. Attempting to fix common issues.")
330
+
331
+ # Try to extract JSON from markdown code blocks if present
332
+ if response_text.startswith("```json") and "```" in response_text:
333
+ code_block_content = response_text.split("```")[1]
334
+ if code_block_content.startswith("json"):
335
+ code_block_content = code_block_content[4:].strip()
336
+ try:
337
+ parsed_json = json.loads(code_block_content)
338
+ logging.info("Successfully extracted JSON from code block")
339
+ except json.JSONDecodeError:
340
+ raise # Re-raise if this also fails
341
+ else:
342
+ raise # Re-raise the original error if not in a code block
343
+
344
+ extracted_faculty_name = faculty_name_hint or "Unknown"
345
+ llm_faculty_name = parsed_json.get("faculty_name", "Unknown")
346
+ if not isinstance(llm_faculty_name, str) or not llm_faculty_name.strip():
347
+ llm_faculty_name = "Unknown"
348
+
349
+ if faculty_name_hint and faculty_name_hint != "Unknown":
350
+ extracted_faculty_name = faculty_name_hint
351
+ elif llm_faculty_name != "Unknown":
352
+ extracted_faculty_name = llm_faculty_name
353
+
354
+ accomplishments_list = []
355
+ if "accomplishments" in parsed_json and isinstance(parsed_json["accomplishments"], list):
356
+ accomplishments_list = parsed_json["accomplishments"]
357
+ logging.info(f"Successfully parsed faculty name '{extracted_faculty_name}' and {len(accomplishments_list)} accomplishments.")
358
+ else:
359
+ logging.warning("LLM response JSON does not contain a valid 'accomplishments' list.")
360
+
361
+ return extracted_faculty_name, accomplishments_list
362
+ except Exception as e:
363
+ logging.error(f"Error in LLM processing: {e}")
364
+ return faculty_name_hint or "Unknown", []
365
+
366
+ def get_accomplishments_with_retry(cv_text, faculty_name_hint=None, max_retries=2, initial_backoff=3):
367
+ """Wrapper function that adds retry logic to the LLM API call."""
368
+ retries = 0
369
+ backoff_time = initial_backoff
370
+
371
+ while retries <= max_retries:
372
+ try:
373
+ # Call the original function that might raise exceptions
374
+ return get_accomplishments_from_llm(cv_text, faculty_name_hint)
375
+ except json.JSONDecodeError as e:
376
+ retries += 1
377
+ logging.error(f"JSONDecodeError on attempt {retries}/{max_retries+1}: {e}. Response might not be valid JSON.")
378
+ if retries > max_retries:
379
+ logging.error(f"Failed after {max_retries+1} attempts due to JSONDecodeError.")
380
+ return faculty_name_hint or "Unknown", []
381
+ # No retry for JSONDecodeError usually, as it implies a persistent issue with response format
382
+ # However, for robustness, we can allow one retry if it's not the last attempt.
383
+ if retries <=1: # Only retry JSON decode once
384
+ logging.info(f"Retrying JSON decode in {backoff_time}s...")
385
+ time.sleep(backoff_time)
386
+ backoff_time *= 2
387
+ else:
388
+ return faculty_name_hint or "Unknown", [] # Give up on JSON decode errors after 1 retry
389
+ except Exception as e: # Catches other API errors, network issues, etc.
390
+ retries += 1
391
+ logging.warning(f"API Error on attempt {retries}/{max_retries+1} for faculty '{faculty_name_hint or 'Unknown'}': {e}")
392
+ if "content filter" in str(e).lower():
393
+ logging.error(f"Content filter triggered for faculty '{faculty_name_hint or 'Unknown'}'. No further retries for this error.")
394
+ return faculty_name_hint or "Unknown", [] # Don't retry content filter errors
395
+
396
+ if retries > max_retries:
397
+ logging.error(f"Failed after {max_retries+1} attempts for faculty '{faculty_name_hint or 'Unknown'}'.")
398
+ return faculty_name_hint or "Unknown", []
399
+
400
+ logging.info(f"Retrying in {backoff_time}s for faculty '{faculty_name_hint or 'Unknown'}'...")
401
+ time.sleep(backoff_time)
402
+ backoff_time *= 2 # Exponential backoff
403
+ return faculty_name_hint or "Unknown", [] # Should be unreachable if logic is correct
404
+
405
+ def validate_and_clean_accomplishment(item, faculty_name_cv, filename):
406
+ """Validates and cleans a single accomplishment item."""
407
+ category = item.get("category", "Other")
408
+ main_category_map = {
409
+ "Book, Authored": "Books & Book Contributions",
410
+ "Book, Chapter": "Books & Book Contributions",
411
+ "Book, Edited": "Books & Book Contributions",
412
+ "Book, introduction, preface, etc.": "Books & Book Contributions",
413
+ "Journal Article, peer-reviewed": "Journal & Article Publications",
414
+ "Journal Article, other": "Journal & Article Publications",
415
+ "Newspaper/Magazine Article": "Journal & Article Publications",
416
+ "Review/Commentary (including Blogging)": "Journal & Article Publications",
417
+ "Conference Presentation - published as proceedings": "Conference & Presentations",
418
+ "Conference Presentation, other": "Conference & Presentations",
419
+ "Lecture (Invited)": "Conference & Presentations",
420
+ "Digital Project": "Creative & Artistic Works",
421
+ "Curated an Art Show": "Creative & Artistic Works",
422
+ "Direction/Choreography/Dramaturgy/Design": "Creative & Artistic Works",
423
+ "Exhibited at Curated Art Show": "Creative & Artistic Works",
424
+ "Music Composition Published/Performed": "Creative & Artistic Works",
425
+ "Performance (music, dance, theater)": "Creative & Artistic Works",
426
+ "Play or Screenplay Produced/Performed": "Creative & Artistic Works",
427
+ "Poem or Short Story Published": "Creative & Artistic Works",
428
+ "Legal Brief (Submitted)": "Legal & Technical Documents",
429
+ "Legal Review": "Legal & Technical Documents",
430
+ "Technical/Policy Reports, peer-reviewed": "Legal & Technical Documents",
431
+ "Technical/Policy Reports, other": "Legal & Technical Documents",
432
+ "Grant (External)": "Funding, Grants & Awards",
433
+ "Grant (Internal)": "Funding, Grants & Awards",
434
+ "Fellowship": "Funding, Grants & Awards",
435
+ "Award/Honor": "Funding, Grants & Awards",
436
+ "Patent": "Other Scholarly Contributions",
437
+ "Other": "Other Scholarly Contributions"
438
+ }
439
+ main_category = item.get("main_category")
440
+ # If main_category is not provided by LLM or is unexpected, try to map it
441
+ if not main_category or main_category not in MAIN_CATEGORIES:
442
+ main_category = main_category_map.get(category, "Other Scholarly Contributions")
443
+
444
+ year = str(item.get("year", "N/A")) # Ensure year is string
445
+ description = item.get("description", "").strip()
446
+ doi_url = item.get("doi_url", "N/A")
447
+ funding_amount = item.get("funding_amount", "N/A")
448
+ confidence = item.get("confidence", 3) # Default to medium confidence
449
+ try:
450
+ confidence = int(confidence)
451
+ except (ValueError, TypeError):
452
+ confidence = 3 # Default if conversion fails
453
+
454
+ needs_review = confidence < 3
455
+
456
+ # Basic validation: if description is empty, skip
457
+ if not description:
458
+ return None
459
+
460
+ return {
461
+ "Faculty_Name": faculty_name_cv,
462
+ "CV_Filename": os.path.basename(filename),
463
+ "Main_Category": main_category,
464
+ "Category": category,
465
+ "Year": year,
466
+ "Description": description,
467
+ "DOI_URL": doi_url,
468
+ "Funding_Amount": funding_amount,
469
+ "Confidence": confidence,
470
+ "Needs_Review": "Yes" if needs_review else "No"
471
+ }
472
+
473
+ # --- Gradio App Functions ---
474
+
475
+ def check_password(password):
476
+ """Check if the provided password matches the app password."""
477
+ if not APP_PASSWORD:
478
+ # If no password is set, allow access (for development)
479
+ return True
480
+ return password == APP_PASSWORD
481
+
482
+ def process_cv_files(pdf_files, progress=gr.Progress()):
483
+ """Process uploaded CV files and extract accomplishments."""
484
+ if not pdf_files:
485
+ raise gr.Error("Please upload at least one PDF file.")
486
+
487
+ if not GOOGLE_API_KEY:
488
+ raise gr.Error("Google API key is not configured. Please set the GOOGLE_API_KEY environment variable.")
489
+
490
+ genai.configure(api_key=GOOGLE_API_KEY)
491
+
492
+ all_accomplishments = []
493
+ total_steps = len(pdf_files) * 4 # 4 steps per file: extract text, get metadata, extract accomplishments, process results
494
+ current_step = 0
495
+
496
+ # Process each PDF file
497
+ for i, pdf_file in enumerate(pdf_files):
498
+ file_name = os.path.basename(pdf_file.name)
499
+ progress(current_step/total_steps, f"Processing file {i+1}/{len(pdf_files)}: {file_name}")
500
+ current_step += 1
501
+
502
+ # Extract text from PDF
503
+ progress(current_step/total_steps, f"Extracting text from {file_name}")
504
+ cv_text = extract_text_from_pdf(pdf_file)
505
+ if not cv_text:
506
+ gr.Warning(f"Could not extract text from {file_name}. Skipping.")
507
+ current_step += 3 # Skip remaining steps for this file
508
+ continue
509
+ current_step += 1
510
+
511
+ # Get PDF metadata
512
+ progress(current_step/total_steps, f"Processing metadata for {file_name}")
513
+ pdf_metadata = extract_pdf_metadata(pdf_file)
514
+ current_step += 1
515
+
516
+ # Extract faculty name and accomplishments
517
+ progress(current_step/total_steps, f"Extracting accomplishments from {file_name}")
518
+ faculty_name_cv, accomplishments_list = get_accomplishments_with_retry(cv_text)
519
+ current_step += 1
520
+
521
+ # Fallback logic if LLM returns "Unknown"
522
+ if faculty_name_cv == "Unknown":
523
+ metadata_author = pdf_metadata.get('author', '').strip()
524
+ if metadata_author:
525
+ faculty_name_cv = metadata_author
526
+ logging.info(f"Used PDF metadata author '{faculty_name_cv}' for {pdf_file.name}")
527
+
528
+ if faculty_name_cv == "Unknown": # If still unknown, try filename
529
+ name_from_file = os.path.splitext(os.path.basename(pdf_file.name))[0].replace("_", " ").replace("-", " ")
530
+ # Basic heuristic to see if it looks like a name
531
+ if len(name_from_file.split()) > 1 and len(name_from_file.split()) < 4:
532
+ faculty_name_cv = name_from_file.title()
533
+
534
+ # Process accomplishments
535
+ if accomplishments_list:
536
+ for item in accomplishments_list:
537
+ processed_item = validate_and_clean_accomplishment(item, faculty_name_cv, pdf_file.name)
538
+ if processed_item:
539
+ all_accomplishments.append(processed_item)
540
+ else:
541
+ gr.Warning(f"No accomplishments found for {os.path.basename(pdf_file.name)}.")
542
+
543
+ if not all_accomplishments:
544
+ raise gr.Error("No accomplishments were extracted from the provided PDFs.")
545
+
546
+ # Convert to DataFrame for display
547
+ df = pd.DataFrame(all_accomplishments)
548
+
549
+ # Create CSV in memory for download
550
+ csv_file = tempfile.NamedTemporaryFile(delete=False, suffix='.csv')
551
+ df.to_csv(csv_file.name, index=False)
552
+
553
+ return df, csv_file.name
554
+
555
+ # --- Gradio Interface ---
556
+
557
+ # Create the authentication interface
558
+ with gr.Blocks(title="CV to CSV Extraction App") as app:
559
+ gr.Markdown("# CV to CSV Extraction App")
560
+ gr.Markdown("Extract publications and accomplishments from faculty CVs")
561
+
562
+ # Authentication state
563
+ authenticated = gr.State(False)
564
+
565
+ # Login interface
566
+ with gr.Group(visible=True) as login_group:
567
+ gr.Markdown("### Authentication Required")
568
+ password_input = gr.Textbox(type="password", label="Password")
569
+ login_button = gr.Button("Login")
570
+ login_error = gr.Markdown(visible=False)
571
+
572
+ # Main app interface (initially hidden)
573
+ with gr.Group(visible=False) as main_app:
574
+ with gr.Tab("Extract from CVs"):
575
+ gr.Markdown("### Upload Faculty CV PDFs")
576
+ gr.Markdown("Upload one or more PDF files containing faculty CVs. The app will extract publications and other scholarly accomplishments.")
577
+
578
+ # File upload
579
+ pdf_input = gr.File(file_count="multiple", label="Upload CV PDFs", file_types=[".pdf"])
580
+ process_button = gr.Button("Extract Accomplishments")
581
+
582
+ # Results display
583
+ results = gr.DataFrame(label="Extracted Accomplishments", interactive=False)
584
+
585
+ # Download button
586
+ csv_output = gr.File(label="Download as CSV")
587
+
588
+ # Process button click
589
+ process_button.click(
590
+ fn=process_cv_files,
591
+ inputs=[pdf_input],
592
+ outputs=[results, csv_output],
593
+ api_name="extract_accomplishments"
594
+ )
595
+
596
+ # Login button click
597
+ def login(password):
598
+ if check_password(password):
599
+ return {
600
+ login_group: gr.update(visible=False),
601
+ main_app: gr.update(visible=True),
602
+ login_error: gr.update(visible=False),
603
+ authenticated: True
604
+ }
605
+ else:
606
+ return {
607
+ login_error: gr.update(visible=True, value="Invalid password. Please try again."),
608
+ authenticated: False
609
+ }
610
+
611
+ login_button.click(
612
+ fn=login,
613
+ inputs=[password_input],
614
+ outputs=[login_group, main_app, login_error, authenticated]
615
+ )
616
+
617
+ # Launch the app
618
+ if __name__ == "__main__":
619
+ app.launch()
huggingface-space.yml ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ title: CV to CSV Extraction App
2
+ emoji: 📄
3
+ colorFrom: blue
4
+ colorTo: green
5
+ sdk: gradio
6
+ sdk_version: 4.19.2
7
+ app_file: app.py
8
+ pinned: false
9
+ license: mit
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ google-generativeai
2
+ PyPDF2
3
+ python-dotenv
4
+ scholarly
5
+ gradio
6
+ pandas