daniel-wojahn commited on
Commit
4bf5701
·
verified ·
1 Parent(s): c667561

Upload 19 files

Browse files
README.md CHANGED
@@ -1,14 +1,122 @@
1
- ---
2
- title: Ttm Webapp Hf
3
- emoji: 💻
4
- colorFrom: yellow
5
- colorTo: yellow
6
- sdk: gradio
7
- sdk_version: 5.29.1
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: A web app for analyzing Tibetan textual similarities
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Tibetan Text Metrics Web App
2
+
3
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
4
+ [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
5
+ [![Project Status: Active – Web app version for accessible text analysis.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
6
+
7
+ A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts. This tool provides a graphical interface to the core text comparison functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project, making it accessible to researchers without Python or command-line experience. Built with Python and Gradio.
8
+
9
+ ## Background
10
+
11
+ The Tibetan Text Metrics project aims to provide quantitative methods for assessing textual similarities at the chapter or segment level, helping researchers understand patterns of textual evolution. This web application extends these capabilities by offering an intuitive interface, removing the need for manual script execution and environment setup for end-users.
12
+
13
+ ## Key Features of the Web App
14
+
15
+ - **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
16
+ - **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
17
+ - **Core Metrics Computed**:
18
+ - **Jaccard Similarity (%)**: Measures vocabulary overlap between segments.
19
+ - **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
20
+ - **Semantic Similarity (BuddhistNLP)**: Uses the `buddhist-nlp/bodhi-sentence-cased-v1` model to compare the contextual meaning of segments.
21
+ - **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles.
22
+ - **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
23
+ - **Interactive Visualizations**:
24
+ - Heatmaps for Jaccard, LCS, Semantic, and TF-IDF similarity metrics, providing a quick overview of inter-segment relationships.
25
+ - Bar chart displaying word counts per segment.
26
+ - **Downloadable Results**: Export detailed metrics as a CSV file.
27
+ - **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.
28
+
29
+ ## Text Segmentation and Best Practices
30
+
31
+ **Why segment your texts?**
32
+
33
+ To obtain meaningful results, it is highly recommended to divide your Tibetan texts into logical chapters or sections before uploading. Comparing entire texts as a single unit often produces shallow or misleading results, especially for long or complex works. Chapters or sections allow the tool to detect stylistic, lexical, or structural differences that would otherwise be hidden.
34
+
35
+ **How to segment your texts:**
36
+
37
+ - Use a clear marker (e.g., `༈` or another unique string) to separate chapters/sections in your `.txt` files.
38
+ - Each segment should represent a coherent part of the text (e.g., a chapter, legal clause, or thematic section).
39
+ - The tool will automatically split your file on this marker for analysis. If no marker is found, the entire file is treated as a single segment, and a warning will be issued.
40
+
41
+ **Best practices:**
42
+
43
+ - Ensure your marker is unique and does not appear within a chapter.
44
+ - Try to keep chapters/sections of similar length for more balanced comparisons.
45
+ - For poetry or short texts, consider grouping several poems or stanzas as one segment.
46
 
47
+ ## Implemented Metrics
48
+
49
+ The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
50
+
51
+ 1. **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words. It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`. Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
52
+ 2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
53
+ * *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
54
+ 3. **Semantic Similarity (BuddhistNLP)**: Utilizes the `buddhist-nlp/bodhi-sentence-cased-v1` model to compute the cosine similarity between the semantic embeddings of text segments. This model is fine-tuned for Buddhist studies texts and captures nuances in meaning. For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
55
+ 4. **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment. TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments. This helps to identify terms that are characteristic or discriminative for a segment. Each segment is then represented as a vector of these TF-IDF scores. Finally, the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
56
+
57
+ ## Getting Started (Running Locally)
58
+
59
+ 1. Ensure you have Python 3.10 or newer.
60
+ 2. Navigate to the `webapp` directory:
61
+ ```bash
62
+ cd path/to/tibetan-text-metrics/webapp
63
+ ```
64
+ 3. Create a virtual environment (recommended):
65
+ ```bash
66
+ python -m venv .venv
67
+ source .venv/bin/activate # On macOS/Linux
68
+ # .venv\Scripts\activate # On Windows
69
+ ```
70
+ 4. Install dependencies:
71
+ ```bash
72
+ pip install -r requirements.txt
73
+ ```
74
+ 5. Run the web application:
75
+ ```bash
76
+ python app.py
77
+ ```
78
+ 6. Open your web browser and go to the local URL provided (usually `http://127.0.0.1:7860`).
79
+
80
+ ## Usage
81
+
82
+ 1. **Upload Files**: Use the file upload interface to select one or more `.txt` files containing Tibetan Unicode text.
83
+ 2. **Run Analysis**: Click the "Run Analysis" button.
84
+ 3. **View Results**:
85
+ - A preview of the similarity metrics will be displayed.
86
+ - Download the full results as a CSV file.
87
+ - Interactive heatmaps for Jaccard Similarity, Normalized LCS, Semantic Similarity, and TF-IDF Cosine Similarity will be generated.
88
+ - A bar chart showing word counts per segment will also be available.
89
+ - Any warnings (e.g., regarding missing chapter markers) will be displayed.
90
+
91
+ ## Structure
92
+
93
+ - `app.py` — Gradio web app entry point and UI definition.
94
+ - `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
95
+ - `process.py`: Core logic for segmenting texts and orchestrating metric computation.
96
+ - `metrics.py`: Implementation of Jaccard, LCS, and Semantic Similarity (including chunking).
97
+ - `semantic_embedding.py`: Handles loading and using the sentence transformer model.
98
+ - `tokenize.py`: Tibetan text tokenization using `botok`.
99
+ - `upload.py`: File upload handling (currently minimal).
100
+ - `visualize.py`: Generates heatmaps and word count plots.
101
+ - `requirements.txt` — Python dependencies for the web application.
102
+
103
+ ## License
104
+
105
+ This project is licensed under the Creative Commons Attribution 4.0 International License - see the [LICENSE](../../LICENSE) file in the main project directory for details.
106
+
107
+ ## Citation
108
+
109
+ If you use this web application or the underlying TTM tool in your research, please cite the main project:
110
+
111
+ ```bibtex
112
+ @software{wojahn2025ttm,
113
+ title = {TibetanTextMetrics (TTM): Computing Text Similarity Metrics on POS-tagged Tibetan Texts},
114
+ author = {Daniel Wojahn},
115
+ year = {2025},
116
+ url = {https://github.com/daniel-wojahn/tibetan-text-metrics},
117
+ version = {0.3.0}
118
+ }
119
+ ```
120
+
121
+ ---
122
+ For questions or issues specifically regarding the web application, please refer to the main project's [issue tracker](https://github.com/daniel-wojahn/tibetan-text-metrics/issues) or contact Daniel Wojahn.
app.py ADDED
@@ -0,0 +1,286 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from pathlib import Path
3
+ from pipeline.process import process_texts
4
+ from pipeline.visualize import generate_visualizations, generate_word_count_chart
5
+ import cProfile
6
+ import pstats
7
+ import io
8
+
9
+ from theme import tibetan_theme
10
+
11
+
12
+ # Main interface logic
13
+ def main_interface():
14
+ with gr.Blocks(
15
+ theme=tibetan_theme,
16
+ title="Tibetan Text Metrics Web App",
17
+ css=tibetan_theme.get_css_string(),
18
+ ) as demo:
19
+ gr.Markdown(
20
+ """# Tibetan Text Metrics Web App
21
+ <span style='font-size:18px;'>A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project.</span>
22
+ """,
23
+ elem_classes="gr-markdown",
24
+ )
25
+
26
+ with gr.Row(elem_id="steps-row"):
27
+ with gr.Column(scale=1, elem_classes="step-column"):
28
+ with gr.Group():
29
+ gr.Markdown(
30
+ """
31
+ ## Step 1: Upload Your Tibetan Text Files
32
+ <span style='font-size:16px;'>Upload one or more `.txt` files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible.</span>
33
+ """,
34
+ elem_classes="gr-markdown",
35
+ )
36
+ file_input = gr.File(
37
+ label="Upload Tibetan .txt files",
38
+ file_types=[".txt"],
39
+ file_count="multiple",
40
+ )
41
+ with gr.Column(scale=1, elem_classes="step-column"):
42
+ with gr.Group():
43
+ gr.Markdown(
44
+ """## Step 2: Configure and Run the Analysis
45
+ <span style='font-size:16px;'>Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using a marker like '༈'. The tool will split files based on this marker.</span>
46
+ """,
47
+ elem_classes="gr-markdown",
48
+ )
49
+ semantic_toggle_radio = gr.Radio(
50
+ label="Compute Semantic Similarity?",
51
+ choices=["Yes", "No"],
52
+ value="Yes",
53
+ info="Semantic similarity can be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
54
+ elem_id="semantic-radio-group",
55
+ )
56
+ process_btn = gr.Button(
57
+ "Run Analysis", elem_id="run-btn", variant="primary"
58
+ )
59
+
60
+ gr.Markdown(
61
+ """## Results
62
+ """,
63
+ elem_classes="gr-markdown",
64
+ )
65
+ # The heatmap_titles and metric_tooltips dictionaries are defined here
66
+ # heatmap_titles = { ... }
67
+ # metric_tooltips = { ... }
68
+ csv_output = gr.File(label="Download CSV Results")
69
+ metrics_preview = gr.Dataframe(
70
+ label="Similarity Metrics Preview", interactive=False, visible=True
71
+ )
72
+ word_count_plot = gr.Plot(label="Word Counts per Segment")
73
+ # Heatmap tabs for each metric
74
+ heatmap_titles = {
75
+ "Jaccard Similarity (%)": "Jaccard Similarity (%): Higher scores (brighter) mean more shared unique words.",
76
+ "Normalized LCS": "Normalized LCS: Higher scores (brighter) mean longer shared sequences of words.",
77
+ "Semantic Similarity (BuddhistNLP)": "Semantic Similarity (BuddhistNLP): Higher scores (brighter) mean more similar meanings.",
78
+ "TF-IDF Cosine Sim": "TF-IDF Cosine Similarity: Higher scores mean texts share more important, distinctive vocabulary.",
79
+ }
80
+
81
+ metric_tooltips = {
82
+ "Jaccard Similarity (%)": """
83
+ ### Jaccard Similarity (%)
84
+ This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words.
85
+ It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?'
86
+ It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`.
87
+ Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent.
88
+ A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
89
+ """,
90
+ "Normalized LCS": """
91
+ ### Normalized LCS (Longest Common Subsequence)
92
+ This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order.
93
+ Importantly, these words do not need to be directly adjacent (contiguous) in either text.
94
+ For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'.
95
+ The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage.
96
+ A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
97
+
98
+ *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
99
+ """,
100
+ "Semantic Similarity (BuddhistNLP)": """
101
+ ### Semantic Similarity (BuddhistNLP)
102
+ Utilizes the `buddhist-nlp/bodhi-sentence-cased-v1` model to compute the cosine similarity between the semantic embeddings of text segments.
103
+ This model is fine-tuned for Buddhist studies texts and captures nuances in meaning.
104
+ For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
105
+ """,
106
+ "TF-IDF Cosine Sim": """
107
+ ### TF-IDF Cosine Similarity
108
+ This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment.
109
+ TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
110
+ This helps to identify terms that are characteristic or discriminative for a segment.
111
+ Each segment is then represented as a vector of these TF-IDF scores.
112
+ Finally, the cosine similarity is computed between these vectors.
113
+ A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
114
+ """,
115
+ }
116
+ heatmap_tabs = {}
117
+ gr.Markdown("## Detailed Metric Analysis", elem_classes="gr-markdown")
118
+ with gr.Tabs(elem_id="heatmap-tab-group"):
119
+ for metric_key, descriptive_title in heatmap_titles.items():
120
+ with gr.Tab(metric_key):
121
+ if metric_key in metric_tooltips:
122
+ gr.Markdown(value=metric_tooltips[metric_key])
123
+ else:
124
+ gr.Markdown(
125
+ value=f"### {metric_key}\nDescription not found."
126
+ ) # Fallback
127
+ heatmap_tabs[metric_key] = gr.Plot(
128
+ label=f"Heatmap: {metric_key}", show_label=False
129
+ )
130
+
131
+ # The outputs in process_btn.click should use the short metric names as keys for heatmap_tabs
132
+ # e.g., heatmap_tabs["Jaccard Similarity (%)"]
133
+ # Ensure the plot is part of the layout. This assumes plots are displayed sequentially
134
+ # within the current gr.Tab("Results"). If they are in specific TabItems, this needs adjustment.
135
+ # For now, this modification focuses on creating the plot object and making it an output.
136
+ # The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
137
+
138
+ warning_box = gr.Markdown(visible=False)
139
+
140
+ def run_pipeline(files, enable_semantic_str):
141
+ """
142
+ Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
143
+
144
+ Args:
145
+ files (List[FileStorage]): A list of file objects uploaded by the user.
146
+
147
+ Returns:
148
+ tuple: A tuple containing the following elements in order:
149
+ - csv_path (str | None): Path to the generated CSV results file, or None on error.
150
+ - metrics_preview_df (pd.DataFrame | str | None): DataFrame for metrics preview, error string, or None.
151
+ - word_count_fig (matplotlib.figure.Figure | None): Plot of word counts, or None on error.
152
+ - jaccard_heatmap (matplotlib.figure.Figure | None): Jaccard similarity heatmap, or None.
153
+ - lcs_heatmap (matplotlib.figure.Figure | None): LCS heatmap, or None.
154
+ - semantic_heatmap (matplotlib.figure.Figure | None): Semantic similarity heatmap, or None.
155
+ - warning_update (gr.update): Gradio update for the warning box.
156
+ """
157
+ if not files:
158
+ return (
159
+ None,
160
+ "Please upload files to process.",
161
+ None,
162
+ None,
163
+ None,
164
+ None,
165
+ gr.update(value="Please upload files to process.", visible=True),
166
+ )
167
+
168
+ pr = cProfile.Profile()
169
+ pr.enable()
170
+
171
+ # Initialize results to ensure they are defined in finally if an early error occurs
172
+ (
173
+ csv_path_res,
174
+ metrics_preview_df_res,
175
+ word_count_fig_res,
176
+ jaccard_heatmap_res,
177
+ lcs_heatmap_res,
178
+ semantic_heatmap_res,
179
+ tfidf_heatmap_res,
180
+ warning_update_res,
181
+ ) = (
182
+ None,
183
+ "Processing error. See console for details.",
184
+ None,
185
+ None,
186
+ None,
187
+ None,
188
+ None,
189
+ gr.update(
190
+ value="Processing error. See console for details.", visible=True
191
+ ),
192
+ )
193
+
194
+ try:
195
+ filenames = [
196
+ Path(file.name).name for file in files
197
+ ] # Use Path().name to get just the filename
198
+ text_data = {
199
+ Path(file.name)
200
+ .name: Path(file.name)
201
+ .read_text(encoding="utf-8-sig")
202
+ for file in files
203
+ }
204
+
205
+ enable_semantic_bool = enable_semantic_str == "Yes"
206
+ df_results, word_counts_df_data, warning_raw = process_texts(
207
+ text_data, filenames, enable_semantic=enable_semantic_bool
208
+ )
209
+
210
+ if df_results.empty:
211
+ warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
212
+ warning_message = (
213
+ "No common chapters found or results are empty. " + warning_md
214
+ )
215
+ metrics_preview_df_res = warning_message
216
+ warning_update_res = gr.update(value=warning_message, visible=True)
217
+ # Results for this case are set, finally will execute, then return
218
+ else:
219
+ # heatmap_titles is already defined in the outer scope of main_interface
220
+ heatmaps_data = generate_visualizations(
221
+ df_results, descriptive_titles=heatmap_titles
222
+ )
223
+ word_count_fig_res = generate_word_count_chart(word_counts_df_data)
224
+ csv_path_res = "results.csv"
225
+ df_results.to_csv(csv_path_res, index=False)
226
+ warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
227
+ metrics_preview_df_res = df_results.head(10)
228
+
229
+ jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
230
+ lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
231
+ semantic_heatmap_res = heatmaps_data.get(
232
+ "Semantic Similarity (BuddhistNLP)"
233
+ )
234
+ tfidf_heatmap_res = heatmaps_data.get("TF-IDF Cosine Sim")
235
+ warning_update_res = gr.update(
236
+ visible=bool(warning_raw), value=warning_md
237
+ )
238
+
239
+ except Exception as e:
240
+ # logger.error(f"Error in processing: {e}", exc_info=True) # Already logged in process_texts or lower levels
241
+ metrics_preview_df_res = f"Error: {str(e)}"
242
+ warning_update_res = gr.update(value=f"Error: {str(e)}", visible=True)
243
+
244
+ finally:
245
+ pr.disable()
246
+ s = io.StringIO()
247
+ sortby = (
248
+ pstats.SortKey.CUMULATIVE
249
+ ) # Sort by cumulative time spent in function
250
+ ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
251
+ print("\n--- cProfile Stats (Top 30) ---")
252
+ ps.print_stats(30) # Print the top 30 costly functions
253
+ print(s.getvalue())
254
+ print("-------------------------------\n")
255
+
256
+ return (
257
+ csv_path_res,
258
+ metrics_preview_df_res,
259
+ word_count_fig_res,
260
+ jaccard_heatmap_res,
261
+ lcs_heatmap_res,
262
+ semantic_heatmap_res,
263
+ tfidf_heatmap_res,
264
+ warning_update_res,
265
+ )
266
+
267
+ process_btn.click(
268
+ run_pipeline,
269
+ inputs=[file_input, semantic_toggle_radio],
270
+ outputs=[
271
+ csv_output,
272
+ metrics_preview,
273
+ word_count_plot,
274
+ heatmap_tabs["Jaccard Similarity (%)"],
275
+ heatmap_tabs["Normalized LCS"],
276
+ heatmap_tabs["Semantic Similarity (BuddhistNLP)"],
277
+ heatmap_tabs["TF-IDF Cosine Sim"],
278
+ warning_box,
279
+ ],
280
+ )
281
+ return demo
282
+
283
+
284
+ if __name__ == "__main__":
285
+ demo = main_interface()
286
+ demo.launch()
pipeline/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+
pipeline/__pycache__/__init__.cpython-310.pyc ADDED
Binary file (187 Bytes). View file
 
pipeline/__pycache__/metrics.cpython-310.pyc ADDED
Binary file (7.23 kB). View file
 
pipeline/__pycache__/process.cpython-310.pyc ADDED
Binary file (3.74 kB). View file
 
pipeline/__pycache__/semantic_embedding.cpython-310.pyc ADDED
Binary file (4.02 kB). View file
 
pipeline/__pycache__/tokenize.cpython-310.pyc ADDED
Binary file (1.14 kB). View file
 
pipeline/__pycache__/upload.cpython-310.pyc ADDED
Binary file (983 Bytes). View file
 
pipeline/__pycache__/visualize.cpython-310.pyc ADDED
Binary file (4.37 kB). View file
 
pipeline/metrics.py ADDED
@@ -0,0 +1,279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import pandas as pd
3
+ from typing import List, Dict
4
+ from itertools import combinations
5
+ from sklearn.metrics.pairwise import cosine_similarity
6
+ import torch
7
+ from .semantic_embedding import generate_embeddings
8
+ from .tokenize import tokenize_texts
9
+ import logging
10
+ from numba import njit
11
+ from sklearn.feature_extraction.text import TfidfVectorizer
12
+
13
+ logger = logging.getLogger(__name__)
14
+
15
+ MAX_TOKENS_PER_CHUNK = 500 # Max tokens (words via botok) per chunk
16
+ CHUNK_OVERLAP = 50 # Number of tokens to overlap between chunks
17
+
18
+
19
+ def _chunk_text(
20
+ original_text_content: str,
21
+ tokens: List[str],
22
+ max_chunk_tokens: int,
23
+ overlap_tokens: int,
24
+ ) -> List[str]:
25
+ """
26
+ Splits a list of tokens into chunks and reconstructs text segments from these token chunks.
27
+ The reconstructed text segments are intended for embedding models.
28
+ Args:
29
+ original_text_content (str): The original raw text string. Used if no chunking is needed.
30
+ tokens (List[str]): The list of botok tokens for the original_text_content.
31
+ max_chunk_tokens (int): Maximum number of botok tokens per chunk.
32
+ overlap_tokens (int): Number of botok tokens to overlap between chunks.
33
+
34
+ Returns:
35
+ List[str]: A list of text strings, where each string is a chunk.
36
+ """
37
+ if (
38
+ not tokens
39
+ ): # Handles empty or whitespace-only original text that led to no tokens
40
+ return [original_text_content] if original_text_content.strip() else []
41
+
42
+ if len(tokens) <= max_chunk_tokens:
43
+ # If not chunking, return the original text content directly, as per MEMORY[a777e6ad-11c4-4b90-8e6e-63a923a94432]
44
+ # The memory states raw text segments are passed directly to the model.
45
+ # Joining tokens here would alter spacing, etc.
46
+ return [original_text_content]
47
+
48
+ reconstructed_text_chunks = []
49
+ start_idx = 0
50
+ while start_idx < len(tokens):
51
+ end_idx = min(start_idx + max_chunk_tokens, len(tokens))
52
+ current_chunk_botok_tokens = tokens[start_idx:end_idx]
53
+ # Reconstruct the text chunk by joining the botok tokens. This is an approximation.
54
+ # The semantic model's internal tokenizer will handle this string.
55
+ reconstructed_text_chunks.append(" ".join(current_chunk_botok_tokens))
56
+
57
+ if end_idx == len(tokens):
58
+ break
59
+
60
+ next_start_idx = start_idx + max_chunk_tokens - overlap_tokens
61
+ if next_start_idx <= start_idx:
62
+ next_start_idx = start_idx + 1
63
+ start_idx = next_start_idx
64
+
65
+ return reconstructed_text_chunks
66
+
67
+
68
+ @njit
69
+ def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
70
+ m, n = len(words1), len(words2)
71
+ if m == 0 or n == 0:
72
+ return 0.0
73
+ dp = np.zeros((m + 1, n + 1), dtype=np.int32)
74
+ for i in range(1, m + 1):
75
+ for j in range(1, n + 1):
76
+ if words1[i - 1] == words2[j - 1]:
77
+ dp[i, j] = dp[i - 1, j - 1] + 1
78
+ else:
79
+ dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
80
+ lcs_length = int(dp[m, n])
81
+ avg_length = (m + n) / 2
82
+ return lcs_length / avg_length if avg_length > 0 else 0.0
83
+
84
+
85
+ def compute_semantic_similarity(
86
+ text1_segment: str,
87
+ text2_segment: str,
88
+ tokens1: List[str],
89
+ tokens2: List[str],
90
+ model,
91
+ device,
92
+ ) -> float:
93
+ """Computes semantic similarity using a sentence transformer model, with chunking for long texts."""
94
+ if model is None or device is None:
95
+ logger.warning(
96
+ "Semantic similarity model or device not available. Skipping calculation."
97
+ )
98
+ return np.nan # Return NaN if model isn't loaded
99
+
100
+ if not text1_segment or not text2_segment:
101
+ logger.info(
102
+ "One or both texts are empty for semantic similarity. Returning 0.0."
103
+ )
104
+ return 0.0 # Or np.nan, depending on desired behavior for empty inputs
105
+
106
+ def _get_aggregated_embedding(
107
+ raw_text_segment: str, botok_tokens: List[str], model_obj, device_str
108
+ ) -> torch.Tensor | None:
109
+ """Helper to get a single embedding for a text, chunking if necessary."""
110
+ if (
111
+ not botok_tokens and not raw_text_segment.strip()
112
+ ): # Check if effectively empty
113
+ logger.info(
114
+ f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
115
+ )
116
+ return None
117
+
118
+ if len(botok_tokens) > MAX_TOKENS_PER_CHUNK:
119
+ logger.info(
120
+ f"Text segment with ~{len(botok_tokens)} tokens exceeds {MAX_TOKENS_PER_CHUNK}, chunking {raw_text_segment[:30]}..."
121
+ )
122
+ # Pass the original raw text and its pre-computed botok tokens to _chunk_text
123
+ text_chunks = _chunk_text(
124
+ raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
125
+ )
126
+ if not text_chunks:
127
+ logger.warning(
128
+ f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
129
+ )
130
+ return None
131
+
132
+ logger.info(
133
+ f"Generated {len(text_chunks)} chunks for segment: {raw_text_segment[:30]}..."
134
+ )
135
+ chunk_embeddings = generate_embeddings(text_chunks, model_obj, device_str)
136
+
137
+ if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
138
+ logger.error(
139
+ f"Failed to generate embeddings for chunks of text: {raw_text_segment[:100]}..."
140
+ )
141
+ return None
142
+ # Mean pooling of chunk embeddings
143
+ aggregated_embedding = torch.mean(chunk_embeddings, dim=0, keepdim=True)
144
+ return aggregated_embedding
145
+ else:
146
+ # Text is short enough, embed raw text directly as per MEMORY[a777e6ad-11c4-4b90-8e6e-63a923a94432]
147
+ if not raw_text_segment.strip():
148
+ logger.info(
149
+ f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
150
+ )
151
+ return None
152
+
153
+ embedding = generate_embeddings([raw_text_segment], model_obj, device_str)
154
+ if embedding is None or embedding.nelement() == 0:
155
+ logger.error(
156
+ f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
157
+ )
158
+ return None
159
+ return embedding # Already [1, embed_dim]
160
+
161
+ try:
162
+ # Pass raw text and its pre-computed botok tokens
163
+ embedding1 = _get_aggregated_embedding(text1_segment, tokens1, model, device)
164
+ embedding2 = _get_aggregated_embedding(text2_segment, tokens2, model, device)
165
+
166
+ if (
167
+ embedding1 is None
168
+ or embedding2 is None
169
+ or embedding1.nelement() == 0
170
+ or embedding2.nelement() == 0
171
+ ):
172
+ logger.error(
173
+ "Failed to obtain one or both aggregated embeddings for semantic similarity."
174
+ )
175
+ return np.nan
176
+
177
+ # Cosine similarity expects 2D arrays, embeddings are [1, embed_dim] and on CPU
178
+ similarity = cosine_similarity(embedding1.numpy(), embedding2.numpy())
179
+ return float(similarity[0][0])
180
+ except Exception as e:
181
+ logger.error(
182
+ f"Error computing semantic similarity with chunking:\nText1: '{text1_segment[:100]}...'\nText2: '{text2_segment[:100]}...'\nError: {e}",
183
+ exc_info=True,
184
+ )
185
+ return np.nan
186
+
187
+
188
+ def compute_all_metrics(
189
+ texts: Dict[str, str], model=None, device=None, enable_semantic: bool = True
190
+ ) -> pd.DataFrame:
191
+ """
192
+ Computes all selected similarity metrics between pairs of texts.
193
+
194
+ Args:
195
+ texts (Dict[str, str]): A dictionary where keys are text identifiers (e.g., filenames or segment IDs)
196
+ and values are the text content strings.
197
+ model (SentenceTransformer, optional): The pre-loaded sentence transformer model.
198
+ Defaults to None.
199
+ device (str, optional): The device the model is on ('cuda' or 'cpu').
200
+ Defaults to None.
201
+
202
+ Returns:
203
+ pd.DataFrame: A DataFrame where each row contains the metrics for a pair of texts,
204
+ including 'Text Pair', 'Jaccard Similarity (%)', 'Normalized LCS',
205
+ and 'Semantic Similarity (BuddhistNLP)'.
206
+ """
207
+ files = list(texts.keys())
208
+ results = []
209
+ # Prepare token lists (always use tokenize_texts for raw Unicode)
210
+ token_lists = {}
211
+ corpus_for_tfidf = [] # For storing space-joined tokens for TF-IDF
212
+
213
+ for fname, content in texts.items():
214
+ tokenized_content = tokenize_texts([content]) # Returns a list of lists
215
+ if tokenized_content and tokenized_content[0]:
216
+ token_lists[fname] = tokenized_content[0]
217
+ else:
218
+ token_lists[fname] = []
219
+ # Regardless of whether tokenized_content[0] exists, prepare entry for TF-IDF corpus
220
+ # If tokens exist, join them; otherwise, use an empty string for that document
221
+ corpus_for_tfidf.append(
222
+ " ".join(token_lists[fname])
223
+ if fname in token_lists and token_lists[fname]
224
+ else ""
225
+ )
226
+
227
+ # TF-IDF Vectorization and Cosine Similarity Calculation
228
+ if corpus_for_tfidf:
229
+ # Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
230
+ # and we don't want further case changes or token modifications for Tibetan.
231
+ vectorizer = TfidfVectorizer(
232
+ tokenizer=lambda x: x.split(), preprocessor=lambda x: x, token_pattern=None
233
+ )
234
+ tfidf_matrix = vectorizer.fit_transform(corpus_for_tfidf)
235
+ # Calculate pairwise cosine similarity on the TF-IDF matrix
236
+ # This gives a square matrix where cosine_sim_matrix[i, j] is the similarity between doc i and doc j
237
+ cosine_sim_matrix = cosine_similarity(tfidf_matrix)
238
+ else:
239
+ # Handle case with no texts or all empty texts
240
+ cosine_sim_matrix = np.array(
241
+ [[]]
242
+ ) # Or some other appropriate empty/default structure
243
+
244
+ for i, j in combinations(range(len(files)), 2):
245
+ f1, f2 = files[i], files[j]
246
+ words1, words2 = token_lists[f1], token_lists[f2]
247
+ jaccard = (
248
+ len(set(words1) & set(words2)) / len(set(words1) | set(words2))
249
+ if set(words1) | set(words2)
250
+ else 0.0
251
+ )
252
+ jaccard_percent = jaccard * 100.0
253
+ norm_lcs = compute_normalized_lcs(words1, words2)
254
+
255
+ # Semantic Similarity Calculation
256
+ if enable_semantic:
257
+ # Pass raw texts and their pre-computed botok tokens
258
+ semantic_sim = compute_semantic_similarity(
259
+ texts[f1], texts[f2], words1, words2, model, device
260
+ )
261
+ else:
262
+ semantic_sim = np.nan
263
+ results.append(
264
+ {
265
+ "Text Pair": f"{f1} vs {f2}",
266
+ "Jaccard Similarity (%)": jaccard_percent,
267
+ "Normalized LCS": norm_lcs,
268
+ # Pass tokens1 and tokens2 to compute_semantic_similarity
269
+ "Semantic Similarity (BuddhistNLP)": semantic_sim,
270
+ "TF-IDF Cosine Sim": (
271
+ cosine_sim_matrix[i, j]
272
+ if cosine_sim_matrix.size > 0
273
+ and i < cosine_sim_matrix.shape[0]
274
+ and j < cosine_sim_matrix.shape[1]
275
+ else np.nan
276
+ ),
277
+ }
278
+ )
279
+ return pd.DataFrame(results)
pipeline/process.py ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from typing import Dict, List, Tuple
3
+ from .metrics import compute_all_metrics
4
+ from .semantic_embedding import get_sentence_transformer_model_and_device
5
+ from .tokenize import tokenize_texts
6
+ import logging
7
+
8
+ logger = logging.getLogger(__name__)
9
+
10
+
11
+ def process_texts(
12
+ text_data: Dict[str, str], filenames: List[str], enable_semantic: bool = True
13
+ ) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
14
+ """
15
+ Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
16
+ Args:
17
+ text_data (Dict[str, str]): A dictionary mapping filenames to their content.
18
+ filenames (List[str]): A list of filenames that were uploaded.
19
+ Returns:
20
+ Tuple[pd.DataFrame, pd.DataFrame, str]:
21
+ - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
22
+ - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
23
+ - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
24
+ """
25
+ st_model, st_device = None, None
26
+ if enable_semantic:
27
+ logger.info(
28
+ "Semantic similarity enabled. Loading sentence transformer model..."
29
+ )
30
+ try:
31
+ st_model, st_device = get_sentence_transformer_model_and_device()
32
+ logger.info(
33
+ f"Sentence transformer model loaded successfully on {st_device}."
34
+ )
35
+ except Exception as e:
36
+ logger.error(
37
+ f"Failed to load sentence transformer model: {e}. Semantic similarity will not be available."
38
+ )
39
+ # Optionally, add a warning to the UI if model loading fails
40
+ # warning += " Semantic similarity model failed to load."
41
+ else:
42
+ logger.info("Semantic similarity disabled. Skipping model loading.")
43
+
44
+ # Detect chapter marker
45
+ chapter_marker = "༈"
46
+ fallback = False
47
+ segment_texts = {}
48
+ for fname in filenames:
49
+ content = text_data[fname]
50
+ if chapter_marker in content:
51
+ segments = [
52
+ seg.strip() for seg in content.split(chapter_marker) if seg.strip()
53
+ ]
54
+ for idx, seg in enumerate(segments):
55
+ seg_id = f"{fname}|chapter {idx+1}"
56
+ segment_texts[seg_id] = seg
57
+ else:
58
+ seg_id = f"{fname}|chapter 1"
59
+ segment_texts[seg_id] = content.strip()
60
+ fallback = True
61
+ warning = ""
62
+ if fallback:
63
+ warning = (
64
+ "No chapter marker found in one or more files. "
65
+ "Each file will be treated as a single segment. "
66
+ "For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
67
+ )
68
+ # Group chapters by filename (preserving order)
69
+ file_to_chapters = {}
70
+ for seg_id in segment_texts:
71
+ fname = seg_id.split("|")[0]
72
+ file_to_chapters.setdefault(fname, []).append(seg_id)
73
+ # For each pair of files, compare corresponding chapters (by index)
74
+ from itertools import combinations
75
+
76
+ results = []
77
+ files = list(file_to_chapters.keys())
78
+ for file1, file2 in combinations(files, 2):
79
+ chaps1 = file_to_chapters[file1]
80
+ chaps2 = file_to_chapters[file2]
81
+ min_chaps = min(len(chaps1), len(chaps2))
82
+ for idx in range(min_chaps):
83
+ seg1 = chaps1[idx]
84
+ seg2 = chaps2[idx]
85
+ # Compute metrics for this chapter pair
86
+ # Use compute_all_metrics on just these two segments
87
+ pair_metrics = compute_all_metrics(
88
+ {seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
89
+ model=st_model,
90
+ device=st_device,
91
+ enable_semantic=enable_semantic,
92
+ )
93
+ # Rename 'Text Pair' to show file stems and chapter number
94
+ # Set Text Pair and Chapter columns
95
+ pair_metrics.loc[:, "Text Pair"] = f"{file1} vs {file2}"
96
+ pair_metrics.loc[:, "Chapter"] = idx + 1
97
+ results.append(pair_metrics)
98
+ if results:
99
+ metrics_df = pd.concat(results, ignore_index=True)
100
+ else:
101
+ metrics_df = pd.DataFrame()
102
+
103
+ # Calculate word counts
104
+ word_counts_data = []
105
+ for seg_id, text_content in segment_texts.items():
106
+ fname, chapter_info = seg_id.split("|", 1)
107
+ chapter_num = int(chapter_info.replace("chapter ", ""))
108
+ # Use botok for accurate word count for raw Tibetan text
109
+ tokenized_segments = tokenize_texts([text_content]) # Returns a list of lists
110
+ if tokenized_segments and tokenized_segments[0]:
111
+ word_count = len(tokenized_segments[0])
112
+ else:
113
+ word_count = 0
114
+ word_counts_data.append(
115
+ {
116
+ "Filename": fname.replace(".txt", ""),
117
+ "ChapterNumber": chapter_num,
118
+ "SegmentID": seg_id,
119
+ "WordCount": word_count,
120
+ }
121
+ )
122
+ word_counts_df = pd.DataFrame(word_counts_data)
123
+ if not word_counts_df.empty:
124
+ word_counts_df = word_counts_df.sort_values(
125
+ by=["Filename", "ChapterNumber"]
126
+ ).reset_index(drop=True)
127
+
128
+ return metrics_df, word_counts_df, warning
pipeline/semantic_embedding.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ import torch
3
+ from sentence_transformers import SentenceTransformer
4
+
5
+ # Configure logging
6
+ logging.basicConfig(
7
+ level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
8
+ )
9
+ logger = logging.getLogger(__name__)
10
+
11
+ # Define the model ID for the fine-tuned Tibetan MiniLM
12
+ DEFAULT_MODEL_NAME = "buddhist-nlp/buddhist-sentence-similarity"
13
+
14
+
15
+ def get_sentence_transformer_model_and_device(
16
+ model_id: str = DEFAULT_MODEL_NAME, device_preference: str = "auto"
17
+ ):
18
+ """
19
+ Loads the Sentence Transformer model and determines the device.
20
+ Priority: CUDA -> MPS (Apple Silicon) -> CPU.
21
+
22
+ Args:
23
+ model_id (str): The Hugging Face model ID.
24
+ device_preference (str): Preferred device ("cuda", "mps", "cpu", "auto").
25
+
26
+ Returns:
27
+ tuple: (model, device_str)
28
+ - model: The loaded SentenceTransformer model.
29
+ - device_str: The device the model is loaded on ("cuda", "mps", or "cpu").
30
+ """
31
+ selected_device_str = ""
32
+
33
+ if device_preference == "auto":
34
+ if torch.cuda.is_available():
35
+ selected_device_str = "cuda"
36
+ elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
37
+ selected_device_str = "mps"
38
+ else:
39
+ selected_device_str = "cpu"
40
+ elif device_preference == "cuda" and torch.cuda.is_available():
41
+ selected_device_str = "cuda"
42
+ elif (
43
+ device_preference == "mps"
44
+ and hasattr(torch.backends, "mps")
45
+ and torch.backends.mps.is_available()
46
+ ):
47
+ selected_device_str = "mps"
48
+ else: # Handles explicit "cpu" preference or fallback if preferred is unavailable
49
+ selected_device_str = "cpu"
50
+
51
+ logger.info(f"Attempting to use device: {selected_device_str}")
52
+
53
+ try:
54
+ logger.info(
55
+ f"Loading Sentence Transformer model: {model_id} on device: {selected_device_str}"
56
+ )
57
+ # SentenceTransformer expects a string like 'cuda', 'mps', or 'cpu'
58
+ model = SentenceTransformer(model_id, device=selected_device_str)
59
+ logger.info(f"Model {model_id} loaded successfully on {selected_device_str}.")
60
+ return model, selected_device_str
61
+ except Exception as e:
62
+ logger.error(
63
+ f"Error loading model {model_id} on device {selected_device_str}: {e}"
64
+ )
65
+ # Fallback to CPU if the initially selected device (CUDA or MPS) failed
66
+ if selected_device_str != "cpu":
67
+ logger.warning(
68
+ f"Failed to load model on {selected_device_str}, attempting to load on CPU..."
69
+ )
70
+ fallback_device_str = "cpu"
71
+ try:
72
+ model = SentenceTransformer(model_id, device=fallback_device_str)
73
+ logger.info(
74
+ f"Model {model_id} loaded successfully on CPU after fallback."
75
+ )
76
+ return model, fallback_device_str
77
+ except Exception as fallback_e:
78
+ logger.error(
79
+ f"Error loading model {model_id} on CPU during fallback: {fallback_e}"
80
+ )
81
+ raise fallback_e # Re-raise exception if CPU fallback also fails
82
+ raise e # Re-raise original exception if selected_device_str was already CPU or no fallback attempted
83
+
84
+
85
+ def generate_embeddings(texts: list[str], model, device: str):
86
+ """
87
+ Generates embeddings for a list of texts using the provided Sentence Transformer model.
88
+
89
+ Args:
90
+ texts (list[str]): A list of texts to embed.
91
+ model: The loaded SentenceTransformer model.
92
+ device (str): The device the model is on (primarily for logging, model.encode handles device).
93
+
94
+ Returns:
95
+ torch.Tensor: A tensor containing the embeddings, moved to CPU.
96
+ """
97
+ if not texts:
98
+ logger.warning(
99
+ "No texts provided to generate_embeddings. Returning empty tensor."
100
+ )
101
+ return torch.empty(0)
102
+
103
+ logger.info(f"Generating embeddings for {len(texts)} texts...")
104
+
105
+ # The encode method of SentenceTransformer handles tokenization and pooling internally.
106
+ # It also manages moving data to the model's device.
107
+ embeddings = model.encode(texts, convert_to_tensor=True, show_progress_bar=True)
108
+
109
+ logger.info(f"Embeddings generated with shape: {embeddings.shape}")
110
+ return (
111
+ embeddings.cpu()
112
+ ) # Ensure embeddings are on CPU for consistent further processing
113
+
114
+
115
+ if __name__ == "__main__":
116
+ # Example usage:
117
+ logger.info("Starting example usage of semantic_embedding module...")
118
+
119
+ test_texts = [
120
+ "བཀྲ་ཤིས་བདེ་ལེགས།",
121
+ "hello world", # Test with non-Tibetan to see behavior
122
+ "དེ་རིང་གནམ་གཤིས་ཡག་པོ་འདུག",
123
+ ]
124
+
125
+ logger.info("Attempting to load model using default cache directory.")
126
+ try:
127
+ # Forcing CPU for this example to avoid potential CUDA issues in diverse environments
128
+ # or if CUDA is not intended for this specific test.
129
+ st_model, st_device = get_sentence_transformer_model_and_device(
130
+ device_preference="cpu" # Explicitly use CPU for this test run
131
+ )
132
+
133
+ if st_model:
134
+ logger.info(f"Test model loaded on device: {st_device}")
135
+ example_embeddings = generate_embeddings(test_texts, st_model, st_device)
136
+ logger.info(
137
+ f"Generated example embeddings shape: {example_embeddings.shape}"
138
+ )
139
+ if example_embeddings.nelement() > 0: # Check if tensor is not empty
140
+ logger.info(
141
+ f"First embedding (first 10 dims): {example_embeddings[0][:10]}..."
142
+ )
143
+ else:
144
+ logger.info("Generated example embeddings tensor is empty.")
145
+ else:
146
+ logger.error("Failed to load model for example usage.")
147
+
148
+ except Exception as e:
149
+ logger.error(f"An error occurred during the example usage: {e}")
150
+
151
+ logger.info("Finished example usage.")
pipeline/tokenize.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List
2
+
3
+ try:
4
+ from botok import WordTokenizer
5
+
6
+ # Initialize the tokenizer once at the module level
7
+ BOTOK_TOKENIZER = WordTokenizer()
8
+ except ImportError:
9
+ # Handle the case where botok might not be installed,
10
+ # though it's a core dependency for this app.
11
+ BOTOK_TOKENIZER = None
12
+ print("ERROR: botok library not found. Tokenization will fail.")
13
+ # Optionally, raise an error here if botok is absolutely critical for the app to even start
14
+ # raise ImportError("botok is required for tokenization. Please install it.")
15
+
16
+
17
+ def tokenize_texts(texts: List[str]) -> List[List[str]]:
18
+ """
19
+ Tokenizes a list of raw Tibetan texts using botok.
20
+ Args:
21
+ texts: List of raw text strings.
22
+ Returns:
23
+ List of tokenized texts (each as a list of tokens).
24
+ """
25
+ if BOTOK_TOKENIZER is None:
26
+ # This case should ideally be handled more gracefully,
27
+ # perhaps by preventing analysis if the tokenizer failed to load.
28
+ raise RuntimeError(
29
+ "Botok tokenizer failed to initialize. Cannot tokenize texts."
30
+ )
31
+
32
+ tokenized_texts_list = []
33
+ for text_content in texts:
34
+ tokens = [
35
+ w.text for w in BOTOK_TOKENIZER.tokenize(text_content) if w.text.strip()
36
+ ]
37
+ tokenized_texts_list.append(tokens)
38
+ return tokenized_texts_list
pipeline/upload.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from typing import List, Tuple
3
+
4
+
5
+ def handle_file_upload(files, input_type: str) -> Tuple[dict, List[str]]:
6
+ """
7
+ Reads uploaded files and returns their contents and filenames.
8
+ Args:
9
+ files: List of uploaded file objects from Gradio
10
+ input_type: Type of input (raw, tokenized, pos-tagged)
11
+ Returns:
12
+ text_data: dict mapping filename to file content
13
+ filenames: list of filenames
14
+ """
15
+ text_data = {}
16
+ filenames = []
17
+ for file in files:
18
+ filename = os.path.basename(file.name)
19
+ with open(file.name, "r", encoding="utf-8") as f:
20
+ content = f.read()
21
+ text_data[filename] = content
22
+ filenames.append(filename)
23
+ return text_data, filenames
pipeline/visualize.py ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import plotly.graph_objects as go
2
+ import pandas as pd
3
+ import plotly.express as px # For color palettes
4
+ import numpy as np # Ensure numpy is imported, in case pivot_table uses it for aggfunc
5
+
6
+
7
+ def generate_visualizations(metrics_df: pd.DataFrame, descriptive_titles: dict = None):
8
+ """
9
+ Generate heatmap visualizations for all metrics.
10
+ Args:
11
+ metrics_df: DataFrame with similarity metrics (segment-level)
12
+ Returns:
13
+ heatmaps: dict of {metric_name: plotly Figure} for each metric
14
+ """
15
+
16
+ # Identify all numeric metric columns (exclude 'Text Pair' and 'Chapter')
17
+ metric_cols = [
18
+ col
19
+ for col in metrics_df.columns
20
+ if col not in ["Text Pair", "Chapter"] and metrics_df[col].dtype != object
21
+ ]
22
+ for col in metrics_df.columns:
23
+ if "Pattern Similarity" in col and col not in metric_cols:
24
+ metric_cols.append(col)
25
+
26
+ # --- Heatmaps for each metric ---
27
+ heatmaps = {}
28
+ # Using 'Reds' colormap as requested for a red/white gradient.
29
+ # Chapter 1 will be at the top of the Y-axis due to sort_index(ascending=False).
30
+ for metric in metric_cols:
31
+ # Check if all values for this metric are NaN
32
+ if metrics_df[metric].isnull().all():
33
+ heatmaps[metric] = None
34
+ continue # Move to the next metric
35
+
36
+ pivot = metrics_df.pivot(index="Chapter", columns="Text Pair", values=metric)
37
+ pivot = pivot.sort_index(ascending=False) # Invert Y-axis: Chapter 1 at the top
38
+ # Additional check: if pivot is empty or all NaNs after pivoting (e.g., due to single chapter comparisons)
39
+ if pivot.empty or pivot.isnull().all().all():
40
+ heatmaps[metric] = None
41
+ continue
42
+
43
+ cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
44
+ cmap = "Reds" # Apply 'Reds' colormap to all heatmaps
45
+ text = [
46
+ [f"{val:.2f}" if pd.notnull(val) else "" for val in row]
47
+ for row in pivot.values
48
+ ]
49
+ fig = go.Figure(
50
+ data=go.Heatmap(
51
+ z=pivot.values,
52
+ x=cleaned_columns,
53
+ y=pivot.index,
54
+ colorscale=cmap,
55
+ zmin=float(np.nanmin(pivot.values)),
56
+ zmax=float(np.nanmax(pivot.values)),
57
+ text=text,
58
+ texttemplate="%{text}",
59
+ hovertemplate="Chapter %{y}<br>Text Pair: %{x}<br>Value: %{z:.2f}<extra></extra>",
60
+ colorbar=dict(title=metric, thickness=20, tickfont=dict(size=14)),
61
+ )
62
+ )
63
+ plot_title = (
64
+ descriptive_titles.get(metric, metric) if descriptive_titles else metric
65
+ )
66
+ fig.update_layout(
67
+ title=plot_title,
68
+ xaxis_title="Text Pair",
69
+ yaxis_title="Chapter",
70
+ autosize=False,
71
+ width=1350,
72
+ height=1200,
73
+ font=dict(size=16),
74
+ margin=dict(l=140, b=80, t=60),
75
+ )
76
+ fig.update_xaxes(tickangle=30, tickfont=dict(size=16))
77
+ fig.update_yaxes(tickfont=dict(size=16), autorange="reversed")
78
+ # Ensure all integer chapter numbers are shown if the axis is numeric and reversed
79
+ if pd.api.types.is_numeric_dtype(pivot.index):
80
+ fig.update_yaxes(
81
+ tickmode="array",
82
+ tickvals=pivot.index,
83
+ ticktext=[str(i) for i in pivot.index],
84
+ )
85
+ heatmaps[metric] = fig
86
+
87
+ # Use all features including pattern similarities if present
88
+ if not metrics_df.empty:
89
+ # Remove '.txt' from Text Pair labels
90
+ metrics_df = metrics_df.copy()
91
+ metrics_df["Text Pair"] = metrics_df["Text Pair"].str.replace(
92
+ ".txt", "", regex=False
93
+ )
94
+ return heatmaps
95
+
96
+
97
+ def generate_word_count_chart(word_counts_df: pd.DataFrame):
98
+ """
99
+ Generates a bar chart for word counts per segment (file/chapter).
100
+ Args:
101
+ word_counts_df: DataFrame with 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
102
+ Returns:
103
+ plotly Figure for the bar chart, or None if input is empty.
104
+ """
105
+ if word_counts_df.empty:
106
+ return None
107
+
108
+ fig = go.Figure()
109
+
110
+ # Assign colors based on Filename
111
+ unique_files = sorted(word_counts_df["Filename"].unique())
112
+ colors = px.colors.qualitative.Plotly # Get a default Plotly color sequence
113
+
114
+ for i, filename in enumerate(unique_files):
115
+ file_df = word_counts_df[word_counts_df["Filename"] == filename].sort_values(
116
+ "ChapterNumber"
117
+ )
118
+ fig.add_trace(
119
+ go.Bar(
120
+ x=file_df["ChapterNumber"],
121
+ y=file_df["WordCount"],
122
+ name=filename,
123
+ marker_color=colors[i % len(colors)],
124
+ text=file_df["WordCount"],
125
+ textposition="auto",
126
+ customdata=file_df[["Filename"]], # Pass Filename for hovertemplate
127
+ hovertemplate="<b>File</b>: %{customdata[0]}<br>"
128
+ + "<b>Chapter</b>: %{x}<br>"
129
+ + "<b>Word Count</b>: %{y}<extra></extra>",
130
+ )
131
+ )
132
+
133
+ fig.update_layout(
134
+ title_text="Word Counts per Chapter (Grouped by File)",
135
+ xaxis_title="Chapter Number",
136
+ yaxis_title="Word Count",
137
+ barmode="group",
138
+ font=dict(size=14),
139
+ legend_title_text="Filename",
140
+ xaxis=dict(
141
+ type="category"
142
+ ), # Treat chapter numbers as categories for distinct grouping
143
+ autosize=True,
144
+ margin=dict(l=80, r=50, b=100, t=50, pad=4),
145
+ )
146
+ # Ensure x-axis ticks are shown for all chapter numbers present
147
+ all_chapter_numbers = sorted(word_counts_df["ChapterNumber"].unique())
148
+ fig.update_xaxes(
149
+ tickmode="array",
150
+ tickvals=all_chapter_numbers,
151
+ ticktext=[str(ch) for ch in all_chapter_numbers],
152
+ )
153
+
154
+ return fig
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ pandas
3
+ plotly
4
+ matplotlib
5
+ seaborn
6
+ scikit-learn
7
+ # botok is required for Tibetan tokenization, ensure it is available on Hugging Face Spaces
8
+ botok
9
+ torch
10
+ transformers
11
+ sentence-transformers
12
+ numba
results.csv ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Text Pair,Jaccard Similarity (%),Normalized LCS,Semantic Similarity (BuddhistNLP),TF-IDF Cosine Sim,Chapter
2
+ Nepal12.txt vs LTWA.txt,0.24183796856106407,0.0020325203252032522,,0.49779922611270133,1
3
+ Nepal12.txt vs LTWA.txt,0.49019607843137253,0.006349206349206349,,0.3092411626312882,2
4
+ Nepal12.txt vs LTWA.txt,0.423728813559322,0.0051813471502590676,,0.2835703043060698,3
5
+ Nepal12.txt vs LTWA.txt,47.05882352941176,0.6213151927437641,,0.7895311629826965,4
6
+ Nepal12.txt vs LTWA.txt,29.81366459627329,0.4437299035369775,,0.7168755829405374,5
7
+ Nepal12.txt vs LTWA.txt,36.31578947368421,0.5212464589235127,,0.8074186048740946,6
8
+ Nepal12.txt vs LTWA.txt,26.769911504424783,0.2892561983471074,,0.8417955275319969,7
9
+ Nepal12.txt vs LTWA.txt,43.08943089430895,0.5246753246753246,,0.9303577865963523,8
10
+ Nepal12.txt vs LTWA.txt,29.625292740046838,0.2795071335927367,,0.9109232758325454,9
11
+ Nepal12.txt vs LTWA.txt,52.798053527980535,0.6291525423728813,,0.9087016630709266,10
12
+ Nepal12.txt vs LTWA.txt,53.075170842824605,0.6566920565832427,,0.9670241021079997,11
13
+ Nepal12.txt vs LTWA.txt,36.029411764705884,0.43727598566308246,,0.7062555004041853,12
14
+ Nepal12.txt vs LTWA.txt,46.80232558139535,0.6346153846153846,,0.9098381701939877,13
15
+ Nepal12.txt vs LTWA.txt,34.954407294832826,0.3716433941997852,,0.8681558692770571,14
16
+ Nepal12.txt vs LTWA.txt,13.288288288288289,0.13510520487264674,,0.672356831077214,15
17
+ Nepal12.txt vs LTWA.txt,0.205761316872428,0.00186219739292365,,0.379562190805602,16
18
+ Nepal12.txt vs Leiden.txt,0.23364485981308408,0.0020151133501259445,,0.48802450618062665,1
19
+ Nepal12.txt vs Leiden.txt,0.5,0.006493506493506494,,0.2674810724560652,2
20
+ Nepal12.txt vs Leiden.txt,0.43478260869565216,0.005633802816901409,,0.2833623975540526,3
21
+ Nepal12.txt vs Leiden.txt,47.159090909090914,0.5876543209876544,,0.7517513456553279,4
22
+ Nepal12.txt vs Leiden.txt,30.075187969924812,0.38866396761133604,,0.4408740561408262,5
23
+ Nepal12.txt vs Leiden.txt,40.853658536585364,0.549520766773163,,0.8128602295895004,6
24
+ Nepal12.txt vs Leiden.txt,31.524008350730686,0.2908224076281287,,0.8738496945193566,7
25
+ Nepal12.txt vs Leiden.txt,39.01273885350319,0.43176831943835015,,0.926450152931149,8
26
+ Nepal12.txt vs Leiden.txt,27.754415475189237,0.19742670322716727,,0.9388989588908633,9
27
+ Nepal12.txt vs Leiden.txt,54.936708860759495,0.6423049894588897,,0.921846371542675,10
28
+ Nepal12.txt vs Leiden.txt,47.205707491082045,0.6151611966308452,,0.9428816469273638,11
29
+ Nepal12.txt vs Leiden.txt,19.011406844106464,0.24765478424015008,,0.7015349768313621,12
30
+ Nepal12.txt vs Leiden.txt,53.125,0.6434782608695652,,0.8959335274660175,13
31
+ Nepal12.txt vs Leiden.txt,31.699346405228756,0.3399765533411489,,0.7322457041553282,14
32
+ Nepal12.txt vs Leiden.txt,21.730382293762577,0.23124448367166814,,0.7987166674337054,15
33
+ Nepal12.txt vs Leiden.txt,0.17667844522968199,0.001563721657544957,,0.35541620823150016,16
34
+ LTWA.txt vs Leiden.txt,44.46351931330472,0.5621676373765511,,0.973014209495479,1
35
+ LTWA.txt vs Leiden.txt,52.27272727272727,0.6720516962843296,,0.8952409611501382,2
36
+ LTWA.txt vs Leiden.txt,51.14006514657981,0.6132971506105834,,0.8817937094432065,3
37
+ LTWA.txt vs Leiden.txt,43.29896907216495,0.5622119815668203,,0.7896479610422875,4
38
+ LTWA.txt vs Leiden.txt,30.48780487804878,0.39869281045751637,,0.5548358409765478,5
39
+ LTWA.txt vs Leiden.txt,35.80246913580247,0.4931506849315068,,0.7722629194984254,6
40
+ LTWA.txt vs Leiden.txt,36.211699164345404,0.44142614601018676,,0.8733787580475953,7
41
+ LTWA.txt vs Leiden.txt,42.33038348082596,0.4753257007500987,,0.9428524600432496,8
42
+ LTWA.txt vs Leiden.txt,25.196850393700785,0.16772554002541296,,0.9208737107077982,9
43
+ LTWA.txt vs Leiden.txt,53.73493975903615,0.6971279373368147,,0.9411333779617488,10
44
+ LTWA.txt vs Leiden.txt,54.285714285714285,0.6786239341370185,,0.9677013964807858,11
45
+ LTWA.txt vs Leiden.txt,18.43137254901961,0.1700404858299595,,0.6761607878976519,12
46
+ LTWA.txt vs Leiden.txt,48.37758112094395,0.6323851203501094,,0.8863173251539074,13
47
+ LTWA.txt vs Leiden.txt,42.79661016949153,0.570957095709571,,0.7795387299539099,14
48
+ LTWA.txt vs Leiden.txt,28.455284552845526,0.36042402826855124,,0.7233387457696162,15
49
+ LTWA.txt vs Leiden.txt,49.572649572649574,0.6079182630906769,,0.9527216526629653,16
theme.py ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from gradio.themes.utils import colors, sizes, fonts
3
+
4
+
5
+ class TibetanAppTheme(gr.themes.Soft):
6
+ def __init__(self):
7
+ super().__init__(
8
+ primary_hue=colors.blue, # Primary interactive elements (e.g., #2563eb)
9
+ secondary_hue=colors.orange, # For accents if needed, or default buttons
10
+ neutral_hue=colors.slate, # For backgrounds, borders, and text
11
+ font=[
12
+ fonts.GoogleFont("Inter"),
13
+ "ui-sans-serif",
14
+ "system-ui",
15
+ "sans-serif",
16
+ ],
17
+ radius_size=sizes.radius_md, # General radius, can be overridden (16px was for cards)
18
+ text_size=sizes.text_md, # Base font size (16px)
19
+ )
20
+ self.theme_vars_for_set = {
21
+ # Global & Body Styles
22
+ "body_background_fill": "#f0f2f5",
23
+ "body_text_color": "#333333",
24
+ # Card Styles (.gr-group)
25
+ "block_background_fill": "#ffffff",
26
+ "block_radius": "16px", # May need to be removed if not a valid settable CSS var
27
+ "block_shadow": "0 4px 12px rgba(0, 0, 0, 0.08)",
28
+ "block_padding": "24px",
29
+ "block_border_width": "0px",
30
+ # Markdown Styles
31
+ "body_text_color_subdued": "#4b5563",
32
+ # Button Styles
33
+ "button_secondary_background_fill": "#ffffff",
34
+ "button_secondary_text_color": "#374151",
35
+ "button_secondary_border_color": "#d1d5db",
36
+ "button_secondary_border_color_hover": "#adb5bd",
37
+ "button_secondary_background_fill_hover": "#f9fafb",
38
+ # Primary Button
39
+ "button_primary_background_fill": "#2563eb",
40
+ "button_primary_text_color": "#ffffff",
41
+ "button_primary_border_color": "transparent",
42
+ "button_primary_background_fill_hover": "#1d4ed8",
43
+ # HR style
44
+ "border_color_accent_subdued": "#e5e7eb",
45
+ }
46
+ super().set(**self.theme_vars_for_set)
47
+
48
+ # Store CSS overrides; these will be converted to a string and applied via gr.Blocks(css=...)
49
+ self.css_overrides = {
50
+ ".gradio-container, .gr-block, .gr-markdown, label, input, .gr-slider, .gr-radio, .gr-button": {
51
+ "font-family": ", ".join(self.font),
52
+ "font-size": "16px !important",
53
+ "line-height": "1.6 !important",
54
+ "color": "#333333 !important",
55
+ },
56
+ ".gr-group": {"margin-bottom": "24px !important"}, # min-height removed
57
+ ".gr-markdown": {
58
+ "background": "transparent !important",
59
+ "font-size": "1em !important",
60
+ "margin-bottom": "16px !important",
61
+ },
62
+ ".gr-markdown h1": {
63
+ "font-size": "28px !important",
64
+ "font-weight": "600 !important",
65
+ "margin-bottom": "8px !important",
66
+ "color": "#111827 !important",
67
+ },
68
+ ".gr-markdown h2": {
69
+ "font-size": "26px !important",
70
+ "font-weight": "600 !important",
71
+ "color": "var(--primary-600, #2563eb) !important",
72
+ "margin-top": "32px !important",
73
+ "margin-bottom": "16px !important",
74
+ },
75
+ ".gr-markdown h3": {
76
+ "font-size": "22px !important",
77
+ "font-weight": "600 !important",
78
+ "color": "#1f2937 !important",
79
+ "margin-top": "24px !important",
80
+ "margin-bottom": "12px !important",
81
+ },
82
+ ".gr-markdown p, .gr-markdown span": {
83
+ "font-size": "16px !important",
84
+ "color": "#4b5563 !important",
85
+ },
86
+ ".gr-button button": {
87
+ "border-radius": "8px !important",
88
+ "padding": "10px 20px !important",
89
+ "font-weight": "500 !important",
90
+ "box-shadow": "0 1px 2px 0 rgba(0, 0, 0, 0.05) !important",
91
+ "border": "1px solid #d1d5db !important",
92
+ "background-color": "#ffffff !important",
93
+ "color": "#374151 !important",
94
+ },
95
+ "#run-btn": {
96
+ "background": "var(--button-primary-background-fill) !important",
97
+ "color": "var(--button-primary-text-color) !important",
98
+ "font-weight": "bold !important",
99
+ "font-size": "24px !important",
100
+ "border": "none !important",
101
+ "box-shadow": "var(--button-primary-shadow) !important",
102
+ },
103
+ "#run-btn:hover": { # Changed selector
104
+ "background": "var(--button-primary-background-fill-hover) !important",
105
+ "box-shadow": "0px 4px 12px rgba(0, 0, 0, 0.15) !important",
106
+ "transform": "translateY(-1px) !important",
107
+ },
108
+ ".gr-button button:hover": {
109
+ "background-color": "#f9fafb !important",
110
+ "border-color": "#adb5bd !important",
111
+ },
112
+ "hr": {
113
+ "margin": "32px 0 !important",
114
+ "border": "none !important",
115
+ "border-top": "1px solid var(--border-color-accent-subdued) !important",
116
+ },
117
+ ".gr-slider, .gr-radio, .gr-file": {"margin-bottom": "20px !important"},
118
+ ".gr-radio .gr-form button": {
119
+ "background-color": "#f3f4f6 !important",
120
+ "color": "#374151 !important",
121
+ "border": "1px solid #d1d5db !important",
122
+ "border-radius": "6px !important",
123
+ "padding": "8px 16px !important",
124
+ "font-weight": "500 !important",
125
+ },
126
+ ".gr-radio .gr-form button:hover": {
127
+ "background-color": "#e5e7eb !important",
128
+ "border-color": "#9ca3af !important",
129
+ },
130
+ ".gr-radio .gr-form button.selected": {
131
+ "background-color": "var(--primary-500, #3b82f6) !important",
132
+ "color": "#ffffff !important",
133
+ "border-color": "var(--primary-500, #3b82f6) !important",
134
+ },
135
+ ".gr-radio .gr-form button.selected:hover": {
136
+ "background-color": "var(--primary-600, #2563eb) !important",
137
+ "border-color": "var(--primary-600, #2563eb) !important",
138
+ },
139
+ "#semantic-radio-group span": { # General selector, refined size
140
+ "font-size": "17px !important",
141
+ "font-weight": "500 !important",
142
+ },
143
+ "#semantic-radio-group div": { # General selector, refined size
144
+ "font-size": "14px !important"
145
+ },
146
+ # Row and Column flex styles for equal height
147
+ "#steps-row": {
148
+ "display": "flex !important",
149
+ "align-items": "stretch !important",
150
+ },
151
+ ".step-column": {
152
+ "display": "flex !important",
153
+ "flex-direction": "column !important",
154
+ },
155
+ ".step-column > .gr-group": {
156
+ "flex-grow": "1 !important",
157
+ "display": "flex !important",
158
+ "flex-direction": "column !important",
159
+ },
160
+ ".tabs > .tab-nav": {"border-bottom": "1px solid #d1d5db !important"},
161
+ ".tabs > .tab-nav > button.selected": {
162
+ "border-bottom": "2px solid var(--primary-500) !important",
163
+ "color": "var(--primary-500) !important",
164
+ "background-color": "transparent !important",
165
+ },
166
+ ".tabs > .tab-nav > button": {
167
+ "color": "#6b7280 !important",
168
+ "background-color": "transparent !important",
169
+ "padding": "10px 15px !important",
170
+ "border-bottom": "2px solid transparent !important",
171
+ },
172
+ }
173
+
174
+ def get_css_string(self) -> str:
175
+ """Converts the self.css_overrides dictionary into a CSS string."""
176
+ css_parts = []
177
+ for selector, properties in self.css_overrides.items():
178
+ props_str = "\n".join(
179
+ [f" {prop}: {value};" for prop, value in properties.items()]
180
+ )
181
+ css_parts.append(f"{selector} {{\n{props_str}\n}}")
182
+ return "\n\n".join(css_parts)
183
+
184
+
185
+ # Instantiate the theme for easy import
186
+ tibetan_theme = TibetanAppTheme()