daniel-wojahn commited on
Commit
b4c92f5
·
verified ·
1 Parent(s): 6ab086d

Upload 19 files

Browse files
README.md CHANGED
@@ -4,12 +4,12 @@ emoji: 📚
4
  colorFrom: blue
5
  colorTo: indigo
6
  sdk: gradio
7
- sdk_version: 5.29.1
8
  python_version: 3.11
9
  app_file: app.py
10
  models:
11
- - buddhist-nlp/buddhist-sentence-similarity
12
- license: afl-3.0
13
  ---
14
 
15
  # Tibetan Text Metrics Web App
@@ -29,17 +29,55 @@ The Tibetan Text Metrics project aims to provide quantitative methods for assess
29
  - **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
30
  - **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
31
  - **Core Metrics Computed**:
32
- - **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords are filtered out to focus on meaningful lexical similarity.*
33
  - **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
34
- - **Semantic Similarity (BuddhistNLP)**: Uses the [buddhist-nlp/buddhist-sentence-similarity](https://huggingface.co/buddhist-nlp/buddhist-sentence-similarity) model to compare the contextual meaning of segments. *Note: This metric is experimental and may not perform well for all texts. It is recommended to use it in combination with other metrics for a more comprehensive analysis.*
35
- - **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles. *Common Tibetan stopwords are excluded to ensure TF-IDF weights highlight genuinely characteristic terms.*
 
 
 
36
  - **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
 
 
 
 
 
 
 
37
  - **Interactive Visualizations**:
38
  - Heatmaps for Jaccard, LCS, Semantic, and TF-IDF similarity metrics, providing a quick overview of inter-segment relationships.
39
  - Bar chart displaying word counts per segment.
 
 
 
 
 
 
40
  - **Downloadable Results**: Export detailed metrics as a CSV file and save heatmaps as PNG files.
41
  - **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ## Text Segmentation and Best Practices
44
 
45
  **Why segment your texts?**
@@ -73,11 +111,52 @@ Feel free to edit this list of stopwords to better suit your needs. The list is
73
 
74
  ### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
75
 
76
- 1. **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words. It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`. Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
 
 
 
 
 
 
 
 
 
 
 
 
77
  2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
78
  * *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
79
- 3. **Semantic Similarity (BuddhistNLP)**: Utilizes the `buddhist-nlp/bodhi-sentence-cased-v1` model to compute the cosine similarity between the semantic embeddings of text segments. This model is fine-tuned for Buddhist studies texts and captures nuances in meaning. For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
80
- 4. **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment. TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments. This helps to identify terms that are characteristic or discriminative for a segment. Each segment is then represented as a vector of these TF-IDF scores. Finally, the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## Getting Started (if run Locally)
83
 
@@ -113,21 +192,55 @@ Feel free to edit this list of stopwords to better suit your needs. The list is
113
  ## Usage
114
 
115
  1. **Upload Files**: Use the file upload interface to select one or more `.txt` files containing Tibetan Unicode text.
116
- 2. **Run Analysis**: Click the "Run Analysis" button.
 
 
 
 
117
  3. **View Results**:
118
  - A preview of the similarity metrics will be displayed.
119
  - Download the full results as a CSV file.
120
- - Interactive heatmaps for Jaccard Similarity, Normalized LCS, Semantic Similarity, and TF-IDF Cosine Similarity will be generated.
121
  - A bar chart showing word counts per segment will also be available.
122
  - Any warnings (e.g., regarding missing chapter markers) will be displayed.
123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  ## Structure
125
 
126
  - `app.py` — Gradio web app entry point and UI definition.
127
  - `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
128
  - `process.py`: Core logic for segmenting texts and orchestrating metric computation.
129
  - `metrics.py`: Implementation of Jaccard, LCS, and Semantic Similarity (including chunking).
130
- - `semantic_embedding.py`: Handles loading and using the sentence transformer model.
 
131
  - `tokenize.py`: Tibetan text tokenization using `botok`.
132
  - `upload.py`: File upload handling (currently minimal).
133
  - `visualize.py`: Generates heatmaps and word count plots.
@@ -137,6 +250,22 @@ Feel free to edit this list of stopwords to better suit your needs. The list is
137
 
138
  This project is licensed under the Creative Commons Attribution 4.0 International License - see the [LICENSE](../../LICENSE) file in the main project directory for details.
139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  ## Citation
141
 
142
  If you use this web application or the underlying TTM tool in your research, please cite the main project:
@@ -152,4 +281,4 @@ If you use this web application or the underlying TTM tool in your research, ple
152
  ```
153
 
154
  ---
155
- For questions or issues specifically regarding the web application, please refer to the main project's [issue tracker](https://github.com/daniel-wojahn/tibetan-text-metrics/issues) or contact Daniel Wojahn.
 
4
  colorFrom: blue
5
  colorTo: indigo
6
  sdk: gradio
7
+ sdk_version: 5.29.0
8
  python_version: 3.11
9
  app_file: app.py
10
  models:
11
+ - buddhist-nlp/buddhist-sentence-similarity
12
+ - fasttext-tibetan
13
  ---
14
 
15
  # Tibetan Text Metrics Web App
 
29
  - **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
30
  - **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
31
  - **Core Metrics Computed**:
32
+ - **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
33
  - **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
34
+ - **Semantic Similarity**: Uses embedding models to compare the contextual meaning of segments. Users can select between:
35
+ - A transformer-based model (buddhist-nlp/buddhist-sentence-similarity) specialized for Buddhist texts (experimental approach)
36
+ - The official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text
37
+ *Note: This metric works best when combined with other metrics for a more comprehensive analysis.*
38
+ - **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles. *Common Tibetan stopwords can be excluded to ensure TF-IDF weights highlight genuinely characteristic terms.*
39
  - **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
40
+ - **Model Selection**: Choose from specialized embedding models for semantic similarity analysis:
41
+ - **Buddhist-NLP Transformer** (Experimental): Pre-trained model specialized for Buddhist texts
42
+ - **FastText**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) with optimizations specifically for Tibetan language, including botok tokenization and TF-IDF weighted averaging
43
+ - **Stopword Filtering**: Three levels of filtering for Tibetan words:
44
+ - **None**: No filtering, includes all words
45
+ - **Standard**: Filters only common particles and punctuation
46
+ - **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
47
  - **Interactive Visualizations**:
48
  - Heatmaps for Jaccard, LCS, Semantic, and TF-IDF similarity metrics, providing a quick overview of inter-segment relationships.
49
  - Bar chart displaying word counts per segment.
50
+ - **Advanced Interpretation**: Get scholarly insights about your results with a built-in analysis engine that:
51
+ - Examines your metrics and provides contextual interpretation of textual relationships
52
+ - Generates a dual-layer narrative analysis (scholarly and accessible)
53
+ - Identifies patterns across chapters and highlights notable textual relationships
54
+ - Connects findings to Tibetan textual studies concepts (transmission lineages, regional variants)
55
+ - Suggests questions for further investigation
56
  - **Downloadable Results**: Export detailed metrics as a CSV file and save heatmaps as PNG files.
57
  - **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.
58
 
59
+ ## Advanced Features
60
+
61
+ ### Using AI-Powered Analysis
62
+
63
+ The application includes an "Interpret Results" button that provides scholarly insights about your text similarity metrics. This feature:
64
+
65
+ 1. Uses Mistral 7B Instruct via OpenRouter to analyze your results
66
+ 2. Requires an OpenRouter API key (set via environment variable)
67
+ 3. The AI will provide a comprehensive scholarly analysis including:
68
+ - Introduction explaining the texts compared and general observations
69
+ - Overall patterns across all chapters with visualized trends
70
+ - Detailed examination of notable chapters (highest/lowest similarity)
71
+ - Discussion of what different metrics reveal about textual relationships
72
+ - Conclusions suggesting implications for Tibetan textual scholarship
73
+ - Specific questions these findings raise for further investigation
74
+ - Cautionary notes about interpreting perfect matches or zero similarity scores
75
+
76
+ ### Data Processing
77
+
78
+ - **Automatic Filtering**: The system automatically filters out perfect matches (1.0 across all metrics) that may result from empty cells or identical text comparisons
79
+ - **Robust Analysis**: The system handles edge cases and provides meaningful metrics even with imperfect data
80
+
81
  ## Text Segmentation and Best Practices
82
 
83
  **Why segment your texts?**
 
111
 
112
  ### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
113
 
114
+ 1. **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words, optionally **filtering out common Tibetan stopwords**.
115
+ It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?'
116
+ It is calculated as `(Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100`.
117
+ Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent.
118
+ A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
119
+
120
+ **Stopword Filtering**: Three levels of filtering are available:
121
+ - **None**: No filtering, includes all words in the comparison
122
+ - **Standard**: Filters only common particles and punctuation
123
+ - **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
124
+
125
+ This helps focus on meaningful content words rather than grammatical elements.
126
+
127
  2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
128
  * *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
129
+ 3. **Semantic Similarity**: Computes the cosine similarity between semantic embeddings of text segments using one of two approaches:
130
+
131
+ **a. Transformer-based Model** (Experimental): Pre-trained model that understands contextual relationships between words.
132
+ - `buddhist-nlp/buddhist-sentence-similarity`: Specialized for Buddhist texts
133
+ - Processes raw Unicode Tibetan text directly (no special tokenization required)
134
+ - Note: This is an experimental approach and results may vary with different texts
135
+
136
+ **b. FastText Model**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text. Falls back to a custom model only if the official model cannot be loaded.
137
+ - Processes Tibetan text using botok tokenization (same as other metrics)
138
+ - Uses the pre-tokenized words from botok rather than doing its own tokenization
139
+ - Better for texts with specialized Tibetan vocabulary
140
+ - More stable results for general Tibetan text comparison
141
+ - Optimized for Tibetan language with:
142
+ - Syllable-based tokenization preserving Tibetan syllable markers
143
+ - TF-IDF weighted averaging for word vectors (distinct from the TF-IDF Cosine Similarity metric)
144
+ - Enhanced parameters based on Tibetan NLP research
145
+
146
+ For texts exceeding the model's token limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting embeddings are averaged to produce a single vector for the entire segment.
147
+ 4. **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, optionally **filtering out common Tibetan stopwords**.
148
+ TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
149
+ This helps to identify terms that are characteristic or discriminative for a segment. When stopword filtering is enabled, the TF-IDF scores better reflect genuinely significant terms.
150
+ Each segment is then represented as a vector of these TF-IDF scores.
151
+ Finally, the cosine similarity is computed between these vectors.
152
+ A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
153
+
154
+ **Stopword Filtering**: Three levels of filtering are available:
155
+ - **None**: No filtering, includes all words in the comparison
156
+ - **Standard**: Filters only common particles and punctuation
157
+ - **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
158
+
159
+ This helps focus on meaningful content words rather than grammatical elements.
160
 
161
  ## Getting Started (if run Locally)
162
 
 
192
  ## Usage
193
 
194
  1. **Upload Files**: Use the file upload interface to select one or more `.txt` files containing Tibetan Unicode text.
195
+ 2. **Configure Options**:
196
+ - Choose whether to compute semantic similarity
197
+ - Select an embedding model for semantic analysis
198
+ - Choose a stopword filtering level (None, Standard, or Aggressive)
199
+ 3. **Run Analysis**: Click the "Run Analysis" button.
200
  3. **View Results**:
201
  - A preview of the similarity metrics will be displayed.
202
  - Download the full results as a CSV file.
203
+ - Interactive heatmaps for Jaccard Similarity, Normalized LCS, Semantic Similarity, and TF-IDF Cosine Similarity will be generated. All heatmaps use a consistent color scheme where darker colors represent higher similarity.
204
  - A bar chart showing word counts per segment will also be available.
205
  - Any warnings (e.g., regarding missing chapter markers) will be displayed.
206
 
207
+ 4. **Get Interpretation** (Optional):
208
+ - After running the analysis, click the "Help Interpret Results" button.
209
+ - No API key or internet connection required! The system uses a built-in rule-based analysis engine.
210
+ - The system will analyze your metrics and provide insights about patterns, relationships, and notable findings in your data.
211
+ - This feature helps researchers understand the significance of the metrics and identify interesting textual relationships between chapters.
212
+
213
+ ## Embedding Models
214
+
215
+ The application offers two specialized approaches for calculating semantic similarity in Tibetan texts:
216
+
217
+ 1. **Buddhist-NLP Transformer** (Default option):
218
+ - A specialized model fine-tuned for Buddhist text similarity
219
+ - Provides excellent results for Tibetan Buddhist texts
220
+ - Pre-trained and ready to use, no training required
221
+ - Best for general Buddhist terminology and concepts
222
+
223
+ 2. **FastText Model**:
224
+ - Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors)
225
+ - Pre-trained on a large corpus of Tibetan text from Wikipedia and other sources
226
+ - Falls back to training a custom model on your texts if the official model cannot be loaded
227
+ - Respects your stopword filtering settings when creating embeddings
228
+ - Uses simple word vector averaging for stable embeddings
229
+
230
+ **When to choose FastText**:
231
+ - When you want high-quality word embeddings specifically trained for Tibetan language
232
+ - When you need a model that can handle out-of-vocabulary words through character n-grams
233
+ - When you want to benefit from Facebook's large-scale pre-training on Tibetan text
234
+ - When you need more control over how stopwords affect semantic analysis
235
+
236
  ## Structure
237
 
238
  - `app.py` — Gradio web app entry point and UI definition.
239
  - `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
240
  - `process.py`: Core logic for segmenting texts and orchestrating metric computation.
241
  - `metrics.py`: Implementation of Jaccard, LCS, and Semantic Similarity (including chunking).
242
+ - `semantic_embedding.py`: Handles loading and using the selected embedding models.
243
+ - `fasttext_embedding.py`: Provides functionality for training and using FastText models.
244
  - `tokenize.py`: Tibetan text tokenization using `botok`.
245
  - `upload.py`: File upload handling (currently minimal).
246
  - `visualize.py`: Generates heatmaps and word count plots.
 
250
 
251
  This project is licensed under the Creative Commons Attribution 4.0 International License - see the [LICENSE](../../LICENSE) file in the main project directory for details.
252
 
253
+ ## Research and Acknowledgements
254
+
255
+ The FastText implementation for Tibetan text has been optimized based on research findings from several studies on Tibetan natural language processing:
256
+
257
+ 1. Di, R., Tashi, N., & Lin, J. (2019). Improving Tibetan Word Segmentation Based on Multi-Features Fusion. *IEEE Access*, 7, 178057-178069.
258
+ - Informed our syllable-based tokenization approach and the importance of preserving Tibetan syllable markers
259
+
260
+ 2. Tashi, N., Rabgay, T., & Wangchuk, K. (2020). Tibetan Word Segmentation using Syllable-based Maximum Matching with Potential Syllable Merging. *Engineering Applications of Artificial Intelligence*, 93, 103716.
261
+ - Provided insights on syllable segmentation for Tibetan text processing
262
+
263
+ 3. Tashi, N., Rai, A. K., Mittal, P., & Sharma, A. K. (2018). A Novel Approach to Feature Extraction for Tibetan Text Classification. *Journal of Information Processing Systems*, 14(1), 211-224.
264
+ - Guided our parameter optimization for FastText, including embedding dimensions and n-gram settings
265
+
266
+ 4. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. *Transactions of the Association for Computational Linguistics*, 5, 135-146.
267
+ - The original FastText paper that introduced the subword-enriched word embeddings we use
268
+
269
  ## Citation
270
 
271
  If you use this web application or the underlying TTM tool in your research, please cite the main project:
 
281
  ```
282
 
283
  ---
284
+ For questions or issues specifically regarding the web application, please refer to the main project's [issue tracker](https://github.com/daniel-wojahn/tibetan-text-metrics/issues) or contact Daniel Wojahn.
app.py CHANGED
@@ -2,7 +2,13 @@ import gradio as gr
2
  from pathlib import Path
3
  from pipeline.process import process_texts
4
  from pipeline.visualize import generate_visualizations, generate_word_count_chart
 
5
  import logging
 
 
 
 
 
6
 
7
  from theme import tibetan_theme
8
 
@@ -18,7 +24,7 @@ def main_interface():
18
  ) as demo:
19
  gr.Markdown(
20
  """# Tibetan Text Metrics Web App
21
- <span style='font-size:18px;'>A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project.</span>
22
  """,
23
  elem_classes="gr-markdown",
24
  )
@@ -38,6 +44,10 @@ def main_interface():
38
  file_types=[".txt"],
39
  file_count="multiple",
40
  )
 
 
 
 
41
  with gr.Column(scale=1, elem_classes="step-column"):
42
  with gr.Group():
43
  gr.Markdown(
@@ -47,12 +57,39 @@ def main_interface():
47
  elem_classes="gr-markdown",
48
  )
49
  semantic_toggle_radio = gr.Radio(
50
- label="Compute semantic similarity?",
51
  choices=["Yes", "No"],
52
  value="Yes",
53
  info="Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
54
  elem_id="semantic-radio-group",
55
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  process_btn = gr.Button(
57
  "Run Analysis", elem_id="run-btn", variant="primary"
58
  )
@@ -69,25 +106,66 @@ def main_interface():
69
  metrics_preview = gr.Dataframe(
70
  label="Similarity Metrics Preview", interactive=False, visible=True
71
  )
72
- word_count_plot = gr.Plot(label="Word Counts per Segment")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  # Heatmap tabs for each metric
74
  heatmap_titles = {
75
- "Jaccard Similarity (%)": "Jaccard Similarity (%): Higher scores (brighter) mean more shared unique words.",
76
- "Normalized LCS": "Normalized LCS: Higher scores (brighter) mean longer shared sequences of words.",
77
- "Semantic Similarity (BuddhistNLP)": "Semantic Similarity (BuddhistNLP - using word embeddings/experimental): Higher scores (brighter) mean more similar meanings.",
78
- "TF-IDF Cosine Sim": "TF-IDF Cosine Similarity: Higher scores mean texts share more important, distinctive vocabulary.",
 
79
  }
80
 
81
  metric_tooltips = {
82
  "Jaccard Similarity (%)": """
83
  ### Jaccard Similarity (%)
84
- This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words, after **filtering out common Tibetan stopwords**.
85
- It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?'
86
- It is calculated as `(Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100`.
87
- Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent.
88
- A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
89
 
90
- **Stopword Filtering**: uses a range of stopwords to filter out common Tibetan words that do not contribute to the semantic content of the text.
 
 
91
  """,
92
  "Normalized LCS": """
93
  ### Normalized LCS (Longest Common Subsequence)
@@ -102,39 +180,85 @@ A higher Normalized LCS score suggests more significant shared phrasing, direct
102
  **Note on Interpretation**: It is possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
103
  """,
104
  "Semantic Similarity": """
105
- ### Semantic Similarity (Experimental)
106
- Utilizes the `<a href="https://huggingface.co/buddhist-nlp/buddhist-sentence-similarity">buddhist-nlp/buddhist-sentence-similarity</a>` model to compute the cosine similarity between the semantic embeddings of text segments.
107
- This model is fine-tuned for Buddhist studies texts and captures nuances in meaning.
108
- For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
 
 
 
109
 
110
- **Note**: This metric is experimental and may not perform well for all texts. It is recommended to use it in combination with other metrics for a more comprehensive analysis.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
  """,
112
  "TF-IDF Cosine Sim": """
113
  ### TF-IDF Cosine Similarity
114
- This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, **after filtering out common Tibetan stopwords**.
115
- TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
116
- This helps to identify terms that are characteristic or discriminative for a segment. By excluding stopwords, the TF-IDF scores better reflect genuinely significant terms.
117
- Each segment is then represented as a vector of these TF-IDF scores.
118
- Finally, the cosine similarity is computed between these vectors.
119
- A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
120
-
121
- **Stopword Filtering**: uses a range of stopwords to filter out common Tibetan words that do not contribute to the semantic content of the text.
122
  """,
123
  }
124
  heatmap_tabs = {}
125
  gr.Markdown("## Detailed Metric Analysis", elem_classes="gr-markdown")
 
126
  with gr.Tabs(elem_id="heatmap-tab-group"):
 
127
  for metric_key, descriptive_title in heatmap_titles.items():
128
  with gr.Tab(metric_key):
129
- if metric_key in metric_tooltips:
130
- gr.Markdown(value=metric_tooltips[metric_key])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
  else:
132
- gr.Markdown(
133
- value=f"### {metric_key}\nDescription not found."
134
- ) # Fallback
135
- heatmap_tabs[metric_key] = gr.Plot(
136
- label=f"Heatmap: {metric_key}", show_label=False
137
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
  # The outputs in process_btn.click should use the short metric names as keys for heatmap_tabs
140
  # e.g., heatmap_tabs["Jaccard Similarity (%)"]
@@ -145,7 +269,25 @@ A score closer to 1 indicates that the two segments share more of these importan
145
 
146
  warning_box = gr.Markdown(visible=False)
147
 
148
- def run_pipeline(files, enable_semantic_str):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  # Initialize all return values to ensure defined paths for all outputs
150
  csv_path_res = None
151
  metrics_preview_df_res = None # Can be a DataFrame or a string message
@@ -172,6 +314,7 @@ A score closer to 1 indicates that the two segments share more of these importan
172
  - semantic_heatmap (matplotlib.figure.Figure | None): Semantic similarity heatmap, or None.
173
  - warning_update (gr.update): Gradio update for the warning box.
174
  """
 
175
  if not files:
176
  return (
177
  None,
@@ -183,21 +326,76 @@ A score closer to 1 indicates that the two segments share more of these importan
183
  None, # tfidf_heatmap
184
  gr.update(value="Please upload files.", visible=True),
185
  )
 
 
 
 
 
 
 
 
 
 
 
186
 
187
  try:
 
 
 
 
 
 
 
188
  filenames = [
189
  Path(file.name).name for file in files
190
  ] # Use Path().name to get just the filename
191
- text_data = {
192
- Path(file.name)
193
- .name: Path(file.name)
194
- .read_text(encoding="utf-8-sig")
195
- for file in files
196
- }
197
-
198
- enable_semantic_bool = enable_semantic_str == "Yes"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  df_results, word_counts_df_data, warning_raw = process_texts(
200
- text_data, filenames, enable_semantic=enable_semantic_bool
 
 
 
 
 
201
  )
202
 
203
  if df_results.empty:
@@ -209,20 +407,43 @@ A score closer to 1 indicates that the two segments share more of these importan
209
  warning_update_res = gr.update(value=warning_message, visible=True)
210
  # Results for this case are set, then return
211
  else:
 
 
 
 
 
 
 
212
  # heatmap_titles is already defined in the outer scope of main_interface
213
  heatmaps_data = generate_visualizations(
214
  df_results, descriptive_titles=heatmap_titles
215
  )
 
 
 
 
 
 
 
216
  word_count_fig_res = generate_word_count_chart(word_counts_df_data)
 
 
 
 
 
 
 
217
  csv_path_res = "results.csv"
218
  df_results.to_csv(csv_path_res, index=False)
 
 
219
  warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
220
  metrics_preview_df_res = df_results.head(10)
221
 
222
  jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
223
  lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
224
  semantic_heatmap_res = heatmaps_data.get(
225
- "Semantic Similarity (BuddhistNLP)"
226
  )
227
  tfidf_heatmap_res = heatmaps_data.get("TF-IDF Cosine Sim")
228
  warning_update_res = gr.update(
@@ -244,26 +465,68 @@ A score closer to 1 indicates that the two segments share more of these importan
244
  lcs_heatmap_res,
245
  semantic_heatmap_res,
246
  tfidf_heatmap_res,
247
- warning_update_res,
248
  )
249
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
250
  process_btn.click(
251
- run_pipeline,
252
- inputs=[file_input, semantic_toggle_radio],
253
  outputs=[
254
  csv_output,
255
  metrics_preview,
256
  word_count_plot,
257
  heatmap_tabs["Jaccard Similarity (%)"],
258
  heatmap_tabs["Normalized LCS"],
259
- heatmap_tabs["Semantic Similarity (BuddhistNLP)"],
260
  heatmap_tabs["TF-IDF Cosine Sim"],
261
  warning_box,
262
- ],
 
 
 
 
 
 
 
263
  )
 
264
  return demo
265
 
266
 
267
  if __name__ == "__main__":
268
  demo = main_interface()
269
- demo.launch()
 
2
  from pathlib import Path
3
  from pipeline.process import process_texts
4
  from pipeline.visualize import generate_visualizations, generate_word_count_chart
5
+ from pipeline.llm_service import get_interpretation
6
  import logging
7
+ import pandas as pd
8
+ from dotenv import load_dotenv
9
+
10
+ # Load environment variables from .env file
11
+ load_dotenv()
12
 
13
  from theme import tibetan_theme
14
 
 
24
  ) as demo:
25
  gr.Markdown(
26
  """# Tibetan Text Metrics Web App
27
+ <span style='font-size:18px;'>A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project. Powered by Mistral 7B via OpenRouter for advanced text analysis.</span>
28
  """,
29
  elem_classes="gr-markdown",
30
  )
 
44
  file_types=[".txt"],
45
  file_count="multiple",
46
  )
47
+ gr.Markdown(
48
+ "<small>Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.</small>",
49
+ elem_classes="gr-markdown"
50
+ )
51
  with gr.Column(scale=1, elem_classes="step-column"):
52
  with gr.Group():
53
  gr.Markdown(
 
57
  elem_classes="gr-markdown",
58
  )
59
  semantic_toggle_radio = gr.Radio(
60
+ label="Compute semantic similarity? (Experimental)",
61
  choices=["Yes", "No"],
62
  value="Yes",
63
  info="Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
64
  elem_id="semantic-radio-group",
65
  )
66
+
67
+ model_dropdown = gr.Dropdown(
68
+ label="Embedding Model",
69
+ choices=[
70
+ "buddhist-nlp/buddhist-sentence-similarity",
71
+ "fasttext-tibetan"
72
+ ],
73
+ value="buddhist-nlp/buddhist-sentence-similarity",
74
+ info="Select the embedding model for semantic similarity.<br><br>"
75
+ "<b>Model information:</b><br>"
76
+ "• <a href='https://huggingface.co/buddhist-nlp/buddhist-sentence-similarity' target='_blank'>buddhist-nlp/buddhist-sentence-similarity</a>: Specialized model fine-tuned for Buddhist text similarity.<br>"
77
+ "• <b>fasttext-tibetan</b>: Uses the official Facebook FastText Tibetan model pre-trained on a large corpus. If the official model cannot be loaded, it will fall back to training a custom model on your uploaded texts.",
78
+ visible=True,
79
+ interactive=True
80
+ )
81
+
82
+ stopwords_dropdown = gr.Dropdown(
83
+ label="Stopword Filtering",
84
+ choices=[
85
+ "None (No filtering)",
86
+ "Standard (Common particles only)",
87
+ "Aggressive (All function words)"
88
+ ],
89
+ value="Standard (Common particles only)", # Default
90
+ info="Choose how aggressively to filter out common Tibetan particles and function words when calculating similarity. This helps focus on meaningful content words."
91
+ )
92
+
93
  process_btn = gr.Button(
94
  "Run Analysis", elem_id="run-btn", variant="primary"
95
  )
 
106
  metrics_preview = gr.Dataframe(
107
  label="Similarity Metrics Preview", interactive=False, visible=True
108
  )
109
+
110
+ # LLM Interpretation components
111
+ with gr.Row():
112
+ with gr.Column():
113
+ output_analysis = gr.Markdown(
114
+ "## AI Analysis\n*The AI will analyze your text similarities and provide insights into patterns and relationships. Make sure to set up your OpenRouter API key for this feature.*",
115
+ elem_classes="gr-markdown"
116
+ )
117
+
118
+ # Add the interpret button
119
+ with gr.Row():
120
+ interpret_btn = gr.Button(
121
+ "Help Interpret Results",
122
+ variant="primary",
123
+ elem_id="interpret-btn"
124
+ )
125
+
126
+ # About AI Analysis section
127
+ with gr.Accordion("ℹ️ About AI Analysis", open=False):
128
+ gr.Markdown("""
129
+ ### AI-Powered Analysis
130
+
131
+ The AI analysis is powered by **Mistral 7B Instruct** via the OpenRouter API. To use this feature:
132
+
133
+ 1. Get an API key from [OpenRouter](https://openrouter.ai/keys)
134
+ 2. Create a `.env` file in the webapp directory
135
+ 3. Add: `OPENROUTER_API_KEY=your_api_key_here`
136
+
137
+ The AI will automatically analyze your text similarities and provide insights into patterns and relationships.
138
+ """)
139
+ # Create a placeholder message with proper formatting and structure
140
+ initial_message = """
141
+ ## Analysis of Tibetan Text Similarity Metrics
142
+
143
+ <small>*Click the 'Help Interpret Results' button above to generate an AI-powered analysis of your similarity metrics.*</small>
144
+ """
145
+ interpretation_output = gr.Markdown(
146
+ value=initial_message,
147
+ elem_id="llm-analysis"
148
+ )
149
+
150
  # Heatmap tabs for each metric
151
  heatmap_titles = {
152
+ "Jaccard Similarity (%)": "Jaccard Similarity (%): Higher scores (darker) mean more shared unique words.",
153
+ "Normalized LCS": "Normalized LCS: Higher scores (darker) mean longer shared sequences of words.",
154
+ "Semantic Similarity": "Semantic Similarity (using word embeddings/experimental): Higher scores (darker) mean more similar meanings.",
155
+ "TF-IDF Cosine Sim": "TF-IDF Cosine Similarity: Higher scores (darker) mean texts share more important, distinctive vocabulary.",
156
+ "Word Counts": "Word Counts: Shows the number of words in each segment after tokenization."
157
  }
158
 
159
  metric_tooltips = {
160
  "Jaccard Similarity (%)": """
161
  ### Jaccard Similarity (%)
162
+ This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words, optionally filtering out common Tibetan stopwords.
163
+
164
+ It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`.
 
 
165
 
166
+ Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
167
+
168
+ **Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before comparison. This helps focus on meaningful content words rather than grammatical elements.
169
  """,
170
  "Normalized LCS": """
171
  ### Normalized LCS (Longest Common Subsequence)
 
180
  **Note on Interpretation**: It is possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
181
  """,
182
  "Semantic Similarity": """
183
+ ### Semantic Similarity
184
+ Computes the cosine similarity between semantic embeddings of text segments using one of two approaches:
185
+
186
+ **1. Transformer-based Model** (Experimental): Pre-trained model that understands contextual relationships between words.
187
+ - `buddhist-nlp/buddhist-sentence-similarity`: Specialized for Buddhist texts
188
+ - Processes raw Unicode Tibetan text directly (no special tokenization required)
189
+ - Note: This is an experimental approach and results may vary with different texts
190
 
191
+ **2. FastText Model**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text. Falls back to a custom model only if the official model cannot be loaded.
192
+ - Processes Tibetan text using botok tokenization (same as other metrics)
193
+ - Uses the pre-tokenized words from botok rather than doing its own tokenization
194
+ - Better for texts with specialized Tibetan vocabulary
195
+ - More stable results for general Tibetan text comparison
196
+ - Optimized for Tibetan language with:
197
+ - Syllable-based tokenization preserving Tibetan syllable markers
198
+ - TF-IDF weighted averaging for word vectors (distinct from the TF-IDF Cosine Similarity metric)
199
+ - Enhanced parameters based on Tibetan NLP research
200
+
201
+ **Chunking for Long Texts**: For texts exceeding the model's token limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting embeddings are averaged to produce a single vector for the entire segment.
202
+
203
+ **Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before computing embeddings. This helps focus on meaningful content words. Transformer models process the full text regardless of stopword filtering setting.
204
+
205
+ **Note**: This metric works best when combined with other metrics for a more comprehensive analysis.
206
  """,
207
  "TF-IDF Cosine Sim": """
208
  ### TF-IDF Cosine Similarity
209
+ This metric calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, optionally filtering out common Tibetan stopwords.
210
+
211
+ TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments. This helps identify terms that are characteristic or discriminative for a segment. When stopword filtering is enabled, the TF-IDF scores better reflect genuinely significant terms by excluding common particles and function words.
212
+
213
+ Each segment is represented as a vector of these TF-IDF scores, and the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more important, distinguishing terms, suggesting they cover similar specific topics or themes.
214
+
215
+ **Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out. This can be toggled on/off to compare results with and without stopwords.
 
216
  """,
217
  }
218
  heatmap_tabs = {}
219
  gr.Markdown("## Detailed Metric Analysis", elem_classes="gr-markdown")
220
+
221
  with gr.Tabs(elem_id="heatmap-tab-group"):
222
+ # Process all metrics including Word Counts in a unified way
223
  for metric_key, descriptive_title in heatmap_titles.items():
224
  with gr.Tab(metric_key):
225
+ # Set CSS class based on metric type
226
+ if metric_key == "Jaccard Similarity (%)":
227
+ css_class = "metric-info-accordion jaccard-info"
228
+ accordion_title = "Understanding Vocabulary Overlap"
229
+ elif metric_key == "Normalized LCS":
230
+ css_class = "metric-info-accordion lcs-info"
231
+ accordion_title = "Understanding Sequence Patterns"
232
+ elif metric_key == "Semantic Similarity":
233
+ css_class = "metric-info-accordion semantic-info"
234
+ accordion_title = "Understanding Meaning Similarity"
235
+ elif metric_key == "TF-IDF Cosine Sim":
236
+ css_class = "metric-info-accordion tfidf-info"
237
+ accordion_title = "Understanding Term Importance"
238
+ elif metric_key == "Word Counts":
239
+ css_class = "metric-info-accordion wordcount-info"
240
+ accordion_title = "Understanding Text Length"
241
  else:
242
+ css_class = "metric-info-accordion"
243
+ accordion_title = f"About {metric_key}"
244
+
245
+ # Create the accordion with appropriate content
246
+ with gr.Accordion(accordion_title, open=False, elem_classes=css_class):
247
+ if metric_key == "Word Counts":
248
+ gr.Markdown("""
249
+ ### Word Counts per Segment
250
+ This chart displays the number of words in each segment of your texts after tokenization.
251
+ """)
252
+ elif metric_key in metric_tooltips:
253
+ gr.Markdown(value=metric_tooltips[metric_key])
254
+ else:
255
+ gr.Markdown(value=f"### {metric_key}\nDescription not found.")
256
+
257
+ # Add the appropriate plot
258
+ if metric_key == "Word Counts":
259
+ word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False)
260
+ else:
261
+ heatmap_tabs[metric_key] = gr.Plot(label=f"Heatmap: {metric_key}", show_label=False)
262
 
263
  # The outputs in process_btn.click should use the short metric names as keys for heatmap_tabs
264
  # e.g., heatmap_tabs["Jaccard Similarity (%)"]
 
269
 
270
  warning_box = gr.Markdown(visible=False)
271
 
272
+ def run_pipeline(files, enable_semantic, model_name, stopwords_option="Aggressive (All function words)", progress=gr.Progress()):
273
+ """Run the text analysis pipeline on the uploaded files.
274
+
275
+ Args:
276
+ files: List of uploaded files
277
+ enable_semantic: Whether to compute semantic similarity
278
+ model_name: Name of the embedding model to use
279
+ stopwords_option: Stopword filtering level (None, Standard, or Aggressive)
280
+ progress: Gradio progress indicator
281
+
282
+ Returns:
283
+ Tuple of (metrics_df, heatmap_jaccard, heatmap_lcs, heatmap_semantic, heatmap_tfidf, word_count_fig)
284
+ """
285
+ # Initialize progress tracking
286
+ try:
287
+ progress_tracker = gr.Progress()
288
+ except Exception as e:
289
+ logger.warning(f"Could not initialize progress tracker: {e}")
290
+ progress_tracker = None
291
  # Initialize all return values to ensure defined paths for all outputs
292
  csv_path_res = None
293
  metrics_preview_df_res = None # Can be a DataFrame or a string message
 
314
  - semantic_heatmap (matplotlib.figure.Figure | None): Semantic similarity heatmap, or None.
315
  - warning_update (gr.update): Gradio update for the warning box.
316
  """
317
+ # Check if files are provided
318
  if not files:
319
  return (
320
  None,
 
326
  None, # tfidf_heatmap
327
  gr.update(value="Please upload files.", visible=True),
328
  )
329
+
330
+ # Check file size limits (10MB per file)
331
+ for file in files:
332
+ file_size_mb = Path(file.name).stat().st_size / (1024 * 1024)
333
+ if file_size_mb > 10:
334
+ return (
335
+ None,
336
+ f"File '{Path(file.name).name}' exceeds the 10MB size limit (size: {file_size_mb:.2f}MB).",
337
+ None, None, None, None, None,
338
+ gr.update(value=f"Error: File '{Path(file.name).name}' exceeds the 10MB size limit.", visible=True),
339
+ )
340
 
341
  try:
342
+ if progress_tracker is not None:
343
+ try:
344
+ progress_tracker(0.1, desc="Preparing files...")
345
+ except Exception as e:
346
+ logger.warning(f"Progress update error (non-critical): {e}")
347
+
348
+ # Get filenames and read file contents
349
  filenames = [
350
  Path(file.name).name for file in files
351
  ] # Use Path().name to get just the filename
352
+ text_data = {}
353
+
354
+ # Read files with progress updates
355
+ for i, file in enumerate(files):
356
+ file_path = Path(file.name)
357
+ filename = file_path.name
358
+ if progress_tracker is not None:
359
+ try:
360
+ progress_tracker(0.1 + (0.1 * (i / len(files))), desc=f"Reading file: {filename}")
361
+ except Exception as e:
362
+ logger.warning(f"Progress update error (non-critical): {e}")
363
+
364
+ try:
365
+ text_data[filename] = file_path.read_text(encoding="utf-8-sig")
366
+ except UnicodeDecodeError:
367
+ # Try with different encodings if UTF-8 fails
368
+ try:
369
+ text_data[filename] = file_path.read_text(encoding="utf-16")
370
+ except UnicodeDecodeError:
371
+ return (
372
+ None,
373
+ f"Error: Could not decode file '{filename}'. Please ensure it contains valid Tibetan text in UTF-8 or UTF-16 encoding.",
374
+ None, None, None, None, None,
375
+ gr.update(value=f"Error: Could not decode file '{filename}'.", visible=True),
376
+ )
377
+
378
+ # Configure semantic similarity
379
+ enable_semantic_bool = enable_semantic == "Yes"
380
+
381
+ if progress_tracker is not None:
382
+ try:
383
+ progress_tracker(0.2, desc="Loading model..." if enable_semantic_bool else "Processing text...")
384
+ except Exception as e:
385
+ logger.warning(f"Progress update error (non-critical): {e}")
386
+
387
+ # Process texts with selected model
388
+ # Convert stopword option to appropriate parameters
389
+ use_stopwords = stopwords_option != "None (No filtering)"
390
+ use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
391
+
392
  df_results, word_counts_df_data, warning_raw = process_texts(
393
+ text_data, filenames,
394
+ enable_semantic=enable_semantic_bool,
395
+ model_name=model_name,
396
+ use_stopwords=use_stopwords,
397
+ use_lite_stopwords=use_lite_stopwords,
398
+ progress_callback=progress_tracker
399
  )
400
 
401
  if df_results.empty:
 
407
  warning_update_res = gr.update(value=warning_message, visible=True)
408
  # Results for this case are set, then return
409
  else:
410
+ # Generate visualizations
411
+ if progress_tracker is not None:
412
+ try:
413
+ progress_tracker(0.8, desc="Generating visualizations...")
414
+ except Exception as e:
415
+ logger.warning(f"Progress update error (non-critical): {e}")
416
+
417
  # heatmap_titles is already defined in the outer scope of main_interface
418
  heatmaps_data = generate_visualizations(
419
  df_results, descriptive_titles=heatmap_titles
420
  )
421
+
422
+ # Generate word count chart
423
+ if progress_tracker is not None:
424
+ try:
425
+ progress_tracker(0.9, desc="Creating word count chart...")
426
+ except Exception as e:
427
+ logger.warning(f"Progress update error (non-critical): {e}")
428
  word_count_fig_res = generate_word_count_chart(word_counts_df_data)
429
+
430
+ # Save results to CSV
431
+ if progress_tracker is not None:
432
+ try:
433
+ progress_tracker(0.95, desc="Saving results...")
434
+ except Exception as e:
435
+ logger.warning(f"Progress update error (non-critical): {e}")
436
  csv_path_res = "results.csv"
437
  df_results.to_csv(csv_path_res, index=False)
438
+
439
+ # Prepare final output
440
  warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
441
  metrics_preview_df_res = df_results.head(10)
442
 
443
  jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
444
  lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
445
  semantic_heatmap_res = heatmaps_data.get(
446
+ "Semantic Similarity"
447
  )
448
  tfidf_heatmap_res = heatmaps_data.get("TF-IDF Cosine Sim")
449
  warning_update_res = gr.update(
 
465
  lcs_heatmap_res,
466
  semantic_heatmap_res,
467
  tfidf_heatmap_res,
468
+ warning_update_res
469
  )
470
 
471
+ # Function to interpret results using LLM
472
+ def interpret_results(csv_path, progress=gr.Progress()):
473
+ try:
474
+ if not csv_path or not Path(csv_path).exists():
475
+ return "Please run the analysis first to generate results."
476
+
477
+ # Read the CSV file
478
+ df_results = pd.read_csv(csv_path)
479
+
480
+ # Show detailed progress messages with percentages
481
+ progress(0, desc="Preparing data for analysis...")
482
+ progress(0.1, desc="Analyzing similarity patterns...")
483
+ progress(0.2, desc="Connecting to Mistral 7B via OpenRouter...")
484
+
485
+ # Get interpretation from LLM (using OpenRouter API)
486
+ progress(0.3, desc="Generating scholarly interpretation (this may take 20-40 seconds)...")
487
+ interpretation = get_interpretation(df_results)
488
+
489
+ # Simulate completion steps
490
+ progress(0.9, desc="Formatting results...")
491
+ progress(0.95, desc="Applying scholarly formatting...")
492
+
493
+ # Completed
494
+ progress(1.0, desc="Analysis complete!")
495
+
496
+ # Add a timestamp to the interpretation
497
+ from datetime import datetime
498
+ timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
499
+ interpretation = f"{interpretation}\n\n<small>Analysis generated on {timestamp}</small>"
500
+ return interpretation
501
+ except Exception as e:
502
+ logger.error(f"Error in interpret_results: {e}", exc_info=True)
503
+ return f"Error interpreting results: {str(e)}"
504
+
505
  process_btn.click(
506
+ fn=run_pipeline,
507
+ inputs=[file_input, semantic_toggle_radio, model_dropdown, stopwords_dropdown],
508
  outputs=[
509
  csv_output,
510
  metrics_preview,
511
  word_count_plot,
512
  heatmap_tabs["Jaccard Similarity (%)"],
513
  heatmap_tabs["Normalized LCS"],
514
+ heatmap_tabs["Semantic Similarity"],
515
  heatmap_tabs["TF-IDF Cosine Sim"],
516
  warning_box,
517
+ ]
518
+ )
519
+
520
+ # Connect the interpret button
521
+ interpret_btn.click(
522
+ fn=interpret_results,
523
+ inputs=[csv_output],
524
+ outputs=interpretation_output
525
  )
526
+
527
  return demo
528
 
529
 
530
  if __name__ == "__main__":
531
  demo = main_interface()
532
+ demo.launch()
pipeline/fasttext_embedding.py ADDED
@@ -0,0 +1,410 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ FastText embedding module for Tibetan text.
3
+ This module provides functions to train and use FastText models for Tibetan text.
4
+ """
5
+
6
+ import os
7
+ import math
8
+ import logging
9
+ import numpy as np
10
+ import fasttext
11
+ from typing import List, Optional
12
+ from huggingface_hub import hf_hub_download
13
+
14
+ # Set up logging
15
+ logger = logging.getLogger(__name__)
16
+
17
+ # Default parameters optimized for Tibetan
18
+ DEFAULT_DIM = 100
19
+ DEFAULT_EPOCH = 5
20
+ DEFAULT_MIN_COUNT = 5
21
+ DEFAULT_WINDOW = 5
22
+ DEFAULT_MINN = 3
23
+ DEFAULT_MAXN = 6
24
+ DEFAULT_NEG = 5
25
+
26
+ # Define paths for model storage
27
+ DEFAULT_MODEL_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "models")
28
+ DEFAULT_MODEL_PATH = os.path.join(DEFAULT_MODEL_DIR, "fasttext_model.bin")
29
+
30
+ # Facebook's official Tibetan FastText model
31
+ FACEBOOK_TIBETAN_MODEL_ID = "facebook/fasttext-bo-vectors"
32
+ FACEBOOK_TIBETAN_MODEL_FILE = "model.bin"
33
+
34
+ # Create models directory if it doesn't exist
35
+ os.makedirs(DEFAULT_MODEL_DIR, exist_ok=True)
36
+
37
+ def ensure_dir_exists(directory: str) -> None:
38
+ """
39
+ Ensure that a directory exists, creating it if necessary.
40
+
41
+ Args:
42
+ directory: Directory path to ensure exists
43
+ """
44
+ if not os.path.exists(directory):
45
+ os.makedirs(directory, exist_ok=True)
46
+
47
+
48
+ def train_fasttext_model(
49
+ corpus_path: str,
50
+ model_path: str = DEFAULT_MODEL_PATH,
51
+ dim: int = DEFAULT_DIM,
52
+ epoch: int = DEFAULT_EPOCH,
53
+ min_count: int = DEFAULT_MIN_COUNT,
54
+ window: int = DEFAULT_WINDOW,
55
+ minn: int = DEFAULT_MINN,
56
+ maxn: int = DEFAULT_MAXN,
57
+ neg: int = DEFAULT_NEG,
58
+ model_type: str = "skipgram"
59
+ ) -> fasttext.FastText._FastText:
60
+ """
61
+ Train a FastText model on Tibetan corpus using optimized parameters.
62
+
63
+ Args:
64
+ corpus_path: Path to the corpus file
65
+ model_path: Path where to save the trained model
66
+ dim: Embedding dimension (default: 300)
67
+ epoch: Number of training epochs (default: 15)
68
+ min_count: Minimum count of words (default: 3)
69
+ window: Size of context window (default: 5)
70
+ minn: Minimum length of char n-gram (default: 3)
71
+ maxn: Maximum length of char n-gram (default: 6)
72
+ neg: Number of negatives in negative sampling (default: 10)
73
+ model_type: FastText model type ('skipgram' or 'cbow')
74
+
75
+ Returns:
76
+ Trained FastText model
77
+ """
78
+ ensure_dir_exists(os.path.dirname(model_path))
79
+
80
+ logger.info("Training FastText model with %s, dim=%d, epoch=%d, window=%d, minn=%d, maxn=%d...",
81
+ model_type, dim, epoch, window, minn, maxn)
82
+
83
+ # Preprocess corpus for Tibetan - segment by syllable points
84
+ # This is based on research showing syllable segmentation works better for Tibetan
85
+ try:
86
+ with open(corpus_path, 'r', encoding='utf-8') as f:
87
+ content = f.read()
88
+
89
+ # Ensure syllable segmentation by adding spaces after Tibetan syllable markers (if not already present)
90
+ # This improves model quality for Tibetan text according to research
91
+ processed_content = content.replace('་', '་ ')
92
+
93
+ # Write back the processed content
94
+ with open(corpus_path, 'w', encoding='utf-8') as f:
95
+ f.write(processed_content)
96
+
97
+ logger.info("Preprocessed corpus with syllable segmentation for Tibetan text")
98
+ except Exception as e:
99
+ logger.warning("Could not preprocess corpus for syllable segmentation: %s", str(e))
100
+
101
+ # Train the model with optimized parameters
102
+ if model_type == "skipgram":
103
+ model = fasttext.train_unsupervised(
104
+ corpus_path,
105
+ model="skipgram",
106
+ dim=dim,
107
+ epoch=epoch,
108
+ minCount=min_count,
109
+ wordNgrams=1,
110
+ minn=minn,
111
+ maxn=maxn,
112
+ neg=neg,
113
+ window=window
114
+ )
115
+ else: # cbow
116
+ model = fasttext.train_unsupervised(
117
+ corpus_path,
118
+ model="cbow",
119
+ dim=dim,
120
+ epoch=epoch,
121
+ minCount=min_count,
122
+ wordNgrams=1,
123
+ minn=minn,
124
+ maxn=maxn,
125
+ neg=neg,
126
+ window=window
127
+ )
128
+
129
+ # Save the model
130
+ model.save_model(model_path)
131
+ logger.info("FastText model trained and saved to %s", model_path)
132
+
133
+ return model
134
+
135
+
136
+ def load_fasttext_model(model_path: str = DEFAULT_MODEL_PATH) -> Optional[fasttext.FastText._FastText]:
137
+ """
138
+ Load a FastText model from file, with fallback to official Facebook model.
139
+
140
+ Args:
141
+ model_path: Path to the model file
142
+
143
+ Returns:
144
+ Loaded FastText model or None if loading fails
145
+ """
146
+ try:
147
+ # First try to load the official Facebook FastText Tibetan model
148
+ try:
149
+ # Try to download the official Facebook FastText Tibetan model
150
+ logger.info("Attempting to download and load official Facebook FastText Tibetan model")
151
+ facebook_model_path = hf_hub_download(
152
+ repo_id=FACEBOOK_TIBETAN_MODEL_ID,
153
+ filename=FACEBOOK_TIBETAN_MODEL_FILE,
154
+ cache_dir=DEFAULT_MODEL_DIR
155
+ )
156
+ logger.info("Loading official Facebook FastText Tibetan model from %s", facebook_model_path)
157
+ return fasttext.load_model(facebook_model_path)
158
+ except Exception as e:
159
+ logger.warning("Could not load official Facebook FastText Tibetan model: %s", str(e))
160
+ logger.info("Falling back to local model")
161
+
162
+ # Fall back to local model
163
+ if os.path.exists(model_path):
164
+ logger.info("Loading local FastText model from %s", model_path)
165
+ return fasttext.load_model(model_path)
166
+ else:
167
+ logger.warning("Model path %s does not exist", model_path)
168
+ return None
169
+ except Exception as e:
170
+ logger.error("Error loading FastText model: %s", str(e))
171
+ return None
172
+
173
+
174
+ def get_text_embedding(
175
+ text: str,
176
+ model: fasttext.FastText._FastText,
177
+ tokenize_fn=None,
178
+ use_stopwords: bool = True,
179
+ stopwords_set=None,
180
+ use_tfidf_weighting: bool = True, # Enabled by default for better results
181
+ corpus_token_freq=None
182
+ ) -> np.ndarray:
183
+ """
184
+ Get embedding for a text using a FastText model with optional TF-IDF weighting.
185
+
186
+ Args:
187
+ text: Input text
188
+ model: FastText model
189
+ tokenize_fn: Optional tokenization function or pre-tokenized list
190
+ use_stopwords: Whether to filter out stopwords before computing embeddings
191
+ stopwords_set: Set of stopwords to filter out (if use_stopwords is True)
192
+ use_tfidf_weighting: Whether to use TF-IDF weighting for averaging word vectors
193
+ corpus_token_freq: Dictionary of token frequencies across corpus (required for TF-IDF)
194
+
195
+ Returns:
196
+ Text embedding vector
197
+ """
198
+ if not text.strip():
199
+ return np.zeros(model.get_dimension())
200
+
201
+ # Handle tokenization
202
+ if tokenize_fn is None:
203
+ # Simple whitespace tokenization as fallback
204
+ tokens = text.split()
205
+ elif isinstance(tokenize_fn, list):
206
+ # If tokenize_fn is already a list of tokens, use it directly
207
+ tokens = tokenize_fn
208
+ elif callable(tokenize_fn):
209
+ # If tokenize_fn is a function, call it
210
+ tokens = tokenize_fn(text)
211
+ else:
212
+ # If tokenize_fn is something else (like a string), use whitespace tokenization
213
+ logger.warning(f"Unexpected tokenize_fn type: {type(tokenize_fn)}. Using default whitespace tokenization.")
214
+ tokens = text.split()
215
+
216
+ # Filter out stopwords if enabled and stopwords_set is provided
217
+ if use_stopwords and stopwords_set is not None:
218
+ tokens = [token for token in tokens if token not in stopwords_set]
219
+
220
+ # If all tokens were filtered out as stopwords, return zero vector
221
+ if not tokens:
222
+ return np.zeros(model.get_dimension())
223
+
224
+ # Filter out empty tokens
225
+ tokens = [token for token in tokens if token.strip()]
226
+
227
+ if not tokens:
228
+ return np.zeros(model.get_dimension())
229
+
230
+ # Calculate TF-IDF weighted average if requested
231
+ if use_tfidf_weighting and corpus_token_freq is not None:
232
+ # Calculate term frequencies in this document
233
+ token_counts = {}
234
+ for token in tokens:
235
+ token_counts[token] = token_counts.get(token, 0) + 1
236
+
237
+ # Calculate IDF for each token with improved stability
238
+ N = sum(corpus_token_freq.values()) # Total number of tokens in corpus
239
+ N = max(N, 1) # Ensure N is at least 1 to avoid division by zero
240
+
241
+ # Compute TF-IDF weights with safeguards against extreme values
242
+ weights = []
243
+ for token in tokens:
244
+ # Term frequency in this document
245
+ tf = token_counts.get(token, 0) / max(len(tokens), 1) if len(tokens) > 0 else 0
246
+
247
+ # Inverse document frequency with smoothing to avoid extreme values
248
+ token_freq = corpus_token_freq.get(token, 0)
249
+ idf = math.log((N + 1) / (token_freq + 1)) + 1 # Add 1 for smoothing
250
+
251
+ # TF-IDF weight with bounds to prevent extreme values
252
+ weight = tf * idf
253
+ weight = min(max(weight, 0.1), 10.0) # Limit to reasonable range
254
+ weights.append(weight)
255
+
256
+ # Normalize weights to sum to 1 with stability checks
257
+ total_weight = sum(weights)
258
+ if total_weight > 0:
259
+ weights = [w / total_weight for w in weights]
260
+ else:
261
+ # If all weights are 0, use uniform weights
262
+ weights = [1.0 / len(tokens) if len(tokens) > 0 else 0 for _ in tokens]
263
+
264
+ # Check for NaN or infinite values and replace with uniform weights if found
265
+ if any(math.isnan(w) or math.isinf(w) for w in weights):
266
+ logger.warning("Found NaN or infinite weights in TF-IDF calculation. Using uniform weights instead.")
267
+ weights = [1.0 / len(tokens) if len(tokens) > 0 else 0 for _ in tokens]
268
+
269
+ # Get vectors for each token and apply weights
270
+ vectors = [model.get_word_vector(token) for token in tokens]
271
+ weighted_vectors = [w * v for w, v in zip(weights, vectors)]
272
+
273
+ # Sum the weighted vectors
274
+ return np.sum(weighted_vectors, axis=0)
275
+ else:
276
+ # Simple averaging if TF-IDF is not enabled or corpus frequencies not provided
277
+ vectors = [model.get_word_vector(token) for token in tokens]
278
+ return np.mean(vectors, axis=0)
279
+
280
+
281
+ def get_batch_embeddings(
282
+ texts: List[str],
283
+ model: fasttext.FastText._FastText,
284
+ tokenize_fn=None,
285
+ use_stopwords: bool = True,
286
+ stopwords_set=None,
287
+ use_tfidf_weighting: bool = True, # Enabled by default for better results
288
+ corpus_token_freq=None
289
+ ) -> np.ndarray:
290
+ """
291
+ Get embeddings for a batch of texts with optional TF-IDF weighting.
292
+
293
+ Args:
294
+ texts: List of input texts
295
+ model: FastText model
296
+ tokenize_fn: Optional tokenization function or pre-tokenized list of tokens
297
+ use_stopwords: Whether to filter out stopwords before computing embeddings
298
+ stopwords_set: Set of stopwords to filter out (if use_stopwords is True)
299
+ use_tfidf_weighting: Whether to use TF-IDF weighting for averaging word vectors
300
+ corpus_token_freq: Dictionary of token frequencies across corpus (required for TF-IDF)
301
+
302
+ Returns:
303
+ Array of text embedding vectors
304
+ """
305
+ # If corpus_token_freq is not provided but TF-IDF is requested, build it from the texts
306
+ if use_tfidf_weighting and corpus_token_freq is None:
307
+ logger.info("Building corpus token frequency dictionary for TF-IDF weighting")
308
+ corpus_token_freq = {}
309
+
310
+ # Process each text to build corpus token frequencies
311
+ for text in texts:
312
+ if not text.strip():
313
+ continue
314
+
315
+ # Handle tokenization
316
+ if tokenize_fn is None:
317
+ tokens = text.split()
318
+ elif isinstance(tokenize_fn, list):
319
+ # In this case, tokenize_fn should be a list of lists (one list of tokens per text)
320
+ # This is not a common use case, so we'll just use the first one as fallback
321
+ tokens = tokenize_fn[0] if tokenize_fn else []
322
+ else:
323
+ tokens = tokenize_fn(text)
324
+
325
+ # Filter out stopwords if enabled
326
+ if use_stopwords and stopwords_set is not None:
327
+ tokens = [token for token in tokens if token not in stopwords_set]
328
+
329
+ # Update corpus token frequencies
330
+ for token in tokens:
331
+ if token.strip(): # Skip empty tokens
332
+ corpus_token_freq[token] = corpus_token_freq.get(token, 0) + 1
333
+
334
+ logger.info("Built corpus token frequency dictionary with %d unique tokens", len(corpus_token_freq))
335
+
336
+ # Get embeddings for each text
337
+ embeddings = []
338
+ for i, text in enumerate(texts):
339
+ # Handle pre-tokenized input
340
+ tokens = None
341
+ if isinstance(tokenize_fn, list):
342
+ if i < len(tokenize_fn):
343
+ tokens = tokenize_fn[i]
344
+
345
+ embedding = get_text_embedding(
346
+ text,
347
+ model,
348
+ tokenize_fn=tokens, # Pass the tokens directly, not the function
349
+ use_stopwords=use_stopwords,
350
+ stopwords_set=stopwords_set,
351
+ use_tfidf_weighting=use_tfidf_weighting,
352
+ corpus_token_freq=corpus_token_freq
353
+ )
354
+ embeddings.append(embedding)
355
+
356
+ return np.array(embeddings)
357
+
358
+
359
+ def generate_embeddings(
360
+ texts: List[str],
361
+ model: fasttext.FastText._FastText,
362
+ device: str,
363
+ model_type: str = "sentence_transformer",
364
+ tokenize_fn=None,
365
+ use_stopwords: bool = True,
366
+ use_lite_stopwords: bool = False
367
+ ) -> np.ndarray:
368
+ """
369
+ Generate embeddings for a list of texts using a FastText model.
370
+
371
+ Args:
372
+ texts: List of input texts
373
+ model: FastText model
374
+ device: Device to use for computation (not used for FastText)
375
+ model_type: Model type ('sentence_transformer' or 'fasttext')
376
+ tokenize_fn: Optional tokenization function or pre-tokenized list of tokens
377
+ use_stopwords: Whether to filter out stopwords
378
+ use_lite_stopwords: Whether to use a lighter set of stopwords
379
+
380
+ Returns:
381
+ Array of text embedding vectors
382
+ """
383
+ if model_type != "fasttext":
384
+ logger.warning("Model type %s not supported for FastText. Using FastText anyway.", model_type)
385
+
386
+ # Generate embeddings using FastText
387
+ try:
388
+ # Load stopwords if needed
389
+ stopwords_set = None
390
+ if use_stopwords:
391
+ from .tibetan_stopwords import get_stopwords
392
+ stopwords_set = get_stopwords(use_lite=use_lite_stopwords)
393
+ logger.info("Loaded %d Tibetan stopwords", len(stopwords_set))
394
+
395
+ # Generate embeddings
396
+ embeddings = get_batch_embeddings(
397
+ texts,
398
+ model,
399
+ tokenize_fn=tokenize_fn,
400
+ use_stopwords=use_stopwords,
401
+ stopwords_set=stopwords_set,
402
+ use_tfidf_weighting=True # Enable TF-IDF weighting for better results
403
+ )
404
+
405
+ logger.info("FastText embeddings generated with shape: %s", str(embeddings.shape))
406
+ return embeddings
407
+ except Exception as e:
408
+ logger.error("Error generating FastText embeddings: %s", str(e))
409
+ # Return empty embeddings as fallback
410
+ return np.zeros((len(texts), model.get_dimension()))
pipeline/llm_service.py ADDED
@@ -0,0 +1,644 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ LLM Service for Tibetan Text Metrics
3
+
4
+ This module provides a unified interface for analyzing text similarity metrics
5
+ using both LLM-based and rule-based approaches.
6
+ """
7
+
8
+ import os
9
+ import json
10
+ import logging
11
+ import requests
12
+ import pandas as pd
13
+ import re
14
+
15
+ # Set up logging
16
+ logger = logging.getLogger(__name__)
17
+
18
+ # Try to load environment variables
19
+ ENV_LOADED = False
20
+ try:
21
+ from dotenv import load_dotenv
22
+ load_dotenv()
23
+ ENV_LOADED = True
24
+ except ImportError:
25
+ logger.warning("python-dotenv not installed. Using system environment variables.")
26
+
27
+ # Constants
28
+ DEFAULT_MAX_TOKENS = 4000
29
+ DEFAULT_MODEL = "mistralai/mistral-7b-instruct"
30
+ DEFAULT_TEMPERATURE = 0.3
31
+ DEFAULT_TOP_P = 0.9
32
+
33
+ class LLMService:
34
+ """
35
+ Service for analyzing text similarity metrics using LLMs and rule-based methods.
36
+ """
37
+
38
+ def __init__(self, api_key: str = None):
39
+ """
40
+ Initialize the LLM service.
41
+
42
+ Args:
43
+ api_key: Optional API key for OpenRouter. If not provided, will try to load from environment.
44
+ """
45
+ self.api_key = api_key or os.getenv('OPENROUTER_API_KEY')
46
+ self.model = DEFAULT_MODEL
47
+ self.temperature = DEFAULT_TEMPERATURE
48
+ self.top_p = DEFAULT_TOP_P
49
+
50
+ def analyze_similarity(
51
+ self,
52
+ results_df: pd.DataFrame,
53
+ use_llm: bool = True,
54
+ ) -> str:
55
+ """
56
+ Analyze similarity metrics using either LLM or rule-based approach.
57
+
58
+ Args:
59
+ results_df: DataFrame containing similarity metrics
60
+ use_llm: Whether to use LLM for analysis (falls back to rule-based if False or on error)
61
+
62
+ Returns:
63
+ str: Analysis of the metrics in markdown format with appropriate fallback messages
64
+ """
65
+ # If LLM is disabled, use rule-based analysis
66
+ if not use_llm:
67
+ logger.info("LLM analysis disabled. Using rule-based analysis.")
68
+ return self._analyze_with_rules(results_df)
69
+
70
+ # Try LLM analysis if enabled
71
+ try:
72
+ if not self.api_key:
73
+ raise ValueError("No OpenRouter API key provided. Please set the OPENROUTER_API_KEY environment variable.")
74
+
75
+ logger.info("Attempting LLM-based analysis...")
76
+ return self._analyze_with_llm(results_df, max_tokens=DEFAULT_MAX_TOKENS)
77
+
78
+ except Exception as e:
79
+ error_msg = str(e)
80
+ logger.error(f"Error in LLM analysis: {error_msg}")
81
+
82
+ # Create a user-friendly error message
83
+ if "payment" in error_msg.lower() or "402" in error_msg:
84
+ error_note = "OpenRouter API payment required. Falling back to rule-based analysis."
85
+ elif "invalid" in error_msg.lower() or "401" in error_msg:
86
+ error_note = "Invalid OpenRouter API key. Falling back to rule-based analysis."
87
+ elif "rate limit" in error_msg.lower() or "429" in error_msg:
88
+ error_note = "API rate limit exceeded. Falling back to rule-based analysis."
89
+ else:
90
+ error_note = f"LLM analysis failed: {error_msg[:200]}. Falling back to rule-based analysis."
91
+
92
+ # Get rule-based analysis
93
+ rule_based_analysis = self._analyze_with_rules(results_df)
94
+
95
+ # Combine the error message with the rule-based analysis
96
+ return f"## Analysis of Tibetan Text Similarity Metrics\n\n*Note: {error_note}*\n\n{rule_based_analysis}"
97
+
98
+ def _prepare_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
99
+ """
100
+ Prepare the DataFrame for analysis.
101
+
102
+ Args:
103
+ df: Input DataFrame with similarity metrics
104
+
105
+ Returns:
106
+ pd.DataFrame: Cleaned and prepared DataFrame
107
+ """
108
+ # Make a copy to avoid modifying the original
109
+ df = df.copy()
110
+
111
+ # Clean text columns
112
+ text_cols = ['Text A', 'Text B']
113
+ for col in text_cols:
114
+ if col in df.columns:
115
+ df[col] = df[col].fillna('Unknown').astype(str)
116
+ df[col] = df[col].str.replace('.txt$', '', regex=True)
117
+
118
+ # Filter out perfect matches (likely empty cells)
119
+ metrics_cols = ['Jaccard Similarity (%)', 'Normalized LCS', 'TF-IDF Cosine Sim']
120
+ if all(col in df.columns for col in metrics_cols):
121
+ mask = ~((df['Jaccard Similarity (%)'] == 100.0) &
122
+ (df['Normalized LCS'] == 1.0) &
123
+ (df['TF-IDF Cosine Sim'] == 1.0))
124
+ df = df[mask].copy()
125
+
126
+ return df
127
+
128
+ def _analyze_with_llm(self, df: pd.DataFrame, max_tokens: int) -> str:
129
+ """
130
+ Analyze metrics using an LLM via OpenRouter API.
131
+
132
+ Args:
133
+ df: Prepared DataFrame with metrics
134
+ max_tokens: Maximum tokens for the response
135
+
136
+ Returns:
137
+ str: LLM analysis in markdown format
138
+ """
139
+ # Prepare the prompt with data and instructions
140
+ prompt = self._create_llm_prompt(df)
141
+
142
+ try:
143
+ # Call the LLM API
144
+ response = self._call_openrouter_api(
145
+ prompt=prompt,
146
+ system_message=self._get_system_prompt(),
147
+ max_tokens=max_tokens,
148
+ temperature=self.temperature,
149
+ top_p=self.top_p
150
+ )
151
+
152
+ # Process and format the response
153
+ return self._format_llm_response(response, df)
154
+
155
+ except Exception as e:
156
+ logger.error(f"Error in LLM analysis: {str(e)}")
157
+ raise
158
+
159
+ def _analyze_with_rules(self, df: pd.DataFrame) -> str:
160
+ """
161
+ Analyze metrics using rule-based approach.
162
+
163
+ Args:
164
+ df: Prepared DataFrame with metrics
165
+
166
+ Returns:
167
+ str: Rule-based analysis in markdown format
168
+ """
169
+ analysis = ["## Tibetan Text Similarity Analysis (Rule-Based)"]
170
+
171
+ # Basic stats
172
+ text_a_col = 'Text A' if 'Text A' in df.columns else None
173
+ text_b_col = 'Text B' if 'Text B' in df.columns else None
174
+
175
+ if text_a_col and text_b_col:
176
+ unique_texts = set(df[text_a_col].unique()) | set(df[text_b_col].unique())
177
+ analysis.append(f"- **Texts analyzed:** {', '.join(sorted(unique_texts))}")
178
+
179
+ # Analyze each metric
180
+ metric_analyses = []
181
+
182
+ if 'Jaccard Similarity (%)' in df.columns:
183
+ jaccard_analysis = self._analyze_jaccard(df)
184
+ metric_analyses.append(jaccard_analysis)
185
+
186
+ if 'Normalized LCS' in df.columns:
187
+ lcs_analysis = self._analyze_lcs(df)
188
+ metric_analyses.append(lcs_analysis)
189
+
190
+ if 'TF-IDF Cosine Sim' in df.columns:
191
+ tfidf_analysis = self._analyze_tfidf(df)
192
+ metric_analyses.append(tfidf_analysis)
193
+
194
+ # Add all metric analyses
195
+ if metric_analyses:
196
+ analysis.extend(metric_analyses)
197
+
198
+ # Add overall interpretation
199
+ analysis.append("\n## Overall Interpretation")
200
+ analysis.append(self._generate_overall_interpretation(df))
201
+
202
+ return "\n\n".join(analysis)
203
+
204
+ def _analyze_jaccard(self, df: pd.DataFrame) -> str:
205
+ """Analyze Jaccard similarity scores."""
206
+ jaccard = df['Jaccard Similarity (%)'].dropna()
207
+ if jaccard.empty:
208
+ return ""
209
+
210
+ mean_jaccard = jaccard.mean()
211
+ max_jaccard = jaccard.max()
212
+ min_jaccard = jaccard.min()
213
+
214
+ analysis = [
215
+ "### Jaccard Similarity Analysis",
216
+ f"- **Range:** {min_jaccard:.1f}% to {max_jaccard:.1f}% (mean: {mean_jaccard:.1f}%)"
217
+ ]
218
+
219
+ # Interpret the scores
220
+ if mean_jaccard > 60:
221
+ analysis.append("- **High vocabulary overlap** suggests texts share significant content or are from the same tradition.")
222
+ elif mean_jaccard > 30:
223
+ analysis.append("- **Moderate vocabulary overlap** indicates some shared content or themes.")
224
+ else:
225
+ analysis.append("- **Low vocabulary overlap** suggests texts are on different topics or from different traditions.")
226
+
227
+ # Add top pairs
228
+ top_pairs = df.nlargest(3, 'Jaccard Similarity (%)')
229
+ if not top_pairs.empty:
230
+ analysis.append("\n**Most similar pairs:**")
231
+ for _, row in top_pairs.iterrows():
232
+ text_a = row.get('Text A', 'Text 1')
233
+ text_b = row.get('Text B', 'Text 2')
234
+ score = row['Jaccard Similarity (%)']
235
+ analysis.append(f"- {text_a} ↔ {text_b}: {score:.1f}%")
236
+
237
+ return "\n".join(analysis)
238
+
239
+ def _analyze_lcs(self, df: pd.DataFrame) -> str:
240
+ """Analyze Longest Common Subsequence scores."""
241
+ lcs = df['Normalized LCS'].dropna()
242
+ if lcs.empty:
243
+ return ""
244
+
245
+ mean_lcs = lcs.mean()
246
+ max_lcs = lcs.max()
247
+ min_lcs = lcs.min()
248
+
249
+ analysis = [
250
+ "### Structural Similarity (LCS) Analysis",
251
+ f"- **Range:** {min_lcs:.2f} to {max_lcs:.2f} (mean: {mean_lcs:.2f})"
252
+ ]
253
+
254
+ # Interpret the scores
255
+ if mean_lcs > 0.7:
256
+ analysis.append("- **High structural similarity** suggests texts follow similar organizational patterns.")
257
+ elif mean_lcs > 0.4:
258
+ analysis.append("- **Moderate structural similarity** indicates some shared organizational elements.")
259
+ else:
260
+ analysis.append("- **Low structural similarity** suggests different organizational approaches.")
261
+
262
+ # Add top pairs
263
+ top_pairs = df.nlargest(3, 'Normalized LCS')
264
+ if not top_pairs.empty:
265
+ analysis.append("\n**Most structurally similar pairs:**")
266
+ for _, row in top_pairs.iterrows():
267
+ text_a = row.get('Text A', 'Text 1')
268
+ text_b = row.get('Text B', 'Text 2')
269
+ score = row['Normalized LCS']
270
+ analysis.append(f"- {text_a} ↔ {text_b}: {score:.2f}")
271
+
272
+ return "\n".join(analysis)
273
+
274
+ def _analyze_tfidf(self, df: pd.DataFrame) -> str:
275
+ """Analyze TF-IDF cosine similarity scores."""
276
+ tfidf = df['TF-IDF Cosine Sim'].dropna()
277
+ if tfidf.empty:
278
+ return ""
279
+
280
+ mean_tfidf = tfidf.mean()
281
+ max_tfidf = tfidf.max()
282
+ min_tfidf = tfidf.min()
283
+
284
+ analysis = [
285
+ "### Thematic Similarity (TF-IDF) Analysis",
286
+ f"- **Range:** {min_tfidf:.2f} to {max_tfidf:.2f} (mean: {mean_tfidf:.2f})"
287
+ ]
288
+
289
+ # Interpret the scores
290
+ if mean_tfidf > 0.8:
291
+ analysis.append("- **High thematic similarity** suggests texts share distinctive terms and concepts.")
292
+ elif mean_tfidf > 0.5:
293
+ analysis.append("- **Moderate thematic similarity** indicates some shared distinctive terms.")
294
+ else:
295
+ analysis.append("- **Low thematic similarity** suggests different conceptual focuses.")
296
+
297
+ # Add top pairs
298
+ top_pairs = df.nlargest(3, 'TF-IDF Cosine Sim')
299
+ if not top_pairs.empty:
300
+ analysis.append("\n**Most thematically similar pairs:**")
301
+ for _, row in top_pairs.iterrows():
302
+ text_a = row.get('Text A', 'Text 1')
303
+ text_b = row.get('Text B', 'Text 2')
304
+ score = row['TF-IDF Cosine Sim']
305
+ analysis.append(f"- {text_a} ↔ {text_b}: {score:.2f}")
306
+
307
+ return "\n".join(analysis)
308
+
309
+ def _generate_overall_interpretation(self, df: pd.DataFrame) -> str:
310
+ """Generate an overall interpretation of the metrics."""
311
+ interpretations = []
312
+
313
+ # Get metrics if they exist
314
+ has_jaccard = 'Jaccard Similarity (%)' in df.columns
315
+ has_lcs = 'Normalized LCS' in df.columns
316
+ has_tfidf = 'TF-IDF Cosine Sim' in df.columns
317
+
318
+ # Calculate means for available metrics
319
+ metrics = {}
320
+ if has_jaccard:
321
+ metrics['jaccard'] = df['Jaccard Similarity (%)'].mean()
322
+ if has_lcs:
323
+ metrics['lcs'] = df['Normalized LCS'].mean()
324
+ if has_tfidf:
325
+ metrics['tfidf'] = df['TF-IDF Cosine Sim'].mean()
326
+
327
+ # Generate interpretation based on metrics
328
+ if metrics:
329
+ interpretations.append("Based on the analysis of similarity metrics:")
330
+
331
+ if has_jaccard and metrics['jaccard'] > 60:
332
+ interpretations.append("- The high Jaccard similarity indicates significant vocabulary overlap between texts, "
333
+ "suggesting they may share common sources or be part of the same textual tradition.")
334
+
335
+ if has_lcs and metrics['lcs'] > 0.7:
336
+ interpretations.append("- The high LCS score indicates strong structural similarity, "
337
+ "suggesting the texts may follow similar organizational patterns or share common structural elements.")
338
+
339
+ if has_tfidf and metrics['tfidf'] > 0.8:
340
+ interpretations.append("- The high TF-IDF similarity suggests the texts share distinctive terms and concepts, "
341
+ "indicating they may cover similar topics or themes.")
342
+
343
+ # Add cross-metric interpretations
344
+ if has_jaccard and has_lcs and metrics['jaccard'] > 60 and metrics['lcs'] > 0.7:
345
+ interpretations.append("\nThe combination of high Jaccard and LCS similarities strongly suggests "
346
+ "that these texts are closely related, possibly being different versions or "
347
+ "transmissions of the same work or sharing a common source.")
348
+
349
+ if has_tfidf and has_jaccard and metrics['tfidf'] < 0.5 and metrics['jaccard'] > 60:
350
+ interpretations.append("\nThe high Jaccard but lower TF-IDF similarity suggests that while the texts "
351
+ "share many common words, they may use them in different contexts or with different "
352
+ "meanings, possibly indicating different interpretations of similar material.")
353
+
354
+ # Add general guidance if no specific patterns found
355
+ if not interpretations:
356
+ interpretations.append("The analysis did not reveal strong patterns in the similarity metrics. "
357
+ "This could indicate that the texts are either very similar or very different "
358
+ "across all measured dimensions.")
359
+
360
+ return "\n\n".join(interpretations)
361
+
362
+ def _create_llm_prompt(self, df: pd.DataFrame) -> str:
363
+ """
364
+ Create a prompt for the LLM based on the DataFrame.
365
+
366
+ Args:
367
+ df: Prepared DataFrame with metrics
368
+
369
+ Returns:
370
+ str: Formatted prompt for the LLM
371
+ """
372
+ # Format the CSV data for the prompt
373
+ csv_data = df.to_csv(index=False)
374
+
375
+ # Create the prompt using the user's template
376
+ prompt = """You are a specialized text analysis interpreter with expertise in Tibetan textual studies. Your task is to analyze text similarity data from a CSV file and create a clear, narrative explanation for scholars who may not have technical expertise.
377
+
378
+ <CONTEXT>
379
+ This data comes from a text similarity analysis tool designed for various genres of Tibetan sources including historical, religious, literary, and philosophical texts. The tool compares texts using multiple linguistic metrics:
380
+ - Jaccard Similarity (%): Measures word overlap between texts (higher % = more similar)
381
+ - Normalized LCS: Longest Common Subsequence, measuring sequential text patterns
382
+ - Semantic Similarity: Deep meaning comparison using sentence transformers or fasttext
383
+ - TF-IDF Cosine Similarity: Term frequency-inverse document frequency comparison
384
+ The "Chapter" column indicates which chapter/section of the texts is being compared.
385
+ </CONTEXT>
386
+
387
+ <INSTRUCTIONS>
388
+ 1. Begin by identifying the specific texts being compared in the data (e.g., "Japan13.txt vs Dolanji.txt").
389
+
390
+ 2. Create a dual-layer narrative analysis (800-1000 words) that includes:
391
+ a) A high-level overview of text similarity patterns accessible to non-technical readers
392
+ b) A more detailed analysis for scholars interested in specific textual relationships
393
+
394
+ 3. In your analysis:
395
+ - Summarize overall similarity patterns between the texts across all chapters
396
+ - Identify which chapters show strongest similarities and differences
397
+ - Explain whether similarities appear to be more lexical (Jaccard, LCS) or conceptual (Semantic)
398
+ - Interpret what these patterns might suggest about textual relationships, transmission, or variant histories
399
+ - Note any interesting anomalies (e.g., chapters with high semantic but low lexical similarity)
400
+
401
+ 4. Structure your analysis with:
402
+ - An introduction explaining the texts compared and general observations
403
+ - A section on overall patterns across all chapters with visualized trends
404
+ - A detailed examination of 2-3 notable chapters (highest/lowest similarity)
405
+ - A discussion of what different metrics reveal about textual relationships
406
+ - A conclusion suggesting what these patterns might mean for Tibetan textual scholarship
407
+ - 2-3 specific questions these findings raise for further investigation
408
+
409
+ 5. Connect your analysis to common interests in Tibetan textual studies such as:
410
+ - Textual transmission and lineages
411
+ - Regional variants and dialectical differences
412
+ - Potential historical relationships between texts
413
+ - Original vs. commentary material identification
414
+
415
+ 6. Consider using a "family tree" analogy to make the textual relationships more intuitive. For example:
416
+ - Texts with very high similarity (>80%) might be described as "siblings" from the same direct source
417
+ - Texts with moderate similarity (50-80%) could be "cousins" sharing a common ancestor but with separate development
418
+ - Texts with low similarity (<50%) might be "distant relatives" with only fundamental connections
419
+ Use this metaphor if it helps clarify the relationships, but don't force it if another explanation would be clearer.
420
+
421
+ 7. **Important note on perfect or zero similarity matches:**
422
+ If you notice that all metrics indicate perfect or near-perfect similarity (for example, scores of 1.0/100 across all metrics for a chapter) or 0 for a complete mismatch, this may not indicate true textual identity or lack thereof. Instead, it likely means both corresponding text cells were empty or contained no content. In these cases, be sure to clarify in your narrative that such results are *artifacts of missing data, not genuine textual matches*, and should be interpreted with caution.
423
+
424
+ 8. Balance scholarly precision with accessibility, explaining technical concepts when necessary while keeping the overall narrative engaging for non-technical readers.
425
+ </INSTRUCTIONS>
426
+
427
+ Here is the CSV data to analyze:
428
+ [CSV_DATA]
429
+ """
430
+
431
+ # Replace [CSV_DATA] with the actual CSV data
432
+ prompt = prompt.replace("[CSV_DATA]", csv_data)
433
+
434
+ return prompt
435
+
436
+ def _get_system_prompt(self) -> str:
437
+ """Get the system prompt for the LLM."""
438
+ return """
439
+ You are a senior scholar of Tibetan Buddhist texts with expertise in textual criticism and
440
+ comparative analysis. Your task is to analyze the provided similarity metrics and provide
441
+ expert-level insights into the relationships between these Tibetan texts.
442
+
443
+ CRITICAL INSTRUCTIONS:
444
+ 1. Your analysis MUST be grounded in the specific metrics provided
445
+ 2. Always reference actual text names and metric values when making claims
446
+ 3. Focus on what the data shows, not what it might show
447
+ 4. Be precise and avoid vague or generic statements
448
+
449
+ ANALYSIS APPROACH:
450
+ 1. Begin with a brief executive summary of the most significant findings
451
+ 2. Group similar text pairs and explain their relationships
452
+ 3. Highlight any patterns that suggest textual transmission or common sources
453
+ 4. Note any anomalies or unexpected results that merit further investigation
454
+ 5. Provide specific examples from the data to support your analysis
455
+
456
+ TIBETAN TEXT-SPECIFIC GUIDANCE:
457
+ - Consider the implications of shared vocabulary in the context of Tibetan Buddhist literature
458
+ - Be aware that high LCS scores might indicate shared liturgical or formulaic language
459
+ - Note that texts with similar Jaccard but different LCS scores might share content but differ in structure
460
+ - Consider the possibility of text reuse, commentary traditions, or shared sources
461
+
462
+ Your analysis should be scholarly but accessible, providing clear insights that would be
463
+ valuable to researchers studying these texts.
464
+ """
465
+
466
+ def _call_openrouter_api(
467
+ self,
468
+ prompt: str,
469
+ system_message: str = None,
470
+ max_tokens: int = None,
471
+ temperature: float = None,
472
+ top_p: float = None
473
+ ) -> str:
474
+ """
475
+ Call the OpenRouter API.
476
+
477
+ Args:
478
+ prompt: The user prompt
479
+ system_message: Optional system message
480
+ max_tokens: Maximum tokens for the response
481
+ temperature: Sampling temperature
482
+ top_p: Nucleus sampling parameter
483
+
484
+ Returns:
485
+ str: The API response
486
+
487
+ Raises:
488
+ ValueError: If API key is missing or invalid
489
+ requests.exceptions.RequestException: For network-related errors
490
+ Exception: For other API-related errors
491
+ """
492
+ if not self.api_key:
493
+ error_msg = "OpenRouter API key not provided. Please set the OPENROUTER_API_KEY environment variable."
494
+ logger.error(error_msg)
495
+ raise ValueError(error_msg)
496
+
497
+ url = "https://openrouter.ai/api/v1/chat/completions"
498
+
499
+ headers = {
500
+ "Authorization": f"Bearer {self.api_key}",
501
+ "Content-Type": "application/json",
502
+ "HTTP-Referer": "https://github.com/daniel-wojahn/tibetan-text-metrics",
503
+ "X-Title": "Tibetan Text Metrics"
504
+ }
505
+
506
+ messages = []
507
+ if system_message:
508
+ messages.append({"role": "system", "content": system_message})
509
+ messages.append({"role": "user", "content": prompt})
510
+
511
+ data = {
512
+ "model": self.model,
513
+ "messages": messages,
514
+ "max_tokens": max_tokens or DEFAULT_MAX_TOKENS,
515
+ "temperature": temperature or self.temperature,
516
+ "top_p": top_p or self.top_p,
517
+ }
518
+
519
+ try:
520
+ logger.info(f"Calling OpenRouter API with model: {self.model}")
521
+ response = requests.post(url, headers=headers, json=data, timeout=60)
522
+
523
+ # Handle different HTTP status codes
524
+ if response.status_code == 200:
525
+ result = response.json()
526
+ if 'choices' in result and len(result['choices']) > 0:
527
+ return result['choices'][0]['message']['content'].strip()
528
+ else:
529
+ error_msg = "Unexpected response format from OpenRouter API"
530
+ logger.error(f"{error_msg}: {result}")
531
+ raise ValueError(error_msg)
532
+
533
+ elif response.status_code == 401:
534
+ error_msg = "Invalid OpenRouter API key. Please check your API key and try again."
535
+ logger.error(error_msg)
536
+ raise ValueError(error_msg)
537
+
538
+ elif response.status_code == 402:
539
+ error_msg = "OpenRouter API payment required. Please check your OpenRouter account balance or billing status."
540
+ logger.error(error_msg)
541
+ raise ValueError(error_msg)
542
+
543
+ elif response.status_code == 429:
544
+ error_msg = "API rate limit exceeded. Please try again later or check your OpenRouter rate limits."
545
+ logger.error(error_msg)
546
+ raise ValueError(error_msg)
547
+
548
+ else:
549
+ error_msg = f"OpenRouter API error: {response.status_code} - {response.text}"
550
+ logger.error(error_msg)
551
+ raise Exception(error_msg)
552
+
553
+ except requests.exceptions.RequestException as e:
554
+ error_msg = f"Failed to connect to OpenRouter API: {str(e)}"
555
+ logger.error(error_msg)
556
+ raise Exception(error_msg) from e
557
+
558
+ except json.JSONDecodeError as e:
559
+ error_msg = f"Failed to parse OpenRouter API response: {str(e)}"
560
+ logger.error(error_msg)
561
+ raise Exception(error_msg) from e
562
+
563
+ def _format_llm_response(self, response: str, df: pd.DataFrame) -> str:
564
+ """
565
+ Format the LLM response for display.
566
+
567
+ Args:
568
+ response: Raw LLM response
569
+ df: Original DataFrame for reference
570
+
571
+ Returns:
572
+ str: Formatted response with fallback if needed
573
+ """
574
+ # Basic validation
575
+ if not response or len(response) < 100:
576
+ raise ValueError("Response too short or empty")
577
+
578
+ # Check for garbled output (random numbers, nonsensical patterns)
579
+ # This is a simple heuristic - look for long sequences of numbers or strange patterns
580
+ suspicious_patterns = [
581
+ r'\d{8,}', # Long number sequences
582
+ r'[0-9,.]{20,}', # Long sequences of digits, commas and periods
583
+ r'[\W]{20,}', # Long sequences of non-word characters
584
+ ]
585
+
586
+ for pattern in suspicious_patterns:
587
+ if re.search(pattern, response):
588
+ logger.warning(f"Detected potentially garbled output matching pattern: {pattern}")
589
+ # Don't immediately raise - we'll do a more comprehensive check
590
+
591
+ # Check for content quality - ensure it has expected sections
592
+ expected_content = [
593
+ "introduction", "analysis", "similarity", "patterns", "conclusion", "question"
594
+ ]
595
+
596
+ # Count how many expected content markers we find
597
+ content_matches = sum(1 for term in expected_content if term.lower() in response.lower())
598
+
599
+ # If we find fewer than 3 expected content markers, it's likely not a good analysis
600
+ if content_matches < 3:
601
+ logger.warning(f"LLM response missing expected content sections (found {content_matches}/6)")
602
+ raise ValueError("Response does not contain expected analysis sections")
603
+
604
+ # Check for text names from the dataset
605
+ # Extract text names from the Text Pair column
606
+ text_names = set()
607
+ if "Text Pair" in df.columns:
608
+ for pair in df["Text Pair"]:
609
+ if isinstance(pair, str) and " vs " in pair:
610
+ texts = pair.split(" vs ")
611
+ text_names.update(texts)
612
+
613
+ # Check if at least some text names appear in the response
614
+ text_name_matches = sum(1 for name in text_names if name in response)
615
+ if text_names and text_name_matches == 0:
616
+ logger.warning("LLM response does not mention any of the text names from the dataset")
617
+ raise ValueError("Response does not reference any of the analyzed texts")
618
+
619
+ # Ensure basic markdown structure
620
+ if '##' not in response:
621
+ response = f"## Analysis of Tibetan Text Similarity\n\n{response}"
622
+
623
+ # Add styling to make the output more readable
624
+ response = f"<div class='llm-analysis'>\n{response}\n</div>"
625
+
626
+ return response
627
+
628
+
629
+ def get_interpretation(results_df: pd.DataFrame, use_llm: bool = True) -> str:
630
+ """
631
+ Get an interpretation of the similarity metrics.
632
+
633
+ This is a convenience function that creates an LLMService instance
634
+ and calls analyze_similarity with default parameters.
635
+
636
+ Args:
637
+ results_df: DataFrame containing similarity metrics
638
+ use_llm: Whether to use LLM for analysis (falls back to rule-based if False or on error)
639
+
640
+ Returns:
641
+ str: Analysis of the metrics in markdown format
642
+ """
643
+ service = LLMService()
644
+ return service.analyze_similarity(results_df, use_llm=use_llm)
pipeline/metrics.py CHANGED
@@ -7,9 +7,9 @@ import torch
7
  from .semantic_embedding import generate_embeddings
8
  from .tokenize import tokenize_texts
9
  import logging
10
- from sentence_transformers import SentenceTransformer
11
  from sklearn.feature_extraction.text import TfidfVectorizer
12
  from .stopwords_bo import TIBETAN_STOPWORDS, TIBETAN_STOPWORDS_SET
 
13
 
14
  # Attempt to import the Cython-compiled fast_lcs module
15
  try:
@@ -107,6 +107,9 @@ def compute_semantic_similarity(
107
  tokens2: List[str],
108
  model,
109
  device,
 
 
 
110
  ) -> float:
111
  """Computes semantic similarity using a sentence transformer model, with chunking for long texts."""
112
  if model is None or device is None:
@@ -122,9 +125,9 @@ def compute_semantic_similarity(
122
  return 0.0 # Or np.nan, depending on desired behavior for empty inputs
123
 
124
  def _get_aggregated_embedding(
125
- raw_text_segment: str, botok_tokens: List[str], model_obj, device_str
126
  ) -> torch.Tensor | None:
127
- """Helper to get a single embedding for a text, chunking if necessary."""
128
  if (
129
  not botok_tokens and not raw_text_segment.strip()
130
  ): # Check if effectively empty
@@ -132,54 +135,147 @@ def compute_semantic_similarity(
132
  f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
133
  )
134
  return None
135
-
136
- if len(botok_tokens) > MAX_TOKENS_PER_CHUNK:
137
- logger.info(
138
- f"Text segment with ~{len(botok_tokens)} tokens exceeds {MAX_TOKENS_PER_CHUNK}, chunking {raw_text_segment[:30]}..."
139
- )
140
- # Pass the original raw text and its pre-computed botok tokens to _chunk_text
141
- text_chunks = _chunk_text(
142
- raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
143
- )
144
- if not text_chunks:
145
- logger.warning(
146
- f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
147
  )
148
  return None
149
-
150
- logger.info(
151
- f"Generated {len(text_chunks)} chunks for segment: {raw_text_segment[:30]}..."
 
 
 
 
 
 
 
 
152
  )
153
- chunk_embeddings = generate_embeddings(text_chunks, model_obj, device_str)
154
-
155
- if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
156
  logger.error(
157
- f"Failed to generate embeddings for chunks of text: {raw_text_segment[:100]}..."
158
  )
159
  return None
160
- # Mean pooling of chunk embeddings
161
- aggregated_embedding = torch.mean(chunk_embeddings, dim=0, keepdim=True)
162
- return aggregated_embedding
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  else:
164
- # Text is short enough, embed raw text directly as per MEMORY[a777e6ad-11c4-4b90-8e6e-63a923a94432]
165
- if not raw_text_segment.strip():
166
  logger.info(
167
- f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
168
  )
169
- return None
 
 
 
 
 
 
 
 
170
 
171
- embedding = generate_embeddings([raw_text_segment], model_obj, device_str)
172
- if embedding is None or embedding.nelement() == 0:
173
- logger.error(
174
- f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
175
  )
176
- return None
177
- return embedding # Already [1, embed_dim]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178
 
179
  try:
180
- # Pass raw text and its pre-computed botok tokens
181
- embedding1 = _get_aggregated_embedding(text1_segment, tokens1, model, device)
182
- embedding2 = _get_aggregated_embedding(text2_segment, tokens2, model, device)
183
 
184
  if (
185
  embedding1 is None
@@ -192,6 +288,11 @@ def compute_semantic_similarity(
192
  )
193
  return np.nan
194
 
 
 
 
 
 
195
  # Cosine similarity expects 2D arrays, embeddings are [1, embed_dim] and on CPU
196
  similarity = cosine_similarity(embedding1.numpy(), embedding2.numpy())
197
  return float(similarity[0][0])
@@ -204,7 +305,9 @@ def compute_semantic_similarity(
204
 
205
 
206
  def compute_all_metrics(
207
- texts: Dict[str, str], model=None, device=None, enable_semantic: bool = True
 
 
208
  ) -> pd.DataFrame:
209
  """
210
  Computes all selected similarity metrics between pairs of texts.
@@ -220,14 +323,13 @@ def compute_all_metrics(
220
  Returns:
221
  pd.DataFrame: A DataFrame where each row contains the metrics for a pair of texts,
222
  including 'Text Pair', 'Jaccard Similarity (%)', 'Normalized LCS',
223
- and 'Semantic Similarity (BuddhistNLP)'.
224
  """
225
  files = list(texts.keys())
226
  results = []
227
  # Prepare token lists (always use tokenize_texts for raw Unicode)
228
  token_lists = {}
229
  corpus_for_tfidf = [] # For storing space-joined tokens for TF-IDF
230
- tibetan_stopwords_set = set() # Initialize for Jaccard (and potentially LCS) filtering
231
 
232
  for fname, content in texts.items():
233
  tokenized_content = tokenize_texts([content]) # Returns a list of lists
@@ -245,36 +347,65 @@ def compute_all_metrics(
245
 
246
  # TF-IDF Vectorization and Cosine Similarity Calculation
247
  if corpus_for_tfidf:
248
- # Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
249
- # and we don't want further case changes or token modifications for Tibetan.
250
- # Define Tibetan stopwords. These should match tokens produced by botok.
251
- # Tibetan stopwords are now imported from stopwords_bo.py
252
-
253
- vectorizer = TfidfVectorizer(
254
- tokenizer=lambda x: x.split(),
255
- preprocessor=lambda x: x,
256
- token_pattern=None,
257
- stop_words=TIBETAN_STOPWORDS
258
- )
259
- tfidf_matrix = vectorizer.fit_transform(corpus_for_tfidf)
260
- # Calculate pairwise cosine similarity on the TF-IDF matrix
261
- # This gives a square matrix where cosine_sim_matrix[i, j] is the similarity between doc i and doc j
262
- cosine_sim_matrix = cosine_similarity(tfidf_matrix)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
263
  else:
264
  # Handle case with no texts or all empty texts
265
- cosine_sim_matrix = np.array(
266
- [[]]
267
- ) # Or some other appropriate empty/default structure
268
 
269
  for i, j in combinations(range(len(files)), 2):
270
  f1, f2 = files[i], files[j]
271
  words1_raw, words2_raw = token_lists[f1], token_lists[f2]
272
 
273
- # Filter stopwords for Jaccard calculation using the imported TIBETAN_STOPWORDS_SET
274
- # If TIBETAN_STOPWORDS_SET is empty (e.g., if stopwords_bo.py somehow yields an empty set),
275
- # filtering will have no effect, which is a safe fallback.
276
- words1_jaccard = [word for word in words1_raw if word not in TIBETAN_STOPWORDS_SET]
277
- words2_jaccard = [word for word in words2_raw if word not in TIBETAN_STOPWORDS_SET]
 
 
 
 
 
 
 
 
 
 
 
 
278
 
279
  jaccard = (
280
  len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
@@ -290,7 +421,7 @@ def compute_all_metrics(
290
  if enable_semantic:
291
  # Pass raw texts and their pre-computed botok tokens
292
  semantic_sim = compute_semantic_similarity(
293
- texts[f1], texts[f2], words1_raw, words2_raw, model, device
294
  )
295
  else:
296
  semantic_sim = np.nan
@@ -300,8 +431,9 @@ def compute_all_metrics(
300
  "Jaccard Similarity (%)": jaccard_percent,
301
  "Normalized LCS": norm_lcs,
302
  # Pass tokens1 and tokens2 to compute_semantic_similarity
303
- "Semantic Similarity (BuddhistNLP)": semantic_sim,
304
  "TF-IDF Cosine Sim": (
 
305
  cosine_sim_matrix[i, j]
306
  if cosine_sim_matrix.size > 0
307
  and i < cosine_sim_matrix.shape[0]
 
7
  from .semantic_embedding import generate_embeddings
8
  from .tokenize import tokenize_texts
9
  import logging
 
10
  from sklearn.feature_extraction.text import TfidfVectorizer
11
  from .stopwords_bo import TIBETAN_STOPWORDS, TIBETAN_STOPWORDS_SET
12
+ from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE, TIBETAN_STOPWORDS_LITE_SET
13
 
14
  # Attempt to import the Cython-compiled fast_lcs module
15
  try:
 
107
  tokens2: List[str],
108
  model,
109
  device,
110
+ model_type: str = "sentence_transformer",
111
+ use_stopwords: bool = True,
112
+ use_lite_stopwords: bool = False,
113
  ) -> float:
114
  """Computes semantic similarity using a sentence transformer model, with chunking for long texts."""
115
  if model is None or device is None:
 
125
  return 0.0 # Or np.nan, depending on desired behavior for empty inputs
126
 
127
  def _get_aggregated_embedding(
128
+ raw_text_segment: str, botok_tokens: List[str], model_obj, device_str, model_type: str = "sentence_transformer", use_stopwords: bool = True, use_lite_stopwords: bool = False
129
  ) -> torch.Tensor | None:
130
+ """Helper to get a single embedding for a text, chunking if necessary for transformer models."""
131
  if (
132
  not botok_tokens and not raw_text_segment.strip()
133
  ): # Check if effectively empty
 
135
  f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
136
  )
137
  return None
138
+
139
+ # For FastText, we don't need chunking as it processes tokens directly
140
+ if model_type == "fasttext":
141
+ if not raw_text_segment.strip():
142
+ logger.info(
143
+ f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
 
 
 
 
 
 
144
  )
145
  return None
146
+
147
+ # Pass the raw text, pre-tokenized tokens, and stopword parameters
148
+ # Wrap the tokens in a list since generate_embeddings expects a list of token lists
149
+ embedding = generate_embeddings(
150
+ [raw_text_segment],
151
+ model_obj,
152
+ device_str,
153
+ model_type,
154
+ tokenize_fn=[botok_tokens], # Wrap in list since we're passing tokens for one text
155
+ use_stopwords=use_stopwords,
156
+ use_lite_stopwords=use_lite_stopwords
157
  )
158
+
159
+ if embedding is None or embedding.nelement() == 0:
 
160
  logger.error(
161
+ f"Failed to generate FastText embedding for text: {raw_text_segment[:100]}..."
162
  )
163
  return None
164
+ return embedding # Already [1, embed_dim]
165
+
166
+ # For transformer models, check if all tokens are stopwords when filtering is enabled
167
+ elif use_stopwords:
168
+ # Filter stopwords to see if any content remains
169
+ filtered_tokens = []
170
+ if use_lite_stopwords:
171
+ from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
172
+ filtered_tokens = [token for token in botok_tokens if token not in TIBETAN_STOPWORDS_LITE_SET]
173
+ else:
174
+ from .stopwords_bo import TIBETAN_STOPWORDS_SET
175
+ filtered_tokens = [token for token in botok_tokens if token not in TIBETAN_STOPWORDS_SET]
176
+
177
+ # If all tokens were filtered out as stopwords, return zero embedding
178
+ if not filtered_tokens:
179
+ logger.info("All tokens in text are stopwords. Returning zero embedding.")
180
+ # Create a zero tensor with the same dimension as the model's output
181
+ # For transformer models, typically 384 or 768 dimensions
182
+ embedding_dim = 384 # Default dimension for MiniLM models
183
+ return torch.zeros(1, embedding_dim)
184
+
185
+ # Continue with normal processing if content remains after filtering
186
+ if len(botok_tokens) > MAX_TOKENS_PER_CHUNK:
187
+ logger.info(
188
+ f"Text segment with ~{len(botok_tokens)} tokens exceeds {MAX_TOKENS_PER_CHUNK}, chunking {raw_text_segment[:30]}..."
189
+ )
190
+ # Pass the original raw text and its pre-computed botok tokens to _chunk_text
191
+ text_chunks = _chunk_text(
192
+ raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
193
+ )
194
+ if not text_chunks:
195
+ logger.warning(
196
+ f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
197
+ )
198
+ return None
199
+
200
+ logger.info(
201
+ f"Generated {len(text_chunks)} chunks for segment: {raw_text_segment[:30]}..."
202
+ )
203
+ # Generate embeddings for each chunk using the model
204
+ chunk_embeddings = generate_embeddings(text_chunks, model_obj, device_str, model_type)
205
+
206
+ if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
207
+ logger.error(
208
+ f"Failed to generate embeddings for chunks of text: {raw_text_segment[:100]}..."
209
+ )
210
+ return None
211
+ # Mean pooling of chunk embeddings
212
+ aggregated_embedding = torch.mean(chunk_embeddings, dim=0, keepdim=True)
213
+ return aggregated_embedding
214
+ else:
215
+ # Text is short enough for transformer model, embed raw text directly
216
+ if not raw_text_segment.strip():
217
+ logger.info(
218
+ f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
219
+ )
220
+ return None
221
+
222
+ embedding = generate_embeddings([raw_text_segment], model_obj, device_str, model_type)
223
+ if embedding is None or embedding.nelement() == 0:
224
+ logger.error(
225
+ f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
226
+ )
227
+ return None
228
+ return embedding # Already [1, embed_dim]
229
  else:
230
+ # No stopword filtering, proceed with normal processing
231
+ if len(botok_tokens) > MAX_TOKENS_PER_CHUNK:
232
  logger.info(
233
+ f"Text segment with ~{len(botok_tokens)} tokens exceeds {MAX_TOKENS_PER_CHUNK}, chunking {raw_text_segment[:30]}..."
234
  )
235
+ # Pass the original raw text and its pre-computed botok tokens to _chunk_text
236
+ text_chunks = _chunk_text(
237
+ raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
238
+ )
239
+ if not text_chunks:
240
+ logger.warning(
241
+ f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
242
+ )
243
+ return None
244
 
245
+ logger.info(
246
+ f"Generated {len(text_chunks)} chunks for segment: {raw_text_segment[:30]}..."
 
 
247
  )
248
+ # Generate embeddings for each chunk using the model
249
+ chunk_embeddings = generate_embeddings(text_chunks, model_obj, device_str, model_type)
250
+
251
+ if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
252
+ logger.error(
253
+ f"Failed to generate embeddings for chunks of text: {raw_text_segment[:100]}..."
254
+ )
255
+ return None
256
+ # Mean pooling of chunk embeddings
257
+ aggregated_embedding = torch.mean(chunk_embeddings, dim=0, keepdim=True)
258
+ return aggregated_embedding
259
+ else:
260
+ # Text is short enough for transformer model, embed raw text directly
261
+ if not raw_text_segment.strip():
262
+ logger.info(
263
+ f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
264
+ )
265
+ return None
266
+
267
+ embedding = generate_embeddings([raw_text_segment], model_obj, device_str, model_type)
268
+ if embedding is None or embedding.nelement() == 0:
269
+ logger.error(
270
+ f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
271
+ )
272
+ return None
273
+ return embedding # Already [1, embed_dim]
274
 
275
  try:
276
+ # Pass raw text and its pre-computed botok tokens with stopword preference
277
+ embedding1 = _get_aggregated_embedding(text1_segment, tokens1, model, device, model_type, use_stopwords, use_lite_stopwords)
278
+ embedding2 = _get_aggregated_embedding(text2_segment, tokens2, model, device, model_type, use_stopwords, use_lite_stopwords)
279
 
280
  if (
281
  embedding1 is None
 
288
  )
289
  return np.nan
290
 
291
+ # Check if both embeddings are zero vectors (which happens when all tokens are stopwords)
292
+ if np.all(embedding1.numpy() == 0) and np.all(embedding2.numpy() == 0):
293
+ # If both texts contain only stopwords, return 0 similarity
294
+ return 0.0
295
+
296
  # Cosine similarity expects 2D arrays, embeddings are [1, embed_dim] and on CPU
297
  similarity = cosine_similarity(embedding1.numpy(), embedding2.numpy())
298
  return float(similarity[0][0])
 
305
 
306
 
307
  def compute_all_metrics(
308
+ texts: Dict[str, str], model=None, device=None, enable_semantic: bool = True,
309
+ model_type: str = "sentence_transformer", use_stopwords: bool = True,
310
+ use_lite_stopwords: bool = False
311
  ) -> pd.DataFrame:
312
  """
313
  Computes all selected similarity metrics between pairs of texts.
 
323
  Returns:
324
  pd.DataFrame: A DataFrame where each row contains the metrics for a pair of texts,
325
  including 'Text Pair', 'Jaccard Similarity (%)', 'Normalized LCS',
326
+ and 'Semantic Similarity'.
327
  """
328
  files = list(texts.keys())
329
  results = []
330
  # Prepare token lists (always use tokenize_texts for raw Unicode)
331
  token_lists = {}
332
  corpus_for_tfidf = [] # For storing space-joined tokens for TF-IDF
 
333
 
334
  for fname, content in texts.items():
335
  tokenized_content = tokenize_texts([content]) # Returns a list of lists
 
347
 
348
  # TF-IDF Vectorization and Cosine Similarity Calculation
349
  if corpus_for_tfidf:
350
+ try:
351
+ # Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
352
+ # and we don't want further case changes or token modifications for Tibetan.
353
+
354
+ # Select appropriate stopwords list based on user preference
355
+ if use_stopwords:
356
+ # Choose between regular and lite stopwords list
357
+ if use_lite_stopwords:
358
+ stopwords_to_use = TIBETAN_STOPWORDS_LITE
359
+ else:
360
+ stopwords_to_use = TIBETAN_STOPWORDS
361
+ else:
362
+ # If stopwords are disabled, use an empty list
363
+ stopwords_to_use = []
364
+
365
+ vectorizer = TfidfVectorizer(
366
+ tokenizer=lambda x: x.split(),
367
+ preprocessor=lambda x: x,
368
+ token_pattern=None,
369
+ stop_words=stopwords_to_use
370
+ )
371
+ tfidf_matrix = vectorizer.fit_transform(corpus_for_tfidf)
372
+ # Calculate pairwise cosine similarity on the TF-IDF matrix
373
+ # This gives a square matrix where cosine_sim_matrix[i, j] is the similarity between doc i and doc j
374
+ cosine_sim_matrix = cosine_similarity(tfidf_matrix)
375
+ except ValueError as e:
376
+ if "empty vocabulary" in str(e):
377
+ # If vocabulary is empty after stopword removal, create a zero matrix
378
+ n = len(corpus_for_tfidf)
379
+ cosine_sim_matrix = np.zeros((n, n))
380
+ else:
381
+ # Re-raise other ValueError
382
+ raise
383
  else:
384
  # Handle case with no texts or all empty texts
385
+ n = len(files) if files else 0
386
+ cosine_sim_matrix = np.zeros((n, n))
 
387
 
388
  for i, j in combinations(range(len(files)), 2):
389
  f1, f2 = files[i], files[j]
390
  words1_raw, words2_raw = token_lists[f1], token_lists[f2]
391
 
392
+ # Select appropriate stopwords set based on user preference
393
+ if use_stopwords:
394
+ # Choose between regular and lite stopwords sets
395
+ if use_lite_stopwords:
396
+ stopwords_set_to_use = TIBETAN_STOPWORDS_LITE_SET
397
+ else:
398
+ stopwords_set_to_use = TIBETAN_STOPWORDS_SET
399
+ else:
400
+ # If stopwords are disabled, use an empty set
401
+ stopwords_set_to_use = set()
402
+
403
+ # Filter stopwords for Jaccard calculation
404
+ words1_jaccard = [word for word in words1_raw if word not in stopwords_set_to_use]
405
+ words2_jaccard = [word for word in words2_raw if word not in stopwords_set_to_use]
406
+
407
+ # Check if both texts only contain stopwords
408
+ both_only_stopwords = len(words1_jaccard) == 0 and len(words2_jaccard) == 0
409
 
410
  jaccard = (
411
  len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
 
421
  if enable_semantic:
422
  # Pass raw texts and their pre-computed botok tokens
423
  semantic_sim = compute_semantic_similarity(
424
+ texts[f1], texts[f2], words1_raw, words2_raw, model, device, model_type, use_stopwords, use_lite_stopwords
425
  )
426
  else:
427
  semantic_sim = np.nan
 
431
  "Jaccard Similarity (%)": jaccard_percent,
432
  "Normalized LCS": norm_lcs,
433
  # Pass tokens1 and tokens2 to compute_semantic_similarity
434
+ "Semantic Similarity": semantic_sim,
435
  "TF-IDF Cosine Sim": (
436
+ 0.0 if both_only_stopwords else
437
  cosine_sim_matrix[i, j]
438
  if cosine_sim_matrix.size > 0
439
  and i < cosine_sim_matrix.shape[0]
pipeline/process.py CHANGED
@@ -1,7 +1,7 @@
1
  import pandas as pd
2
  from typing import Dict, List, Tuple
3
  from .metrics import compute_all_metrics
4
- from .semantic_embedding import get_sentence_transformer_model_and_device
5
  from .tokenize import tokenize_texts
6
  import logging
7
  from itertools import combinations
@@ -10,119 +10,345 @@ logger = logging.getLogger(__name__)
10
 
11
 
12
  def process_texts(
13
- text_data: Dict[str, str], filenames: List[str], enable_semantic: bool = True
 
 
 
 
 
 
14
  ) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
15
  """
16
  Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
 
17
  Args:
18
  text_data (Dict[str, str]): A dictionary mapping filenames to their content.
19
  filenames (List[str]): A list of filenames that were uploaded.
 
 
 
 
 
 
 
 
 
 
20
  Returns:
21
  Tuple[pd.DataFrame, pd.DataFrame, str]:
22
  - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
 
 
23
  - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
 
24
  - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
 
 
 
 
25
  """
 
26
  st_model, st_device = None, None
 
 
 
 
 
 
 
 
 
 
 
27
  if enable_semantic:
28
- logger.info(
29
- "Semantic similarity enabled. Loading sentence transformer model..."
30
- )
31
  try:
32
- st_model, st_device = get_sentence_transformer_model_and_device()
33
- logger.info(
34
- f"Sentence transformer model loaded successfully on {st_device}."
35
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  except Exception as e:
37
- logger.error(
38
- f"Failed to load sentence transformer model: {e}. Semantic similarity will not be available."
39
- )
40
- # Optionally, add a warning to the UI if model loading fails
41
- # For now, keeping it as a logger.error. UI warning can be added later if desired.
42
- pass # Explicitly noting that we are not changing the warning handling for UI here.
 
 
 
 
 
 
 
 
 
 
43
  else:
44
  logger.info("Semantic similarity disabled. Skipping model loading.")
 
 
 
 
 
45
 
46
- # Detect chapter marker
 
 
 
 
 
 
47
  chapter_marker = "༈"
48
  fallback = False
49
  segment_texts = {}
50
- for fname in filenames:
 
 
 
 
 
 
 
 
 
51
  content = text_data[fname]
 
 
 
 
 
 
 
52
  if chapter_marker in content:
53
  segments = [
54
  seg.strip() for seg in content.split(chapter_marker) if seg.strip()
55
  ]
 
 
 
 
 
 
56
  for idx, seg in enumerate(segments):
57
  seg_id = f"{fname}|chapter {idx+1}"
58
  segment_texts[seg_id] = seg
59
  else:
 
60
  seg_id = f"{fname}|chapter 1"
61
  segment_texts[seg_id] = content.strip()
62
  fallback = True
63
- warning = ""
 
 
64
  if fallback:
65
- warning = (
66
  "No chapter marker found in one or more files. "
67
  "Each file will be treated as a single segment. "
68
  "For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
69
  )
 
 
 
 
 
 
70
  # Group chapters by filename (preserving order)
 
 
 
 
 
 
71
  file_to_chapters = {}
72
  for seg_id in segment_texts:
73
  fname = seg_id.split("|")[0]
74
  file_to_chapters.setdefault(fname, []).append(seg_id)
 
75
  # For each pair of files, compare corresponding chapters (by index)
 
 
 
 
 
 
76
  results = []
77
  files = list(file_to_chapters.keys())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  for file1, file2 in combinations(files, 2):
79
  chaps1 = file_to_chapters[file1]
80
  chaps2 = file_to_chapters[file2]
81
  min_chaps = min(len(chaps1), len(chaps2))
 
 
 
 
 
 
 
82
  for idx in range(min_chaps):
83
  seg1 = chaps1[idx]
84
  seg2 = chaps2[idx]
85
- # Compute metrics for this chapter pair
86
- # Use compute_all_metrics on just these two segments
87
- pair_metrics = compute_all_metrics(
88
- {seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
89
- model=st_model,
90
- device=st_device,
91
- enable_semantic=enable_semantic,
92
- )
93
- # Rename 'Text Pair' to show file stems and chapter number
94
- # Set Text Pair and Chapter columns
95
- pair_metrics.loc[:, "Text Pair"] = f"{file1} vs {file2}"
96
- pair_metrics.loc[:, "Chapter"] = idx + 1
97
- results.append(pair_metrics)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
  if results:
99
  metrics_df = pd.concat(results, ignore_index=True)
100
  else:
101
  metrics_df = pd.DataFrame()
 
102
 
103
  # Calculate word counts
 
 
 
 
 
 
104
  word_counts_data = []
105
- for seg_id, text_content in segment_texts.items():
 
 
 
 
 
 
 
 
 
 
106
  fname, chapter_info = seg_id.split("|", 1)
107
  chapter_num = int(chapter_info.replace("chapter ", ""))
108
- # Use botok for accurate word count for raw Tibetan text
109
- tokenized_segments = tokenize_texts([text_content]) # Returns a list of lists
110
- if tokenized_segments and tokenized_segments[0]:
111
- word_count = len(tokenized_segments[0])
112
- else:
113
- word_count = 0
114
- word_counts_data.append(
115
- {
116
- "Filename": fname.replace(".txt", ""),
117
- "ChapterNumber": chapter_num,
118
- "SegmentID": seg_id,
119
- "WordCount": word_count,
120
- }
121
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  word_counts_df = pd.DataFrame(word_counts_data)
123
  if not word_counts_df.empty:
124
  word_counts_df = word_counts_df.sort_values(
125
  by=["Filename", "ChapterNumber"]
126
  ).reset_index(drop=True)
127
-
 
 
 
 
 
 
128
  return metrics_df, word_counts_df, warning
 
1
  import pandas as pd
2
  from typing import Dict, List, Tuple
3
  from .metrics import compute_all_metrics
4
+ from .semantic_embedding import get_model_and_device, train_fasttext_model, FASTTEXT_MODEL_ID
5
  from .tokenize import tokenize_texts
6
  import logging
7
  from itertools import combinations
 
10
 
11
 
12
  def process_texts(
13
+ text_data: Dict[str, str],
14
+ filenames: List[str],
15
+ enable_semantic: bool = True,
16
+ model_name: str = "buddhist-nlp/buddhist-sentence-similarity",
17
+ use_stopwords: bool = True,
18
+ use_lite_stopwords: bool = False,
19
+ progress_callback = None
20
  ) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
21
  """
22
  Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
23
+
24
  Args:
25
  text_data (Dict[str, str]): A dictionary mapping filenames to their content.
26
  filenames (List[str]): A list of filenames that were uploaded.
27
+ enable_semantic (bool, optional): Whether to compute semantic similarity metrics.
28
+ Requires loading a sentence transformer model, which can be time-consuming. Defaults to True.
29
+ model_name (str, optional): The name of the sentence transformer model to use for semantic similarity.
30
+ Must be a valid model identifier on Hugging Face. Defaults to "buddhist-nlp/buddhist-sentence-similarity".
31
+ use_stopwords (bool, optional): Whether to use stopwords in the metrics calculation. Defaults to True.
32
+ use_lite_stopwords (bool, optional): Whether to use the lite stopwords list (common particles only)
33
+ instead of the comprehensive list. Only applies if use_stopwords is True. Defaults to False.
34
+ progress_callback (callable, optional): A callback function for reporting progress updates.
35
+ Should accept a float between 0 and 1 and a description string. Defaults to None.
36
+
37
  Returns:
38
  Tuple[pd.DataFrame, pd.DataFrame, str]:
39
  - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
40
+ Contains columns: 'Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS',
41
+ 'Semantic Similarity' (if enable_semantic=True), and 'TF-IDF Cosine Sim'.
42
  - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
43
+ Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
44
  - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
45
+
46
+ Raises:
47
+ RuntimeError: If the botok tokenizer fails to initialize.
48
+ ValueError: If the input files cannot be processed or if metrics computation fails.
49
  """
50
+ # Initialize model and device variables
51
  st_model, st_device = None, None
52
+ model_warning = ""
53
+
54
+ # Update progress if callback provided
55
+ if progress_callback is not None:
56
+ try:
57
+ progress_callback(0.25, desc="Preparing for text analysis...")
58
+ except Exception as e:
59
+ logger.warning(f"Progress callback error (non-critical): {e}")
60
+ # Continue processing even if progress reporting fails
61
+
62
+ # Load semantic model if enabled
63
  if enable_semantic:
64
+ logger.info("Semantic similarity enabled. Loading embedding model...")
 
 
65
  try:
66
+ logger.info("Using model: %s", model_name)
67
+
68
+ # Check if this is a FastText model request
69
+ if model_name == FASTTEXT_MODEL_ID:
70
+ # Try to load the official Facebook FastText Tibetan model first
71
+ if progress_callback is not None:
72
+ try:
73
+ progress_callback(0.25, desc="Loading official Facebook FastText Tibetan model...")
74
+ except Exception as e:
75
+ logger.warning("Progress callback error (non-critical): %s", str(e))
76
+
77
+ st_model, st_device, model_type = get_model_and_device(model_id=model_name)
78
+
79
+ # If model is None, we need to train a fallback model
80
+ if st_model is None:
81
+ if progress_callback is not None:
82
+ try:
83
+ progress_callback(0.25, desc="Official model unavailable. Training fallback FastText model...")
84
+ except Exception as e:
85
+ logger.warning("Progress callback error (non-critical): %s", str(e))
86
+
87
+ # Collect all text data for training
88
+ all_texts = list(text_data.values())
89
+
90
+ # Train the model with standard parameters for stability
91
+ st_model = train_fasttext_model(all_texts, dim=100, epoch=5)
92
+
93
+ if progress_callback is not None:
94
+ try:
95
+ progress_callback(0.3, desc="Fallback FastText model trained successfully")
96
+ except Exception as e:
97
+ logger.warning("Progress callback error (non-critical): %s", str(e))
98
+ else:
99
+ if progress_callback is not None:
100
+ try:
101
+ progress_callback(0.3, desc="Official Facebook FastText Tibetan model loaded successfully")
102
+ except Exception as e:
103
+ logger.warning(f"Progress callback error (non-critical): {e}")
104
+ else:
105
+ # For sentence transformers
106
+ st_model, st_device, model_type = get_model_and_device(model_id=model_name)
107
+ logger.info(f"Model {model_name} loaded successfully on {st_device}.")
108
+
109
+ if progress_callback is not None:
110
+ try:
111
+ progress_callback(0.3, desc="Model loaded successfully")
112
+ except Exception as e:
113
+ logger.warning(f"Progress callback error (non-critical): {e}")
114
+
115
  except Exception as e:
116
+ error_msg = str(e)
117
+ logger.error(f"Failed to load sentence transformer model: {error_msg}. Semantic similarity will not be available.")
118
+
119
+ # Create a user-friendly warning message
120
+ if "is not a valid model identifier" in error_msg:
121
+ model_warning = f"The model '{model_name}' could not be found on Hugging Face. Semantic similarity will not be available."
122
+ elif "CUDA out of memory" in error_msg:
123
+ model_warning = "Not enough GPU memory to load the semantic model. Try using a smaller model or disable semantic similarity."
124
+ else:
125
+ model_warning = f"Failed to load semantic model: {error_msg}. Semantic similarity will not be available."
126
+
127
+ if progress_callback is not None:
128
+ try:
129
+ progress_callback(0.3, desc="Continuing without semantic model")
130
+ except Exception as e:
131
+ logger.warning(f"Progress callback error (non-critical): {e}")
132
  else:
133
  logger.info("Semantic similarity disabled. Skipping model loading.")
134
+ if progress_callback is not None:
135
+ try:
136
+ progress_callback(0.3, desc="Processing text segments")
137
+ except Exception as e:
138
+ logger.warning(f"Progress callback error (non-critical): {e}")
139
 
140
+ # Detect chapter marker and segment texts
141
+ if progress_callback is not None:
142
+ try:
143
+ progress_callback(0.35, desc="Segmenting texts by chapters...")
144
+ except Exception as e:
145
+ logger.warning(f"Progress callback error (non-critical): {e}")
146
+
147
  chapter_marker = "༈"
148
  fallback = False
149
  segment_texts = {}
150
+
151
+ # Process each file
152
+ for i, fname in enumerate(filenames):
153
+ if progress_callback is not None and len(filenames) > 1:
154
+ try:
155
+ progress_callback(0.35 + (0.05 * (i / len(filenames))),
156
+ desc=f"Segmenting file {i+1}/{len(filenames)}: {fname}")
157
+ except Exception as e:
158
+ logger.warning(f"Progress callback error (non-critical): {e}")
159
+
160
  content = text_data[fname]
161
+
162
+ # Check if content is empty
163
+ if not content.strip():
164
+ logger.warning(f"File '{fname}' is empty or contains only whitespace.")
165
+ continue
166
+
167
+ # Split by chapter marker if present
168
  if chapter_marker in content:
169
  segments = [
170
  seg.strip() for seg in content.split(chapter_marker) if seg.strip()
171
  ]
172
+
173
+ # Check if we have valid segments after splitting
174
+ if not segments:
175
+ logger.warning(f"File '{fname}' contains chapter markers but no valid text segments.")
176
+ continue
177
+
178
  for idx, seg in enumerate(segments):
179
  seg_id = f"{fname}|chapter {idx+1}"
180
  segment_texts[seg_id] = seg
181
  else:
182
+ # No chapter markers found, treat entire file as one segment
183
  seg_id = f"{fname}|chapter 1"
184
  segment_texts[seg_id] = content.strip()
185
  fallback = True
186
+
187
+ # Generate warning if no chapter markers found
188
+ warning = model_warning # Include any model warnings
189
  if fallback:
190
+ chapter_warning = (
191
  "No chapter marker found in one or more files. "
192
  "Each file will be treated as a single segment. "
193
  "For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
194
  )
195
+ warning = warning + " " + chapter_warning if warning else chapter_warning
196
+
197
+ # Check if we have any valid segments
198
+ if not segment_texts:
199
+ logger.error("No valid text segments found in any of the uploaded files.")
200
+ return pd.DataFrame(), pd.DataFrame(), "No valid text segments found in the uploaded files. Please check your files and try again."
201
  # Group chapters by filename (preserving order)
202
+ if progress_callback is not None:
203
+ try:
204
+ progress_callback(0.4, desc="Organizing text segments...")
205
+ except Exception as e:
206
+ logger.warning(f"Progress callback error (non-critical): {e}")
207
+
208
  file_to_chapters = {}
209
  for seg_id in segment_texts:
210
  fname = seg_id.split("|")[0]
211
  file_to_chapters.setdefault(fname, []).append(seg_id)
212
+
213
  # For each pair of files, compare corresponding chapters (by index)
214
+ if progress_callback is not None:
215
+ try:
216
+ progress_callback(0.45, desc="Computing similarity metrics...")
217
+ except Exception as e:
218
+ logger.warning(f"Progress callback error (non-critical): {e}")
219
+
220
  results = []
221
  files = list(file_to_chapters.keys())
222
+
223
+ # Check if we have at least two files to compare
224
+ if len(files) < 2:
225
+ logger.warning("Need at least two files to compute similarity metrics.")
226
+ return pd.DataFrame(), pd.DataFrame(), "Need at least two files to compute similarity metrics."
227
+
228
+ # Track total number of comparisons for progress reporting
229
+ total_comparisons = 0
230
+ for file1, file2 in combinations(files, 2):
231
+ chaps1 = file_to_chapters[file1]
232
+ chaps2 = file_to_chapters[file2]
233
+ total_comparisons += min(len(chaps1), len(chaps2))
234
+
235
+ # Process each file pair
236
+ comparison_count = 0
237
  for file1, file2 in combinations(files, 2):
238
  chaps1 = file_to_chapters[file1]
239
  chaps2 = file_to_chapters[file2]
240
  min_chaps = min(len(chaps1), len(chaps2))
241
+
242
+ if progress_callback is not None:
243
+ try:
244
+ progress_callback(0.45, desc=f"Comparing {file1} with {file2}...")
245
+ except Exception as e:
246
+ logger.warning(f"Progress callback error (non-critical): {e}")
247
+
248
  for idx in range(min_chaps):
249
  seg1 = chaps1[idx]
250
  seg2 = chaps2[idx]
251
+
252
+ # Update progress
253
+ comparison_count += 1
254
+ if progress_callback is not None and total_comparisons > 0:
255
+ try:
256
+ progress_percentage = 0.45 + (0.25 * (comparison_count / total_comparisons))
257
+ progress_callback(progress_percentage,
258
+ desc=f"Computing metrics for chapter {idx+1} ({comparison_count}/{total_comparisons})")
259
+ except Exception as e:
260
+ logger.warning(f"Progress callback error (non-critical): {e}")
261
+
262
+ try:
263
+ # Compute metrics for this chapter pair
264
+ pair_metrics = compute_all_metrics(
265
+ {seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
266
+ model=st_model,
267
+ device=st_device,
268
+ enable_semantic=enable_semantic,
269
+ model_type=model_type if 'model_type' in locals() else "sentence_transformer",
270
+ use_stopwords=use_stopwords,
271
+ use_lite_stopwords=use_lite_stopwords
272
+ )
273
+
274
+ # Rename 'Text Pair' to show file stems and chapter number
275
+ pair_metrics.loc[:, "Text Pair"] = f"{file1} vs {file2}"
276
+ pair_metrics.loc[:, "Chapter"] = idx + 1
277
+ results.append(pair_metrics)
278
+
279
+ except Exception as e:
280
+ logger.error(f"Error computing metrics for {seg1} vs {seg2}: {e}")
281
+ # Continue with other comparisons instead of failing completely
282
+ continue
283
+
284
+ # Create the metrics DataFrame
285
  if results:
286
  metrics_df = pd.concat(results, ignore_index=True)
287
  else:
288
  metrics_df = pd.DataFrame()
289
+ warning += " No valid metrics could be computed. Please check your files and try again."
290
 
291
  # Calculate word counts
292
+ if progress_callback is not None:
293
+ try:
294
+ progress_callback(0.75, desc="Calculating word counts...")
295
+ except Exception as e:
296
+ logger.warning(f"Progress callback error (non-critical): {e}")
297
+
298
  word_counts_data = []
299
+
300
+ # Process each segment
301
+ for i, (seg_id, text_content) in enumerate(segment_texts.items()):
302
+ # Update progress
303
+ if progress_callback is not None and len(segment_texts) > 0:
304
+ try:
305
+ progress_percentage = 0.75 + (0.15 * (i / len(segment_texts)))
306
+ progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
307
+ except Exception as e:
308
+ logger.warning(f"Progress callback error (non-critical): {e}")
309
+
310
  fname, chapter_info = seg_id.split("|", 1)
311
  chapter_num = int(chapter_info.replace("chapter ", ""))
312
+
313
+ try:
314
+ # Use botok for accurate word count for raw Tibetan text
315
+ tokenized_segments = tokenize_texts([text_content]) # Returns a list of lists
316
+ if tokenized_segments and tokenized_segments[0]:
317
+ word_count = len(tokenized_segments[0])
318
+ else:
319
+ word_count = 0
320
+
321
+ word_counts_data.append(
322
+ {
323
+ "Filename": fname.replace(".txt", ""),
324
+ "ChapterNumber": chapter_num,
325
+ "SegmentID": seg_id,
326
+ "WordCount": word_count,
327
+ }
328
+ )
329
+ except Exception as e:
330
+ logger.error(f"Error calculating word count for segment {seg_id}: {e}")
331
+ # Add entry with 0 word count to maintain consistency
332
+ word_counts_data.append(
333
+ {
334
+ "Filename": fname.replace(".txt", ""),
335
+ "ChapterNumber": chapter_num,
336
+ "SegmentID": seg_id,
337
+ "WordCount": 0,
338
+ }
339
+ )
340
+
341
+ # Create and sort the word counts DataFrame
342
  word_counts_df = pd.DataFrame(word_counts_data)
343
  if not word_counts_df.empty:
344
  word_counts_df = word_counts_df.sort_values(
345
  by=["Filename", "ChapterNumber"]
346
  ).reset_index(drop=True)
347
+
348
+ if progress_callback is not None:
349
+ try:
350
+ progress_callback(0.95, desc="Analysis complete!")
351
+ except Exception as e:
352
+ logger.warning(f"Progress callback error (non-critical): {e}")
353
+
354
  return metrics_df, word_counts_df, warning
pipeline/semantic_embedding.py CHANGED
@@ -1,5 +1,6 @@
1
  import logging
2
  import torch
 
3
  from sentence_transformers import SentenceTransformer
4
 
5
  # Configure logging
@@ -11,8 +12,11 @@ logger = logging.getLogger(__name__)
11
  # Define the model ID for the fine-tuned Tibetan MiniLM
12
  DEFAULT_MODEL_NAME = "buddhist-nlp/buddhist-sentence-similarity"
13
 
 
 
14
 
15
- def get_sentence_transformer_model_and_device(
 
16
  model_id: str = DEFAULT_MODEL_NAME, device_preference: str = "auto"
17
  ):
18
  """
@@ -48,48 +52,95 @@ def get_sentence_transformer_model_and_device(
48
  else: # Handles explicit "cpu" preference or fallback if preferred is unavailable
49
  selected_device_str = "cpu"
50
 
51
- logger.info(f"Attempting to use device: {selected_device_str}")
52
 
53
  try:
54
- logger.info(
55
- f"Loading Sentence Transformer model: {model_id} on device: {selected_device_str}"
56
- )
57
- # SentenceTransformer expects a string like 'cuda', 'mps', or 'cpu'
58
- model = SentenceTransformer(model_id, device=selected_device_str)
59
- logger.info(f"Model {model_id} loaded successfully on {selected_device_str}.")
60
- return model, selected_device_str
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  except Exception as e:
62
  logger.error(
63
- f"Error loading model {model_id} on device {selected_device_str}: {e}"
 
64
  )
65
  # Fallback to CPU if the initially selected device (CUDA or MPS) failed
66
  if selected_device_str != "cpu":
67
  logger.warning(
68
- f"Failed to load model on {selected_device_str}, attempting to load on CPU..."
 
69
  )
70
  fallback_device_str = "cpu"
71
  try:
72
- model = SentenceTransformer(model_id, device=fallback_device_str)
73
- logger.info(
74
- f"Model {model_id} loaded successfully on CPU after fallback."
75
- )
76
- return model, fallback_device_str
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  except Exception as fallback_e:
78
  logger.error(
79
- f"Error loading model {model_id} on CPU during fallback: {fallback_e}"
 
80
  )
81
  raise fallback_e # Re-raise exception if CPU fallback also fails
82
  raise e # Re-raise original exception if selected_device_str was already CPU or no fallback attempted
83
 
84
 
85
- def generate_embeddings(texts: list[str], model, device: str):
86
  """
87
- Generates embeddings for a list of texts using the provided Sentence Transformer model.
88
 
89
  Args:
90
  texts (list[str]): A list of texts to embed.
91
- model: The loaded SentenceTransformer model.
92
- device (str): The device the model is on (primarily for logging, model.encode handles device).
 
 
 
93
 
94
  Returns:
95
  torch.Tensor: A tensor containing the embeddings, moved to CPU.
@@ -102,14 +153,72 @@ def generate_embeddings(texts: list[str], model, device: str):
102
 
103
  logger.info(f"Generating embeddings for {len(texts)} texts...")
104
 
105
- # The encode method of SentenceTransformer handles tokenization and pooling internally.
106
- # It also manages moving data to the model's device.
107
- embeddings = model.encode(texts, convert_to_tensor=True, show_progress_bar=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
 
109
- logger.info(f"Embeddings generated with shape: {embeddings.shape}")
110
- return (
111
- embeddings.cpu()
112
- ) # Ensure embeddings are on CPU for consistent further processing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
 
115
  if __name__ == "__main__":
@@ -126,19 +235,21 @@ if __name__ == "__main__":
126
  try:
127
  # Forcing CPU for this example to avoid potential CUDA issues in diverse environments
128
  # or if CUDA is not intended for this specific test.
129
- st_model, st_device = get_sentence_transformer_model_and_device(
130
  device_preference="cpu" # Explicitly use CPU for this test run
131
  )
132
 
133
- if st_model:
134
- logger.info(f"Test model loaded on device: {st_device}")
135
- example_embeddings = generate_embeddings(test_texts, st_model, st_device)
136
  logger.info(
137
- f"Generated example embeddings shape: {example_embeddings.shape}"
 
138
  )
139
  if example_embeddings.nelement() > 0: # Check if tensor is not empty
140
  logger.info(
141
- f"First embedding (first 10 dims): {example_embeddings[0][:10]}..."
 
142
  )
143
  else:
144
  logger.info("Generated example embeddings tensor is empty.")
@@ -146,6 +257,6 @@ if __name__ == "__main__":
146
  logger.error("Failed to load model for example usage.")
147
 
148
  except Exception as e:
149
- logger.error(f"An error occurred during the example usage: {e}")
150
 
151
  logger.info("Finished example usage.")
 
1
  import logging
2
  import torch
3
+ from typing import List, Any
4
  from sentence_transformers import SentenceTransformer
5
 
6
  # Configure logging
 
12
  # Define the model ID for the fine-tuned Tibetan MiniLM
13
  DEFAULT_MODEL_NAME = "buddhist-nlp/buddhist-sentence-similarity"
14
 
15
+ # FastText model identifier - this is just an internal identifier, not a HuggingFace model ID
16
+ FASTTEXT_MODEL_ID = "fasttext-tibetan"
17
 
18
+
19
+ def get_model_and_device(
20
  model_id: str = DEFAULT_MODEL_NAME, device_preference: str = "auto"
21
  ):
22
  """
 
52
  else: # Handles explicit "cpu" preference or fallback if preferred is unavailable
53
  selected_device_str = "cpu"
54
 
55
+ logger.info("Attempting to use device: %s", selected_device_str)
56
 
57
  try:
58
+ # Check if this is a FastText model request
59
+ if model_id == FASTTEXT_MODEL_ID:
60
+ try:
61
+ # Import here to avoid dependency issues if FastText is not installed
62
+ import fasttext
63
+ from .fasttext_embedding import load_fasttext_model
64
+
65
+ # Try to load the FastText model
66
+ model = load_fasttext_model()
67
+
68
+ if model is None:
69
+ error_msg = "Failed to load FastText model. Semantic similarity will not be available."
70
+ logger.error(error_msg)
71
+ raise Exception(error_msg)
72
+
73
+ logger.info("FastText model loaded successfully.")
74
+ # FastText always runs on CPU
75
+ return model, "cpu", "fasttext"
76
+ except ImportError:
77
+ logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
78
+ raise
79
+ else:
80
+ logger.info(
81
+ "Loading Sentence Transformer model: %s on device: %s",
82
+ model_id, selected_device_str
83
+ )
84
+ # SentenceTransformer expects a string like 'cuda', 'mps', or 'cpu'
85
+ model = SentenceTransformer(model_id, device=selected_device_str)
86
+ logger.info("Model %s loaded successfully on %s.", model_id, selected_device_str)
87
+ return model, selected_device_str, "sentence_transformer"
88
  except Exception as e:
89
  logger.error(
90
+ "Error loading model %s on device %s: %s",
91
+ model_id, selected_device_str, str(e)
92
  )
93
  # Fallback to CPU if the initially selected device (CUDA or MPS) failed
94
  if selected_device_str != "cpu":
95
  logger.warning(
96
+ "Failed to load model on %s, attempting to load on CPU...",
97
+ selected_device_str
98
  )
99
  fallback_device_str = "cpu"
100
  try:
101
+ # Check if this is a FastText model request during fallback
102
+ if model_id == FASTTEXT_MODEL_ID:
103
+ # Import here to avoid dependency issues if FastText is not installed
104
+ from .fasttext_embedding import load_fasttext_model
105
+
106
+ # Try to load the FastText model
107
+ model = load_fasttext_model()
108
+
109
+ if model is None:
110
+ logger.error("Failed to load FastText model during fallback. Semantic similarity will not be available.")
111
+ raise Exception("Failed to load FastText model. Please check if the model file exists.")
112
+
113
+ logger.info("FastText model loaded successfully during fallback.")
114
+ # FastText always runs on CPU
115
+ return model, "cpu", "fasttext"
116
+ else:
117
+ # Try to load as a sentence transformer
118
+ model = SentenceTransformer(model_id, device=fallback_device_str)
119
+ logger.info(
120
+ "Model %s loaded successfully on CPU after fallback.",
121
+ model_id
122
+ )
123
+ return model, fallback_device_str, "sentence_transformer"
124
  except Exception as fallback_e:
125
  logger.error(
126
+ "Error loading model %s on CPU during fallback: %s",
127
+ model_id, str(fallback_e)
128
  )
129
  raise fallback_e # Re-raise exception if CPU fallback also fails
130
  raise e # Re-raise original exception if selected_device_str was already CPU or no fallback attempted
131
 
132
 
133
+ def generate_embeddings(texts: List[str], model: Any, device: str, model_type: str = "sentence_transformer", tokenize_fn=None, use_stopwords: bool = True, use_lite_stopwords: bool = False):
134
  """
135
+ Generates embeddings for a list of texts using the provided model.
136
 
137
  Args:
138
  texts (list[str]): A list of texts to embed.
139
+ model: The loaded model (SentenceTransformer or FastText).
140
+ device (str): The device to use ("cuda", "mps", or "cpu").
141
+ model_type (str): Type of model ("sentence_transformer" or "fasttext")
142
+ tokenize_fn: Optional tokenization function or pre-tokenized list for FastText
143
+ use_stopwords (bool): Whether to filter out stopwords for FastText embeddings
144
 
145
  Returns:
146
  torch.Tensor: A tensor containing the embeddings, moved to CPU.
 
153
 
154
  logger.info(f"Generating embeddings for {len(texts)} texts...")
155
 
156
+ if model_type == "fasttext":
157
+ try:
158
+ # Import here to avoid dependency issues if FastText is not installed
159
+ from .fasttext_embedding import get_batch_embeddings
160
+ from .stopwords_bo import TIBETAN_STOPWORDS_SET
161
+
162
+ # For FastText, get appropriate stopwords set if filtering is enabled
163
+ stopwords_set = None
164
+ if use_stopwords:
165
+ # Choose between regular and lite stopwords sets
166
+ if use_lite_stopwords:
167
+ from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
168
+ stopwords_set = TIBETAN_STOPWORDS_LITE_SET
169
+ else:
170
+ from .stopwords_bo import TIBETAN_STOPWORDS_SET
171
+ stopwords_set = TIBETAN_STOPWORDS_SET
172
+
173
+ # Pass pre-tokenized tokens if available, otherwise pass None
174
+ # tokenize_fn should be a list of lists (tokens for each text) or None
175
+ embeddings = get_batch_embeddings(
176
+ texts,
177
+ model,
178
+ tokenize_fn=tokenize_fn,
179
+ use_stopwords=use_stopwords,
180
+ stopwords_set=stopwords_set
181
+ )
182
+ logger.info("FastText embeddings generated with shape: %s", str(embeddings.shape))
183
+ # Convert numpy array to torch tensor for consistency
184
+ return torch.tensor(embeddings)
185
+ except ImportError:
186
+ logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
187
+ raise
188
+ else: # sentence_transformer
189
+ # The encode method of SentenceTransformer handles tokenization and pooling internally.
190
+ # It also manages moving data to the model's device.
191
+ embeddings = model.encode(texts, convert_to_tensor=True, show_progress_bar=True)
192
+ logger.info("Sentence Transformer embeddings generated with shape: %s", str(embeddings.shape))
193
+ return (
194
+ embeddings.cpu()
195
+ ) # Ensure embeddings are on CPU for consistent further processing
196
+
197
 
198
+ def train_fasttext_model(corpus_texts: List[str], **kwargs):
199
+ """
200
+ Train a FastText model on the provided corpus texts.
201
+
202
+ Args:
203
+ corpus_texts: List of texts to use for training
204
+ **kwargs: Additional parameters for training (dim, epoch, etc.)
205
+
206
+ Returns:
207
+ Trained model and path to the model file
208
+ """
209
+ try:
210
+ from .fasttext_embedding import prepare_corpus_file, train_fasttext_model as train_ft
211
+
212
+ # Prepare corpus file
213
+ corpus_path = prepare_corpus_file(corpus_texts)
214
+
215
+ # Train the model
216
+ model = train_ft(corpus_path=corpus_path, **kwargs)
217
+
218
+ return model
219
+ except ImportError:
220
+ logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
221
+ raise
222
 
223
 
224
  if __name__ == "__main__":
 
235
  try:
236
  # Forcing CPU for this example to avoid potential CUDA issues in diverse environments
237
  # or if CUDA is not intended for this specific test.
238
+ model, device, model_type = get_model_and_device(
239
  device_preference="cpu" # Explicitly use CPU for this test run
240
  )
241
 
242
+ if model:
243
+ logger.info("Test model loaded on device: %s, type: %s", device, model_type)
244
+ example_embeddings = generate_embeddings(test_texts, model, device, model_type)
245
  logger.info(
246
+ "Generated example embeddings shape: %s",
247
+ str(example_embeddings.shape)
248
  )
249
  if example_embeddings.nelement() > 0: # Check if tensor is not empty
250
  logger.info(
251
+ "First embedding (first 10 dims): %s...",
252
+ str(example_embeddings[0][:10])
253
  )
254
  else:
255
  logger.info("Generated example embeddings tensor is empty.")
 
257
  logger.error("Failed to load model for example usage.")
258
 
259
  except Exception as e:
260
+ logger.error("An error occurred during the example usage: %s", str(e))
261
 
262
  logger.info("Finished example usage.")
pipeline/stopwords_lite_bo.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """Module for reduced Tibetan stopwords.
3
+
4
+ This file provides a less aggressive list of Tibetan stopwords for use in the Tibetan Text Metrics application.
5
+ It contains only the most common particles and punctuation that are unlikely to carry significant meaning.
6
+ """
7
+
8
+ # Initial set of stopwords with clear categories
9
+ PARTICLES_INITIAL_LITE = [
10
+ "ཏུ", "གི", "ཀྱི", "གིས", "ཀྱིས", "ཡིས", "ཀྱང", "སྟེ", "ཏེ", "ནོ", "ཏོ",
11
+ "ཅིང", "ཅིག", "ཅེས", "ཞེས", "གྱིས", "ན",
12
+ ]
13
+
14
+ MARKERS_AND_PUNCTUATION = ["༈", "།", "༎", "༑"]
15
+
16
+ # Reduced list of particles and suffixes
17
+ MORE_PARTICLES_SUFFIXES_LITE = [
18
+ "འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
19
+ "ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
20
+ "ན", "འམ་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི"
21
+ ]
22
+
23
+ # Combine all categorized lists
24
+ _ALL_STOPWORDS_CATEGORIZED_LITE = (
25
+ PARTICLES_INITIAL_LITE +
26
+ MARKERS_AND_PUNCTUATION +
27
+ MORE_PARTICLES_SUFFIXES_LITE
28
+ )
29
+
30
+ # Final flat list of unique stopwords for TfidfVectorizer (as a list)
31
+ TIBETAN_STOPWORDS_LITE = list(set(_ALL_STOPWORDS_CATEGORIZED_LITE))
32
+
33
+ # Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
34
+ TIBETAN_STOPWORDS_LITE_SET = set(TIBETAN_STOPWORDS_LITE)
pipeline/tokenize.py CHANGED
@@ -1,4 +1,16 @@
1
- from typing import List
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  try:
4
  from botok import WordTokenizer
@@ -9,18 +21,40 @@ except ImportError:
9
  # Handle the case where botok might not be installed,
10
  # though it's a core dependency for this app.
11
  BOTOK_TOKENIZER = None
12
- print("ERROR: botok library not found. Tokenization will fail.")
13
  # Optionally, raise an error here if botok is absolutely critical for the app to even start
14
  # raise ImportError("botok is required for tokenization. Please install it.")
15
 
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  def tokenize_texts(texts: List[str]) -> List[List[str]]:
18
  """
19
- Tokenizes a list of raw Tibetan texts using botok.
 
 
 
 
 
20
  Args:
21
- texts: List of raw text strings.
 
22
  Returns:
23
  List of tokenized texts (each as a list of tokens).
 
 
 
24
  """
25
  if BOTOK_TOKENIZER is None:
26
  # This case should ideally be handled more gracefully,
@@ -30,9 +64,42 @@ def tokenize_texts(texts: List[str]) -> List[List[str]]:
30
  )
31
 
32
  tokenized_texts_list = []
 
 
33
  for text_content in texts:
34
- tokens = [
35
- w.text for w in BOTOK_TOKENIZER.tokenize(text_content) if w.text.strip()
36
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  tokenized_texts_list.append(tokens)
 
38
  return tokenized_texts_list
 
1
+ from typing import List, Dict
2
+ import hashlib
3
+ import logging
4
+
5
+ # Configure logging
6
+ logger = logging.getLogger(__name__)
7
+
8
+ # Initialize a cache for tokenization results
9
+ # Using a simple in-memory dictionary with text hash as key
10
+ _tokenization_cache: Dict[str, List[str]] = {}
11
+
12
+ # Maximum cache size (number of entries)
13
+ MAX_CACHE_SIZE = 1000
14
 
15
  try:
16
  from botok import WordTokenizer
 
21
  # Handle the case where botok might not be installed,
22
  # though it's a core dependency for this app.
23
  BOTOK_TOKENIZER = None
24
+ logger.error("botok library not found. Tokenization will fail.")
25
  # Optionally, raise an error here if botok is absolutely critical for the app to even start
26
  # raise ImportError("botok is required for tokenization. Please install it.")
27
 
28
 
29
+ def _get_text_hash(text: str) -> str:
30
+ """
31
+ Generate a hash for the input text to use as a cache key.
32
+
33
+ Args:
34
+ text: The input text to hash
35
+
36
+ Returns:
37
+ A string representation of the MD5 hash of the input text
38
+ """
39
+ return hashlib.md5(text.encode('utf-8')).hexdigest()
40
+
41
+
42
  def tokenize_texts(texts: List[str]) -> List[List[str]]:
43
  """
44
+ Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
45
+
46
+ This function maintains an in-memory cache of previously tokenized texts to avoid
47
+ redundant processing of the same content. The cache uses MD5 hashes of the input
48
+ texts as keys.
49
+
50
  Args:
51
+ texts: List of raw text strings to tokenize.
52
+
53
  Returns:
54
  List of tokenized texts (each as a list of tokens).
55
+
56
+ Raises:
57
+ RuntimeError: If the botok tokenizer failed to initialize.
58
  """
59
  if BOTOK_TOKENIZER is None:
60
  # This case should ideally be handled more gracefully,
 
64
  )
65
 
66
  tokenized_texts_list = []
67
+
68
+ # Process each text
69
  for text_content in texts:
70
+ # Skip empty texts
71
+ if not text_content.strip():
72
+ tokenized_texts_list.append([])
73
+ continue
74
+
75
+ # Generate hash for cache lookup
76
+ text_hash = _get_text_hash(text_content)
77
+
78
+ # Check if we have this text in cache
79
+ if text_hash in _tokenization_cache:
80
+ # Cache hit - use cached tokens
81
+ tokens = _tokenization_cache[text_hash]
82
+ logger.debug(f"Cache hit for text hash {text_hash[:8]}...")
83
+ else:
84
+ # Cache miss - tokenize and store in cache
85
+ try:
86
+ tokens = [
87
+ w.text for w in BOTOK_TOKENIZER.tokenize(text_content) if w.text.strip()
88
+ ]
89
+
90
+ # Store in cache if not empty
91
+ if tokens:
92
+ # If cache is full, remove a random entry (simple strategy)
93
+ if len(_tokenization_cache) >= MAX_CACHE_SIZE:
94
+ # Remove first key (oldest if ordered dict, random otherwise)
95
+ _tokenization_cache.pop(next(iter(_tokenization_cache)))
96
+
97
+ _tokenization_cache[text_hash] = tokens
98
+ logger.debug(f"Added tokens to cache with hash {text_hash[:8]}...")
99
+ except Exception as e:
100
+ logger.error(f"Error tokenizing text: {e}")
101
+ tokens = []
102
+
103
  tokenized_texts_list.append(tokens)
104
+
105
  return tokenized_texts_list
pipeline/visualize.py CHANGED
@@ -25,7 +25,6 @@ def generate_visualizations(metrics_df: pd.DataFrame, descriptive_titles: dict =
25
 
26
  # --- Heatmaps for each metric ---
27
  heatmaps = {}
28
- # Using 'Reds' colormap as requested for a red/white gradient.
29
  # Chapter 1 will be at the top of the Y-axis due to sort_index(ascending=False).
30
  for metric in metric_cols:
31
  # Check if all values for this metric are NaN
@@ -41,19 +40,38 @@ def generate_visualizations(metrics_df: pd.DataFrame, descriptive_titles: dict =
41
  continue
42
 
43
  cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
44
- cmap = "Reds" # Apply 'Reds' colormap to all heatmaps
 
 
 
 
 
45
  text = [
46
  [f"{val:.2f}" if pd.notnull(val) else "" for val in row]
47
  for row in pivot.values
48
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  fig = go.Figure(
50
  data=go.Heatmap(
51
- z=pivot.values,
52
  x=cleaned_columns,
53
  y=pivot.index,
54
  colorscale=cmap,
55
- zmin=float(np.nanmin(pivot.values)),
56
- zmax=float(np.nanmax(pivot.values)),
 
57
  text=text,
58
  texttemplate="%{text}",
59
  hovertemplate="Chapter %{y}<br>Text Pair: %{x}<br>Value: %{z:.2f}<extra></extra>",
 
25
 
26
  # --- Heatmaps for each metric ---
27
  heatmaps = {}
 
28
  # Chapter 1 will be at the top of the Y-axis due to sort_index(ascending=False).
29
  for metric in metric_cols:
30
  # Check if all values for this metric are NaN
 
40
  continue
41
 
42
  cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
43
+
44
+ # For consistent interpretation: higher values (more similarity) = darker colors
45
+ # Using 'Reds' colormap for all metrics (dark red = high similarity)
46
+ cmap = "Reds"
47
+
48
+ # Format values for display
49
  text = [
50
  [f"{val:.2f}" if pd.notnull(val) else "" for val in row]
51
  for row in pivot.values
52
  ]
53
+
54
+ # Create a copy of the pivot data for visualization
55
+ # For LCS and Semantic Similarity, we need to reverse the color scale
56
+ # so that higher values (more similarity) are darker
57
+ viz_values = pivot.values.copy()
58
+
59
+ # Determine if we need to reverse the values for consistent color interpretation
60
+ # (darker = more similar across all metrics)
61
+ reverse_colorscale = False
62
+
63
+ # All metrics should have darker colors for higher similarity
64
+ # No need to reverse values anymore - we'll use the same scale for all
65
+
66
  fig = go.Figure(
67
  data=go.Heatmap(
68
+ z=viz_values,
69
  x=cleaned_columns,
70
  y=pivot.index,
71
  colorscale=cmap,
72
+ reversescale=reverse_colorscale, # Use the same scale direction for all metrics
73
+ zmin=float(np.nanmin(viz_values)),
74
+ zmax=float(np.nanmax(viz_values)),
75
  text=text,
76
  texttemplate="%{text}",
77
  hovertemplate="Chapter %{y}<br>Text Pair: %{x}<br>Value: %{z:.2f}<extra></extra>",
requirements.txt CHANGED
@@ -1,5 +1,5 @@
1
  # Core application and UI
2
- gradio==5.29.1
3
  pandas==2.2.3
4
 
5
  # Plotting and visualization
@@ -14,9 +14,19 @@ torch==2.7.0
14
  transformers==4.51.3
15
  sentence-transformers==4.1.0
16
  numba==0.61.2
 
17
 
18
  # Tibetan language processing
19
  botok==0.9.0
20
 
21
  # Build system for Cython
22
- Cython==3.0.12
 
 
 
 
 
 
 
 
 
 
1
  # Core application and UI
2
+ gradio
3
  pandas==2.2.3
4
 
5
  # Plotting and visualization
 
14
  transformers==4.51.3
15
  sentence-transformers==4.1.0
16
  numba==0.61.2
17
+ fasttext==0.9.2
18
 
19
  # Tibetan language processing
20
  botok==0.9.0
21
 
22
  # Build system for Cython
23
+ Cython==3.0.12
24
+
25
+ # HuggingFace integration
26
+ hf_xet==0.1.0
27
+ huggingface_hub
28
+
29
+ # LLM integration
30
+ python-dotenv==1.0.0
31
+ requests==2.31.0
32
+ tabulate
theme.py CHANGED
@@ -169,6 +169,85 @@ class TibetanAppTheme(gr.themes.Soft):
169
  "padding": "10px 15px !important",
170
  "border-bottom": "2px solid transparent !important",
171
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
  }
173
 
174
  def get_css_string(self) -> str:
 
169
  "padding": "10px 15px !important",
170
  "border-bottom": "2px solid transparent !important",
171
  },
172
+
173
+ # Custom styling for metric accordions
174
+ ".metric-info-accordion": {
175
+ "border-left": "4px solid #3B82F6 !important",
176
+ "margin-bottom": "1rem !important",
177
+ "background-color": "#F8FAFC !important",
178
+ "border-radius": "6px !important",
179
+ "overflow": "hidden !important",
180
+ },
181
+ ".jaccard-info": {
182
+ "border-left-color": "#3B82F6 !important", # Blue
183
+ },
184
+ ".lcs-info": {
185
+ "border-left-color": "#10B981 !important", # Green
186
+ },
187
+ ".semantic-info": {
188
+ "border-left-color": "#8B5CF6 !important", # Purple
189
+ },
190
+ ".tfidf-info": {
191
+ "border-left-color": "#F59E0B !important", # Amber
192
+ },
193
+ ".wordcount-info": {
194
+ "border-left-color": "#EC4899 !important", # Pink
195
+ },
196
+
197
+ # Accordion header styling
198
+ ".metric-info-accordion > .label-wrap": {
199
+ "font-weight": "600 !important",
200
+ "padding": "12px 16px !important",
201
+ "background-color": "#F1F5F9 !important",
202
+ "border-bottom": "1px solid #E2E8F0 !important",
203
+ },
204
+
205
+ # Accordion content styling
206
+ ".metric-info-accordion > .wrap": {
207
+ "padding": "16px !important",
208
+ },
209
+
210
+ # Word count plot styling - full width
211
+ ".tabs > .tab-content > div[data-testid='tabitem'] > .plot": {
212
+ "width": "100% !important",
213
+ },
214
+
215
+ # LLM Analysis styling
216
+ ".llm-analysis": {
217
+ "background-color": "#f8f9fa !important",
218
+ "border-left": "4px solid #3B82F6 !important",
219
+ "border-radius": "8px !important",
220
+ "padding": "20px 24px !important",
221
+ "margin": "16px 0 !important",
222
+ "box-shadow": "0 2px 8px rgba(0, 0, 0, 0.05) !important",
223
+ },
224
+ ".llm-analysis h2": {
225
+ "color": "#1e40af !important",
226
+ "font-size": "24px !important",
227
+ "margin-bottom": "16px !important",
228
+ "border-bottom": "1px solid #e5e7eb !important",
229
+ "padding-bottom": "8px !important",
230
+ },
231
+ ".llm-analysis h3, .llm-analysis h4": {
232
+ "color": "#1e3a8a !important",
233
+ "margin-top": "20px !important",
234
+ "margin-bottom": "12px !important",
235
+ },
236
+ ".llm-analysis p": {
237
+ "line-height": "1.7 !important",
238
+ "margin-bottom": "12px !important",
239
+ },
240
+ ".llm-analysis ul, .llm-analysis ol": {
241
+ "margin-left": "24px !important",
242
+ "margin-bottom": "16px !important",
243
+ },
244
+ ".llm-analysis li": {
245
+ "margin-bottom": "6px !important",
246
+ },
247
+ ".llm-analysis strong, .llm-analysis b": {
248
+ "color": "#1f2937 !important",
249
+ "font-weight": "600 !important",
250
+ },
251
  }
252
 
253
  def get_css_string(self) -> str: