Spaces:
Running
Running
Upload 19 files
Browse files- README.md +142 -13
- app.py +314 -51
- pipeline/fasttext_embedding.py +410 -0
- pipeline/llm_service.py +644 -0
- pipeline/metrics.py +198 -66
- pipeline/process.py +274 -48
- pipeline/semantic_embedding.py +146 -35
- pipeline/stopwords_lite_bo.py +34 -0
- pipeline/tokenize.py +74 -7
- pipeline/visualize.py +23 -5
- requirements.txt +12 -2
- theme.py +79 -0
README.md
CHANGED
@@ -4,12 +4,12 @@ emoji: 📚
|
|
4 |
colorFrom: blue
|
5 |
colorTo: indigo
|
6 |
sdk: gradio
|
7 |
-
sdk_version: 5.29.
|
8 |
python_version: 3.11
|
9 |
app_file: app.py
|
10 |
models:
|
11 |
-
- buddhist-nlp/buddhist-sentence-similarity
|
12 |
-
|
13 |
---
|
14 |
|
15 |
# Tibetan Text Metrics Web App
|
@@ -29,17 +29,55 @@ The Tibetan Text Metrics project aims to provide quantitative methods for assess
|
|
29 |
- **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
|
30 |
- **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
|
31 |
- **Core Metrics Computed**:
|
32 |
-
- **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords
|
33 |
- **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
|
34 |
-
- **Semantic Similarity
|
35 |
-
-
|
|
|
|
|
|
|
36 |
- **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
- **Interactive Visualizations**:
|
38 |
- Heatmaps for Jaccard, LCS, Semantic, and TF-IDF similarity metrics, providing a quick overview of inter-segment relationships.
|
39 |
- Bar chart displaying word counts per segment.
|
|
|
|
|
|
|
|
|
|
|
|
|
40 |
- **Downloadable Results**: Export detailed metrics as a CSV file and save heatmaps as PNG files.
|
41 |
- **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.
|
42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
## Text Segmentation and Best Practices
|
44 |
|
45 |
**Why segment your texts?**
|
@@ -73,11 +111,52 @@ Feel free to edit this list of stopwords to better suit your needs. The list is
|
|
73 |
|
74 |
### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
|
75 |
|
76 |
-
1. **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
|
78 |
* *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
|
79 |
-
3. **Semantic Similarity
|
80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
## Getting Started (if run Locally)
|
83 |
|
@@ -113,21 +192,55 @@ Feel free to edit this list of stopwords to better suit your needs. The list is
|
|
113 |
## Usage
|
114 |
|
115 |
1. **Upload Files**: Use the file upload interface to select one or more `.txt` files containing Tibetan Unicode text.
|
116 |
-
2. **
|
|
|
|
|
|
|
|
|
117 |
3. **View Results**:
|
118 |
- A preview of the similarity metrics will be displayed.
|
119 |
- Download the full results as a CSV file.
|
120 |
-
- Interactive heatmaps for Jaccard Similarity, Normalized LCS, Semantic Similarity, and TF-IDF Cosine Similarity will be generated.
|
121 |
- A bar chart showing word counts per segment will also be available.
|
122 |
- Any warnings (e.g., regarding missing chapter markers) will be displayed.
|
123 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
124 |
## Structure
|
125 |
|
126 |
- `app.py` — Gradio web app entry point and UI definition.
|
127 |
- `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
|
128 |
- `process.py`: Core logic for segmenting texts and orchestrating metric computation.
|
129 |
- `metrics.py`: Implementation of Jaccard, LCS, and Semantic Similarity (including chunking).
|
130 |
-
- `semantic_embedding.py`: Handles loading and using the
|
|
|
131 |
- `tokenize.py`: Tibetan text tokenization using `botok`.
|
132 |
- `upload.py`: File upload handling (currently minimal).
|
133 |
- `visualize.py`: Generates heatmaps and word count plots.
|
@@ -137,6 +250,22 @@ Feel free to edit this list of stopwords to better suit your needs. The list is
|
|
137 |
|
138 |
This project is licensed under the Creative Commons Attribution 4.0 International License - see the [LICENSE](../../LICENSE) file in the main project directory for details.
|
139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
140 |
## Citation
|
141 |
|
142 |
If you use this web application or the underlying TTM tool in your research, please cite the main project:
|
@@ -152,4 +281,4 @@ If you use this web application or the underlying TTM tool in your research, ple
|
|
152 |
```
|
153 |
|
154 |
---
|
155 |
-
For questions or issues specifically regarding the web application, please refer to the main project's [issue tracker](https://github.com/daniel-wojahn/tibetan-text-metrics/issues) or contact Daniel Wojahn.
|
|
|
4 |
colorFrom: blue
|
5 |
colorTo: indigo
|
6 |
sdk: gradio
|
7 |
+
sdk_version: 5.29.0
|
8 |
python_version: 3.11
|
9 |
app_file: app.py
|
10 |
models:
|
11 |
+
- buddhist-nlp/buddhist-sentence-similarity
|
12 |
+
- fasttext-tibetan
|
13 |
---
|
14 |
|
15 |
# Tibetan Text Metrics Web App
|
|
|
29 |
- **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
|
30 |
- **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
|
31 |
- **Core Metrics Computed**:
|
32 |
+
- **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
|
33 |
- **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
|
34 |
+
- **Semantic Similarity**: Uses embedding models to compare the contextual meaning of segments. Users can select between:
|
35 |
+
- A transformer-based model (buddhist-nlp/buddhist-sentence-similarity) specialized for Buddhist texts (experimental approach)
|
36 |
+
- The official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text
|
37 |
+
*Note: This metric works best when combined with other metrics for a more comprehensive analysis.*
|
38 |
+
- **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles. *Common Tibetan stopwords can be excluded to ensure TF-IDF weights highlight genuinely characteristic terms.*
|
39 |
- **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
|
40 |
+
- **Model Selection**: Choose from specialized embedding models for semantic similarity analysis:
|
41 |
+
- **Buddhist-NLP Transformer** (Experimental): Pre-trained model specialized for Buddhist texts
|
42 |
+
- **FastText**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) with optimizations specifically for Tibetan language, including botok tokenization and TF-IDF weighted averaging
|
43 |
+
- **Stopword Filtering**: Three levels of filtering for Tibetan words:
|
44 |
+
- **None**: No filtering, includes all words
|
45 |
+
- **Standard**: Filters only common particles and punctuation
|
46 |
+
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
|
47 |
- **Interactive Visualizations**:
|
48 |
- Heatmaps for Jaccard, LCS, Semantic, and TF-IDF similarity metrics, providing a quick overview of inter-segment relationships.
|
49 |
- Bar chart displaying word counts per segment.
|
50 |
+
- **Advanced Interpretation**: Get scholarly insights about your results with a built-in analysis engine that:
|
51 |
+
- Examines your metrics and provides contextual interpretation of textual relationships
|
52 |
+
- Generates a dual-layer narrative analysis (scholarly and accessible)
|
53 |
+
- Identifies patterns across chapters and highlights notable textual relationships
|
54 |
+
- Connects findings to Tibetan textual studies concepts (transmission lineages, regional variants)
|
55 |
+
- Suggests questions for further investigation
|
56 |
- **Downloadable Results**: Export detailed metrics as a CSV file and save heatmaps as PNG files.
|
57 |
- **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.
|
58 |
|
59 |
+
## Advanced Features
|
60 |
+
|
61 |
+
### Using AI-Powered Analysis
|
62 |
+
|
63 |
+
The application includes an "Interpret Results" button that provides scholarly insights about your text similarity metrics. This feature:
|
64 |
+
|
65 |
+
1. Uses Mistral 7B Instruct via OpenRouter to analyze your results
|
66 |
+
2. Requires an OpenRouter API key (set via environment variable)
|
67 |
+
3. The AI will provide a comprehensive scholarly analysis including:
|
68 |
+
- Introduction explaining the texts compared and general observations
|
69 |
+
- Overall patterns across all chapters with visualized trends
|
70 |
+
- Detailed examination of notable chapters (highest/lowest similarity)
|
71 |
+
- Discussion of what different metrics reveal about textual relationships
|
72 |
+
- Conclusions suggesting implications for Tibetan textual scholarship
|
73 |
+
- Specific questions these findings raise for further investigation
|
74 |
+
- Cautionary notes about interpreting perfect matches or zero similarity scores
|
75 |
+
|
76 |
+
### Data Processing
|
77 |
+
|
78 |
+
- **Automatic Filtering**: The system automatically filters out perfect matches (1.0 across all metrics) that may result from empty cells or identical text comparisons
|
79 |
+
- **Robust Analysis**: The system handles edge cases and provides meaningful metrics even with imperfect data
|
80 |
+
|
81 |
## Text Segmentation and Best Practices
|
82 |
|
83 |
**Why segment your texts?**
|
|
|
111 |
|
112 |
### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
|
113 |
|
114 |
+
1. **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words, optionally **filtering out common Tibetan stopwords**.
|
115 |
+
It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?'
|
116 |
+
It is calculated as `(Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100`.
|
117 |
+
Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent.
|
118 |
+
A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
|
119 |
+
|
120 |
+
**Stopword Filtering**: Three levels of filtering are available:
|
121 |
+
- **None**: No filtering, includes all words in the comparison
|
122 |
+
- **Standard**: Filters only common particles and punctuation
|
123 |
+
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
|
124 |
+
|
125 |
+
This helps focus on meaningful content words rather than grammatical elements.
|
126 |
+
|
127 |
2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
|
128 |
* *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
|
129 |
+
3. **Semantic Similarity**: Computes the cosine similarity between semantic embeddings of text segments using one of two approaches:
|
130 |
+
|
131 |
+
**a. Transformer-based Model** (Experimental): Pre-trained model that understands contextual relationships between words.
|
132 |
+
- `buddhist-nlp/buddhist-sentence-similarity`: Specialized for Buddhist texts
|
133 |
+
- Processes raw Unicode Tibetan text directly (no special tokenization required)
|
134 |
+
- Note: This is an experimental approach and results may vary with different texts
|
135 |
+
|
136 |
+
**b. FastText Model**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text. Falls back to a custom model only if the official model cannot be loaded.
|
137 |
+
- Processes Tibetan text using botok tokenization (same as other metrics)
|
138 |
+
- Uses the pre-tokenized words from botok rather than doing its own tokenization
|
139 |
+
- Better for texts with specialized Tibetan vocabulary
|
140 |
+
- More stable results for general Tibetan text comparison
|
141 |
+
- Optimized for Tibetan language with:
|
142 |
+
- Syllable-based tokenization preserving Tibetan syllable markers
|
143 |
+
- TF-IDF weighted averaging for word vectors (distinct from the TF-IDF Cosine Similarity metric)
|
144 |
+
- Enhanced parameters based on Tibetan NLP research
|
145 |
+
|
146 |
+
For texts exceeding the model's token limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting embeddings are averaged to produce a single vector for the entire segment.
|
147 |
+
4. **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, optionally **filtering out common Tibetan stopwords**.
|
148 |
+
TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
|
149 |
+
This helps to identify terms that are characteristic or discriminative for a segment. When stopword filtering is enabled, the TF-IDF scores better reflect genuinely significant terms.
|
150 |
+
Each segment is then represented as a vector of these TF-IDF scores.
|
151 |
+
Finally, the cosine similarity is computed between these vectors.
|
152 |
+
A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
|
153 |
+
|
154 |
+
**Stopword Filtering**: Three levels of filtering are available:
|
155 |
+
- **None**: No filtering, includes all words in the comparison
|
156 |
+
- **Standard**: Filters only common particles and punctuation
|
157 |
+
- **Aggressive**: Filters all function words including particles, pronouns, and auxiliaries
|
158 |
+
|
159 |
+
This helps focus on meaningful content words rather than grammatical elements.
|
160 |
|
161 |
## Getting Started (if run Locally)
|
162 |
|
|
|
192 |
## Usage
|
193 |
|
194 |
1. **Upload Files**: Use the file upload interface to select one or more `.txt` files containing Tibetan Unicode text.
|
195 |
+
2. **Configure Options**:
|
196 |
+
- Choose whether to compute semantic similarity
|
197 |
+
- Select an embedding model for semantic analysis
|
198 |
+
- Choose a stopword filtering level (None, Standard, or Aggressive)
|
199 |
+
3. **Run Analysis**: Click the "Run Analysis" button.
|
200 |
3. **View Results**:
|
201 |
- A preview of the similarity metrics will be displayed.
|
202 |
- Download the full results as a CSV file.
|
203 |
+
- Interactive heatmaps for Jaccard Similarity, Normalized LCS, Semantic Similarity, and TF-IDF Cosine Similarity will be generated. All heatmaps use a consistent color scheme where darker colors represent higher similarity.
|
204 |
- A bar chart showing word counts per segment will also be available.
|
205 |
- Any warnings (e.g., regarding missing chapter markers) will be displayed.
|
206 |
|
207 |
+
4. **Get Interpretation** (Optional):
|
208 |
+
- After running the analysis, click the "Help Interpret Results" button.
|
209 |
+
- No API key or internet connection required! The system uses a built-in rule-based analysis engine.
|
210 |
+
- The system will analyze your metrics and provide insights about patterns, relationships, and notable findings in your data.
|
211 |
+
- This feature helps researchers understand the significance of the metrics and identify interesting textual relationships between chapters.
|
212 |
+
|
213 |
+
## Embedding Models
|
214 |
+
|
215 |
+
The application offers two specialized approaches for calculating semantic similarity in Tibetan texts:
|
216 |
+
|
217 |
+
1. **Buddhist-NLP Transformer** (Default option):
|
218 |
+
- A specialized model fine-tuned for Buddhist text similarity
|
219 |
+
- Provides excellent results for Tibetan Buddhist texts
|
220 |
+
- Pre-trained and ready to use, no training required
|
221 |
+
- Best for general Buddhist terminology and concepts
|
222 |
+
|
223 |
+
2. **FastText Model**:
|
224 |
+
- Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors)
|
225 |
+
- Pre-trained on a large corpus of Tibetan text from Wikipedia and other sources
|
226 |
+
- Falls back to training a custom model on your texts if the official model cannot be loaded
|
227 |
+
- Respects your stopword filtering settings when creating embeddings
|
228 |
+
- Uses simple word vector averaging for stable embeddings
|
229 |
+
|
230 |
+
**When to choose FastText**:
|
231 |
+
- When you want high-quality word embeddings specifically trained for Tibetan language
|
232 |
+
- When you need a model that can handle out-of-vocabulary words through character n-grams
|
233 |
+
- When you want to benefit from Facebook's large-scale pre-training on Tibetan text
|
234 |
+
- When you need more control over how stopwords affect semantic analysis
|
235 |
+
|
236 |
## Structure
|
237 |
|
238 |
- `app.py` — Gradio web app entry point and UI definition.
|
239 |
- `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
|
240 |
- `process.py`: Core logic for segmenting texts and orchestrating metric computation.
|
241 |
- `metrics.py`: Implementation of Jaccard, LCS, and Semantic Similarity (including chunking).
|
242 |
+
- `semantic_embedding.py`: Handles loading and using the selected embedding models.
|
243 |
+
- `fasttext_embedding.py`: Provides functionality for training and using FastText models.
|
244 |
- `tokenize.py`: Tibetan text tokenization using `botok`.
|
245 |
- `upload.py`: File upload handling (currently minimal).
|
246 |
- `visualize.py`: Generates heatmaps and word count plots.
|
|
|
250 |
|
251 |
This project is licensed under the Creative Commons Attribution 4.0 International License - see the [LICENSE](../../LICENSE) file in the main project directory for details.
|
252 |
|
253 |
+
## Research and Acknowledgements
|
254 |
+
|
255 |
+
The FastText implementation for Tibetan text has been optimized based on research findings from several studies on Tibetan natural language processing:
|
256 |
+
|
257 |
+
1. Di, R., Tashi, N., & Lin, J. (2019). Improving Tibetan Word Segmentation Based on Multi-Features Fusion. *IEEE Access*, 7, 178057-178069.
|
258 |
+
- Informed our syllable-based tokenization approach and the importance of preserving Tibetan syllable markers
|
259 |
+
|
260 |
+
2. Tashi, N., Rabgay, T., & Wangchuk, K. (2020). Tibetan Word Segmentation using Syllable-based Maximum Matching with Potential Syllable Merging. *Engineering Applications of Artificial Intelligence*, 93, 103716.
|
261 |
+
- Provided insights on syllable segmentation for Tibetan text processing
|
262 |
+
|
263 |
+
3. Tashi, N., Rai, A. K., Mittal, P., & Sharma, A. K. (2018). A Novel Approach to Feature Extraction for Tibetan Text Classification. *Journal of Information Processing Systems*, 14(1), 211-224.
|
264 |
+
- Guided our parameter optimization for FastText, including embedding dimensions and n-gram settings
|
265 |
+
|
266 |
+
4. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. *Transactions of the Association for Computational Linguistics*, 5, 135-146.
|
267 |
+
- The original FastText paper that introduced the subword-enriched word embeddings we use
|
268 |
+
|
269 |
## Citation
|
270 |
|
271 |
If you use this web application or the underlying TTM tool in your research, please cite the main project:
|
|
|
281 |
```
|
282 |
|
283 |
---
|
284 |
+
For questions or issues specifically regarding the web application, please refer to the main project's [issue tracker](https://github.com/daniel-wojahn/tibetan-text-metrics/issues) or contact Daniel Wojahn.
|
app.py
CHANGED
@@ -2,7 +2,13 @@ import gradio as gr
|
|
2 |
from pathlib import Path
|
3 |
from pipeline.process import process_texts
|
4 |
from pipeline.visualize import generate_visualizations, generate_word_count_chart
|
|
|
5 |
import logging
|
|
|
|
|
|
|
|
|
|
|
6 |
|
7 |
from theme import tibetan_theme
|
8 |
|
@@ -18,7 +24,7 @@ def main_interface():
|
|
18 |
) as demo:
|
19 |
gr.Markdown(
|
20 |
"""# Tibetan Text Metrics Web App
|
21 |
-
<span style='font-size:18px;'>A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project.</span>
|
22 |
""",
|
23 |
elem_classes="gr-markdown",
|
24 |
)
|
@@ -38,6 +44,10 @@ def main_interface():
|
|
38 |
file_types=[".txt"],
|
39 |
file_count="multiple",
|
40 |
)
|
|
|
|
|
|
|
|
|
41 |
with gr.Column(scale=1, elem_classes="step-column"):
|
42 |
with gr.Group():
|
43 |
gr.Markdown(
|
@@ -47,12 +57,39 @@ def main_interface():
|
|
47 |
elem_classes="gr-markdown",
|
48 |
)
|
49 |
semantic_toggle_radio = gr.Radio(
|
50 |
-
label="Compute semantic similarity?",
|
51 |
choices=["Yes", "No"],
|
52 |
value="Yes",
|
53 |
info="Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
|
54 |
elem_id="semantic-radio-group",
|
55 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
process_btn = gr.Button(
|
57 |
"Run Analysis", elem_id="run-btn", variant="primary"
|
58 |
)
|
@@ -69,25 +106,66 @@ def main_interface():
|
|
69 |
metrics_preview = gr.Dataframe(
|
70 |
label="Similarity Metrics Preview", interactive=False, visible=True
|
71 |
)
|
72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
# Heatmap tabs for each metric
|
74 |
heatmap_titles = {
|
75 |
-
"Jaccard Similarity (%)": "Jaccard Similarity (%): Higher scores (
|
76 |
-
"Normalized LCS": "Normalized LCS: Higher scores (
|
77 |
-
"Semantic Similarity
|
78 |
-
"TF-IDF Cosine Sim": "TF-IDF Cosine Similarity: Higher scores mean texts share more important, distinctive vocabulary.",
|
|
|
79 |
}
|
80 |
|
81 |
metric_tooltips = {
|
82 |
"Jaccard Similarity (%)": """
|
83 |
### Jaccard Similarity (%)
|
84 |
-
This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words,
|
85 |
-
|
86 |
-
It is calculated as `(Number of common unique
|
87 |
-
Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent.
|
88 |
-
A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
|
89 |
|
90 |
-
|
|
|
|
|
91 |
""",
|
92 |
"Normalized LCS": """
|
93 |
### Normalized LCS (Longest Common Subsequence)
|
@@ -102,39 +180,85 @@ A higher Normalized LCS score suggests more significant shared phrasing, direct
|
|
102 |
**Note on Interpretation**: It is possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
|
103 |
""",
|
104 |
"Semantic Similarity": """
|
105 |
-
### Semantic Similarity
|
106 |
-
|
107 |
-
|
108 |
-
|
|
|
|
|
|
|
109 |
|
110 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
111 |
""",
|
112 |
"TF-IDF Cosine Sim": """
|
113 |
### TF-IDF Cosine Similarity
|
114 |
-
This metric
|
115 |
-
|
116 |
-
This helps
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
**Stopword Filtering**: uses a range of stopwords to filter out common Tibetan words that do not contribute to the semantic content of the text.
|
122 |
""",
|
123 |
}
|
124 |
heatmap_tabs = {}
|
125 |
gr.Markdown("## Detailed Metric Analysis", elem_classes="gr-markdown")
|
|
|
126 |
with gr.Tabs(elem_id="heatmap-tab-group"):
|
|
|
127 |
for metric_key, descriptive_title in heatmap_titles.items():
|
128 |
with gr.Tab(metric_key):
|
129 |
-
|
130 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
131 |
else:
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
138 |
|
139 |
# The outputs in process_btn.click should use the short metric names as keys for heatmap_tabs
|
140 |
# e.g., heatmap_tabs["Jaccard Similarity (%)"]
|
@@ -145,7 +269,25 @@ A score closer to 1 indicates that the two segments share more of these importan
|
|
145 |
|
146 |
warning_box = gr.Markdown(visible=False)
|
147 |
|
148 |
-
def run_pipeline(files,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
149 |
# Initialize all return values to ensure defined paths for all outputs
|
150 |
csv_path_res = None
|
151 |
metrics_preview_df_res = None # Can be a DataFrame or a string message
|
@@ -172,6 +314,7 @@ A score closer to 1 indicates that the two segments share more of these importan
|
|
172 |
- semantic_heatmap (matplotlib.figure.Figure | None): Semantic similarity heatmap, or None.
|
173 |
- warning_update (gr.update): Gradio update for the warning box.
|
174 |
"""
|
|
|
175 |
if not files:
|
176 |
return (
|
177 |
None,
|
@@ -183,21 +326,76 @@ A score closer to 1 indicates that the two segments share more of these importan
|
|
183 |
None, # tfidf_heatmap
|
184 |
gr.update(value="Please upload files.", visible=True),
|
185 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
186 |
|
187 |
try:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
188 |
filenames = [
|
189 |
Path(file.name).name for file in files
|
190 |
] # Use Path().name to get just the filename
|
191 |
-
text_data = {
|
192 |
-
|
193 |
-
|
194 |
-
|
195 |
-
|
196 |
-
|
197 |
-
|
198 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
199 |
df_results, word_counts_df_data, warning_raw = process_texts(
|
200 |
-
text_data, filenames,
|
|
|
|
|
|
|
|
|
|
|
201 |
)
|
202 |
|
203 |
if df_results.empty:
|
@@ -209,20 +407,43 @@ A score closer to 1 indicates that the two segments share more of these importan
|
|
209 |
warning_update_res = gr.update(value=warning_message, visible=True)
|
210 |
# Results for this case are set, then return
|
211 |
else:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
212 |
# heatmap_titles is already defined in the outer scope of main_interface
|
213 |
heatmaps_data = generate_visualizations(
|
214 |
df_results, descriptive_titles=heatmap_titles
|
215 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
216 |
word_count_fig_res = generate_word_count_chart(word_counts_df_data)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
217 |
csv_path_res = "results.csv"
|
218 |
df_results.to_csv(csv_path_res, index=False)
|
|
|
|
|
219 |
warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
|
220 |
metrics_preview_df_res = df_results.head(10)
|
221 |
|
222 |
jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
|
223 |
lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
|
224 |
semantic_heatmap_res = heatmaps_data.get(
|
225 |
-
"Semantic Similarity
|
226 |
)
|
227 |
tfidf_heatmap_res = heatmaps_data.get("TF-IDF Cosine Sim")
|
228 |
warning_update_res = gr.update(
|
@@ -244,26 +465,68 @@ A score closer to 1 indicates that the two segments share more of these importan
|
|
244 |
lcs_heatmap_res,
|
245 |
semantic_heatmap_res,
|
246 |
tfidf_heatmap_res,
|
247 |
-
warning_update_res
|
248 |
)
|
249 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
250 |
process_btn.click(
|
251 |
-
run_pipeline,
|
252 |
-
inputs=[file_input, semantic_toggle_radio],
|
253 |
outputs=[
|
254 |
csv_output,
|
255 |
metrics_preview,
|
256 |
word_count_plot,
|
257 |
heatmap_tabs["Jaccard Similarity (%)"],
|
258 |
heatmap_tabs["Normalized LCS"],
|
259 |
-
heatmap_tabs["Semantic Similarity
|
260 |
heatmap_tabs["TF-IDF Cosine Sim"],
|
261 |
warning_box,
|
262 |
-
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
263 |
)
|
|
|
264 |
return demo
|
265 |
|
266 |
|
267 |
if __name__ == "__main__":
|
268 |
demo = main_interface()
|
269 |
-
demo.launch()
|
|
|
2 |
from pathlib import Path
|
3 |
from pipeline.process import process_texts
|
4 |
from pipeline.visualize import generate_visualizations, generate_word_count_chart
|
5 |
+
from pipeline.llm_service import get_interpretation
|
6 |
import logging
|
7 |
+
import pandas as pd
|
8 |
+
from dotenv import load_dotenv
|
9 |
+
|
10 |
+
# Load environment variables from .env file
|
11 |
+
load_dotenv()
|
12 |
|
13 |
from theme import tibetan_theme
|
14 |
|
|
|
24 |
) as demo:
|
25 |
gr.Markdown(
|
26 |
"""# Tibetan Text Metrics Web App
|
27 |
+
<span style='font-size:18px;'>A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project. Powered by Mistral 7B via OpenRouter for advanced text analysis.</span>
|
28 |
""",
|
29 |
elem_classes="gr-markdown",
|
30 |
)
|
|
|
44 |
file_types=[".txt"],
|
45 |
file_count="multiple",
|
46 |
)
|
47 |
+
gr.Markdown(
|
48 |
+
"<small>Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.</small>",
|
49 |
+
elem_classes="gr-markdown"
|
50 |
+
)
|
51 |
with gr.Column(scale=1, elem_classes="step-column"):
|
52 |
with gr.Group():
|
53 |
gr.Markdown(
|
|
|
57 |
elem_classes="gr-markdown",
|
58 |
)
|
59 |
semantic_toggle_radio = gr.Radio(
|
60 |
+
label="Compute semantic similarity? (Experimental)",
|
61 |
choices=["Yes", "No"],
|
62 |
value="Yes",
|
63 |
info="Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
|
64 |
elem_id="semantic-radio-group",
|
65 |
)
|
66 |
+
|
67 |
+
model_dropdown = gr.Dropdown(
|
68 |
+
label="Embedding Model",
|
69 |
+
choices=[
|
70 |
+
"buddhist-nlp/buddhist-sentence-similarity",
|
71 |
+
"fasttext-tibetan"
|
72 |
+
],
|
73 |
+
value="buddhist-nlp/buddhist-sentence-similarity",
|
74 |
+
info="Select the embedding model for semantic similarity.<br><br>"
|
75 |
+
"<b>Model information:</b><br>"
|
76 |
+
"• <a href='https://huggingface.co/buddhist-nlp/buddhist-sentence-similarity' target='_blank'>buddhist-nlp/buddhist-sentence-similarity</a>: Specialized model fine-tuned for Buddhist text similarity.<br>"
|
77 |
+
"• <b>fasttext-tibetan</b>: Uses the official Facebook FastText Tibetan model pre-trained on a large corpus. If the official model cannot be loaded, it will fall back to training a custom model on your uploaded texts.",
|
78 |
+
visible=True,
|
79 |
+
interactive=True
|
80 |
+
)
|
81 |
+
|
82 |
+
stopwords_dropdown = gr.Dropdown(
|
83 |
+
label="Stopword Filtering",
|
84 |
+
choices=[
|
85 |
+
"None (No filtering)",
|
86 |
+
"Standard (Common particles only)",
|
87 |
+
"Aggressive (All function words)"
|
88 |
+
],
|
89 |
+
value="Standard (Common particles only)", # Default
|
90 |
+
info="Choose how aggressively to filter out common Tibetan particles and function words when calculating similarity. This helps focus on meaningful content words."
|
91 |
+
)
|
92 |
+
|
93 |
process_btn = gr.Button(
|
94 |
"Run Analysis", elem_id="run-btn", variant="primary"
|
95 |
)
|
|
|
106 |
metrics_preview = gr.Dataframe(
|
107 |
label="Similarity Metrics Preview", interactive=False, visible=True
|
108 |
)
|
109 |
+
|
110 |
+
# LLM Interpretation components
|
111 |
+
with gr.Row():
|
112 |
+
with gr.Column():
|
113 |
+
output_analysis = gr.Markdown(
|
114 |
+
"## AI Analysis\n*The AI will analyze your text similarities and provide insights into patterns and relationships. Make sure to set up your OpenRouter API key for this feature.*",
|
115 |
+
elem_classes="gr-markdown"
|
116 |
+
)
|
117 |
+
|
118 |
+
# Add the interpret button
|
119 |
+
with gr.Row():
|
120 |
+
interpret_btn = gr.Button(
|
121 |
+
"Help Interpret Results",
|
122 |
+
variant="primary",
|
123 |
+
elem_id="interpret-btn"
|
124 |
+
)
|
125 |
+
|
126 |
+
# About AI Analysis section
|
127 |
+
with gr.Accordion("ℹ️ About AI Analysis", open=False):
|
128 |
+
gr.Markdown("""
|
129 |
+
### AI-Powered Analysis
|
130 |
+
|
131 |
+
The AI analysis is powered by **Mistral 7B Instruct** via the OpenRouter API. To use this feature:
|
132 |
+
|
133 |
+
1. Get an API key from [OpenRouter](https://openrouter.ai/keys)
|
134 |
+
2. Create a `.env` file in the webapp directory
|
135 |
+
3. Add: `OPENROUTER_API_KEY=your_api_key_here`
|
136 |
+
|
137 |
+
The AI will automatically analyze your text similarities and provide insights into patterns and relationships.
|
138 |
+
""")
|
139 |
+
# Create a placeholder message with proper formatting and structure
|
140 |
+
initial_message = """
|
141 |
+
## Analysis of Tibetan Text Similarity Metrics
|
142 |
+
|
143 |
+
<small>*Click the 'Help Interpret Results' button above to generate an AI-powered analysis of your similarity metrics.*</small>
|
144 |
+
"""
|
145 |
+
interpretation_output = gr.Markdown(
|
146 |
+
value=initial_message,
|
147 |
+
elem_id="llm-analysis"
|
148 |
+
)
|
149 |
+
|
150 |
# Heatmap tabs for each metric
|
151 |
heatmap_titles = {
|
152 |
+
"Jaccard Similarity (%)": "Jaccard Similarity (%): Higher scores (darker) mean more shared unique words.",
|
153 |
+
"Normalized LCS": "Normalized LCS: Higher scores (darker) mean longer shared sequences of words.",
|
154 |
+
"Semantic Similarity": "Semantic Similarity (using word embeddings/experimental): Higher scores (darker) mean more similar meanings.",
|
155 |
+
"TF-IDF Cosine Sim": "TF-IDF Cosine Similarity: Higher scores (darker) mean texts share more important, distinctive vocabulary.",
|
156 |
+
"Word Counts": "Word Counts: Shows the number of words in each segment after tokenization."
|
157 |
}
|
158 |
|
159 |
metric_tooltips = {
|
160 |
"Jaccard Similarity (%)": """
|
161 |
### Jaccard Similarity (%)
|
162 |
+
This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words, optionally filtering out common Tibetan stopwords.
|
163 |
+
|
164 |
+
It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`.
|
|
|
|
|
165 |
|
166 |
+
Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
|
167 |
+
|
168 |
+
**Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before comparison. This helps focus on meaningful content words rather than grammatical elements.
|
169 |
""",
|
170 |
"Normalized LCS": """
|
171 |
### Normalized LCS (Longest Common Subsequence)
|
|
|
180 |
**Note on Interpretation**: It is possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
|
181 |
""",
|
182 |
"Semantic Similarity": """
|
183 |
+
### Semantic Similarity
|
184 |
+
Computes the cosine similarity between semantic embeddings of text segments using one of two approaches:
|
185 |
+
|
186 |
+
**1. Transformer-based Model** (Experimental): Pre-trained model that understands contextual relationships between words.
|
187 |
+
- `buddhist-nlp/buddhist-sentence-similarity`: Specialized for Buddhist texts
|
188 |
+
- Processes raw Unicode Tibetan text directly (no special tokenization required)
|
189 |
+
- Note: This is an experimental approach and results may vary with different texts
|
190 |
|
191 |
+
**2. FastText Model**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text. Falls back to a custom model only if the official model cannot be loaded.
|
192 |
+
- Processes Tibetan text using botok tokenization (same as other metrics)
|
193 |
+
- Uses the pre-tokenized words from botok rather than doing its own tokenization
|
194 |
+
- Better for texts with specialized Tibetan vocabulary
|
195 |
+
- More stable results for general Tibetan text comparison
|
196 |
+
- Optimized for Tibetan language with:
|
197 |
+
- Syllable-based tokenization preserving Tibetan syllable markers
|
198 |
+
- TF-IDF weighted averaging for word vectors (distinct from the TF-IDF Cosine Similarity metric)
|
199 |
+
- Enhanced parameters based on Tibetan NLP research
|
200 |
+
|
201 |
+
**Chunking for Long Texts**: For texts exceeding the model's token limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting embeddings are averaged to produce a single vector for the entire segment.
|
202 |
+
|
203 |
+
**Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before computing embeddings. This helps focus on meaningful content words. Transformer models process the full text regardless of stopword filtering setting.
|
204 |
+
|
205 |
+
**Note**: This metric works best when combined with other metrics for a more comprehensive analysis.
|
206 |
""",
|
207 |
"TF-IDF Cosine Sim": """
|
208 |
### TF-IDF Cosine Similarity
|
209 |
+
This metric calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, optionally filtering out common Tibetan stopwords.
|
210 |
+
|
211 |
+
TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments. This helps identify terms that are characteristic or discriminative for a segment. When stopword filtering is enabled, the TF-IDF scores better reflect genuinely significant terms by excluding common particles and function words.
|
212 |
+
|
213 |
+
Each segment is represented as a vector of these TF-IDF scores, and the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more important, distinguishing terms, suggesting they cover similar specific topics or themes.
|
214 |
+
|
215 |
+
**Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out. This can be toggled on/off to compare results with and without stopwords.
|
|
|
216 |
""",
|
217 |
}
|
218 |
heatmap_tabs = {}
|
219 |
gr.Markdown("## Detailed Metric Analysis", elem_classes="gr-markdown")
|
220 |
+
|
221 |
with gr.Tabs(elem_id="heatmap-tab-group"):
|
222 |
+
# Process all metrics including Word Counts in a unified way
|
223 |
for metric_key, descriptive_title in heatmap_titles.items():
|
224 |
with gr.Tab(metric_key):
|
225 |
+
# Set CSS class based on metric type
|
226 |
+
if metric_key == "Jaccard Similarity (%)":
|
227 |
+
css_class = "metric-info-accordion jaccard-info"
|
228 |
+
accordion_title = "Understanding Vocabulary Overlap"
|
229 |
+
elif metric_key == "Normalized LCS":
|
230 |
+
css_class = "metric-info-accordion lcs-info"
|
231 |
+
accordion_title = "Understanding Sequence Patterns"
|
232 |
+
elif metric_key == "Semantic Similarity":
|
233 |
+
css_class = "metric-info-accordion semantic-info"
|
234 |
+
accordion_title = "Understanding Meaning Similarity"
|
235 |
+
elif metric_key == "TF-IDF Cosine Sim":
|
236 |
+
css_class = "metric-info-accordion tfidf-info"
|
237 |
+
accordion_title = "Understanding Term Importance"
|
238 |
+
elif metric_key == "Word Counts":
|
239 |
+
css_class = "metric-info-accordion wordcount-info"
|
240 |
+
accordion_title = "Understanding Text Length"
|
241 |
else:
|
242 |
+
css_class = "metric-info-accordion"
|
243 |
+
accordion_title = f"About {metric_key}"
|
244 |
+
|
245 |
+
# Create the accordion with appropriate content
|
246 |
+
with gr.Accordion(accordion_title, open=False, elem_classes=css_class):
|
247 |
+
if metric_key == "Word Counts":
|
248 |
+
gr.Markdown("""
|
249 |
+
### Word Counts per Segment
|
250 |
+
This chart displays the number of words in each segment of your texts after tokenization.
|
251 |
+
""")
|
252 |
+
elif metric_key in metric_tooltips:
|
253 |
+
gr.Markdown(value=metric_tooltips[metric_key])
|
254 |
+
else:
|
255 |
+
gr.Markdown(value=f"### {metric_key}\nDescription not found.")
|
256 |
+
|
257 |
+
# Add the appropriate plot
|
258 |
+
if metric_key == "Word Counts":
|
259 |
+
word_count_plot = gr.Plot(label="Word Counts per Segment", show_label=False)
|
260 |
+
else:
|
261 |
+
heatmap_tabs[metric_key] = gr.Plot(label=f"Heatmap: {metric_key}", show_label=False)
|
262 |
|
263 |
# The outputs in process_btn.click should use the short metric names as keys for heatmap_tabs
|
264 |
# e.g., heatmap_tabs["Jaccard Similarity (%)"]
|
|
|
269 |
|
270 |
warning_box = gr.Markdown(visible=False)
|
271 |
|
272 |
+
def run_pipeline(files, enable_semantic, model_name, stopwords_option="Aggressive (All function words)", progress=gr.Progress()):
|
273 |
+
"""Run the text analysis pipeline on the uploaded files.
|
274 |
+
|
275 |
+
Args:
|
276 |
+
files: List of uploaded files
|
277 |
+
enable_semantic: Whether to compute semantic similarity
|
278 |
+
model_name: Name of the embedding model to use
|
279 |
+
stopwords_option: Stopword filtering level (None, Standard, or Aggressive)
|
280 |
+
progress: Gradio progress indicator
|
281 |
+
|
282 |
+
Returns:
|
283 |
+
Tuple of (metrics_df, heatmap_jaccard, heatmap_lcs, heatmap_semantic, heatmap_tfidf, word_count_fig)
|
284 |
+
"""
|
285 |
+
# Initialize progress tracking
|
286 |
+
try:
|
287 |
+
progress_tracker = gr.Progress()
|
288 |
+
except Exception as e:
|
289 |
+
logger.warning(f"Could not initialize progress tracker: {e}")
|
290 |
+
progress_tracker = None
|
291 |
# Initialize all return values to ensure defined paths for all outputs
|
292 |
csv_path_res = None
|
293 |
metrics_preview_df_res = None # Can be a DataFrame or a string message
|
|
|
314 |
- semantic_heatmap (matplotlib.figure.Figure | None): Semantic similarity heatmap, or None.
|
315 |
- warning_update (gr.update): Gradio update for the warning box.
|
316 |
"""
|
317 |
+
# Check if files are provided
|
318 |
if not files:
|
319 |
return (
|
320 |
None,
|
|
|
326 |
None, # tfidf_heatmap
|
327 |
gr.update(value="Please upload files.", visible=True),
|
328 |
)
|
329 |
+
|
330 |
+
# Check file size limits (10MB per file)
|
331 |
+
for file in files:
|
332 |
+
file_size_mb = Path(file.name).stat().st_size / (1024 * 1024)
|
333 |
+
if file_size_mb > 10:
|
334 |
+
return (
|
335 |
+
None,
|
336 |
+
f"File '{Path(file.name).name}' exceeds the 10MB size limit (size: {file_size_mb:.2f}MB).",
|
337 |
+
None, None, None, None, None,
|
338 |
+
gr.update(value=f"Error: File '{Path(file.name).name}' exceeds the 10MB size limit.", visible=True),
|
339 |
+
)
|
340 |
|
341 |
try:
|
342 |
+
if progress_tracker is not None:
|
343 |
+
try:
|
344 |
+
progress_tracker(0.1, desc="Preparing files...")
|
345 |
+
except Exception as e:
|
346 |
+
logger.warning(f"Progress update error (non-critical): {e}")
|
347 |
+
|
348 |
+
# Get filenames and read file contents
|
349 |
filenames = [
|
350 |
Path(file.name).name for file in files
|
351 |
] # Use Path().name to get just the filename
|
352 |
+
text_data = {}
|
353 |
+
|
354 |
+
# Read files with progress updates
|
355 |
+
for i, file in enumerate(files):
|
356 |
+
file_path = Path(file.name)
|
357 |
+
filename = file_path.name
|
358 |
+
if progress_tracker is not None:
|
359 |
+
try:
|
360 |
+
progress_tracker(0.1 + (0.1 * (i / len(files))), desc=f"Reading file: {filename}")
|
361 |
+
except Exception as e:
|
362 |
+
logger.warning(f"Progress update error (non-critical): {e}")
|
363 |
+
|
364 |
+
try:
|
365 |
+
text_data[filename] = file_path.read_text(encoding="utf-8-sig")
|
366 |
+
except UnicodeDecodeError:
|
367 |
+
# Try with different encodings if UTF-8 fails
|
368 |
+
try:
|
369 |
+
text_data[filename] = file_path.read_text(encoding="utf-16")
|
370 |
+
except UnicodeDecodeError:
|
371 |
+
return (
|
372 |
+
None,
|
373 |
+
f"Error: Could not decode file '{filename}'. Please ensure it contains valid Tibetan text in UTF-8 or UTF-16 encoding.",
|
374 |
+
None, None, None, None, None,
|
375 |
+
gr.update(value=f"Error: Could not decode file '{filename}'.", visible=True),
|
376 |
+
)
|
377 |
+
|
378 |
+
# Configure semantic similarity
|
379 |
+
enable_semantic_bool = enable_semantic == "Yes"
|
380 |
+
|
381 |
+
if progress_tracker is not None:
|
382 |
+
try:
|
383 |
+
progress_tracker(0.2, desc="Loading model..." if enable_semantic_bool else "Processing text...")
|
384 |
+
except Exception as e:
|
385 |
+
logger.warning(f"Progress update error (non-critical): {e}")
|
386 |
+
|
387 |
+
# Process texts with selected model
|
388 |
+
# Convert stopword option to appropriate parameters
|
389 |
+
use_stopwords = stopwords_option != "None (No filtering)"
|
390 |
+
use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
|
391 |
+
|
392 |
df_results, word_counts_df_data, warning_raw = process_texts(
|
393 |
+
text_data, filenames,
|
394 |
+
enable_semantic=enable_semantic_bool,
|
395 |
+
model_name=model_name,
|
396 |
+
use_stopwords=use_stopwords,
|
397 |
+
use_lite_stopwords=use_lite_stopwords,
|
398 |
+
progress_callback=progress_tracker
|
399 |
)
|
400 |
|
401 |
if df_results.empty:
|
|
|
407 |
warning_update_res = gr.update(value=warning_message, visible=True)
|
408 |
# Results for this case are set, then return
|
409 |
else:
|
410 |
+
# Generate visualizations
|
411 |
+
if progress_tracker is not None:
|
412 |
+
try:
|
413 |
+
progress_tracker(0.8, desc="Generating visualizations...")
|
414 |
+
except Exception as e:
|
415 |
+
logger.warning(f"Progress update error (non-critical): {e}")
|
416 |
+
|
417 |
# heatmap_titles is already defined in the outer scope of main_interface
|
418 |
heatmaps_data = generate_visualizations(
|
419 |
df_results, descriptive_titles=heatmap_titles
|
420 |
)
|
421 |
+
|
422 |
+
# Generate word count chart
|
423 |
+
if progress_tracker is not None:
|
424 |
+
try:
|
425 |
+
progress_tracker(0.9, desc="Creating word count chart...")
|
426 |
+
except Exception as e:
|
427 |
+
logger.warning(f"Progress update error (non-critical): {e}")
|
428 |
word_count_fig_res = generate_word_count_chart(word_counts_df_data)
|
429 |
+
|
430 |
+
# Save results to CSV
|
431 |
+
if progress_tracker is not None:
|
432 |
+
try:
|
433 |
+
progress_tracker(0.95, desc="Saving results...")
|
434 |
+
except Exception as e:
|
435 |
+
logger.warning(f"Progress update error (non-critical): {e}")
|
436 |
csv_path_res = "results.csv"
|
437 |
df_results.to_csv(csv_path_res, index=False)
|
438 |
+
|
439 |
+
# Prepare final output
|
440 |
warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
|
441 |
metrics_preview_df_res = df_results.head(10)
|
442 |
|
443 |
jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
|
444 |
lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
|
445 |
semantic_heatmap_res = heatmaps_data.get(
|
446 |
+
"Semantic Similarity"
|
447 |
)
|
448 |
tfidf_heatmap_res = heatmaps_data.get("TF-IDF Cosine Sim")
|
449 |
warning_update_res = gr.update(
|
|
|
465 |
lcs_heatmap_res,
|
466 |
semantic_heatmap_res,
|
467 |
tfidf_heatmap_res,
|
468 |
+
warning_update_res
|
469 |
)
|
470 |
|
471 |
+
# Function to interpret results using LLM
|
472 |
+
def interpret_results(csv_path, progress=gr.Progress()):
|
473 |
+
try:
|
474 |
+
if not csv_path or not Path(csv_path).exists():
|
475 |
+
return "Please run the analysis first to generate results."
|
476 |
+
|
477 |
+
# Read the CSV file
|
478 |
+
df_results = pd.read_csv(csv_path)
|
479 |
+
|
480 |
+
# Show detailed progress messages with percentages
|
481 |
+
progress(0, desc="Preparing data for analysis...")
|
482 |
+
progress(0.1, desc="Analyzing similarity patterns...")
|
483 |
+
progress(0.2, desc="Connecting to Mistral 7B via OpenRouter...")
|
484 |
+
|
485 |
+
# Get interpretation from LLM (using OpenRouter API)
|
486 |
+
progress(0.3, desc="Generating scholarly interpretation (this may take 20-40 seconds)...")
|
487 |
+
interpretation = get_interpretation(df_results)
|
488 |
+
|
489 |
+
# Simulate completion steps
|
490 |
+
progress(0.9, desc="Formatting results...")
|
491 |
+
progress(0.95, desc="Applying scholarly formatting...")
|
492 |
+
|
493 |
+
# Completed
|
494 |
+
progress(1.0, desc="Analysis complete!")
|
495 |
+
|
496 |
+
# Add a timestamp to the interpretation
|
497 |
+
from datetime import datetime
|
498 |
+
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
|
499 |
+
interpretation = f"{interpretation}\n\n<small>Analysis generated on {timestamp}</small>"
|
500 |
+
return interpretation
|
501 |
+
except Exception as e:
|
502 |
+
logger.error(f"Error in interpret_results: {e}", exc_info=True)
|
503 |
+
return f"Error interpreting results: {str(e)}"
|
504 |
+
|
505 |
process_btn.click(
|
506 |
+
fn=run_pipeline,
|
507 |
+
inputs=[file_input, semantic_toggle_radio, model_dropdown, stopwords_dropdown],
|
508 |
outputs=[
|
509 |
csv_output,
|
510 |
metrics_preview,
|
511 |
word_count_plot,
|
512 |
heatmap_tabs["Jaccard Similarity (%)"],
|
513 |
heatmap_tabs["Normalized LCS"],
|
514 |
+
heatmap_tabs["Semantic Similarity"],
|
515 |
heatmap_tabs["TF-IDF Cosine Sim"],
|
516 |
warning_box,
|
517 |
+
]
|
518 |
+
)
|
519 |
+
|
520 |
+
# Connect the interpret button
|
521 |
+
interpret_btn.click(
|
522 |
+
fn=interpret_results,
|
523 |
+
inputs=[csv_output],
|
524 |
+
outputs=interpretation_output
|
525 |
)
|
526 |
+
|
527 |
return demo
|
528 |
|
529 |
|
530 |
if __name__ == "__main__":
|
531 |
demo = main_interface()
|
532 |
+
demo.launch()
|
pipeline/fasttext_embedding.py
ADDED
@@ -0,0 +1,410 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
FastText embedding module for Tibetan text.
|
3 |
+
This module provides functions to train and use FastText models for Tibetan text.
|
4 |
+
"""
|
5 |
+
|
6 |
+
import os
|
7 |
+
import math
|
8 |
+
import logging
|
9 |
+
import numpy as np
|
10 |
+
import fasttext
|
11 |
+
from typing import List, Optional
|
12 |
+
from huggingface_hub import hf_hub_download
|
13 |
+
|
14 |
+
# Set up logging
|
15 |
+
logger = logging.getLogger(__name__)
|
16 |
+
|
17 |
+
# Default parameters optimized for Tibetan
|
18 |
+
DEFAULT_DIM = 100
|
19 |
+
DEFAULT_EPOCH = 5
|
20 |
+
DEFAULT_MIN_COUNT = 5
|
21 |
+
DEFAULT_WINDOW = 5
|
22 |
+
DEFAULT_MINN = 3
|
23 |
+
DEFAULT_MAXN = 6
|
24 |
+
DEFAULT_NEG = 5
|
25 |
+
|
26 |
+
# Define paths for model storage
|
27 |
+
DEFAULT_MODEL_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "models")
|
28 |
+
DEFAULT_MODEL_PATH = os.path.join(DEFAULT_MODEL_DIR, "fasttext_model.bin")
|
29 |
+
|
30 |
+
# Facebook's official Tibetan FastText model
|
31 |
+
FACEBOOK_TIBETAN_MODEL_ID = "facebook/fasttext-bo-vectors"
|
32 |
+
FACEBOOK_TIBETAN_MODEL_FILE = "model.bin"
|
33 |
+
|
34 |
+
# Create models directory if it doesn't exist
|
35 |
+
os.makedirs(DEFAULT_MODEL_DIR, exist_ok=True)
|
36 |
+
|
37 |
+
def ensure_dir_exists(directory: str) -> None:
|
38 |
+
"""
|
39 |
+
Ensure that a directory exists, creating it if necessary.
|
40 |
+
|
41 |
+
Args:
|
42 |
+
directory: Directory path to ensure exists
|
43 |
+
"""
|
44 |
+
if not os.path.exists(directory):
|
45 |
+
os.makedirs(directory, exist_ok=True)
|
46 |
+
|
47 |
+
|
48 |
+
def train_fasttext_model(
|
49 |
+
corpus_path: str,
|
50 |
+
model_path: str = DEFAULT_MODEL_PATH,
|
51 |
+
dim: int = DEFAULT_DIM,
|
52 |
+
epoch: int = DEFAULT_EPOCH,
|
53 |
+
min_count: int = DEFAULT_MIN_COUNT,
|
54 |
+
window: int = DEFAULT_WINDOW,
|
55 |
+
minn: int = DEFAULT_MINN,
|
56 |
+
maxn: int = DEFAULT_MAXN,
|
57 |
+
neg: int = DEFAULT_NEG,
|
58 |
+
model_type: str = "skipgram"
|
59 |
+
) -> fasttext.FastText._FastText:
|
60 |
+
"""
|
61 |
+
Train a FastText model on Tibetan corpus using optimized parameters.
|
62 |
+
|
63 |
+
Args:
|
64 |
+
corpus_path: Path to the corpus file
|
65 |
+
model_path: Path where to save the trained model
|
66 |
+
dim: Embedding dimension (default: 300)
|
67 |
+
epoch: Number of training epochs (default: 15)
|
68 |
+
min_count: Minimum count of words (default: 3)
|
69 |
+
window: Size of context window (default: 5)
|
70 |
+
minn: Minimum length of char n-gram (default: 3)
|
71 |
+
maxn: Maximum length of char n-gram (default: 6)
|
72 |
+
neg: Number of negatives in negative sampling (default: 10)
|
73 |
+
model_type: FastText model type ('skipgram' or 'cbow')
|
74 |
+
|
75 |
+
Returns:
|
76 |
+
Trained FastText model
|
77 |
+
"""
|
78 |
+
ensure_dir_exists(os.path.dirname(model_path))
|
79 |
+
|
80 |
+
logger.info("Training FastText model with %s, dim=%d, epoch=%d, window=%d, minn=%d, maxn=%d...",
|
81 |
+
model_type, dim, epoch, window, minn, maxn)
|
82 |
+
|
83 |
+
# Preprocess corpus for Tibetan - segment by syllable points
|
84 |
+
# This is based on research showing syllable segmentation works better for Tibetan
|
85 |
+
try:
|
86 |
+
with open(corpus_path, 'r', encoding='utf-8') as f:
|
87 |
+
content = f.read()
|
88 |
+
|
89 |
+
# Ensure syllable segmentation by adding spaces after Tibetan syllable markers (if not already present)
|
90 |
+
# This improves model quality for Tibetan text according to research
|
91 |
+
processed_content = content.replace('་', '་ ')
|
92 |
+
|
93 |
+
# Write back the processed content
|
94 |
+
with open(corpus_path, 'w', encoding='utf-8') as f:
|
95 |
+
f.write(processed_content)
|
96 |
+
|
97 |
+
logger.info("Preprocessed corpus with syllable segmentation for Tibetan text")
|
98 |
+
except Exception as e:
|
99 |
+
logger.warning("Could not preprocess corpus for syllable segmentation: %s", str(e))
|
100 |
+
|
101 |
+
# Train the model with optimized parameters
|
102 |
+
if model_type == "skipgram":
|
103 |
+
model = fasttext.train_unsupervised(
|
104 |
+
corpus_path,
|
105 |
+
model="skipgram",
|
106 |
+
dim=dim,
|
107 |
+
epoch=epoch,
|
108 |
+
minCount=min_count,
|
109 |
+
wordNgrams=1,
|
110 |
+
minn=minn,
|
111 |
+
maxn=maxn,
|
112 |
+
neg=neg,
|
113 |
+
window=window
|
114 |
+
)
|
115 |
+
else: # cbow
|
116 |
+
model = fasttext.train_unsupervised(
|
117 |
+
corpus_path,
|
118 |
+
model="cbow",
|
119 |
+
dim=dim,
|
120 |
+
epoch=epoch,
|
121 |
+
minCount=min_count,
|
122 |
+
wordNgrams=1,
|
123 |
+
minn=minn,
|
124 |
+
maxn=maxn,
|
125 |
+
neg=neg,
|
126 |
+
window=window
|
127 |
+
)
|
128 |
+
|
129 |
+
# Save the model
|
130 |
+
model.save_model(model_path)
|
131 |
+
logger.info("FastText model trained and saved to %s", model_path)
|
132 |
+
|
133 |
+
return model
|
134 |
+
|
135 |
+
|
136 |
+
def load_fasttext_model(model_path: str = DEFAULT_MODEL_PATH) -> Optional[fasttext.FastText._FastText]:
|
137 |
+
"""
|
138 |
+
Load a FastText model from file, with fallback to official Facebook model.
|
139 |
+
|
140 |
+
Args:
|
141 |
+
model_path: Path to the model file
|
142 |
+
|
143 |
+
Returns:
|
144 |
+
Loaded FastText model or None if loading fails
|
145 |
+
"""
|
146 |
+
try:
|
147 |
+
# First try to load the official Facebook FastText Tibetan model
|
148 |
+
try:
|
149 |
+
# Try to download the official Facebook FastText Tibetan model
|
150 |
+
logger.info("Attempting to download and load official Facebook FastText Tibetan model")
|
151 |
+
facebook_model_path = hf_hub_download(
|
152 |
+
repo_id=FACEBOOK_TIBETAN_MODEL_ID,
|
153 |
+
filename=FACEBOOK_TIBETAN_MODEL_FILE,
|
154 |
+
cache_dir=DEFAULT_MODEL_DIR
|
155 |
+
)
|
156 |
+
logger.info("Loading official Facebook FastText Tibetan model from %s", facebook_model_path)
|
157 |
+
return fasttext.load_model(facebook_model_path)
|
158 |
+
except Exception as e:
|
159 |
+
logger.warning("Could not load official Facebook FastText Tibetan model: %s", str(e))
|
160 |
+
logger.info("Falling back to local model")
|
161 |
+
|
162 |
+
# Fall back to local model
|
163 |
+
if os.path.exists(model_path):
|
164 |
+
logger.info("Loading local FastText model from %s", model_path)
|
165 |
+
return fasttext.load_model(model_path)
|
166 |
+
else:
|
167 |
+
logger.warning("Model path %s does not exist", model_path)
|
168 |
+
return None
|
169 |
+
except Exception as e:
|
170 |
+
logger.error("Error loading FastText model: %s", str(e))
|
171 |
+
return None
|
172 |
+
|
173 |
+
|
174 |
+
def get_text_embedding(
|
175 |
+
text: str,
|
176 |
+
model: fasttext.FastText._FastText,
|
177 |
+
tokenize_fn=None,
|
178 |
+
use_stopwords: bool = True,
|
179 |
+
stopwords_set=None,
|
180 |
+
use_tfidf_weighting: bool = True, # Enabled by default for better results
|
181 |
+
corpus_token_freq=None
|
182 |
+
) -> np.ndarray:
|
183 |
+
"""
|
184 |
+
Get embedding for a text using a FastText model with optional TF-IDF weighting.
|
185 |
+
|
186 |
+
Args:
|
187 |
+
text: Input text
|
188 |
+
model: FastText model
|
189 |
+
tokenize_fn: Optional tokenization function or pre-tokenized list
|
190 |
+
use_stopwords: Whether to filter out stopwords before computing embeddings
|
191 |
+
stopwords_set: Set of stopwords to filter out (if use_stopwords is True)
|
192 |
+
use_tfidf_weighting: Whether to use TF-IDF weighting for averaging word vectors
|
193 |
+
corpus_token_freq: Dictionary of token frequencies across corpus (required for TF-IDF)
|
194 |
+
|
195 |
+
Returns:
|
196 |
+
Text embedding vector
|
197 |
+
"""
|
198 |
+
if not text.strip():
|
199 |
+
return np.zeros(model.get_dimension())
|
200 |
+
|
201 |
+
# Handle tokenization
|
202 |
+
if tokenize_fn is None:
|
203 |
+
# Simple whitespace tokenization as fallback
|
204 |
+
tokens = text.split()
|
205 |
+
elif isinstance(tokenize_fn, list):
|
206 |
+
# If tokenize_fn is already a list of tokens, use it directly
|
207 |
+
tokens = tokenize_fn
|
208 |
+
elif callable(tokenize_fn):
|
209 |
+
# If tokenize_fn is a function, call it
|
210 |
+
tokens = tokenize_fn(text)
|
211 |
+
else:
|
212 |
+
# If tokenize_fn is something else (like a string), use whitespace tokenization
|
213 |
+
logger.warning(f"Unexpected tokenize_fn type: {type(tokenize_fn)}. Using default whitespace tokenization.")
|
214 |
+
tokens = text.split()
|
215 |
+
|
216 |
+
# Filter out stopwords if enabled and stopwords_set is provided
|
217 |
+
if use_stopwords and stopwords_set is not None:
|
218 |
+
tokens = [token for token in tokens if token not in stopwords_set]
|
219 |
+
|
220 |
+
# If all tokens were filtered out as stopwords, return zero vector
|
221 |
+
if not tokens:
|
222 |
+
return np.zeros(model.get_dimension())
|
223 |
+
|
224 |
+
# Filter out empty tokens
|
225 |
+
tokens = [token for token in tokens if token.strip()]
|
226 |
+
|
227 |
+
if not tokens:
|
228 |
+
return np.zeros(model.get_dimension())
|
229 |
+
|
230 |
+
# Calculate TF-IDF weighted average if requested
|
231 |
+
if use_tfidf_weighting and corpus_token_freq is not None:
|
232 |
+
# Calculate term frequencies in this document
|
233 |
+
token_counts = {}
|
234 |
+
for token in tokens:
|
235 |
+
token_counts[token] = token_counts.get(token, 0) + 1
|
236 |
+
|
237 |
+
# Calculate IDF for each token with improved stability
|
238 |
+
N = sum(corpus_token_freq.values()) # Total number of tokens in corpus
|
239 |
+
N = max(N, 1) # Ensure N is at least 1 to avoid division by zero
|
240 |
+
|
241 |
+
# Compute TF-IDF weights with safeguards against extreme values
|
242 |
+
weights = []
|
243 |
+
for token in tokens:
|
244 |
+
# Term frequency in this document
|
245 |
+
tf = token_counts.get(token, 0) / max(len(tokens), 1) if len(tokens) > 0 else 0
|
246 |
+
|
247 |
+
# Inverse document frequency with smoothing to avoid extreme values
|
248 |
+
token_freq = corpus_token_freq.get(token, 0)
|
249 |
+
idf = math.log((N + 1) / (token_freq + 1)) + 1 # Add 1 for smoothing
|
250 |
+
|
251 |
+
# TF-IDF weight with bounds to prevent extreme values
|
252 |
+
weight = tf * idf
|
253 |
+
weight = min(max(weight, 0.1), 10.0) # Limit to reasonable range
|
254 |
+
weights.append(weight)
|
255 |
+
|
256 |
+
# Normalize weights to sum to 1 with stability checks
|
257 |
+
total_weight = sum(weights)
|
258 |
+
if total_weight > 0:
|
259 |
+
weights = [w / total_weight for w in weights]
|
260 |
+
else:
|
261 |
+
# If all weights are 0, use uniform weights
|
262 |
+
weights = [1.0 / len(tokens) if len(tokens) > 0 else 0 for _ in tokens]
|
263 |
+
|
264 |
+
# Check for NaN or infinite values and replace with uniform weights if found
|
265 |
+
if any(math.isnan(w) or math.isinf(w) for w in weights):
|
266 |
+
logger.warning("Found NaN or infinite weights in TF-IDF calculation. Using uniform weights instead.")
|
267 |
+
weights = [1.0 / len(tokens) if len(tokens) > 0 else 0 for _ in tokens]
|
268 |
+
|
269 |
+
# Get vectors for each token and apply weights
|
270 |
+
vectors = [model.get_word_vector(token) for token in tokens]
|
271 |
+
weighted_vectors = [w * v for w, v in zip(weights, vectors)]
|
272 |
+
|
273 |
+
# Sum the weighted vectors
|
274 |
+
return np.sum(weighted_vectors, axis=0)
|
275 |
+
else:
|
276 |
+
# Simple averaging if TF-IDF is not enabled or corpus frequencies not provided
|
277 |
+
vectors = [model.get_word_vector(token) for token in tokens]
|
278 |
+
return np.mean(vectors, axis=0)
|
279 |
+
|
280 |
+
|
281 |
+
def get_batch_embeddings(
|
282 |
+
texts: List[str],
|
283 |
+
model: fasttext.FastText._FastText,
|
284 |
+
tokenize_fn=None,
|
285 |
+
use_stopwords: bool = True,
|
286 |
+
stopwords_set=None,
|
287 |
+
use_tfidf_weighting: bool = True, # Enabled by default for better results
|
288 |
+
corpus_token_freq=None
|
289 |
+
) -> np.ndarray:
|
290 |
+
"""
|
291 |
+
Get embeddings for a batch of texts with optional TF-IDF weighting.
|
292 |
+
|
293 |
+
Args:
|
294 |
+
texts: List of input texts
|
295 |
+
model: FastText model
|
296 |
+
tokenize_fn: Optional tokenization function or pre-tokenized list of tokens
|
297 |
+
use_stopwords: Whether to filter out stopwords before computing embeddings
|
298 |
+
stopwords_set: Set of stopwords to filter out (if use_stopwords is True)
|
299 |
+
use_tfidf_weighting: Whether to use TF-IDF weighting for averaging word vectors
|
300 |
+
corpus_token_freq: Dictionary of token frequencies across corpus (required for TF-IDF)
|
301 |
+
|
302 |
+
Returns:
|
303 |
+
Array of text embedding vectors
|
304 |
+
"""
|
305 |
+
# If corpus_token_freq is not provided but TF-IDF is requested, build it from the texts
|
306 |
+
if use_tfidf_weighting and corpus_token_freq is None:
|
307 |
+
logger.info("Building corpus token frequency dictionary for TF-IDF weighting")
|
308 |
+
corpus_token_freq = {}
|
309 |
+
|
310 |
+
# Process each text to build corpus token frequencies
|
311 |
+
for text in texts:
|
312 |
+
if not text.strip():
|
313 |
+
continue
|
314 |
+
|
315 |
+
# Handle tokenization
|
316 |
+
if tokenize_fn is None:
|
317 |
+
tokens = text.split()
|
318 |
+
elif isinstance(tokenize_fn, list):
|
319 |
+
# In this case, tokenize_fn should be a list of lists (one list of tokens per text)
|
320 |
+
# This is not a common use case, so we'll just use the first one as fallback
|
321 |
+
tokens = tokenize_fn[0] if tokenize_fn else []
|
322 |
+
else:
|
323 |
+
tokens = tokenize_fn(text)
|
324 |
+
|
325 |
+
# Filter out stopwords if enabled
|
326 |
+
if use_stopwords and stopwords_set is not None:
|
327 |
+
tokens = [token for token in tokens if token not in stopwords_set]
|
328 |
+
|
329 |
+
# Update corpus token frequencies
|
330 |
+
for token in tokens:
|
331 |
+
if token.strip(): # Skip empty tokens
|
332 |
+
corpus_token_freq[token] = corpus_token_freq.get(token, 0) + 1
|
333 |
+
|
334 |
+
logger.info("Built corpus token frequency dictionary with %d unique tokens", len(corpus_token_freq))
|
335 |
+
|
336 |
+
# Get embeddings for each text
|
337 |
+
embeddings = []
|
338 |
+
for i, text in enumerate(texts):
|
339 |
+
# Handle pre-tokenized input
|
340 |
+
tokens = None
|
341 |
+
if isinstance(tokenize_fn, list):
|
342 |
+
if i < len(tokenize_fn):
|
343 |
+
tokens = tokenize_fn[i]
|
344 |
+
|
345 |
+
embedding = get_text_embedding(
|
346 |
+
text,
|
347 |
+
model,
|
348 |
+
tokenize_fn=tokens, # Pass the tokens directly, not the function
|
349 |
+
use_stopwords=use_stopwords,
|
350 |
+
stopwords_set=stopwords_set,
|
351 |
+
use_tfidf_weighting=use_tfidf_weighting,
|
352 |
+
corpus_token_freq=corpus_token_freq
|
353 |
+
)
|
354 |
+
embeddings.append(embedding)
|
355 |
+
|
356 |
+
return np.array(embeddings)
|
357 |
+
|
358 |
+
|
359 |
+
def generate_embeddings(
|
360 |
+
texts: List[str],
|
361 |
+
model: fasttext.FastText._FastText,
|
362 |
+
device: str,
|
363 |
+
model_type: str = "sentence_transformer",
|
364 |
+
tokenize_fn=None,
|
365 |
+
use_stopwords: bool = True,
|
366 |
+
use_lite_stopwords: bool = False
|
367 |
+
) -> np.ndarray:
|
368 |
+
"""
|
369 |
+
Generate embeddings for a list of texts using a FastText model.
|
370 |
+
|
371 |
+
Args:
|
372 |
+
texts: List of input texts
|
373 |
+
model: FastText model
|
374 |
+
device: Device to use for computation (not used for FastText)
|
375 |
+
model_type: Model type ('sentence_transformer' or 'fasttext')
|
376 |
+
tokenize_fn: Optional tokenization function or pre-tokenized list of tokens
|
377 |
+
use_stopwords: Whether to filter out stopwords
|
378 |
+
use_lite_stopwords: Whether to use a lighter set of stopwords
|
379 |
+
|
380 |
+
Returns:
|
381 |
+
Array of text embedding vectors
|
382 |
+
"""
|
383 |
+
if model_type != "fasttext":
|
384 |
+
logger.warning("Model type %s not supported for FastText. Using FastText anyway.", model_type)
|
385 |
+
|
386 |
+
# Generate embeddings using FastText
|
387 |
+
try:
|
388 |
+
# Load stopwords if needed
|
389 |
+
stopwords_set = None
|
390 |
+
if use_stopwords:
|
391 |
+
from .tibetan_stopwords import get_stopwords
|
392 |
+
stopwords_set = get_stopwords(use_lite=use_lite_stopwords)
|
393 |
+
logger.info("Loaded %d Tibetan stopwords", len(stopwords_set))
|
394 |
+
|
395 |
+
# Generate embeddings
|
396 |
+
embeddings = get_batch_embeddings(
|
397 |
+
texts,
|
398 |
+
model,
|
399 |
+
tokenize_fn=tokenize_fn,
|
400 |
+
use_stopwords=use_stopwords,
|
401 |
+
stopwords_set=stopwords_set,
|
402 |
+
use_tfidf_weighting=True # Enable TF-IDF weighting for better results
|
403 |
+
)
|
404 |
+
|
405 |
+
logger.info("FastText embeddings generated with shape: %s", str(embeddings.shape))
|
406 |
+
return embeddings
|
407 |
+
except Exception as e:
|
408 |
+
logger.error("Error generating FastText embeddings: %s", str(e))
|
409 |
+
# Return empty embeddings as fallback
|
410 |
+
return np.zeros((len(texts), model.get_dimension()))
|
pipeline/llm_service.py
ADDED
@@ -0,0 +1,644 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
LLM Service for Tibetan Text Metrics
|
3 |
+
|
4 |
+
This module provides a unified interface for analyzing text similarity metrics
|
5 |
+
using both LLM-based and rule-based approaches.
|
6 |
+
"""
|
7 |
+
|
8 |
+
import os
|
9 |
+
import json
|
10 |
+
import logging
|
11 |
+
import requests
|
12 |
+
import pandas as pd
|
13 |
+
import re
|
14 |
+
|
15 |
+
# Set up logging
|
16 |
+
logger = logging.getLogger(__name__)
|
17 |
+
|
18 |
+
# Try to load environment variables
|
19 |
+
ENV_LOADED = False
|
20 |
+
try:
|
21 |
+
from dotenv import load_dotenv
|
22 |
+
load_dotenv()
|
23 |
+
ENV_LOADED = True
|
24 |
+
except ImportError:
|
25 |
+
logger.warning("python-dotenv not installed. Using system environment variables.")
|
26 |
+
|
27 |
+
# Constants
|
28 |
+
DEFAULT_MAX_TOKENS = 4000
|
29 |
+
DEFAULT_MODEL = "mistralai/mistral-7b-instruct"
|
30 |
+
DEFAULT_TEMPERATURE = 0.3
|
31 |
+
DEFAULT_TOP_P = 0.9
|
32 |
+
|
33 |
+
class LLMService:
|
34 |
+
"""
|
35 |
+
Service for analyzing text similarity metrics using LLMs and rule-based methods.
|
36 |
+
"""
|
37 |
+
|
38 |
+
def __init__(self, api_key: str = None):
|
39 |
+
"""
|
40 |
+
Initialize the LLM service.
|
41 |
+
|
42 |
+
Args:
|
43 |
+
api_key: Optional API key for OpenRouter. If not provided, will try to load from environment.
|
44 |
+
"""
|
45 |
+
self.api_key = api_key or os.getenv('OPENROUTER_API_KEY')
|
46 |
+
self.model = DEFAULT_MODEL
|
47 |
+
self.temperature = DEFAULT_TEMPERATURE
|
48 |
+
self.top_p = DEFAULT_TOP_P
|
49 |
+
|
50 |
+
def analyze_similarity(
|
51 |
+
self,
|
52 |
+
results_df: pd.DataFrame,
|
53 |
+
use_llm: bool = True,
|
54 |
+
) -> str:
|
55 |
+
"""
|
56 |
+
Analyze similarity metrics using either LLM or rule-based approach.
|
57 |
+
|
58 |
+
Args:
|
59 |
+
results_df: DataFrame containing similarity metrics
|
60 |
+
use_llm: Whether to use LLM for analysis (falls back to rule-based if False or on error)
|
61 |
+
|
62 |
+
Returns:
|
63 |
+
str: Analysis of the metrics in markdown format with appropriate fallback messages
|
64 |
+
"""
|
65 |
+
# If LLM is disabled, use rule-based analysis
|
66 |
+
if not use_llm:
|
67 |
+
logger.info("LLM analysis disabled. Using rule-based analysis.")
|
68 |
+
return self._analyze_with_rules(results_df)
|
69 |
+
|
70 |
+
# Try LLM analysis if enabled
|
71 |
+
try:
|
72 |
+
if not self.api_key:
|
73 |
+
raise ValueError("No OpenRouter API key provided. Please set the OPENROUTER_API_KEY environment variable.")
|
74 |
+
|
75 |
+
logger.info("Attempting LLM-based analysis...")
|
76 |
+
return self._analyze_with_llm(results_df, max_tokens=DEFAULT_MAX_TOKENS)
|
77 |
+
|
78 |
+
except Exception as e:
|
79 |
+
error_msg = str(e)
|
80 |
+
logger.error(f"Error in LLM analysis: {error_msg}")
|
81 |
+
|
82 |
+
# Create a user-friendly error message
|
83 |
+
if "payment" in error_msg.lower() or "402" in error_msg:
|
84 |
+
error_note = "OpenRouter API payment required. Falling back to rule-based analysis."
|
85 |
+
elif "invalid" in error_msg.lower() or "401" in error_msg:
|
86 |
+
error_note = "Invalid OpenRouter API key. Falling back to rule-based analysis."
|
87 |
+
elif "rate limit" in error_msg.lower() or "429" in error_msg:
|
88 |
+
error_note = "API rate limit exceeded. Falling back to rule-based analysis."
|
89 |
+
else:
|
90 |
+
error_note = f"LLM analysis failed: {error_msg[:200]}. Falling back to rule-based analysis."
|
91 |
+
|
92 |
+
# Get rule-based analysis
|
93 |
+
rule_based_analysis = self._analyze_with_rules(results_df)
|
94 |
+
|
95 |
+
# Combine the error message with the rule-based analysis
|
96 |
+
return f"## Analysis of Tibetan Text Similarity Metrics\n\n*Note: {error_note}*\n\n{rule_based_analysis}"
|
97 |
+
|
98 |
+
def _prepare_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
|
99 |
+
"""
|
100 |
+
Prepare the DataFrame for analysis.
|
101 |
+
|
102 |
+
Args:
|
103 |
+
df: Input DataFrame with similarity metrics
|
104 |
+
|
105 |
+
Returns:
|
106 |
+
pd.DataFrame: Cleaned and prepared DataFrame
|
107 |
+
"""
|
108 |
+
# Make a copy to avoid modifying the original
|
109 |
+
df = df.copy()
|
110 |
+
|
111 |
+
# Clean text columns
|
112 |
+
text_cols = ['Text A', 'Text B']
|
113 |
+
for col in text_cols:
|
114 |
+
if col in df.columns:
|
115 |
+
df[col] = df[col].fillna('Unknown').astype(str)
|
116 |
+
df[col] = df[col].str.replace('.txt$', '', regex=True)
|
117 |
+
|
118 |
+
# Filter out perfect matches (likely empty cells)
|
119 |
+
metrics_cols = ['Jaccard Similarity (%)', 'Normalized LCS', 'TF-IDF Cosine Sim']
|
120 |
+
if all(col in df.columns for col in metrics_cols):
|
121 |
+
mask = ~((df['Jaccard Similarity (%)'] == 100.0) &
|
122 |
+
(df['Normalized LCS'] == 1.0) &
|
123 |
+
(df['TF-IDF Cosine Sim'] == 1.0))
|
124 |
+
df = df[mask].copy()
|
125 |
+
|
126 |
+
return df
|
127 |
+
|
128 |
+
def _analyze_with_llm(self, df: pd.DataFrame, max_tokens: int) -> str:
|
129 |
+
"""
|
130 |
+
Analyze metrics using an LLM via OpenRouter API.
|
131 |
+
|
132 |
+
Args:
|
133 |
+
df: Prepared DataFrame with metrics
|
134 |
+
max_tokens: Maximum tokens for the response
|
135 |
+
|
136 |
+
Returns:
|
137 |
+
str: LLM analysis in markdown format
|
138 |
+
"""
|
139 |
+
# Prepare the prompt with data and instructions
|
140 |
+
prompt = self._create_llm_prompt(df)
|
141 |
+
|
142 |
+
try:
|
143 |
+
# Call the LLM API
|
144 |
+
response = self._call_openrouter_api(
|
145 |
+
prompt=prompt,
|
146 |
+
system_message=self._get_system_prompt(),
|
147 |
+
max_tokens=max_tokens,
|
148 |
+
temperature=self.temperature,
|
149 |
+
top_p=self.top_p
|
150 |
+
)
|
151 |
+
|
152 |
+
# Process and format the response
|
153 |
+
return self._format_llm_response(response, df)
|
154 |
+
|
155 |
+
except Exception as e:
|
156 |
+
logger.error(f"Error in LLM analysis: {str(e)}")
|
157 |
+
raise
|
158 |
+
|
159 |
+
def _analyze_with_rules(self, df: pd.DataFrame) -> str:
|
160 |
+
"""
|
161 |
+
Analyze metrics using rule-based approach.
|
162 |
+
|
163 |
+
Args:
|
164 |
+
df: Prepared DataFrame with metrics
|
165 |
+
|
166 |
+
Returns:
|
167 |
+
str: Rule-based analysis in markdown format
|
168 |
+
"""
|
169 |
+
analysis = ["## Tibetan Text Similarity Analysis (Rule-Based)"]
|
170 |
+
|
171 |
+
# Basic stats
|
172 |
+
text_a_col = 'Text A' if 'Text A' in df.columns else None
|
173 |
+
text_b_col = 'Text B' if 'Text B' in df.columns else None
|
174 |
+
|
175 |
+
if text_a_col and text_b_col:
|
176 |
+
unique_texts = set(df[text_a_col].unique()) | set(df[text_b_col].unique())
|
177 |
+
analysis.append(f"- **Texts analyzed:** {', '.join(sorted(unique_texts))}")
|
178 |
+
|
179 |
+
# Analyze each metric
|
180 |
+
metric_analyses = []
|
181 |
+
|
182 |
+
if 'Jaccard Similarity (%)' in df.columns:
|
183 |
+
jaccard_analysis = self._analyze_jaccard(df)
|
184 |
+
metric_analyses.append(jaccard_analysis)
|
185 |
+
|
186 |
+
if 'Normalized LCS' in df.columns:
|
187 |
+
lcs_analysis = self._analyze_lcs(df)
|
188 |
+
metric_analyses.append(lcs_analysis)
|
189 |
+
|
190 |
+
if 'TF-IDF Cosine Sim' in df.columns:
|
191 |
+
tfidf_analysis = self._analyze_tfidf(df)
|
192 |
+
metric_analyses.append(tfidf_analysis)
|
193 |
+
|
194 |
+
# Add all metric analyses
|
195 |
+
if metric_analyses:
|
196 |
+
analysis.extend(metric_analyses)
|
197 |
+
|
198 |
+
# Add overall interpretation
|
199 |
+
analysis.append("\n## Overall Interpretation")
|
200 |
+
analysis.append(self._generate_overall_interpretation(df))
|
201 |
+
|
202 |
+
return "\n\n".join(analysis)
|
203 |
+
|
204 |
+
def _analyze_jaccard(self, df: pd.DataFrame) -> str:
|
205 |
+
"""Analyze Jaccard similarity scores."""
|
206 |
+
jaccard = df['Jaccard Similarity (%)'].dropna()
|
207 |
+
if jaccard.empty:
|
208 |
+
return ""
|
209 |
+
|
210 |
+
mean_jaccard = jaccard.mean()
|
211 |
+
max_jaccard = jaccard.max()
|
212 |
+
min_jaccard = jaccard.min()
|
213 |
+
|
214 |
+
analysis = [
|
215 |
+
"### Jaccard Similarity Analysis",
|
216 |
+
f"- **Range:** {min_jaccard:.1f}% to {max_jaccard:.1f}% (mean: {mean_jaccard:.1f}%)"
|
217 |
+
]
|
218 |
+
|
219 |
+
# Interpret the scores
|
220 |
+
if mean_jaccard > 60:
|
221 |
+
analysis.append("- **High vocabulary overlap** suggests texts share significant content or are from the same tradition.")
|
222 |
+
elif mean_jaccard > 30:
|
223 |
+
analysis.append("- **Moderate vocabulary overlap** indicates some shared content or themes.")
|
224 |
+
else:
|
225 |
+
analysis.append("- **Low vocabulary overlap** suggests texts are on different topics or from different traditions.")
|
226 |
+
|
227 |
+
# Add top pairs
|
228 |
+
top_pairs = df.nlargest(3, 'Jaccard Similarity (%)')
|
229 |
+
if not top_pairs.empty:
|
230 |
+
analysis.append("\n**Most similar pairs:**")
|
231 |
+
for _, row in top_pairs.iterrows():
|
232 |
+
text_a = row.get('Text A', 'Text 1')
|
233 |
+
text_b = row.get('Text B', 'Text 2')
|
234 |
+
score = row['Jaccard Similarity (%)']
|
235 |
+
analysis.append(f"- {text_a} ↔ {text_b}: {score:.1f}%")
|
236 |
+
|
237 |
+
return "\n".join(analysis)
|
238 |
+
|
239 |
+
def _analyze_lcs(self, df: pd.DataFrame) -> str:
|
240 |
+
"""Analyze Longest Common Subsequence scores."""
|
241 |
+
lcs = df['Normalized LCS'].dropna()
|
242 |
+
if lcs.empty:
|
243 |
+
return ""
|
244 |
+
|
245 |
+
mean_lcs = lcs.mean()
|
246 |
+
max_lcs = lcs.max()
|
247 |
+
min_lcs = lcs.min()
|
248 |
+
|
249 |
+
analysis = [
|
250 |
+
"### Structural Similarity (LCS) Analysis",
|
251 |
+
f"- **Range:** {min_lcs:.2f} to {max_lcs:.2f} (mean: {mean_lcs:.2f})"
|
252 |
+
]
|
253 |
+
|
254 |
+
# Interpret the scores
|
255 |
+
if mean_lcs > 0.7:
|
256 |
+
analysis.append("- **High structural similarity** suggests texts follow similar organizational patterns.")
|
257 |
+
elif mean_lcs > 0.4:
|
258 |
+
analysis.append("- **Moderate structural similarity** indicates some shared organizational elements.")
|
259 |
+
else:
|
260 |
+
analysis.append("- **Low structural similarity** suggests different organizational approaches.")
|
261 |
+
|
262 |
+
# Add top pairs
|
263 |
+
top_pairs = df.nlargest(3, 'Normalized LCS')
|
264 |
+
if not top_pairs.empty:
|
265 |
+
analysis.append("\n**Most structurally similar pairs:**")
|
266 |
+
for _, row in top_pairs.iterrows():
|
267 |
+
text_a = row.get('Text A', 'Text 1')
|
268 |
+
text_b = row.get('Text B', 'Text 2')
|
269 |
+
score = row['Normalized LCS']
|
270 |
+
analysis.append(f"- {text_a} ↔ {text_b}: {score:.2f}")
|
271 |
+
|
272 |
+
return "\n".join(analysis)
|
273 |
+
|
274 |
+
def _analyze_tfidf(self, df: pd.DataFrame) -> str:
|
275 |
+
"""Analyze TF-IDF cosine similarity scores."""
|
276 |
+
tfidf = df['TF-IDF Cosine Sim'].dropna()
|
277 |
+
if tfidf.empty:
|
278 |
+
return ""
|
279 |
+
|
280 |
+
mean_tfidf = tfidf.mean()
|
281 |
+
max_tfidf = tfidf.max()
|
282 |
+
min_tfidf = tfidf.min()
|
283 |
+
|
284 |
+
analysis = [
|
285 |
+
"### Thematic Similarity (TF-IDF) Analysis",
|
286 |
+
f"- **Range:** {min_tfidf:.2f} to {max_tfidf:.2f} (mean: {mean_tfidf:.2f})"
|
287 |
+
]
|
288 |
+
|
289 |
+
# Interpret the scores
|
290 |
+
if mean_tfidf > 0.8:
|
291 |
+
analysis.append("- **High thematic similarity** suggests texts share distinctive terms and concepts.")
|
292 |
+
elif mean_tfidf > 0.5:
|
293 |
+
analysis.append("- **Moderate thematic similarity** indicates some shared distinctive terms.")
|
294 |
+
else:
|
295 |
+
analysis.append("- **Low thematic similarity** suggests different conceptual focuses.")
|
296 |
+
|
297 |
+
# Add top pairs
|
298 |
+
top_pairs = df.nlargest(3, 'TF-IDF Cosine Sim')
|
299 |
+
if not top_pairs.empty:
|
300 |
+
analysis.append("\n**Most thematically similar pairs:**")
|
301 |
+
for _, row in top_pairs.iterrows():
|
302 |
+
text_a = row.get('Text A', 'Text 1')
|
303 |
+
text_b = row.get('Text B', 'Text 2')
|
304 |
+
score = row['TF-IDF Cosine Sim']
|
305 |
+
analysis.append(f"- {text_a} ↔ {text_b}: {score:.2f}")
|
306 |
+
|
307 |
+
return "\n".join(analysis)
|
308 |
+
|
309 |
+
def _generate_overall_interpretation(self, df: pd.DataFrame) -> str:
|
310 |
+
"""Generate an overall interpretation of the metrics."""
|
311 |
+
interpretations = []
|
312 |
+
|
313 |
+
# Get metrics if they exist
|
314 |
+
has_jaccard = 'Jaccard Similarity (%)' in df.columns
|
315 |
+
has_lcs = 'Normalized LCS' in df.columns
|
316 |
+
has_tfidf = 'TF-IDF Cosine Sim' in df.columns
|
317 |
+
|
318 |
+
# Calculate means for available metrics
|
319 |
+
metrics = {}
|
320 |
+
if has_jaccard:
|
321 |
+
metrics['jaccard'] = df['Jaccard Similarity (%)'].mean()
|
322 |
+
if has_lcs:
|
323 |
+
metrics['lcs'] = df['Normalized LCS'].mean()
|
324 |
+
if has_tfidf:
|
325 |
+
metrics['tfidf'] = df['TF-IDF Cosine Sim'].mean()
|
326 |
+
|
327 |
+
# Generate interpretation based on metrics
|
328 |
+
if metrics:
|
329 |
+
interpretations.append("Based on the analysis of similarity metrics:")
|
330 |
+
|
331 |
+
if has_jaccard and metrics['jaccard'] > 60:
|
332 |
+
interpretations.append("- The high Jaccard similarity indicates significant vocabulary overlap between texts, "
|
333 |
+
"suggesting they may share common sources or be part of the same textual tradition.")
|
334 |
+
|
335 |
+
if has_lcs and metrics['lcs'] > 0.7:
|
336 |
+
interpretations.append("- The high LCS score indicates strong structural similarity, "
|
337 |
+
"suggesting the texts may follow similar organizational patterns or share common structural elements.")
|
338 |
+
|
339 |
+
if has_tfidf and metrics['tfidf'] > 0.8:
|
340 |
+
interpretations.append("- The high TF-IDF similarity suggests the texts share distinctive terms and concepts, "
|
341 |
+
"indicating they may cover similar topics or themes.")
|
342 |
+
|
343 |
+
# Add cross-metric interpretations
|
344 |
+
if has_jaccard and has_lcs and metrics['jaccard'] > 60 and metrics['lcs'] > 0.7:
|
345 |
+
interpretations.append("\nThe combination of high Jaccard and LCS similarities strongly suggests "
|
346 |
+
"that these texts are closely related, possibly being different versions or "
|
347 |
+
"transmissions of the same work or sharing a common source.")
|
348 |
+
|
349 |
+
if has_tfidf and has_jaccard and metrics['tfidf'] < 0.5 and metrics['jaccard'] > 60:
|
350 |
+
interpretations.append("\nThe high Jaccard but lower TF-IDF similarity suggests that while the texts "
|
351 |
+
"share many common words, they may use them in different contexts or with different "
|
352 |
+
"meanings, possibly indicating different interpretations of similar material.")
|
353 |
+
|
354 |
+
# Add general guidance if no specific patterns found
|
355 |
+
if not interpretations:
|
356 |
+
interpretations.append("The analysis did not reveal strong patterns in the similarity metrics. "
|
357 |
+
"This could indicate that the texts are either very similar or very different "
|
358 |
+
"across all measured dimensions.")
|
359 |
+
|
360 |
+
return "\n\n".join(interpretations)
|
361 |
+
|
362 |
+
def _create_llm_prompt(self, df: pd.DataFrame) -> str:
|
363 |
+
"""
|
364 |
+
Create a prompt for the LLM based on the DataFrame.
|
365 |
+
|
366 |
+
Args:
|
367 |
+
df: Prepared DataFrame with metrics
|
368 |
+
|
369 |
+
Returns:
|
370 |
+
str: Formatted prompt for the LLM
|
371 |
+
"""
|
372 |
+
# Format the CSV data for the prompt
|
373 |
+
csv_data = df.to_csv(index=False)
|
374 |
+
|
375 |
+
# Create the prompt using the user's template
|
376 |
+
prompt = """You are a specialized text analysis interpreter with expertise in Tibetan textual studies. Your task is to analyze text similarity data from a CSV file and create a clear, narrative explanation for scholars who may not have technical expertise.
|
377 |
+
|
378 |
+
<CONTEXT>
|
379 |
+
This data comes from a text similarity analysis tool designed for various genres of Tibetan sources including historical, religious, literary, and philosophical texts. The tool compares texts using multiple linguistic metrics:
|
380 |
+
- Jaccard Similarity (%): Measures word overlap between texts (higher % = more similar)
|
381 |
+
- Normalized LCS: Longest Common Subsequence, measuring sequential text patterns
|
382 |
+
- Semantic Similarity: Deep meaning comparison using sentence transformers or fasttext
|
383 |
+
- TF-IDF Cosine Similarity: Term frequency-inverse document frequency comparison
|
384 |
+
The "Chapter" column indicates which chapter/section of the texts is being compared.
|
385 |
+
</CONTEXT>
|
386 |
+
|
387 |
+
<INSTRUCTIONS>
|
388 |
+
1. Begin by identifying the specific texts being compared in the data (e.g., "Japan13.txt vs Dolanji.txt").
|
389 |
+
|
390 |
+
2. Create a dual-layer narrative analysis (800-1000 words) that includes:
|
391 |
+
a) A high-level overview of text similarity patterns accessible to non-technical readers
|
392 |
+
b) A more detailed analysis for scholars interested in specific textual relationships
|
393 |
+
|
394 |
+
3. In your analysis:
|
395 |
+
- Summarize overall similarity patterns between the texts across all chapters
|
396 |
+
- Identify which chapters show strongest similarities and differences
|
397 |
+
- Explain whether similarities appear to be more lexical (Jaccard, LCS) or conceptual (Semantic)
|
398 |
+
- Interpret what these patterns might suggest about textual relationships, transmission, or variant histories
|
399 |
+
- Note any interesting anomalies (e.g., chapters with high semantic but low lexical similarity)
|
400 |
+
|
401 |
+
4. Structure your analysis with:
|
402 |
+
- An introduction explaining the texts compared and general observations
|
403 |
+
- A section on overall patterns across all chapters with visualized trends
|
404 |
+
- A detailed examination of 2-3 notable chapters (highest/lowest similarity)
|
405 |
+
- A discussion of what different metrics reveal about textual relationships
|
406 |
+
- A conclusion suggesting what these patterns might mean for Tibetan textual scholarship
|
407 |
+
- 2-3 specific questions these findings raise for further investigation
|
408 |
+
|
409 |
+
5. Connect your analysis to common interests in Tibetan textual studies such as:
|
410 |
+
- Textual transmission and lineages
|
411 |
+
- Regional variants and dialectical differences
|
412 |
+
- Potential historical relationships between texts
|
413 |
+
- Original vs. commentary material identification
|
414 |
+
|
415 |
+
6. Consider using a "family tree" analogy to make the textual relationships more intuitive. For example:
|
416 |
+
- Texts with very high similarity (>80%) might be described as "siblings" from the same direct source
|
417 |
+
- Texts with moderate similarity (50-80%) could be "cousins" sharing a common ancestor but with separate development
|
418 |
+
- Texts with low similarity (<50%) might be "distant relatives" with only fundamental connections
|
419 |
+
Use this metaphor if it helps clarify the relationships, but don't force it if another explanation would be clearer.
|
420 |
+
|
421 |
+
7. **Important note on perfect or zero similarity matches:**
|
422 |
+
If you notice that all metrics indicate perfect or near-perfect similarity (for example, scores of 1.0/100 across all metrics for a chapter) or 0 for a complete mismatch, this may not indicate true textual identity or lack thereof. Instead, it likely means both corresponding text cells were empty or contained no content. In these cases, be sure to clarify in your narrative that such results are *artifacts of missing data, not genuine textual matches*, and should be interpreted with caution.
|
423 |
+
|
424 |
+
8. Balance scholarly precision with accessibility, explaining technical concepts when necessary while keeping the overall narrative engaging for non-technical readers.
|
425 |
+
</INSTRUCTIONS>
|
426 |
+
|
427 |
+
Here is the CSV data to analyze:
|
428 |
+
[CSV_DATA]
|
429 |
+
"""
|
430 |
+
|
431 |
+
# Replace [CSV_DATA] with the actual CSV data
|
432 |
+
prompt = prompt.replace("[CSV_DATA]", csv_data)
|
433 |
+
|
434 |
+
return prompt
|
435 |
+
|
436 |
+
def _get_system_prompt(self) -> str:
|
437 |
+
"""Get the system prompt for the LLM."""
|
438 |
+
return """
|
439 |
+
You are a senior scholar of Tibetan Buddhist texts with expertise in textual criticism and
|
440 |
+
comparative analysis. Your task is to analyze the provided similarity metrics and provide
|
441 |
+
expert-level insights into the relationships between these Tibetan texts.
|
442 |
+
|
443 |
+
CRITICAL INSTRUCTIONS:
|
444 |
+
1. Your analysis MUST be grounded in the specific metrics provided
|
445 |
+
2. Always reference actual text names and metric values when making claims
|
446 |
+
3. Focus on what the data shows, not what it might show
|
447 |
+
4. Be precise and avoid vague or generic statements
|
448 |
+
|
449 |
+
ANALYSIS APPROACH:
|
450 |
+
1. Begin with a brief executive summary of the most significant findings
|
451 |
+
2. Group similar text pairs and explain their relationships
|
452 |
+
3. Highlight any patterns that suggest textual transmission or common sources
|
453 |
+
4. Note any anomalies or unexpected results that merit further investigation
|
454 |
+
5. Provide specific examples from the data to support your analysis
|
455 |
+
|
456 |
+
TIBETAN TEXT-SPECIFIC GUIDANCE:
|
457 |
+
- Consider the implications of shared vocabulary in the context of Tibetan Buddhist literature
|
458 |
+
- Be aware that high LCS scores might indicate shared liturgical or formulaic language
|
459 |
+
- Note that texts with similar Jaccard but different LCS scores might share content but differ in structure
|
460 |
+
- Consider the possibility of text reuse, commentary traditions, or shared sources
|
461 |
+
|
462 |
+
Your analysis should be scholarly but accessible, providing clear insights that would be
|
463 |
+
valuable to researchers studying these texts.
|
464 |
+
"""
|
465 |
+
|
466 |
+
def _call_openrouter_api(
|
467 |
+
self,
|
468 |
+
prompt: str,
|
469 |
+
system_message: str = None,
|
470 |
+
max_tokens: int = None,
|
471 |
+
temperature: float = None,
|
472 |
+
top_p: float = None
|
473 |
+
) -> str:
|
474 |
+
"""
|
475 |
+
Call the OpenRouter API.
|
476 |
+
|
477 |
+
Args:
|
478 |
+
prompt: The user prompt
|
479 |
+
system_message: Optional system message
|
480 |
+
max_tokens: Maximum tokens for the response
|
481 |
+
temperature: Sampling temperature
|
482 |
+
top_p: Nucleus sampling parameter
|
483 |
+
|
484 |
+
Returns:
|
485 |
+
str: The API response
|
486 |
+
|
487 |
+
Raises:
|
488 |
+
ValueError: If API key is missing or invalid
|
489 |
+
requests.exceptions.RequestException: For network-related errors
|
490 |
+
Exception: For other API-related errors
|
491 |
+
"""
|
492 |
+
if not self.api_key:
|
493 |
+
error_msg = "OpenRouter API key not provided. Please set the OPENROUTER_API_KEY environment variable."
|
494 |
+
logger.error(error_msg)
|
495 |
+
raise ValueError(error_msg)
|
496 |
+
|
497 |
+
url = "https://openrouter.ai/api/v1/chat/completions"
|
498 |
+
|
499 |
+
headers = {
|
500 |
+
"Authorization": f"Bearer {self.api_key}",
|
501 |
+
"Content-Type": "application/json",
|
502 |
+
"HTTP-Referer": "https://github.com/daniel-wojahn/tibetan-text-metrics",
|
503 |
+
"X-Title": "Tibetan Text Metrics"
|
504 |
+
}
|
505 |
+
|
506 |
+
messages = []
|
507 |
+
if system_message:
|
508 |
+
messages.append({"role": "system", "content": system_message})
|
509 |
+
messages.append({"role": "user", "content": prompt})
|
510 |
+
|
511 |
+
data = {
|
512 |
+
"model": self.model,
|
513 |
+
"messages": messages,
|
514 |
+
"max_tokens": max_tokens or DEFAULT_MAX_TOKENS,
|
515 |
+
"temperature": temperature or self.temperature,
|
516 |
+
"top_p": top_p or self.top_p,
|
517 |
+
}
|
518 |
+
|
519 |
+
try:
|
520 |
+
logger.info(f"Calling OpenRouter API with model: {self.model}")
|
521 |
+
response = requests.post(url, headers=headers, json=data, timeout=60)
|
522 |
+
|
523 |
+
# Handle different HTTP status codes
|
524 |
+
if response.status_code == 200:
|
525 |
+
result = response.json()
|
526 |
+
if 'choices' in result and len(result['choices']) > 0:
|
527 |
+
return result['choices'][0]['message']['content'].strip()
|
528 |
+
else:
|
529 |
+
error_msg = "Unexpected response format from OpenRouter API"
|
530 |
+
logger.error(f"{error_msg}: {result}")
|
531 |
+
raise ValueError(error_msg)
|
532 |
+
|
533 |
+
elif response.status_code == 401:
|
534 |
+
error_msg = "Invalid OpenRouter API key. Please check your API key and try again."
|
535 |
+
logger.error(error_msg)
|
536 |
+
raise ValueError(error_msg)
|
537 |
+
|
538 |
+
elif response.status_code == 402:
|
539 |
+
error_msg = "OpenRouter API payment required. Please check your OpenRouter account balance or billing status."
|
540 |
+
logger.error(error_msg)
|
541 |
+
raise ValueError(error_msg)
|
542 |
+
|
543 |
+
elif response.status_code == 429:
|
544 |
+
error_msg = "API rate limit exceeded. Please try again later or check your OpenRouter rate limits."
|
545 |
+
logger.error(error_msg)
|
546 |
+
raise ValueError(error_msg)
|
547 |
+
|
548 |
+
else:
|
549 |
+
error_msg = f"OpenRouter API error: {response.status_code} - {response.text}"
|
550 |
+
logger.error(error_msg)
|
551 |
+
raise Exception(error_msg)
|
552 |
+
|
553 |
+
except requests.exceptions.RequestException as e:
|
554 |
+
error_msg = f"Failed to connect to OpenRouter API: {str(e)}"
|
555 |
+
logger.error(error_msg)
|
556 |
+
raise Exception(error_msg) from e
|
557 |
+
|
558 |
+
except json.JSONDecodeError as e:
|
559 |
+
error_msg = f"Failed to parse OpenRouter API response: {str(e)}"
|
560 |
+
logger.error(error_msg)
|
561 |
+
raise Exception(error_msg) from e
|
562 |
+
|
563 |
+
def _format_llm_response(self, response: str, df: pd.DataFrame) -> str:
|
564 |
+
"""
|
565 |
+
Format the LLM response for display.
|
566 |
+
|
567 |
+
Args:
|
568 |
+
response: Raw LLM response
|
569 |
+
df: Original DataFrame for reference
|
570 |
+
|
571 |
+
Returns:
|
572 |
+
str: Formatted response with fallback if needed
|
573 |
+
"""
|
574 |
+
# Basic validation
|
575 |
+
if not response or len(response) < 100:
|
576 |
+
raise ValueError("Response too short or empty")
|
577 |
+
|
578 |
+
# Check for garbled output (random numbers, nonsensical patterns)
|
579 |
+
# This is a simple heuristic - look for long sequences of numbers or strange patterns
|
580 |
+
suspicious_patterns = [
|
581 |
+
r'\d{8,}', # Long number sequences
|
582 |
+
r'[0-9,.]{20,}', # Long sequences of digits, commas and periods
|
583 |
+
r'[\W]{20,}', # Long sequences of non-word characters
|
584 |
+
]
|
585 |
+
|
586 |
+
for pattern in suspicious_patterns:
|
587 |
+
if re.search(pattern, response):
|
588 |
+
logger.warning(f"Detected potentially garbled output matching pattern: {pattern}")
|
589 |
+
# Don't immediately raise - we'll do a more comprehensive check
|
590 |
+
|
591 |
+
# Check for content quality - ensure it has expected sections
|
592 |
+
expected_content = [
|
593 |
+
"introduction", "analysis", "similarity", "patterns", "conclusion", "question"
|
594 |
+
]
|
595 |
+
|
596 |
+
# Count how many expected content markers we find
|
597 |
+
content_matches = sum(1 for term in expected_content if term.lower() in response.lower())
|
598 |
+
|
599 |
+
# If we find fewer than 3 expected content markers, it's likely not a good analysis
|
600 |
+
if content_matches < 3:
|
601 |
+
logger.warning(f"LLM response missing expected content sections (found {content_matches}/6)")
|
602 |
+
raise ValueError("Response does not contain expected analysis sections")
|
603 |
+
|
604 |
+
# Check for text names from the dataset
|
605 |
+
# Extract text names from the Text Pair column
|
606 |
+
text_names = set()
|
607 |
+
if "Text Pair" in df.columns:
|
608 |
+
for pair in df["Text Pair"]:
|
609 |
+
if isinstance(pair, str) and " vs " in pair:
|
610 |
+
texts = pair.split(" vs ")
|
611 |
+
text_names.update(texts)
|
612 |
+
|
613 |
+
# Check if at least some text names appear in the response
|
614 |
+
text_name_matches = sum(1 for name in text_names if name in response)
|
615 |
+
if text_names and text_name_matches == 0:
|
616 |
+
logger.warning("LLM response does not mention any of the text names from the dataset")
|
617 |
+
raise ValueError("Response does not reference any of the analyzed texts")
|
618 |
+
|
619 |
+
# Ensure basic markdown structure
|
620 |
+
if '##' not in response:
|
621 |
+
response = f"## Analysis of Tibetan Text Similarity\n\n{response}"
|
622 |
+
|
623 |
+
# Add styling to make the output more readable
|
624 |
+
response = f"<div class='llm-analysis'>\n{response}\n</div>"
|
625 |
+
|
626 |
+
return response
|
627 |
+
|
628 |
+
|
629 |
+
def get_interpretation(results_df: pd.DataFrame, use_llm: bool = True) -> str:
|
630 |
+
"""
|
631 |
+
Get an interpretation of the similarity metrics.
|
632 |
+
|
633 |
+
This is a convenience function that creates an LLMService instance
|
634 |
+
and calls analyze_similarity with default parameters.
|
635 |
+
|
636 |
+
Args:
|
637 |
+
results_df: DataFrame containing similarity metrics
|
638 |
+
use_llm: Whether to use LLM for analysis (falls back to rule-based if False or on error)
|
639 |
+
|
640 |
+
Returns:
|
641 |
+
str: Analysis of the metrics in markdown format
|
642 |
+
"""
|
643 |
+
service = LLMService()
|
644 |
+
return service.analyze_similarity(results_df, use_llm=use_llm)
|
pipeline/metrics.py
CHANGED
@@ -7,9 +7,9 @@ import torch
|
|
7 |
from .semantic_embedding import generate_embeddings
|
8 |
from .tokenize import tokenize_texts
|
9 |
import logging
|
10 |
-
from sentence_transformers import SentenceTransformer
|
11 |
from sklearn.feature_extraction.text import TfidfVectorizer
|
12 |
from .stopwords_bo import TIBETAN_STOPWORDS, TIBETAN_STOPWORDS_SET
|
|
|
13 |
|
14 |
# Attempt to import the Cython-compiled fast_lcs module
|
15 |
try:
|
@@ -107,6 +107,9 @@ def compute_semantic_similarity(
|
|
107 |
tokens2: List[str],
|
108 |
model,
|
109 |
device,
|
|
|
|
|
|
|
110 |
) -> float:
|
111 |
"""Computes semantic similarity using a sentence transformer model, with chunking for long texts."""
|
112 |
if model is None or device is None:
|
@@ -122,9 +125,9 @@ def compute_semantic_similarity(
|
|
122 |
return 0.0 # Or np.nan, depending on desired behavior for empty inputs
|
123 |
|
124 |
def _get_aggregated_embedding(
|
125 |
-
raw_text_segment: str, botok_tokens: List[str], model_obj, device_str
|
126 |
) -> torch.Tensor | None:
|
127 |
-
"""Helper to get a single embedding for a text, chunking if necessary."""
|
128 |
if (
|
129 |
not botok_tokens and not raw_text_segment.strip()
|
130 |
): # Check if effectively empty
|
@@ -132,54 +135,147 @@ def compute_semantic_similarity(
|
|
132 |
f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
|
133 |
)
|
134 |
return None
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
text_chunks = _chunk_text(
|
142 |
-
raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
|
143 |
-
)
|
144 |
-
if not text_chunks:
|
145 |
-
logger.warning(
|
146 |
-
f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
|
147 |
)
|
148 |
return None
|
149 |
-
|
150 |
-
|
151 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
152 |
)
|
153 |
-
|
154 |
-
|
155 |
-
if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
|
156 |
logger.error(
|
157 |
-
f"Failed to generate
|
158 |
)
|
159 |
return None
|
160 |
-
#
|
161 |
-
|
162 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
163 |
else:
|
164 |
-
#
|
165 |
-
if
|
166 |
logger.info(
|
167 |
-
f"Text segment
|
168 |
)
|
169 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
170 |
|
171 |
-
|
172 |
-
|
173 |
-
logger.error(
|
174 |
-
f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
|
175 |
)
|
176 |
-
|
177 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
178 |
|
179 |
try:
|
180 |
-
# Pass raw text and its pre-computed botok tokens
|
181 |
-
embedding1 = _get_aggregated_embedding(text1_segment, tokens1, model, device)
|
182 |
-
embedding2 = _get_aggregated_embedding(text2_segment, tokens2, model, device)
|
183 |
|
184 |
if (
|
185 |
embedding1 is None
|
@@ -192,6 +288,11 @@ def compute_semantic_similarity(
|
|
192 |
)
|
193 |
return np.nan
|
194 |
|
|
|
|
|
|
|
|
|
|
|
195 |
# Cosine similarity expects 2D arrays, embeddings are [1, embed_dim] and on CPU
|
196 |
similarity = cosine_similarity(embedding1.numpy(), embedding2.numpy())
|
197 |
return float(similarity[0][0])
|
@@ -204,7 +305,9 @@ def compute_semantic_similarity(
|
|
204 |
|
205 |
|
206 |
def compute_all_metrics(
|
207 |
-
texts: Dict[str, str], model=None, device=None, enable_semantic: bool = True
|
|
|
|
|
208 |
) -> pd.DataFrame:
|
209 |
"""
|
210 |
Computes all selected similarity metrics between pairs of texts.
|
@@ -220,14 +323,13 @@ def compute_all_metrics(
|
|
220 |
Returns:
|
221 |
pd.DataFrame: A DataFrame where each row contains the metrics for a pair of texts,
|
222 |
including 'Text Pair', 'Jaccard Similarity (%)', 'Normalized LCS',
|
223 |
-
and 'Semantic Similarity
|
224 |
"""
|
225 |
files = list(texts.keys())
|
226 |
results = []
|
227 |
# Prepare token lists (always use tokenize_texts for raw Unicode)
|
228 |
token_lists = {}
|
229 |
corpus_for_tfidf = [] # For storing space-joined tokens for TF-IDF
|
230 |
-
tibetan_stopwords_set = set() # Initialize for Jaccard (and potentially LCS) filtering
|
231 |
|
232 |
for fname, content in texts.items():
|
233 |
tokenized_content = tokenize_texts([content]) # Returns a list of lists
|
@@ -245,36 +347,65 @@ def compute_all_metrics(
|
|
245 |
|
246 |
# TF-IDF Vectorization and Cosine Similarity Calculation
|
247 |
if corpus_for_tfidf:
|
248 |
-
|
249 |
-
|
250 |
-
|
251 |
-
|
252 |
-
|
253 |
-
|
254 |
-
|
255 |
-
|
256 |
-
|
257 |
-
|
258 |
-
|
259 |
-
|
260 |
-
|
261 |
-
|
262 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
263 |
else:
|
264 |
# Handle case with no texts or all empty texts
|
265 |
-
|
266 |
-
|
267 |
-
) # Or some other appropriate empty/default structure
|
268 |
|
269 |
for i, j in combinations(range(len(files)), 2):
|
270 |
f1, f2 = files[i], files[j]
|
271 |
words1_raw, words2_raw = token_lists[f1], token_lists[f2]
|
272 |
|
273 |
-
#
|
274 |
-
|
275 |
-
|
276 |
-
|
277 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
278 |
|
279 |
jaccard = (
|
280 |
len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
|
@@ -290,7 +421,7 @@ def compute_all_metrics(
|
|
290 |
if enable_semantic:
|
291 |
# Pass raw texts and their pre-computed botok tokens
|
292 |
semantic_sim = compute_semantic_similarity(
|
293 |
-
texts[f1], texts[f2], words1_raw, words2_raw, model, device
|
294 |
)
|
295 |
else:
|
296 |
semantic_sim = np.nan
|
@@ -300,8 +431,9 @@ def compute_all_metrics(
|
|
300 |
"Jaccard Similarity (%)": jaccard_percent,
|
301 |
"Normalized LCS": norm_lcs,
|
302 |
# Pass tokens1 and tokens2 to compute_semantic_similarity
|
303 |
-
"Semantic Similarity
|
304 |
"TF-IDF Cosine Sim": (
|
|
|
305 |
cosine_sim_matrix[i, j]
|
306 |
if cosine_sim_matrix.size > 0
|
307 |
and i < cosine_sim_matrix.shape[0]
|
|
|
7 |
from .semantic_embedding import generate_embeddings
|
8 |
from .tokenize import tokenize_texts
|
9 |
import logging
|
|
|
10 |
from sklearn.feature_extraction.text import TfidfVectorizer
|
11 |
from .stopwords_bo import TIBETAN_STOPWORDS, TIBETAN_STOPWORDS_SET
|
12 |
+
from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE, TIBETAN_STOPWORDS_LITE_SET
|
13 |
|
14 |
# Attempt to import the Cython-compiled fast_lcs module
|
15 |
try:
|
|
|
107 |
tokens2: List[str],
|
108 |
model,
|
109 |
device,
|
110 |
+
model_type: str = "sentence_transformer",
|
111 |
+
use_stopwords: bool = True,
|
112 |
+
use_lite_stopwords: bool = False,
|
113 |
) -> float:
|
114 |
"""Computes semantic similarity using a sentence transformer model, with chunking for long texts."""
|
115 |
if model is None or device is None:
|
|
|
125 |
return 0.0 # Or np.nan, depending on desired behavior for empty inputs
|
126 |
|
127 |
def _get_aggregated_embedding(
|
128 |
+
raw_text_segment: str, botok_tokens: List[str], model_obj, device_str, model_type: str = "sentence_transformer", use_stopwords: bool = True, use_lite_stopwords: bool = False
|
129 |
) -> torch.Tensor | None:
|
130 |
+
"""Helper to get a single embedding for a text, chunking if necessary for transformer models."""
|
131 |
if (
|
132 |
not botok_tokens and not raw_text_segment.strip()
|
133 |
): # Check if effectively empty
|
|
|
135 |
f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
|
136 |
)
|
137 |
return None
|
138 |
+
|
139 |
+
# For FastText, we don't need chunking as it processes tokens directly
|
140 |
+
if model_type == "fasttext":
|
141 |
+
if not raw_text_segment.strip():
|
142 |
+
logger.info(
|
143 |
+
f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
|
|
|
|
|
|
|
|
|
|
|
|
|
144 |
)
|
145 |
return None
|
146 |
+
|
147 |
+
# Pass the raw text, pre-tokenized tokens, and stopword parameters
|
148 |
+
# Wrap the tokens in a list since generate_embeddings expects a list of token lists
|
149 |
+
embedding = generate_embeddings(
|
150 |
+
[raw_text_segment],
|
151 |
+
model_obj,
|
152 |
+
device_str,
|
153 |
+
model_type,
|
154 |
+
tokenize_fn=[botok_tokens], # Wrap in list since we're passing tokens for one text
|
155 |
+
use_stopwords=use_stopwords,
|
156 |
+
use_lite_stopwords=use_lite_stopwords
|
157 |
)
|
158 |
+
|
159 |
+
if embedding is None or embedding.nelement() == 0:
|
|
|
160 |
logger.error(
|
161 |
+
f"Failed to generate FastText embedding for text: {raw_text_segment[:100]}..."
|
162 |
)
|
163 |
return None
|
164 |
+
return embedding # Already [1, embed_dim]
|
165 |
+
|
166 |
+
# For transformer models, check if all tokens are stopwords when filtering is enabled
|
167 |
+
elif use_stopwords:
|
168 |
+
# Filter stopwords to see if any content remains
|
169 |
+
filtered_tokens = []
|
170 |
+
if use_lite_stopwords:
|
171 |
+
from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
|
172 |
+
filtered_tokens = [token for token in botok_tokens if token not in TIBETAN_STOPWORDS_LITE_SET]
|
173 |
+
else:
|
174 |
+
from .stopwords_bo import TIBETAN_STOPWORDS_SET
|
175 |
+
filtered_tokens = [token for token in botok_tokens if token not in TIBETAN_STOPWORDS_SET]
|
176 |
+
|
177 |
+
# If all tokens were filtered out as stopwords, return zero embedding
|
178 |
+
if not filtered_tokens:
|
179 |
+
logger.info("All tokens in text are stopwords. Returning zero embedding.")
|
180 |
+
# Create a zero tensor with the same dimension as the model's output
|
181 |
+
# For transformer models, typically 384 or 768 dimensions
|
182 |
+
embedding_dim = 384 # Default dimension for MiniLM models
|
183 |
+
return torch.zeros(1, embedding_dim)
|
184 |
+
|
185 |
+
# Continue with normal processing if content remains after filtering
|
186 |
+
if len(botok_tokens) > MAX_TOKENS_PER_CHUNK:
|
187 |
+
logger.info(
|
188 |
+
f"Text segment with ~{len(botok_tokens)} tokens exceeds {MAX_TOKENS_PER_CHUNK}, chunking {raw_text_segment[:30]}..."
|
189 |
+
)
|
190 |
+
# Pass the original raw text and its pre-computed botok tokens to _chunk_text
|
191 |
+
text_chunks = _chunk_text(
|
192 |
+
raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
|
193 |
+
)
|
194 |
+
if not text_chunks:
|
195 |
+
logger.warning(
|
196 |
+
f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
|
197 |
+
)
|
198 |
+
return None
|
199 |
+
|
200 |
+
logger.info(
|
201 |
+
f"Generated {len(text_chunks)} chunks for segment: {raw_text_segment[:30]}..."
|
202 |
+
)
|
203 |
+
# Generate embeddings for each chunk using the model
|
204 |
+
chunk_embeddings = generate_embeddings(text_chunks, model_obj, device_str, model_type)
|
205 |
+
|
206 |
+
if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
|
207 |
+
logger.error(
|
208 |
+
f"Failed to generate embeddings for chunks of text: {raw_text_segment[:100]}..."
|
209 |
+
)
|
210 |
+
return None
|
211 |
+
# Mean pooling of chunk embeddings
|
212 |
+
aggregated_embedding = torch.mean(chunk_embeddings, dim=0, keepdim=True)
|
213 |
+
return aggregated_embedding
|
214 |
+
else:
|
215 |
+
# Text is short enough for transformer model, embed raw text directly
|
216 |
+
if not raw_text_segment.strip():
|
217 |
+
logger.info(
|
218 |
+
f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
|
219 |
+
)
|
220 |
+
return None
|
221 |
+
|
222 |
+
embedding = generate_embeddings([raw_text_segment], model_obj, device_str, model_type)
|
223 |
+
if embedding is None or embedding.nelement() == 0:
|
224 |
+
logger.error(
|
225 |
+
f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
|
226 |
+
)
|
227 |
+
return None
|
228 |
+
return embedding # Already [1, embed_dim]
|
229 |
else:
|
230 |
+
# No stopword filtering, proceed with normal processing
|
231 |
+
if len(botok_tokens) > MAX_TOKENS_PER_CHUNK:
|
232 |
logger.info(
|
233 |
+
f"Text segment with ~{len(botok_tokens)} tokens exceeds {MAX_TOKENS_PER_CHUNK}, chunking {raw_text_segment[:30]}..."
|
234 |
)
|
235 |
+
# Pass the original raw text and its pre-computed botok tokens to _chunk_text
|
236 |
+
text_chunks = _chunk_text(
|
237 |
+
raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
|
238 |
+
)
|
239 |
+
if not text_chunks:
|
240 |
+
logger.warning(
|
241 |
+
f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
|
242 |
+
)
|
243 |
+
return None
|
244 |
|
245 |
+
logger.info(
|
246 |
+
f"Generated {len(text_chunks)} chunks for segment: {raw_text_segment[:30]}..."
|
|
|
|
|
247 |
)
|
248 |
+
# Generate embeddings for each chunk using the model
|
249 |
+
chunk_embeddings = generate_embeddings(text_chunks, model_obj, device_str, model_type)
|
250 |
+
|
251 |
+
if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
|
252 |
+
logger.error(
|
253 |
+
f"Failed to generate embeddings for chunks of text: {raw_text_segment[:100]}..."
|
254 |
+
)
|
255 |
+
return None
|
256 |
+
# Mean pooling of chunk embeddings
|
257 |
+
aggregated_embedding = torch.mean(chunk_embeddings, dim=0, keepdim=True)
|
258 |
+
return aggregated_embedding
|
259 |
+
else:
|
260 |
+
# Text is short enough for transformer model, embed raw text directly
|
261 |
+
if not raw_text_segment.strip():
|
262 |
+
logger.info(
|
263 |
+
f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
|
264 |
+
)
|
265 |
+
return None
|
266 |
+
|
267 |
+
embedding = generate_embeddings([raw_text_segment], model_obj, device_str, model_type)
|
268 |
+
if embedding is None or embedding.nelement() == 0:
|
269 |
+
logger.error(
|
270 |
+
f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
|
271 |
+
)
|
272 |
+
return None
|
273 |
+
return embedding # Already [1, embed_dim]
|
274 |
|
275 |
try:
|
276 |
+
# Pass raw text and its pre-computed botok tokens with stopword preference
|
277 |
+
embedding1 = _get_aggregated_embedding(text1_segment, tokens1, model, device, model_type, use_stopwords, use_lite_stopwords)
|
278 |
+
embedding2 = _get_aggregated_embedding(text2_segment, tokens2, model, device, model_type, use_stopwords, use_lite_stopwords)
|
279 |
|
280 |
if (
|
281 |
embedding1 is None
|
|
|
288 |
)
|
289 |
return np.nan
|
290 |
|
291 |
+
# Check if both embeddings are zero vectors (which happens when all tokens are stopwords)
|
292 |
+
if np.all(embedding1.numpy() == 0) and np.all(embedding2.numpy() == 0):
|
293 |
+
# If both texts contain only stopwords, return 0 similarity
|
294 |
+
return 0.0
|
295 |
+
|
296 |
# Cosine similarity expects 2D arrays, embeddings are [1, embed_dim] and on CPU
|
297 |
similarity = cosine_similarity(embedding1.numpy(), embedding2.numpy())
|
298 |
return float(similarity[0][0])
|
|
|
305 |
|
306 |
|
307 |
def compute_all_metrics(
|
308 |
+
texts: Dict[str, str], model=None, device=None, enable_semantic: bool = True,
|
309 |
+
model_type: str = "sentence_transformer", use_stopwords: bool = True,
|
310 |
+
use_lite_stopwords: bool = False
|
311 |
) -> pd.DataFrame:
|
312 |
"""
|
313 |
Computes all selected similarity metrics between pairs of texts.
|
|
|
323 |
Returns:
|
324 |
pd.DataFrame: A DataFrame where each row contains the metrics for a pair of texts,
|
325 |
including 'Text Pair', 'Jaccard Similarity (%)', 'Normalized LCS',
|
326 |
+
and 'Semantic Similarity'.
|
327 |
"""
|
328 |
files = list(texts.keys())
|
329 |
results = []
|
330 |
# Prepare token lists (always use tokenize_texts for raw Unicode)
|
331 |
token_lists = {}
|
332 |
corpus_for_tfidf = [] # For storing space-joined tokens for TF-IDF
|
|
|
333 |
|
334 |
for fname, content in texts.items():
|
335 |
tokenized_content = tokenize_texts([content]) # Returns a list of lists
|
|
|
347 |
|
348 |
# TF-IDF Vectorization and Cosine Similarity Calculation
|
349 |
if corpus_for_tfidf:
|
350 |
+
try:
|
351 |
+
# Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
|
352 |
+
# and we don't want further case changes or token modifications for Tibetan.
|
353 |
+
|
354 |
+
# Select appropriate stopwords list based on user preference
|
355 |
+
if use_stopwords:
|
356 |
+
# Choose between regular and lite stopwords list
|
357 |
+
if use_lite_stopwords:
|
358 |
+
stopwords_to_use = TIBETAN_STOPWORDS_LITE
|
359 |
+
else:
|
360 |
+
stopwords_to_use = TIBETAN_STOPWORDS
|
361 |
+
else:
|
362 |
+
# If stopwords are disabled, use an empty list
|
363 |
+
stopwords_to_use = []
|
364 |
+
|
365 |
+
vectorizer = TfidfVectorizer(
|
366 |
+
tokenizer=lambda x: x.split(),
|
367 |
+
preprocessor=lambda x: x,
|
368 |
+
token_pattern=None,
|
369 |
+
stop_words=stopwords_to_use
|
370 |
+
)
|
371 |
+
tfidf_matrix = vectorizer.fit_transform(corpus_for_tfidf)
|
372 |
+
# Calculate pairwise cosine similarity on the TF-IDF matrix
|
373 |
+
# This gives a square matrix where cosine_sim_matrix[i, j] is the similarity between doc i and doc j
|
374 |
+
cosine_sim_matrix = cosine_similarity(tfidf_matrix)
|
375 |
+
except ValueError as e:
|
376 |
+
if "empty vocabulary" in str(e):
|
377 |
+
# If vocabulary is empty after stopword removal, create a zero matrix
|
378 |
+
n = len(corpus_for_tfidf)
|
379 |
+
cosine_sim_matrix = np.zeros((n, n))
|
380 |
+
else:
|
381 |
+
# Re-raise other ValueError
|
382 |
+
raise
|
383 |
else:
|
384 |
# Handle case with no texts or all empty texts
|
385 |
+
n = len(files) if files else 0
|
386 |
+
cosine_sim_matrix = np.zeros((n, n))
|
|
|
387 |
|
388 |
for i, j in combinations(range(len(files)), 2):
|
389 |
f1, f2 = files[i], files[j]
|
390 |
words1_raw, words2_raw = token_lists[f1], token_lists[f2]
|
391 |
|
392 |
+
# Select appropriate stopwords set based on user preference
|
393 |
+
if use_stopwords:
|
394 |
+
# Choose between regular and lite stopwords sets
|
395 |
+
if use_lite_stopwords:
|
396 |
+
stopwords_set_to_use = TIBETAN_STOPWORDS_LITE_SET
|
397 |
+
else:
|
398 |
+
stopwords_set_to_use = TIBETAN_STOPWORDS_SET
|
399 |
+
else:
|
400 |
+
# If stopwords are disabled, use an empty set
|
401 |
+
stopwords_set_to_use = set()
|
402 |
+
|
403 |
+
# Filter stopwords for Jaccard calculation
|
404 |
+
words1_jaccard = [word for word in words1_raw if word not in stopwords_set_to_use]
|
405 |
+
words2_jaccard = [word for word in words2_raw if word not in stopwords_set_to_use]
|
406 |
+
|
407 |
+
# Check if both texts only contain stopwords
|
408 |
+
both_only_stopwords = len(words1_jaccard) == 0 and len(words2_jaccard) == 0
|
409 |
|
410 |
jaccard = (
|
411 |
len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
|
|
|
421 |
if enable_semantic:
|
422 |
# Pass raw texts and their pre-computed botok tokens
|
423 |
semantic_sim = compute_semantic_similarity(
|
424 |
+
texts[f1], texts[f2], words1_raw, words2_raw, model, device, model_type, use_stopwords, use_lite_stopwords
|
425 |
)
|
426 |
else:
|
427 |
semantic_sim = np.nan
|
|
|
431 |
"Jaccard Similarity (%)": jaccard_percent,
|
432 |
"Normalized LCS": norm_lcs,
|
433 |
# Pass tokens1 and tokens2 to compute_semantic_similarity
|
434 |
+
"Semantic Similarity": semantic_sim,
|
435 |
"TF-IDF Cosine Sim": (
|
436 |
+
0.0 if both_only_stopwords else
|
437 |
cosine_sim_matrix[i, j]
|
438 |
if cosine_sim_matrix.size > 0
|
439 |
and i < cosine_sim_matrix.shape[0]
|
pipeline/process.py
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
import pandas as pd
|
2 |
from typing import Dict, List, Tuple
|
3 |
from .metrics import compute_all_metrics
|
4 |
-
from .semantic_embedding import
|
5 |
from .tokenize import tokenize_texts
|
6 |
import logging
|
7 |
from itertools import combinations
|
@@ -10,119 +10,345 @@ logger = logging.getLogger(__name__)
|
|
10 |
|
11 |
|
12 |
def process_texts(
|
13 |
-
text_data: Dict[str, str],
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
|
15 |
"""
|
16 |
Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
|
|
|
17 |
Args:
|
18 |
text_data (Dict[str, str]): A dictionary mapping filenames to their content.
|
19 |
filenames (List[str]): A list of filenames that were uploaded.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
Returns:
|
21 |
Tuple[pd.DataFrame, pd.DataFrame, str]:
|
22 |
- metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
|
|
|
|
|
23 |
- word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
|
|
|
24 |
- warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
|
|
|
|
|
|
|
|
|
25 |
"""
|
|
|
26 |
st_model, st_device = None, None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
if enable_semantic:
|
28 |
-
logger.info(
|
29 |
-
"Semantic similarity enabled. Loading sentence transformer model..."
|
30 |
-
)
|
31 |
try:
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
except Exception as e:
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
#
|
41 |
-
|
42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
else:
|
44 |
logger.info("Semantic similarity disabled. Skipping model loading.")
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
-
# Detect chapter marker
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
chapter_marker = "༈"
|
48 |
fallback = False
|
49 |
segment_texts = {}
|
50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
content = text_data[fname]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
if chapter_marker in content:
|
53 |
segments = [
|
54 |
seg.strip() for seg in content.split(chapter_marker) if seg.strip()
|
55 |
]
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
for idx, seg in enumerate(segments):
|
57 |
seg_id = f"{fname}|chapter {idx+1}"
|
58 |
segment_texts[seg_id] = seg
|
59 |
else:
|
|
|
60 |
seg_id = f"{fname}|chapter 1"
|
61 |
segment_texts[seg_id] = content.strip()
|
62 |
fallback = True
|
63 |
-
|
|
|
|
|
64 |
if fallback:
|
65 |
-
|
66 |
"No chapter marker found in one or more files. "
|
67 |
"Each file will be treated as a single segment. "
|
68 |
"For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
|
69 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
# Group chapters by filename (preserving order)
|
|
|
|
|
|
|
|
|
|
|
|
|
71 |
file_to_chapters = {}
|
72 |
for seg_id in segment_texts:
|
73 |
fname = seg_id.split("|")[0]
|
74 |
file_to_chapters.setdefault(fname, []).append(seg_id)
|
|
|
75 |
# For each pair of files, compare corresponding chapters (by index)
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
results = []
|
77 |
files = list(file_to_chapters.keys())
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
for file1, file2 in combinations(files, 2):
|
79 |
chaps1 = file_to_chapters[file1]
|
80 |
chaps2 = file_to_chapters[file2]
|
81 |
min_chaps = min(len(chaps1), len(chaps2))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
82 |
for idx in range(min_chaps):
|
83 |
seg1 = chaps1[idx]
|
84 |
seg2 = chaps2[idx]
|
85 |
-
|
86 |
-
#
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
if results:
|
99 |
metrics_df = pd.concat(results, ignore_index=True)
|
100 |
else:
|
101 |
metrics_df = pd.DataFrame()
|
|
|
102 |
|
103 |
# Calculate word counts
|
|
|
|
|
|
|
|
|
|
|
|
|
104 |
word_counts_data = []
|
105 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
106 |
fname, chapter_info = seg_id.split("|", 1)
|
107 |
chapter_num = int(chapter_info.replace("chapter ", ""))
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
122 |
word_counts_df = pd.DataFrame(word_counts_data)
|
123 |
if not word_counts_df.empty:
|
124 |
word_counts_df = word_counts_df.sort_values(
|
125 |
by=["Filename", "ChapterNumber"]
|
126 |
).reset_index(drop=True)
|
127 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
128 |
return metrics_df, word_counts_df, warning
|
|
|
1 |
import pandas as pd
|
2 |
from typing import Dict, List, Tuple
|
3 |
from .metrics import compute_all_metrics
|
4 |
+
from .semantic_embedding import get_model_and_device, train_fasttext_model, FASTTEXT_MODEL_ID
|
5 |
from .tokenize import tokenize_texts
|
6 |
import logging
|
7 |
from itertools import combinations
|
|
|
10 |
|
11 |
|
12 |
def process_texts(
|
13 |
+
text_data: Dict[str, str],
|
14 |
+
filenames: List[str],
|
15 |
+
enable_semantic: bool = True,
|
16 |
+
model_name: str = "buddhist-nlp/buddhist-sentence-similarity",
|
17 |
+
use_stopwords: bool = True,
|
18 |
+
use_lite_stopwords: bool = False,
|
19 |
+
progress_callback = None
|
20 |
) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
|
21 |
"""
|
22 |
Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
|
23 |
+
|
24 |
Args:
|
25 |
text_data (Dict[str, str]): A dictionary mapping filenames to their content.
|
26 |
filenames (List[str]): A list of filenames that were uploaded.
|
27 |
+
enable_semantic (bool, optional): Whether to compute semantic similarity metrics.
|
28 |
+
Requires loading a sentence transformer model, which can be time-consuming. Defaults to True.
|
29 |
+
model_name (str, optional): The name of the sentence transformer model to use for semantic similarity.
|
30 |
+
Must be a valid model identifier on Hugging Face. Defaults to "buddhist-nlp/buddhist-sentence-similarity".
|
31 |
+
use_stopwords (bool, optional): Whether to use stopwords in the metrics calculation. Defaults to True.
|
32 |
+
use_lite_stopwords (bool, optional): Whether to use the lite stopwords list (common particles only)
|
33 |
+
instead of the comprehensive list. Only applies if use_stopwords is True. Defaults to False.
|
34 |
+
progress_callback (callable, optional): A callback function for reporting progress updates.
|
35 |
+
Should accept a float between 0 and 1 and a description string. Defaults to None.
|
36 |
+
|
37 |
Returns:
|
38 |
Tuple[pd.DataFrame, pd.DataFrame, str]:
|
39 |
- metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
|
40 |
+
Contains columns: 'Text Pair', 'Chapter', 'Jaccard Similarity (%)', 'Normalized LCS',
|
41 |
+
'Semantic Similarity' (if enable_semantic=True), and 'TF-IDF Cosine Sim'.
|
42 |
- word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
|
43 |
+
Contains columns: 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
|
44 |
- warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
|
45 |
+
|
46 |
+
Raises:
|
47 |
+
RuntimeError: If the botok tokenizer fails to initialize.
|
48 |
+
ValueError: If the input files cannot be processed or if metrics computation fails.
|
49 |
"""
|
50 |
+
# Initialize model and device variables
|
51 |
st_model, st_device = None, None
|
52 |
+
model_warning = ""
|
53 |
+
|
54 |
+
# Update progress if callback provided
|
55 |
+
if progress_callback is not None:
|
56 |
+
try:
|
57 |
+
progress_callback(0.25, desc="Preparing for text analysis...")
|
58 |
+
except Exception as e:
|
59 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
60 |
+
# Continue processing even if progress reporting fails
|
61 |
+
|
62 |
+
# Load semantic model if enabled
|
63 |
if enable_semantic:
|
64 |
+
logger.info("Semantic similarity enabled. Loading embedding model...")
|
|
|
|
|
65 |
try:
|
66 |
+
logger.info("Using model: %s", model_name)
|
67 |
+
|
68 |
+
# Check if this is a FastText model request
|
69 |
+
if model_name == FASTTEXT_MODEL_ID:
|
70 |
+
# Try to load the official Facebook FastText Tibetan model first
|
71 |
+
if progress_callback is not None:
|
72 |
+
try:
|
73 |
+
progress_callback(0.25, desc="Loading official Facebook FastText Tibetan model...")
|
74 |
+
except Exception as e:
|
75 |
+
logger.warning("Progress callback error (non-critical): %s", str(e))
|
76 |
+
|
77 |
+
st_model, st_device, model_type = get_model_and_device(model_id=model_name)
|
78 |
+
|
79 |
+
# If model is None, we need to train a fallback model
|
80 |
+
if st_model is None:
|
81 |
+
if progress_callback is not None:
|
82 |
+
try:
|
83 |
+
progress_callback(0.25, desc="Official model unavailable. Training fallback FastText model...")
|
84 |
+
except Exception as e:
|
85 |
+
logger.warning("Progress callback error (non-critical): %s", str(e))
|
86 |
+
|
87 |
+
# Collect all text data for training
|
88 |
+
all_texts = list(text_data.values())
|
89 |
+
|
90 |
+
# Train the model with standard parameters for stability
|
91 |
+
st_model = train_fasttext_model(all_texts, dim=100, epoch=5)
|
92 |
+
|
93 |
+
if progress_callback is not None:
|
94 |
+
try:
|
95 |
+
progress_callback(0.3, desc="Fallback FastText model trained successfully")
|
96 |
+
except Exception as e:
|
97 |
+
logger.warning("Progress callback error (non-critical): %s", str(e))
|
98 |
+
else:
|
99 |
+
if progress_callback is not None:
|
100 |
+
try:
|
101 |
+
progress_callback(0.3, desc="Official Facebook FastText Tibetan model loaded successfully")
|
102 |
+
except Exception as e:
|
103 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
104 |
+
else:
|
105 |
+
# For sentence transformers
|
106 |
+
st_model, st_device, model_type = get_model_and_device(model_id=model_name)
|
107 |
+
logger.info(f"Model {model_name} loaded successfully on {st_device}.")
|
108 |
+
|
109 |
+
if progress_callback is not None:
|
110 |
+
try:
|
111 |
+
progress_callback(0.3, desc="Model loaded successfully")
|
112 |
+
except Exception as e:
|
113 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
114 |
+
|
115 |
except Exception as e:
|
116 |
+
error_msg = str(e)
|
117 |
+
logger.error(f"Failed to load sentence transformer model: {error_msg}. Semantic similarity will not be available.")
|
118 |
+
|
119 |
+
# Create a user-friendly warning message
|
120 |
+
if "is not a valid model identifier" in error_msg:
|
121 |
+
model_warning = f"The model '{model_name}' could not be found on Hugging Face. Semantic similarity will not be available."
|
122 |
+
elif "CUDA out of memory" in error_msg:
|
123 |
+
model_warning = "Not enough GPU memory to load the semantic model. Try using a smaller model or disable semantic similarity."
|
124 |
+
else:
|
125 |
+
model_warning = f"Failed to load semantic model: {error_msg}. Semantic similarity will not be available."
|
126 |
+
|
127 |
+
if progress_callback is not None:
|
128 |
+
try:
|
129 |
+
progress_callback(0.3, desc="Continuing without semantic model")
|
130 |
+
except Exception as e:
|
131 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
132 |
else:
|
133 |
logger.info("Semantic similarity disabled. Skipping model loading.")
|
134 |
+
if progress_callback is not None:
|
135 |
+
try:
|
136 |
+
progress_callback(0.3, desc="Processing text segments")
|
137 |
+
except Exception as e:
|
138 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
139 |
|
140 |
+
# Detect chapter marker and segment texts
|
141 |
+
if progress_callback is not None:
|
142 |
+
try:
|
143 |
+
progress_callback(0.35, desc="Segmenting texts by chapters...")
|
144 |
+
except Exception as e:
|
145 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
146 |
+
|
147 |
chapter_marker = "༈"
|
148 |
fallback = False
|
149 |
segment_texts = {}
|
150 |
+
|
151 |
+
# Process each file
|
152 |
+
for i, fname in enumerate(filenames):
|
153 |
+
if progress_callback is not None and len(filenames) > 1:
|
154 |
+
try:
|
155 |
+
progress_callback(0.35 + (0.05 * (i / len(filenames))),
|
156 |
+
desc=f"Segmenting file {i+1}/{len(filenames)}: {fname}")
|
157 |
+
except Exception as e:
|
158 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
159 |
+
|
160 |
content = text_data[fname]
|
161 |
+
|
162 |
+
# Check if content is empty
|
163 |
+
if not content.strip():
|
164 |
+
logger.warning(f"File '{fname}' is empty or contains only whitespace.")
|
165 |
+
continue
|
166 |
+
|
167 |
+
# Split by chapter marker if present
|
168 |
if chapter_marker in content:
|
169 |
segments = [
|
170 |
seg.strip() for seg in content.split(chapter_marker) if seg.strip()
|
171 |
]
|
172 |
+
|
173 |
+
# Check if we have valid segments after splitting
|
174 |
+
if not segments:
|
175 |
+
logger.warning(f"File '{fname}' contains chapter markers but no valid text segments.")
|
176 |
+
continue
|
177 |
+
|
178 |
for idx, seg in enumerate(segments):
|
179 |
seg_id = f"{fname}|chapter {idx+1}"
|
180 |
segment_texts[seg_id] = seg
|
181 |
else:
|
182 |
+
# No chapter markers found, treat entire file as one segment
|
183 |
seg_id = f"{fname}|chapter 1"
|
184 |
segment_texts[seg_id] = content.strip()
|
185 |
fallback = True
|
186 |
+
|
187 |
+
# Generate warning if no chapter markers found
|
188 |
+
warning = model_warning # Include any model warnings
|
189 |
if fallback:
|
190 |
+
chapter_warning = (
|
191 |
"No chapter marker found in one or more files. "
|
192 |
"Each file will be treated as a single segment. "
|
193 |
"For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
|
194 |
)
|
195 |
+
warning = warning + " " + chapter_warning if warning else chapter_warning
|
196 |
+
|
197 |
+
# Check if we have any valid segments
|
198 |
+
if not segment_texts:
|
199 |
+
logger.error("No valid text segments found in any of the uploaded files.")
|
200 |
+
return pd.DataFrame(), pd.DataFrame(), "No valid text segments found in the uploaded files. Please check your files and try again."
|
201 |
# Group chapters by filename (preserving order)
|
202 |
+
if progress_callback is not None:
|
203 |
+
try:
|
204 |
+
progress_callback(0.4, desc="Organizing text segments...")
|
205 |
+
except Exception as e:
|
206 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
207 |
+
|
208 |
file_to_chapters = {}
|
209 |
for seg_id in segment_texts:
|
210 |
fname = seg_id.split("|")[0]
|
211 |
file_to_chapters.setdefault(fname, []).append(seg_id)
|
212 |
+
|
213 |
# For each pair of files, compare corresponding chapters (by index)
|
214 |
+
if progress_callback is not None:
|
215 |
+
try:
|
216 |
+
progress_callback(0.45, desc="Computing similarity metrics...")
|
217 |
+
except Exception as e:
|
218 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
219 |
+
|
220 |
results = []
|
221 |
files = list(file_to_chapters.keys())
|
222 |
+
|
223 |
+
# Check if we have at least two files to compare
|
224 |
+
if len(files) < 2:
|
225 |
+
logger.warning("Need at least two files to compute similarity metrics.")
|
226 |
+
return pd.DataFrame(), pd.DataFrame(), "Need at least two files to compute similarity metrics."
|
227 |
+
|
228 |
+
# Track total number of comparisons for progress reporting
|
229 |
+
total_comparisons = 0
|
230 |
+
for file1, file2 in combinations(files, 2):
|
231 |
+
chaps1 = file_to_chapters[file1]
|
232 |
+
chaps2 = file_to_chapters[file2]
|
233 |
+
total_comparisons += min(len(chaps1), len(chaps2))
|
234 |
+
|
235 |
+
# Process each file pair
|
236 |
+
comparison_count = 0
|
237 |
for file1, file2 in combinations(files, 2):
|
238 |
chaps1 = file_to_chapters[file1]
|
239 |
chaps2 = file_to_chapters[file2]
|
240 |
min_chaps = min(len(chaps1), len(chaps2))
|
241 |
+
|
242 |
+
if progress_callback is not None:
|
243 |
+
try:
|
244 |
+
progress_callback(0.45, desc=f"Comparing {file1} with {file2}...")
|
245 |
+
except Exception as e:
|
246 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
247 |
+
|
248 |
for idx in range(min_chaps):
|
249 |
seg1 = chaps1[idx]
|
250 |
seg2 = chaps2[idx]
|
251 |
+
|
252 |
+
# Update progress
|
253 |
+
comparison_count += 1
|
254 |
+
if progress_callback is not None and total_comparisons > 0:
|
255 |
+
try:
|
256 |
+
progress_percentage = 0.45 + (0.25 * (comparison_count / total_comparisons))
|
257 |
+
progress_callback(progress_percentage,
|
258 |
+
desc=f"Computing metrics for chapter {idx+1} ({comparison_count}/{total_comparisons})")
|
259 |
+
except Exception as e:
|
260 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
261 |
+
|
262 |
+
try:
|
263 |
+
# Compute metrics for this chapter pair
|
264 |
+
pair_metrics = compute_all_metrics(
|
265 |
+
{seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
|
266 |
+
model=st_model,
|
267 |
+
device=st_device,
|
268 |
+
enable_semantic=enable_semantic,
|
269 |
+
model_type=model_type if 'model_type' in locals() else "sentence_transformer",
|
270 |
+
use_stopwords=use_stopwords,
|
271 |
+
use_lite_stopwords=use_lite_stopwords
|
272 |
+
)
|
273 |
+
|
274 |
+
# Rename 'Text Pair' to show file stems and chapter number
|
275 |
+
pair_metrics.loc[:, "Text Pair"] = f"{file1} vs {file2}"
|
276 |
+
pair_metrics.loc[:, "Chapter"] = idx + 1
|
277 |
+
results.append(pair_metrics)
|
278 |
+
|
279 |
+
except Exception as e:
|
280 |
+
logger.error(f"Error computing metrics for {seg1} vs {seg2}: {e}")
|
281 |
+
# Continue with other comparisons instead of failing completely
|
282 |
+
continue
|
283 |
+
|
284 |
+
# Create the metrics DataFrame
|
285 |
if results:
|
286 |
metrics_df = pd.concat(results, ignore_index=True)
|
287 |
else:
|
288 |
metrics_df = pd.DataFrame()
|
289 |
+
warning += " No valid metrics could be computed. Please check your files and try again."
|
290 |
|
291 |
# Calculate word counts
|
292 |
+
if progress_callback is not None:
|
293 |
+
try:
|
294 |
+
progress_callback(0.75, desc="Calculating word counts...")
|
295 |
+
except Exception as e:
|
296 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
297 |
+
|
298 |
word_counts_data = []
|
299 |
+
|
300 |
+
# Process each segment
|
301 |
+
for i, (seg_id, text_content) in enumerate(segment_texts.items()):
|
302 |
+
# Update progress
|
303 |
+
if progress_callback is not None and len(segment_texts) > 0:
|
304 |
+
try:
|
305 |
+
progress_percentage = 0.75 + (0.15 * (i / len(segment_texts)))
|
306 |
+
progress_callback(progress_percentage, desc=f"Counting words in segment {i+1}/{len(segment_texts)}")
|
307 |
+
except Exception as e:
|
308 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
309 |
+
|
310 |
fname, chapter_info = seg_id.split("|", 1)
|
311 |
chapter_num = int(chapter_info.replace("chapter ", ""))
|
312 |
+
|
313 |
+
try:
|
314 |
+
# Use botok for accurate word count for raw Tibetan text
|
315 |
+
tokenized_segments = tokenize_texts([text_content]) # Returns a list of lists
|
316 |
+
if tokenized_segments and tokenized_segments[0]:
|
317 |
+
word_count = len(tokenized_segments[0])
|
318 |
+
else:
|
319 |
+
word_count = 0
|
320 |
+
|
321 |
+
word_counts_data.append(
|
322 |
+
{
|
323 |
+
"Filename": fname.replace(".txt", ""),
|
324 |
+
"ChapterNumber": chapter_num,
|
325 |
+
"SegmentID": seg_id,
|
326 |
+
"WordCount": word_count,
|
327 |
+
}
|
328 |
+
)
|
329 |
+
except Exception as e:
|
330 |
+
logger.error(f"Error calculating word count for segment {seg_id}: {e}")
|
331 |
+
# Add entry with 0 word count to maintain consistency
|
332 |
+
word_counts_data.append(
|
333 |
+
{
|
334 |
+
"Filename": fname.replace(".txt", ""),
|
335 |
+
"ChapterNumber": chapter_num,
|
336 |
+
"SegmentID": seg_id,
|
337 |
+
"WordCount": 0,
|
338 |
+
}
|
339 |
+
)
|
340 |
+
|
341 |
+
# Create and sort the word counts DataFrame
|
342 |
word_counts_df = pd.DataFrame(word_counts_data)
|
343 |
if not word_counts_df.empty:
|
344 |
word_counts_df = word_counts_df.sort_values(
|
345 |
by=["Filename", "ChapterNumber"]
|
346 |
).reset_index(drop=True)
|
347 |
+
|
348 |
+
if progress_callback is not None:
|
349 |
+
try:
|
350 |
+
progress_callback(0.95, desc="Analysis complete!")
|
351 |
+
except Exception as e:
|
352 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
353 |
+
|
354 |
return metrics_df, word_counts_df, warning
|
pipeline/semantic_embedding.py
CHANGED
@@ -1,5 +1,6 @@
|
|
1 |
import logging
|
2 |
import torch
|
|
|
3 |
from sentence_transformers import SentenceTransformer
|
4 |
|
5 |
# Configure logging
|
@@ -11,8 +12,11 @@ logger = logging.getLogger(__name__)
|
|
11 |
# Define the model ID for the fine-tuned Tibetan MiniLM
|
12 |
DEFAULT_MODEL_NAME = "buddhist-nlp/buddhist-sentence-similarity"
|
13 |
|
|
|
|
|
14 |
|
15 |
-
|
|
|
16 |
model_id: str = DEFAULT_MODEL_NAME, device_preference: str = "auto"
|
17 |
):
|
18 |
"""
|
@@ -48,48 +52,95 @@ def get_sentence_transformer_model_and_device(
|
|
48 |
else: # Handles explicit "cpu" preference or fallback if preferred is unavailable
|
49 |
selected_device_str = "cpu"
|
50 |
|
51 |
-
logger.info(
|
52 |
|
53 |
try:
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
61 |
except Exception as e:
|
62 |
logger.error(
|
63 |
-
|
|
|
64 |
)
|
65 |
# Fallback to CPU if the initially selected device (CUDA or MPS) failed
|
66 |
if selected_device_str != "cpu":
|
67 |
logger.warning(
|
68 |
-
|
|
|
69 |
)
|
70 |
fallback_device_str = "cpu"
|
71 |
try:
|
72 |
-
model
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
except Exception as fallback_e:
|
78 |
logger.error(
|
79 |
-
|
|
|
80 |
)
|
81 |
raise fallback_e # Re-raise exception if CPU fallback also fails
|
82 |
raise e # Re-raise original exception if selected_device_str was already CPU or no fallback attempted
|
83 |
|
84 |
|
85 |
-
def generate_embeddings(texts:
|
86 |
"""
|
87 |
-
Generates embeddings for a list of texts using the provided
|
88 |
|
89 |
Args:
|
90 |
texts (list[str]): A list of texts to embed.
|
91 |
-
model: The loaded SentenceTransformer
|
92 |
-
device (str): The device
|
|
|
|
|
|
|
93 |
|
94 |
Returns:
|
95 |
torch.Tensor: A tensor containing the embeddings, moved to CPU.
|
@@ -102,14 +153,72 @@ def generate_embeddings(texts: list[str], model, device: str):
|
|
102 |
|
103 |
logger.info(f"Generating embeddings for {len(texts)} texts...")
|
104 |
|
105 |
-
|
106 |
-
|
107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
108 |
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
|
114 |
|
115 |
if __name__ == "__main__":
|
@@ -126,19 +235,21 @@ if __name__ == "__main__":
|
|
126 |
try:
|
127 |
# Forcing CPU for this example to avoid potential CUDA issues in diverse environments
|
128 |
# or if CUDA is not intended for this specific test.
|
129 |
-
|
130 |
device_preference="cpu" # Explicitly use CPU for this test run
|
131 |
)
|
132 |
|
133 |
-
if
|
134 |
-
logger.info(
|
135 |
-
example_embeddings = generate_embeddings(test_texts,
|
136 |
logger.info(
|
137 |
-
|
|
|
138 |
)
|
139 |
if example_embeddings.nelement() > 0: # Check if tensor is not empty
|
140 |
logger.info(
|
141 |
-
|
|
|
142 |
)
|
143 |
else:
|
144 |
logger.info("Generated example embeddings tensor is empty.")
|
@@ -146,6 +257,6 @@ if __name__ == "__main__":
|
|
146 |
logger.error("Failed to load model for example usage.")
|
147 |
|
148 |
except Exception as e:
|
149 |
-
logger.error(
|
150 |
|
151 |
logger.info("Finished example usage.")
|
|
|
1 |
import logging
|
2 |
import torch
|
3 |
+
from typing import List, Any
|
4 |
from sentence_transformers import SentenceTransformer
|
5 |
|
6 |
# Configure logging
|
|
|
12 |
# Define the model ID for the fine-tuned Tibetan MiniLM
|
13 |
DEFAULT_MODEL_NAME = "buddhist-nlp/buddhist-sentence-similarity"
|
14 |
|
15 |
+
# FastText model identifier - this is just an internal identifier, not a HuggingFace model ID
|
16 |
+
FASTTEXT_MODEL_ID = "fasttext-tibetan"
|
17 |
|
18 |
+
|
19 |
+
def get_model_and_device(
|
20 |
model_id: str = DEFAULT_MODEL_NAME, device_preference: str = "auto"
|
21 |
):
|
22 |
"""
|
|
|
52 |
else: # Handles explicit "cpu" preference or fallback if preferred is unavailable
|
53 |
selected_device_str = "cpu"
|
54 |
|
55 |
+
logger.info("Attempting to use device: %s", selected_device_str)
|
56 |
|
57 |
try:
|
58 |
+
# Check if this is a FastText model request
|
59 |
+
if model_id == FASTTEXT_MODEL_ID:
|
60 |
+
try:
|
61 |
+
# Import here to avoid dependency issues if FastText is not installed
|
62 |
+
import fasttext
|
63 |
+
from .fasttext_embedding import load_fasttext_model
|
64 |
+
|
65 |
+
# Try to load the FastText model
|
66 |
+
model = load_fasttext_model()
|
67 |
+
|
68 |
+
if model is None:
|
69 |
+
error_msg = "Failed to load FastText model. Semantic similarity will not be available."
|
70 |
+
logger.error(error_msg)
|
71 |
+
raise Exception(error_msg)
|
72 |
+
|
73 |
+
logger.info("FastText model loaded successfully.")
|
74 |
+
# FastText always runs on CPU
|
75 |
+
return model, "cpu", "fasttext"
|
76 |
+
except ImportError:
|
77 |
+
logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
|
78 |
+
raise
|
79 |
+
else:
|
80 |
+
logger.info(
|
81 |
+
"Loading Sentence Transformer model: %s on device: %s",
|
82 |
+
model_id, selected_device_str
|
83 |
+
)
|
84 |
+
# SentenceTransformer expects a string like 'cuda', 'mps', or 'cpu'
|
85 |
+
model = SentenceTransformer(model_id, device=selected_device_str)
|
86 |
+
logger.info("Model %s loaded successfully on %s.", model_id, selected_device_str)
|
87 |
+
return model, selected_device_str, "sentence_transformer"
|
88 |
except Exception as e:
|
89 |
logger.error(
|
90 |
+
"Error loading model %s on device %s: %s",
|
91 |
+
model_id, selected_device_str, str(e)
|
92 |
)
|
93 |
# Fallback to CPU if the initially selected device (CUDA or MPS) failed
|
94 |
if selected_device_str != "cpu":
|
95 |
logger.warning(
|
96 |
+
"Failed to load model on %s, attempting to load on CPU...",
|
97 |
+
selected_device_str
|
98 |
)
|
99 |
fallback_device_str = "cpu"
|
100 |
try:
|
101 |
+
# Check if this is a FastText model request during fallback
|
102 |
+
if model_id == FASTTEXT_MODEL_ID:
|
103 |
+
# Import here to avoid dependency issues if FastText is not installed
|
104 |
+
from .fasttext_embedding import load_fasttext_model
|
105 |
+
|
106 |
+
# Try to load the FastText model
|
107 |
+
model = load_fasttext_model()
|
108 |
+
|
109 |
+
if model is None:
|
110 |
+
logger.error("Failed to load FastText model during fallback. Semantic similarity will not be available.")
|
111 |
+
raise Exception("Failed to load FastText model. Please check if the model file exists.")
|
112 |
+
|
113 |
+
logger.info("FastText model loaded successfully during fallback.")
|
114 |
+
# FastText always runs on CPU
|
115 |
+
return model, "cpu", "fasttext"
|
116 |
+
else:
|
117 |
+
# Try to load as a sentence transformer
|
118 |
+
model = SentenceTransformer(model_id, device=fallback_device_str)
|
119 |
+
logger.info(
|
120 |
+
"Model %s loaded successfully on CPU after fallback.",
|
121 |
+
model_id
|
122 |
+
)
|
123 |
+
return model, fallback_device_str, "sentence_transformer"
|
124 |
except Exception as fallback_e:
|
125 |
logger.error(
|
126 |
+
"Error loading model %s on CPU during fallback: %s",
|
127 |
+
model_id, str(fallback_e)
|
128 |
)
|
129 |
raise fallback_e # Re-raise exception if CPU fallback also fails
|
130 |
raise e # Re-raise original exception if selected_device_str was already CPU or no fallback attempted
|
131 |
|
132 |
|
133 |
+
def generate_embeddings(texts: List[str], model: Any, device: str, model_type: str = "sentence_transformer", tokenize_fn=None, use_stopwords: bool = True, use_lite_stopwords: bool = False):
|
134 |
"""
|
135 |
+
Generates embeddings for a list of texts using the provided model.
|
136 |
|
137 |
Args:
|
138 |
texts (list[str]): A list of texts to embed.
|
139 |
+
model: The loaded model (SentenceTransformer or FastText).
|
140 |
+
device (str): The device to use ("cuda", "mps", or "cpu").
|
141 |
+
model_type (str): Type of model ("sentence_transformer" or "fasttext")
|
142 |
+
tokenize_fn: Optional tokenization function or pre-tokenized list for FastText
|
143 |
+
use_stopwords (bool): Whether to filter out stopwords for FastText embeddings
|
144 |
|
145 |
Returns:
|
146 |
torch.Tensor: A tensor containing the embeddings, moved to CPU.
|
|
|
153 |
|
154 |
logger.info(f"Generating embeddings for {len(texts)} texts...")
|
155 |
|
156 |
+
if model_type == "fasttext":
|
157 |
+
try:
|
158 |
+
# Import here to avoid dependency issues if FastText is not installed
|
159 |
+
from .fasttext_embedding import get_batch_embeddings
|
160 |
+
from .stopwords_bo import TIBETAN_STOPWORDS_SET
|
161 |
+
|
162 |
+
# For FastText, get appropriate stopwords set if filtering is enabled
|
163 |
+
stopwords_set = None
|
164 |
+
if use_stopwords:
|
165 |
+
# Choose between regular and lite stopwords sets
|
166 |
+
if use_lite_stopwords:
|
167 |
+
from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
|
168 |
+
stopwords_set = TIBETAN_STOPWORDS_LITE_SET
|
169 |
+
else:
|
170 |
+
from .stopwords_bo import TIBETAN_STOPWORDS_SET
|
171 |
+
stopwords_set = TIBETAN_STOPWORDS_SET
|
172 |
+
|
173 |
+
# Pass pre-tokenized tokens if available, otherwise pass None
|
174 |
+
# tokenize_fn should be a list of lists (tokens for each text) or None
|
175 |
+
embeddings = get_batch_embeddings(
|
176 |
+
texts,
|
177 |
+
model,
|
178 |
+
tokenize_fn=tokenize_fn,
|
179 |
+
use_stopwords=use_stopwords,
|
180 |
+
stopwords_set=stopwords_set
|
181 |
+
)
|
182 |
+
logger.info("FastText embeddings generated with shape: %s", str(embeddings.shape))
|
183 |
+
# Convert numpy array to torch tensor for consistency
|
184 |
+
return torch.tensor(embeddings)
|
185 |
+
except ImportError:
|
186 |
+
logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
|
187 |
+
raise
|
188 |
+
else: # sentence_transformer
|
189 |
+
# The encode method of SentenceTransformer handles tokenization and pooling internally.
|
190 |
+
# It also manages moving data to the model's device.
|
191 |
+
embeddings = model.encode(texts, convert_to_tensor=True, show_progress_bar=True)
|
192 |
+
logger.info("Sentence Transformer embeddings generated with shape: %s", str(embeddings.shape))
|
193 |
+
return (
|
194 |
+
embeddings.cpu()
|
195 |
+
) # Ensure embeddings are on CPU for consistent further processing
|
196 |
+
|
197 |
|
198 |
+
def train_fasttext_model(corpus_texts: List[str], **kwargs):
|
199 |
+
"""
|
200 |
+
Train a FastText model on the provided corpus texts.
|
201 |
+
|
202 |
+
Args:
|
203 |
+
corpus_texts: List of texts to use for training
|
204 |
+
**kwargs: Additional parameters for training (dim, epoch, etc.)
|
205 |
+
|
206 |
+
Returns:
|
207 |
+
Trained model and path to the model file
|
208 |
+
"""
|
209 |
+
try:
|
210 |
+
from .fasttext_embedding import prepare_corpus_file, train_fasttext_model as train_ft
|
211 |
+
|
212 |
+
# Prepare corpus file
|
213 |
+
corpus_path = prepare_corpus_file(corpus_texts)
|
214 |
+
|
215 |
+
# Train the model
|
216 |
+
model = train_ft(corpus_path=corpus_path, **kwargs)
|
217 |
+
|
218 |
+
return model
|
219 |
+
except ImportError:
|
220 |
+
logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
|
221 |
+
raise
|
222 |
|
223 |
|
224 |
if __name__ == "__main__":
|
|
|
235 |
try:
|
236 |
# Forcing CPU for this example to avoid potential CUDA issues in diverse environments
|
237 |
# or if CUDA is not intended for this specific test.
|
238 |
+
model, device, model_type = get_model_and_device(
|
239 |
device_preference="cpu" # Explicitly use CPU for this test run
|
240 |
)
|
241 |
|
242 |
+
if model:
|
243 |
+
logger.info("Test model loaded on device: %s, type: %s", device, model_type)
|
244 |
+
example_embeddings = generate_embeddings(test_texts, model, device, model_type)
|
245 |
logger.info(
|
246 |
+
"Generated example embeddings shape: %s",
|
247 |
+
str(example_embeddings.shape)
|
248 |
)
|
249 |
if example_embeddings.nelement() > 0: # Check if tensor is not empty
|
250 |
logger.info(
|
251 |
+
"First embedding (first 10 dims): %s...",
|
252 |
+
str(example_embeddings[0][:10])
|
253 |
)
|
254 |
else:
|
255 |
logger.info("Generated example embeddings tensor is empty.")
|
|
|
257 |
logger.error("Failed to load model for example usage.")
|
258 |
|
259 |
except Exception as e:
|
260 |
+
logger.error("An error occurred during the example usage: %s", str(e))
|
261 |
|
262 |
logger.info("Finished example usage.")
|
pipeline/stopwords_lite_bo.py
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# -*- coding: utf-8 -*-
|
2 |
+
"""Module for reduced Tibetan stopwords.
|
3 |
+
|
4 |
+
This file provides a less aggressive list of Tibetan stopwords for use in the Tibetan Text Metrics application.
|
5 |
+
It contains only the most common particles and punctuation that are unlikely to carry significant meaning.
|
6 |
+
"""
|
7 |
+
|
8 |
+
# Initial set of stopwords with clear categories
|
9 |
+
PARTICLES_INITIAL_LITE = [
|
10 |
+
"ཏུ", "གི", "ཀྱི", "གིས", "ཀྱིས", "ཡིས", "ཀྱང", "སྟེ", "ཏེ", "ནོ", "ཏོ",
|
11 |
+
"ཅིང", "ཅིག", "ཅེས", "ཞེས", "གྱིས", "ན",
|
12 |
+
]
|
13 |
+
|
14 |
+
MARKERS_AND_PUNCTUATION = ["༈", "།", "༎", "༑"]
|
15 |
+
|
16 |
+
# Reduced list of particles and suffixes
|
17 |
+
MORE_PARTICLES_SUFFIXES_LITE = [
|
18 |
+
"འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
|
19 |
+
"ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
|
20 |
+
"ན", "འམ་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི"
|
21 |
+
]
|
22 |
+
|
23 |
+
# Combine all categorized lists
|
24 |
+
_ALL_STOPWORDS_CATEGORIZED_LITE = (
|
25 |
+
PARTICLES_INITIAL_LITE +
|
26 |
+
MARKERS_AND_PUNCTUATION +
|
27 |
+
MORE_PARTICLES_SUFFIXES_LITE
|
28 |
+
)
|
29 |
+
|
30 |
+
# Final flat list of unique stopwords for TfidfVectorizer (as a list)
|
31 |
+
TIBETAN_STOPWORDS_LITE = list(set(_ALL_STOPWORDS_CATEGORIZED_LITE))
|
32 |
+
|
33 |
+
# Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
|
34 |
+
TIBETAN_STOPWORDS_LITE_SET = set(TIBETAN_STOPWORDS_LITE)
|
pipeline/tokenize.py
CHANGED
@@ -1,4 +1,16 @@
|
|
1 |
-
from typing import List
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
try:
|
4 |
from botok import WordTokenizer
|
@@ -9,18 +21,40 @@ except ImportError:
|
|
9 |
# Handle the case where botok might not be installed,
|
10 |
# though it's a core dependency for this app.
|
11 |
BOTOK_TOKENIZER = None
|
12 |
-
|
13 |
# Optionally, raise an error here if botok is absolutely critical for the app to even start
|
14 |
# raise ImportError("botok is required for tokenization. Please install it.")
|
15 |
|
16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
def tokenize_texts(texts: List[str]) -> List[List[str]]:
|
18 |
"""
|
19 |
-
Tokenizes a list of raw Tibetan texts using botok.
|
|
|
|
|
|
|
|
|
|
|
20 |
Args:
|
21 |
-
texts: List of raw text strings.
|
|
|
22 |
Returns:
|
23 |
List of tokenized texts (each as a list of tokens).
|
|
|
|
|
|
|
24 |
"""
|
25 |
if BOTOK_TOKENIZER is None:
|
26 |
# This case should ideally be handled more gracefully,
|
@@ -30,9 +64,42 @@ def tokenize_texts(texts: List[str]) -> List[List[str]]:
|
|
30 |
)
|
31 |
|
32 |
tokenized_texts_list = []
|
|
|
|
|
33 |
for text_content in texts:
|
34 |
-
|
35 |
-
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
tokenized_texts_list.append(tokens)
|
|
|
38 |
return tokenized_texts_list
|
|
|
1 |
+
from typing import List, Dict
|
2 |
+
import hashlib
|
3 |
+
import logging
|
4 |
+
|
5 |
+
# Configure logging
|
6 |
+
logger = logging.getLogger(__name__)
|
7 |
+
|
8 |
+
# Initialize a cache for tokenization results
|
9 |
+
# Using a simple in-memory dictionary with text hash as key
|
10 |
+
_tokenization_cache: Dict[str, List[str]] = {}
|
11 |
+
|
12 |
+
# Maximum cache size (number of entries)
|
13 |
+
MAX_CACHE_SIZE = 1000
|
14 |
|
15 |
try:
|
16 |
from botok import WordTokenizer
|
|
|
21 |
# Handle the case where botok might not be installed,
|
22 |
# though it's a core dependency for this app.
|
23 |
BOTOK_TOKENIZER = None
|
24 |
+
logger.error("botok library not found. Tokenization will fail.")
|
25 |
# Optionally, raise an error here if botok is absolutely critical for the app to even start
|
26 |
# raise ImportError("botok is required for tokenization. Please install it.")
|
27 |
|
28 |
|
29 |
+
def _get_text_hash(text: str) -> str:
|
30 |
+
"""
|
31 |
+
Generate a hash for the input text to use as a cache key.
|
32 |
+
|
33 |
+
Args:
|
34 |
+
text: The input text to hash
|
35 |
+
|
36 |
+
Returns:
|
37 |
+
A string representation of the MD5 hash of the input text
|
38 |
+
"""
|
39 |
+
return hashlib.md5(text.encode('utf-8')).hexdigest()
|
40 |
+
|
41 |
+
|
42 |
def tokenize_texts(texts: List[str]) -> List[List[str]]:
|
43 |
"""
|
44 |
+
Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
|
45 |
+
|
46 |
+
This function maintains an in-memory cache of previously tokenized texts to avoid
|
47 |
+
redundant processing of the same content. The cache uses MD5 hashes of the input
|
48 |
+
texts as keys.
|
49 |
+
|
50 |
Args:
|
51 |
+
texts: List of raw text strings to tokenize.
|
52 |
+
|
53 |
Returns:
|
54 |
List of tokenized texts (each as a list of tokens).
|
55 |
+
|
56 |
+
Raises:
|
57 |
+
RuntimeError: If the botok tokenizer failed to initialize.
|
58 |
"""
|
59 |
if BOTOK_TOKENIZER is None:
|
60 |
# This case should ideally be handled more gracefully,
|
|
|
64 |
)
|
65 |
|
66 |
tokenized_texts_list = []
|
67 |
+
|
68 |
+
# Process each text
|
69 |
for text_content in texts:
|
70 |
+
# Skip empty texts
|
71 |
+
if not text_content.strip():
|
72 |
+
tokenized_texts_list.append([])
|
73 |
+
continue
|
74 |
+
|
75 |
+
# Generate hash for cache lookup
|
76 |
+
text_hash = _get_text_hash(text_content)
|
77 |
+
|
78 |
+
# Check if we have this text in cache
|
79 |
+
if text_hash in _tokenization_cache:
|
80 |
+
# Cache hit - use cached tokens
|
81 |
+
tokens = _tokenization_cache[text_hash]
|
82 |
+
logger.debug(f"Cache hit for text hash {text_hash[:8]}...")
|
83 |
+
else:
|
84 |
+
# Cache miss - tokenize and store in cache
|
85 |
+
try:
|
86 |
+
tokens = [
|
87 |
+
w.text for w in BOTOK_TOKENIZER.tokenize(text_content) if w.text.strip()
|
88 |
+
]
|
89 |
+
|
90 |
+
# Store in cache if not empty
|
91 |
+
if tokens:
|
92 |
+
# If cache is full, remove a random entry (simple strategy)
|
93 |
+
if len(_tokenization_cache) >= MAX_CACHE_SIZE:
|
94 |
+
# Remove first key (oldest if ordered dict, random otherwise)
|
95 |
+
_tokenization_cache.pop(next(iter(_tokenization_cache)))
|
96 |
+
|
97 |
+
_tokenization_cache[text_hash] = tokens
|
98 |
+
logger.debug(f"Added tokens to cache with hash {text_hash[:8]}...")
|
99 |
+
except Exception as e:
|
100 |
+
logger.error(f"Error tokenizing text: {e}")
|
101 |
+
tokens = []
|
102 |
+
|
103 |
tokenized_texts_list.append(tokens)
|
104 |
+
|
105 |
return tokenized_texts_list
|
pipeline/visualize.py
CHANGED
@@ -25,7 +25,6 @@ def generate_visualizations(metrics_df: pd.DataFrame, descriptive_titles: dict =
|
|
25 |
|
26 |
# --- Heatmaps for each metric ---
|
27 |
heatmaps = {}
|
28 |
-
# Using 'Reds' colormap as requested for a red/white gradient.
|
29 |
# Chapter 1 will be at the top of the Y-axis due to sort_index(ascending=False).
|
30 |
for metric in metric_cols:
|
31 |
# Check if all values for this metric are NaN
|
@@ -41,19 +40,38 @@ def generate_visualizations(metrics_df: pd.DataFrame, descriptive_titles: dict =
|
|
41 |
continue
|
42 |
|
43 |
cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
|
44 |
-
|
|
|
|
|
|
|
|
|
|
|
45 |
text = [
|
46 |
[f"{val:.2f}" if pd.notnull(val) else "" for val in row]
|
47 |
for row in pivot.values
|
48 |
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
fig = go.Figure(
|
50 |
data=go.Heatmap(
|
51 |
-
z=
|
52 |
x=cleaned_columns,
|
53 |
y=pivot.index,
|
54 |
colorscale=cmap,
|
55 |
-
|
56 |
-
|
|
|
57 |
text=text,
|
58 |
texttemplate="%{text}",
|
59 |
hovertemplate="Chapter %{y}<br>Text Pair: %{x}<br>Value: %{z:.2f}<extra></extra>",
|
|
|
25 |
|
26 |
# --- Heatmaps for each metric ---
|
27 |
heatmaps = {}
|
|
|
28 |
# Chapter 1 will be at the top of the Y-axis due to sort_index(ascending=False).
|
29 |
for metric in metric_cols:
|
30 |
# Check if all values for this metric are NaN
|
|
|
40 |
continue
|
41 |
|
42 |
cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
|
43 |
+
|
44 |
+
# For consistent interpretation: higher values (more similarity) = darker colors
|
45 |
+
# Using 'Reds' colormap for all metrics (dark red = high similarity)
|
46 |
+
cmap = "Reds"
|
47 |
+
|
48 |
+
# Format values for display
|
49 |
text = [
|
50 |
[f"{val:.2f}" if pd.notnull(val) else "" for val in row]
|
51 |
for row in pivot.values
|
52 |
]
|
53 |
+
|
54 |
+
# Create a copy of the pivot data for visualization
|
55 |
+
# For LCS and Semantic Similarity, we need to reverse the color scale
|
56 |
+
# so that higher values (more similarity) are darker
|
57 |
+
viz_values = pivot.values.copy()
|
58 |
+
|
59 |
+
# Determine if we need to reverse the values for consistent color interpretation
|
60 |
+
# (darker = more similar across all metrics)
|
61 |
+
reverse_colorscale = False
|
62 |
+
|
63 |
+
# All metrics should have darker colors for higher similarity
|
64 |
+
# No need to reverse values anymore - we'll use the same scale for all
|
65 |
+
|
66 |
fig = go.Figure(
|
67 |
data=go.Heatmap(
|
68 |
+
z=viz_values,
|
69 |
x=cleaned_columns,
|
70 |
y=pivot.index,
|
71 |
colorscale=cmap,
|
72 |
+
reversescale=reverse_colorscale, # Use the same scale direction for all metrics
|
73 |
+
zmin=float(np.nanmin(viz_values)),
|
74 |
+
zmax=float(np.nanmax(viz_values)),
|
75 |
text=text,
|
76 |
texttemplate="%{text}",
|
77 |
hovertemplate="Chapter %{y}<br>Text Pair: %{x}<br>Value: %{z:.2f}<extra></extra>",
|
requirements.txt
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
# Core application and UI
|
2 |
-
gradio
|
3 |
pandas==2.2.3
|
4 |
|
5 |
# Plotting and visualization
|
@@ -14,9 +14,19 @@ torch==2.7.0
|
|
14 |
transformers==4.51.3
|
15 |
sentence-transformers==4.1.0
|
16 |
numba==0.61.2
|
|
|
17 |
|
18 |
# Tibetan language processing
|
19 |
botok==0.9.0
|
20 |
|
21 |
# Build system for Cython
|
22 |
-
Cython==3.0.12
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# Core application and UI
|
2 |
+
gradio
|
3 |
pandas==2.2.3
|
4 |
|
5 |
# Plotting and visualization
|
|
|
14 |
transformers==4.51.3
|
15 |
sentence-transformers==4.1.0
|
16 |
numba==0.61.2
|
17 |
+
fasttext==0.9.2
|
18 |
|
19 |
# Tibetan language processing
|
20 |
botok==0.9.0
|
21 |
|
22 |
# Build system for Cython
|
23 |
+
Cython==3.0.12
|
24 |
+
|
25 |
+
# HuggingFace integration
|
26 |
+
hf_xet==0.1.0
|
27 |
+
huggingface_hub
|
28 |
+
|
29 |
+
# LLM integration
|
30 |
+
python-dotenv==1.0.0
|
31 |
+
requests==2.31.0
|
32 |
+
tabulate
|
theme.py
CHANGED
@@ -169,6 +169,85 @@ class TibetanAppTheme(gr.themes.Soft):
|
|
169 |
"padding": "10px 15px !important",
|
170 |
"border-bottom": "2px solid transparent !important",
|
171 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
172 |
}
|
173 |
|
174 |
def get_css_string(self) -> str:
|
|
|
169 |
"padding": "10px 15px !important",
|
170 |
"border-bottom": "2px solid transparent !important",
|
171 |
},
|
172 |
+
|
173 |
+
# Custom styling for metric accordions
|
174 |
+
".metric-info-accordion": {
|
175 |
+
"border-left": "4px solid #3B82F6 !important",
|
176 |
+
"margin-bottom": "1rem !important",
|
177 |
+
"background-color": "#F8FAFC !important",
|
178 |
+
"border-radius": "6px !important",
|
179 |
+
"overflow": "hidden !important",
|
180 |
+
},
|
181 |
+
".jaccard-info": {
|
182 |
+
"border-left-color": "#3B82F6 !important", # Blue
|
183 |
+
},
|
184 |
+
".lcs-info": {
|
185 |
+
"border-left-color": "#10B981 !important", # Green
|
186 |
+
},
|
187 |
+
".semantic-info": {
|
188 |
+
"border-left-color": "#8B5CF6 !important", # Purple
|
189 |
+
},
|
190 |
+
".tfidf-info": {
|
191 |
+
"border-left-color": "#F59E0B !important", # Amber
|
192 |
+
},
|
193 |
+
".wordcount-info": {
|
194 |
+
"border-left-color": "#EC4899 !important", # Pink
|
195 |
+
},
|
196 |
+
|
197 |
+
# Accordion header styling
|
198 |
+
".metric-info-accordion > .label-wrap": {
|
199 |
+
"font-weight": "600 !important",
|
200 |
+
"padding": "12px 16px !important",
|
201 |
+
"background-color": "#F1F5F9 !important",
|
202 |
+
"border-bottom": "1px solid #E2E8F0 !important",
|
203 |
+
},
|
204 |
+
|
205 |
+
# Accordion content styling
|
206 |
+
".metric-info-accordion > .wrap": {
|
207 |
+
"padding": "16px !important",
|
208 |
+
},
|
209 |
+
|
210 |
+
# Word count plot styling - full width
|
211 |
+
".tabs > .tab-content > div[data-testid='tabitem'] > .plot": {
|
212 |
+
"width": "100% !important",
|
213 |
+
},
|
214 |
+
|
215 |
+
# LLM Analysis styling
|
216 |
+
".llm-analysis": {
|
217 |
+
"background-color": "#f8f9fa !important",
|
218 |
+
"border-left": "4px solid #3B82F6 !important",
|
219 |
+
"border-radius": "8px !important",
|
220 |
+
"padding": "20px 24px !important",
|
221 |
+
"margin": "16px 0 !important",
|
222 |
+
"box-shadow": "0 2px 8px rgba(0, 0, 0, 0.05) !important",
|
223 |
+
},
|
224 |
+
".llm-analysis h2": {
|
225 |
+
"color": "#1e40af !important",
|
226 |
+
"font-size": "24px !important",
|
227 |
+
"margin-bottom": "16px !important",
|
228 |
+
"border-bottom": "1px solid #e5e7eb !important",
|
229 |
+
"padding-bottom": "8px !important",
|
230 |
+
},
|
231 |
+
".llm-analysis h3, .llm-analysis h4": {
|
232 |
+
"color": "#1e3a8a !important",
|
233 |
+
"margin-top": "20px !important",
|
234 |
+
"margin-bottom": "12px !important",
|
235 |
+
},
|
236 |
+
".llm-analysis p": {
|
237 |
+
"line-height": "1.7 !important",
|
238 |
+
"margin-bottom": "12px !important",
|
239 |
+
},
|
240 |
+
".llm-analysis ul, .llm-analysis ol": {
|
241 |
+
"margin-left": "24px !important",
|
242 |
+
"margin-bottom": "16px !important",
|
243 |
+
},
|
244 |
+
".llm-analysis li": {
|
245 |
+
"margin-bottom": "6px !important",
|
246 |
+
},
|
247 |
+
".llm-analysis strong, .llm-analysis b": {
|
248 |
+
"color": "#1f2937 !important",
|
249 |
+
"font-weight": "600 !important",
|
250 |
+
},
|
251 |
}
|
252 |
|
253 |
def get_css_string(self) -> str:
|