daniel-wojahn commited on
Commit
3011301
·
verified ·
1 Parent(s): f1e6b77

Reafactoring of the tokenization pipeline, adjusted fasttext implementation

Browse files
README.md CHANGED
@@ -8,7 +8,6 @@ sdk_version: 5.29.0
8
  python_version: 3.11
9
  app_file: app.py
10
  models:
11
- - buddhist-nlp/buddhist-sentence-similarity
12
  - fasttext-tibetan
13
  ---
14
 
@@ -32,14 +31,12 @@ The Tibetan Text Metrics project aims to provide quantitative methods for assess
32
  - **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
33
  - **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
34
  - **Semantic Similarity**: Uses embedding models to compare the contextual meaning of segments. Users can select between:
35
- - A transformer-based model (buddhist-nlp/buddhist-sentence-similarity) specialized for Buddhist texts (experimental approach)
36
- - The official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text
37
  *Note: This metric works best when combined with other metrics for a more comprehensive analysis.*
38
  - **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles. *Common Tibetan stopwords can be excluded to ensure TF-IDF weights highlight genuinely characteristic terms.*
39
  - **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
40
- - **Model Selection**: Choose from specialized embedding models for semantic similarity analysis:
41
- - **Buddhist-NLP Transformer** (Experimental): Pre-trained model specialized for Buddhist texts
42
- - **FastText**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) with optimizations specifically for Tibetan language, including botok tokenization and TF-IDF weighted averaging
43
  - **Stopword Filtering**: Three levels of filtering for Tibetan words:
44
  - **None**: No filtering, includes all words
45
  - **Standard**: Filters only common particles and punctuation
@@ -126,24 +123,8 @@ This helps focus on meaningful content words rather than grammatical elements.
126
 
127
  2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
128
  * *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
129
- 3. **Semantic Similarity**: Computes the cosine similarity between semantic embeddings of text segments using one of two approaches:
130
-
131
- **a. Transformer-based Model** (Experimental): Pre-trained model that understands contextual relationships between words.
132
- - `buddhist-nlp/buddhist-sentence-similarity`: Specialized for Buddhist texts
133
- - Processes raw Unicode Tibetan text directly (no special tokenization required)
134
- - Note: This is an experimental approach and results may vary with different texts
135
-
136
- **b. FastText Model**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text. Falls back to a custom model only if the official model cannot be loaded.
137
- - Processes Tibetan text using botok tokenization (same as other metrics)
138
- - Uses the pre-tokenized words from botok rather than doing its own tokenization
139
- - Better for texts with specialized Tibetan vocabulary
140
- - More stable results for general Tibetan text comparison
141
- - Optimized for Tibetan language with:
142
- - Syllable-based tokenization preserving Tibetan syllable markers
143
- - TF-IDF weighted averaging for word vectors (distinct from the TF-IDF Cosine Similarity metric)
144
- - Enhanced parameters based on Tibetan NLP research
145
-
146
- For texts exceeding the model's token limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting embeddings are averaged to produce a single vector for the entire segment.
147
  4. **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, optionally **filtering out common Tibetan stopwords**.
148
  TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
149
  This helps to identify terms that are characteristic or discriminative for a segment. When stopword filtering is enabled, the TF-IDF scores better reflect genuinely significant terms.
@@ -210,28 +191,21 @@ This helps focus on meaningful content words rather than grammatical elements.
210
  - The system will analyze your metrics and provide insights about patterns, relationships, and notable findings in your data.
211
  - This feature helps researchers understand the significance of the metrics and identify interesting textual relationships between chapters.
212
 
213
- ## Embedding Models
214
 
215
- The application offers two specialized approaches for calculating semantic similarity in Tibetan texts:
216
 
217
- 1. **Buddhist-NLP Transformer** (Default option):
218
- - A specialized model fine-tuned for Buddhist text similarity
219
- - Provides excellent results for Tibetan Buddhist texts
220
- - Pre-trained and ready to use, no training required
221
- - Best for general Buddhist terminology and concepts
222
-
223
- 2. **FastText Model**:
224
- - Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors)
225
- - Pre-trained on a large corpus of Tibetan text from Wikipedia and other sources
226
- - Falls back to training a custom model on your texts if the official model cannot be loaded
227
- - Respects your stopword filtering settings when creating embeddings
228
- - Uses simple word vector averaging for stable embeddings
229
 
230
  **When to choose FastText**:
231
- - When you want high-quality word embeddings specifically trained for Tibetan language
232
- - When you need a model that can handle out-of-vocabulary words through character n-grams
233
- - When you want to benefit from Facebook's large-scale pre-training on Tibetan text
234
- - When you need more control over how stopwords affect semantic analysis
235
 
236
  ## Structure
237
 
@@ -244,6 +218,10 @@ The application offers two specialized approaches for calculating semantic simil
244
  - `tokenize.py`: Tibetan text tokenization using `botok`.
245
  - `upload.py`: File upload handling (currently minimal).
246
  - `visualize.py`: Generates heatmaps and word count plots.
 
 
 
 
247
  - `requirements.txt` — Python dependencies for the web application.
248
 
249
  ## License
 
8
  python_version: 3.11
9
  app_file: app.py
10
  models:
 
11
  - fasttext-tibetan
12
  ---
13
 
 
31
  - **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
32
  - **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
33
  - **Semantic Similarity**: Uses embedding models to compare the contextual meaning of segments. Users can select between:
34
+ - A FastText model using official Facebook Tibetan vectors with custom `botok` tokenization (recommended approach).
 
35
  *Note: This metric works best when combined with other metrics for a more comprehensive analysis.*
36
  - **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles. *Common Tibetan stopwords can be excluded to ensure TF-IDF weights highlight genuinely characteristic terms.*
37
  - **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
38
+ - **Model Selection**: Semantic similarity analysis uses a FastText model:
39
+ - **FastText**: Uses the official Facebook FastText Tibetan model (`cc.bo.300.bin`) with optimizations specifically for Tibetan language, including `botok` tokenization and TF-IDF weighted averaging of word vectors to produce segment embeddings.
 
40
  - **Stopword Filtering**: Three levels of filtering for Tibetan words:
41
  - **None**: No filtering, includes all words
42
  - **Standard**: Filters only common particles and punctuation
 
123
 
124
  2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
125
  * *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
126
+ 3. **Semantic Similarity**: Computes the cosine similarity between semantic embeddings of text segments using a FastText model:
127
+ - **FastText**: Uses the official Facebook FastText Tibetan model (`cc.bo.300.bin`) with optimizations specifically for Tibetan language, including `botok` tokenization and TF-IDF weighted averaging of word vectors to produce segment embeddings.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
  4. **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, optionally **filtering out common Tibetan stopwords**.
129
  TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
130
  This helps to identify terms that are characteristic or discriminative for a segment. When stopword filtering is enabled, the TF-IDF scores better reflect genuinely significant terms.
 
191
  - The system will analyze your metrics and provide insights about patterns, relationships, and notable findings in your data.
192
  - This feature helps researchers understand the significance of the metrics and identify interesting textual relationships between chapters.
193
 
194
+ ## Embedding Model
195
 
196
+ The application uses a FastText-based approach for calculating semantic similarity in Tibetan texts:
197
 
198
+ **FastText Model Features**:
199
+ - Utilizes official Facebook FastText word vectors for Tibetan (`cc.bo.300.bin`)
200
+ - Integrates `botok` for accurate Tibetan word tokenization.
201
+ - Employs TF-IDF weighted averaging of word vectors to produce segment embeddings. This method provides more nuanced similarity scores by emphasizing terms that are important within the analyzed texts.
202
+ - The underlying `tibetan-text-metrics` library also supports training custom FastText models on user-uploaded texts for domain-specific accuracy, though this feature is not yet directly exposed for training in the web UI.
 
 
 
 
 
 
 
203
 
204
  **When to choose FastText**:
205
+ - When you require high-quality word embeddings trained on a very large and diverse corpus of classical Tibetan texts.
206
+ - When your analysis benefits from a model that can effectively handle out-of-vocabulary Tibetan words and orthographic variations through FastText's character n-gram features.
207
+ - When you want to leverage a model trained with Tibetan-specific text preprocessing techniques.
208
+ - When you need fine-grained control over how stopwords affect semantic analysis, as this model's embedding process respects the selected stopword filtering level.
209
 
210
  ## Structure
211
 
 
218
  - `tokenize.py`: Tibetan text tokenization using `botok`.
219
  - `upload.py`: File upload handling (currently minimal).
220
  - `visualize.py`: Generates heatmaps and word count plots.
221
+ - `fasttext-modelling/` — Scripts and documentation for training custom FastText models.
222
+ - `train_custom_fasttext.py`: Script to train a custom FastText model.
223
+ - `README.md`: Detailed instructions for the training script.
224
+ - `requirements.txt`: Python dependencies specifically for the training script.
225
  - `requirements.txt` — Python dependencies for the web application.
226
 
227
  ## License
app.py CHANGED
@@ -5,6 +5,7 @@ from pipeline.visualize import generate_visualizations, generate_word_count_char
5
  from pipeline.llm_service import get_interpretation
6
  import logging
7
  import pandas as pd
 
8
  from dotenv import load_dotenv
9
 
10
  # Load environment variables from .env file
@@ -14,7 +15,6 @@ from theme import tibetan_theme
14
 
15
  logger = logging.getLogger(__name__)
16
 
17
-
18
  # Main interface logic
19
  def main_interface():
20
  with gr.Blocks(
@@ -65,18 +65,10 @@ def main_interface():
65
  )
66
 
67
  model_dropdown = gr.Dropdown(
68
- label="Embedding Model",
69
- choices=[
70
- "buddhist-nlp/buddhist-sentence-similarity",
71
- "fasttext-tibetan"
72
- ],
73
- value="buddhist-nlp/buddhist-sentence-similarity",
74
- info="Select the embedding model for semantic similarity.<br><br>"
75
- "<b>Model information:</b><br>"
76
- "• <a href='https://huggingface.co/buddhist-nlp/buddhist-sentence-similarity' target='_blank'>buddhist-nlp/buddhist-sentence-similarity</a>: Specialized model fine-tuned for Buddhist text similarity.<br>"
77
- "• <b>fasttext-tibetan</b>: Uses the official Facebook FastText Tibetan model pre-trained on a large corpus. If the official model cannot be loaded, it will fall back to training a custom model on your uploaded texts.",
78
- visible=True,
79
- interactive=True
80
  )
81
 
82
  stopwords_dropdown = gr.Dropdown(
@@ -181,26 +173,19 @@ A higher Normalized LCS score suggests more significant shared phrasing, direct
181
  """,
182
  "Semantic Similarity": """
183
  ### Semantic Similarity
184
- Computes the cosine similarity between semantic embeddings of text segments using one of two approaches:
185
 
186
- **1. Transformer-based Model** (Experimental): Pre-trained model that understands contextual relationships between words.
187
- - `buddhist-nlp/buddhist-sentence-similarity`: Specialized for Buddhist texts
188
- - Processes raw Unicode Tibetan text directly (no special tokenization required)
189
- - Note: This is an experimental approach and results may vary with different texts
190
-
191
- **2. FastText Model**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text. Falls back to a custom model only if the official model cannot be loaded.
192
  - Processes Tibetan text using botok tokenization (same as other metrics)
193
  - Uses the pre-tokenized words from botok rather than doing its own tokenization
194
  - Better for texts with specialized Tibetan vocabulary
195
  - More stable results for general Tibetan text comparison
196
  - Optimized for Tibetan language with:
197
- - Syllable-based tokenization preserving Tibetan syllable markers
198
  - TF-IDF weighted averaging for word vectors (distinct from the TF-IDF Cosine Similarity metric)
199
  - Enhanced parameters based on Tibetan NLP research
200
 
201
- **Chunking for Long Texts**: For texts exceeding the model's token limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting embeddings are averaged to produce a single vector for the entire segment.
202
-
203
- **Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before computing embeddings. This helps focus on meaningful content words. Transformer models process the full text regardless of stopword filtering setting.
204
 
205
  **Note**: This metric works best when combined with other metrics for a more comprehensive analysis.
206
  """,
@@ -389,10 +374,20 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
389
  use_stopwords = stopwords_option != "None (No filtering)"
390
  use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
391
 
 
 
 
 
 
 
 
 
 
 
392
  df_results, word_counts_df_data, warning_raw = process_texts(
393
  text_data, filenames,
394
  enable_semantic=enable_semantic_bool,
395
- model_name=model_name,
396
  use_stopwords=use_stopwords,
397
  use_lite_stopwords=use_lite_stopwords,
398
  progress_callback=progress_tracker
 
5
  from pipeline.llm_service import get_interpretation
6
  import logging
7
  import pandas as pd
8
+
9
  from dotenv import load_dotenv
10
 
11
  # Load environment variables from .env file
 
15
 
16
  logger = logging.getLogger(__name__)
17
 
 
18
  # Main interface logic
19
  def main_interface():
20
  with gr.Blocks(
 
65
  )
66
 
67
  model_dropdown = gr.Dropdown(
68
+ choices=["Facebook FastText (Pre-trained)"],
69
+ label="Select Embedding Model",
70
+ value="Facebook FastText (Pre-trained)",
71
+ info="Using Facebook's pre-trained FastText model for semantic similarity. Other model options have been removed."
 
 
 
 
 
 
 
 
72
  )
73
 
74
  stopwords_dropdown = gr.Dropdown(
 
173
  """,
174
  "Semantic Similarity": """
175
  ### Semantic Similarity
176
+ Computes the cosine similarity between semantic embeddings of text segments:
177
 
178
+ **FastText Model**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text. Falls back to a custom model only if the official model cannot be loaded.
 
 
 
 
 
179
  - Processes Tibetan text using botok tokenization (same as other metrics)
180
  - Uses the pre-tokenized words from botok rather than doing its own tokenization
181
  - Better for texts with specialized Tibetan vocabulary
182
  - More stable results for general Tibetan text comparison
183
  - Optimized for Tibetan language with:
184
+ - Word-based tokenization preserving Tibetan syllable markers
185
  - TF-IDF weighted averaging for word vectors (distinct from the TF-IDF Cosine Similarity metric)
186
  - Enhanced parameters based on Tibetan NLP research
187
 
188
+ **Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before computing embeddings. This helps focus on meaningful content words.
 
 
189
 
190
  **Note**: This metric works best when combined with other metrics for a more comprehensive analysis.
191
  """,
 
374
  use_stopwords = stopwords_option != "None (No filtering)"
375
  use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
376
 
377
+ # Map UI model name to internal model ID
378
+ # The UI model_name is "Facebook FastText (Pre-trained)"
379
+ # This mapping ensures the backend receives the correct identifier.
380
+ if model_name == "Facebook FastText (Pre-trained)":
381
+ internal_model_id = "facebook-fasttext-pretrained"
382
+ else:
383
+ # Fallback or error if unexpected model_name, though UI should prevent this
384
+ logger.warning(f"Unexpected model_name from UI: {model_name}. Defaulting to facebook-fasttext-pretrained.")
385
+ internal_model_id = "facebook-fasttext-pretrained"
386
+
387
  df_results, word_counts_df_data, warning_raw = process_texts(
388
  text_data, filenames,
389
  enable_semantic=enable_semantic_bool,
390
+ model_name=internal_model_id, # Use the mapped internal ID
391
  use_stopwords=use_stopwords,
392
  use_lite_stopwords=use_lite_stopwords,
393
  progress_callback=progress_tracker
pipeline/fasttext_embedding.py CHANGED
@@ -4,15 +4,18 @@ This module provides functions to train and use FastText models for Tibetan text
4
  """
5
 
6
  import os
 
7
  import math
8
  import logging
9
  import numpy as np
10
  import fasttext
11
- from typing import List, Optional
 
12
  from huggingface_hub import hf_hub_download
13
 
14
  # Set up logging
15
  logger = logging.getLogger(__name__)
 
16
 
17
  # Default parameters optimized for Tibetan
18
  DEFAULT_DIM = 100
@@ -25,7 +28,7 @@ DEFAULT_NEG = 5
25
 
26
  # Define paths for model storage
27
  DEFAULT_MODEL_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "models")
28
- DEFAULT_MODEL_PATH = os.path.join(DEFAULT_MODEL_DIR, "fasttext_model.bin")
29
 
30
  # Facebook's official Tibetan FastText model
31
  FACEBOOK_TIBETAN_MODEL_ID = "facebook/fasttext-bo-vectors"
@@ -133,41 +136,67 @@ def train_fasttext_model(
133
  return model
134
 
135
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  def load_fasttext_model(model_path: str = DEFAULT_MODEL_PATH) -> Optional[fasttext.FastText._FastText]:
137
  """
138
- Load a FastText model from file, with fallback to official Facebook model.
139
 
140
  Args:
141
- model_path: Path to the model file
142
 
143
  Returns:
144
- Loaded FastText model or None if loading fails
145
  """
146
  try:
147
- # First try to load the official Facebook FastText Tibetan model
148
- try:
149
- # Try to download the official Facebook FastText Tibetan model
150
- logger.info("Attempting to download and load official Facebook FastText Tibetan model")
151
- facebook_model_path = hf_hub_download(
152
- repo_id=FACEBOOK_TIBETAN_MODEL_ID,
153
- filename=FACEBOOK_TIBETAN_MODEL_FILE,
154
- cache_dir=DEFAULT_MODEL_DIR
155
- )
156
- logger.info("Loading official Facebook FastText Tibetan model from %s", facebook_model_path)
157
- return fasttext.load_model(facebook_model_path)
158
- except Exception as e:
159
- logger.warning("Could not load official Facebook FastText Tibetan model: %s", str(e))
160
- logger.info("Falling back to local model")
161
-
162
- # Fall back to local model
163
  if os.path.exists(model_path):
164
- logger.info("Loading local FastText model from %s", model_path)
165
  return fasttext.load_model(model_path)
166
  else:
167
- logger.warning("Model path %s does not exist", model_path)
168
  return None
169
  except Exception as e:
170
- logger.error("Error loading FastText model: %s", str(e))
171
  return None
172
 
173
 
@@ -177,8 +206,10 @@ def get_text_embedding(
177
  tokenize_fn=None,
178
  use_stopwords: bool = True,
179
  stopwords_set=None,
180
- use_tfidf_weighting: bool = True, # Enabled by default for better results
181
- corpus_token_freq=None
 
 
182
  ) -> np.ndarray:
183
  """
184
  Get embedding for a text using a FastText model with optional TF-IDF weighting.
@@ -199,83 +230,141 @@ def get_text_embedding(
199
  return np.zeros(model.get_dimension())
200
 
201
  # Handle tokenization
202
- if tokenize_fn is None:
203
- # Simple whitespace tokenization as fallback
204
- tokens = text.split()
205
- elif isinstance(tokenize_fn, list):
206
- # If tokenize_fn is already a list of tokens, use it directly
207
- tokens = tokenize_fn
208
- elif callable(tokenize_fn):
209
- # If tokenize_fn is a function, call it
210
  tokens = tokenize_fn(text)
 
 
 
 
211
  else:
212
- # If tokenize_fn is something else (like a string), use whitespace tokenization
213
- logger.warning(f"Unexpected tokenize_fn type: {type(tokenize_fn)}. Using default whitespace tokenization.")
 
 
 
214
  tokens = text.split()
215
-
216
- # Filter out stopwords if enabled and stopwords_set is provided
217
- if use_stopwords and stopwords_set is not None:
218
- tokens = [token for token in tokens if token not in stopwords_set]
219
-
220
- # If all tokens were filtered out as stopwords, return zero vector
221
- if not tokens:
222
- return np.zeros(model.get_dimension())
223
-
224
- # Filter out empty tokens
225
- tokens = [token for token in tokens if token.strip()]
226
-
227
- if not tokens:
228
- return np.zeros(model.get_dimension())
229
-
230
- # Calculate TF-IDF weighted average if requested
231
- if use_tfidf_weighting and corpus_token_freq is not None:
232
- # Calculate term frequencies in this document
233
- token_counts = {}
234
- for token in tokens:
235
- token_counts[token] = token_counts.get(token, 0) + 1
236
 
237
- # Calculate IDF for each token with improved stability
238
- N = sum(corpus_token_freq.values()) # Total number of tokens in corpus
239
- N = max(N, 1) # Ensure N is at least 1 to avoid division by zero
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240
 
241
- # Compute TF-IDF weights with safeguards against extreme values
242
- weights = []
243
- for token in tokens:
244
- # Term frequency in this document
245
- tf = token_counts.get(token, 0) / max(len(tokens), 1) if len(tokens) > 0 else 0
246
-
247
- # Inverse document frequency with smoothing to avoid extreme values
248
- token_freq = corpus_token_freq.get(token, 0)
249
- idf = math.log((N + 1) / (token_freq + 1)) + 1 # Add 1 for smoothing
250
 
251
- # TF-IDF weight with bounds to prevent extreme values
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
252
  weight = tf * idf
253
- weight = min(max(weight, 0.1), 10.0) # Limit to reasonable range
254
- weights.append(weight)
255
 
256
- # Normalize weights to sum to 1 with stability checks
257
- total_weight = sum(weights)
258
- if total_weight > 0:
259
- weights = [w / total_weight for w in weights]
 
 
 
 
 
 
 
260
  else:
261
- # If all weights are 0, use uniform weights
262
- weights = [1.0 / len(tokens) if len(tokens) > 0 else 0 for _ in tokens]
263
-
264
- # Check for NaN or infinite values and replace with uniform weights if found
265
- if any(math.isnan(w) or math.isinf(w) for w in weights):
266
- logger.warning("Found NaN or infinite weights in TF-IDF calculation. Using uniform weights instead.")
267
- weights = [1.0 / len(tokens) if len(tokens) > 0 else 0 for _ in tokens]
268
 
269
- # Get vectors for each token and apply weights
270
- vectors = [model.get_word_vector(token) for token in tokens]
271
- weighted_vectors = [w * v for w, v in zip(weights, vectors)]
 
 
 
 
 
 
 
 
 
 
 
 
272
 
273
- # Sum the weighted vectors
274
- return np.sum(weighted_vectors, axis=0)
275
  else:
276
- # Simple averaging if TF-IDF is not enabled or corpus frequencies not provided
277
- vectors = [model.get_word_vector(token) for token in tokens]
278
- return np.mean(vectors, axis=0)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
279
 
280
 
281
  def get_batch_embeddings(
@@ -284,8 +373,10 @@ def get_batch_embeddings(
284
  tokenize_fn=None,
285
  use_stopwords: bool = True,
286
  stopwords_set=None,
287
- use_tfidf_weighting: bool = True, # Enabled by default for better results
288
- corpus_token_freq=None
 
 
289
  ) -> np.ndarray:
290
  """
291
  Get embeddings for a batch of texts with optional TF-IDF weighting.
@@ -302,54 +393,31 @@ def get_batch_embeddings(
302
  Returns:
303
  Array of text embedding vectors
304
  """
305
- # If corpus_token_freq is not provided but TF-IDF is requested, build it from the texts
306
- if use_tfidf_weighting and corpus_token_freq is None:
307
- logger.info("Building corpus token frequency dictionary for TF-IDF weighting")
308
- corpus_token_freq = {}
309
-
310
- # Process each text to build corpus token frequencies
311
- for text in texts:
312
- if not text.strip():
313
- continue
314
-
315
- # Handle tokenization
316
- if tokenize_fn is None:
317
- tokens = text.split()
318
- elif isinstance(tokenize_fn, list):
319
- # In this case, tokenize_fn should be a list of lists (one list of tokens per text)
320
- # This is not a common use case, so we'll just use the first one as fallback
321
- tokens = tokenize_fn[0] if tokenize_fn else []
322
- else:
323
- tokens = tokenize_fn(text)
324
-
325
- # Filter out stopwords if enabled
326
- if use_stopwords and stopwords_set is not None:
327
- tokens = [token for token in tokens if token not in stopwords_set]
328
-
329
- # Update corpus token frequencies
330
- for token in tokens:
331
- if token.strip(): # Skip empty tokens
332
- corpus_token_freq[token] = corpus_token_freq.get(token, 0) + 1
333
-
334
- logger.info("Built corpus token frequency dictionary with %d unique tokens", len(corpus_token_freq))
335
-
336
  # Get embeddings for each text
337
  embeddings = []
338
- for i, text in enumerate(texts):
339
- # Handle pre-tokenized input
340
- tokens = None
341
- if isinstance(tokenize_fn, list):
 
 
 
342
  if i < len(tokenize_fn):
343
- tokens = tokenize_fn[i]
 
 
 
344
 
345
  embedding = get_text_embedding(
346
- text,
347
  model,
348
- tokenize_fn=tokens, # Pass the tokens directly, not the function
349
  use_stopwords=use_stopwords,
350
  stopwords_set=stopwords_set,
351
  use_tfidf_weighting=use_tfidf_weighting,
352
- corpus_token_freq=corpus_token_freq
 
 
353
  )
354
  embeddings.append(embedding)
355
 
@@ -359,11 +427,12 @@ def get_batch_embeddings(
359
  def generate_embeddings(
360
  texts: List[str],
361
  model: fasttext.FastText._FastText,
362
- device: str,
363
- model_type: str = "sentence_transformer",
364
  tokenize_fn=None,
365
  use_stopwords: bool = True,
366
- use_lite_stopwords: bool = False
 
 
 
367
  ) -> np.ndarray:
368
  """
369
  Generate embeddings for a list of texts using a FastText model.
@@ -371,17 +440,16 @@ def generate_embeddings(
371
  Args:
372
  texts: List of input texts
373
  model: FastText model
374
- device: Device to use for computation (not used for FastText)
375
- model_type: Model type ('sentence_transformer' or 'fasttext')
376
  tokenize_fn: Optional tokenization function or pre-tokenized list of tokens
377
  use_stopwords: Whether to filter out stopwords
378
  use_lite_stopwords: Whether to use a lighter set of stopwords
 
 
 
379
 
380
  Returns:
381
  Array of text embedding vectors
382
  """
383
- if model_type != "fasttext":
384
- logger.warning("Model type %s not supported for FastText. Using FastText anyway.", model_type)
385
 
386
  # Generate embeddings using FastText
387
  try:
@@ -399,7 +467,10 @@ def generate_embeddings(
399
  tokenize_fn=tokenize_fn,
400
  use_stopwords=use_stopwords,
401
  stopwords_set=stopwords_set,
402
- use_tfidf_weighting=True # Enable TF-IDF weighting for better results
 
 
 
403
  )
404
 
405
  logger.info("FastText embeddings generated with shape: %s", str(embeddings.shape))
 
4
  """
5
 
6
  import os
7
+ from pathlib import Path
8
  import math
9
  import logging
10
  import numpy as np
11
  import fasttext
12
+ from collections import Counter
13
+ from typing import List, Set, Optional
14
  from huggingface_hub import hf_hub_download
15
 
16
  # Set up logging
17
  logger = logging.getLogger(__name__)
18
+ logger.setLevel(logging.DEBUG) # Ensure this logger processes DEBUG messages
19
 
20
  # Default parameters optimized for Tibetan
21
  DEFAULT_DIM = 100
 
28
 
29
  # Define paths for model storage
30
  DEFAULT_MODEL_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "models")
31
+ DEFAULT_MODEL_PATH = str(Path(__file__).resolve().parent.parent / "fasttext-modelling" / "tibetan_cbow_model.bin") # Updated to custom model
32
 
33
  # Facebook's official Tibetan FastText model
34
  FACEBOOK_TIBETAN_MODEL_ID = "facebook/fasttext-bo-vectors"
 
136
  return model
137
 
138
 
139
+ def load_facebook_official_tibetan_model() -> Optional[fasttext.FastText._FastText]:
140
+ """
141
+ Downloads (if necessary) and loads the official Facebook FastText Tibetan model.
142
+
143
+ Returns:
144
+ Loaded FastText model or None if loading fails.
145
+ """
146
+ try:
147
+ logger.info("Attempting to download and load official Facebook FastText Tibetan model")
148
+ facebook_model_path = hf_hub_download(
149
+ repo_id=FACEBOOK_TIBETAN_MODEL_ID,
150
+ filename=FACEBOOK_TIBETAN_MODEL_FILE,
151
+ cache_dir=DEFAULT_MODEL_DIR
152
+ )
153
+ logger.info(f"Loading official Facebook FastText Tibetan model from {facebook_model_path}")
154
+ model = fasttext.load_model(facebook_model_path)
155
+ if model:
156
+ logger.info(f"FastText model loaded in load_facebook_official_tibetan_model. Type: {type(model)}")
157
+ try:
158
+ # Basic check: get model dimensions
159
+ dims = model.get_dimension()
160
+ logger.info(f"Model dimensions reported by fasttext_embedding: {dims}")
161
+ # Check for a specific word to see if get_word_vector is callable with a string
162
+ # Using a common Tibetan particle that should be in the vocab
163
+ test_word = "ལ་"
164
+ try:
165
+ vec = model.get_word_vector(test_word)
166
+ logger.info(f"Successfully retrieved vector for test word '{test_word}'. Vector shape: {vec.shape if vec is not None else 'None'}")
167
+ except Exception as e_gwv:
168
+ logger.error(f"Error calling get_word_vector for test word '{test_word}' in fasttext_embedding: {e_gwv}", exc_info=True)
169
+ # Potentially re-raise or handle if this is critical for model validity
170
+ except Exception as e_diag_load:
171
+ logger.error(f"Error during diagnostic checks of loaded FastText model in fasttext_embedding: {e_diag_load}", exc_info=True)
172
+ # If diagnostics fail, the model might be unusable. Consider returning None.
173
+ # For now, let it return the model and fail later if that's the case.
174
+ else:
175
+ logger.error("fasttext.load_model returned None in load_facebook_official_tibetan_model.")
176
+ return model
177
+ except Exception as e_fb:
178
+ logger.error(f"Could not load official Facebook FastText Tibetan model (outer try-except): {str(e_fb)}", exc_info=True)
179
+ return None
180
+
181
  def load_fasttext_model(model_path: str = DEFAULT_MODEL_PATH) -> Optional[fasttext.FastText._FastText]:
182
  """
183
+ Load a custom FastText model from the specified file path.
184
 
185
  Args:
186
+ model_path: Path to the custom model file.
187
 
188
  Returns:
189
+ Loaded FastText model or None if loading fails.
190
  """
191
  try:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
192
  if os.path.exists(model_path):
193
+ logger.info(f"Attempting to load custom FastText model from {model_path}")
194
  return fasttext.load_model(model_path)
195
  else:
196
+ logger.error(f"Custom FastText model path {model_path} does not exist.")
197
  return None
198
  except Exception as e:
199
+ logger.error(f"Could not load custom FastText model from {model_path}: {str(e)}")
200
  return None
201
 
202
 
 
206
  tokenize_fn=None,
207
  use_stopwords: bool = True,
208
  stopwords_set=None,
209
+ use_tfidf_weighting: bool = True,
210
+ corpus_token_freq=None, # Retained for TF, but IDF will use doc_freq_map
211
+ doc_freq_map=None, # Document frequency map for IDF
212
+ total_docs_in_corpus=0 # Total documents in corpus for IDF
213
  ) -> np.ndarray:
214
  """
215
  Get embedding for a text using a FastText model with optional TF-IDF weighting.
 
230
  return np.zeros(model.get_dimension())
231
 
232
  # Handle tokenization
233
+ if callable(tokenize_fn):
 
 
 
 
 
 
 
234
  tokens = tokenize_fn(text)
235
+ logger.debug(f"Tokens from callable tokenize_fn (first 20): {tokens[:20]}")
236
+ elif isinstance(tokenize_fn, list):
237
+ tokens = tokenize_fn # Use the provided list directly
238
+ logger.debug(f"Tokens provided as list (first 20): {tokens[:20]}")
239
  else:
240
+ if tokenize_fn is not None:
241
+ logger.warning(f"tokenize_fn is of unexpected type: {type(tokenize_fn)}. Defaulting to space-split.")
242
+ else:
243
+ # This case handles tokenize_fn being explicitly None
244
+ logger.debug("tokenize_fn is None. Defaulting to space-split.")
245
  tokens = text.split()
246
+ logger.debug(f"Tokens from space-split fallback (first 20): {tokens[:20]}")
247
+
248
+ if use_stopwords and stopwords_set:
249
+ logger.debug(f"Original tokens before stopword check (first 20): {tokens[:20]}")
250
+ original_token_count = len(tokens)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
251
 
252
+ def _remove_stopwords_from_tokens(tokens: List[str], stopwords_set: Set[str]) -> List[str]:
253
+ """
254
+ Removes stopwords from a list of tokens.
255
+ Handles Tibetan punctuation by checking both the token itself and the token after
256
+ stripping trailing '།' or '༔'.
257
+ """
258
+ cleaned_tokens = []
259
+ removed_count = 0
260
+ for token in tokens:
261
+ # 1. Check if the original token itself is a stopword (e.g., standalone '།')
262
+ if token in stopwords_set:
263
+ removed_count += 1
264
+ continue # Skip this token
265
+
266
+ # 2. If not a direct stopword, check if it becomes one after stripping trailing punctuation
267
+ # This handles cases like "གྲུབ་པའི་།" where "གྲུབ་པའི་" is the stopword.
268
+ token_for_check = token
269
+ punctuation_was_stripped = False
270
+ if token.endswith(('།', '༔')):
271
+ stripped_token = token.rstrip('།༔')
272
+ if stripped_token != token: # Check if stripping actually changed the token
273
+ token_for_check = stripped_token
274
+ punctuation_was_stripped = True
275
 
276
+ if punctuation_was_stripped and token_for_check in stopwords_set:
277
+ removed_count += 1
278
+ continue # Skip this token
 
 
 
 
 
 
279
 
280
+ # 3. If neither the original token nor its base form is a stopword, keep it.
281
+ cleaned_tokens.append(token)
282
+
283
+ return cleaned_tokens
284
+
285
+ tokens = _remove_stopwords_from_tokens(tokens, stopwords_set)
286
+ removed_count = original_token_count - len(tokens)
287
+ logger.debug(f"Tokens after stopword removal (removed {removed_count}): {tokens[:20]}")
288
+
289
+ if not tokens:
290
+ logger.debug("Text became empty after tokenization/stopwords, returning zero vector.")
291
+ return np.zeros(model.get_dimension())
292
+
293
+ if use_tfidf_weighting and doc_freq_map and total_docs_in_corpus is not None and total_docs_in_corpus > 0:
294
+ logger.debug("Applying TF-IDF weighting.")
295
+ N_docs = total_docs_in_corpus
296
+ logger.debug(f"Total documents (N_docs) for IDF: {N_docs}")
297
+
298
+ token_counts = Counter(tokens)
299
+ logger.debug(f"Local token counts for this segment (top 5): {dict(token_counts.most_common(5))}")
300
+
301
+ tf_idf_weights = []
302
+ token_details_log = []
303
+
304
+ for token in tokens: # Iterate in original token order
305
+ tf = token_counts.get(token, 0) / len(tokens) if len(tokens) > 0 else 0
306
+ df = doc_freq_map.get(token, 0)
307
+ idf = math.log((N_docs + 1) / (df + 1)) + 1
308
  weight = tf * idf
309
+ tf_idf_weights.append(weight)
310
+ token_details_log.append(f"Token: '{token}', TF: {tf:.4f}, DF: {df}, IDF: {idf:.4f}, Raw_TFIDF: {weight:.4f}")
311
 
312
+ logger.debug("Token TF-IDF details (first 10 tokens):")
313
+ for i, log_entry in enumerate(token_details_log[:10]):
314
+ logger.debug(f" {i+1}. {log_entry}")
315
+
316
+ total_weight = sum(tf_idf_weights)
317
+ logger.debug(f"Sum of raw TF-IDF weights: {total_weight}")
318
+ logger.debug(f"TF-IDF Summary for text snippet (first 100 chars): '{text[:100]}'. Total_TFIDF_Weight: {total_weight:.8e}. Fallback_to_Uniform: {total_weight <= 1e-6}.")
319
+ normalized_weights = []
320
+ if total_weight > 1e-6:
321
+ normalized_weights = [w / total_weight for w in tf_idf_weights]
322
+ logger.debug(f"Normalized weights (first 10): {[f'{w:.4f}' for w in normalized_weights[:10]]}")
323
  else:
324
+ logger.debug("Total TF-IDF weight is very small, falling back to uniform weights.")
325
+ num_tokens = len(tokens)
326
+ if num_tokens > 0:
327
+ normalized_weights = [1/num_tokens] * num_tokens
328
+ logger.debug(f"Uniform weights (first 10): {[f'{w:.4f}' for w in normalized_weights[:10]]}")
 
 
329
 
330
+ weighted_embeddings_sum = np.zeros(model.get_dimension())
331
+ if len(normalized_weights) == len(tokens):
332
+ for i, token in enumerate(tokens):
333
+ word_vector = model.get_word_vector(token)
334
+ vec_sum_for_log = np.sum(word_vector)
335
+ logger.debug(f" Token: '{token}', Word_Vec_Sum: {vec_sum_for_log:.4f}, Applied_Weight: {normalized_weights[i]:.4f}")
336
+ weighted_embeddings_sum += word_vector * normalized_weights[i]
337
+ final_embedding = weighted_embeddings_sum
338
+ else:
339
+ logger.error("Mismatch between token count and normalized_weights count. THIS IS A BUG. Falling back to simple average.")
340
+ embeddings = [model.get_word_vector(t) for t in tokens]
341
+ if embeddings:
342
+ final_embedding = np.mean(embeddings, axis=0)
343
+ else:
344
+ final_embedding = np.zeros(model.get_dimension())
345
 
 
 
346
  else:
347
+ if use_tfidf_weighting:
348
+ logger.debug("TF-IDF weighting was requested but doc_freq_map or total_docs_in_corpus is missing/invalid. Falling back to simple averaging.")
349
+ else:
350
+ logger.debug("Using simple averaging of word vectors (TF-IDF not requested or N_docs=0).")
351
+
352
+ embeddings = []
353
+ for token in tokens:
354
+ word_vector = model.get_word_vector(token)
355
+ embeddings.append(word_vector)
356
+ vec_sum_for_log = np.sum(word_vector)
357
+ logger.debug(f" Token: '{token}', Word_Vec_Sum: {vec_sum_for_log:.4f} (simple avg context)")
358
+
359
+ if embeddings:
360
+ final_embedding = np.mean(embeddings, axis=0)
361
+ else:
362
+ final_embedding = np.zeros(model.get_dimension())
363
+
364
+ final_emb_sum_for_log = np.sum(final_embedding)
365
+ logger.debug(f"Final aggregated embedding sum: {final_emb_sum_for_log:.4f}, shape: {final_embedding.shape}")
366
+ logger.debug(f"--- get_text_embedding finished for text (first 50 chars): {text[:50]} ---")
367
+ return final_embedding
368
 
369
 
370
  def get_batch_embeddings(
 
373
  tokenize_fn=None,
374
  use_stopwords: bool = True,
375
  stopwords_set=None,
376
+ use_tfidf_weighting: bool = True,
377
+ corpus_token_freq=None, # Corpus-wide term frequencies
378
+ doc_freq_map=None, # Document frequency map for IDF
379
+ total_docs_in_corpus=0 # Total documents in corpus for IDF
380
  ) -> np.ndarray:
381
  """
382
  Get embeddings for a batch of texts with optional TF-IDF weighting.
 
393
  Returns:
394
  Array of text embedding vectors
395
  """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
396
  # Get embeddings for each text
397
  embeddings = []
398
+ for i, text_content in enumerate(texts): # Changed 'text' to 'text_content'
399
+ tokens_or_tokenizer_for_current_text = None
400
+
401
+ if callable(tokenize_fn):
402
+ tokens_or_tokenizer_for_current_text = tokenize_fn # Pass the function itself
403
+ elif isinstance(tokenize_fn, list):
404
+ # If tokenize_fn is a list, it's assumed to be a list of pre-tokenized documents
405
  if i < len(tokenize_fn):
406
+ tokens_or_tokenizer_for_current_text = tokenize_fn[i] # This is List[str] for the current text
407
+ else:
408
+ logger.warning(f"Pre-tokenized list `tokenize_fn` is shorter than the list of texts. Index {i} is out of bounds for `tokenize_fn` with length {len(tokenize_fn)}. Defaulting to None for this text.")
409
+ # If tokenize_fn is None or other, tokens_or_tokenizer_for_current_text remains None (get_text_embedding handles default).
410
 
411
  embedding = get_text_embedding(
412
+ text_content, # Use renamed variable
413
  model,
414
+ tokenize_fn=tokens_or_tokenizer_for_current_text, # Pass the correctly determined function or token list
415
  use_stopwords=use_stopwords,
416
  stopwords_set=stopwords_set,
417
  use_tfidf_weighting=use_tfidf_weighting,
418
+ corpus_token_freq=corpus_token_freq,
419
+ doc_freq_map=doc_freq_map,
420
+ total_docs_in_corpus=total_docs_in_corpus
421
  )
422
  embeddings.append(embedding)
423
 
 
427
  def generate_embeddings(
428
  texts: List[str],
429
  model: fasttext.FastText._FastText,
 
 
430
  tokenize_fn=None,
431
  use_stopwords: bool = True,
432
+ use_lite_stopwords: bool = False,
433
+ corpus_token_freq=None, # Existing: For TF-IDF
434
+ doc_freq_map=None, # Added: For TF-IDF document frequency
435
+ total_docs_in_corpus=0 # Added: For TF-IDF total documents in corpus
436
  ) -> np.ndarray:
437
  """
438
  Generate embeddings for a list of texts using a FastText model.
 
440
  Args:
441
  texts: List of input texts
442
  model: FastText model
 
 
443
  tokenize_fn: Optional tokenization function or pre-tokenized list of tokens
444
  use_stopwords: Whether to filter out stopwords
445
  use_lite_stopwords: Whether to use a lighter set of stopwords
446
+ corpus_token_freq: Precomputed term frequencies for the corpus (for TF-IDF).
447
+ doc_freq_map: Precomputed document frequencies for tokens (for TF-IDF).
448
+ total_docs_in_corpus: Total number of documents in the corpus (for TF-IDF).
449
 
450
  Returns:
451
  Array of text embedding vectors
452
  """
 
 
453
 
454
  # Generate embeddings using FastText
455
  try:
 
467
  tokenize_fn=tokenize_fn,
468
  use_stopwords=use_stopwords,
469
  stopwords_set=stopwords_set,
470
+ use_tfidf_weighting=True, # TF-IDF weighting enabled
471
+ corpus_token_freq=corpus_token_freq, # Pass down
472
+ doc_freq_map=doc_freq_map, # Pass down for TF-IDF
473
+ total_docs_in_corpus=total_docs_in_corpus # Pass down for TF-IDF
474
  )
475
 
476
  logger.info("FastText embeddings generated with shape: %s", str(embeddings.shape))
pipeline/metrics.py CHANGED
@@ -1,15 +1,14 @@
1
  import numpy as np
2
  import pandas as pd
3
- from typing import List, Dict
4
  from itertools import combinations
5
  from sklearn.metrics.pairwise import cosine_similarity
6
- import torch
7
  from .semantic_embedding import generate_embeddings
8
  from .tokenize import tokenize_texts
9
  import logging
10
  from sklearn.feature_extraction.text import TfidfVectorizer
11
- from .stopwords_bo import TIBETAN_STOPWORDS, TIBETAN_STOPWORDS_SET
12
- from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE, TIBETAN_STOPWORDS_LITE_SET
13
 
14
  # Attempt to import the Cython-compiled fast_lcs module
15
  try:
@@ -21,58 +20,8 @@ except ImportError:
21
 
22
  logger = logging.getLogger(__name__)
23
 
24
- MAX_TOKENS_PER_CHUNK = 500 # Max tokens (words via botok) per chunk
25
- CHUNK_OVERLAP = 50 # Number of tokens to overlap between chunks
26
 
27
 
28
- def _chunk_text(
29
- original_text_content: str,
30
- tokens: List[str],
31
- max_chunk_tokens: int,
32
- overlap_tokens: int,
33
- ) -> List[str]:
34
- """
35
- Splits a list of tokens into chunks and reconstructs text segments from these token chunks.
36
- The reconstructed text segments are intended for embedding models.
37
- Args:
38
- original_text_content (str): The original raw text string. Used if no chunking is needed.
39
- tokens (List[str]): The list of botok tokens for the original_text_content.
40
- max_chunk_tokens (int): Maximum number of botok tokens per chunk.
41
- overlap_tokens (int): Number of botok tokens to overlap between chunks.
42
-
43
- Returns:
44
- List[str]: A list of text strings, where each string is a chunk.
45
- """
46
- if (
47
- not tokens
48
- ): # Handles empty or whitespace-only original text that led to no tokens
49
- return [original_text_content] if original_text_content.strip() else []
50
-
51
- if len(tokens) <= max_chunk_tokens:
52
- # If not chunking, return the original text content directly, as per MEMORY[a777e6ad-11c4-4b90-8e6e-63a923a94432]
53
- # The memory states raw text segments are passed directly to the model.
54
- # Joining tokens here would alter spacing, etc.
55
- return [original_text_content]
56
-
57
- reconstructed_text_chunks = []
58
- start_idx = 0
59
- while start_idx < len(tokens):
60
- end_idx = min(start_idx + max_chunk_tokens, len(tokens))
61
- current_chunk_botok_tokens = tokens[start_idx:end_idx]
62
- # Reconstruct the text chunk by joining the botok tokens. This is an approximation.
63
- # The semantic model's internal tokenizer will handle this string.
64
- reconstructed_text_chunks.append(" ".join(current_chunk_botok_tokens))
65
-
66
- if end_idx == len(tokens):
67
- break
68
-
69
- next_start_idx = start_idx + max_chunk_tokens - overlap_tokens
70
- if next_start_idx <= start_idx:
71
- next_start_idx = start_idx + 1
72
- start_idx = next_start_idx
73
-
74
- return reconstructed_text_chunks
75
-
76
 
77
  def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
78
  # Calculate m and n (lengths) here, so they are available for normalization
@@ -100,214 +49,126 @@ def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
100
  return lcs_length / avg_length if avg_length > 0 else 0.0
101
 
102
 
 
103
  def compute_semantic_similarity(
104
  text1_segment: str,
105
  text2_segment: str,
106
- tokens1: List[str],
107
- tokens2: List[str],
108
- model,
109
- device,
110
- model_type: str = "sentence_transformer",
111
  use_stopwords: bool = True,
112
  use_lite_stopwords: bool = False,
 
 
 
 
113
  ) -> float:
114
- """Computes semantic similarity using a sentence transformer model, with chunking for long texts."""
115
- if model is None or device is None:
 
 
 
 
116
  logger.warning(
117
- "Semantic similarity model or device not available. Skipping calculation."
118
  )
119
- return np.nan # Return NaN if model isn't loaded
120
 
121
  if not text1_segment or not text2_segment:
122
  logger.info(
123
  "One or both texts are empty for semantic similarity. Returning 0.0."
124
  )
125
- return 0.0 # Or np.nan, depending on desired behavior for empty inputs
126
 
127
  def _get_aggregated_embedding(
128
- raw_text_segment: str, botok_tokens: List[str], model_obj, device_str, model_type: str = "sentence_transformer", use_stopwords: bool = True, use_lite_stopwords: bool = False
129
- ) -> torch.Tensor | None:
130
- """Helper to get a single embedding for a text, chunking if necessary for transformer models."""
131
- if (
132
- not botok_tokens and not raw_text_segment.strip()
133
- ): # Check if effectively empty
 
 
 
 
 
 
134
  logger.info(
135
  f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
136
  )
137
  return None
138
 
139
- # For FastText, we don't need chunking as it processes tokens directly
140
- if model_type == "fasttext":
141
- if not raw_text_segment.strip():
142
- logger.info(
143
- f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
144
- )
145
- return None
146
-
147
- # Pass the raw text, pre-tokenized tokens, and stopword parameters
148
- # Wrap the tokens in a list since generate_embeddings expects a list of token lists
149
- embedding = generate_embeddings(
150
- [raw_text_segment],
151
- model_obj,
152
- device_str,
153
- model_type,
154
- tokenize_fn=[botok_tokens], # Wrap in list since we're passing tokens for one text
155
- use_stopwords=use_stopwords,
156
- use_lite_stopwords=use_lite_stopwords
157
- )
158
-
159
- if embedding is None or embedding.nelement() == 0:
160
- logger.error(
161
- f"Failed to generate FastText embedding for text: {raw_text_segment[:100]}..."
162
- )
163
- return None
164
- return embedding # Already [1, embed_dim]
165
 
166
- # For transformer models, check if all tokens are stopwords when filtering is enabled
167
- elif use_stopwords:
168
- # Filter stopwords to see if any content remains
169
- filtered_tokens = []
170
- if use_lite_stopwords:
171
- from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
172
- filtered_tokens = [token for token in botok_tokens if token not in TIBETAN_STOPWORDS_LITE_SET]
173
- else:
174
- from .stopwords_bo import TIBETAN_STOPWORDS_SET
175
- filtered_tokens = [token for token in botok_tokens if token not in TIBETAN_STOPWORDS_SET]
176
-
177
- # If all tokens were filtered out as stopwords, return zero embedding
178
- if not filtered_tokens:
179
- logger.info("All tokens in text are stopwords. Returning zero embedding.")
180
- # Create a zero tensor with the same dimension as the model's output
181
- # For transformer models, typically 384 or 768 dimensions
182
- embedding_dim = 384 # Default dimension for MiniLM models
183
- return torch.zeros(1, embedding_dim)
184
-
185
- # Continue with normal processing if content remains after filtering
186
- if len(botok_tokens) > MAX_TOKENS_PER_CHUNK:
187
- logger.info(
188
- f"Text segment with ~{len(botok_tokens)} tokens exceeds {MAX_TOKENS_PER_CHUNK}, chunking {raw_text_segment[:30]}..."
189
- )
190
- # Pass the original raw text and its pre-computed botok tokens to _chunk_text
191
- text_chunks = _chunk_text(
192
- raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
193
- )
194
- if not text_chunks:
195
- logger.warning(
196
- f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
197
- )
198
- return None
199
-
200
- logger.info(
201
- f"Generated {len(text_chunks)} chunks for segment: {raw_text_segment[:30]}..."
202
- )
203
- # Generate embeddings for each chunk using the model
204
- chunk_embeddings = generate_embeddings(text_chunks, model_obj, device_str, model_type)
205
-
206
- if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
207
- logger.error(
208
- f"Failed to generate embeddings for chunks of text: {raw_text_segment[:100]}..."
209
- )
210
- return None
211
- # Mean pooling of chunk embeddings
212
- aggregated_embedding = torch.mean(chunk_embeddings, dim=0, keepdim=True)
213
- return aggregated_embedding
214
- else:
215
- # Text is short enough for transformer model, embed raw text directly
216
- if not raw_text_segment.strip():
217
- logger.info(
218
- f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
219
- )
220
- return None
221
-
222
- embedding = generate_embeddings([raw_text_segment], model_obj, device_str, model_type)
223
- if embedding is None or embedding.nelement() == 0:
224
- logger.error(
225
- f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
226
- )
227
- return None
228
- return embedding # Already [1, embed_dim]
229
- else:
230
- # No stopword filtering, proceed with normal processing
231
- if len(botok_tokens) > MAX_TOKENS_PER_CHUNK:
232
- logger.info(
233
- f"Text segment with ~{len(botok_tokens)} tokens exceeds {MAX_TOKENS_PER_CHUNK}, chunking {raw_text_segment[:30]}..."
234
- )
235
- # Pass the original raw text and its pre-computed botok tokens to _chunk_text
236
- text_chunks = _chunk_text(
237
- raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
238
- )
239
- if not text_chunks:
240
- logger.warning(
241
- f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
242
- )
243
- return None
244
-
245
- logger.info(
246
- f"Generated {len(text_chunks)} chunks for segment: {raw_text_segment[:30]}..."
247
- )
248
- # Generate embeddings for each chunk using the model
249
- chunk_embeddings = generate_embeddings(text_chunks, model_obj, device_str, model_type)
250
-
251
- if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
252
- logger.error(
253
- f"Failed to generate embeddings for chunks of text: {raw_text_segment[:100]}..."
254
- )
255
- return None
256
- # Mean pooling of chunk embeddings
257
- aggregated_embedding = torch.mean(chunk_embeddings, dim=0, keepdim=True)
258
- return aggregated_embedding
259
- else:
260
- # Text is short enough for transformer model, embed raw text directly
261
- if not raw_text_segment.strip():
262
- logger.info(
263
- f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
264
- )
265
- return None
266
-
267
- embedding = generate_embeddings([raw_text_segment], model_obj, device_str, model_type)
268
- if embedding is None or embedding.nelement() == 0:
269
- logger.error(
270
- f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
271
- )
272
- return None
273
- return embedding # Already [1, embed_dim]
274
 
275
  try:
276
- # Pass raw text and its pre-computed botok tokens with stopword preference
277
- embedding1 = _get_aggregated_embedding(text1_segment, tokens1, model, device, model_type, use_stopwords, use_lite_stopwords)
278
- embedding2 = _get_aggregated_embedding(text2_segment, tokens2, model, device, model_type, use_stopwords, use_lite_stopwords)
279
-
280
- if (
281
- embedding1 is None
282
- or embedding2 is None
283
- or embedding1.nelement() == 0
284
- or embedding2.nelement() == 0
285
- ):
286
  logger.error(
287
- "Failed to obtain one or both aggregated embeddings for semantic similarity."
288
  )
289
  return np.nan
290
 
291
- # Check if both embeddings are zero vectors (which happens when all tokens are stopwords)
292
- if np.all(embedding1.numpy() == 0) and np.all(embedding2.numpy() == 0):
293
- # If both texts contain only stopwords, return 0 similarity
 
 
 
 
294
  return 0.0
295
-
296
- # Cosine similarity expects 2D arrays, embeddings are [1, embed_dim] and on CPU
297
- similarity = cosine_similarity(embedding1.numpy(), embedding2.numpy())
298
- return float(similarity[0][0])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
299
  except Exception as e:
 
 
300
  logger.error(
301
- f"Error computing semantic similarity with chunking:\nText1: '{text1_segment[:100]}...'\nText2: '{text2_segment[:100]}...'\nError: {e}",
302
- exc_info=True,
303
  )
 
304
  return np.nan
305
 
306
 
307
  def compute_all_metrics(
308
- texts: Dict[str, str], model=None, device=None, enable_semantic: bool = True,
309
- model_type: str = "sentence_transformer", use_stopwords: bool = True,
310
- use_lite_stopwords: bool = False
 
311
  ) -> pd.DataFrame:
312
  """
313
  Computes all selected similarity metrics between pairs of texts.
@@ -328,25 +189,60 @@ def compute_all_metrics(
328
  files = list(texts.keys())
329
  results = []
330
  # Prepare token lists (always use tokenize_texts for raw Unicode)
331
- token_lists = {}
332
- corpus_for_tfidf = [] # For storing space-joined tokens for TF-IDF
 
 
 
 
 
 
 
 
 
 
 
 
 
 
333
 
334
  for fname, content in texts.items():
335
- tokenized_content = tokenize_texts([content]) # Returns a list of lists
336
- if tokenized_content and tokenized_content[0]:
337
- token_lists[fname] = tokenized_content[0]
338
- else:
339
- token_lists[fname] = []
340
- # Regardless of whether tokenized_content[0] exists, prepare entry for TF-IDF corpus
341
- # If tokens exist, join them; otherwise, use an empty string for that document
342
- corpus_for_tfidf.append(
343
- " ".join(token_lists[fname])
344
- if fname in token_lists and token_lists[fname]
345
- else ""
346
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
347
 
348
  # TF-IDF Vectorization and Cosine Similarity Calculation
349
- if corpus_for_tfidf:
350
  try:
351
  # Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
352
  # and we don't want further case changes or token modifications for Tibetan.
@@ -368,14 +264,14 @@ def compute_all_metrics(
368
  token_pattern=None,
369
  stop_words=stopwords_to_use
370
  )
371
- tfidf_matrix = vectorizer.fit_transform(corpus_for_tfidf)
372
  # Calculate pairwise cosine similarity on the TF-IDF matrix
373
  # This gives a square matrix where cosine_sim_matrix[i, j] is the similarity between doc i and doc j
374
  cosine_sim_matrix = cosine_similarity(tfidf_matrix)
375
  except ValueError as e:
376
  if "empty vocabulary" in str(e):
377
  # If vocabulary is empty after stopword removal, create a zero matrix
378
- n = len(corpus_for_tfidf)
379
  cosine_sim_matrix = np.zeros((n, n))
380
  else:
381
  # Re-raise other ValueError
@@ -421,7 +317,11 @@ def compute_all_metrics(
421
  if enable_semantic:
422
  # Pass raw texts and their pre-computed botok tokens
423
  semantic_sim = compute_semantic_similarity(
424
- texts[f1], texts[f2], words1_raw, words2_raw, model, device, model_type, use_stopwords, use_lite_stopwords
 
 
 
 
425
  )
426
  else:
427
  semantic_sim = np.nan
 
1
  import numpy as np
2
  import pandas as pd
3
+ from typing import List, Dict, Union
4
  from itertools import combinations
5
  from sklearn.metrics.pairwise import cosine_similarity
 
6
  from .semantic_embedding import generate_embeddings
7
  from .tokenize import tokenize_texts
8
  import logging
9
  from sklearn.feature_extraction.text import TfidfVectorizer
10
+ from .stopwords_bo import TIBETAN_STOPWORDS
11
+ from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE
12
 
13
  # Attempt to import the Cython-compiled fast_lcs module
14
  try:
 
20
 
21
  logger = logging.getLogger(__name__)
22
 
 
 
23
 
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
27
  # Calculate m and n (lengths) here, so they are available for normalization
 
49
  return lcs_length / avg_length if avg_length > 0 else 0.0
50
 
51
 
52
+
53
  def compute_semantic_similarity(
54
  text1_segment: str,
55
  text2_segment: str,
56
+ tokens1: List[str], # botok tokens for text1, not directly used by FastText path but kept for signature
57
+ tokens2: List[str], # botok tokens for text2, not directly used by FastText path but kept for signature
58
+ model, # FastText model object
59
+ model_type: str = "fasttext", # Should always be 'fasttext' when called
 
60
  use_stopwords: bool = True,
61
  use_lite_stopwords: bool = False,
62
+ fasttext_tokenize_fn=None,
63
+ term_freq_corpus=None,
64
+ doc_freq_map=None,
65
+ total_docs_in_corpus=0
66
  ) -> float:
67
+ """Computes semantic similarity using a FastText model."""
68
+ if model_type != "fasttext":
69
+ logger.error(f"compute_semantic_similarity called with unexpected model_type: {model_type}")
70
+ return np.nan
71
+
72
+ if model is None:
73
  logger.warning(
74
+ "FastText model not available for semantic similarity. Skipping calculation."
75
  )
76
+ return np.nan
77
 
78
  if not text1_segment or not text2_segment:
79
  logger.info(
80
  "One or both texts are empty for semantic similarity. Returning 0.0."
81
  )
82
+ return 0.0
83
 
84
  def _get_aggregated_embedding(
85
+ raw_text_segment: str,
86
+ _botok_tokens: List[str], # Parameter name prefixed with _ to indicate it's not used
87
+ model_obj,
88
+ use_stopwords_param: bool,
89
+ use_lite_stopwords_param: bool,
90
+ tokenize_fn_param,
91
+ term_freq_corpus_param,
92
+ doc_freq_map_param,
93
+ total_docs_in_corpus_param
94
+ ) -> Union[np.ndarray, None]:
95
+ """Helper to get a single embedding for a text using FastText."""
96
+ if not raw_text_segment.strip():
97
  logger.info(
98
  f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
99
  )
100
  return None
101
 
102
+ embedding = generate_embeddings(
103
+ texts=[raw_text_segment],
104
+ model=model_obj,
105
+ tokenize_fn=tokenize_fn_param,
106
+ use_stopwords=use_stopwords_param,
107
+ use_lite_stopwords=use_lite_stopwords_param,
108
+ corpus_token_freq=term_freq_corpus_param,
109
+ doc_freq_map=doc_freq_map_param,
110
+ total_docs_in_corpus=total_docs_in_corpus_param
111
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
+ if embedding is None or embedding.size == 0:
114
+ logger.error(
115
+ f"Failed to generate FastText embedding for text: {raw_text_segment[:100]}..."
116
+ )
117
+ return None
118
+ return embedding
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  try:
121
+ # Pass all relevant parameters to _get_aggregated_embedding
122
+ emb1 = _get_aggregated_embedding(text1_segment, tokens1, model, use_stopwords, use_lite_stopwords, fasttext_tokenize_fn, term_freq_corpus, doc_freq_map, total_docs_in_corpus)
123
+ emb2 = _get_aggregated_embedding(text2_segment, tokens2, model, use_stopwords, use_lite_stopwords, fasttext_tokenize_fn, term_freq_corpus, doc_freq_map, total_docs_in_corpus)
124
+
125
+ if emb1 is None or emb2 is None or emb1.size == 0 or emb2.size == 0:
 
 
 
 
 
126
  logger.error(
127
+ "Failed to obtain one or both FastText embeddings for semantic similarity."
128
  )
129
  return np.nan
130
 
131
+ # Ensure embeddings are numpy arrays (should be, but defensive)
132
+ if not isinstance(emb1, np.ndarray): emb1 = np.array(emb1)
133
+ if not isinstance(emb2, np.ndarray): emb2 = np.array(emb2)
134
+
135
+ # Handle cases where embeddings are all zeros
136
+ if np.all(emb1 == 0) and np.all(emb2 == 0):
137
+ logger.info("Both FastText embeddings are zero. Semantic similarity is 0.0.")
138
  return 0.0
139
+ if np.all(emb1 == 0) or np.all(emb2 == 0):
140
+ logger.info("One of the FastText embeddings is zero. Semantic similarity is 0.0.")
141
+ return 0.0
142
+
143
+ # Handle NaN or Inf in embeddings
144
+ if np.isnan(emb1).any() or np.isinf(emb1).any() or \
145
+ np.isnan(emb2).any() or np.isinf(emb2).any():
146
+ logger.warning("NaN or Inf found in FastText embeddings. Semantic similarity set to 0.0.")
147
+ return 0.0
148
+
149
+ # Ensure embeddings are 2D for cosine_similarity: [1, dim]
150
+ if emb1.ndim == 1: emb1 = emb1.reshape(1, -1)
151
+ if emb2.ndim == 1: emb2 = emb2.reshape(1, -1)
152
+
153
+ similarity_score = cosine_similarity(emb1, emb2)[0][0]
154
+
155
+ return max(0.0, float(similarity_score))
156
+
157
  except Exception as e:
158
+ safe_text1 = str(text1_segment)[:100] if text1_segment is not None else "N/A"
159
+ safe_text2 = str(text2_segment)[:100] if text2_segment is not None else "N/A"
160
  logger.error(
161
+ f"Error during FastText semantic similarity calculation:\nText1: {safe_text1}...\nText2: {safe_text2}...\nError: {e}"
 
162
  )
163
+ logger.exception("Traceback for FastText semantic similarity calculation error:")
164
  return np.nan
165
 
166
 
167
  def compute_all_metrics(
168
+ texts: Dict[str, str], model=None, enable_semantic: bool = True, # device=None removed
169
+ model_type: str = "fasttext", use_stopwords: bool = True,
170
+ use_lite_stopwords: bool = False,
171
+ fasttext_tokenize_fn=None # Added for FastText specific tokenizer
172
  ) -> pd.DataFrame:
173
  """
174
  Computes all selected similarity metrics between pairs of texts.
 
189
  files = list(texts.keys())
190
  results = []
191
  # Prepare token lists (always use tokenize_texts for raw Unicode)
192
+ token_lists = {} # Stores botok tokens for each text_id, used for Jaccard, LCS, and semantic sim
193
+ corpus_for_sklearn_tfidf = [] # For storing space-joined tokens for scikit-learn's TF-IDF
194
+
195
+ # For FastText TF-IDF related statistics
196
+ term_freq_corpus_for_fasttext = {} # Renamed from global_corpus_token_freq_for_fasttext
197
+ document_frequency_map_for_fasttext = {}
198
+ total_num_documents_for_fasttext = len(texts)
199
+
200
+ stopwords_set_for_fasttext_stats_calc = set()
201
+ if use_stopwords: # This 'use_stopwords' is an arg to compute_all_metrics
202
+ if use_lite_stopwords:
203
+ from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
204
+ stopwords_set_for_fasttext_stats_calc = TIBETAN_STOPWORDS_LITE_SET
205
+ else:
206
+ from .stopwords_bo import TIBETAN_STOPWORDS_SET
207
+ stopwords_set_for_fasttext_stats_calc = TIBETAN_STOPWORDS_SET
208
 
209
  for fname, content in texts.items():
210
+ current_tokens_for_file = []
211
+ tokenized_content_list_of_lists = tokenize_texts([content])
212
+ if tokenized_content_list_of_lists and tokenized_content_list_of_lists[0]:
213
+ current_tokens_for_file = tokenized_content_list_of_lists[0]
214
+ token_lists[fname] = current_tokens_for_file
215
+
216
+ corpus_for_sklearn_tfidf.append(" ".join(current_tokens_for_file) if current_tokens_for_file else "")
217
+
218
+ if model_type == "fasttext":
219
+ tokens_for_fasttext_stats = []
220
+ if fasttext_tokenize_fn is not None:
221
+ tokens_for_fasttext_stats = fasttext_tokenize_fn(content)
222
+ else:
223
+ tokens_for_fasttext_stats = current_tokens_for_file
224
+
225
+ filtered_tokens_for_stats = [
226
+ token for token in tokens_for_fasttext_stats if token not in stopwords_set_for_fasttext_stats_calc
227
+ ] if use_stopwords else tokens_for_fasttext_stats
228
+
229
+ # Update corpus-wide term frequencies
230
+ for token in filtered_tokens_for_stats:
231
+ if token.strip():
232
+ term_freq_corpus_for_fasttext[token] = term_freq_corpus_for_fasttext.get(token, 0) + 1
233
+
234
+ # Update document frequencies
235
+ unique_filtered_tokens_in_doc = set(filtered_tokens_for_stats)
236
+ for token in unique_filtered_tokens_in_doc:
237
+ if token.strip():
238
+ document_frequency_map_for_fasttext[token] = document_frequency_map_for_fasttext.get(token, 0) + 1
239
+
240
+ if model_type == "fasttext":
241
+ logger.info(f"Built FastText corpus term frequency map with {len(term_freq_corpus_for_fasttext)} unique tokens.")
242
+ logger.info(f"Built FastText document frequency map with {len(document_frequency_map_for_fasttext)} unique tokens across {total_num_documents_for_fasttext} documents.")
243
 
244
  # TF-IDF Vectorization and Cosine Similarity Calculation
245
+ if corpus_for_sklearn_tfidf:
246
  try:
247
  # Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
248
  # and we don't want further case changes or token modifications for Tibetan.
 
264
  token_pattern=None,
265
  stop_words=stopwords_to_use
266
  )
267
+ tfidf_matrix = vectorizer.fit_transform(corpus_for_sklearn_tfidf)
268
  # Calculate pairwise cosine similarity on the TF-IDF matrix
269
  # This gives a square matrix where cosine_sim_matrix[i, j] is the similarity between doc i and doc j
270
  cosine_sim_matrix = cosine_similarity(tfidf_matrix)
271
  except ValueError as e:
272
  if "empty vocabulary" in str(e):
273
  # If vocabulary is empty after stopword removal, create a zero matrix
274
+ n = len(corpus_for_sklearn_tfidf)
275
  cosine_sim_matrix = np.zeros((n, n))
276
  else:
277
  # Re-raise other ValueError
 
317
  if enable_semantic:
318
  # Pass raw texts and their pre-computed botok tokens
319
  semantic_sim = compute_semantic_similarity(
320
+ texts[f1], texts[f2], words1_raw, words2_raw, model, model_type, use_stopwords, use_lite_stopwords, # device removed
321
+ fasttext_tokenize_fn=fasttext_tokenize_fn,
322
+ term_freq_corpus=term_freq_corpus_for_fasttext if model_type == "fasttext" else None,
323
+ doc_freq_map=document_frequency_map_for_fasttext if model_type == "fasttext" else None,
324
+ total_docs_in_corpus=total_num_documents_for_fasttext if model_type == "fasttext" else 0
325
  )
326
  else:
327
  semantic_sim = np.nan
pipeline/process.py CHANGED
@@ -1,10 +1,48 @@
1
  import pandas as pd
2
  from typing import Dict, List, Tuple
3
  from .metrics import compute_all_metrics
4
- from .semantic_embedding import get_model_and_device, train_fasttext_model, FASTTEXT_MODEL_ID
 
5
  from .tokenize import tokenize_texts
6
  import logging
7
  from itertools import combinations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  logger = logging.getLogger(__name__)
10
 
@@ -47,10 +85,10 @@ def process_texts(
47
  RuntimeError: If the botok tokenizer fails to initialize.
48
  ValueError: If the input files cannot be processed or if metrics computation fails.
49
  """
50
- # Initialize model and device variables
51
- st_model, st_device = None, None
52
  model_warning = ""
53
-
54
  # Update progress if callback provided
55
  if progress_callback is not None:
56
  try:
@@ -58,77 +96,78 @@ def process_texts(
58
  except Exception as e:
59
  logger.warning(f"Progress callback error (non-critical): {e}")
60
  # Continue processing even if progress reporting fails
61
-
62
  # Load semantic model if enabled
63
  if enable_semantic:
64
  logger.info("Semantic similarity enabled. Loading embedding model...")
65
  try:
66
  logger.info("Using model: %s", model_name)
67
-
68
- # Check if this is a FastText model request
69
- if model_name == FASTTEXT_MODEL_ID:
70
- # Try to load the official Facebook FastText Tibetan model first
71
  if progress_callback is not None:
72
  try:
73
- progress_callback(0.25, desc="Loading official Facebook FastText Tibetan model...")
74
  except Exception as e:
75
- logger.warning("Progress callback error (non-critical): %s", str(e))
76
-
77
- st_model, st_device, model_type = get_model_and_device(model_id=model_name)
78
 
79
- # If model is None, we need to train a fallback model
80
- if st_model is None:
81
- if progress_callback is not None:
82
- try:
83
- progress_callback(0.25, desc="Official model unavailable. Training fallback FastText model...")
84
- except Exception as e:
85
- logger.warning("Progress callback error (non-critical): %s", str(e))
86
-
87
- # Collect all text data for training
88
- all_texts = list(text_data.values())
89
-
90
- # Train the model with standard parameters for stability
91
- st_model = train_fasttext_model(all_texts, dim=100, epoch=5)
92
-
93
  if progress_callback is not None:
94
  try:
95
- progress_callback(0.3, desc="Fallback FastText model trained successfully")
96
  except Exception as e:
97
- logger.warning("Progress callback error (non-critical): %s", str(e))
98
  else:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  if progress_callback is not None:
100
  try:
101
- progress_callback(0.3, desc="Official Facebook FastText Tibetan model loaded successfully")
102
  except Exception as e:
103
  logger.warning(f"Progress callback error (non-critical): {e}")
104
- else:
105
- # For sentence transformers
106
- st_model, st_device, model_type = get_model_and_device(model_id=model_name)
107
- logger.info(f"Model {model_name} loaded successfully on {st_device}.")
108
-
 
 
 
 
109
  if progress_callback is not None:
110
  try:
111
- progress_callback(0.3, desc="Model loaded successfully")
112
  except Exception as e:
113
  logger.warning(f"Progress callback error (non-critical): {e}")
114
-
115
- except Exception as e:
116
- error_msg = str(e)
117
- logger.error(f"Failed to load sentence transformer model: {error_msg}. Semantic similarity will not be available.")
118
-
119
- # Create a user-friendly warning message
120
- if "is not a valid model identifier" in error_msg:
121
- model_warning = f"The model '{model_name}' could not be found on Hugging Face. Semantic similarity will not be available."
122
- elif "CUDA out of memory" in error_msg:
123
- model_warning = "Not enough GPU memory to load the semantic model. Try using a smaller model or disable semantic similarity."
124
- else:
125
- model_warning = f"Failed to load semantic model: {error_msg}. Semantic similarity will not be available."
126
-
127
  if progress_callback is not None:
128
  try:
129
- progress_callback(0.3, desc="Continuing without semantic model")
130
- except Exception as e:
131
- logger.warning(f"Progress callback error (non-critical): {e}")
132
  else:
133
  logger.info("Semantic similarity disabled. Skipping model loading.")
134
  if progress_callback is not None:
@@ -177,11 +216,13 @@ def process_texts(
177
 
178
  for idx, seg in enumerate(segments):
179
  seg_id = f"{fname}|chapter {idx+1}"
180
- segment_texts[seg_id] = seg
 
181
  else:
182
  # No chapter markers found, treat entire file as one segment
183
  seg_id = f"{fname}|chapter 1"
184
- segment_texts[seg_id] = content.strip()
 
185
  fallback = True
186
 
187
  # Generate warning if no chapter markers found
@@ -261,14 +302,26 @@ def process_texts(
261
 
262
  try:
263
  # Compute metrics for this chapter pair
 
 
 
 
 
 
 
 
 
 
 
 
264
  pair_metrics = compute_all_metrics(
265
  {seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
266
- model=st_model,
267
- device=st_device,
268
  enable_semantic=enable_semantic,
269
- model_type=model_type if 'model_type' in locals() else "sentence_transformer",
270
  use_stopwords=use_stopwords,
271
- use_lite_stopwords=use_lite_stopwords
 
272
  )
273
 
274
  # Rename 'Text Pair' to show file stems and chapter number
 
1
  import pandas as pd
2
  from typing import Dict, List, Tuple
3
  from .metrics import compute_all_metrics
4
+ from .semantic_embedding import get_model_and_device
5
+ from .fasttext_embedding import load_fasttext_model # Added for custom fasttext
6
  from .tokenize import tokenize_texts
7
  import logging
8
  from itertools import combinations
9
+ import re
10
+
11
+ # Define FASTTEXT_MODEL_ID if not already defined (it should be, from semantic_embedding or globally)
12
+ # For safety, let's assume it might be needed here directly for conditional logic
13
+ FASTTEXT_MODEL_ID = "fasttext-tibetan" # Ensure this matches the ID used elsewhere
14
+
15
+
16
+ def get_botok_tokens_for_single_text(text: str, mode: str = "syllable") -> list[str]:
17
+ """
18
+ A wrapper around tokenize_texts to make it suitable for tokenize_fn
19
+ in generate_embeddings, which expects a function that tokenizes a single string.
20
+ Accepts a 'mode' argument ('syllable' or 'word') to pass to tokenize_texts.
21
+ """
22
+ if not text.strip():
23
+ return []
24
+ # Pass the mode to tokenize_texts
25
+ tokenized_list_of_lists = tokenize_texts([text], mode=mode)
26
+ if tokenized_list_of_lists and tokenized_list_of_lists[0]:
27
+ return tokenized_list_of_lists[0]
28
+ return []
29
+
30
+ def clean_tibetan_text_for_fasttext(text: str) -> str:
31
+ """
32
+ Applies cleaning steps to Tibetan text similar to those in FastText training:
33
+ - Removes lnX/pX page/line markers.
34
+ - Normalizes double tsheg to single tsheg.
35
+ - Normalizes whitespace.
36
+ """
37
+ # Remove lnX/pX markers
38
+ cleaned_text = re.sub(r"\s*(?:[lL][nN]|[pP])\d{1,3}[abAB]?\s*", " ", text)
39
+ # Normalize double tsheg
40
+ cleaned_text = re.sub(r"།\s*།", "།", cleaned_text)
41
+ # Normalize spaces (multiple spaces to single, strip leading/trailing)
42
+ cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
43
+ return cleaned_text
44
+
45
+
46
 
47
  logger = logging.getLogger(__name__)
48
 
 
85
  RuntimeError: If the botok tokenizer fails to initialize.
86
  ValueError: If the input files cannot be processed or if metrics computation fails.
87
  """
88
+ # Initialize model and model_type variables
89
+ model, model_type = None, None # st_device removed
90
  model_warning = ""
91
+
92
  # Update progress if callback provided
93
  if progress_callback is not None:
94
  try:
 
96
  except Exception as e:
97
  logger.warning(f"Progress callback error (non-critical): {e}")
98
  # Continue processing even if progress reporting fails
99
+
100
  # Load semantic model if enabled
101
  if enable_semantic:
102
  logger.info("Semantic similarity enabled. Loading embedding model...")
103
  try:
104
  logger.info("Using model: %s", model_name)
105
+
106
+ if model_name == FASTTEXT_MODEL_ID: # FASTTEXT_MODEL_ID is 'fasttext-tibetan'
107
+ logger.info(f"Attempting to load custom FastText model: {model_name}")
 
108
  if progress_callback is not None:
109
  try:
110
+ progress_callback(0.25, desc=f"Loading custom FastText model: {model_name}...")
111
  except Exception as e:
112
+ logger.warning(f"Progress callback error (non-critical): {e}")
 
 
113
 
114
+ loaded_custom_model = load_fasttext_model(model_id=model_name) # model_id is expected to be path or key by this func
115
+ if loaded_custom_model:
116
+ model = loaded_custom_model
117
+ model_type = "fasttext"
118
+ logger.info(f"Custom FastText model '{model_name}' loaded successfully.")
 
 
 
 
 
 
 
 
 
119
  if progress_callback is not None:
120
  try:
121
+ progress_callback(0.3, desc=f"Custom FastText model '{model_name}' loaded.")
122
  except Exception as e:
123
+ logger.warning(f"Progress callback error (non-critical): {e}")
124
  else:
125
+ model_warning = f"Custom FastText model ('{model_name}') failed to load. Semantic similarity will be disabled."
126
+ logger.warning(model_warning)
127
+ enable_semantic = False
128
+
129
+ elif model_name == "facebook-fasttext-pretrained":
130
+ logger.info(f"Attempting to load Facebook FastText model: {model_name}")
131
+ if progress_callback is not None:
132
+ try:
133
+ progress_callback(0.25, desc=f"Loading Facebook FastText model: {model_name}...")
134
+ except Exception as e:
135
+ logger.warning(f"Progress callback error (non-critical): {e}")
136
+
137
+ fb_model, fb_model_type = get_model_and_device(model_id=model_name) # from semantic_embedding
138
+ if fb_model:
139
+ model = fb_model
140
+ model_type = fb_model_type # Should be "fasttext"
141
+ logger.info(f"Facebook FastText model '{model_name}' (type: {model_type}) loaded successfully.")
142
  if progress_callback is not None:
143
  try:
144
+ progress_callback(0.3, desc=f"Facebook FastText model '{model_name}' loaded.")
145
  except Exception as e:
146
  logger.warning(f"Progress callback error (non-critical): {e}")
147
+ else:
148
+ model_warning = f"Facebook FastText model ('{model_name}') failed to load. Semantic similarity will be disabled."
149
+ logger.warning(model_warning)
150
+ enable_semantic = False
151
+
152
+ else: # Any other model_name is unsupported
153
+ model_warning = f"Unsupported model_name: '{model_name}'. Semantic similarity will be disabled. Supported models are '{FASTTEXT_MODEL_ID}' and 'facebook-fasttext-pretrained'."
154
+ logger.warning(model_warning)
155
+ enable_semantic = False
156
  if progress_callback is not None:
157
  try:
158
+ progress_callback(0.3, desc="Unsupported model, continuing without semantic similarity.")
159
  except Exception as e:
160
  logger.warning(f"Progress callback error (non-critical): {e}")
161
+
162
+ except Exception as e: # General catch-all for unexpected errors during model loading attempts
163
+ model_warning = f"An unexpected error occurred while attempting to load model '{model_name}': {e}. Semantic similarity will be disabled."
164
+ logger.error(model_warning, exc_info=True)
165
+ enable_semantic = False
 
 
 
 
 
 
 
 
166
  if progress_callback is not None:
167
  try:
168
+ progress_callback(0.3, desc="Error loading model, continuing without semantic similarity.")
169
+ except Exception as e_cb:
170
+ logger.warning(f"Progress callback error (non-critical): {e_cb}")
171
  else:
172
  logger.info("Semantic similarity disabled. Skipping model loading.")
173
  if progress_callback is not None:
 
216
 
217
  for idx, seg in enumerate(segments):
218
  seg_id = f"{fname}|chapter {idx+1}"
219
+ cleaned_seg = clean_tibetan_text_for_fasttext(seg)
220
+ segment_texts[seg_id] = cleaned_seg
221
  else:
222
  # No chapter markers found, treat entire file as one segment
223
  seg_id = f"{fname}|chapter 1"
224
+ cleaned_content = clean_tibetan_text_for_fasttext(content.strip())
225
+ segment_texts[seg_id] = cleaned_content
226
  fallback = True
227
 
228
  # Generate warning if no chapter markers found
 
302
 
303
  try:
304
  # Compute metrics for this chapter pair
305
+ tokenizer_for_fasttext = None
306
+ current_model_type = model_type if 'model_type' in locals() else "sentence_transformer"
307
+ if current_model_type == "fasttext":
308
+ # Tokenizer setup for FastText model:
309
+ def fasttext_tokenizer_adapter(text_segment: str) -> List[str]:
310
+ cleaned_segment = clean_tibetan_text_for_fasttext(text_segment)
311
+ # Use word-level tokenization for the custom FastText model
312
+ return get_botok_tokens_for_single_text(cleaned_segment, mode="word")
313
+
314
+ tokenizer_for_fasttext = fasttext_tokenizer_adapter
315
+ logger.info("Using botok word-level tokenization for FastText model.")
316
+
317
  pair_metrics = compute_all_metrics(
318
  {seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
319
+ model=model,
 
320
  enable_semantic=enable_semantic,
321
+ model_type=model_type,
322
  use_stopwords=use_stopwords,
323
+ use_lite_stopwords=use_lite_stopwords,
324
+ fasttext_tokenize_fn=tokenizer_for_fasttext
325
  )
326
 
327
  # Rename 'Text Pair' to show file stems and chapter number
pipeline/semantic_embedding.py CHANGED
@@ -1,7 +1,6 @@
1
  import logging
2
- import torch
3
- from typing import List, Any
4
- from sentence_transformers import SentenceTransformer
5
 
6
  # Configure logging
7
  logging.basicConfig(
@@ -9,190 +8,114 @@ logging.basicConfig(
9
  )
10
  logger = logging.getLogger(__name__)
11
 
12
- # Define the model ID for the fine-tuned Tibetan MiniLM
13
- DEFAULT_MODEL_NAME = "buddhist-nlp/buddhist-sentence-similarity"
14
 
15
- # FastText model identifier - this is just an internal identifier, not a HuggingFace model ID
16
- FASTTEXT_MODEL_ID = "fasttext-tibetan"
17
 
18
 
19
- def get_model_and_device(
20
- model_id: str = DEFAULT_MODEL_NAME, device_preference: str = "auto"
21
- ):
22
  """
23
- Loads the Sentence Transformer model and determines the device.
24
- Priority: CUDA -> MPS (Apple Silicon) -> CPU.
25
 
26
  Args:
27
- model_id (str): The Hugging Face model ID.
28
- device_preference (str): Preferred device ("cuda", "mps", "cpu", "auto").
29
 
30
  Returns:
31
- tuple: (model, device_str)
32
- - model: The loaded SentenceTransformer model.
33
- - device_str: The device the model is loaded on ("cuda", "mps", or "cpu").
34
  """
35
- selected_device_str = ""
36
 
37
- if device_preference == "auto":
38
- if torch.cuda.is_available():
39
- selected_device_str = "cuda"
40
- elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
41
- selected_device_str = "mps"
42
- else:
43
- selected_device_str = "cpu"
44
- elif device_preference == "cuda" and torch.cuda.is_available():
45
- selected_device_str = "cuda"
46
- elif (
47
- device_preference == "mps"
48
- and hasattr(torch.backends, "mps")
49
- and torch.backends.mps.is_available()
50
- ):
51
- selected_device_str = "mps"
52
- else: # Handles explicit "cpu" preference or fallback if preferred is unavailable
53
- selected_device_str = "cpu"
54
-
55
- logger.info("Attempting to use device: %s", selected_device_str)
56
-
57
- try:
58
- # Check if this is a FastText model request
59
- if model_id == FASTTEXT_MODEL_ID:
60
- try:
61
- # Import here to avoid dependency issues if FastText is not installed
62
- import fasttext
63
- from .fasttext_embedding import load_fasttext_model
64
-
65
- # Try to load the FastText model
66
- model = load_fasttext_model()
67
-
68
- if model is None:
69
- error_msg = "Failed to load FastText model. Semantic similarity will not be available."
70
- logger.error(error_msg)
71
- raise Exception(error_msg)
72
-
73
- logger.info("FastText model loaded successfully.")
74
- # FastText always runs on CPU
75
- return model, "cpu", "fasttext"
76
- except ImportError:
77
- logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
78
- raise
79
- else:
80
- logger.info(
81
- "Loading Sentence Transformer model: %s on device: %s",
82
- model_id, selected_device_str
83
- )
84
- # SentenceTransformer expects a string like 'cuda', 'mps', or 'cpu'
85
- model = SentenceTransformer(model_id, device=selected_device_str)
86
- logger.info("Model %s loaded successfully on %s.", model_id, selected_device_str)
87
- return model, selected_device_str, "sentence_transformer"
88
- except Exception as e:
89
- logger.error(
90
- "Error loading model %s on device %s: %s",
91
- model_id, selected_device_str, str(e)
92
- )
93
- # Fallback to CPU if the initially selected device (CUDA or MPS) failed
94
- if selected_device_str != "cpu":
95
- logger.warning(
96
- "Failed to load model on %s, attempting to load on CPU...",
97
- selected_device_str
98
- )
99
- fallback_device_str = "cpu"
100
- try:
101
- # Check if this is a FastText model request during fallback
102
- if model_id == FASTTEXT_MODEL_ID:
103
- # Import here to avoid dependency issues if FastText is not installed
104
- from .fasttext_embedding import load_fasttext_model
105
-
106
- # Try to load the FastText model
107
- model = load_fasttext_model()
108
-
109
- if model is None:
110
- logger.error("Failed to load FastText model during fallback. Semantic similarity will not be available.")
111
- raise Exception("Failed to load FastText model. Please check if the model file exists.")
112
-
113
- logger.info("FastText model loaded successfully during fallback.")
114
- # FastText always runs on CPU
115
- return model, "cpu", "fasttext"
116
- else:
117
- # Try to load as a sentence transformer
118
- model = SentenceTransformer(model_id, device=fallback_device_str)
119
- logger.info(
120
- "Model %s loaded successfully on CPU after fallback.",
121
- model_id
122
- )
123
- return model, fallback_device_str, "sentence_transformer"
124
- except Exception as fallback_e:
125
- logger.error(
126
- "Error loading model %s on CPU during fallback: %s",
127
- model_id, str(fallback_e)
128
- )
129
- raise fallback_e # Re-raise exception if CPU fallback also fails
130
- raise e # Re-raise original exception if selected_device_str was already CPU or no fallback attempted
131
-
132
-
133
- def generate_embeddings(texts: List[str], model: Any, device: str, model_type: str = "sentence_transformer", tokenize_fn=None, use_stopwords: bool = True, use_lite_stopwords: bool = False):
134
  """
135
- Generates embeddings for a list of texts using the provided model.
136
 
137
  Args:
138
  texts (list[str]): A list of texts to embed.
139
- model: The loaded model (SentenceTransformer or FastText).
140
- device (str): The device to use ("cuda", "mps", or "cpu").
141
- model_type (str): Type of model ("sentence_transformer" or "fasttext")
142
- tokenize_fn: Optional tokenization function or pre-tokenized list for FastText
143
- use_stopwords (bool): Whether to filter out stopwords for FastText embeddings
 
 
144
 
145
  Returns:
146
- torch.Tensor: A tensor containing the embeddings, moved to CPU.
147
  """
148
  if not texts:
149
  logger.warning(
150
- "No texts provided to generate_embeddings. Returning empty tensor."
151
  )
152
- return torch.empty(0)
153
 
154
- logger.info(f"Generating embeddings for {len(texts)} texts...")
155
 
156
- if model_type == "fasttext":
157
- try:
158
- # Import here to avoid dependency issues if FastText is not installed
159
- from .fasttext_embedding import get_batch_embeddings
160
- from .stopwords_bo import TIBETAN_STOPWORDS_SET
161
-
162
- # For FastText, get appropriate stopwords set if filtering is enabled
163
- stopwords_set = None
164
- if use_stopwords:
165
- # Choose between regular and lite stopwords sets
166
- if use_lite_stopwords:
167
- from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
168
- stopwords_set = TIBETAN_STOPWORDS_LITE_SET
169
- else:
170
- from .stopwords_bo import TIBETAN_STOPWORDS_SET
171
- stopwords_set = TIBETAN_STOPWORDS_SET
172
-
173
- # Pass pre-tokenized tokens if available, otherwise pass None
174
- # tokenize_fn should be a list of lists (tokens for each text) or None
175
- embeddings = get_batch_embeddings(
176
- texts,
177
- model,
178
- tokenize_fn=tokenize_fn,
179
- use_stopwords=use_stopwords,
180
- stopwords_set=stopwords_set
181
- )
182
- logger.info("FastText embeddings generated with shape: %s", str(embeddings.shape))
183
- # Convert numpy array to torch tensor for consistency
184
- return torch.tensor(embeddings)
185
- except ImportError:
186
- logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
187
- raise
188
- else: # sentence_transformer
189
- # The encode method of SentenceTransformer handles tokenization and pooling internally.
190
- # It also manages moving data to the model's device.
191
- embeddings = model.encode(texts, convert_to_tensor=True, show_progress_bar=True)
192
- logger.info("Sentence Transformer embeddings generated with shape: %s", str(embeddings.shape))
193
- return (
194
- embeddings.cpu()
195
- ) # Ensure embeddings are on CPU for consistent further processing
196
 
197
 
198
  def train_fasttext_model(corpus_texts: List[str], **kwargs):
@@ -204,59 +127,15 @@ def train_fasttext_model(corpus_texts: List[str], **kwargs):
204
  **kwargs: Additional parameters for training (dim, epoch, etc.)
205
 
206
  Returns:
207
- Trained model and path to the model file
208
- """
209
  try:
210
  from .fasttext_embedding import prepare_corpus_file, train_fasttext_model as train_ft
211
 
212
- # Prepare corpus file
213
  corpus_path = prepare_corpus_file(corpus_texts)
214
-
215
- # Train the model
216
  model = train_ft(corpus_path=corpus_path, **kwargs)
217
 
218
- return model
219
  except ImportError:
220
  logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
221
- raise
222
-
223
-
224
- if __name__ == "__main__":
225
- # Example usage:
226
- logger.info("Starting example usage of semantic_embedding module...")
227
-
228
- test_texts = [
229
- "བཀྲ་ཤིས་བདེ་ལེགས།",
230
- "hello world", # Test with non-Tibetan to see behavior
231
- "དེ་རིང་གནམ་གཤིས་ཡག་པོ་འདུག",
232
- ]
233
-
234
- logger.info("Attempting to load model using default cache directory.")
235
- try:
236
- # Forcing CPU for this example to avoid potential CUDA issues in diverse environments
237
- # or if CUDA is not intended for this specific test.
238
- model, device, model_type = get_model_and_device(
239
- device_preference="cpu" # Explicitly use CPU for this test run
240
- )
241
-
242
- if model:
243
- logger.info("Test model loaded on device: %s, type: %s", device, model_type)
244
- example_embeddings = generate_embeddings(test_texts, model, device, model_type)
245
- logger.info(
246
- "Generated example embeddings shape: %s",
247
- str(example_embeddings.shape)
248
- )
249
- if example_embeddings.nelement() > 0: # Check if tensor is not empty
250
- logger.info(
251
- "First embedding (first 10 dims): %s...",
252
- str(example_embeddings[0][:10])
253
- )
254
- else:
255
- logger.info("Generated example embeddings tensor is empty.")
256
- else:
257
- logger.error("Failed to load model for example usage.")
258
-
259
- except Exception as e:
260
- logger.error("An error occurred during the example usage: %s", str(e))
261
-
262
- logger.info("Finished example usage.")
 
1
  import logging
2
+ from typing import List, Any, Optional
3
+ import numpy as np # Added for type hinting Optional[np.ndarray]
 
4
 
5
  # Configure logging
6
  logging.basicConfig(
 
8
  )
9
  logger = logging.getLogger(__name__)
10
 
11
+ # Define the model ID for the Facebook FastText pretrained model
12
+ DEFAULT_MODEL_NAME = "facebook-fasttext-pretrained"
13
 
14
+ # FASTTEXT_MODEL_ID = "fasttext-tibetan" # Removed: Custom model loading to be handled in process.py directly
 
15
 
16
 
17
+ def get_model_and_device(model_id: str = DEFAULT_MODEL_NAME):
 
 
18
  """
19
+ Loads the Facebook official pre-trained FastText model for Tibetan.
 
20
 
21
  Args:
22
+ model_id (str): The model ID. Must be 'facebook-fasttext-pretrained' (DEFAULT_MODEL_NAME).
 
23
 
24
  Returns:
25
+ Tuple[Optional[Any], Optional[str]]:
26
+ A tuple containing the loaded FastText model and its type ("fasttext"),
27
+ or (None, None) if loading fails or model_id is unsupported.
28
  """
29
+ logger.info("Attempting to load FastText model via semantic_embedding.get_model_and_device: %s", model_id)
30
 
31
+ if model_id == DEFAULT_MODEL_NAME: # DEFAULT_MODEL_NAME is "facebook-fasttext-pretrained"
32
+ try:
33
+ # Importing here to minimize issues if fasttext_embedding also imports from semantic_embedding
34
+ from .fasttext_embedding import load_facebook_official_tibetan_model
35
+
36
+ model = load_facebook_official_tibetan_model()
37
+
38
+ if model:
39
+ logger.info(f"FastText model object received in get_model_and_device. Type: {type(model)}.")
40
+ try:
41
+ logger.info(f"Model dimensions: {model.get_dimension()}")
42
+ # Basic check for model validity via an expected attribute/method
43
+ if hasattr(model, 'get_word_vector'):
44
+ logger.info("Model has 'get_word_vector' method (Python API expected for fasttext.load_model results).")
45
+ except Exception as diag_e:
46
+ logger.error(f"Error during diagnostic check of FastText model '{model_id}': {diag_e}", exc_info=True)
47
+ return model, "fasttext"
48
+ else:
49
+ # This case implies load_facebook_official_tibetan_model returned None without raising an error.
50
+ logger.error(f"Model loading for '{model_id}' via load_facebook_official_tibetan_model() returned None unexpectedly.")
51
+ return None, None
52
+ except Exception as e:
53
+ logger.error(f"Failed to load or initialize FastText model '{model_id}': {e}. Semantic similarity will not be available.", exc_info=True)
54
+ return None, None
55
+ else:
56
+ logger.error(f"Unsupported model_id for get_model_and_device in semantic_embedding.py: '{model_id}'. Only '{DEFAULT_MODEL_NAME}' is supported by this function.")
57
+ return None, None
58
+
59
+
60
+ def generate_embeddings(texts: List[str], model: Any, tokenize_fn=None, use_stopwords: bool = True, use_lite_stopwords: bool = False, corpus_token_freq=None, doc_freq_map=None, total_docs_in_corpus=0) -> Optional[np.ndarray]:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  """
62
+ Generates FastText embeddings for a list of texts.
63
 
64
  Args:
65
  texts (list[str]): A list of texts to embed.
66
+ model: The loaded FastText model.
67
+ tokenize_fn: Optional tokenization function for FastText (if different from default botok based).
68
+ use_stopwords (bool): Whether to filter out stopwords for FastText embeddings.
69
+ use_lite_stopwords (bool): Whether to use the 'lite' stopwords list.
70
+ corpus_token_freq: Corpus-wide term frequencies for TF-IDF weighted FastText.
71
+ doc_freq_map: Document frequency map for TF-IDF weighted FastText.
72
+ total_docs_in_corpus: Total documents in corpus for TF-IDF weighted FastText.
73
 
74
  Returns:
75
+ Optional[np.ndarray]: A numpy array containing the embeddings. Returns None if generation fails.
76
  """
77
  if not texts:
78
  logger.warning(
79
+ "No texts provided to generate_embeddings. Returning None."
80
  )
81
+ return None
82
 
83
+ logger.info(f"Generating FastText embeddings for {len(texts)} texts...")
84
 
85
+ try:
86
+ from .fasttext_embedding import get_batch_embeddings
87
+
88
+ stopwords_set = None
89
+ if use_stopwords:
90
+ if use_lite_stopwords:
91
+ from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
92
+ stopwords_set = TIBETAN_STOPWORDS_LITE_SET
93
+ else:
94
+ from .stopwords_bo import TIBETAN_STOPWORDS_SET
95
+ stopwords_set = TIBETAN_STOPWORDS_SET
96
+
97
+ embeddings = get_batch_embeddings(
98
+ texts,
99
+ model,
100
+ tokenize_fn=tokenize_fn,
101
+ use_stopwords=use_stopwords,
102
+ stopwords_set=stopwords_set,
103
+ corpus_token_freq=corpus_token_freq,
104
+ doc_freq_map=doc_freq_map,
105
+ total_docs_in_corpus=total_docs_in_corpus
106
+ )
107
+ if embeddings is None:
108
+ logger.error(f"get_batch_embeddings returned None for {len(texts)} texts. First few: {texts[:2]}")
109
+ return None
110
+
111
+ logger.info("FastText embeddings generated with shape: %s", str(embeddings.shape))
112
+ return embeddings
113
+ except ImportError:
114
+ logger.error("Required FastText modules not found. Please ensure 'fasttext' and its dependencies are correctly installed.")
115
+ return None
116
+ except Exception as e:
117
+ logger.error(f"An unexpected error occurred during FastText embedding generation: {e}", exc_info=True)
118
+ return None
 
 
 
 
 
 
119
 
120
 
121
  def train_fasttext_model(corpus_texts: List[str], **kwargs):
 
127
  **kwargs: Additional parameters for training (dim, epoch, etc.)
128
 
129
  Returns:
130
+ Trained model and path to the model file (Note: current implementation returns only model object)
131
+ """ # Docstring updated for return type
132
  try:
133
  from .fasttext_embedding import prepare_corpus_file, train_fasttext_model as train_ft
134
 
 
135
  corpus_path = prepare_corpus_file(corpus_texts)
 
 
136
  model = train_ft(corpus_path=corpus_path, **kwargs)
137
 
138
+ return model # Returns model object, not path as previously suggested by older docstring
139
  except ImportError:
140
  logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
141
+ raise # Re-raising to signal critical failure if training components are missing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pipeline/tibetan_stopwords.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+
3
+ logger = logging.getLogger(__name__)
4
+
5
+ def get_stopwords(use_lite: bool = False) -> set:
6
+ """
7
+ Returns a set of Tibetan stopwords by importing them from the respective .py files.
8
+
9
+ Args:
10
+ use_lite (bool): If True, returns a smaller, less aggressive list of stopwords
11
+ from stopwords_lite_bo.py.
12
+ Otherwise, returns the full list from stopwords_bo.py.
13
+
14
+ Returns:
15
+ set: A set of stopword strings. Returns an empty set on failure.
16
+ """
17
+ stopwords_set = set()
18
+ try:
19
+ if use_lite:
20
+ from .stopwords_lite_bo import STOPWORDS
21
+ stopwords_set = STOPWORDS
22
+ else:
23
+ from .stopwords_bo import STOPWORDS
24
+ stopwords_set = STOPWORDS
25
+
26
+ logger.info(f"Successfully loaded {len(stopwords_set)} stopwords from {source_module_name.lstrip('.')}.py")
27
+ except ImportError:
28
+ logger.error(
29
+ f"Failed to import STOPWORDS from {source_module_name.lstrip('.')}.py. "
30
+ f"Ensure the file exists in the 'pipeline' directory, is a Python module (ends in .py), "
31
+ f"and is importable (e.g., no syntax errors)."
32
+ )
33
+ except AttributeError:
34
+ logger.error(
35
+ f"Variable 'STOPWORDS' (all caps) not found in {source_module_name.lstrip('.')}.py. "
36
+ f"Please ensure the stopword set is defined with this name within the module."
37
+ )
38
+ except Exception as e:
39
+ logger.error(f"An unexpected error occurred while loading stopwords from {source_module_name.lstrip('.')}.py: {e}")
40
+
41
+ return stopwords_set
pipeline/tokenize.py CHANGED
@@ -39,7 +39,7 @@ def _get_text_hash(text: str) -> str:
39
  return hashlib.md5(text.encode('utf-8')).hexdigest()
40
 
41
 
42
- def tokenize_texts(texts: List[str]) -> List[List[str]]:
43
  """
44
  Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
45
 
@@ -64,6 +64,10 @@ def tokenize_texts(texts: List[str]) -> List[List[str]]:
64
  )
65
 
66
  tokenized_texts_list = []
 
 
 
 
67
 
68
  # Process each text
69
  for text_content in texts:
@@ -73,19 +77,84 @@ def tokenize_texts(texts: List[str]) -> List[List[str]]:
73
  continue
74
 
75
  # Generate hash for cache lookup
76
- text_hash = _get_text_hash(text_content)
 
77
 
78
  # Check if we have this text in cache
79
  if text_hash in _tokenization_cache:
80
  # Cache hit - use cached tokens
81
  tokens = _tokenization_cache[text_hash]
82
- logger.debug(f"Cache hit for text hash {text_hash[:8]}...")
83
  else:
84
  # Cache miss - tokenize and store in cache
85
  try:
86
- tokens = [
87
- w.text for w in BOTOK_TOKENIZER.tokenize(text_content) if w.text.strip()
88
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  # Store in cache if not empty
91
  if tokens:
@@ -95,9 +164,9 @@ def tokenize_texts(texts: List[str]) -> List[List[str]]:
95
  _tokenization_cache.pop(next(iter(_tokenization_cache)))
96
 
97
  _tokenization_cache[text_hash] = tokens
98
- logger.debug(f"Added tokens to cache with hash {text_hash[:8]}...")
99
  except Exception as e:
100
- logger.error(f"Error tokenizing text: {e}")
101
  tokens = []
102
 
103
  tokenized_texts_list.append(tokens)
 
39
  return hashlib.md5(text.encode('utf-8')).hexdigest()
40
 
41
 
42
+ def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
43
  """
44
  Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
45
 
 
64
  )
65
 
66
  tokenized_texts_list = []
67
+
68
+ if mode not in ["word", "syllable"]:
69
+ logger.warning(f"Invalid tokenization mode: '{mode}'. Defaulting to 'syllable'.")
70
+ mode = "syllable"
71
 
72
  # Process each text
73
  for text_content in texts:
 
77
  continue
78
 
79
  # Generate hash for cache lookup
80
+ cache_key_string = text_content + f"_{mode}" # Include mode in string for hashing
81
+ text_hash = _get_text_hash(cache_key_string)
82
 
83
  # Check if we have this text in cache
84
  if text_hash in _tokenization_cache:
85
  # Cache hit - use cached tokens
86
  tokens = _tokenization_cache[text_hash]
87
+ logger.debug(f"Cache hit for text hash {text_hash[:8]}... (mode: {mode})")
88
  else:
89
  # Cache miss - tokenize and store in cache
90
  try:
91
+ current_tokens = []
92
+ if BOTOK_TOKENIZER:
93
+ raw_botok_items = list(BOTOK_TOKENIZER.tokenize(text_content))
94
+
95
+ if mode == "word":
96
+ for item_idx, w in enumerate(raw_botok_items):
97
+ if hasattr(w, 'text') and isinstance(w.text, str):
98
+ token_text = w.text.strip()
99
+ if token_text: # Ensure token is not empty or just whitespace
100
+ current_tokens.append(token_text)
101
+ # Optionally log if w.text is not a string or missing, for debugging
102
+ # elif w.text is not None:
103
+ # logger.debug(f"Token item {item_idx} has non-string text {type(w.text)} for hash {text_hash[:8]}. Skipping word.")
104
+ # else:
105
+ # logger.debug(f"Token item {item_idx} missing text attribute for hash {text_hash[:8]}. Skipping word.")
106
+ logger.debug(
107
+ f"WORD TOKENS FORMED for hash {text_hash[:8]} (mode: {mode}, first 30): "
108
+ f"{current_tokens[:30]}"
109
+ )
110
+ elif mode == "syllable":
111
+ # This is the original syllable extraction logic
112
+ for item_idx, w in enumerate(raw_botok_items):
113
+ if hasattr(w, 'syls') and w.syls:
114
+ for syl_idx, syl_item in enumerate(w.syls):
115
+ syllable_to_process = None
116
+ if isinstance(syl_item, str):
117
+ syllable_to_process = syl_item
118
+ elif isinstance(syl_item, list):
119
+ try:
120
+ syllable_to_process = "".join(syl_item)
121
+ except TypeError:
122
+ logger.warning(
123
+ f"Syllable item in w.syls was a list, but could not be joined (non-string elements?): {syl_item} "
124
+ f"from word item {item_idx} (text: {getattr(w, 'text', 'N/A')}), syl_idx {syl_idx} "
125
+ f"for hash {text_hash[:8]}. Skipping this syllable."
126
+ )
127
+ continue
128
+
129
+ if syllable_to_process is not None:
130
+ stripped_syl = syllable_to_process.strip()
131
+ if stripped_syl:
132
+ current_tokens.append(stripped_syl)
133
+ elif syl_item is not None:
134
+ logger.warning(
135
+ f"Unexpected type for syllable item (neither str nor list): {type(syl_item)} ('{str(syl_item)[:100]}') "
136
+ f"from word item {item_idx} (text: {getattr(w, 'text', 'N/A')}), syl_idx {syl_idx} "
137
+ f"for hash {text_hash[:8]}. Skipping this syllable."
138
+ )
139
+ elif hasattr(w, 'text') and w.text: # Fallback if no 'syls' but in syllable mode
140
+ if isinstance(w.text, str):
141
+ token_text = w.text.strip()
142
+ if token_text:
143
+ current_tokens.append(token_text) # Treat as a single syllable/token
144
+ elif w.text is not None:
145
+ logger.warning(
146
+ f"Unexpected type for w.text (in syllable mode fallback): {type(w.text)} ('{str(w.text)[:100]}') "
147
+ f"for item {item_idx} (POS: {getattr(w, 'pos', 'N/A')}) "
148
+ f"for hash {text_hash[:8]}. Skipping this token."
149
+ )
150
+ logger.debug(
151
+ f"SYLLABLE TOKENS FORMED for hash {text_hash[:8]} (mode: {mode}, first 30): "
152
+ f"{current_tokens[:30]}"
153
+ )
154
+ tokens = current_tokens
155
+ else:
156
+ logger.error(f"BOTOK_TOKENIZER is None for text hash {text_hash[:8]}, cannot tokenize (mode: {mode}).")
157
+ tokens = []
158
 
159
  # Store in cache if not empty
160
  if tokens:
 
164
  _tokenization_cache.pop(next(iter(_tokenization_cache)))
165
 
166
  _tokenization_cache[text_hash] = tokens
167
+ logger.debug(f"Added tokens to cache with hash {text_hash[:8]}... (mode: {mode})")
168
  except Exception as e:
169
+ logger.error(f"Error tokenizing text (mode: {mode}): {e}")
170
  tokens = []
171
 
172
  tokenized_texts_list.append(tokens)
pipeline/visualize.py CHANGED
@@ -149,10 +149,15 @@ def generate_word_count_chart(word_counts_df: pd.DataFrame):
149
  font=dict(size=14),
150
  legend_title_text="Filename",
151
  xaxis=dict(
152
- type="category"
153
- ), # Treat chapter numbers as categories for distinct grouping
154
- autosize=True,
155
- margin=dict(l=80, r=50, b=100, t=50, pad=4),
 
 
 
 
 
156
  )
157
  # Ensure x-axis ticks are shown for all chapter numbers present
158
  all_chapter_numbers = sorted(word_counts_df["ChapterNumber"].unique())
 
149
  font=dict(size=14),
150
  legend_title_text="Filename",
151
  xaxis=dict(
152
+ type="category", # Treat chapter numbers as categories
153
+ automargin=True # Automatically adjust margin for x-axis labels/title
154
+ ),
155
+ yaxis=dict(
156
+ rangemode='tozero', # Ensure y-axis starts at 0 and includes max value
157
+ automargin=True # Automatically adjust margin for y-axis labels/title
158
+ ),
159
+ autosize=True, # Keep for responsiveness in Gradio
160
+ margin=dict(l=80, r=50, b=100, t=50, pad=4) # Keep existing base margins
161
  )
162
  # Ensure x-axis ticks are shown for all chapter numbers present
163
  all_chapter_numbers = sorted(word_counts_df["ChapterNumber"].unique())
results.csv ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Text Pair,Jaccard Similarity (%),Normalized LCS,Semantic Similarity,TF-IDF Cosine Sim,Chapter
2
+ Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,1
3
+ Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,2
4
+ Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,3
5
+ Ngari 9.txt vs Nepal12.txt,46.42857142857143,0.6112115732368897,,0.8407944127395544,4
6
+ Ngari 9.txt vs Nepal12.txt,40.42553191489361,0.5191256830601093,,0.5026984774848224,5
7
+ Ngari 9.txt vs Nepal12.txt,47.28260869565217,0.6107784431137725,,0.8380742568060093,6
8
+ Ngari 9.txt vs Nepal12.txt,49.29178470254957,0.5285565939771547,,0.8409605475909782,7
9
+ Ngari 9.txt vs Nepal12.txt,46.07218683651805,0.6053169734151329,,0.9306016557862976,8
10
+ Ngari 9.txt vs Nepal12.txt,51.7557251908397,0.7000429737859906,,0.9600630844581352,9
11
+ Ngari 9.txt vs Nepal12.txt,52.760736196319016,0.710204081632653,,0.9135878707769712,10
12
+ Ngari 9.txt vs Nepal12.txt,14.92842535787321,0.08302507192766133,,0.698638890914812,11
13
+ Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,12
14
+ Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,13
15
+ Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,14
16
+ Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,15
17
+ Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,16
18
+ Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,1
19
+ Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,2
20
+ Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,3
21
+ Ngari 9.txt vs LTWA.txt,47.752808988764045,0.603648424543947,,0.8414077093281586,4
22
+ Ngari 9.txt vs LTWA.txt,48.40764331210191,0.6094808126410836,,0.6526135410649626,5
23
+ Ngari 9.txt vs LTWA.txt,49.13294797687861,0.6297872340425532,,0.8252183235183391,6
24
+ Ngari 9.txt vs LTWA.txt,35.53459119496855,0.4071058475203553,,0.8403529862077375,7
25
+ Ngari 9.txt vs LTWA.txt,45.0,0.601965601965602,,0.9452297806160965,8
26
+ Ngari 9.txt vs LTWA.txt,37.89126853377265,0.29986320109439124,,0.8760838478443608,9
27
+ Ngari 9.txt vs LTWA.txt,51.632047477744806,0.6395222584147665,,0.9317016829510952,10
28
+ Ngari 9.txt vs LTWA.txt,14.979757085020243,0.10742761225346202,,0.7111189597708231,11
29
+ Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,12
30
+ Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,13
31
+ Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,14
32
+ Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,15
33
+ Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,16
34
+ Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,1
35
+ Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,2
36
+ Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,3
37
+ Ngari 9.txt vs Leiden.txt,41.340782122905026,0.5282331511839709,,0.8525095366316284,4
38
+ Ngari 9.txt vs Leiden.txt,36.80555555555556,0.4734042553191489,,0.5634721694429372,5
39
+ Ngari 9.txt vs Leiden.txt,44.047619047619044,0.4728132387706856,,0.7698959290709281,6
40
+ Ngari 9.txt vs Leiden.txt,35.67251461988304,0.3208020050125313,,0.784262930792386,7
41
+ Ngari 9.txt vs Leiden.txt,41.01123595505618,0.4241099312929419,,0.9275267086147868,8
42
+ Ngari 9.txt vs Leiden.txt,40.31209362808843,0.20184790334044064,,0.9076572014074583,9
43
+ Ngari 9.txt vs Leiden.txt,50.445103857566764,0.6045733407696597,,0.9284684903895061,10
44
+ Ngari 9.txt vs Leiden.txt,16.363636363636363,0.08736942070275404,,0.6999802304139516,11
45
+ Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,12
46
+ Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,13
47
+ Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,14
48
+ Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,15
49
+ Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,16
50
+ Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,1
51
+ Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,2
52
+ Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,3
53
+ Nepal12.txt vs LTWA.txt,56.493506493506494,0.6959706959706959,,0.8321482637014176,4
54
+ Nepal12.txt vs LTWA.txt,39.71631205673759,0.5386666666666666,,0.7104447145077406,5
55
+ Nepal12.txt vs LTWA.txt,48.795180722891565,0.5898004434589801,,0.8168699067131293,6
56
+ Nepal12.txt vs LTWA.txt,34.954407294832826,0.3365548607163161,,0.861391898750807,7
57
+ Nepal12.txt vs LTWA.txt,51.41509433962265,0.5873239436619718,,0.9310750730815768,8
58
+ Nepal12.txt vs LTWA.txt,41.18705035971223,0.3156208277703605,,0.9075961630628558,9
59
+ Nepal12.txt vs LTWA.txt,60.066006600660074,0.7040533037201555,,0.921390350997517,10
60
+ Nepal12.txt vs LTWA.txt,63.6986301369863,0.7454220634211701,,0.9803189694519824,11
61
+ Nepal12.txt vs LTWA.txt,48.275862068965516,0.5102639296187683,,0.7725258306356406,12
62
+ Nepal12.txt vs LTWA.txt,58.203125,0.7364921030756443,,0.9543942889292814,13
63
+ Nepal12.txt vs LTWA.txt,41.732283464566926,0.4332449160035367,,0.8497746214132795,14
64
+ Nepal12.txt vs LTWA.txt,17.983651226158038,0.1474820143884892,,0.5779105517118261,15
65
+ Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,16
66
+ Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,1
67
+ Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,2
68
+ Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,3
69
+ Nepal12.txt vs Leiden.txt,57.14285714285714,0.6788617886178862,,0.8403894964769358,4
70
+ Nepal12.txt vs Leiden.txt,38.793103448275865,0.4935064935064935,,0.4684416871587978,5
71
+ Nepal12.txt vs Leiden.txt,60.416666666666664,0.6386138613861386,,0.8441982917223785,6
72
+ Nepal12.txt vs Leiden.txt,43.24324324324324,0.33632734530938124,,0.8839876637274263,7
73
+ Nepal12.txt vs Leiden.txt,50.8235294117647,0.4953338119167265,,0.9373191281412603,8
74
+ Nepal12.txt vs Leiden.txt,44.03927068723703,0.2242042672263029,,0.9196700291527228,9
75
+ Nepal12.txt vs Leiden.txt,67.59581881533101,0.7226027397260274,,0.9462708958951278,10
76
+ Nepal12.txt vs Leiden.txt,60.42780748663101,0.7003094501309212,,0.9722895878422901,11
77
+ Nepal12.txt vs Leiden.txt,23.502304147465438,0.27245508982035926,,0.6893488630692246,12
78
+ Nepal12.txt vs Leiden.txt,67.08333333333333,0.7506382978723404,,0.9466019120384076,13
79
+ Nepal12.txt vs Leiden.txt,42.67782426778243,0.418426103646833,,0.8023010077421123,14
80
+ Nepal12.txt vs Leiden.txt,31.17206982543641,0.2664756446991404,,0.757778410785804,15
81
+ Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,16
82
+ LTWA.txt vs Leiden.txt,53.5064935064935,0.6359163591635917,,0.9623337315161734,1
83
+ LTWA.txt vs Leiden.txt,60.909090909090914,0.7578659370725034,,0.8852155398192683,2
84
+ LTWA.txt vs Leiden.txt,64.1025641025641,0.7001044932079414,,0.8986878289296542,3
85
+ LTWA.txt vs Leiden.txt,51.21951219512195,0.6568265682656826,,0.8596233249504219,4
86
+ LTWA.txt vs Leiden.txt,40.0,0.5298701298701298,,0.6287776677298036,5
87
+ LTWA.txt vs Leiden.txt,46.308724832214764,0.5415549597855228,,0.7796920776498958,6
88
+ LTWA.txt vs Leiden.txt,43.233082706766915,0.49545136459062283,,0.8819142330949857,7
89
+ LTWA.txt vs Leiden.txt,54.03050108932462,0.5373230373230373,,0.9477445373252964,8
90
+ LTWA.txt vs Leiden.txt,38.75598086124402,0.1898707353252808,,0.887072472781142,9
91
+ LTWA.txt vs Leiden.txt,66.32996632996633,0.7823310271420969,,0.9693004524579277,10
92
+ LTWA.txt vs Leiden.txt,63.537906137184116,0.754516983859311,,0.9830176756030125,11
93
+ LTWA.txt vs Leiden.txt,24.299065420560748,0.18152350081037277,,0.6278532648577805,12
94
+ LTWA.txt vs Leiden.txt,60.1593625498008,0.7367521367521368,,0.9381662329597793,13
95
+ LTWA.txt vs Leiden.txt,59.44444444444444,0.6746987951807228,,0.8771500136505623,14
96
+ LTWA.txt vs Leiden.txt,35.37735849056604,0.39255014326647564,,0.6834100468628878,15
97
+ LTWA.txt vs Leiden.txt,60.45081967213115,0.6875444839857652,,0.9482911929631709,16
user_guide.md ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Tibetan Text Metrics Web Application User Guide
2
+
3
+ ## Introduction
4
+
5
+ Welcome to the Tibetan Text Metrics Web Application! This user-friendly tool allows you to analyze textual similarities and variations in Tibetan manuscripts using multiple computational approaches. The application provides a graphical interface to the core functionalities of the Tibetan Text Metrics (TTM) project.
6
+
7
+ ## Getting Started
8
+
9
+ ### System Requirements
10
+
11
+ - Modern web browser (Chrome, Firefox, Safari, or Edge)
12
+ - For local installation: Python 3.10 or newer
13
+ - Sufficient RAM for processing large texts (4GB minimum, 8GB recommended)
14
+
15
+ ### Installation and Setup
16
+
17
+ #### Online Demo
18
+
19
+ The easiest way to try the application is through our Hugging Face Spaces demo:
20
+ [daniel-wojahn/ttm-webapp-hf](https://huggingface.co/spaces/daniel-wojahn/ttm-webapp-hf)
21
+
22
+ Note: The free tier of Hugging Face Spaces may have performance limitations compared to running locally.
23
+
24
+ #### Local Installation
25
+
26
+ 1. Clone the repository:
27
+ ```bash
28
+ git clone https://github.com/daniel-wojahn/tibetan-text-metrics.git
29
+ cd tibetan-text-metrics/webapp
30
+ ```
31
+
32
+ 2. Create and activate a virtual environment:
33
+ ```bash
34
+ python -m venv venv
35
+ source venv/bin/activate # On Windows: venv\Scripts\activate
36
+ ```
37
+
38
+ 3. Install dependencies:
39
+ ```bash
40
+ pip install -r requirements.txt
41
+ ```
42
+
43
+ 4. Run the application:
44
+ ```bash
45
+ python app.py
46
+ ```
47
+
48
+ 5. Open your browser and navigate to:
49
+ ```
50
+ http://localhost:7860
51
+ ```
52
+
53
+ ## Using the Application
54
+
55
+ ### Step 1: Upload Your Tibetan Text Files
56
+
57
+ 1. Click the "Upload Tibetan .txt files" button to select one or more `.txt` files containing Tibetan text.
58
+ 2. Files should be in UTF-8 or UTF-16 encoding.
59
+ 3. Maximum file size: 10MB per file (for optimal performance, use files under 1MB).
60
+ 4. For best results, your texts should be segmented into chapters/sections using the Tibetan marker '༈' (*sbrul shad*).
61
+
62
+ ### Step 2: Configure Analysis Options
63
+
64
+ 1. **Semantic Similarity**: Choose whether to compute semantic similarity metrics.
65
+ - "Yes" (default): Includes semantic similarity in the analysis (slower but more comprehensive).
66
+ - "No": Skips semantic similarity calculation for faster processing.
67
+
68
+ 2. **Embedding Model**: Select the model to use for semantic similarity analysis.
69
+ - **sentence-transformers/all-MiniLM-L6-v2** (default): General purpose sentence embedding model (fastest option).
70
+ - **sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2**: Multilingual model with good performance for many languages.
71
+ - **buddhist-nlp/buddhist-sentence-similarity**: Optimized for Buddhist text similarity.
72
+ - **xlm-roberta-base**: Multilingual model that includes Tibetan.
73
+
74
+ 3. Click the "Run Analysis" button to start processing.
75
+
76
+ ### Step 3: View and Interpret Results
77
+
78
+ After processing, the application displays several visualizations and metrics:
79
+
80
+ #### Word Count Chart
81
+
82
+ Shows the number of words in each chapter/segment of each file, allowing you to compare the relative lengths of different texts.
83
+
84
+ #### Similarity Metrics
85
+
86
+ The application computes four different similarity metrics between corresponding chapters of different files:
87
+
88
+ 1. **Jaccard Similarity (%)**: Measures vocabulary overlap between segments after filtering out common Tibetan stopwords. A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
89
+
90
+ 2. **Normalized LCS (Longest Common Subsequence)**: Measures the length of the longest sequence of words that appears in both text segments, maintaining their original relative order. A higher score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism.
91
+
92
+ 3. **Semantic Similarity**: Uses a transformer-based model to compute the cosine similarity between the semantic embeddings of text segments. This captures similarities in meaning even when different vocabulary is used.
93
+
94
+ 4. **TF-IDF Cosine Similarity**: Compares texts based on their important, characteristic terms by giving higher weight to words that are frequent within a particular segment but relatively rare across the entire collection.
95
+
96
+ #### Heatmap Visualizations
97
+
98
+ Each metric has a corresponding heatmap visualization where:
99
+ - Rows represent chapters/segments
100
+ - Columns represent text pairs being compared
101
+ - Color intensity indicates similarity (brighter = more similar)
102
+
103
+ ### Tips for Effective Analysis
104
+
105
+ 1. **Text Segmentation**: For meaningful chapter-level comparisons, ensure your texts are segmented using the Tibetan marker '༈' (*sbrul shad*).
106
+
107
+ 2. **File Naming**: Use descriptive filenames to make the comparison results easier to interpret.
108
+
109
+ 3. **Model Selection**:
110
+ - For faster processing, use the default model or disable semantic similarity.
111
+ - For Buddhist texts, the buddhist-nlp/buddhist-sentence-similarity model may provide better results.
112
+
113
+ 4. **File Size**:
114
+ - Keep individual files under 1MB for optimal performance.
115
+ - Very large files (>10MB) are not supported and will trigger an error.
116
+
117
+ 5. **Comparing Multiple Texts**: The application requires at least two text files to compute similarity metrics.
118
+
119
+ ## Understanding the Metrics
120
+
121
+ ### Jaccard Similarity (%)
122
+
123
+ This metric quantifies the lexical overlap between two text segments by comparing their sets of unique words, after filtering out common Tibetan stopwords. It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?'
124
+
125
+ It is calculated as:
126
+ ```
127
+ (Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100
128
+ ```
129
+
130
+ Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent. A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
131
+
132
+ ### Normalized LCS (Longest Common Subsequence)
133
+
134
+ This metric measures the length of the longest sequence of words that appears in both text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text.
135
+
136
+ For example, if Text A is 'the quick brown fox jumps' and Text B is 'the lazy cat and brown dog jumps high', the LCS is 'the brown jumps'.
137
+
138
+ The length of this common subsequence is then normalized to provide a score. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
139
+
140
+ Unlike other metrics, LCS does not filter out stopwords, allowing it to capture structural similarities and the flow of language, including the use of particles and common words that contribute to sentence construction.
141
+
142
+ ### Semantic Similarity
143
+
144
+ This metric utilizes transformer-based models to compute the cosine similarity between the semantic embeddings of text segments. The model converts each text segment into a high-dimensional vector that captures its semantic meaning.
145
+
146
+ For texts exceeding the model's token limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged to produce a single representative vector for the entire segment before comparison.
147
+
148
+ A higher score indicates that the texts express similar concepts or ideas, even if they use different vocabulary or phrasing.
149
+
150
+ ### TF-IDF Cosine Similarity
151
+
152
+ This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, after filtering out common Tibetan stopwords. TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
153
+
154
+ Each segment is then represented as a vector of these TF-IDF scores, and the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
155
+
156
+ ## Troubleshooting
157
+
158
+ ### Common Issues and Solutions
159
+
160
+ 1. **"Empty vocabulary" error**:
161
+ - This can occur if a text contains only stopwords or if tokenization fails.
162
+ - Solution: Check your input text to ensure it contains valid Tibetan content.
163
+
164
+ 2. **Model loading errors**:
165
+ - If a model fails to load, the application will continue without semantic similarity.
166
+ - Solution: Try a different model or disable semantic similarity.
167
+
168
+ 3. **Performance issues with large files**:
169
+ - Solution: Split large files into smaller ones or use fewer files at once.
170
+
171
+ 4. **No results displayed**:
172
+ - Solution: Ensure you have uploaded at least two valid text files and that they contain comparable content.
173
+
174
+ 5. **Encoding issues**:
175
+ - If your text appears garbled, it may have encoding problems.
176
+ - Solution: Ensure your files are saved in UTF-8 or UTF-16 encoding.
177
+
178
+ ### Getting Help
179
+
180
+ If you encounter issues not covered in this guide, please:
181
+ 1. Check the [GitHub repository](https://github.com/daniel-wojahn/tibetan-text-metrics) for updates or known issues.
182
+ 2. Submit an issue on GitHub with details about your problem.
183
+
184
+ ## Acknowledgments
185
+
186
+ The Tibetan Text Metrics project was developed as part of the [Law in Historic Tibet](https://www.law.ox.ac.uk/law-historic-tibet) project at the Centre for Socio-Legal Studies at the University of Oxford.
187
+
188
+ ## License
189
+
190
+ This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).