Spaces:
Sleeping
Sleeping
Reafactoring of the tokenization pipeline, adjusted fasttext implementation
Browse files- README.md +20 -42
- app.py +20 -25
- pipeline/fasttext_embedding.py +211 -140
- pipeline/metrics.py +145 -245
- pipeline/process.py +111 -58
- pipeline/semantic_embedding.py +92 -213
- pipeline/tibetan_stopwords.py +41 -0
- pipeline/tokenize.py +77 -8
- pipeline/visualize.py +9 -4
- results.csv +97 -0
- user_guide.md +190 -0
README.md
CHANGED
@@ -8,7 +8,6 @@ sdk_version: 5.29.0
|
|
8 |
python_version: 3.11
|
9 |
app_file: app.py
|
10 |
models:
|
11 |
-
- buddhist-nlp/buddhist-sentence-similarity
|
12 |
- fasttext-tibetan
|
13 |
---
|
14 |
|
@@ -32,14 +31,12 @@ The Tibetan Text Metrics project aims to provide quantitative methods for assess
|
|
32 |
- **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
|
33 |
- **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
|
34 |
- **Semantic Similarity**: Uses embedding models to compare the contextual meaning of segments. Users can select between:
|
35 |
-
- A
|
36 |
-
- The official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text
|
37 |
*Note: This metric works best when combined with other metrics for a more comprehensive analysis.*
|
38 |
- **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles. *Common Tibetan stopwords can be excluded to ensure TF-IDF weights highlight genuinely characteristic terms.*
|
39 |
- **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
|
40 |
-
- **Model Selection**:
|
41 |
-
- **
|
42 |
-
- **FastText**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) with optimizations specifically for Tibetan language, including botok tokenization and TF-IDF weighted averaging
|
43 |
- **Stopword Filtering**: Three levels of filtering for Tibetan words:
|
44 |
- **None**: No filtering, includes all words
|
45 |
- **Standard**: Filters only common particles and punctuation
|
@@ -126,24 +123,8 @@ This helps focus on meaningful content words rather than grammatical elements.
|
|
126 |
|
127 |
2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
|
128 |
* *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
|
129 |
-
3. **Semantic Similarity**: Computes the cosine similarity between semantic embeddings of text segments using
|
130 |
-
|
131 |
-
**a. Transformer-based Model** (Experimental): Pre-trained model that understands contextual relationships between words.
|
132 |
-
- `buddhist-nlp/buddhist-sentence-similarity`: Specialized for Buddhist texts
|
133 |
-
- Processes raw Unicode Tibetan text directly (no special tokenization required)
|
134 |
-
- Note: This is an experimental approach and results may vary with different texts
|
135 |
-
|
136 |
-
**b. FastText Model**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text. Falls back to a custom model only if the official model cannot be loaded.
|
137 |
-
- Processes Tibetan text using botok tokenization (same as other metrics)
|
138 |
-
- Uses the pre-tokenized words from botok rather than doing its own tokenization
|
139 |
-
- Better for texts with specialized Tibetan vocabulary
|
140 |
-
- More stable results for general Tibetan text comparison
|
141 |
-
- Optimized for Tibetan language with:
|
142 |
-
- Syllable-based tokenization preserving Tibetan syllable markers
|
143 |
-
- TF-IDF weighted averaging for word vectors (distinct from the TF-IDF Cosine Similarity metric)
|
144 |
-
- Enhanced parameters based on Tibetan NLP research
|
145 |
-
|
146 |
-
For texts exceeding the model's token limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting embeddings are averaged to produce a single vector for the entire segment.
|
147 |
4. **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, optionally **filtering out common Tibetan stopwords**.
|
148 |
TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
|
149 |
This helps to identify terms that are characteristic or discriminative for a segment. When stopword filtering is enabled, the TF-IDF scores better reflect genuinely significant terms.
|
@@ -210,28 +191,21 @@ This helps focus on meaningful content words rather than grammatical elements.
|
|
210 |
- The system will analyze your metrics and provide insights about patterns, relationships, and notable findings in your data.
|
211 |
- This feature helps researchers understand the significance of the metrics and identify interesting textual relationships between chapters.
|
212 |
|
213 |
-
## Embedding
|
214 |
|
215 |
-
The application
|
216 |
|
217 |
-
|
218 |
-
|
219 |
-
|
220 |
-
|
221 |
-
|
222 |
-
|
223 |
-
2. **FastText Model**:
|
224 |
-
- Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors)
|
225 |
-
- Pre-trained on a large corpus of Tibetan text from Wikipedia and other sources
|
226 |
-
- Falls back to training a custom model on your texts if the official model cannot be loaded
|
227 |
-
- Respects your stopword filtering settings when creating embeddings
|
228 |
-
- Uses simple word vector averaging for stable embeddings
|
229 |
|
230 |
**When to choose FastText**:
|
231 |
-
- When you
|
232 |
-
- When
|
233 |
-
- When you want to
|
234 |
-
- When you need
|
235 |
|
236 |
## Structure
|
237 |
|
@@ -244,6 +218,10 @@ The application offers two specialized approaches for calculating semantic simil
|
|
244 |
- `tokenize.py`: Tibetan text tokenization using `botok`.
|
245 |
- `upload.py`: File upload handling (currently minimal).
|
246 |
- `visualize.py`: Generates heatmaps and word count plots.
|
|
|
|
|
|
|
|
|
247 |
- `requirements.txt` — Python dependencies for the web application.
|
248 |
|
249 |
## License
|
|
|
8 |
python_version: 3.11
|
9 |
app_file: app.py
|
10 |
models:
|
|
|
11 |
- fasttext-tibetan
|
12 |
---
|
13 |
|
|
|
31 |
- **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords can be filtered out to focus on meaningful lexical similarity.*
|
32 |
- **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
|
33 |
- **Semantic Similarity**: Uses embedding models to compare the contextual meaning of segments. Users can select between:
|
34 |
+
- A FastText model using official Facebook Tibetan vectors with custom `botok` tokenization (recommended approach).
|
|
|
35 |
*Note: This metric works best when combined with other metrics for a more comprehensive analysis.*
|
36 |
- **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles. *Common Tibetan stopwords can be excluded to ensure TF-IDF weights highlight genuinely characteristic terms.*
|
37 |
- **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
|
38 |
+
- **Model Selection**: Semantic similarity analysis uses a FastText model:
|
39 |
+
- **FastText**: Uses the official Facebook FastText Tibetan model (`cc.bo.300.bin`) with optimizations specifically for Tibetan language, including `botok` tokenization and TF-IDF weighted averaging of word vectors to produce segment embeddings.
|
|
|
40 |
- **Stopword Filtering**: Three levels of filtering for Tibetan words:
|
41 |
- **None**: No filtering, includes all words
|
42 |
- **Standard**: Filters only common particles and punctuation
|
|
|
123 |
|
124 |
2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
|
125 |
* *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
|
126 |
+
3. **Semantic Similarity**: Computes the cosine similarity between semantic embeddings of text segments using a FastText model:
|
127 |
+
- **FastText**: Uses the official Facebook FastText Tibetan model (`cc.bo.300.bin`) with optimizations specifically for Tibetan language, including `botok` tokenization and TF-IDF weighted averaging of word vectors to produce segment embeddings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
128 |
4. **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, optionally **filtering out common Tibetan stopwords**.
|
129 |
TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
|
130 |
This helps to identify terms that are characteristic or discriminative for a segment. When stopword filtering is enabled, the TF-IDF scores better reflect genuinely significant terms.
|
|
|
191 |
- The system will analyze your metrics and provide insights about patterns, relationships, and notable findings in your data.
|
192 |
- This feature helps researchers understand the significance of the metrics and identify interesting textual relationships between chapters.
|
193 |
|
194 |
+
## Embedding Model
|
195 |
|
196 |
+
The application uses a FastText-based approach for calculating semantic similarity in Tibetan texts:
|
197 |
|
198 |
+
**FastText Model Features**:
|
199 |
+
- Utilizes official Facebook FastText word vectors for Tibetan (`cc.bo.300.bin`)
|
200 |
+
- Integrates `botok` for accurate Tibetan word tokenization.
|
201 |
+
- Employs TF-IDF weighted averaging of word vectors to produce segment embeddings. This method provides more nuanced similarity scores by emphasizing terms that are important within the analyzed texts.
|
202 |
+
- The underlying `tibetan-text-metrics` library also supports training custom FastText models on user-uploaded texts for domain-specific accuracy, though this feature is not yet directly exposed for training in the web UI.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
203 |
|
204 |
**When to choose FastText**:
|
205 |
+
- When you require high-quality word embeddings trained on a very large and diverse corpus of classical Tibetan texts.
|
206 |
+
- When your analysis benefits from a model that can effectively handle out-of-vocabulary Tibetan words and orthographic variations through FastText's character n-gram features.
|
207 |
+
- When you want to leverage a model trained with Tibetan-specific text preprocessing techniques.
|
208 |
+
- When you need fine-grained control over how stopwords affect semantic analysis, as this model's embedding process respects the selected stopword filtering level.
|
209 |
|
210 |
## Structure
|
211 |
|
|
|
218 |
- `tokenize.py`: Tibetan text tokenization using `botok`.
|
219 |
- `upload.py`: File upload handling (currently minimal).
|
220 |
- `visualize.py`: Generates heatmaps and word count plots.
|
221 |
+
- `fasttext-modelling/` — Scripts and documentation for training custom FastText models.
|
222 |
+
- `train_custom_fasttext.py`: Script to train a custom FastText model.
|
223 |
+
- `README.md`: Detailed instructions for the training script.
|
224 |
+
- `requirements.txt`: Python dependencies specifically for the training script.
|
225 |
- `requirements.txt` — Python dependencies for the web application.
|
226 |
|
227 |
## License
|
app.py
CHANGED
@@ -5,6 +5,7 @@ from pipeline.visualize import generate_visualizations, generate_word_count_char
|
|
5 |
from pipeline.llm_service import get_interpretation
|
6 |
import logging
|
7 |
import pandas as pd
|
|
|
8 |
from dotenv import load_dotenv
|
9 |
|
10 |
# Load environment variables from .env file
|
@@ -14,7 +15,6 @@ from theme import tibetan_theme
|
|
14 |
|
15 |
logger = logging.getLogger(__name__)
|
16 |
|
17 |
-
|
18 |
# Main interface logic
|
19 |
def main_interface():
|
20 |
with gr.Blocks(
|
@@ -65,18 +65,10 @@ def main_interface():
|
|
65 |
)
|
66 |
|
67 |
model_dropdown = gr.Dropdown(
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
],
|
73 |
-
value="buddhist-nlp/buddhist-sentence-similarity",
|
74 |
-
info="Select the embedding model for semantic similarity.<br><br>"
|
75 |
-
"<b>Model information:</b><br>"
|
76 |
-
"• <a href='https://huggingface.co/buddhist-nlp/buddhist-sentence-similarity' target='_blank'>buddhist-nlp/buddhist-sentence-similarity</a>: Specialized model fine-tuned for Buddhist text similarity.<br>"
|
77 |
-
"• <b>fasttext-tibetan</b>: Uses the official Facebook FastText Tibetan model pre-trained on a large corpus. If the official model cannot be loaded, it will fall back to training a custom model on your uploaded texts.",
|
78 |
-
visible=True,
|
79 |
-
interactive=True
|
80 |
)
|
81 |
|
82 |
stopwords_dropdown = gr.Dropdown(
|
@@ -181,26 +173,19 @@ A higher Normalized LCS score suggests more significant shared phrasing, direct
|
|
181 |
""",
|
182 |
"Semantic Similarity": """
|
183 |
### Semantic Similarity
|
184 |
-
Computes the cosine similarity between semantic embeddings of text segments
|
185 |
|
186 |
-
**
|
187 |
-
- `buddhist-nlp/buddhist-sentence-similarity`: Specialized for Buddhist texts
|
188 |
-
- Processes raw Unicode Tibetan text directly (no special tokenization required)
|
189 |
-
- Note: This is an experimental approach and results may vary with different texts
|
190 |
-
|
191 |
-
**2. FastText Model**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text. Falls back to a custom model only if the official model cannot be loaded.
|
192 |
- Processes Tibetan text using botok tokenization (same as other metrics)
|
193 |
- Uses the pre-tokenized words from botok rather than doing its own tokenization
|
194 |
- Better for texts with specialized Tibetan vocabulary
|
195 |
- More stable results for general Tibetan text comparison
|
196 |
- Optimized for Tibetan language with:
|
197 |
-
-
|
198 |
- TF-IDF weighted averaging for word vectors (distinct from the TF-IDF Cosine Similarity metric)
|
199 |
- Enhanced parameters based on Tibetan NLP research
|
200 |
|
201 |
-
**
|
202 |
-
|
203 |
-
**Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before computing embeddings. This helps focus on meaningful content words. Transformer models process the full text regardless of stopword filtering setting.
|
204 |
|
205 |
**Note**: This metric works best when combined with other metrics for a more comprehensive analysis.
|
206 |
""",
|
@@ -389,10 +374,20 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
|
|
389 |
use_stopwords = stopwords_option != "None (No filtering)"
|
390 |
use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
|
391 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
392 |
df_results, word_counts_df_data, warning_raw = process_texts(
|
393 |
text_data, filenames,
|
394 |
enable_semantic=enable_semantic_bool,
|
395 |
-
model_name=
|
396 |
use_stopwords=use_stopwords,
|
397 |
use_lite_stopwords=use_lite_stopwords,
|
398 |
progress_callback=progress_tracker
|
|
|
5 |
from pipeline.llm_service import get_interpretation
|
6 |
import logging
|
7 |
import pandas as pd
|
8 |
+
|
9 |
from dotenv import load_dotenv
|
10 |
|
11 |
# Load environment variables from .env file
|
|
|
15 |
|
16 |
logger = logging.getLogger(__name__)
|
17 |
|
|
|
18 |
# Main interface logic
|
19 |
def main_interface():
|
20 |
with gr.Blocks(
|
|
|
65 |
)
|
66 |
|
67 |
model_dropdown = gr.Dropdown(
|
68 |
+
choices=["Facebook FastText (Pre-trained)"],
|
69 |
+
label="Select Embedding Model",
|
70 |
+
value="Facebook FastText (Pre-trained)",
|
71 |
+
info="Using Facebook's pre-trained FastText model for semantic similarity. Other model options have been removed."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
72 |
)
|
73 |
|
74 |
stopwords_dropdown = gr.Dropdown(
|
|
|
173 |
""",
|
174 |
"Semantic Similarity": """
|
175 |
### Semantic Similarity
|
176 |
+
Computes the cosine similarity between semantic embeddings of text segments:
|
177 |
|
178 |
+
**FastText Model**: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text. Falls back to a custom model only if the official model cannot be loaded.
|
|
|
|
|
|
|
|
|
|
|
179 |
- Processes Tibetan text using botok tokenization (same as other metrics)
|
180 |
- Uses the pre-tokenized words from botok rather than doing its own tokenization
|
181 |
- Better for texts with specialized Tibetan vocabulary
|
182 |
- More stable results for general Tibetan text comparison
|
183 |
- Optimized for Tibetan language with:
|
184 |
+
- Word-based tokenization preserving Tibetan syllable markers
|
185 |
- TF-IDF weighted averaging for word vectors (distinct from the TF-IDF Cosine Similarity metric)
|
186 |
- Enhanced parameters based on Tibetan NLP research
|
187 |
|
188 |
+
**Stopword Filtering**: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before computing embeddings. This helps focus on meaningful content words.
|
|
|
|
|
189 |
|
190 |
**Note**: This metric works best when combined with other metrics for a more comprehensive analysis.
|
191 |
""",
|
|
|
374 |
use_stopwords = stopwords_option != "None (No filtering)"
|
375 |
use_lite_stopwords = stopwords_option == "Standard (Common particles only)"
|
376 |
|
377 |
+
# Map UI model name to internal model ID
|
378 |
+
# The UI model_name is "Facebook FastText (Pre-trained)"
|
379 |
+
# This mapping ensures the backend receives the correct identifier.
|
380 |
+
if model_name == "Facebook FastText (Pre-trained)":
|
381 |
+
internal_model_id = "facebook-fasttext-pretrained"
|
382 |
+
else:
|
383 |
+
# Fallback or error if unexpected model_name, though UI should prevent this
|
384 |
+
logger.warning(f"Unexpected model_name from UI: {model_name}. Defaulting to facebook-fasttext-pretrained.")
|
385 |
+
internal_model_id = "facebook-fasttext-pretrained"
|
386 |
+
|
387 |
df_results, word_counts_df_data, warning_raw = process_texts(
|
388 |
text_data, filenames,
|
389 |
enable_semantic=enable_semantic_bool,
|
390 |
+
model_name=internal_model_id, # Use the mapped internal ID
|
391 |
use_stopwords=use_stopwords,
|
392 |
use_lite_stopwords=use_lite_stopwords,
|
393 |
progress_callback=progress_tracker
|
pipeline/fasttext_embedding.py
CHANGED
@@ -4,15 +4,18 @@ This module provides functions to train and use FastText models for Tibetan text
|
|
4 |
"""
|
5 |
|
6 |
import os
|
|
|
7 |
import math
|
8 |
import logging
|
9 |
import numpy as np
|
10 |
import fasttext
|
11 |
-
from
|
|
|
12 |
from huggingface_hub import hf_hub_download
|
13 |
|
14 |
# Set up logging
|
15 |
logger = logging.getLogger(__name__)
|
|
|
16 |
|
17 |
# Default parameters optimized for Tibetan
|
18 |
DEFAULT_DIM = 100
|
@@ -25,7 +28,7 @@ DEFAULT_NEG = 5
|
|
25 |
|
26 |
# Define paths for model storage
|
27 |
DEFAULT_MODEL_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "models")
|
28 |
-
DEFAULT_MODEL_PATH =
|
29 |
|
30 |
# Facebook's official Tibetan FastText model
|
31 |
FACEBOOK_TIBETAN_MODEL_ID = "facebook/fasttext-bo-vectors"
|
@@ -133,41 +136,67 @@ def train_fasttext_model(
|
|
133 |
return model
|
134 |
|
135 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
136 |
def load_fasttext_model(model_path: str = DEFAULT_MODEL_PATH) -> Optional[fasttext.FastText._FastText]:
|
137 |
"""
|
138 |
-
Load a FastText model from
|
139 |
|
140 |
Args:
|
141 |
-
model_path: Path to the model file
|
142 |
|
143 |
Returns:
|
144 |
-
Loaded FastText model or None if loading fails
|
145 |
"""
|
146 |
try:
|
147 |
-
# First try to load the official Facebook FastText Tibetan model
|
148 |
-
try:
|
149 |
-
# Try to download the official Facebook FastText Tibetan model
|
150 |
-
logger.info("Attempting to download and load official Facebook FastText Tibetan model")
|
151 |
-
facebook_model_path = hf_hub_download(
|
152 |
-
repo_id=FACEBOOK_TIBETAN_MODEL_ID,
|
153 |
-
filename=FACEBOOK_TIBETAN_MODEL_FILE,
|
154 |
-
cache_dir=DEFAULT_MODEL_DIR
|
155 |
-
)
|
156 |
-
logger.info("Loading official Facebook FastText Tibetan model from %s", facebook_model_path)
|
157 |
-
return fasttext.load_model(facebook_model_path)
|
158 |
-
except Exception as e:
|
159 |
-
logger.warning("Could not load official Facebook FastText Tibetan model: %s", str(e))
|
160 |
-
logger.info("Falling back to local model")
|
161 |
-
|
162 |
-
# Fall back to local model
|
163 |
if os.path.exists(model_path):
|
164 |
-
logger.info("
|
165 |
return fasttext.load_model(model_path)
|
166 |
else:
|
167 |
-
logger.
|
168 |
return None
|
169 |
except Exception as e:
|
170 |
-
logger.error("
|
171 |
return None
|
172 |
|
173 |
|
@@ -177,8 +206,10 @@ def get_text_embedding(
|
|
177 |
tokenize_fn=None,
|
178 |
use_stopwords: bool = True,
|
179 |
stopwords_set=None,
|
180 |
-
use_tfidf_weighting: bool = True,
|
181 |
-
corpus_token_freq=None
|
|
|
|
|
182 |
) -> np.ndarray:
|
183 |
"""
|
184 |
Get embedding for a text using a FastText model with optional TF-IDF weighting.
|
@@ -199,83 +230,141 @@ def get_text_embedding(
|
|
199 |
return np.zeros(model.get_dimension())
|
200 |
|
201 |
# Handle tokenization
|
202 |
-
if tokenize_fn
|
203 |
-
# Simple whitespace tokenization as fallback
|
204 |
-
tokens = text.split()
|
205 |
-
elif isinstance(tokenize_fn, list):
|
206 |
-
# If tokenize_fn is already a list of tokens, use it directly
|
207 |
-
tokens = tokenize_fn
|
208 |
-
elif callable(tokenize_fn):
|
209 |
-
# If tokenize_fn is a function, call it
|
210 |
tokens = tokenize_fn(text)
|
|
|
|
|
|
|
|
|
211 |
else:
|
212 |
-
|
213 |
-
|
|
|
|
|
|
|
214 |
tokens = text.split()
|
215 |
-
|
216 |
-
|
217 |
-
if use_stopwords and stopwords_set
|
218 |
-
tokens
|
219 |
-
|
220 |
-
# If all tokens were filtered out as stopwords, return zero vector
|
221 |
-
if not tokens:
|
222 |
-
return np.zeros(model.get_dimension())
|
223 |
-
|
224 |
-
# Filter out empty tokens
|
225 |
-
tokens = [token for token in tokens if token.strip()]
|
226 |
-
|
227 |
-
if not tokens:
|
228 |
-
return np.zeros(model.get_dimension())
|
229 |
-
|
230 |
-
# Calculate TF-IDF weighted average if requested
|
231 |
-
if use_tfidf_weighting and corpus_token_freq is not None:
|
232 |
-
# Calculate term frequencies in this document
|
233 |
-
token_counts = {}
|
234 |
-
for token in tokens:
|
235 |
-
token_counts[token] = token_counts.get(token, 0) + 1
|
236 |
|
237 |
-
|
238 |
-
|
239 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
240 |
|
241 |
-
|
242 |
-
|
243 |
-
|
244 |
-
# Term frequency in this document
|
245 |
-
tf = token_counts.get(token, 0) / max(len(tokens), 1) if len(tokens) > 0 else 0
|
246 |
-
|
247 |
-
# Inverse document frequency with smoothing to avoid extreme values
|
248 |
-
token_freq = corpus_token_freq.get(token, 0)
|
249 |
-
idf = math.log((N + 1) / (token_freq + 1)) + 1 # Add 1 for smoothing
|
250 |
|
251 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
252 |
weight = tf * idf
|
253 |
-
|
254 |
-
|
255 |
|
256 |
-
|
257 |
-
|
258 |
-
|
259 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
260 |
else:
|
261 |
-
|
262 |
-
|
263 |
-
|
264 |
-
|
265 |
-
|
266 |
-
logger.warning("Found NaN or infinite weights in TF-IDF calculation. Using uniform weights instead.")
|
267 |
-
weights = [1.0 / len(tokens) if len(tokens) > 0 else 0 for _ in tokens]
|
268 |
|
269 |
-
|
270 |
-
|
271 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
272 |
|
273 |
-
# Sum the weighted vectors
|
274 |
-
return np.sum(weighted_vectors, axis=0)
|
275 |
else:
|
276 |
-
|
277 |
-
|
278 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
279 |
|
280 |
|
281 |
def get_batch_embeddings(
|
@@ -284,8 +373,10 @@ def get_batch_embeddings(
|
|
284 |
tokenize_fn=None,
|
285 |
use_stopwords: bool = True,
|
286 |
stopwords_set=None,
|
287 |
-
use_tfidf_weighting: bool = True,
|
288 |
-
corpus_token_freq=None
|
|
|
|
|
289 |
) -> np.ndarray:
|
290 |
"""
|
291 |
Get embeddings for a batch of texts with optional TF-IDF weighting.
|
@@ -302,54 +393,31 @@ def get_batch_embeddings(
|
|
302 |
Returns:
|
303 |
Array of text embedding vectors
|
304 |
"""
|
305 |
-
# If corpus_token_freq is not provided but TF-IDF is requested, build it from the texts
|
306 |
-
if use_tfidf_weighting and corpus_token_freq is None:
|
307 |
-
logger.info("Building corpus token frequency dictionary for TF-IDF weighting")
|
308 |
-
corpus_token_freq = {}
|
309 |
-
|
310 |
-
# Process each text to build corpus token frequencies
|
311 |
-
for text in texts:
|
312 |
-
if not text.strip():
|
313 |
-
continue
|
314 |
-
|
315 |
-
# Handle tokenization
|
316 |
-
if tokenize_fn is None:
|
317 |
-
tokens = text.split()
|
318 |
-
elif isinstance(tokenize_fn, list):
|
319 |
-
# In this case, tokenize_fn should be a list of lists (one list of tokens per text)
|
320 |
-
# This is not a common use case, so we'll just use the first one as fallback
|
321 |
-
tokens = tokenize_fn[0] if tokenize_fn else []
|
322 |
-
else:
|
323 |
-
tokens = tokenize_fn(text)
|
324 |
-
|
325 |
-
# Filter out stopwords if enabled
|
326 |
-
if use_stopwords and stopwords_set is not None:
|
327 |
-
tokens = [token for token in tokens if token not in stopwords_set]
|
328 |
-
|
329 |
-
# Update corpus token frequencies
|
330 |
-
for token in tokens:
|
331 |
-
if token.strip(): # Skip empty tokens
|
332 |
-
corpus_token_freq[token] = corpus_token_freq.get(token, 0) + 1
|
333 |
-
|
334 |
-
logger.info("Built corpus token frequency dictionary with %d unique tokens", len(corpus_token_freq))
|
335 |
-
|
336 |
# Get embeddings for each text
|
337 |
embeddings = []
|
338 |
-
for i,
|
339 |
-
|
340 |
-
|
341 |
-
if
|
|
|
|
|
|
|
342 |
if i < len(tokenize_fn):
|
343 |
-
|
|
|
|
|
|
|
344 |
|
345 |
embedding = get_text_embedding(
|
346 |
-
|
347 |
model,
|
348 |
-
tokenize_fn=
|
349 |
use_stopwords=use_stopwords,
|
350 |
stopwords_set=stopwords_set,
|
351 |
use_tfidf_weighting=use_tfidf_weighting,
|
352 |
-
corpus_token_freq=corpus_token_freq
|
|
|
|
|
353 |
)
|
354 |
embeddings.append(embedding)
|
355 |
|
@@ -359,11 +427,12 @@ def get_batch_embeddings(
|
|
359 |
def generate_embeddings(
|
360 |
texts: List[str],
|
361 |
model: fasttext.FastText._FastText,
|
362 |
-
device: str,
|
363 |
-
model_type: str = "sentence_transformer",
|
364 |
tokenize_fn=None,
|
365 |
use_stopwords: bool = True,
|
366 |
-
use_lite_stopwords: bool = False
|
|
|
|
|
|
|
367 |
) -> np.ndarray:
|
368 |
"""
|
369 |
Generate embeddings for a list of texts using a FastText model.
|
@@ -371,17 +440,16 @@ def generate_embeddings(
|
|
371 |
Args:
|
372 |
texts: List of input texts
|
373 |
model: FastText model
|
374 |
-
device: Device to use for computation (not used for FastText)
|
375 |
-
model_type: Model type ('sentence_transformer' or 'fasttext')
|
376 |
tokenize_fn: Optional tokenization function or pre-tokenized list of tokens
|
377 |
use_stopwords: Whether to filter out stopwords
|
378 |
use_lite_stopwords: Whether to use a lighter set of stopwords
|
|
|
|
|
|
|
379 |
|
380 |
Returns:
|
381 |
Array of text embedding vectors
|
382 |
"""
|
383 |
-
if model_type != "fasttext":
|
384 |
-
logger.warning("Model type %s not supported for FastText. Using FastText anyway.", model_type)
|
385 |
|
386 |
# Generate embeddings using FastText
|
387 |
try:
|
@@ -399,7 +467,10 @@ def generate_embeddings(
|
|
399 |
tokenize_fn=tokenize_fn,
|
400 |
use_stopwords=use_stopwords,
|
401 |
stopwords_set=stopwords_set,
|
402 |
-
use_tfidf_weighting=True
|
|
|
|
|
|
|
403 |
)
|
404 |
|
405 |
logger.info("FastText embeddings generated with shape: %s", str(embeddings.shape))
|
|
|
4 |
"""
|
5 |
|
6 |
import os
|
7 |
+
from pathlib import Path
|
8 |
import math
|
9 |
import logging
|
10 |
import numpy as np
|
11 |
import fasttext
|
12 |
+
from collections import Counter
|
13 |
+
from typing import List, Set, Optional
|
14 |
from huggingface_hub import hf_hub_download
|
15 |
|
16 |
# Set up logging
|
17 |
logger = logging.getLogger(__name__)
|
18 |
+
logger.setLevel(logging.DEBUG) # Ensure this logger processes DEBUG messages
|
19 |
|
20 |
# Default parameters optimized for Tibetan
|
21 |
DEFAULT_DIM = 100
|
|
|
28 |
|
29 |
# Define paths for model storage
|
30 |
DEFAULT_MODEL_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "models")
|
31 |
+
DEFAULT_MODEL_PATH = str(Path(__file__).resolve().parent.parent / "fasttext-modelling" / "tibetan_cbow_model.bin") # Updated to custom model
|
32 |
|
33 |
# Facebook's official Tibetan FastText model
|
34 |
FACEBOOK_TIBETAN_MODEL_ID = "facebook/fasttext-bo-vectors"
|
|
|
136 |
return model
|
137 |
|
138 |
|
139 |
+
def load_facebook_official_tibetan_model() -> Optional[fasttext.FastText._FastText]:
|
140 |
+
"""
|
141 |
+
Downloads (if necessary) and loads the official Facebook FastText Tibetan model.
|
142 |
+
|
143 |
+
Returns:
|
144 |
+
Loaded FastText model or None if loading fails.
|
145 |
+
"""
|
146 |
+
try:
|
147 |
+
logger.info("Attempting to download and load official Facebook FastText Tibetan model")
|
148 |
+
facebook_model_path = hf_hub_download(
|
149 |
+
repo_id=FACEBOOK_TIBETAN_MODEL_ID,
|
150 |
+
filename=FACEBOOK_TIBETAN_MODEL_FILE,
|
151 |
+
cache_dir=DEFAULT_MODEL_DIR
|
152 |
+
)
|
153 |
+
logger.info(f"Loading official Facebook FastText Tibetan model from {facebook_model_path}")
|
154 |
+
model = fasttext.load_model(facebook_model_path)
|
155 |
+
if model:
|
156 |
+
logger.info(f"FastText model loaded in load_facebook_official_tibetan_model. Type: {type(model)}")
|
157 |
+
try:
|
158 |
+
# Basic check: get model dimensions
|
159 |
+
dims = model.get_dimension()
|
160 |
+
logger.info(f"Model dimensions reported by fasttext_embedding: {dims}")
|
161 |
+
# Check for a specific word to see if get_word_vector is callable with a string
|
162 |
+
# Using a common Tibetan particle that should be in the vocab
|
163 |
+
test_word = "ལ་"
|
164 |
+
try:
|
165 |
+
vec = model.get_word_vector(test_word)
|
166 |
+
logger.info(f"Successfully retrieved vector for test word '{test_word}'. Vector shape: {vec.shape if vec is not None else 'None'}")
|
167 |
+
except Exception as e_gwv:
|
168 |
+
logger.error(f"Error calling get_word_vector for test word '{test_word}' in fasttext_embedding: {e_gwv}", exc_info=True)
|
169 |
+
# Potentially re-raise or handle if this is critical for model validity
|
170 |
+
except Exception as e_diag_load:
|
171 |
+
logger.error(f"Error during diagnostic checks of loaded FastText model in fasttext_embedding: {e_diag_load}", exc_info=True)
|
172 |
+
# If diagnostics fail, the model might be unusable. Consider returning None.
|
173 |
+
# For now, let it return the model and fail later if that's the case.
|
174 |
+
else:
|
175 |
+
logger.error("fasttext.load_model returned None in load_facebook_official_tibetan_model.")
|
176 |
+
return model
|
177 |
+
except Exception as e_fb:
|
178 |
+
logger.error(f"Could not load official Facebook FastText Tibetan model (outer try-except): {str(e_fb)}", exc_info=True)
|
179 |
+
return None
|
180 |
+
|
181 |
def load_fasttext_model(model_path: str = DEFAULT_MODEL_PATH) -> Optional[fasttext.FastText._FastText]:
|
182 |
"""
|
183 |
+
Load a custom FastText model from the specified file path.
|
184 |
|
185 |
Args:
|
186 |
+
model_path: Path to the custom model file.
|
187 |
|
188 |
Returns:
|
189 |
+
Loaded FastText model or None if loading fails.
|
190 |
"""
|
191 |
try:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
192 |
if os.path.exists(model_path):
|
193 |
+
logger.info(f"Attempting to load custom FastText model from {model_path}")
|
194 |
return fasttext.load_model(model_path)
|
195 |
else:
|
196 |
+
logger.error(f"Custom FastText model path {model_path} does not exist.")
|
197 |
return None
|
198 |
except Exception as e:
|
199 |
+
logger.error(f"Could not load custom FastText model from {model_path}: {str(e)}")
|
200 |
return None
|
201 |
|
202 |
|
|
|
206 |
tokenize_fn=None,
|
207 |
use_stopwords: bool = True,
|
208 |
stopwords_set=None,
|
209 |
+
use_tfidf_weighting: bool = True,
|
210 |
+
corpus_token_freq=None, # Retained for TF, but IDF will use doc_freq_map
|
211 |
+
doc_freq_map=None, # Document frequency map for IDF
|
212 |
+
total_docs_in_corpus=0 # Total documents in corpus for IDF
|
213 |
) -> np.ndarray:
|
214 |
"""
|
215 |
Get embedding for a text using a FastText model with optional TF-IDF weighting.
|
|
|
230 |
return np.zeros(model.get_dimension())
|
231 |
|
232 |
# Handle tokenization
|
233 |
+
if callable(tokenize_fn):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
234 |
tokens = tokenize_fn(text)
|
235 |
+
logger.debug(f"Tokens from callable tokenize_fn (first 20): {tokens[:20]}")
|
236 |
+
elif isinstance(tokenize_fn, list):
|
237 |
+
tokens = tokenize_fn # Use the provided list directly
|
238 |
+
logger.debug(f"Tokens provided as list (first 20): {tokens[:20]}")
|
239 |
else:
|
240 |
+
if tokenize_fn is not None:
|
241 |
+
logger.warning(f"tokenize_fn is of unexpected type: {type(tokenize_fn)}. Defaulting to space-split.")
|
242 |
+
else:
|
243 |
+
# This case handles tokenize_fn being explicitly None
|
244 |
+
logger.debug("tokenize_fn is None. Defaulting to space-split.")
|
245 |
tokens = text.split()
|
246 |
+
logger.debug(f"Tokens from space-split fallback (first 20): {tokens[:20]}")
|
247 |
+
|
248 |
+
if use_stopwords and stopwords_set:
|
249 |
+
logger.debug(f"Original tokens before stopword check (first 20): {tokens[:20]}")
|
250 |
+
original_token_count = len(tokens)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
251 |
|
252 |
+
def _remove_stopwords_from_tokens(tokens: List[str], stopwords_set: Set[str]) -> List[str]:
|
253 |
+
"""
|
254 |
+
Removes stopwords from a list of tokens.
|
255 |
+
Handles Tibetan punctuation by checking both the token itself and the token after
|
256 |
+
stripping trailing '།' or '༔'.
|
257 |
+
"""
|
258 |
+
cleaned_tokens = []
|
259 |
+
removed_count = 0
|
260 |
+
for token in tokens:
|
261 |
+
# 1. Check if the original token itself is a stopword (e.g., standalone '།')
|
262 |
+
if token in stopwords_set:
|
263 |
+
removed_count += 1
|
264 |
+
continue # Skip this token
|
265 |
+
|
266 |
+
# 2. If not a direct stopword, check if it becomes one after stripping trailing punctuation
|
267 |
+
# This handles cases like "གྲུབ་པའི་།" where "གྲུབ་པའི་" is the stopword.
|
268 |
+
token_for_check = token
|
269 |
+
punctuation_was_stripped = False
|
270 |
+
if token.endswith(('།', '༔')):
|
271 |
+
stripped_token = token.rstrip('།༔')
|
272 |
+
if stripped_token != token: # Check if stripping actually changed the token
|
273 |
+
token_for_check = stripped_token
|
274 |
+
punctuation_was_stripped = True
|
275 |
|
276 |
+
if punctuation_was_stripped and token_for_check in stopwords_set:
|
277 |
+
removed_count += 1
|
278 |
+
continue # Skip this token
|
|
|
|
|
|
|
|
|
|
|
|
|
279 |
|
280 |
+
# 3. If neither the original token nor its base form is a stopword, keep it.
|
281 |
+
cleaned_tokens.append(token)
|
282 |
+
|
283 |
+
return cleaned_tokens
|
284 |
+
|
285 |
+
tokens = _remove_stopwords_from_tokens(tokens, stopwords_set)
|
286 |
+
removed_count = original_token_count - len(tokens)
|
287 |
+
logger.debug(f"Tokens after stopword removal (removed {removed_count}): {tokens[:20]}")
|
288 |
+
|
289 |
+
if not tokens:
|
290 |
+
logger.debug("Text became empty after tokenization/stopwords, returning zero vector.")
|
291 |
+
return np.zeros(model.get_dimension())
|
292 |
+
|
293 |
+
if use_tfidf_weighting and doc_freq_map and total_docs_in_corpus is not None and total_docs_in_corpus > 0:
|
294 |
+
logger.debug("Applying TF-IDF weighting.")
|
295 |
+
N_docs = total_docs_in_corpus
|
296 |
+
logger.debug(f"Total documents (N_docs) for IDF: {N_docs}")
|
297 |
+
|
298 |
+
token_counts = Counter(tokens)
|
299 |
+
logger.debug(f"Local token counts for this segment (top 5): {dict(token_counts.most_common(5))}")
|
300 |
+
|
301 |
+
tf_idf_weights = []
|
302 |
+
token_details_log = []
|
303 |
+
|
304 |
+
for token in tokens: # Iterate in original token order
|
305 |
+
tf = token_counts.get(token, 0) / len(tokens) if len(tokens) > 0 else 0
|
306 |
+
df = doc_freq_map.get(token, 0)
|
307 |
+
idf = math.log((N_docs + 1) / (df + 1)) + 1
|
308 |
weight = tf * idf
|
309 |
+
tf_idf_weights.append(weight)
|
310 |
+
token_details_log.append(f"Token: '{token}', TF: {tf:.4f}, DF: {df}, IDF: {idf:.4f}, Raw_TFIDF: {weight:.4f}")
|
311 |
|
312 |
+
logger.debug("Token TF-IDF details (first 10 tokens):")
|
313 |
+
for i, log_entry in enumerate(token_details_log[:10]):
|
314 |
+
logger.debug(f" {i+1}. {log_entry}")
|
315 |
+
|
316 |
+
total_weight = sum(tf_idf_weights)
|
317 |
+
logger.debug(f"Sum of raw TF-IDF weights: {total_weight}")
|
318 |
+
logger.debug(f"TF-IDF Summary for text snippet (first 100 chars): '{text[:100]}'. Total_TFIDF_Weight: {total_weight:.8e}. Fallback_to_Uniform: {total_weight <= 1e-6}.")
|
319 |
+
normalized_weights = []
|
320 |
+
if total_weight > 1e-6:
|
321 |
+
normalized_weights = [w / total_weight for w in tf_idf_weights]
|
322 |
+
logger.debug(f"Normalized weights (first 10): {[f'{w:.4f}' for w in normalized_weights[:10]]}")
|
323 |
else:
|
324 |
+
logger.debug("Total TF-IDF weight is very small, falling back to uniform weights.")
|
325 |
+
num_tokens = len(tokens)
|
326 |
+
if num_tokens > 0:
|
327 |
+
normalized_weights = [1/num_tokens] * num_tokens
|
328 |
+
logger.debug(f"Uniform weights (first 10): {[f'{w:.4f}' for w in normalized_weights[:10]]}")
|
|
|
|
|
329 |
|
330 |
+
weighted_embeddings_sum = np.zeros(model.get_dimension())
|
331 |
+
if len(normalized_weights) == len(tokens):
|
332 |
+
for i, token in enumerate(tokens):
|
333 |
+
word_vector = model.get_word_vector(token)
|
334 |
+
vec_sum_for_log = np.sum(word_vector)
|
335 |
+
logger.debug(f" Token: '{token}', Word_Vec_Sum: {vec_sum_for_log:.4f}, Applied_Weight: {normalized_weights[i]:.4f}")
|
336 |
+
weighted_embeddings_sum += word_vector * normalized_weights[i]
|
337 |
+
final_embedding = weighted_embeddings_sum
|
338 |
+
else:
|
339 |
+
logger.error("Mismatch between token count and normalized_weights count. THIS IS A BUG. Falling back to simple average.")
|
340 |
+
embeddings = [model.get_word_vector(t) for t in tokens]
|
341 |
+
if embeddings:
|
342 |
+
final_embedding = np.mean(embeddings, axis=0)
|
343 |
+
else:
|
344 |
+
final_embedding = np.zeros(model.get_dimension())
|
345 |
|
|
|
|
|
346 |
else:
|
347 |
+
if use_tfidf_weighting:
|
348 |
+
logger.debug("TF-IDF weighting was requested but doc_freq_map or total_docs_in_corpus is missing/invalid. Falling back to simple averaging.")
|
349 |
+
else:
|
350 |
+
logger.debug("Using simple averaging of word vectors (TF-IDF not requested or N_docs=0).")
|
351 |
+
|
352 |
+
embeddings = []
|
353 |
+
for token in tokens:
|
354 |
+
word_vector = model.get_word_vector(token)
|
355 |
+
embeddings.append(word_vector)
|
356 |
+
vec_sum_for_log = np.sum(word_vector)
|
357 |
+
logger.debug(f" Token: '{token}', Word_Vec_Sum: {vec_sum_for_log:.4f} (simple avg context)")
|
358 |
+
|
359 |
+
if embeddings:
|
360 |
+
final_embedding = np.mean(embeddings, axis=0)
|
361 |
+
else:
|
362 |
+
final_embedding = np.zeros(model.get_dimension())
|
363 |
+
|
364 |
+
final_emb_sum_for_log = np.sum(final_embedding)
|
365 |
+
logger.debug(f"Final aggregated embedding sum: {final_emb_sum_for_log:.4f}, shape: {final_embedding.shape}")
|
366 |
+
logger.debug(f"--- get_text_embedding finished for text (first 50 chars): {text[:50]} ---")
|
367 |
+
return final_embedding
|
368 |
|
369 |
|
370 |
def get_batch_embeddings(
|
|
|
373 |
tokenize_fn=None,
|
374 |
use_stopwords: bool = True,
|
375 |
stopwords_set=None,
|
376 |
+
use_tfidf_weighting: bool = True,
|
377 |
+
corpus_token_freq=None, # Corpus-wide term frequencies
|
378 |
+
doc_freq_map=None, # Document frequency map for IDF
|
379 |
+
total_docs_in_corpus=0 # Total documents in corpus for IDF
|
380 |
) -> np.ndarray:
|
381 |
"""
|
382 |
Get embeddings for a batch of texts with optional TF-IDF weighting.
|
|
|
393 |
Returns:
|
394 |
Array of text embedding vectors
|
395 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
396 |
# Get embeddings for each text
|
397 |
embeddings = []
|
398 |
+
for i, text_content in enumerate(texts): # Changed 'text' to 'text_content'
|
399 |
+
tokens_or_tokenizer_for_current_text = None
|
400 |
+
|
401 |
+
if callable(tokenize_fn):
|
402 |
+
tokens_or_tokenizer_for_current_text = tokenize_fn # Pass the function itself
|
403 |
+
elif isinstance(tokenize_fn, list):
|
404 |
+
# If tokenize_fn is a list, it's assumed to be a list of pre-tokenized documents
|
405 |
if i < len(tokenize_fn):
|
406 |
+
tokens_or_tokenizer_for_current_text = tokenize_fn[i] # This is List[str] for the current text
|
407 |
+
else:
|
408 |
+
logger.warning(f"Pre-tokenized list `tokenize_fn` is shorter than the list of texts. Index {i} is out of bounds for `tokenize_fn` with length {len(tokenize_fn)}. Defaulting to None for this text.")
|
409 |
+
# If tokenize_fn is None or other, tokens_or_tokenizer_for_current_text remains None (get_text_embedding handles default).
|
410 |
|
411 |
embedding = get_text_embedding(
|
412 |
+
text_content, # Use renamed variable
|
413 |
model,
|
414 |
+
tokenize_fn=tokens_or_tokenizer_for_current_text, # Pass the correctly determined function or token list
|
415 |
use_stopwords=use_stopwords,
|
416 |
stopwords_set=stopwords_set,
|
417 |
use_tfidf_weighting=use_tfidf_weighting,
|
418 |
+
corpus_token_freq=corpus_token_freq,
|
419 |
+
doc_freq_map=doc_freq_map,
|
420 |
+
total_docs_in_corpus=total_docs_in_corpus
|
421 |
)
|
422 |
embeddings.append(embedding)
|
423 |
|
|
|
427 |
def generate_embeddings(
|
428 |
texts: List[str],
|
429 |
model: fasttext.FastText._FastText,
|
|
|
|
|
430 |
tokenize_fn=None,
|
431 |
use_stopwords: bool = True,
|
432 |
+
use_lite_stopwords: bool = False,
|
433 |
+
corpus_token_freq=None, # Existing: For TF-IDF
|
434 |
+
doc_freq_map=None, # Added: For TF-IDF document frequency
|
435 |
+
total_docs_in_corpus=0 # Added: For TF-IDF total documents in corpus
|
436 |
) -> np.ndarray:
|
437 |
"""
|
438 |
Generate embeddings for a list of texts using a FastText model.
|
|
|
440 |
Args:
|
441 |
texts: List of input texts
|
442 |
model: FastText model
|
|
|
|
|
443 |
tokenize_fn: Optional tokenization function or pre-tokenized list of tokens
|
444 |
use_stopwords: Whether to filter out stopwords
|
445 |
use_lite_stopwords: Whether to use a lighter set of stopwords
|
446 |
+
corpus_token_freq: Precomputed term frequencies for the corpus (for TF-IDF).
|
447 |
+
doc_freq_map: Precomputed document frequencies for tokens (for TF-IDF).
|
448 |
+
total_docs_in_corpus: Total number of documents in the corpus (for TF-IDF).
|
449 |
|
450 |
Returns:
|
451 |
Array of text embedding vectors
|
452 |
"""
|
|
|
|
|
453 |
|
454 |
# Generate embeddings using FastText
|
455 |
try:
|
|
|
467 |
tokenize_fn=tokenize_fn,
|
468 |
use_stopwords=use_stopwords,
|
469 |
stopwords_set=stopwords_set,
|
470 |
+
use_tfidf_weighting=True, # TF-IDF weighting enabled
|
471 |
+
corpus_token_freq=corpus_token_freq, # Pass down
|
472 |
+
doc_freq_map=doc_freq_map, # Pass down for TF-IDF
|
473 |
+
total_docs_in_corpus=total_docs_in_corpus # Pass down for TF-IDF
|
474 |
)
|
475 |
|
476 |
logger.info("FastText embeddings generated with shape: %s", str(embeddings.shape))
|
pipeline/metrics.py
CHANGED
@@ -1,15 +1,14 @@
|
|
1 |
import numpy as np
|
2 |
import pandas as pd
|
3 |
-
from typing import List, Dict
|
4 |
from itertools import combinations
|
5 |
from sklearn.metrics.pairwise import cosine_similarity
|
6 |
-
import torch
|
7 |
from .semantic_embedding import generate_embeddings
|
8 |
from .tokenize import tokenize_texts
|
9 |
import logging
|
10 |
from sklearn.feature_extraction.text import TfidfVectorizer
|
11 |
-
from .stopwords_bo import TIBETAN_STOPWORDS
|
12 |
-
from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE
|
13 |
|
14 |
# Attempt to import the Cython-compiled fast_lcs module
|
15 |
try:
|
@@ -21,58 +20,8 @@ except ImportError:
|
|
21 |
|
22 |
logger = logging.getLogger(__name__)
|
23 |
|
24 |
-
MAX_TOKENS_PER_CHUNK = 500 # Max tokens (words via botok) per chunk
|
25 |
-
CHUNK_OVERLAP = 50 # Number of tokens to overlap between chunks
|
26 |
|
27 |
|
28 |
-
def _chunk_text(
|
29 |
-
original_text_content: str,
|
30 |
-
tokens: List[str],
|
31 |
-
max_chunk_tokens: int,
|
32 |
-
overlap_tokens: int,
|
33 |
-
) -> List[str]:
|
34 |
-
"""
|
35 |
-
Splits a list of tokens into chunks and reconstructs text segments from these token chunks.
|
36 |
-
The reconstructed text segments are intended for embedding models.
|
37 |
-
Args:
|
38 |
-
original_text_content (str): The original raw text string. Used if no chunking is needed.
|
39 |
-
tokens (List[str]): The list of botok tokens for the original_text_content.
|
40 |
-
max_chunk_tokens (int): Maximum number of botok tokens per chunk.
|
41 |
-
overlap_tokens (int): Number of botok tokens to overlap between chunks.
|
42 |
-
|
43 |
-
Returns:
|
44 |
-
List[str]: A list of text strings, where each string is a chunk.
|
45 |
-
"""
|
46 |
-
if (
|
47 |
-
not tokens
|
48 |
-
): # Handles empty or whitespace-only original text that led to no tokens
|
49 |
-
return [original_text_content] if original_text_content.strip() else []
|
50 |
-
|
51 |
-
if len(tokens) <= max_chunk_tokens:
|
52 |
-
# If not chunking, return the original text content directly, as per MEMORY[a777e6ad-11c4-4b90-8e6e-63a923a94432]
|
53 |
-
# The memory states raw text segments are passed directly to the model.
|
54 |
-
# Joining tokens here would alter spacing, etc.
|
55 |
-
return [original_text_content]
|
56 |
-
|
57 |
-
reconstructed_text_chunks = []
|
58 |
-
start_idx = 0
|
59 |
-
while start_idx < len(tokens):
|
60 |
-
end_idx = min(start_idx + max_chunk_tokens, len(tokens))
|
61 |
-
current_chunk_botok_tokens = tokens[start_idx:end_idx]
|
62 |
-
# Reconstruct the text chunk by joining the botok tokens. This is an approximation.
|
63 |
-
# The semantic model's internal tokenizer will handle this string.
|
64 |
-
reconstructed_text_chunks.append(" ".join(current_chunk_botok_tokens))
|
65 |
-
|
66 |
-
if end_idx == len(tokens):
|
67 |
-
break
|
68 |
-
|
69 |
-
next_start_idx = start_idx + max_chunk_tokens - overlap_tokens
|
70 |
-
if next_start_idx <= start_idx:
|
71 |
-
next_start_idx = start_idx + 1
|
72 |
-
start_idx = next_start_idx
|
73 |
-
|
74 |
-
return reconstructed_text_chunks
|
75 |
-
|
76 |
|
77 |
def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
|
78 |
# Calculate m and n (lengths) here, so they are available for normalization
|
@@ -100,214 +49,126 @@ def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
|
|
100 |
return lcs_length / avg_length if avg_length > 0 else 0.0
|
101 |
|
102 |
|
|
|
103 |
def compute_semantic_similarity(
|
104 |
text1_segment: str,
|
105 |
text2_segment: str,
|
106 |
-
tokens1: List[str],
|
107 |
-
tokens2: List[str],
|
108 |
-
model,
|
109 |
-
|
110 |
-
model_type: str = "sentence_transformer",
|
111 |
use_stopwords: bool = True,
|
112 |
use_lite_stopwords: bool = False,
|
|
|
|
|
|
|
|
|
113 |
) -> float:
|
114 |
-
"""Computes semantic similarity using a
|
115 |
-
if
|
|
|
|
|
|
|
|
|
116 |
logger.warning(
|
117 |
-
"
|
118 |
)
|
119 |
-
return np.nan
|
120 |
|
121 |
if not text1_segment or not text2_segment:
|
122 |
logger.info(
|
123 |
"One or both texts are empty for semantic similarity. Returning 0.0."
|
124 |
)
|
125 |
-
return 0.0
|
126 |
|
127 |
def _get_aggregated_embedding(
|
128 |
-
raw_text_segment: str,
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
134 |
logger.info(
|
135 |
f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
|
136 |
)
|
137 |
return None
|
138 |
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
embedding = generate_embeddings(
|
150 |
-
[raw_text_segment],
|
151 |
-
model_obj,
|
152 |
-
device_str,
|
153 |
-
model_type,
|
154 |
-
tokenize_fn=[botok_tokens], # Wrap in list since we're passing tokens for one text
|
155 |
-
use_stopwords=use_stopwords,
|
156 |
-
use_lite_stopwords=use_lite_stopwords
|
157 |
-
)
|
158 |
-
|
159 |
-
if embedding is None or embedding.nelement() == 0:
|
160 |
-
logger.error(
|
161 |
-
f"Failed to generate FastText embedding for text: {raw_text_segment[:100]}..."
|
162 |
-
)
|
163 |
-
return None
|
164 |
-
return embedding # Already [1, embed_dim]
|
165 |
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
|
171 |
-
|
172 |
-
filtered_tokens = [token for token in botok_tokens if token not in TIBETAN_STOPWORDS_LITE_SET]
|
173 |
-
else:
|
174 |
-
from .stopwords_bo import TIBETAN_STOPWORDS_SET
|
175 |
-
filtered_tokens = [token for token in botok_tokens if token not in TIBETAN_STOPWORDS_SET]
|
176 |
-
|
177 |
-
# If all tokens were filtered out as stopwords, return zero embedding
|
178 |
-
if not filtered_tokens:
|
179 |
-
logger.info("All tokens in text are stopwords. Returning zero embedding.")
|
180 |
-
# Create a zero tensor with the same dimension as the model's output
|
181 |
-
# For transformer models, typically 384 or 768 dimensions
|
182 |
-
embedding_dim = 384 # Default dimension for MiniLM models
|
183 |
-
return torch.zeros(1, embedding_dim)
|
184 |
-
|
185 |
-
# Continue with normal processing if content remains after filtering
|
186 |
-
if len(botok_tokens) > MAX_TOKENS_PER_CHUNK:
|
187 |
-
logger.info(
|
188 |
-
f"Text segment with ~{len(botok_tokens)} tokens exceeds {MAX_TOKENS_PER_CHUNK}, chunking {raw_text_segment[:30]}..."
|
189 |
-
)
|
190 |
-
# Pass the original raw text and its pre-computed botok tokens to _chunk_text
|
191 |
-
text_chunks = _chunk_text(
|
192 |
-
raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
|
193 |
-
)
|
194 |
-
if not text_chunks:
|
195 |
-
logger.warning(
|
196 |
-
f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
|
197 |
-
)
|
198 |
-
return None
|
199 |
-
|
200 |
-
logger.info(
|
201 |
-
f"Generated {len(text_chunks)} chunks for segment: {raw_text_segment[:30]}..."
|
202 |
-
)
|
203 |
-
# Generate embeddings for each chunk using the model
|
204 |
-
chunk_embeddings = generate_embeddings(text_chunks, model_obj, device_str, model_type)
|
205 |
-
|
206 |
-
if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
|
207 |
-
logger.error(
|
208 |
-
f"Failed to generate embeddings for chunks of text: {raw_text_segment[:100]}..."
|
209 |
-
)
|
210 |
-
return None
|
211 |
-
# Mean pooling of chunk embeddings
|
212 |
-
aggregated_embedding = torch.mean(chunk_embeddings, dim=0, keepdim=True)
|
213 |
-
return aggregated_embedding
|
214 |
-
else:
|
215 |
-
# Text is short enough for transformer model, embed raw text directly
|
216 |
-
if not raw_text_segment.strip():
|
217 |
-
logger.info(
|
218 |
-
f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
|
219 |
-
)
|
220 |
-
return None
|
221 |
-
|
222 |
-
embedding = generate_embeddings([raw_text_segment], model_obj, device_str, model_type)
|
223 |
-
if embedding is None or embedding.nelement() == 0:
|
224 |
-
logger.error(
|
225 |
-
f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
|
226 |
-
)
|
227 |
-
return None
|
228 |
-
return embedding # Already [1, embed_dim]
|
229 |
-
else:
|
230 |
-
# No stopword filtering, proceed with normal processing
|
231 |
-
if len(botok_tokens) > MAX_TOKENS_PER_CHUNK:
|
232 |
-
logger.info(
|
233 |
-
f"Text segment with ~{len(botok_tokens)} tokens exceeds {MAX_TOKENS_PER_CHUNK}, chunking {raw_text_segment[:30]}..."
|
234 |
-
)
|
235 |
-
# Pass the original raw text and its pre-computed botok tokens to _chunk_text
|
236 |
-
text_chunks = _chunk_text(
|
237 |
-
raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
|
238 |
-
)
|
239 |
-
if not text_chunks:
|
240 |
-
logger.warning(
|
241 |
-
f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
|
242 |
-
)
|
243 |
-
return None
|
244 |
-
|
245 |
-
logger.info(
|
246 |
-
f"Generated {len(text_chunks)} chunks for segment: {raw_text_segment[:30]}..."
|
247 |
-
)
|
248 |
-
# Generate embeddings for each chunk using the model
|
249 |
-
chunk_embeddings = generate_embeddings(text_chunks, model_obj, device_str, model_type)
|
250 |
-
|
251 |
-
if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
|
252 |
-
logger.error(
|
253 |
-
f"Failed to generate embeddings for chunks of text: {raw_text_segment[:100]}..."
|
254 |
-
)
|
255 |
-
return None
|
256 |
-
# Mean pooling of chunk embeddings
|
257 |
-
aggregated_embedding = torch.mean(chunk_embeddings, dim=0, keepdim=True)
|
258 |
-
return aggregated_embedding
|
259 |
-
else:
|
260 |
-
# Text is short enough for transformer model, embed raw text directly
|
261 |
-
if not raw_text_segment.strip():
|
262 |
-
logger.info(
|
263 |
-
f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
|
264 |
-
)
|
265 |
-
return None
|
266 |
-
|
267 |
-
embedding = generate_embeddings([raw_text_segment], model_obj, device_str, model_type)
|
268 |
-
if embedding is None or embedding.nelement() == 0:
|
269 |
-
logger.error(
|
270 |
-
f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
|
271 |
-
)
|
272 |
-
return None
|
273 |
-
return embedding # Already [1, embed_dim]
|
274 |
|
275 |
try:
|
276 |
-
# Pass
|
277 |
-
|
278 |
-
|
279 |
-
|
280 |
-
if
|
281 |
-
embedding1 is None
|
282 |
-
or embedding2 is None
|
283 |
-
or embedding1.nelement() == 0
|
284 |
-
or embedding2.nelement() == 0
|
285 |
-
):
|
286 |
logger.error(
|
287 |
-
"Failed to obtain one or both
|
288 |
)
|
289 |
return np.nan
|
290 |
|
291 |
-
#
|
292 |
-
if np.
|
293 |
-
|
|
|
|
|
|
|
|
|
294 |
return 0.0
|
295 |
-
|
296 |
-
|
297 |
-
|
298 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
299 |
except Exception as e:
|
|
|
|
|
300 |
logger.error(
|
301 |
-
f"Error
|
302 |
-
exc_info=True,
|
303 |
)
|
|
|
304 |
return np.nan
|
305 |
|
306 |
|
307 |
def compute_all_metrics(
|
308 |
-
texts: Dict[str, str], model=None,
|
309 |
-
model_type: str = "
|
310 |
-
use_lite_stopwords: bool = False
|
|
|
311 |
) -> pd.DataFrame:
|
312 |
"""
|
313 |
Computes all selected similarity metrics between pairs of texts.
|
@@ -328,25 +189,60 @@ def compute_all_metrics(
|
|
328 |
files = list(texts.keys())
|
329 |
results = []
|
330 |
# Prepare token lists (always use tokenize_texts for raw Unicode)
|
331 |
-
token_lists = {}
|
332 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
333 |
|
334 |
for fname, content in texts.items():
|
335 |
-
|
336 |
-
|
337 |
-
|
338 |
-
|
339 |
-
|
340 |
-
|
341 |
-
|
342 |
-
|
343 |
-
|
344 |
-
|
345 |
-
|
346 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
347 |
|
348 |
# TF-IDF Vectorization and Cosine Similarity Calculation
|
349 |
-
if
|
350 |
try:
|
351 |
# Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
|
352 |
# and we don't want further case changes or token modifications for Tibetan.
|
@@ -368,14 +264,14 @@ def compute_all_metrics(
|
|
368 |
token_pattern=None,
|
369 |
stop_words=stopwords_to_use
|
370 |
)
|
371 |
-
tfidf_matrix = vectorizer.fit_transform(
|
372 |
# Calculate pairwise cosine similarity on the TF-IDF matrix
|
373 |
# This gives a square matrix where cosine_sim_matrix[i, j] is the similarity between doc i and doc j
|
374 |
cosine_sim_matrix = cosine_similarity(tfidf_matrix)
|
375 |
except ValueError as e:
|
376 |
if "empty vocabulary" in str(e):
|
377 |
# If vocabulary is empty after stopword removal, create a zero matrix
|
378 |
-
n = len(
|
379 |
cosine_sim_matrix = np.zeros((n, n))
|
380 |
else:
|
381 |
# Re-raise other ValueError
|
@@ -421,7 +317,11 @@ def compute_all_metrics(
|
|
421 |
if enable_semantic:
|
422 |
# Pass raw texts and their pre-computed botok tokens
|
423 |
semantic_sim = compute_semantic_similarity(
|
424 |
-
texts[f1], texts[f2], words1_raw, words2_raw, model,
|
|
|
|
|
|
|
|
|
425 |
)
|
426 |
else:
|
427 |
semantic_sim = np.nan
|
|
|
1 |
import numpy as np
|
2 |
import pandas as pd
|
3 |
+
from typing import List, Dict, Union
|
4 |
from itertools import combinations
|
5 |
from sklearn.metrics.pairwise import cosine_similarity
|
|
|
6 |
from .semantic_embedding import generate_embeddings
|
7 |
from .tokenize import tokenize_texts
|
8 |
import logging
|
9 |
from sklearn.feature_extraction.text import TfidfVectorizer
|
10 |
+
from .stopwords_bo import TIBETAN_STOPWORDS
|
11 |
+
from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE
|
12 |
|
13 |
# Attempt to import the Cython-compiled fast_lcs module
|
14 |
try:
|
|
|
20 |
|
21 |
logger = logging.getLogger(__name__)
|
22 |
|
|
|
|
|
23 |
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
|
27 |
# Calculate m and n (lengths) here, so they are available for normalization
|
|
|
49 |
return lcs_length / avg_length if avg_length > 0 else 0.0
|
50 |
|
51 |
|
52 |
+
|
53 |
def compute_semantic_similarity(
|
54 |
text1_segment: str,
|
55 |
text2_segment: str,
|
56 |
+
tokens1: List[str], # botok tokens for text1, not directly used by FastText path but kept for signature
|
57 |
+
tokens2: List[str], # botok tokens for text2, not directly used by FastText path but kept for signature
|
58 |
+
model, # FastText model object
|
59 |
+
model_type: str = "fasttext", # Should always be 'fasttext' when called
|
|
|
60 |
use_stopwords: bool = True,
|
61 |
use_lite_stopwords: bool = False,
|
62 |
+
fasttext_tokenize_fn=None,
|
63 |
+
term_freq_corpus=None,
|
64 |
+
doc_freq_map=None,
|
65 |
+
total_docs_in_corpus=0
|
66 |
) -> float:
|
67 |
+
"""Computes semantic similarity using a FastText model."""
|
68 |
+
if model_type != "fasttext":
|
69 |
+
logger.error(f"compute_semantic_similarity called with unexpected model_type: {model_type}")
|
70 |
+
return np.nan
|
71 |
+
|
72 |
+
if model is None:
|
73 |
logger.warning(
|
74 |
+
"FastText model not available for semantic similarity. Skipping calculation."
|
75 |
)
|
76 |
+
return np.nan
|
77 |
|
78 |
if not text1_segment or not text2_segment:
|
79 |
logger.info(
|
80 |
"One or both texts are empty for semantic similarity. Returning 0.0."
|
81 |
)
|
82 |
+
return 0.0
|
83 |
|
84 |
def _get_aggregated_embedding(
|
85 |
+
raw_text_segment: str,
|
86 |
+
_botok_tokens: List[str], # Parameter name prefixed with _ to indicate it's not used
|
87 |
+
model_obj,
|
88 |
+
use_stopwords_param: bool,
|
89 |
+
use_lite_stopwords_param: bool,
|
90 |
+
tokenize_fn_param,
|
91 |
+
term_freq_corpus_param,
|
92 |
+
doc_freq_map_param,
|
93 |
+
total_docs_in_corpus_param
|
94 |
+
) -> Union[np.ndarray, None]:
|
95 |
+
"""Helper to get a single embedding for a text using FastText."""
|
96 |
+
if not raw_text_segment.strip():
|
97 |
logger.info(
|
98 |
f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
|
99 |
)
|
100 |
return None
|
101 |
|
102 |
+
embedding = generate_embeddings(
|
103 |
+
texts=[raw_text_segment],
|
104 |
+
model=model_obj,
|
105 |
+
tokenize_fn=tokenize_fn_param,
|
106 |
+
use_stopwords=use_stopwords_param,
|
107 |
+
use_lite_stopwords=use_lite_stopwords_param,
|
108 |
+
corpus_token_freq=term_freq_corpus_param,
|
109 |
+
doc_freq_map=doc_freq_map_param,
|
110 |
+
total_docs_in_corpus=total_docs_in_corpus_param
|
111 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
112 |
|
113 |
+
if embedding is None or embedding.size == 0:
|
114 |
+
logger.error(
|
115 |
+
f"Failed to generate FastText embedding for text: {raw_text_segment[:100]}..."
|
116 |
+
)
|
117 |
+
return None
|
118 |
+
return embedding
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
|
120 |
try:
|
121 |
+
# Pass all relevant parameters to _get_aggregated_embedding
|
122 |
+
emb1 = _get_aggregated_embedding(text1_segment, tokens1, model, use_stopwords, use_lite_stopwords, fasttext_tokenize_fn, term_freq_corpus, doc_freq_map, total_docs_in_corpus)
|
123 |
+
emb2 = _get_aggregated_embedding(text2_segment, tokens2, model, use_stopwords, use_lite_stopwords, fasttext_tokenize_fn, term_freq_corpus, doc_freq_map, total_docs_in_corpus)
|
124 |
+
|
125 |
+
if emb1 is None or emb2 is None or emb1.size == 0 or emb2.size == 0:
|
|
|
|
|
|
|
|
|
|
|
126 |
logger.error(
|
127 |
+
"Failed to obtain one or both FastText embeddings for semantic similarity."
|
128 |
)
|
129 |
return np.nan
|
130 |
|
131 |
+
# Ensure embeddings are numpy arrays (should be, but defensive)
|
132 |
+
if not isinstance(emb1, np.ndarray): emb1 = np.array(emb1)
|
133 |
+
if not isinstance(emb2, np.ndarray): emb2 = np.array(emb2)
|
134 |
+
|
135 |
+
# Handle cases where embeddings are all zeros
|
136 |
+
if np.all(emb1 == 0) and np.all(emb2 == 0):
|
137 |
+
logger.info("Both FastText embeddings are zero. Semantic similarity is 0.0.")
|
138 |
return 0.0
|
139 |
+
if np.all(emb1 == 0) or np.all(emb2 == 0):
|
140 |
+
logger.info("One of the FastText embeddings is zero. Semantic similarity is 0.0.")
|
141 |
+
return 0.0
|
142 |
+
|
143 |
+
# Handle NaN or Inf in embeddings
|
144 |
+
if np.isnan(emb1).any() or np.isinf(emb1).any() or \
|
145 |
+
np.isnan(emb2).any() or np.isinf(emb2).any():
|
146 |
+
logger.warning("NaN or Inf found in FastText embeddings. Semantic similarity set to 0.0.")
|
147 |
+
return 0.0
|
148 |
+
|
149 |
+
# Ensure embeddings are 2D for cosine_similarity: [1, dim]
|
150 |
+
if emb1.ndim == 1: emb1 = emb1.reshape(1, -1)
|
151 |
+
if emb2.ndim == 1: emb2 = emb2.reshape(1, -1)
|
152 |
+
|
153 |
+
similarity_score = cosine_similarity(emb1, emb2)[0][0]
|
154 |
+
|
155 |
+
return max(0.0, float(similarity_score))
|
156 |
+
|
157 |
except Exception as e:
|
158 |
+
safe_text1 = str(text1_segment)[:100] if text1_segment is not None else "N/A"
|
159 |
+
safe_text2 = str(text2_segment)[:100] if text2_segment is not None else "N/A"
|
160 |
logger.error(
|
161 |
+
f"Error during FastText semantic similarity calculation:\nText1: {safe_text1}...\nText2: {safe_text2}...\nError: {e}"
|
|
|
162 |
)
|
163 |
+
logger.exception("Traceback for FastText semantic similarity calculation error:")
|
164 |
return np.nan
|
165 |
|
166 |
|
167 |
def compute_all_metrics(
|
168 |
+
texts: Dict[str, str], model=None, enable_semantic: bool = True, # device=None removed
|
169 |
+
model_type: str = "fasttext", use_stopwords: bool = True,
|
170 |
+
use_lite_stopwords: bool = False,
|
171 |
+
fasttext_tokenize_fn=None # Added for FastText specific tokenizer
|
172 |
) -> pd.DataFrame:
|
173 |
"""
|
174 |
Computes all selected similarity metrics between pairs of texts.
|
|
|
189 |
files = list(texts.keys())
|
190 |
results = []
|
191 |
# Prepare token lists (always use tokenize_texts for raw Unicode)
|
192 |
+
token_lists = {} # Stores botok tokens for each text_id, used for Jaccard, LCS, and semantic sim
|
193 |
+
corpus_for_sklearn_tfidf = [] # For storing space-joined tokens for scikit-learn's TF-IDF
|
194 |
+
|
195 |
+
# For FastText TF-IDF related statistics
|
196 |
+
term_freq_corpus_for_fasttext = {} # Renamed from global_corpus_token_freq_for_fasttext
|
197 |
+
document_frequency_map_for_fasttext = {}
|
198 |
+
total_num_documents_for_fasttext = len(texts)
|
199 |
+
|
200 |
+
stopwords_set_for_fasttext_stats_calc = set()
|
201 |
+
if use_stopwords: # This 'use_stopwords' is an arg to compute_all_metrics
|
202 |
+
if use_lite_stopwords:
|
203 |
+
from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
|
204 |
+
stopwords_set_for_fasttext_stats_calc = TIBETAN_STOPWORDS_LITE_SET
|
205 |
+
else:
|
206 |
+
from .stopwords_bo import TIBETAN_STOPWORDS_SET
|
207 |
+
stopwords_set_for_fasttext_stats_calc = TIBETAN_STOPWORDS_SET
|
208 |
|
209 |
for fname, content in texts.items():
|
210 |
+
current_tokens_for_file = []
|
211 |
+
tokenized_content_list_of_lists = tokenize_texts([content])
|
212 |
+
if tokenized_content_list_of_lists and tokenized_content_list_of_lists[0]:
|
213 |
+
current_tokens_for_file = tokenized_content_list_of_lists[0]
|
214 |
+
token_lists[fname] = current_tokens_for_file
|
215 |
+
|
216 |
+
corpus_for_sklearn_tfidf.append(" ".join(current_tokens_for_file) if current_tokens_for_file else "")
|
217 |
+
|
218 |
+
if model_type == "fasttext":
|
219 |
+
tokens_for_fasttext_stats = []
|
220 |
+
if fasttext_tokenize_fn is not None:
|
221 |
+
tokens_for_fasttext_stats = fasttext_tokenize_fn(content)
|
222 |
+
else:
|
223 |
+
tokens_for_fasttext_stats = current_tokens_for_file
|
224 |
+
|
225 |
+
filtered_tokens_for_stats = [
|
226 |
+
token for token in tokens_for_fasttext_stats if token not in stopwords_set_for_fasttext_stats_calc
|
227 |
+
] if use_stopwords else tokens_for_fasttext_stats
|
228 |
+
|
229 |
+
# Update corpus-wide term frequencies
|
230 |
+
for token in filtered_tokens_for_stats:
|
231 |
+
if token.strip():
|
232 |
+
term_freq_corpus_for_fasttext[token] = term_freq_corpus_for_fasttext.get(token, 0) + 1
|
233 |
+
|
234 |
+
# Update document frequencies
|
235 |
+
unique_filtered_tokens_in_doc = set(filtered_tokens_for_stats)
|
236 |
+
for token in unique_filtered_tokens_in_doc:
|
237 |
+
if token.strip():
|
238 |
+
document_frequency_map_for_fasttext[token] = document_frequency_map_for_fasttext.get(token, 0) + 1
|
239 |
+
|
240 |
+
if model_type == "fasttext":
|
241 |
+
logger.info(f"Built FastText corpus term frequency map with {len(term_freq_corpus_for_fasttext)} unique tokens.")
|
242 |
+
logger.info(f"Built FastText document frequency map with {len(document_frequency_map_for_fasttext)} unique tokens across {total_num_documents_for_fasttext} documents.")
|
243 |
|
244 |
# TF-IDF Vectorization and Cosine Similarity Calculation
|
245 |
+
if corpus_for_sklearn_tfidf:
|
246 |
try:
|
247 |
# Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
|
248 |
# and we don't want further case changes or token modifications for Tibetan.
|
|
|
264 |
token_pattern=None,
|
265 |
stop_words=stopwords_to_use
|
266 |
)
|
267 |
+
tfidf_matrix = vectorizer.fit_transform(corpus_for_sklearn_tfidf)
|
268 |
# Calculate pairwise cosine similarity on the TF-IDF matrix
|
269 |
# This gives a square matrix where cosine_sim_matrix[i, j] is the similarity between doc i and doc j
|
270 |
cosine_sim_matrix = cosine_similarity(tfidf_matrix)
|
271 |
except ValueError as e:
|
272 |
if "empty vocabulary" in str(e):
|
273 |
# If vocabulary is empty after stopword removal, create a zero matrix
|
274 |
+
n = len(corpus_for_sklearn_tfidf)
|
275 |
cosine_sim_matrix = np.zeros((n, n))
|
276 |
else:
|
277 |
# Re-raise other ValueError
|
|
|
317 |
if enable_semantic:
|
318 |
# Pass raw texts and their pre-computed botok tokens
|
319 |
semantic_sim = compute_semantic_similarity(
|
320 |
+
texts[f1], texts[f2], words1_raw, words2_raw, model, model_type, use_stopwords, use_lite_stopwords, # device removed
|
321 |
+
fasttext_tokenize_fn=fasttext_tokenize_fn,
|
322 |
+
term_freq_corpus=term_freq_corpus_for_fasttext if model_type == "fasttext" else None,
|
323 |
+
doc_freq_map=document_frequency_map_for_fasttext if model_type == "fasttext" else None,
|
324 |
+
total_docs_in_corpus=total_num_documents_for_fasttext if model_type == "fasttext" else 0
|
325 |
)
|
326 |
else:
|
327 |
semantic_sim = np.nan
|
pipeline/process.py
CHANGED
@@ -1,10 +1,48 @@
|
|
1 |
import pandas as pd
|
2 |
from typing import Dict, List, Tuple
|
3 |
from .metrics import compute_all_metrics
|
4 |
-
from .semantic_embedding import get_model_and_device
|
|
|
5 |
from .tokenize import tokenize_texts
|
6 |
import logging
|
7 |
from itertools import combinations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
logger = logging.getLogger(__name__)
|
10 |
|
@@ -47,10 +85,10 @@ def process_texts(
|
|
47 |
RuntimeError: If the botok tokenizer fails to initialize.
|
48 |
ValueError: If the input files cannot be processed or if metrics computation fails.
|
49 |
"""
|
50 |
-
# Initialize model and
|
51 |
-
|
52 |
model_warning = ""
|
53 |
-
|
54 |
# Update progress if callback provided
|
55 |
if progress_callback is not None:
|
56 |
try:
|
@@ -58,77 +96,78 @@ def process_texts(
|
|
58 |
except Exception as e:
|
59 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
60 |
# Continue processing even if progress reporting fails
|
61 |
-
|
62 |
# Load semantic model if enabled
|
63 |
if enable_semantic:
|
64 |
logger.info("Semantic similarity enabled. Loading embedding model...")
|
65 |
try:
|
66 |
logger.info("Using model: %s", model_name)
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
# Try to load the official Facebook FastText Tibetan model first
|
71 |
if progress_callback is not None:
|
72 |
try:
|
73 |
-
progress_callback(0.25, desc="Loading
|
74 |
except Exception as e:
|
75 |
-
logger.warning("Progress callback error (non-critical):
|
76 |
-
|
77 |
-
st_model, st_device, model_type = get_model_and_device(model_id=model_name)
|
78 |
|
79 |
-
|
80 |
-
if
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
except Exception as e:
|
85 |
-
logger.warning("Progress callback error (non-critical): %s", str(e))
|
86 |
-
|
87 |
-
# Collect all text data for training
|
88 |
-
all_texts = list(text_data.values())
|
89 |
-
|
90 |
-
# Train the model with standard parameters for stability
|
91 |
-
st_model = train_fasttext_model(all_texts, dim=100, epoch=5)
|
92 |
-
|
93 |
if progress_callback is not None:
|
94 |
try:
|
95 |
-
progress_callback(0.3, desc="
|
96 |
except Exception as e:
|
97 |
-
logger.warning("Progress callback error (non-critical):
|
98 |
else:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
if progress_callback is not None:
|
100 |
try:
|
101 |
-
progress_callback(0.3, desc="
|
102 |
except Exception as e:
|
103 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
|
|
|
|
|
|
|
|
109 |
if progress_callback is not None:
|
110 |
try:
|
111 |
-
progress_callback(0.3, desc="
|
112 |
except Exception as e:
|
113 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
114 |
-
|
115 |
-
except Exception as e:
|
116 |
-
|
117 |
-
logger.error(
|
118 |
-
|
119 |
-
# Create a user-friendly warning message
|
120 |
-
if "is not a valid model identifier" in error_msg:
|
121 |
-
model_warning = f"The model '{model_name}' could not be found on Hugging Face. Semantic similarity will not be available."
|
122 |
-
elif "CUDA out of memory" in error_msg:
|
123 |
-
model_warning = "Not enough GPU memory to load the semantic model. Try using a smaller model or disable semantic similarity."
|
124 |
-
else:
|
125 |
-
model_warning = f"Failed to load semantic model: {error_msg}. Semantic similarity will not be available."
|
126 |
-
|
127 |
if progress_callback is not None:
|
128 |
try:
|
129 |
-
progress_callback(0.3, desc="
|
130 |
-
except Exception as
|
131 |
-
logger.warning(f"Progress callback error (non-critical): {
|
132 |
else:
|
133 |
logger.info("Semantic similarity disabled. Skipping model loading.")
|
134 |
if progress_callback is not None:
|
@@ -177,11 +216,13 @@ def process_texts(
|
|
177 |
|
178 |
for idx, seg in enumerate(segments):
|
179 |
seg_id = f"{fname}|chapter {idx+1}"
|
180 |
-
|
|
|
181 |
else:
|
182 |
# No chapter markers found, treat entire file as one segment
|
183 |
seg_id = f"{fname}|chapter 1"
|
184 |
-
|
|
|
185 |
fallback = True
|
186 |
|
187 |
# Generate warning if no chapter markers found
|
@@ -261,14 +302,26 @@ def process_texts(
|
|
261 |
|
262 |
try:
|
263 |
# Compute metrics for this chapter pair
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
264 |
pair_metrics = compute_all_metrics(
|
265 |
{seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
|
266 |
-
model=
|
267 |
-
device=st_device,
|
268 |
enable_semantic=enable_semantic,
|
269 |
-
model_type=model_type
|
270 |
use_stopwords=use_stopwords,
|
271 |
-
use_lite_stopwords=use_lite_stopwords
|
|
|
272 |
)
|
273 |
|
274 |
# Rename 'Text Pair' to show file stems and chapter number
|
|
|
1 |
import pandas as pd
|
2 |
from typing import Dict, List, Tuple
|
3 |
from .metrics import compute_all_metrics
|
4 |
+
from .semantic_embedding import get_model_and_device
|
5 |
+
from .fasttext_embedding import load_fasttext_model # Added for custom fasttext
|
6 |
from .tokenize import tokenize_texts
|
7 |
import logging
|
8 |
from itertools import combinations
|
9 |
+
import re
|
10 |
+
|
11 |
+
# Define FASTTEXT_MODEL_ID if not already defined (it should be, from semantic_embedding or globally)
|
12 |
+
# For safety, let's assume it might be needed here directly for conditional logic
|
13 |
+
FASTTEXT_MODEL_ID = "fasttext-tibetan" # Ensure this matches the ID used elsewhere
|
14 |
+
|
15 |
+
|
16 |
+
def get_botok_tokens_for_single_text(text: str, mode: str = "syllable") -> list[str]:
|
17 |
+
"""
|
18 |
+
A wrapper around tokenize_texts to make it suitable for tokenize_fn
|
19 |
+
in generate_embeddings, which expects a function that tokenizes a single string.
|
20 |
+
Accepts a 'mode' argument ('syllable' or 'word') to pass to tokenize_texts.
|
21 |
+
"""
|
22 |
+
if not text.strip():
|
23 |
+
return []
|
24 |
+
# Pass the mode to tokenize_texts
|
25 |
+
tokenized_list_of_lists = tokenize_texts([text], mode=mode)
|
26 |
+
if tokenized_list_of_lists and tokenized_list_of_lists[0]:
|
27 |
+
return tokenized_list_of_lists[0]
|
28 |
+
return []
|
29 |
+
|
30 |
+
def clean_tibetan_text_for_fasttext(text: str) -> str:
|
31 |
+
"""
|
32 |
+
Applies cleaning steps to Tibetan text similar to those in FastText training:
|
33 |
+
- Removes lnX/pX page/line markers.
|
34 |
+
- Normalizes double tsheg to single tsheg.
|
35 |
+
- Normalizes whitespace.
|
36 |
+
"""
|
37 |
+
# Remove lnX/pX markers
|
38 |
+
cleaned_text = re.sub(r"\s*(?:[lL][nN]|[pP])\d{1,3}[abAB]?\s*", " ", text)
|
39 |
+
# Normalize double tsheg
|
40 |
+
cleaned_text = re.sub(r"།\s*།", "།", cleaned_text)
|
41 |
+
# Normalize spaces (multiple spaces to single, strip leading/trailing)
|
42 |
+
cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
|
43 |
+
return cleaned_text
|
44 |
+
|
45 |
+
|
46 |
|
47 |
logger = logging.getLogger(__name__)
|
48 |
|
|
|
85 |
RuntimeError: If the botok tokenizer fails to initialize.
|
86 |
ValueError: If the input files cannot be processed or if metrics computation fails.
|
87 |
"""
|
88 |
+
# Initialize model and model_type variables
|
89 |
+
model, model_type = None, None # st_device removed
|
90 |
model_warning = ""
|
91 |
+
|
92 |
# Update progress if callback provided
|
93 |
if progress_callback is not None:
|
94 |
try:
|
|
|
96 |
except Exception as e:
|
97 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
98 |
# Continue processing even if progress reporting fails
|
99 |
+
|
100 |
# Load semantic model if enabled
|
101 |
if enable_semantic:
|
102 |
logger.info("Semantic similarity enabled. Loading embedding model...")
|
103 |
try:
|
104 |
logger.info("Using model: %s", model_name)
|
105 |
+
|
106 |
+
if model_name == FASTTEXT_MODEL_ID: # FASTTEXT_MODEL_ID is 'fasttext-tibetan'
|
107 |
+
logger.info(f"Attempting to load custom FastText model: {model_name}")
|
|
|
108 |
if progress_callback is not None:
|
109 |
try:
|
110 |
+
progress_callback(0.25, desc=f"Loading custom FastText model: {model_name}...")
|
111 |
except Exception as e:
|
112 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
|
|
|
|
113 |
|
114 |
+
loaded_custom_model = load_fasttext_model(model_id=model_name) # model_id is expected to be path or key by this func
|
115 |
+
if loaded_custom_model:
|
116 |
+
model = loaded_custom_model
|
117 |
+
model_type = "fasttext"
|
118 |
+
logger.info(f"Custom FastText model '{model_name}' loaded successfully.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
if progress_callback is not None:
|
120 |
try:
|
121 |
+
progress_callback(0.3, desc=f"Custom FastText model '{model_name}' loaded.")
|
122 |
except Exception as e:
|
123 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
124 |
else:
|
125 |
+
model_warning = f"Custom FastText model ('{model_name}') failed to load. Semantic similarity will be disabled."
|
126 |
+
logger.warning(model_warning)
|
127 |
+
enable_semantic = False
|
128 |
+
|
129 |
+
elif model_name == "facebook-fasttext-pretrained":
|
130 |
+
logger.info(f"Attempting to load Facebook FastText model: {model_name}")
|
131 |
+
if progress_callback is not None:
|
132 |
+
try:
|
133 |
+
progress_callback(0.25, desc=f"Loading Facebook FastText model: {model_name}...")
|
134 |
+
except Exception as e:
|
135 |
+
logger.warning(f"Progress callback error (non-critical): {e}")
|
136 |
+
|
137 |
+
fb_model, fb_model_type = get_model_and_device(model_id=model_name) # from semantic_embedding
|
138 |
+
if fb_model:
|
139 |
+
model = fb_model
|
140 |
+
model_type = fb_model_type # Should be "fasttext"
|
141 |
+
logger.info(f"Facebook FastText model '{model_name}' (type: {model_type}) loaded successfully.")
|
142 |
if progress_callback is not None:
|
143 |
try:
|
144 |
+
progress_callback(0.3, desc=f"Facebook FastText model '{model_name}' loaded.")
|
145 |
except Exception as e:
|
146 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
147 |
+
else:
|
148 |
+
model_warning = f"Facebook FastText model ('{model_name}') failed to load. Semantic similarity will be disabled."
|
149 |
+
logger.warning(model_warning)
|
150 |
+
enable_semantic = False
|
151 |
+
|
152 |
+
else: # Any other model_name is unsupported
|
153 |
+
model_warning = f"Unsupported model_name: '{model_name}'. Semantic similarity will be disabled. Supported models are '{FASTTEXT_MODEL_ID}' and 'facebook-fasttext-pretrained'."
|
154 |
+
logger.warning(model_warning)
|
155 |
+
enable_semantic = False
|
156 |
if progress_callback is not None:
|
157 |
try:
|
158 |
+
progress_callback(0.3, desc="Unsupported model, continuing without semantic similarity.")
|
159 |
except Exception as e:
|
160 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
161 |
+
|
162 |
+
except Exception as e: # General catch-all for unexpected errors during model loading attempts
|
163 |
+
model_warning = f"An unexpected error occurred while attempting to load model '{model_name}': {e}. Semantic similarity will be disabled."
|
164 |
+
logger.error(model_warning, exc_info=True)
|
165 |
+
enable_semantic = False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
166 |
if progress_callback is not None:
|
167 |
try:
|
168 |
+
progress_callback(0.3, desc="Error loading model, continuing without semantic similarity.")
|
169 |
+
except Exception as e_cb:
|
170 |
+
logger.warning(f"Progress callback error (non-critical): {e_cb}")
|
171 |
else:
|
172 |
logger.info("Semantic similarity disabled. Skipping model loading.")
|
173 |
if progress_callback is not None:
|
|
|
216 |
|
217 |
for idx, seg in enumerate(segments):
|
218 |
seg_id = f"{fname}|chapter {idx+1}"
|
219 |
+
cleaned_seg = clean_tibetan_text_for_fasttext(seg)
|
220 |
+
segment_texts[seg_id] = cleaned_seg
|
221 |
else:
|
222 |
# No chapter markers found, treat entire file as one segment
|
223 |
seg_id = f"{fname}|chapter 1"
|
224 |
+
cleaned_content = clean_tibetan_text_for_fasttext(content.strip())
|
225 |
+
segment_texts[seg_id] = cleaned_content
|
226 |
fallback = True
|
227 |
|
228 |
# Generate warning if no chapter markers found
|
|
|
302 |
|
303 |
try:
|
304 |
# Compute metrics for this chapter pair
|
305 |
+
tokenizer_for_fasttext = None
|
306 |
+
current_model_type = model_type if 'model_type' in locals() else "sentence_transformer"
|
307 |
+
if current_model_type == "fasttext":
|
308 |
+
# Tokenizer setup for FastText model:
|
309 |
+
def fasttext_tokenizer_adapter(text_segment: str) -> List[str]:
|
310 |
+
cleaned_segment = clean_tibetan_text_for_fasttext(text_segment)
|
311 |
+
# Use word-level tokenization for the custom FastText model
|
312 |
+
return get_botok_tokens_for_single_text(cleaned_segment, mode="word")
|
313 |
+
|
314 |
+
tokenizer_for_fasttext = fasttext_tokenizer_adapter
|
315 |
+
logger.info("Using botok word-level tokenization for FastText model.")
|
316 |
+
|
317 |
pair_metrics = compute_all_metrics(
|
318 |
{seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
|
319 |
+
model=model,
|
|
|
320 |
enable_semantic=enable_semantic,
|
321 |
+
model_type=model_type,
|
322 |
use_stopwords=use_stopwords,
|
323 |
+
use_lite_stopwords=use_lite_stopwords,
|
324 |
+
fasttext_tokenize_fn=tokenizer_for_fasttext
|
325 |
)
|
326 |
|
327 |
# Rename 'Text Pair' to show file stems and chapter number
|
pipeline/semantic_embedding.py
CHANGED
@@ -1,7 +1,6 @@
|
|
1 |
import logging
|
2 |
-
import
|
3 |
-
|
4 |
-
from sentence_transformers import SentenceTransformer
|
5 |
|
6 |
# Configure logging
|
7 |
logging.basicConfig(
|
@@ -9,190 +8,114 @@ logging.basicConfig(
|
|
9 |
)
|
10 |
logger = logging.getLogger(__name__)
|
11 |
|
12 |
-
# Define the model ID for the
|
13 |
-
DEFAULT_MODEL_NAME = "
|
14 |
|
15 |
-
#
|
16 |
-
FASTTEXT_MODEL_ID = "fasttext-tibetan"
|
17 |
|
18 |
|
19 |
-
def get_model_and_device(
|
20 |
-
model_id: str = DEFAULT_MODEL_NAME, device_preference: str = "auto"
|
21 |
-
):
|
22 |
"""
|
23 |
-
Loads the
|
24 |
-
Priority: CUDA -> MPS (Apple Silicon) -> CPU.
|
25 |
|
26 |
Args:
|
27 |
-
model_id (str): The
|
28 |
-
device_preference (str): Preferred device ("cuda", "mps", "cpu", "auto").
|
29 |
|
30 |
Returns:
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
"""
|
35 |
-
|
36 |
|
37 |
-
if
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
if model is None:
|
69 |
-
error_msg = "Failed to load FastText model. Semantic similarity will not be available."
|
70 |
-
logger.error(error_msg)
|
71 |
-
raise Exception(error_msg)
|
72 |
-
|
73 |
-
logger.info("FastText model loaded successfully.")
|
74 |
-
# FastText always runs on CPU
|
75 |
-
return model, "cpu", "fasttext"
|
76 |
-
except ImportError:
|
77 |
-
logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
|
78 |
-
raise
|
79 |
-
else:
|
80 |
-
logger.info(
|
81 |
-
"Loading Sentence Transformer model: %s on device: %s",
|
82 |
-
model_id, selected_device_str
|
83 |
-
)
|
84 |
-
# SentenceTransformer expects a string like 'cuda', 'mps', or 'cpu'
|
85 |
-
model = SentenceTransformer(model_id, device=selected_device_str)
|
86 |
-
logger.info("Model %s loaded successfully on %s.", model_id, selected_device_str)
|
87 |
-
return model, selected_device_str, "sentence_transformer"
|
88 |
-
except Exception as e:
|
89 |
-
logger.error(
|
90 |
-
"Error loading model %s on device %s: %s",
|
91 |
-
model_id, selected_device_str, str(e)
|
92 |
-
)
|
93 |
-
# Fallback to CPU if the initially selected device (CUDA or MPS) failed
|
94 |
-
if selected_device_str != "cpu":
|
95 |
-
logger.warning(
|
96 |
-
"Failed to load model on %s, attempting to load on CPU...",
|
97 |
-
selected_device_str
|
98 |
-
)
|
99 |
-
fallback_device_str = "cpu"
|
100 |
-
try:
|
101 |
-
# Check if this is a FastText model request during fallback
|
102 |
-
if model_id == FASTTEXT_MODEL_ID:
|
103 |
-
# Import here to avoid dependency issues if FastText is not installed
|
104 |
-
from .fasttext_embedding import load_fasttext_model
|
105 |
-
|
106 |
-
# Try to load the FastText model
|
107 |
-
model = load_fasttext_model()
|
108 |
-
|
109 |
-
if model is None:
|
110 |
-
logger.error("Failed to load FastText model during fallback. Semantic similarity will not be available.")
|
111 |
-
raise Exception("Failed to load FastText model. Please check if the model file exists.")
|
112 |
-
|
113 |
-
logger.info("FastText model loaded successfully during fallback.")
|
114 |
-
# FastText always runs on CPU
|
115 |
-
return model, "cpu", "fasttext"
|
116 |
-
else:
|
117 |
-
# Try to load as a sentence transformer
|
118 |
-
model = SentenceTransformer(model_id, device=fallback_device_str)
|
119 |
-
logger.info(
|
120 |
-
"Model %s loaded successfully on CPU after fallback.",
|
121 |
-
model_id
|
122 |
-
)
|
123 |
-
return model, fallback_device_str, "sentence_transformer"
|
124 |
-
except Exception as fallback_e:
|
125 |
-
logger.error(
|
126 |
-
"Error loading model %s on CPU during fallback: %s",
|
127 |
-
model_id, str(fallback_e)
|
128 |
-
)
|
129 |
-
raise fallback_e # Re-raise exception if CPU fallback also fails
|
130 |
-
raise e # Re-raise original exception if selected_device_str was already CPU or no fallback attempted
|
131 |
-
|
132 |
-
|
133 |
-
def generate_embeddings(texts: List[str], model: Any, device: str, model_type: str = "sentence_transformer", tokenize_fn=None, use_stopwords: bool = True, use_lite_stopwords: bool = False):
|
134 |
"""
|
135 |
-
Generates embeddings for a list of texts
|
136 |
|
137 |
Args:
|
138 |
texts (list[str]): A list of texts to embed.
|
139 |
-
model: The loaded model
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
|
|
|
|
144 |
|
145 |
Returns:
|
146 |
-
|
147 |
"""
|
148 |
if not texts:
|
149 |
logger.warning(
|
150 |
-
"No texts provided to generate_embeddings. Returning
|
151 |
)
|
152 |
-
return
|
153 |
|
154 |
-
logger.info(f"Generating embeddings for {len(texts)} texts...")
|
155 |
|
156 |
-
|
157 |
-
|
158 |
-
|
159 |
-
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
|
165 |
-
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
|
171 |
-
|
172 |
-
|
173 |
-
|
174 |
-
|
175 |
-
|
176 |
-
|
177 |
-
|
178 |
-
|
179 |
-
|
180 |
-
|
181 |
-
|
182 |
-
|
183 |
-
|
184 |
-
|
185 |
-
|
186 |
-
|
187 |
-
|
188 |
-
|
189 |
-
|
190 |
-
# It also manages moving data to the model's device.
|
191 |
-
embeddings = model.encode(texts, convert_to_tensor=True, show_progress_bar=True)
|
192 |
-
logger.info("Sentence Transformer embeddings generated with shape: %s", str(embeddings.shape))
|
193 |
-
return (
|
194 |
-
embeddings.cpu()
|
195 |
-
) # Ensure embeddings are on CPU for consistent further processing
|
196 |
|
197 |
|
198 |
def train_fasttext_model(corpus_texts: List[str], **kwargs):
|
@@ -204,59 +127,15 @@ def train_fasttext_model(corpus_texts: List[str], **kwargs):
|
|
204 |
**kwargs: Additional parameters for training (dim, epoch, etc.)
|
205 |
|
206 |
Returns:
|
207 |
-
Trained model and path to the model file
|
208 |
-
"""
|
209 |
try:
|
210 |
from .fasttext_embedding import prepare_corpus_file, train_fasttext_model as train_ft
|
211 |
|
212 |
-
# Prepare corpus file
|
213 |
corpus_path = prepare_corpus_file(corpus_texts)
|
214 |
-
|
215 |
-
# Train the model
|
216 |
model = train_ft(corpus_path=corpus_path, **kwargs)
|
217 |
|
218 |
-
return model
|
219 |
except ImportError:
|
220 |
logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
|
221 |
-
raise
|
222 |
-
|
223 |
-
|
224 |
-
if __name__ == "__main__":
|
225 |
-
# Example usage:
|
226 |
-
logger.info("Starting example usage of semantic_embedding module...")
|
227 |
-
|
228 |
-
test_texts = [
|
229 |
-
"བཀྲ་ཤིས་བདེ་ལེགས།",
|
230 |
-
"hello world", # Test with non-Tibetan to see behavior
|
231 |
-
"དེ་རིང་གནམ་གཤིས་ཡག་པོ་འདུག",
|
232 |
-
]
|
233 |
-
|
234 |
-
logger.info("Attempting to load model using default cache directory.")
|
235 |
-
try:
|
236 |
-
# Forcing CPU for this example to avoid potential CUDA issues in diverse environments
|
237 |
-
# or if CUDA is not intended for this specific test.
|
238 |
-
model, device, model_type = get_model_and_device(
|
239 |
-
device_preference="cpu" # Explicitly use CPU for this test run
|
240 |
-
)
|
241 |
-
|
242 |
-
if model:
|
243 |
-
logger.info("Test model loaded on device: %s, type: %s", device, model_type)
|
244 |
-
example_embeddings = generate_embeddings(test_texts, model, device, model_type)
|
245 |
-
logger.info(
|
246 |
-
"Generated example embeddings shape: %s",
|
247 |
-
str(example_embeddings.shape)
|
248 |
-
)
|
249 |
-
if example_embeddings.nelement() > 0: # Check if tensor is not empty
|
250 |
-
logger.info(
|
251 |
-
"First embedding (first 10 dims): %s...",
|
252 |
-
str(example_embeddings[0][:10])
|
253 |
-
)
|
254 |
-
else:
|
255 |
-
logger.info("Generated example embeddings tensor is empty.")
|
256 |
-
else:
|
257 |
-
logger.error("Failed to load model for example usage.")
|
258 |
-
|
259 |
-
except Exception as e:
|
260 |
-
logger.error("An error occurred during the example usage: %s", str(e))
|
261 |
-
|
262 |
-
logger.info("Finished example usage.")
|
|
|
1 |
import logging
|
2 |
+
from typing import List, Any, Optional
|
3 |
+
import numpy as np # Added for type hinting Optional[np.ndarray]
|
|
|
4 |
|
5 |
# Configure logging
|
6 |
logging.basicConfig(
|
|
|
8 |
)
|
9 |
logger = logging.getLogger(__name__)
|
10 |
|
11 |
+
# Define the model ID for the Facebook FastText pretrained model
|
12 |
+
DEFAULT_MODEL_NAME = "facebook-fasttext-pretrained"
|
13 |
|
14 |
+
# FASTTEXT_MODEL_ID = "fasttext-tibetan" # Removed: Custom model loading to be handled in process.py directly
|
|
|
15 |
|
16 |
|
17 |
+
def get_model_and_device(model_id: str = DEFAULT_MODEL_NAME):
|
|
|
|
|
18 |
"""
|
19 |
+
Loads the Facebook official pre-trained FastText model for Tibetan.
|
|
|
20 |
|
21 |
Args:
|
22 |
+
model_id (str): The model ID. Must be 'facebook-fasttext-pretrained' (DEFAULT_MODEL_NAME).
|
|
|
23 |
|
24 |
Returns:
|
25 |
+
Tuple[Optional[Any], Optional[str]]:
|
26 |
+
A tuple containing the loaded FastText model and its type ("fasttext"),
|
27 |
+
or (None, None) if loading fails or model_id is unsupported.
|
28 |
"""
|
29 |
+
logger.info("Attempting to load FastText model via semantic_embedding.get_model_and_device: %s", model_id)
|
30 |
|
31 |
+
if model_id == DEFAULT_MODEL_NAME: # DEFAULT_MODEL_NAME is "facebook-fasttext-pretrained"
|
32 |
+
try:
|
33 |
+
# Importing here to minimize issues if fasttext_embedding also imports from semantic_embedding
|
34 |
+
from .fasttext_embedding import load_facebook_official_tibetan_model
|
35 |
+
|
36 |
+
model = load_facebook_official_tibetan_model()
|
37 |
+
|
38 |
+
if model:
|
39 |
+
logger.info(f"FastText model object received in get_model_and_device. Type: {type(model)}.")
|
40 |
+
try:
|
41 |
+
logger.info(f"Model dimensions: {model.get_dimension()}")
|
42 |
+
# Basic check for model validity via an expected attribute/method
|
43 |
+
if hasattr(model, 'get_word_vector'):
|
44 |
+
logger.info("Model has 'get_word_vector' method (Python API expected for fasttext.load_model results).")
|
45 |
+
except Exception as diag_e:
|
46 |
+
logger.error(f"Error during diagnostic check of FastText model '{model_id}': {diag_e}", exc_info=True)
|
47 |
+
return model, "fasttext"
|
48 |
+
else:
|
49 |
+
# This case implies load_facebook_official_tibetan_model returned None without raising an error.
|
50 |
+
logger.error(f"Model loading for '{model_id}' via load_facebook_official_tibetan_model() returned None unexpectedly.")
|
51 |
+
return None, None
|
52 |
+
except Exception as e:
|
53 |
+
logger.error(f"Failed to load or initialize FastText model '{model_id}': {e}. Semantic similarity will not be available.", exc_info=True)
|
54 |
+
return None, None
|
55 |
+
else:
|
56 |
+
logger.error(f"Unsupported model_id for get_model_and_device in semantic_embedding.py: '{model_id}'. Only '{DEFAULT_MODEL_NAME}' is supported by this function.")
|
57 |
+
return None, None
|
58 |
+
|
59 |
+
|
60 |
+
def generate_embeddings(texts: List[str], model: Any, tokenize_fn=None, use_stopwords: bool = True, use_lite_stopwords: bool = False, corpus_token_freq=None, doc_freq_map=None, total_docs_in_corpus=0) -> Optional[np.ndarray]:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
61 |
"""
|
62 |
+
Generates FastText embeddings for a list of texts.
|
63 |
|
64 |
Args:
|
65 |
texts (list[str]): A list of texts to embed.
|
66 |
+
model: The loaded FastText model.
|
67 |
+
tokenize_fn: Optional tokenization function for FastText (if different from default botok based).
|
68 |
+
use_stopwords (bool): Whether to filter out stopwords for FastText embeddings.
|
69 |
+
use_lite_stopwords (bool): Whether to use the 'lite' stopwords list.
|
70 |
+
corpus_token_freq: Corpus-wide term frequencies for TF-IDF weighted FastText.
|
71 |
+
doc_freq_map: Document frequency map for TF-IDF weighted FastText.
|
72 |
+
total_docs_in_corpus: Total documents in corpus for TF-IDF weighted FastText.
|
73 |
|
74 |
Returns:
|
75 |
+
Optional[np.ndarray]: A numpy array containing the embeddings. Returns None if generation fails.
|
76 |
"""
|
77 |
if not texts:
|
78 |
logger.warning(
|
79 |
+
"No texts provided to generate_embeddings. Returning None."
|
80 |
)
|
81 |
+
return None
|
82 |
|
83 |
+
logger.info(f"Generating FastText embeddings for {len(texts)} texts...")
|
84 |
|
85 |
+
try:
|
86 |
+
from .fasttext_embedding import get_batch_embeddings
|
87 |
+
|
88 |
+
stopwords_set = None
|
89 |
+
if use_stopwords:
|
90 |
+
if use_lite_stopwords:
|
91 |
+
from .stopwords_lite_bo import TIBETAN_STOPWORDS_LITE_SET
|
92 |
+
stopwords_set = TIBETAN_STOPWORDS_LITE_SET
|
93 |
+
else:
|
94 |
+
from .stopwords_bo import TIBETAN_STOPWORDS_SET
|
95 |
+
stopwords_set = TIBETAN_STOPWORDS_SET
|
96 |
+
|
97 |
+
embeddings = get_batch_embeddings(
|
98 |
+
texts,
|
99 |
+
model,
|
100 |
+
tokenize_fn=tokenize_fn,
|
101 |
+
use_stopwords=use_stopwords,
|
102 |
+
stopwords_set=stopwords_set,
|
103 |
+
corpus_token_freq=corpus_token_freq,
|
104 |
+
doc_freq_map=doc_freq_map,
|
105 |
+
total_docs_in_corpus=total_docs_in_corpus
|
106 |
+
)
|
107 |
+
if embeddings is None:
|
108 |
+
logger.error(f"get_batch_embeddings returned None for {len(texts)} texts. First few: {texts[:2]}")
|
109 |
+
return None
|
110 |
+
|
111 |
+
logger.info("FastText embeddings generated with shape: %s", str(embeddings.shape))
|
112 |
+
return embeddings
|
113 |
+
except ImportError:
|
114 |
+
logger.error("Required FastText modules not found. Please ensure 'fasttext' and its dependencies are correctly installed.")
|
115 |
+
return None
|
116 |
+
except Exception as e:
|
117 |
+
logger.error(f"An unexpected error occurred during FastText embedding generation: {e}", exc_info=True)
|
118 |
+
return None
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
|
120 |
|
121 |
def train_fasttext_model(corpus_texts: List[str], **kwargs):
|
|
|
127 |
**kwargs: Additional parameters for training (dim, epoch, etc.)
|
128 |
|
129 |
Returns:
|
130 |
+
Trained model and path to the model file (Note: current implementation returns only model object)
|
131 |
+
""" # Docstring updated for return type
|
132 |
try:
|
133 |
from .fasttext_embedding import prepare_corpus_file, train_fasttext_model as train_ft
|
134 |
|
|
|
135 |
corpus_path = prepare_corpus_file(corpus_texts)
|
|
|
|
|
136 |
model = train_ft(corpus_path=corpus_path, **kwargs)
|
137 |
|
138 |
+
return model # Returns model object, not path as previously suggested by older docstring
|
139 |
except ImportError:
|
140 |
logger.error("FastText module not found. Please install it with 'pip install fasttext'.")
|
141 |
+
raise # Re-raising to signal critical failure if training components are missing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pipeline/tibetan_stopwords.py
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import logging
|
2 |
+
|
3 |
+
logger = logging.getLogger(__name__)
|
4 |
+
|
5 |
+
def get_stopwords(use_lite: bool = False) -> set:
|
6 |
+
"""
|
7 |
+
Returns a set of Tibetan stopwords by importing them from the respective .py files.
|
8 |
+
|
9 |
+
Args:
|
10 |
+
use_lite (bool): If True, returns a smaller, less aggressive list of stopwords
|
11 |
+
from stopwords_lite_bo.py.
|
12 |
+
Otherwise, returns the full list from stopwords_bo.py.
|
13 |
+
|
14 |
+
Returns:
|
15 |
+
set: A set of stopword strings. Returns an empty set on failure.
|
16 |
+
"""
|
17 |
+
stopwords_set = set()
|
18 |
+
try:
|
19 |
+
if use_lite:
|
20 |
+
from .stopwords_lite_bo import STOPWORDS
|
21 |
+
stopwords_set = STOPWORDS
|
22 |
+
else:
|
23 |
+
from .stopwords_bo import STOPWORDS
|
24 |
+
stopwords_set = STOPWORDS
|
25 |
+
|
26 |
+
logger.info(f"Successfully loaded {len(stopwords_set)} stopwords from {source_module_name.lstrip('.')}.py")
|
27 |
+
except ImportError:
|
28 |
+
logger.error(
|
29 |
+
f"Failed to import STOPWORDS from {source_module_name.lstrip('.')}.py. "
|
30 |
+
f"Ensure the file exists in the 'pipeline' directory, is a Python module (ends in .py), "
|
31 |
+
f"and is importable (e.g., no syntax errors)."
|
32 |
+
)
|
33 |
+
except AttributeError:
|
34 |
+
logger.error(
|
35 |
+
f"Variable 'STOPWORDS' (all caps) not found in {source_module_name.lstrip('.')}.py. "
|
36 |
+
f"Please ensure the stopword set is defined with this name within the module."
|
37 |
+
)
|
38 |
+
except Exception as e:
|
39 |
+
logger.error(f"An unexpected error occurred while loading stopwords from {source_module_name.lstrip('.')}.py: {e}")
|
40 |
+
|
41 |
+
return stopwords_set
|
pipeline/tokenize.py
CHANGED
@@ -39,7 +39,7 @@ def _get_text_hash(text: str) -> str:
|
|
39 |
return hashlib.md5(text.encode('utf-8')).hexdigest()
|
40 |
|
41 |
|
42 |
-
def tokenize_texts(texts: List[str]) -> List[List[str]]:
|
43 |
"""
|
44 |
Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
|
45 |
|
@@ -64,6 +64,10 @@ def tokenize_texts(texts: List[str]) -> List[List[str]]:
|
|
64 |
)
|
65 |
|
66 |
tokenized_texts_list = []
|
|
|
|
|
|
|
|
|
67 |
|
68 |
# Process each text
|
69 |
for text_content in texts:
|
@@ -73,19 +77,84 @@ def tokenize_texts(texts: List[str]) -> List[List[str]]:
|
|
73 |
continue
|
74 |
|
75 |
# Generate hash for cache lookup
|
76 |
-
|
|
|
77 |
|
78 |
# Check if we have this text in cache
|
79 |
if text_hash in _tokenization_cache:
|
80 |
# Cache hit - use cached tokens
|
81 |
tokens = _tokenization_cache[text_hash]
|
82 |
-
logger.debug(f"Cache hit for text hash {text_hash[:8]}...")
|
83 |
else:
|
84 |
# Cache miss - tokenize and store in cache
|
85 |
try:
|
86 |
-
|
87 |
-
|
88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
|
90 |
# Store in cache if not empty
|
91 |
if tokens:
|
@@ -95,9 +164,9 @@ def tokenize_texts(texts: List[str]) -> List[List[str]]:
|
|
95 |
_tokenization_cache.pop(next(iter(_tokenization_cache)))
|
96 |
|
97 |
_tokenization_cache[text_hash] = tokens
|
98 |
-
logger.debug(f"Added tokens to cache with hash {text_hash[:8]}...")
|
99 |
except Exception as e:
|
100 |
-
logger.error(f"Error tokenizing text: {e}")
|
101 |
tokens = []
|
102 |
|
103 |
tokenized_texts_list.append(tokens)
|
|
|
39 |
return hashlib.md5(text.encode('utf-8')).hexdigest()
|
40 |
|
41 |
|
42 |
+
def tokenize_texts(texts: List[str], mode: str = "syllable") -> List[List[str]]:
|
43 |
"""
|
44 |
Tokenizes a list of raw Tibetan texts using botok, with caching for performance.
|
45 |
|
|
|
64 |
)
|
65 |
|
66 |
tokenized_texts_list = []
|
67 |
+
|
68 |
+
if mode not in ["word", "syllable"]:
|
69 |
+
logger.warning(f"Invalid tokenization mode: '{mode}'. Defaulting to 'syllable'.")
|
70 |
+
mode = "syllable"
|
71 |
|
72 |
# Process each text
|
73 |
for text_content in texts:
|
|
|
77 |
continue
|
78 |
|
79 |
# Generate hash for cache lookup
|
80 |
+
cache_key_string = text_content + f"_{mode}" # Include mode in string for hashing
|
81 |
+
text_hash = _get_text_hash(cache_key_string)
|
82 |
|
83 |
# Check if we have this text in cache
|
84 |
if text_hash in _tokenization_cache:
|
85 |
# Cache hit - use cached tokens
|
86 |
tokens = _tokenization_cache[text_hash]
|
87 |
+
logger.debug(f"Cache hit for text hash {text_hash[:8]}... (mode: {mode})")
|
88 |
else:
|
89 |
# Cache miss - tokenize and store in cache
|
90 |
try:
|
91 |
+
current_tokens = []
|
92 |
+
if BOTOK_TOKENIZER:
|
93 |
+
raw_botok_items = list(BOTOK_TOKENIZER.tokenize(text_content))
|
94 |
+
|
95 |
+
if mode == "word":
|
96 |
+
for item_idx, w in enumerate(raw_botok_items):
|
97 |
+
if hasattr(w, 'text') and isinstance(w.text, str):
|
98 |
+
token_text = w.text.strip()
|
99 |
+
if token_text: # Ensure token is not empty or just whitespace
|
100 |
+
current_tokens.append(token_text)
|
101 |
+
# Optionally log if w.text is not a string or missing, for debugging
|
102 |
+
# elif w.text is not None:
|
103 |
+
# logger.debug(f"Token item {item_idx} has non-string text {type(w.text)} for hash {text_hash[:8]}. Skipping word.")
|
104 |
+
# else:
|
105 |
+
# logger.debug(f"Token item {item_idx} missing text attribute for hash {text_hash[:8]}. Skipping word.")
|
106 |
+
logger.debug(
|
107 |
+
f"WORD TOKENS FORMED for hash {text_hash[:8]} (mode: {mode}, first 30): "
|
108 |
+
f"{current_tokens[:30]}"
|
109 |
+
)
|
110 |
+
elif mode == "syllable":
|
111 |
+
# This is the original syllable extraction logic
|
112 |
+
for item_idx, w in enumerate(raw_botok_items):
|
113 |
+
if hasattr(w, 'syls') and w.syls:
|
114 |
+
for syl_idx, syl_item in enumerate(w.syls):
|
115 |
+
syllable_to_process = None
|
116 |
+
if isinstance(syl_item, str):
|
117 |
+
syllable_to_process = syl_item
|
118 |
+
elif isinstance(syl_item, list):
|
119 |
+
try:
|
120 |
+
syllable_to_process = "".join(syl_item)
|
121 |
+
except TypeError:
|
122 |
+
logger.warning(
|
123 |
+
f"Syllable item in w.syls was a list, but could not be joined (non-string elements?): {syl_item} "
|
124 |
+
f"from word item {item_idx} (text: {getattr(w, 'text', 'N/A')}), syl_idx {syl_idx} "
|
125 |
+
f"for hash {text_hash[:8]}. Skipping this syllable."
|
126 |
+
)
|
127 |
+
continue
|
128 |
+
|
129 |
+
if syllable_to_process is not None:
|
130 |
+
stripped_syl = syllable_to_process.strip()
|
131 |
+
if stripped_syl:
|
132 |
+
current_tokens.append(stripped_syl)
|
133 |
+
elif syl_item is not None:
|
134 |
+
logger.warning(
|
135 |
+
f"Unexpected type for syllable item (neither str nor list): {type(syl_item)} ('{str(syl_item)[:100]}') "
|
136 |
+
f"from word item {item_idx} (text: {getattr(w, 'text', 'N/A')}), syl_idx {syl_idx} "
|
137 |
+
f"for hash {text_hash[:8]}. Skipping this syllable."
|
138 |
+
)
|
139 |
+
elif hasattr(w, 'text') and w.text: # Fallback if no 'syls' but in syllable mode
|
140 |
+
if isinstance(w.text, str):
|
141 |
+
token_text = w.text.strip()
|
142 |
+
if token_text:
|
143 |
+
current_tokens.append(token_text) # Treat as a single syllable/token
|
144 |
+
elif w.text is not None:
|
145 |
+
logger.warning(
|
146 |
+
f"Unexpected type for w.text (in syllable mode fallback): {type(w.text)} ('{str(w.text)[:100]}') "
|
147 |
+
f"for item {item_idx} (POS: {getattr(w, 'pos', 'N/A')}) "
|
148 |
+
f"for hash {text_hash[:8]}. Skipping this token."
|
149 |
+
)
|
150 |
+
logger.debug(
|
151 |
+
f"SYLLABLE TOKENS FORMED for hash {text_hash[:8]} (mode: {mode}, first 30): "
|
152 |
+
f"{current_tokens[:30]}"
|
153 |
+
)
|
154 |
+
tokens = current_tokens
|
155 |
+
else:
|
156 |
+
logger.error(f"BOTOK_TOKENIZER is None for text hash {text_hash[:8]}, cannot tokenize (mode: {mode}).")
|
157 |
+
tokens = []
|
158 |
|
159 |
# Store in cache if not empty
|
160 |
if tokens:
|
|
|
164 |
_tokenization_cache.pop(next(iter(_tokenization_cache)))
|
165 |
|
166 |
_tokenization_cache[text_hash] = tokens
|
167 |
+
logger.debug(f"Added tokens to cache with hash {text_hash[:8]}... (mode: {mode})")
|
168 |
except Exception as e:
|
169 |
+
logger.error(f"Error tokenizing text (mode: {mode}): {e}")
|
170 |
tokens = []
|
171 |
|
172 |
tokenized_texts_list.append(tokens)
|
pipeline/visualize.py
CHANGED
@@ -149,10 +149,15 @@ def generate_word_count_chart(word_counts_df: pd.DataFrame):
|
|
149 |
font=dict(size=14),
|
150 |
legend_title_text="Filename",
|
151 |
xaxis=dict(
|
152 |
-
type="category"
|
153 |
-
|
154 |
-
|
155 |
-
|
|
|
|
|
|
|
|
|
|
|
156 |
)
|
157 |
# Ensure x-axis ticks are shown for all chapter numbers present
|
158 |
all_chapter_numbers = sorted(word_counts_df["ChapterNumber"].unique())
|
|
|
149 |
font=dict(size=14),
|
150 |
legend_title_text="Filename",
|
151 |
xaxis=dict(
|
152 |
+
type="category", # Treat chapter numbers as categories
|
153 |
+
automargin=True # Automatically adjust margin for x-axis labels/title
|
154 |
+
),
|
155 |
+
yaxis=dict(
|
156 |
+
rangemode='tozero', # Ensure y-axis starts at 0 and includes max value
|
157 |
+
automargin=True # Automatically adjust margin for y-axis labels/title
|
158 |
+
),
|
159 |
+
autosize=True, # Keep for responsiveness in Gradio
|
160 |
+
margin=dict(l=80, r=50, b=100, t=50, pad=4) # Keep existing base margins
|
161 |
)
|
162 |
# Ensure x-axis ticks are shown for all chapter numbers present
|
163 |
all_chapter_numbers = sorted(word_counts_df["ChapterNumber"].unique())
|
results.csv
ADDED
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Text Pair,Jaccard Similarity (%),Normalized LCS,Semantic Similarity,TF-IDF Cosine Sim,Chapter
|
2 |
+
Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,1
|
3 |
+
Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,2
|
4 |
+
Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,3
|
5 |
+
Ngari 9.txt vs Nepal12.txt,46.42857142857143,0.6112115732368897,,0.8407944127395544,4
|
6 |
+
Ngari 9.txt vs Nepal12.txt,40.42553191489361,0.5191256830601093,,0.5026984774848224,5
|
7 |
+
Ngari 9.txt vs Nepal12.txt,47.28260869565217,0.6107784431137725,,0.8380742568060093,6
|
8 |
+
Ngari 9.txt vs Nepal12.txt,49.29178470254957,0.5285565939771547,,0.8409605475909782,7
|
9 |
+
Ngari 9.txt vs Nepal12.txt,46.07218683651805,0.6053169734151329,,0.9306016557862976,8
|
10 |
+
Ngari 9.txt vs Nepal12.txt,51.7557251908397,0.7000429737859906,,0.9600630844581352,9
|
11 |
+
Ngari 9.txt vs Nepal12.txt,52.760736196319016,0.710204081632653,,0.9135878707769712,10
|
12 |
+
Ngari 9.txt vs Nepal12.txt,14.92842535787321,0.08302507192766133,,0.698638890914812,11
|
13 |
+
Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,12
|
14 |
+
Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,13
|
15 |
+
Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,14
|
16 |
+
Ngari 9.txt vs Nepal12.txt,0.0,0.0,,0.0,15
|
17 |
+
Ngari 9.txt vs Nepal12.txt,100.0,1.0,,1.0,16
|
18 |
+
Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,1
|
19 |
+
Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,2
|
20 |
+
Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,3
|
21 |
+
Ngari 9.txt vs LTWA.txt,47.752808988764045,0.603648424543947,,0.8414077093281586,4
|
22 |
+
Ngari 9.txt vs LTWA.txt,48.40764331210191,0.6094808126410836,,0.6526135410649626,5
|
23 |
+
Ngari 9.txt vs LTWA.txt,49.13294797687861,0.6297872340425532,,0.8252183235183391,6
|
24 |
+
Ngari 9.txt vs LTWA.txt,35.53459119496855,0.4071058475203553,,0.8403529862077375,7
|
25 |
+
Ngari 9.txt vs LTWA.txt,45.0,0.601965601965602,,0.9452297806160965,8
|
26 |
+
Ngari 9.txt vs LTWA.txt,37.89126853377265,0.29986320109439124,,0.8760838478443608,9
|
27 |
+
Ngari 9.txt vs LTWA.txt,51.632047477744806,0.6395222584147665,,0.9317016829510952,10
|
28 |
+
Ngari 9.txt vs LTWA.txt,14.979757085020243,0.10742761225346202,,0.7111189597708231,11
|
29 |
+
Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,12
|
30 |
+
Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,13
|
31 |
+
Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,14
|
32 |
+
Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,15
|
33 |
+
Ngari 9.txt vs LTWA.txt,0.0,0.0,,0.0,16
|
34 |
+
Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,1
|
35 |
+
Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,2
|
36 |
+
Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,3
|
37 |
+
Ngari 9.txt vs Leiden.txt,41.340782122905026,0.5282331511839709,,0.8525095366316284,4
|
38 |
+
Ngari 9.txt vs Leiden.txt,36.80555555555556,0.4734042553191489,,0.5634721694429372,5
|
39 |
+
Ngari 9.txt vs Leiden.txt,44.047619047619044,0.4728132387706856,,0.7698959290709281,6
|
40 |
+
Ngari 9.txt vs Leiden.txt,35.67251461988304,0.3208020050125313,,0.784262930792386,7
|
41 |
+
Ngari 9.txt vs Leiden.txt,41.01123595505618,0.4241099312929419,,0.9275267086147868,8
|
42 |
+
Ngari 9.txt vs Leiden.txt,40.31209362808843,0.20184790334044064,,0.9076572014074583,9
|
43 |
+
Ngari 9.txt vs Leiden.txt,50.445103857566764,0.6045733407696597,,0.9284684903895061,10
|
44 |
+
Ngari 9.txt vs Leiden.txt,16.363636363636363,0.08736942070275404,,0.6999802304139516,11
|
45 |
+
Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,12
|
46 |
+
Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,13
|
47 |
+
Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,14
|
48 |
+
Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,15
|
49 |
+
Ngari 9.txt vs Leiden.txt,0.0,0.0,,0.0,16
|
50 |
+
Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,1
|
51 |
+
Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,2
|
52 |
+
Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,3
|
53 |
+
Nepal12.txt vs LTWA.txt,56.493506493506494,0.6959706959706959,,0.8321482637014176,4
|
54 |
+
Nepal12.txt vs LTWA.txt,39.71631205673759,0.5386666666666666,,0.7104447145077406,5
|
55 |
+
Nepal12.txt vs LTWA.txt,48.795180722891565,0.5898004434589801,,0.8168699067131293,6
|
56 |
+
Nepal12.txt vs LTWA.txt,34.954407294832826,0.3365548607163161,,0.861391898750807,7
|
57 |
+
Nepal12.txt vs LTWA.txt,51.41509433962265,0.5873239436619718,,0.9310750730815768,8
|
58 |
+
Nepal12.txt vs LTWA.txt,41.18705035971223,0.3156208277703605,,0.9075961630628558,9
|
59 |
+
Nepal12.txt vs LTWA.txt,60.066006600660074,0.7040533037201555,,0.921390350997517,10
|
60 |
+
Nepal12.txt vs LTWA.txt,63.6986301369863,0.7454220634211701,,0.9803189694519824,11
|
61 |
+
Nepal12.txt vs LTWA.txt,48.275862068965516,0.5102639296187683,,0.7725258306356406,12
|
62 |
+
Nepal12.txt vs LTWA.txt,58.203125,0.7364921030756443,,0.9543942889292814,13
|
63 |
+
Nepal12.txt vs LTWA.txt,41.732283464566926,0.4332449160035367,,0.8497746214132795,14
|
64 |
+
Nepal12.txt vs LTWA.txt,17.983651226158038,0.1474820143884892,,0.5779105517118261,15
|
65 |
+
Nepal12.txt vs LTWA.txt,0.0,0.0,,0.0,16
|
66 |
+
Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,1
|
67 |
+
Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,2
|
68 |
+
Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,3
|
69 |
+
Nepal12.txt vs Leiden.txt,57.14285714285714,0.6788617886178862,,0.8403894964769358,4
|
70 |
+
Nepal12.txt vs Leiden.txt,38.793103448275865,0.4935064935064935,,0.4684416871587978,5
|
71 |
+
Nepal12.txt vs Leiden.txt,60.416666666666664,0.6386138613861386,,0.8441982917223785,6
|
72 |
+
Nepal12.txt vs Leiden.txt,43.24324324324324,0.33632734530938124,,0.8839876637274263,7
|
73 |
+
Nepal12.txt vs Leiden.txt,50.8235294117647,0.4953338119167265,,0.9373191281412603,8
|
74 |
+
Nepal12.txt vs Leiden.txt,44.03927068723703,0.2242042672263029,,0.9196700291527228,9
|
75 |
+
Nepal12.txt vs Leiden.txt,67.59581881533101,0.7226027397260274,,0.9462708958951278,10
|
76 |
+
Nepal12.txt vs Leiden.txt,60.42780748663101,0.7003094501309212,,0.9722895878422901,11
|
77 |
+
Nepal12.txt vs Leiden.txt,23.502304147465438,0.27245508982035926,,0.6893488630692246,12
|
78 |
+
Nepal12.txt vs Leiden.txt,67.08333333333333,0.7506382978723404,,0.9466019120384076,13
|
79 |
+
Nepal12.txt vs Leiden.txt,42.67782426778243,0.418426103646833,,0.8023010077421123,14
|
80 |
+
Nepal12.txt vs Leiden.txt,31.17206982543641,0.2664756446991404,,0.757778410785804,15
|
81 |
+
Nepal12.txt vs Leiden.txt,0.0,0.0,,0.0,16
|
82 |
+
LTWA.txt vs Leiden.txt,53.5064935064935,0.6359163591635917,,0.9623337315161734,1
|
83 |
+
LTWA.txt vs Leiden.txt,60.909090909090914,0.7578659370725034,,0.8852155398192683,2
|
84 |
+
LTWA.txt vs Leiden.txt,64.1025641025641,0.7001044932079414,,0.8986878289296542,3
|
85 |
+
LTWA.txt vs Leiden.txt,51.21951219512195,0.6568265682656826,,0.8596233249504219,4
|
86 |
+
LTWA.txt vs Leiden.txt,40.0,0.5298701298701298,,0.6287776677298036,5
|
87 |
+
LTWA.txt vs Leiden.txt,46.308724832214764,0.5415549597855228,,0.7796920776498958,6
|
88 |
+
LTWA.txt vs Leiden.txt,43.233082706766915,0.49545136459062283,,0.8819142330949857,7
|
89 |
+
LTWA.txt vs Leiden.txt,54.03050108932462,0.5373230373230373,,0.9477445373252964,8
|
90 |
+
LTWA.txt vs Leiden.txt,38.75598086124402,0.1898707353252808,,0.887072472781142,9
|
91 |
+
LTWA.txt vs Leiden.txt,66.32996632996633,0.7823310271420969,,0.9693004524579277,10
|
92 |
+
LTWA.txt vs Leiden.txt,63.537906137184116,0.754516983859311,,0.9830176756030125,11
|
93 |
+
LTWA.txt vs Leiden.txt,24.299065420560748,0.18152350081037277,,0.6278532648577805,12
|
94 |
+
LTWA.txt vs Leiden.txt,60.1593625498008,0.7367521367521368,,0.9381662329597793,13
|
95 |
+
LTWA.txt vs Leiden.txt,59.44444444444444,0.6746987951807228,,0.8771500136505623,14
|
96 |
+
LTWA.txt vs Leiden.txt,35.37735849056604,0.39255014326647564,,0.6834100468628878,15
|
97 |
+
LTWA.txt vs Leiden.txt,60.45081967213115,0.6875444839857652,,0.9482911929631709,16
|
user_guide.md
ADDED
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Tibetan Text Metrics Web Application User Guide
|
2 |
+
|
3 |
+
## Introduction
|
4 |
+
|
5 |
+
Welcome to the Tibetan Text Metrics Web Application! This user-friendly tool allows you to analyze textual similarities and variations in Tibetan manuscripts using multiple computational approaches. The application provides a graphical interface to the core functionalities of the Tibetan Text Metrics (TTM) project.
|
6 |
+
|
7 |
+
## Getting Started
|
8 |
+
|
9 |
+
### System Requirements
|
10 |
+
|
11 |
+
- Modern web browser (Chrome, Firefox, Safari, or Edge)
|
12 |
+
- For local installation: Python 3.10 or newer
|
13 |
+
- Sufficient RAM for processing large texts (4GB minimum, 8GB recommended)
|
14 |
+
|
15 |
+
### Installation and Setup
|
16 |
+
|
17 |
+
#### Online Demo
|
18 |
+
|
19 |
+
The easiest way to try the application is through our Hugging Face Spaces demo:
|
20 |
+
[daniel-wojahn/ttm-webapp-hf](https://huggingface.co/spaces/daniel-wojahn/ttm-webapp-hf)
|
21 |
+
|
22 |
+
Note: The free tier of Hugging Face Spaces may have performance limitations compared to running locally.
|
23 |
+
|
24 |
+
#### Local Installation
|
25 |
+
|
26 |
+
1. Clone the repository:
|
27 |
+
```bash
|
28 |
+
git clone https://github.com/daniel-wojahn/tibetan-text-metrics.git
|
29 |
+
cd tibetan-text-metrics/webapp
|
30 |
+
```
|
31 |
+
|
32 |
+
2. Create and activate a virtual environment:
|
33 |
+
```bash
|
34 |
+
python -m venv venv
|
35 |
+
source venv/bin/activate # On Windows: venv\Scripts\activate
|
36 |
+
```
|
37 |
+
|
38 |
+
3. Install dependencies:
|
39 |
+
```bash
|
40 |
+
pip install -r requirements.txt
|
41 |
+
```
|
42 |
+
|
43 |
+
4. Run the application:
|
44 |
+
```bash
|
45 |
+
python app.py
|
46 |
+
```
|
47 |
+
|
48 |
+
5. Open your browser and navigate to:
|
49 |
+
```
|
50 |
+
http://localhost:7860
|
51 |
+
```
|
52 |
+
|
53 |
+
## Using the Application
|
54 |
+
|
55 |
+
### Step 1: Upload Your Tibetan Text Files
|
56 |
+
|
57 |
+
1. Click the "Upload Tibetan .txt files" button to select one or more `.txt` files containing Tibetan text.
|
58 |
+
2. Files should be in UTF-8 or UTF-16 encoding.
|
59 |
+
3. Maximum file size: 10MB per file (for optimal performance, use files under 1MB).
|
60 |
+
4. For best results, your texts should be segmented into chapters/sections using the Tibetan marker '༈' (*sbrul shad*).
|
61 |
+
|
62 |
+
### Step 2: Configure Analysis Options
|
63 |
+
|
64 |
+
1. **Semantic Similarity**: Choose whether to compute semantic similarity metrics.
|
65 |
+
- "Yes" (default): Includes semantic similarity in the analysis (slower but more comprehensive).
|
66 |
+
- "No": Skips semantic similarity calculation for faster processing.
|
67 |
+
|
68 |
+
2. **Embedding Model**: Select the model to use for semantic similarity analysis.
|
69 |
+
- **sentence-transformers/all-MiniLM-L6-v2** (default): General purpose sentence embedding model (fastest option).
|
70 |
+
- **sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2**: Multilingual model with good performance for many languages.
|
71 |
+
- **buddhist-nlp/buddhist-sentence-similarity**: Optimized for Buddhist text similarity.
|
72 |
+
- **xlm-roberta-base**: Multilingual model that includes Tibetan.
|
73 |
+
|
74 |
+
3. Click the "Run Analysis" button to start processing.
|
75 |
+
|
76 |
+
### Step 3: View and Interpret Results
|
77 |
+
|
78 |
+
After processing, the application displays several visualizations and metrics:
|
79 |
+
|
80 |
+
#### Word Count Chart
|
81 |
+
|
82 |
+
Shows the number of words in each chapter/segment of each file, allowing you to compare the relative lengths of different texts.
|
83 |
+
|
84 |
+
#### Similarity Metrics
|
85 |
+
|
86 |
+
The application computes four different similarity metrics between corresponding chapters of different files:
|
87 |
+
|
88 |
+
1. **Jaccard Similarity (%)**: Measures vocabulary overlap between segments after filtering out common Tibetan stopwords. A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
|
89 |
+
|
90 |
+
2. **Normalized LCS (Longest Common Subsequence)**: Measures the length of the longest sequence of words that appears in both text segments, maintaining their original relative order. A higher score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism.
|
91 |
+
|
92 |
+
3. **Semantic Similarity**: Uses a transformer-based model to compute the cosine similarity between the semantic embeddings of text segments. This captures similarities in meaning even when different vocabulary is used.
|
93 |
+
|
94 |
+
4. **TF-IDF Cosine Similarity**: Compares texts based on their important, characteristic terms by giving higher weight to words that are frequent within a particular segment but relatively rare across the entire collection.
|
95 |
+
|
96 |
+
#### Heatmap Visualizations
|
97 |
+
|
98 |
+
Each metric has a corresponding heatmap visualization where:
|
99 |
+
- Rows represent chapters/segments
|
100 |
+
- Columns represent text pairs being compared
|
101 |
+
- Color intensity indicates similarity (brighter = more similar)
|
102 |
+
|
103 |
+
### Tips for Effective Analysis
|
104 |
+
|
105 |
+
1. **Text Segmentation**: For meaningful chapter-level comparisons, ensure your texts are segmented using the Tibetan marker '༈' (*sbrul shad*).
|
106 |
+
|
107 |
+
2. **File Naming**: Use descriptive filenames to make the comparison results easier to interpret.
|
108 |
+
|
109 |
+
3. **Model Selection**:
|
110 |
+
- For faster processing, use the default model or disable semantic similarity.
|
111 |
+
- For Buddhist texts, the buddhist-nlp/buddhist-sentence-similarity model may provide better results.
|
112 |
+
|
113 |
+
4. **File Size**:
|
114 |
+
- Keep individual files under 1MB for optimal performance.
|
115 |
+
- Very large files (>10MB) are not supported and will trigger an error.
|
116 |
+
|
117 |
+
5. **Comparing Multiple Texts**: The application requires at least two text files to compute similarity metrics.
|
118 |
+
|
119 |
+
## Understanding the Metrics
|
120 |
+
|
121 |
+
### Jaccard Similarity (%)
|
122 |
+
|
123 |
+
This metric quantifies the lexical overlap between two text segments by comparing their sets of unique words, after filtering out common Tibetan stopwords. It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?'
|
124 |
+
|
125 |
+
It is calculated as:
|
126 |
+
```
|
127 |
+
(Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100
|
128 |
+
```
|
129 |
+
|
130 |
+
Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent. A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
|
131 |
+
|
132 |
+
### Normalized LCS (Longest Common Subsequence)
|
133 |
+
|
134 |
+
This metric measures the length of the longest sequence of words that appears in both text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text.
|
135 |
+
|
136 |
+
For example, if Text A is 'the quick brown fox jumps' and Text B is 'the lazy cat and brown dog jumps high', the LCS is 'the brown jumps'.
|
137 |
+
|
138 |
+
The length of this common subsequence is then normalized to provide a score. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
|
139 |
+
|
140 |
+
Unlike other metrics, LCS does not filter out stopwords, allowing it to capture structural similarities and the flow of language, including the use of particles and common words that contribute to sentence construction.
|
141 |
+
|
142 |
+
### Semantic Similarity
|
143 |
+
|
144 |
+
This metric utilizes transformer-based models to compute the cosine similarity between the semantic embeddings of text segments. The model converts each text segment into a high-dimensional vector that captures its semantic meaning.
|
145 |
+
|
146 |
+
For texts exceeding the model's token limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged to produce a single representative vector for the entire segment before comparison.
|
147 |
+
|
148 |
+
A higher score indicates that the texts express similar concepts or ideas, even if they use different vocabulary or phrasing.
|
149 |
+
|
150 |
+
### TF-IDF Cosine Similarity
|
151 |
+
|
152 |
+
This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, after filtering out common Tibetan stopwords. TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
|
153 |
+
|
154 |
+
Each segment is then represented as a vector of these TF-IDF scores, and the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
|
155 |
+
|
156 |
+
## Troubleshooting
|
157 |
+
|
158 |
+
### Common Issues and Solutions
|
159 |
+
|
160 |
+
1. **"Empty vocabulary" error**:
|
161 |
+
- This can occur if a text contains only stopwords or if tokenization fails.
|
162 |
+
- Solution: Check your input text to ensure it contains valid Tibetan content.
|
163 |
+
|
164 |
+
2. **Model loading errors**:
|
165 |
+
- If a model fails to load, the application will continue without semantic similarity.
|
166 |
+
- Solution: Try a different model or disable semantic similarity.
|
167 |
+
|
168 |
+
3. **Performance issues with large files**:
|
169 |
+
- Solution: Split large files into smaller ones or use fewer files at once.
|
170 |
+
|
171 |
+
4. **No results displayed**:
|
172 |
+
- Solution: Ensure you have uploaded at least two valid text files and that they contain comparable content.
|
173 |
+
|
174 |
+
5. **Encoding issues**:
|
175 |
+
- If your text appears garbled, it may have encoding problems.
|
176 |
+
- Solution: Ensure your files are saved in UTF-8 or UTF-16 encoding.
|
177 |
+
|
178 |
+
### Getting Help
|
179 |
+
|
180 |
+
If you encounter issues not covered in this guide, please:
|
181 |
+
1. Check the [GitHub repository](https://github.com/daniel-wojahn/tibetan-text-metrics) for updates or known issues.
|
182 |
+
2. Submit an issue on GitHub with details about your problem.
|
183 |
+
|
184 |
+
## Acknowledgments
|
185 |
+
|
186 |
+
The Tibetan Text Metrics project was developed as part of the [Law in Historic Tibet](https://www.law.ox.ac.uk/law-historic-tibet) project at the Centre for Socio-Legal Studies at the University of Oxford.
|
187 |
+
|
188 |
+
## License
|
189 |
+
|
190 |
+
This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
|