daniel-wojahn commited on
Commit
0bbf2df
·
verified ·
1 Parent(s): 95d7fed

revamped the pipeline and added stopwords and documentation

Browse files
README.md CHANGED
@@ -1,14 +1,14 @@
1
  ---
2
- title: Tibetan Text Metrics Web App
3
- emoji: 🐢
4
  colorFrom: blue
5
- colorTo: purple
6
  sdk: gradio
7
- sdk_version: 5.29.1
 
8
  app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: A web app for analyzing Tibetan textual similarities
12
  ---
13
 
14
  # Tibetan Text Metrics Web App
@@ -17,7 +17,7 @@ short_description: A web app for analyzing Tibetan textual similarities
17
  [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
18
  [![Project Status: Active – Web app version for accessible text analysis.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
19
 
20
- A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts. This tool provides a graphical interface to the core text comparison functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project, making it accessible to researchers without Python or command-line experience. Built with Python and Gradio.
21
 
22
  ## Background
23
 
@@ -28,15 +28,15 @@ The Tibetan Text Metrics project aims to provide quantitative methods for assess
28
  - **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
29
  - **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
30
  - **Core Metrics Computed**:
31
- - **Jaccard Similarity (%)**: Measures vocabulary overlap between segments.
32
  - **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
33
- - **Semantic Similarity (BuddhistNLP)**: Uses the `buddhist-nlp/bodhi-sentence-cased-v1` model to compare the contextual meaning of segments.
34
- - **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles.
35
  - **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
36
  - **Interactive Visualizations**:
37
  - Heatmaps for Jaccard, LCS, Semantic, and TF-IDF similarity metrics, providing a quick overview of inter-segment relationships.
38
  - Bar chart displaying word counts per segment.
39
- - **Downloadable Results**: Export detailed metrics as a CSV file.
40
  - **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.
41
 
42
  ## Text Segmentation and Best Practices
@@ -47,19 +47,30 @@ To obtain meaningful results, it is highly recommended to divide your Tibetan te
47
 
48
  **How to segment your texts:**
49
 
50
- - Use a clear marker (e.g., `༈` or another unique string) to separate chapters/sections in your `.txt` files.
51
  - Each segment should represent a coherent part of the text (e.g., a chapter, legal clause, or thematic section).
52
  - The tool will automatically split your file on this marker for analysis. If no marker is found, the entire file is treated as a single segment, and a warning will be issued.
53
 
54
  **Best practices:**
55
 
56
- - Ensure your marker is unique and does not appear within a chapter.
57
  - Try to keep chapters/sections of similar length for more balanced comparisons.
58
  - For poetry or short texts, consider grouping several poems or stanzas as one segment.
59
 
60
  ## Implemented Metrics
61
 
62
- The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  1. **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words. It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`. Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
65
  2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
@@ -67,7 +78,7 @@ The application computes and visualizes the following similarity metrics between
67
  3. **Semantic Similarity (BuddhistNLP)**: Utilizes the `buddhist-nlp/bodhi-sentence-cased-v1` model to compute the cosine similarity between the semantic embeddings of text segments. This model is fine-tuned for Buddhist studies texts and captures nuances in meaning. For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
68
  4. **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment. TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments. This helps to identify terms that are characteristic or discriminative for a segment. Each segment is then represented as a vector of these TF-IDF scores. Finally, the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
69
 
70
- ## Getting Started (Running Locally)
71
 
72
  1. Ensure you have Python 3.10 or newer.
73
  2. Navigate to the `webapp` directory:
@@ -84,11 +95,19 @@ The application computes and visualizes the following similarity metrics between
84
  ```bash
85
  pip install -r requirements.txt
86
  ```
87
- 5. Run the web application:
 
 
 
 
 
 
 
 
88
  ```bash
89
  python app.py
90
  ```
91
- 6. Open your web browser and go to the local URL provided (usually `http://127.0.0.1:7860`).
92
 
93
  ## Usage
94
 
 
1
  ---
2
+ title: Tibetan Text Metrics
3
+ emoji: 📚
4
  colorFrom: blue
5
+ colorTo: indigo
6
  sdk: gradio
7
+ sdk_version: 5.29.0
8
+ python_version: 3.10
9
  app_file: app.py
10
+ models:
11
+ - buddhist-nlp/buddhist-sentence-similarity
 
12
  ---
13
 
14
  # Tibetan Text Metrics Web App
 
17
  [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
18
  [![Project Status: Active – Web app version for accessible text analysis.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
19
 
20
+ A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts. This tool provides a graphical interface to the core text comparison functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project, making it accessible to researchers without Python or command-line experience. Built with Python, Cython, and Gradio.
21
 
22
  ## Background
23
 
 
28
  - **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
29
  - **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
30
  - **Core Metrics Computed**:
31
+ - **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords are filtered out to focus on meaningful lexical similarity.*
32
  - **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
33
+ - **Semantic Similarity (BuddhistNLP)**: Uses the [buddhist-nlp/buddhist-sentence-similarity](https://huggingface.co/buddhist-nlp/buddhist-sentence-similarity) model to compare the contextual meaning of segments. *Note: This metric is experimental and may not perform well for all texts. It is recommended to use it in combination with other metrics for a more comprehensive analysis.*
34
+ - **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles. *Common Tibetan stopwords are excluded to ensure TF-IDF weights highlight genuinely characteristic terms.*
35
  - **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
36
  - **Interactive Visualizations**:
37
  - Heatmaps for Jaccard, LCS, Semantic, and TF-IDF similarity metrics, providing a quick overview of inter-segment relationships.
38
  - Bar chart displaying word counts per segment.
39
+ - **Downloadable Results**: Export detailed metrics as a CSV file and save heatmaps as PNG files.
40
  - **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.
41
 
42
  ## Text Segmentation and Best Practices
 
47
 
48
  **How to segment your texts:**
49
 
50
+ - Use the Tibetan section marker (`༈` (sbrul shad)) to separate chapters/sections in your `.txt` files.
51
  - Each segment should represent a coherent part of the text (e.g., a chapter, legal clause, or thematic section).
52
  - The tool will automatically split your file on this marker for analysis. If no marker is found, the entire file is treated as a single segment, and a warning will be issued.
53
 
54
  **Best practices:**
55
 
56
+ - Ensure the marker is unique and does not appear within a chapter.
57
  - Try to keep chapters/sections of similar length for more balanced comparisons.
58
  - For poetry or short texts, consider grouping several poems or stanzas as one segment.
59
 
60
  ## Implemented Metrics
61
 
62
+ **Stopword Filtering:**
63
+ To enhance the accuracy and relevance of similarity scores, both the Jaccard Similarity and TF-IDF Cosine Similarity calculations incorporate a stopword filtering step. This process removes high-frequency, low-information Tibetan words (e.g., common particles, pronouns, and grammatical markers) before the metrics are computed. This ensures that the resulting scores are more reflective of meaningful lexical and thematic similarities between texts, rather than being skewed by the presence of ubiquitous common words.
64
+
65
+ The comprehensive list of Tibetan stopwords used is adapted and compiled from the following valuable resources:
66
+ - The **Divergent Discourses** (specifically, their Tibetan stopwords list available at [Zenodo Record 10148636](https://zenodo.org/records/10148636)).
67
+ - The **Tibetan Lucene Analyser** by the Buddhist Digital Archives (BUDA), available on [GitHub: buda-base/lucene-bo](https://github.com/buda-base/lucene-bo).
68
+
69
+ We extend our gratitude to the creators and maintainers of these projects for making their work available to the community.
70
+
71
+ Feel free to edit this list of stopwords to better suit your needs. The list is stored in the `pipeline/stopwords.py` file.
72
+
73
+ ### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
74
 
75
  1. **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words. It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`. Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
76
  2. **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
 
78
  3. **Semantic Similarity (BuddhistNLP)**: Utilizes the `buddhist-nlp/bodhi-sentence-cased-v1` model to compute the cosine similarity between the semantic embeddings of text segments. This model is fine-tuned for Buddhist studies texts and captures nuances in meaning. For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
79
  4. **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment. TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments. This helps to identify terms that are characteristic or discriminative for a segment. Each segment is then represented as a vector of these TF-IDF scores. Finally, the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
80
 
81
+ ## Getting Started (if run Locally)
82
 
83
  1. Ensure you have Python 3.10 or newer.
84
  2. Navigate to the `webapp` directory:
 
95
  ```bash
96
  pip install -r requirements.txt
97
  ```
98
+ 5. **Compile Cython Extension (Recommended for Performance)**:
99
+ To speed up the Longest Common Subsequence (LCS) calculation, a Cython extension is provided. To compile it:
100
+ ```bash
101
+ # Ensure you are in the webapp directory
102
+ python setup.py build_ext --inplace
103
+ ```
104
+ This step requires a C compiler. If you skip this, the application will use a slower, pure Python implementation for LCS.
105
+
106
+ 6. **Run the Web Application**:
107
  ```bash
108
  python app.py
109
  ```
110
+ 7. Open your web browser and go to the local URL provided (usually `http://127.0.0.1:7860`).
111
 
112
  ## Usage
113
 
app.py CHANGED
@@ -2,12 +2,12 @@ import gradio as gr
2
  from pathlib import Path
3
  from pipeline.process import process_texts
4
  from pipeline.visualize import generate_visualizations, generate_word_count_chart
5
- import cProfile
6
- import pstats
7
- import io
8
 
9
  from theme import tibetan_theme
10
 
 
 
11
 
12
  # Main interface logic
13
  def main_interface():
@@ -29,7 +29,7 @@ def main_interface():
29
  gr.Markdown(
30
  """
31
  ## Step 1: Upload Your Tibetan Text Files
32
- <span style='font-size:16px;'>Upload one or more `.txt` files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible.</span>
33
  """,
34
  elem_classes="gr-markdown",
35
  )
@@ -41,16 +41,16 @@ def main_interface():
41
  with gr.Column(scale=1, elem_classes="step-column"):
42
  with gr.Group():
43
  gr.Markdown(
44
- """## Step 2: Configure and Run the Analysis
45
- <span style='font-size:16px;'>Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using a marker like '༈'. The tool will split files based on this marker.</span>
46
  """,
47
  elem_classes="gr-markdown",
48
  )
49
  semantic_toggle_radio = gr.Radio(
50
- label="Compute Semantic Similarity?",
51
  choices=["Yes", "No"],
52
  value="Yes",
53
- info="Semantic similarity can be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
54
  elem_id="semantic-radio-group",
55
  )
56
  process_btn = gr.Button(
@@ -74,18 +74,20 @@ def main_interface():
74
  heatmap_titles = {
75
  "Jaccard Similarity (%)": "Jaccard Similarity (%): Higher scores (brighter) mean more shared unique words.",
76
  "Normalized LCS": "Normalized LCS: Higher scores (brighter) mean longer shared sequences of words.",
77
- "Semantic Similarity (BuddhistNLP)": "Semantic Similarity (BuddhistNLP): Higher scores (brighter) mean more similar meanings.",
78
  "TF-IDF Cosine Sim": "TF-IDF Cosine Similarity: Higher scores mean texts share more important, distinctive vocabulary.",
79
  }
80
 
81
  metric_tooltips = {
82
  "Jaccard Similarity (%)": """
83
  ### Jaccard Similarity (%)
84
- This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words.
85
- It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?'
86
- It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`.
87
- Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent.
88
- A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
 
 
89
  """,
90
  "Normalized LCS": """
91
  ### Normalized LCS (Longest Common Subsequence)
@@ -95,22 +97,28 @@ For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and T
95
  The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage.
96
  A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
97
 
98
- *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
 
 
99
  """,
100
- "Semantic Similarity (BuddhistNLP)": """
101
- ### Semantic Similarity (BuddhistNLP)
102
- Utilizes the `buddhist-nlp/bodhi-sentence-cased-v1` model to compute the cosine similarity between the semantic embeddings of text segments.
103
  This model is fine-tuned for Buddhist studies texts and captures nuances in meaning.
104
  For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
 
 
105
  """,
106
  "TF-IDF Cosine Sim": """
107
  ### TF-IDF Cosine Similarity
108
- This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment.
109
  TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
110
- This helps to identify terms that are characteristic or discriminative for a segment.
111
  Each segment is then represented as a vector of these TF-IDF scores.
112
  Finally, the cosine similarity is computed between these vectors.
113
  A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
 
 
114
  """,
115
  }
116
  heatmap_tabs = {}
@@ -138,6 +146,16 @@ A score closer to 1 indicates that the two segments share more of these importan
138
  warning_box = gr.Markdown(visible=False)
139
 
140
  def run_pipeline(files, enable_semantic_str):
 
 
 
 
 
 
 
 
 
 
141
  """
142
  Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
143
 
@@ -157,40 +175,15 @@ A score closer to 1 indicates that the two segments share more of these importan
157
  if not files:
158
  return (
159
  None,
160
- "Please upload files to process.",
161
- None,
162
- None,
163
- None,
164
- None,
165
- gr.update(value="Please upload files to process.", visible=True),
 
166
  )
167
 
168
- pr = cProfile.Profile()
169
- pr.enable()
170
-
171
- # Initialize results to ensure they are defined in finally if an early error occurs
172
- (
173
- csv_path_res,
174
- metrics_preview_df_res,
175
- word_count_fig_res,
176
- jaccard_heatmap_res,
177
- lcs_heatmap_res,
178
- semantic_heatmap_res,
179
- tfidf_heatmap_res,
180
- warning_update_res,
181
- ) = (
182
- None,
183
- "Processing error. See console for details.",
184
- None,
185
- None,
186
- None,
187
- None,
188
- None,
189
- gr.update(
190
- value="Processing error. See console for details.", visible=True
191
- ),
192
- )
193
-
194
  try:
195
  filenames = [
196
  Path(file.name).name for file in files
@@ -214,7 +207,7 @@ A score closer to 1 indicates that the two segments share more of these importan
214
  )
215
  metrics_preview_df_res = warning_message
216
  warning_update_res = gr.update(value=warning_message, visible=True)
217
- # Results for this case are set, finally will execute, then return
218
  else:
219
  # heatmap_titles is already defined in the outer scope of main_interface
220
  heatmaps_data = generate_visualizations(
@@ -237,22 +230,12 @@ A score closer to 1 indicates that the two segments share more of these importan
237
  )
238
 
239
  except Exception as e:
240
- # logger.error(f"Error in processing: {e}", exc_info=True) # Already logged in process_texts or lower levels
241
- metrics_preview_df_res = f"Error: {str(e)}"
 
 
242
  warning_update_res = gr.update(value=f"Error: {str(e)}", visible=True)
243
 
244
- finally:
245
- pr.disable()
246
- s = io.StringIO()
247
- sortby = (
248
- pstats.SortKey.CUMULATIVE
249
- ) # Sort by cumulative time spent in function
250
- ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
251
- print("\n--- cProfile Stats (Top 30) ---")
252
- ps.print_stats(30) # Print the top 30 costly functions
253
- print(s.getvalue())
254
- print("-------------------------------\n")
255
-
256
  return (
257
  csv_path_res,
258
  metrics_preview_df_res,
 
2
  from pathlib import Path
3
  from pipeline.process import process_texts
4
  from pipeline.visualize import generate_visualizations, generate_word_count_chart
5
+ import logging
 
 
6
 
7
  from theme import tibetan_theme
8
 
9
+ logger = logging.getLogger(__name__)
10
+
11
 
12
  # Main interface logic
13
  def main_interface():
 
29
  gr.Markdown(
30
  """
31
  ## Step 1: Upload Your Tibetan Text Files
32
+ <span style='font-size:16px;'>Upload one or more `.txt` files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible using the marker '༈' (<i>sbrul shad</i>).</span>
33
  """,
34
  elem_classes="gr-markdown",
35
  )
 
41
  with gr.Column(scale=1, elem_classes="step-column"):
42
  with gr.Group():
43
  gr.Markdown(
44
+ """## Step 2: Configure and run the analysis
45
+ <span style='font-size:16px;'>Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using the marker '༈' (<i>sbrul shad</i>). The tool will split files based on this marker.</span>
46
  """,
47
  elem_classes="gr-markdown",
48
  )
49
  semantic_toggle_radio = gr.Radio(
50
+ label="Compute semantic similarity?",
51
  choices=["Yes", "No"],
52
  value="Yes",
53
+ info="Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
54
  elem_id="semantic-radio-group",
55
  )
56
  process_btn = gr.Button(
 
74
  heatmap_titles = {
75
  "Jaccard Similarity (%)": "Jaccard Similarity (%): Higher scores (brighter) mean more shared unique words.",
76
  "Normalized LCS": "Normalized LCS: Higher scores (brighter) mean longer shared sequences of words.",
77
+ "Semantic Similarity": "Semantic Similarity (using word embeddings/experimental): Higher scores (brighter) mean more similar meanings.",
78
  "TF-IDF Cosine Sim": "TF-IDF Cosine Similarity: Higher scores mean texts share more important, distinctive vocabulary.",
79
  }
80
 
81
  metric_tooltips = {
82
  "Jaccard Similarity (%)": """
83
  ### Jaccard Similarity (%)
84
+ This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words, after **filtering out common Tibetan stopwords**.
85
+ It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?'
86
+ It is calculated as `(Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100`.
87
+ Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent.
88
+ A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
89
+
90
+ **Stopword Filtering**: uses a range of stopwords to filter out common Tibetan words that do not contribute to the semantic content of the text.
91
  """,
92
  "Normalized LCS": """
93
  ### Normalized LCS (Longest Common Subsequence)
 
97
  The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage.
98
  A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
99
 
100
+ **No Stopword Filtering.** Unlike metrics such as Jaccard Similarity or TF-IDF Cosine Similarity (which typically filter out common stopwords to focus on content-bearing words), the LCS calculation in this tool intentionally uses the raw, unfiltered sequence of tokens from your texts. This design choice allows LCS to capture structural similarities and the flow of language, including the use of particles and common words that contribute to sentence construction and narrative sequence. By not removing stopwords, LCS can reveal similarities in phrasing and textual structure that might otherwise be obscured, making it a valuable complement to metrics that focus purely on lexical overlap of keywords.
101
+
102
+ **Note on Interpretation**: It is possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
103
  """,
104
+ "Semantic Similarity": """
105
+ ### Semantic Similarity (Experimental)
106
+ Utilizes the `<a href="https://huggingface.co/buddhist-nlp/buddhist-sentence-similarity">buddhist-nlp/buddhist-sentence-similarity</a>` model to compute the cosine similarity between the semantic embeddings of text segments.
107
  This model is fine-tuned for Buddhist studies texts and captures nuances in meaning.
108
  For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
109
+
110
+ **Note**: This metric is experimental and may not perform well for all texts. It is recommended to use it in combination with other metrics for a more comprehensive analysis.
111
  """,
112
  "TF-IDF Cosine Sim": """
113
  ### TF-IDF Cosine Similarity
114
+ This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, **after filtering out common Tibetan stopwords**.
115
  TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
116
+ This helps to identify terms that are characteristic or discriminative for a segment. By excluding stopwords, the TF-IDF scores better reflect genuinely significant terms.
117
  Each segment is then represented as a vector of these TF-IDF scores.
118
  Finally, the cosine similarity is computed between these vectors.
119
  A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
120
+
121
+ **Stopword Filtering**: uses a range of stopwords to filter out common Tibetan words that do not contribute to the semantic content of the text.
122
  """,
123
  }
124
  heatmap_tabs = {}
 
146
  warning_box = gr.Markdown(visible=False)
147
 
148
  def run_pipeline(files, enable_semantic_str):
149
+ # Initialize all return values to ensure defined paths for all outputs
150
+ csv_path_res = None
151
+ metrics_preview_df_res = None # Can be a DataFrame or a string message
152
+ word_count_fig_res = None
153
+ jaccard_heatmap_res = None
154
+ lcs_heatmap_res = None
155
+ semantic_heatmap_res = None
156
+ tfidf_heatmap_res = None
157
+ warning_update_res = gr.update(value="", visible=False) # Default: no warning
158
+
159
  """
160
  Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
161
 
 
175
  if not files:
176
  return (
177
  None,
178
+ "Please upload files to analyze.",
179
+ None, # word_count_plot
180
+ None, # jaccard_heatmap
181
+ None, # lcs_heatmap
182
+ None, # semantic_heatmap
183
+ None, # tfidf_heatmap
184
+ gr.update(value="Please upload files.", visible=True),
185
  )
186
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
  try:
188
  filenames = [
189
  Path(file.name).name for file in files
 
207
  )
208
  metrics_preview_df_res = warning_message
209
  warning_update_res = gr.update(value=warning_message, visible=True)
210
+ # Results for this case are set, then return
211
  else:
212
  # heatmap_titles is already defined in the outer scope of main_interface
213
  heatmaps_data = generate_visualizations(
 
230
  )
231
 
232
  except Exception as e:
233
+ logger.error(f"Error in run_pipeline: {e}", exc_info=True)
234
+ # metrics_preview_df_res and warning_update_res are set here.
235
+ # Other plot/file path variables will retain their initial 'None' values set at function start.
236
+ metrics_preview_df_res = f"Error: {str(e)}"
237
  warning_update_res = gr.update(value=f"Error: {str(e)}", visible=True)
238
 
 
 
 
 
 
 
 
 
 
 
 
 
239
  return (
240
  csv_path_res,
241
  metrics_preview_df_res,
packages.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ build-essential
2
+ python3-dev
pipeline/fast_lcs.pyx ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # fast_lcs.pyx
2
+ import numpy as np
3
+
4
+ cimport cython
5
+ cimport numpy as np
6
+
7
+
8
+ @cython.boundscheck(False)
9
+ @cython.wraparound(False)
10
+ def compute_lcs_fast(list words1, list words2):
11
+ cdef int m = len(words1)
12
+ cdef int n = len(words2)
13
+ cdef np.ndarray[np.int32_t, ndim=2] dp = np.zeros((m + 1, n + 1), dtype=np.int32)
14
+ cdef int i, j
15
+
16
+ for i in range(1, m + 1):
17
+ for j in range(1, n + 1):
18
+ if words1[i - 1] == words2[j - 1]:
19
+ dp[i, j] = dp[i - 1, j - 1] + 1
20
+ else:
21
+ dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
22
+
23
+ return int(dp[m, n])
pipeline/metrics.py CHANGED
@@ -7,8 +7,17 @@ import torch
7
  from .semantic_embedding import generate_embeddings
8
  from .tokenize import tokenize_texts
9
  import logging
10
- from numba import njit
11
  from sklearn.feature_extraction.text import TfidfVectorizer
 
 
 
 
 
 
 
 
 
12
 
13
  logger = logging.getLogger(__name__)
14
 
@@ -65,19 +74,28 @@ def _chunk_text(
65
  return reconstructed_text_chunks
66
 
67
 
68
- @njit
69
  def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
 
 
70
  m, n = len(words1), len(words2)
71
- if m == 0 or n == 0:
72
- return 0.0
73
- dp = np.zeros((m + 1, n + 1), dtype=np.int32)
74
- for i in range(1, m + 1):
75
- for j in range(1, n + 1):
76
- if words1[i - 1] == words2[j - 1]:
77
- dp[i, j] = dp[i - 1, j - 1] + 1
78
- else:
79
- dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
80
- lcs_length = int(dp[m, n])
 
 
 
 
 
 
 
 
81
  avg_length = (m + n) / 2
82
  return lcs_length / avg_length if avg_length > 0 else 0.0
83
 
@@ -209,6 +227,7 @@ def compute_all_metrics(
209
  # Prepare token lists (always use tokenize_texts for raw Unicode)
210
  token_lists = {}
211
  corpus_for_tfidf = [] # For storing space-joined tokens for TF-IDF
 
212
 
213
  for fname, content in texts.items():
214
  tokenized_content = tokenize_texts([content]) # Returns a list of lists
@@ -228,8 +247,14 @@ def compute_all_metrics(
228
  if corpus_for_tfidf:
229
  # Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
230
  # and we don't want further case changes or token modifications for Tibetan.
 
 
 
231
  vectorizer = TfidfVectorizer(
232
- tokenizer=lambda x: x.split(), preprocessor=lambda x: x, token_pattern=None
 
 
 
233
  )
234
  tfidf_matrix = vectorizer.fit_transform(corpus_for_tfidf)
235
  # Calculate pairwise cosine similarity on the TF-IDF matrix
@@ -243,20 +268,29 @@ def compute_all_metrics(
243
 
244
  for i, j in combinations(range(len(files)), 2):
245
  f1, f2 = files[i], files[j]
246
- words1, words2 = token_lists[f1], token_lists[f2]
 
 
 
 
 
 
 
247
  jaccard = (
248
- len(set(words1) & set(words2)) / len(set(words1) | set(words2))
249
- if set(words1) | set(words2)
250
  else 0.0
251
  )
 
 
252
  jaccard_percent = jaccard * 100.0
253
- norm_lcs = compute_normalized_lcs(words1, words2)
254
 
255
  # Semantic Similarity Calculation
256
  if enable_semantic:
257
  # Pass raw texts and their pre-computed botok tokens
258
  semantic_sim = compute_semantic_similarity(
259
- texts[f1], texts[f2], words1, words2, model, device
260
  )
261
  else:
262
  semantic_sim = np.nan
 
7
  from .semantic_embedding import generate_embeddings
8
  from .tokenize import tokenize_texts
9
  import logging
10
+ from sentence_transformers import SentenceTransformer
11
  from sklearn.feature_extraction.text import TfidfVectorizer
12
+ from .stopwords_bo import TIBETAN_STOPWORDS, TIBETAN_STOPWORDS_SET
13
+
14
+ # Attempt to import the Cython-compiled fast_lcs module
15
+ try:
16
+ from .fast_lcs import compute_lcs_fast
17
+ USE_CYTHON_LCS = True
18
+ except ImportError:
19
+ # print("Cython fast_lcs not found, using Python LCS. For better performance, compile the Cython module.")
20
+ USE_CYTHON_LCS = False
21
 
22
  logger = logging.getLogger(__name__)
23
 
 
74
  return reconstructed_text_chunks
75
 
76
 
 
77
  def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
78
+ # Calculate m and n (lengths) here, so they are available for normalization
79
+ # regardless of which LCS implementation is used.
80
  m, n = len(words1), len(words2)
81
+
82
+ if USE_CYTHON_LCS:
83
+ # Use the Cython-compiled version if available
84
+ lcs_length = compute_lcs_fast(words1, words2)
85
+ else:
86
+ # Fallback to pure Python implementation
87
+ # m, n = len(words1), len(words2) # Moved to the beginning of the function
88
+ # Using numpy array for dp table can be slightly faster than list of lists for large inputs
89
+ # but the primary bottleneck is the Python loop itself compared to Cython.
90
+ dp = np.zeros((m + 1, n + 1), dtype=np.int32)
91
+
92
+ for i in range(1, m + 1):
93
+ for j in range(1, n + 1):
94
+ if words1[i - 1] == words2[j - 1]:
95
+ dp[i, j] = dp[i - 1, j - 1] + 1
96
+ else:
97
+ dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
98
+ lcs_length = int(dp[m, n])
99
  avg_length = (m + n) / 2
100
  return lcs_length / avg_length if avg_length > 0 else 0.0
101
 
 
227
  # Prepare token lists (always use tokenize_texts for raw Unicode)
228
  token_lists = {}
229
  corpus_for_tfidf = [] # For storing space-joined tokens for TF-IDF
230
+ tibetan_stopwords_set = set() # Initialize for Jaccard (and potentially LCS) filtering
231
 
232
  for fname, content in texts.items():
233
  tokenized_content = tokenize_texts([content]) # Returns a list of lists
 
247
  if corpus_for_tfidf:
248
  # Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
249
  # and we don't want further case changes or token modifications for Tibetan.
250
+ # Define Tibetan stopwords. These should match tokens produced by botok.
251
+ # Tibetan stopwords are now imported from stopwords_bo.py
252
+
253
  vectorizer = TfidfVectorizer(
254
+ tokenizer=lambda x: x.split(),
255
+ preprocessor=lambda x: x,
256
+ token_pattern=None,
257
+ stop_words=TIBETAN_STOPWORDS
258
  )
259
  tfidf_matrix = vectorizer.fit_transform(corpus_for_tfidf)
260
  # Calculate pairwise cosine similarity on the TF-IDF matrix
 
268
 
269
  for i, j in combinations(range(len(files)), 2):
270
  f1, f2 = files[i], files[j]
271
+ words1_raw, words2_raw = token_lists[f1], token_lists[f2]
272
+
273
+ # Filter stopwords for Jaccard calculation using the imported TIBETAN_STOPWORDS_SET
274
+ # If TIBETAN_STOPWORDS_SET is empty (e.g., if stopwords_bo.py somehow yields an empty set),
275
+ # filtering will have no effect, which is a safe fallback.
276
+ words1_jaccard = [word for word in words1_raw if word not in TIBETAN_STOPWORDS_SET]
277
+ words2_jaccard = [word for word in words2_raw if word not in TIBETAN_STOPWORDS_SET]
278
+
279
  jaccard = (
280
+ len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
281
+ if set(words1_jaccard) | set(words2_jaccard) # Ensure denominator is not zero
282
  else 0.0
283
  )
284
+ # LCS uses raw tokens (words1_raw, words2_raw) to provide a complementary metric.
285
+ # Semantic similarity also uses raw text and its botok tokens for chunking decisions.
286
  jaccard_percent = jaccard * 100.0
287
+ norm_lcs = compute_normalized_lcs(words1_raw, words2_raw)
288
 
289
  # Semantic Similarity Calculation
290
  if enable_semantic:
291
  # Pass raw texts and their pre-computed botok tokens
292
  semantic_sim = compute_semantic_similarity(
293
+ texts[f1], texts[f2], words1_raw, words2_raw, model, device
294
  )
295
  else:
296
  semantic_sim = np.nan
pipeline/process.py CHANGED
@@ -4,6 +4,7 @@ from .metrics import compute_all_metrics
4
  from .semantic_embedding import get_sentence_transformer_model_and_device
5
  from .tokenize import tokenize_texts
6
  import logging
 
7
 
8
  logger = logging.getLogger(__name__)
9
 
@@ -37,7 +38,8 @@ def process_texts(
37
  f"Failed to load sentence transformer model: {e}. Semantic similarity will not be available."
38
  )
39
  # Optionally, add a warning to the UI if model loading fails
40
- # warning += " Semantic similarity model failed to load."
 
41
  else:
42
  logger.info("Semantic similarity disabled. Skipping model loading.")
43
 
@@ -71,8 +73,6 @@ def process_texts(
71
  fname = seg_id.split("|")[0]
72
  file_to_chapters.setdefault(fname, []).append(seg_id)
73
  # For each pair of files, compare corresponding chapters (by index)
74
- from itertools import combinations
75
-
76
  results = []
77
  files = list(file_to_chapters.keys())
78
  for file1, file2 in combinations(files, 2):
 
4
  from .semantic_embedding import get_sentence_transformer_model_and_device
5
  from .tokenize import tokenize_texts
6
  import logging
7
+ from itertools import combinations
8
 
9
  logger = logging.getLogger(__name__)
10
 
 
38
  f"Failed to load sentence transformer model: {e}. Semantic similarity will not be available."
39
  )
40
  # Optionally, add a warning to the UI if model loading fails
41
+ # For now, keeping it as a logger.error. UI warning can be added later if desired.
42
+ pass # Explicitly noting that we are not changing the warning handling for UI here.
43
  else:
44
  logger.info("Semantic similarity disabled. Skipping model loading.")
45
 
 
73
  fname = seg_id.split("|")[0]
74
  file_to_chapters.setdefault(fname, []).append(seg_id)
75
  # For each pair of files, compare corresponding chapters (by index)
 
 
76
  results = []
77
  files = list(file_to_chapters.keys())
78
  for file1, file2 in combinations(files, 2):
pipeline/stopwords_bo.py ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """Module for Tibetan stopwords.
3
+
4
+ This file centralizes the Tibetan stopwords list used in the Tibetan Text Metrics application.
5
+ Sources for these stopwords are acknowledged in the main README.md of the webapp.
6
+ """
7
+
8
+ # Initial set of stopwords with clear categories
9
+ PARTICLES_INITIAL = [
10
+ "ཏུ", "གི", "ཀྱི", "གིས", "ཀྱིས", "ཡིས", "ཀྱང", "སྟེ", "ཏེ", "ནོ", "ཏོ",
11
+ "ཅིང", "ཅིག", "ཅེས", "ཞེས", "གྱིས", "ན",
12
+ ]
13
+
14
+ MARKERS_AND_PUNCTUATION = ["༈", "།", "༎", "༑"]
15
+
16
+ ORDINAL_NUMBERS = [
17
+ "དང་པོ", "གཉིས་པ", "གསུམ་པ", "བཞི་པ", "ལྔ་པ", "དྲུག་པ", "བདུན་པ", "བརྒྱད་པ", "དགུ་པ", "བཅུ་པ",
18
+ "བཅུ་གཅིག་པ", "བཅུ་གཉིས་པ", "བཅུ་གསུམ་པ", "བཅུ་བཞི་པ", "བཅོ་ལྔ་པ",
19
+ "བཅུ་དྲུག་པ", "བཅུ་བདུན་པ", "བཅོ་བརྒྱད་པ", "བཅུ་དགུ་པ", "ཉི་ཤུ་པ",
20
+ ]
21
+
22
+ # Additional stopwords from the comprehensive list, categorized for readability
23
+ MORE_PARTICLES_SUFFIXES = [
24
+ "འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
25
+ "ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
26
+ "གྱིན་", "ན", "འམ་", "ཀྱིན་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི",
27
+ "བམ་", "ཤིག་", "ནམ", "མིན་", "ནམ་", "ངམ་", "རུ་", "ཤས་", "ཏུ", "ཡིས", "གིན་", "གམ་",
28
+ "གྱིས", "ཅང་", "སམ་", "ཞིག", "འང", "རུ", "དང", "ཡ", "འག", "སམ", "ཀ", "འམ", "མམ",
29
+ "དམ", "ཀྱི", "ལམ", "ནོ་", "སོ་", "རམ་", "བོ་", "ཨང་", "ཕྱི", "ཏོ་", "གེ", "གོ", "རོ་", "བོ",
30
+ "པས", "འི", "རམ", "བས", "མཾ་", "པོ", "ག་", "ག", "གམ", "བམ", "མོ་", "མམ་", "ཏམ་", "ངོ",
31
+ "ཏམ", "གིང་", "ཀྱང" # ཀྱང also in PARTICLES_INITIAL, set() will handle duplicates
32
+ ]
33
+
34
+ PRONOUNS_DEMONSTRATIVES = ["འདི", "གཞན་", "དེ་", "རང་", "སུ་"]
35
+
36
+ VERBS_AUXILIARIES = ["ཡིན་", "མི་", "ལགས་པ", "ཡིན་པ", "ལགས་", "མིན་", "ཡིན་པ་", "མིན", "ཡིན་བ", "ཡིན་ལུགས་"]
37
+
38
+ ADVERBS_QUALIFIERS_INTENSIFIERS = [
39
+ "སོགས་", "ཙམ་", "ཡང་", "ཉིད་", "ཞིང་", "རུང་", "ན་རེ", "འང་", "ཁོ་ན་", "འཕྲལ་", "བར་",
40
+ "ཅུང་ཟད་", "ཙམ་པ་", "ཤ་སྟག་"
41
+ ]
42
+
43
+ QUANTIFIERS_DETERMINERS_COLLECTIVES = [
44
+ "རྣམས་", "ཀུན་", "སྙེད་", "བཅས་", "ཡོངས་", "མཐའ་དག་", "དག་", "ཚུ", "ཚང་མ", "ཐམས་ཅད་",
45
+ "ཅིག་", "སྣ་ཚོགས་", "སྙེད་པ", "རེ་རེ་", "འགའ་", "སྤྱི", "དུ་མ", "མ", "ཁོ་ན", "ཚོ", "ལ་ལ་",
46
+ "སྙེད་པ་", "འབའ་", "སྙེད", "གྲང་", "ཁ་", "ངེ་", "ཅོག་", "རིལ་", "ཉུང་ཤས་", "ཚ་"
47
+ ]
48
+
49
+ CONNECTORS_CONJUNCTIONS = ["དང་", "ཅིང་", "ཤིང་"]
50
+
51
+ INTERJECTIONS_EXCLAMATIONS = ["ཨེ་", "འོ་"]
52
+
53
+ # Combine all categorized lists
54
+ _ALL_STOPWORDS_CATEGORIZED = (
55
+ PARTICLES_INITIAL +
56
+ MARKERS_AND_PUNCTUATION +
57
+ ORDINAL_NUMBERS +
58
+ MORE_PARTICLES_SUFFIXES +
59
+ PRONOUNS_DEMONSTRATIVES +
60
+ VERBS_AUXILIARIES +
61
+ ADVERBS_QUALIFIERS_INTENSIFIERS +
62
+ QUANTIFIERS_DETERMINERS_COLLECTIVES +
63
+ CONNECTORS_CONJUNCTIONS +
64
+ INTERJECTIONS_EXCLAMATIONS
65
+ )
66
+
67
+ # Final flat list of unique stopwords for TfidfVectorizer (as a list)
68
+ TIBETAN_STOPWORDS = list(set(_ALL_STOPWORDS_CATEGORIZED))
69
+
70
+ # Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
71
+ TIBETAN_STOPWORDS_SET = set(TIBETAN_STOPWORDS)
pipeline/visualize.py CHANGED
@@ -84,13 +84,6 @@ def generate_visualizations(metrics_df: pd.DataFrame, descriptive_titles: dict =
84
  )
85
  heatmaps[metric] = fig
86
 
87
- # Use all features including pattern similarities if present
88
- if not metrics_df.empty:
89
- # Remove '.txt' from Text Pair labels
90
- metrics_df = metrics_df.copy()
91
- metrics_df["Text Pair"] = metrics_df["Text Pair"].str.replace(
92
- ".txt", "", regex=False
93
- )
94
  return heatmaps
95
 
96
 
 
84
  )
85
  heatmaps[metric] = fig
86
 
 
 
 
 
 
 
 
87
  return heatmaps
88
 
89
 
pyproject.toml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = [
3
+ "setuptools>=42",
4
+ "Cython>=0.29.21",
5
+ "numpy>=1.20"
6
+ ]
7
+ build-backend = "setuptools.build_meta"
8
+ backend-path = ["."] # Specifies that setuptools.build_meta is in the current directory's PYTHONPATH
requirements.txt CHANGED
@@ -1,12 +1,22 @@
1
- gradio>=4.0.0
2
- pandas
3
- plotly
4
- matplotlib
5
- seaborn
6
- scikit-learn
7
- # botok is required for Tibetan tokenization, ensure it is available on Hugging Face Spaces
8
- botok
9
- torch
10
- transformers
11
- sentence-transformers
12
- numba
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core application and UI
2
+ gradio==5.29.1
3
+ pandas==2.2.3
4
+
5
+ # Plotting and visualization
6
+ plotly==6.0.0
7
+ matplotlib==3.10.1
8
+ seaborn==0.13.2
9
+
10
+ # Machine learning and numerical processing
11
+ numpy==1.26.4
12
+ scikit-learn==1.6.1
13
+ torch==2.7.0
14
+ transformers==4.51.3
15
+ sentence-transformers==4.1.0
16
+ numba==0.61.2
17
+
18
+ # Tibetan language processing
19
+ botok==0.9.0
20
+
21
+ # Build system for Cython
22
+ Cython==3.0.12
setup.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy
2
+ from setuptools import Extension, setup
3
+ from Cython.Build import cythonize
4
+
5
+ # It's good practice to specify encoding for portability
6
+ with open("README.md", "r", encoding="utf-8") as fh:
7
+ long_description = fh.read()
8
+
9
+ setup(
10
+ name="tibetan text metrics webapp",
11
+ version="0.1.0",
12
+ author="Daniel Wojahn / Tibetan Text Metrics",
13
+ author_email="daniel.wojahn@outlook.de",
14
+ description="Cython components for the Tibetan Text Metrics Webapp",
15
+ long_description=long_description,
16
+ long_description_content_type="text/markdown",
17
+ url="https://github.com/daniel-wojahn/tibetan-text-metrics",
18
+ ext_modules=cythonize(
19
+ [
20
+ Extension(
21
+ "pipeline.fast_lcs", # Module name to import: from pipeline.fast_lcs import ...
22
+ ["pipeline/fast_lcs.pyx"],
23
+ include_dirs=[numpy.get_include()],
24
+ )
25
+ ],
26
+ compiler_directives={'language_level' : "3"} # For Python 3 compatibility
27
+ ),
28
+ # Indicates that package data (like .pyx files) should be included if specified in MANIFEST.in
29
+ # For simple cases like this, Cythonize usually handles it.
30
+ include_package_data=True,
31
+ # Although this setup.py is in webapp, it's building modules for the 'pipeline' sub-package
32
+ # We don't list packages here as this setup.py is just for the extension.
33
+ # The main app will treat 'pipeline' as a regular package.
34
+ zip_safe=False, # Cython extensions are generally not zip-safe
35
+ classifiers=[
36
+ "Programming Language :: Python :: 3",
37
+ "License :: OSI Approved :: MIT License",
38
+ "Operating System :: OS Independent",
39
+ ],
40
+ python_requires='>=3.8',
41
+ install_requires=[
42
+ "numpy>=1.20", # Ensure numpy is available for runtime if not just build time
43
+ ],
44
+ # setup_requires is deprecated, use pyproject.toml for build-system requirements
45
+ )