Spaces:

daniel-wojahn
/

ttm-webapp-hf

Running

App Files Files Community

daniel-wojahn commited on May 17

Commit

0bbf2df

verified ·

1 Parent(s): 95d7fed

revamped the pipeline and added stopwords and documentation

Browse files

Files changed (11) hide show

README.md +37 -18
app.py +50 -67
packages.txt +2 -0
pipeline/fast_lcs.pyx +23 -0
pipeline/metrics.py +52 -18
pipeline/process.py +3 -3
pipeline/stopwords_bo.py +71 -0
pipeline/visualize.py +0 -7
pyproject.toml +8 -0
requirements.txt +22 -12
setup.py +45 -0

README.md CHANGED Viewed

@@ -1,14 +1,14 @@
 ---
-title: Tibetan Text Metrics Web App
-emoji: 🐢
 colorFrom: blue
-colorTo: purple
 sdk: gradio
-sdk_version: 5.29.1
 app_file: app.py
-pinned: false
-license: mit
-short_description: A web app for analyzing Tibetan textual similarities
 ---
 # Tibetan Text Metrics Web App
@@ -17,7 +17,7 @@ short_description: A web app for analyzing Tibetan textual similarities
 [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
 [![Project Status: Active – Web app version for accessible text analysis.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
-A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts. This tool provides a graphical interface to the core text comparison functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project, making it accessible to researchers without Python or command-line experience. Built with Python and Gradio.
 ## Background
@@ -28,15 +28,15 @@ The Tibetan Text Metrics project aims to provide quantitative methods for assess
 -   **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
 -   **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
 -   **Core Metrics Computed**:
-    -   **Jaccard Similarity (%)**: Measures vocabulary overlap between segments.
     -   **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
-    -   **Semantic Similarity (BuddhistNLP)**: Uses the `buddhist-nlp/bodhi-sentence-cased-v1` model to compare the contextual meaning of segments.
-    -   **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles.
 -   **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
 -   **Interactive Visualizations**:
     -   Heatmaps for Jaccard, LCS, Semantic, and TF-IDF similarity metrics, providing a quick overview of inter-segment relationships.
     -   Bar chart displaying word counts per segment.
--   **Downloadable Results**: Export detailed metrics as a CSV file.
 -   **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.
 ## Text Segmentation and Best Practices
@@ -47,19 +47,30 @@ To obtain meaningful results, it is highly recommended to divide your Tibetan te
 **How to segment your texts:**
--   Use a clear marker (e.g., `༈` or another unique string) to separate chapters/sections in your `.txt` files.
 -   Each segment should represent a coherent part of the text (e.g., a chapter, legal clause, or thematic section).
 -   The tool will automatically split your file on this marker for analysis. If no marker is found, the entire file is treated as a single segment, and a warning will be issued.
 **Best practices:**
--   Ensure your marker is unique and does not appear within a chapter.
 -   Try to keep chapters/sections of similar length for more balanced comparisons.
 -   For poetry or short texts, consider grouping several poems or stanzas as one segment.
 ## Implemented Metrics
-The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
 1.  **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words. It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`. Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
 2.  **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
@@ -67,7 +78,7 @@ The application computes and visualizes the following similarity metrics between
 3.  **Semantic Similarity (BuddhistNLP)**: Utilizes the `buddhist-nlp/bodhi-sentence-cased-v1` model to compute the cosine similarity between the semantic embeddings of text segments. This model is fine-tuned for Buddhist studies texts and captures nuances in meaning. For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
 4.  **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment. TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments. This helps to identify terms that are characteristic or discriminative for a segment. Each segment is then represented as a vector of these TF-IDF scores. Finally, the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
-## Getting Started (Running Locally)
 1.  Ensure you have Python 3.10 or newer.
 2.  Navigate to the `webapp` directory:
@@ -84,11 +95,19 @@ The application computes and visualizes the following similarity metrics between
     ```bash
     pip install -r requirements.txt
     ```
-5.  Run the web application:
     ```bash
     python app.py
     ```
-6.  Open your web browser and go to the local URL provided (usually `http://127.0.0.1:7860`).
 ## Usage

 ---
+title: Tibetan Text Metrics
+emoji: 📚
 colorFrom: blue
+colorTo: indigo
 sdk: gradio
+sdk_version: 5.29.0
+python_version: 3.10
 app_file: app.py
+models:
+  - buddhist-nlp/buddhist-sentence-similarity
 ---
 # Tibetan Text Metrics Web App
 [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
 [![Project Status: Active – Web app version for accessible text analysis.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
+A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts. This tool provides a graphical interface to the core text comparison functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project, making it accessible to researchers without Python or command-line experience. Built with Python, Cython, and Gradio.
 ## Background
 -   **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
 -   **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
 -   **Core Metrics Computed**:
+    -   **Jaccard Similarity (%)**: Measures vocabulary overlap between segments. *Common Tibetan stopwords are filtered out to focus on meaningful lexical similarity.*
     -   **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
+    -   **Semantic Similarity (BuddhistNLP)**: Uses the [buddhist-nlp/buddhist-sentence-similarity](https://huggingface.co/buddhist-nlp/buddhist-sentence-similarity) model to compare the contextual meaning of segments. *Note: This metric is experimental and may not perform well for all texts. It is recommended to use it in combination with other metrics for a more comprehensive analysis.*
+    -   **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles. *Common Tibetan stopwords are excluded to ensure TF-IDF weights highlight genuinely characteristic terms.*
 -   **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
 -   **Interactive Visualizations**:
     -   Heatmaps for Jaccard, LCS, Semantic, and TF-IDF similarity metrics, providing a quick overview of inter-segment relationships.
     -   Bar chart displaying word counts per segment.
+-   **Downloadable Results**: Export detailed metrics as a CSV file and save heatmaps as PNG files.
 -   **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.
 ## Text Segmentation and Best Practices
 **How to segment your texts:**
+-   Use the Tibetan section marker (`༈` (sbrul shad)) to separate chapters/sections in your `.txt` files.
 -   Each segment should represent a coherent part of the text (e.g., a chapter, legal clause, or thematic section).
 -   The tool will automatically split your file on this marker for analysis. If no marker is found, the entire file is treated as a single segment, and a warning will be issued.
 **Best practices:**
+-   Ensure the marker is unique and does not appear within a chapter.
 -   Try to keep chapters/sections of similar length for more balanced comparisons.
 -   For poetry or short texts, consider grouping several poems or stanzas as one segment.
 ## Implemented Metrics
+**Stopword Filtering:**
+To enhance the accuracy and relevance of similarity scores, both the Jaccard Similarity and TF-IDF Cosine Similarity calculations incorporate a stopword filtering step. This process removes high-frequency, low-information Tibetan words (e.g., common particles, pronouns, and grammatical markers) before the metrics are computed. This ensures that the resulting scores are more reflective of meaningful lexical and thematic similarities between texts, rather than being skewed by the presence of ubiquitous common words.
+The comprehensive list of Tibetan stopwords used is adapted and compiled from the following valuable resources:
+- The **Divergent Discourses** (specifically, their Tibetan stopwords list available at [Zenodo Record 10148636](https://zenodo.org/records/10148636)).
+- The **Tibetan Lucene Analyser** by the Buddhist Digital Archives (BUDA), available on [GitHub: buda-base/lucene-bo](https://github.com/buda-base/lucene-bo).
+We extend our gratitude to the creators and maintainers of these projects for making their work available to the community.
+Feel free to edit this list of stopwords to better suit your needs. The list is stored in the `pipeline/stopwords.py` file.
+### The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
 1.  **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words. It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`. Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
 2.  **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
 3.  **Semantic Similarity (BuddhistNLP)**: Utilizes the `buddhist-nlp/bodhi-sentence-cased-v1` model to compute the cosine similarity between the semantic embeddings of text segments. This model is fine-tuned for Buddhist studies texts and captures nuances in meaning. For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
 4.  **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment. TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments. This helps to identify terms that are characteristic or discriminative for a segment. Each segment is then represented as a vector of these TF-IDF scores. Finally, the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
+## Getting Started (if run Locally)
 1.  Ensure you have Python 3.10 or newer.
 2.  Navigate to the `webapp` directory:
     ```bash
     pip install -r requirements.txt
     ```
+5.  **Compile Cython Extension (Recommended for Performance)**:
+    To speed up the Longest Common Subsequence (LCS) calculation, a Cython extension is provided. To compile it:
+    ```bash
+    # Ensure you are in the webapp directory
+    python setup.py build_ext --inplace
+    ```
+    This step requires a C compiler. If you skip this, the application will use a slower, pure Python implementation for LCS.
+6.  **Run the Web Application**:
     ```bash
     python app.py
     ```
+7.  Open your web browser and go to the local URL provided (usually `http://127.0.0.1:7860`).
 ## Usage

app.py CHANGED Viewed

@@ -2,12 +2,12 @@ import gradio as gr
 from pathlib import Path
 from pipeline.process import process_texts
 from pipeline.visualize import generate_visualizations, generate_word_count_chart
-import cProfile
-import pstats
-import io
 from theme import tibetan_theme
 # Main interface logic
 def main_interface():
@@ -29,7 +29,7 @@ def main_interface():
                     gr.Markdown(
                         """
                     ## Step 1: Upload Your Tibetan Text Files
-                    <span style='font-size:16px;'>Upload one or more `.txt` files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible.</span>
                     """,
                         elem_classes="gr-markdown",
                     )
@@ -41,16 +41,16 @@ def main_interface():
             with gr.Column(scale=1, elem_classes="step-column"):
                 with gr.Group():
                     gr.Markdown(
-                        """## Step 2: Configure and Run the Analysis
-<span style='font-size:16px;'>Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using a marker like '༈'. The tool will split files based on this marker.</span>
                     """,
                         elem_classes="gr-markdown",
                     )
                     semantic_toggle_radio = gr.Radio(
-                        label="Compute Semantic Similarity?",
                         choices=["Yes", "No"],
                         value="Yes",
-                        info="Semantic similarity can be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
                         elem_id="semantic-radio-group",
                     )
                     process_btn = gr.Button(
@@ -74,18 +74,20 @@ def main_interface():
         heatmap_titles = {
             "Jaccard Similarity (%)": "Jaccard Similarity (%): Higher scores (brighter) mean more shared unique words.",
             "Normalized LCS": "Normalized LCS: Higher scores (brighter) mean longer shared sequences of words.",
-            "Semantic Similarity (BuddhistNLP)": "Semantic Similarity (BuddhistNLP): Higher scores (brighter) mean more similar meanings.",
             "TF-IDF Cosine Sim": "TF-IDF Cosine Similarity: Higher scores mean texts share more important, distinctive vocabulary.",
         }
         metric_tooltips = {
             "Jaccard Similarity (%)": """
 ### Jaccard Similarity (%)
-This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words.
-It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?'
-It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`.
-Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent.
-A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
 """,
             "Normalized LCS": """
 ### Normalized LCS (Longest Common Subsequence)
@@ -95,22 +97,28 @@ For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and T
 The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage.
 A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
-*Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
 """,
-            "Semantic Similarity (BuddhistNLP)": """
-### Semantic Similarity (BuddhistNLP)
-Utilizes the `buddhist-nlp/bodhi-sentence-cased-v1` model to compute the cosine similarity between the semantic embeddings of text segments.
 This model is fine-tuned for Buddhist studies texts and captures nuances in meaning.
 For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
 """,
             "TF-IDF Cosine Sim": """
 ### TF-IDF Cosine Similarity
-This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment.
 TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
-This helps to identify terms that are characteristic or discriminative for a segment.
 Each segment is then represented as a vector of these TF-IDF scores.
 Finally, the cosine similarity is computed between these vectors.
 A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
 """,
         }
         heatmap_tabs = {}
@@ -138,6 +146,16 @@ A score closer to 1 indicates that the two segments share more of these importan
         warning_box = gr.Markdown(visible=False)
         def run_pipeline(files, enable_semantic_str):
             """
             Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
@@ -157,40 +175,15 @@ A score closer to 1 indicates that the two segments share more of these importan
             if not files:
                 return (
                     None,
-                    "Please upload files to process.",
-                    None,
-                    None,
-                    None,
-                    None,
-                    gr.update(value="Please upload files to process.", visible=True),
                 )
-            pr = cProfile.Profile()
-            pr.enable()
-            # Initialize results to ensure they are defined in finally if an early error occurs
-            (
-                csv_path_res,
-                metrics_preview_df_res,
-                word_count_fig_res,
-                jaccard_heatmap_res,
-                lcs_heatmap_res,
-                semantic_heatmap_res,
-                tfidf_heatmap_res,
-                warning_update_res,
-            ) = (
-                None,
-                "Processing error. See console for details.",
-                None,
-                None,
-                None,
-                None,
-                None,
-                gr.update(
-                    value="Processing error. See console for details.", visible=True
-                ),
-            )
             try:
                 filenames = [
                     Path(file.name).name for file in files
@@ -214,7 +207,7 @@ A score closer to 1 indicates that the two segments share more of these importan
                     )
                     metrics_preview_df_res = warning_message
                     warning_update_res = gr.update(value=warning_message, visible=True)
-                    # Results for this case are set, finally will execute, then return
                 else:
                     # heatmap_titles is already defined in the outer scope of main_interface
                     heatmaps_data = generate_visualizations(
@@ -237,22 +230,12 @@ A score closer to 1 indicates that the two segments share more of these importan
                     )
             except Exception as e:
-                # logger.error(f"Error in processing: {e}", exc_info=True) # Already logged in process_texts or lower levels
-                metrics_preview_df_res = f"Error: {str(e)}"
                 warning_update_res = gr.update(value=f"Error: {str(e)}", visible=True)
-            finally:
-                pr.disable()
-                s = io.StringIO()
-                sortby = (
-                    pstats.SortKey.CUMULATIVE
-                )  # Sort by cumulative time spent in function
-                ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
-                print("\n--- cProfile Stats (Top 30) ---")
-                ps.print_stats(30)  # Print the top 30 costly functions
-                print(s.getvalue())
-                print("-------------------------------\n")
             return (
                 csv_path_res,
                 metrics_preview_df_res,

 from pathlib import Path
 from pipeline.process import process_texts
 from pipeline.visualize import generate_visualizations, generate_word_count_chart
+import logging
 from theme import tibetan_theme
+logger = logging.getLogger(__name__)
 # Main interface logic
 def main_interface():
                     gr.Markdown(
                         """
                     ## Step 1: Upload Your Tibetan Text Files
+                    <span style='font-size:16px;'>Upload one or more `.txt` files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible using the marker '༈' (<i>sbrul shad</i>).</span>
                     """,
                         elem_classes="gr-markdown",
                     )
             with gr.Column(scale=1, elem_classes="step-column"):
                 with gr.Group():
                     gr.Markdown(
+                        """## Step 2: Configure and run the analysis
+<span style='font-size:16px;'>Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using the marker '༈' (<i>sbrul shad</i>). The tool will split files based on this marker.</span>
                     """,
                         elem_classes="gr-markdown",
                     )
                     semantic_toggle_radio = gr.Radio(
+                        label="Compute semantic similarity?",
                         choices=["Yes", "No"],
                         value="Yes",
+                        info="Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
                         elem_id="semantic-radio-group",
                     )
                     process_btn = gr.Button(
         heatmap_titles = {
             "Jaccard Similarity (%)": "Jaccard Similarity (%): Higher scores (brighter) mean more shared unique words.",
             "Normalized LCS": "Normalized LCS: Higher scores (brighter) mean longer shared sequences of words.",
+            "Semantic Similarity": "Semantic Similarity (using word embeddings/experimental): Higher scores (brighter) mean more similar meanings.",
             "TF-IDF Cosine Sim": "TF-IDF Cosine Similarity: Higher scores mean texts share more important, distinctive vocabulary.",
         }
         metric_tooltips = {
             "Jaccard Similarity (%)": """
 ### Jaccard Similarity (%)
+This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words, after **filtering out common Tibetan stopwords**.
+It essentially answers the question: 'Of all the distinct, meaningful words found across these two segments, what proportion of them are present in both?'
+It is calculated as `(Number of common unique meaningful words) / (Total number of unique meaningful words in both texts combined) * 100`.
+Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique meaningful word is present or absent.
+A higher percentage indicates a greater overlap in the significant vocabularies used in the two segments.
+**Stopword Filtering**: uses a range of stopwords to filter out common Tibetan words that do not contribute to the semantic content of the text.
 """,
             "Normalized LCS": """
 ### Normalized LCS (Longest Common Subsequence)
 The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage.
 A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
+**No Stopword Filtering.** Unlike metrics such as Jaccard Similarity or TF-IDF Cosine Similarity (which typically filter out common stopwords to focus on content-bearing words), the LCS calculation in this tool intentionally uses the raw, unfiltered sequence of tokens from your texts. This design choice allows LCS to capture structural similarities and the flow of language, including the use of particles and common words that contribute to sentence construction and narrative sequence. By not removing stopwords, LCS can reveal similarities in phrasing and textual structure that might otherwise be obscured, making it a valuable complement to metrics that focus purely on lexical overlap of keywords.
+**Note on Interpretation**: It is possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
 """,
+            "Semantic Similarity": """
+### Semantic Similarity (Experimental)
+Utilizes the `<a href="https://huggingface.co/buddhist-nlp/buddhist-sentence-similarity">buddhist-nlp/buddhist-sentence-similarity</a>` model to compute the cosine similarity between the semantic embeddings of text segments.
 This model is fine-tuned for Buddhist studies texts and captures nuances in meaning.
 For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
+**Note**: This metric is experimental and may not perform well for all texts. It is recommended to use it in combination with other metrics for a more comprehensive analysis.
 """,
             "TF-IDF Cosine Sim": """
 ### TF-IDF Cosine Similarity
+This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, **after filtering out common Tibetan stopwords**.
 TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
+This helps to identify terms that are characteristic or discriminative for a segment. By excluding stopwords, the TF-IDF scores better reflect genuinely significant terms.
 Each segment is then represented as a vector of these TF-IDF scores.
 Finally, the cosine similarity is computed between these vectors.
 A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
+**Stopword Filtering**: uses a range of stopwords to filter out common Tibetan words that do not contribute to the semantic content of the text.
 """,
         }
         heatmap_tabs = {}
         warning_box = gr.Markdown(visible=False)
         def run_pipeline(files, enable_semantic_str):
+            # Initialize all return values to ensure defined paths for all outputs
+            csv_path_res = None
+            metrics_preview_df_res = None # Can be a DataFrame or a string message
+            word_count_fig_res = None
+            jaccard_heatmap_res = None
+            lcs_heatmap_res = None
+            semantic_heatmap_res = None
+            tfidf_heatmap_res = None
+            warning_update_res = gr.update(value="", visible=False) # Default: no warning
             """
             Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
             if not files:
                 return (
                     None,
+                    "Please upload files to analyze.",
+                    None,  # word_count_plot
+                    None,  # jaccard_heatmap
+                    None,  # lcs_heatmap
+                    None,  # semantic_heatmap
+                    None,  # tfidf_heatmap
+                    gr.update(value="Please upload files.", visible=True),
                 )
             try:
                 filenames = [
                     Path(file.name).name for file in files
                     )
                     metrics_preview_df_res = warning_message
                     warning_update_res = gr.update(value=warning_message, visible=True)
+                    # Results for this case are set, then return
                 else:
                     # heatmap_titles is already defined in the outer scope of main_interface
                     heatmaps_data = generate_visualizations(
                     )
             except Exception as e:
+                logger.error(f"Error in run_pipeline: {e}", exc_info=True)
+                # metrics_preview_df_res and warning_update_res are set here.
+                # Other plot/file path variables will retain their initial 'None' values set at function start.
+                metrics_preview_df_res = f"Error: {str(e)}"
                 warning_update_res = gr.update(value=f"Error: {str(e)}", visible=True)
             return (
                 csv_path_res,
                 metrics_preview_df_res,

packages.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ build-essential
2	+ python3-dev

pipeline/fast_lcs.pyx ADDED Viewed

	@@ -0,0 +1,23 @@

+# fast_lcs.pyx
+import numpy as np
+cimport cython
+cimport numpy as np
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def compute_lcs_fast(list words1, list words2):
+    cdef int m = len(words1)
+    cdef int n = len(words2)
+    cdef np.ndarray[np.int32_t, ndim=2] dp = np.zeros((m + 1, n + 1), dtype=np.int32)
+    cdef int i, j
+    for i in range(1, m + 1):
+        for j in range(1, n + 1):
+            if words1[i - 1] == words2[j - 1]:
+                dp[i, j] = dp[i - 1, j - 1] + 1
+            else:
+                dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
+    return int(dp[m, n])

pipeline/metrics.py CHANGED Viewed

@@ -7,8 +7,17 @@ import torch
 from .semantic_embedding import generate_embeddings
 from .tokenize import tokenize_texts
 import logging
-from numba import njit
 from sklearn.feature_extraction.text import TfidfVectorizer
 logger = logging.getLogger(__name__)
@@ -65,19 +74,28 @@ def _chunk_text(
     return reconstructed_text_chunks
-@njit
 def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
     m, n = len(words1), len(words2)
-    if m == 0 or n == 0:
-        return 0.0
-    dp = np.zeros((m + 1, n + 1), dtype=np.int32)
-    for i in range(1, m + 1):
-        for j in range(1, n + 1):
-            if words1[i - 1] == words2[j - 1]:
-                dp[i, j] = dp[i - 1, j - 1] + 1
-            else:
-                dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
-    lcs_length = int(dp[m, n])
     avg_length = (m + n) / 2
     return lcs_length / avg_length if avg_length > 0 else 0.0
@@ -209,6 +227,7 @@ def compute_all_metrics(
     # Prepare token lists (always use tokenize_texts for raw Unicode)
     token_lists = {}
     corpus_for_tfidf = []  # For storing space-joined tokens for TF-IDF
     for fname, content in texts.items():
         tokenized_content = tokenize_texts([content])  # Returns a list of lists
@@ -228,8 +247,14 @@ def compute_all_metrics(
     if corpus_for_tfidf:
         # Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
         # and we don't want further case changes or token modifications for Tibetan.
         vectorizer = TfidfVectorizer(
-            tokenizer=lambda x: x.split(), preprocessor=lambda x: x, token_pattern=None
         )
         tfidf_matrix = vectorizer.fit_transform(corpus_for_tfidf)
         # Calculate pairwise cosine similarity on the TF-IDF matrix
@@ -243,20 +268,29 @@ def compute_all_metrics(
     for i, j in combinations(range(len(files)), 2):
         f1, f2 = files[i], files[j]
-        words1, words2 = token_lists[f1], token_lists[f2]
         jaccard = (
-            len(set(words1) & set(words2)) / len(set(words1) | set(words2))
-            if set(words1) | set(words2)
             else 0.0
         )
         jaccard_percent = jaccard * 100.0
-        norm_lcs = compute_normalized_lcs(words1, words2)
         # Semantic Similarity Calculation
         if enable_semantic:
             # Pass raw texts and their pre-computed botok tokens
             semantic_sim = compute_semantic_similarity(
-                texts[f1], texts[f2], words1, words2, model, device
             )
         else:
             semantic_sim = np.nan

 from .semantic_embedding import generate_embeddings
 from .tokenize import tokenize_texts
 import logging
+from sentence_transformers import SentenceTransformer
 from sklearn.feature_extraction.text import TfidfVectorizer
+from .stopwords_bo import TIBETAN_STOPWORDS, TIBETAN_STOPWORDS_SET
+# Attempt to import the Cython-compiled fast_lcs module
+try:
+    from .fast_lcs import compute_lcs_fast
+    USE_CYTHON_LCS = True
+except ImportError:
+    # print("Cython fast_lcs not found, using Python LCS. For better performance, compile the Cython module.")
+    USE_CYTHON_LCS = False
 logger = logging.getLogger(__name__)
     return reconstructed_text_chunks
 def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
+    # Calculate m and n (lengths) here, so they are available for normalization
+    # regardless of which LCS implementation is used.
     m, n = len(words1), len(words2)
+    if USE_CYTHON_LCS:
+        # Use the Cython-compiled version if available
+        lcs_length = compute_lcs_fast(words1, words2)
+    else:
+        # Fallback to pure Python implementation
+        # m, n = len(words1), len(words2) # Moved to the beginning of the function
+        # Using numpy array for dp table can be slightly faster than list of lists for large inputs
+        # but the primary bottleneck is the Python loop itself compared to Cython.
+        dp = np.zeros((m + 1, n + 1), dtype=np.int32)
+        for i in range(1, m + 1):
+            for j in range(1, n + 1):
+                if words1[i - 1] == words2[j - 1]:
+                    dp[i, j] = dp[i - 1, j - 1] + 1
+                else:
+                    dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
+        lcs_length = int(dp[m, n])
     avg_length = (m + n) / 2
     return lcs_length / avg_length if avg_length > 0 else 0.0
     # Prepare token lists (always use tokenize_texts for raw Unicode)
     token_lists = {}
     corpus_for_tfidf = []  # For storing space-joined tokens for TF-IDF
+    tibetan_stopwords_set = set() # Initialize for Jaccard (and potentially LCS) filtering
     for fname, content in texts.items():
         tokenized_content = tokenize_texts([content])  # Returns a list of lists
     if corpus_for_tfidf:
         # Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
         # and we don't want further case changes or token modifications for Tibetan.
+        # Define Tibetan stopwords. These should match tokens produced by botok.
+        # Tibetan stopwords are now imported from stopwords_bo.py
         vectorizer = TfidfVectorizer(
+            tokenizer=lambda x: x.split(),
+            preprocessor=lambda x: x,
+            token_pattern=None,
+            stop_words=TIBETAN_STOPWORDS
         )
         tfidf_matrix = vectorizer.fit_transform(corpus_for_tfidf)
         # Calculate pairwise cosine similarity on the TF-IDF matrix
     for i, j in combinations(range(len(files)), 2):
         f1, f2 = files[i], files[j]
+        words1_raw, words2_raw = token_lists[f1], token_lists[f2]
+        # Filter stopwords for Jaccard calculation using the imported TIBETAN_STOPWORDS_SET
+        # If TIBETAN_STOPWORDS_SET is empty (e.g., if stopwords_bo.py somehow yields an empty set),
+        # filtering will have no effect, which is a safe fallback.
+        words1_jaccard = [word for word in words1_raw if word not in TIBETAN_STOPWORDS_SET]
+        words2_jaccard = [word for word in words2_raw if word not in TIBETAN_STOPWORDS_SET]
         jaccard = (
+            len(set(words1_jaccard) & set(words2_jaccard)) / len(set(words1_jaccard) | set(words2_jaccard))
+            if set(words1_jaccard) | set(words2_jaccard)  # Ensure denominator is not zero
             else 0.0
         )
+        # LCS uses raw tokens (words1_raw, words2_raw) to provide a complementary metric.
+        # Semantic similarity also uses raw text and its botok tokens for chunking decisions.
         jaccard_percent = jaccard * 100.0
+        norm_lcs = compute_normalized_lcs(words1_raw, words2_raw)
         # Semantic Similarity Calculation
         if enable_semantic:
             # Pass raw texts and their pre-computed botok tokens
             semantic_sim = compute_semantic_similarity(
+                texts[f1], texts[f2], words1_raw, words2_raw, model, device
             )
         else:
             semantic_sim = np.nan

pipeline/process.py CHANGED Viewed

@@ -4,6 +4,7 @@ from .metrics import compute_all_metrics
 from .semantic_embedding import get_sentence_transformer_model_and_device
 from .tokenize import tokenize_texts
 import logging
 logger = logging.getLogger(__name__)
@@ -37,7 +38,8 @@ def process_texts(
                 f"Failed to load sentence transformer model: {e}. Semantic similarity will not be available."
             )
             # Optionally, add a warning to the UI if model loading fails
-            # warning += " Semantic similarity model failed to load."
     else:
         logger.info("Semantic similarity disabled. Skipping model loading.")
@@ -71,8 +73,6 @@ def process_texts(
         fname = seg_id.split("|")[0]
         file_to_chapters.setdefault(fname, []).append(seg_id)
     # For each pair of files, compare corresponding chapters (by index)
-    from itertools import combinations
     results = []
     files = list(file_to_chapters.keys())
     for file1, file2 in combinations(files, 2):

 from .semantic_embedding import get_sentence_transformer_model_and_device
 from .tokenize import tokenize_texts
 import logging
+from itertools import combinations
 logger = logging.getLogger(__name__)
                 f"Failed to load sentence transformer model: {e}. Semantic similarity will not be available."
             )
             # Optionally, add a warning to the UI if model loading fails
+            # For now, keeping it as a logger.error. UI warning can be added later if desired.
+            pass # Explicitly noting that we are not changing the warning handling for UI here.
     else:
         logger.info("Semantic similarity disabled. Skipping model loading.")
         fname = seg_id.split("|")[0]
         file_to_chapters.setdefault(fname, []).append(seg_id)
     # For each pair of files, compare corresponding chapters (by index)
     results = []
     files = list(file_to_chapters.keys())
     for file1, file2 in combinations(files, 2):

pipeline/stopwords_bo.py ADDED Viewed

	@@ -0,0 +1,71 @@

+# -*- coding: utf-8 -*-
+"""Module for Tibetan stopwords.
+This file centralizes the Tibetan stopwords list used in the Tibetan Text Metrics application.
+Sources for these stopwords are acknowledged in the main README.md of the webapp.
+"""
+# Initial set of stopwords with clear categories
+PARTICLES_INITIAL = [
+    "ཏུ", "གི", "ཀྱི", "གིས", "ཀྱིས", "ཡིས", "ཀྱང", "སྟེ", "ཏེ", "ནོ", "ཏོ",
+    "ཅིང", "ཅིག", "ཅེས", "ཞེས", "གྱིས", "ན",
+]
+MARKERS_AND_PUNCTUATION = ["༈", "།", "༎", "༑"]
+ORDINAL_NUMBERS = [
+    "དང་པོ", "གཉིས་པ", "གསུམ་པ", "བཞི་པ", "ལྔ་པ", "དྲུག་པ", "བདུན་པ", "བརྒྱད་པ", "དགུ་པ", "བཅུ་པ",
+    "བཅུ་གཅིག་པ", "བཅུ་གཉིས་པ", "བཅུ་གསུམ་པ", "བཅུ་བཞི་པ", "བཅོ་ལྔ་པ",
+    "བཅུ་དྲུག་པ", "བཅུ་བདུན་པ", "བཅོ་བརྒྱད་པ", "བཅུ་དགུ་པ", "ཉི་ཤུ་པ",
+]
+# Additional stopwords from the comprehensive list, categorized for readability
+MORE_PARTICLES_SUFFIXES = [
+    "འི་", "དུ་", "གིས་", "ཏེ", "གི་", "ཡི་", "ཀྱི་", "པས་", "ཀྱིས་", "ཡི", "ལ", "ནི་", "ར", "དུ",
+    "ལས", "གྱིས་", "ས", "ཏེ་", "གྱི་", "དེ", "ཀ་", "སྟེ", "སྟེ་", "ངམ", "ཏོ", "དོ", "དམ་",
+    "གྱིན་", "ན", "འམ་", "ཀྱིན་", "ལོ", "ཀྱིས", "བས་", "ཤིག", "གིས", "ཀི་", "ཡིས་", "གྱི", "གི",
+    "བམ་", "ཤིག་", "ནམ", "མིན་", "ནམ་", "ངམ་", "རུ་", "ཤས་", "ཏུ", "ཡིས", "གིན་", "གམ་",
+    "གྱིས", "ཅང་", "སམ་", "ཞིག", "འང", "རུ", "དང", "ཡ", "འག", "སམ", "ཀ", "འམ", "མམ",
+    "དམ", "ཀྱི", "ལམ", "ནོ་", "སོ་", "རམ་", "བོ་", "ཨང་", "ཕྱི", "ཏོ་", "གེ", "གོ", "རོ་", "བོ",
+    "པས", "འི", "རམ", "བས", "མཾ་", "པོ", "ག་", "ག", "གམ", "བམ", "མོ་", "མམ་", "ཏམ་", "ངོ",
+    "ཏམ", "གིང་", "ཀྱང" # ཀྱང also in PARTICLES_INITIAL, set() will handle duplicates
+]
+PRONOUNS_DEMONSTRATIVES = ["འདི", "གཞན་", "དེ་", "རང་", "སུ་"]
+VERBS_AUXILIARIES = ["ཡིན་", "མི་", "ལགས་པ", "ཡིན་པ", "ལགས་", "མིན་", "ཡིན་པ་", "མིན", "ཡིན་བ", "ཡིན་ལུགས་"]
+ADVERBS_QUALIFIERS_INTENSIFIERS = [
+    "སོགས་", "ཙམ་", "ཡང་", "ཉིད་", "ཞིང་", "རུང་", "ན་རེ", "འང་", "ཁོ་ན་", "འཕྲལ་", "བར་",
+    "ཅུང་ཟད་", "ཙམ་པ་", "ཤ་སྟག་"
+]
+QUANTIFIERS_DETERMINERS_COLLECTIVES = [
+    "རྣམས་", "ཀུན་", "སྙེད་", "བཅས་", "ཡོངས་", "མཐའ་དག་", "དག་", "ཚུ", "ཚང་མ", "ཐམས་ཅད་",
+    "ཅིག་", "སྣ་ཚོགས་", "སྙེད་པ", "རེ་རེ་", "འགའ་", "སྤྱི", "དུ་མ", "མ", "ཁོ་ན", "ཚོ", "ལ་ལ་",
+    "སྙེད་པ་", "འབའ་", "སྙེད", "གྲང་", "ཁ་", "ངེ་", "ཅོག་", "རིལ་", "ཉུང་ཤས་", "ཚ་"
+]
+CONNECTORS_CONJUNCTIONS = ["དང་", "ཅིང་", "ཤིང་"]
+INTERJECTIONS_EXCLAMATIONS = ["ཨེ་", "འོ་"]
+# Combine all categorized lists
+_ALL_STOPWORDS_CATEGORIZED = (
+    PARTICLES_INITIAL +
+    MARKERS_AND_PUNCTUATION +
+    ORDINAL_NUMBERS +
+    MORE_PARTICLES_SUFFIXES +
+    PRONOUNS_DEMONSTRATIVES +
+    VERBS_AUXILIARIES +
+    ADVERBS_QUALIFIERS_INTENSIFIERS +
+    QUANTIFIERS_DETERMINERS_COLLECTIVES +
+    CONNECTORS_CONJUNCTIONS +
+    INTERJECTIONS_EXCLAMATIONS
+)
+# Final flat list of unique stopwords for TfidfVectorizer (as a list)
+TIBETAN_STOPWORDS = list(set(_ALL_STOPWORDS_CATEGORIZED))
+# Final set of unique stopwords for efficient Jaccard/LCS filtering (as a set)
+TIBETAN_STOPWORDS_SET = set(TIBETAN_STOPWORDS)

pipeline/visualize.py CHANGED Viewed

@@ -84,13 +84,6 @@ def generate_visualizations(metrics_df: pd.DataFrame, descriptive_titles: dict =
             )
         heatmaps[metric] = fig
-    # Use all features including pattern similarities if present
-    if not metrics_df.empty:
-        # Remove '.txt' from Text Pair labels
-        metrics_df = metrics_df.copy()
-        metrics_df["Text Pair"] = metrics_df["Text Pair"].str.replace(
-            ".txt", "", regex=False
-        )
     return heatmaps

             )
         heatmaps[metric] = fig
     return heatmaps

pyproject.toml ADDED Viewed

	@@ -0,0 +1,8 @@

+[build-system]
+requires = [
+    "setuptools>=42",
+    "Cython>=0.29.21",
+    "numpy>=1.20"
+]
+build-backend = "setuptools.build_meta"
+backend-path = ["."] # Specifies that setuptools.build_meta is in the current directory's PYTHONPATH

requirements.txt CHANGED Viewed

@@ -1,12 +1,22 @@
-gradio>=4.0.0
-pandas
-plotly
-matplotlib
-seaborn
-scikit-learn
-# botok is required for Tibetan tokenization, ensure it is available on Hugging Face Spaces
-botok
-torch
-transformers
-sentence-transformers
-numba

+# Core application and UI
+gradio==5.29.1
+pandas==2.2.3
+# Plotting and visualization
+plotly==6.0.0
+matplotlib==3.10.1
+seaborn==0.13.2
+# Machine learning and numerical processing
+numpy==1.26.4
+scikit-learn==1.6.1
+torch==2.7.0
+transformers==4.51.3
+sentence-transformers==4.1.0
+numba==0.61.2
+# Tibetan language processing
+botok==0.9.0
+# Build system for Cython
+Cython==3.0.12

setup.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import numpy
+from setuptools import Extension, setup
+from Cython.Build import cythonize
+# It's good practice to specify encoding for portability
+with open("README.md", "r", encoding="utf-8") as fh:
+    long_description = fh.read()
+setup(
+    name="tibetan text metrics webapp",
+    version="0.1.0",
+    author="Daniel Wojahn / Tibetan Text Metrics",
+    author_email="daniel.wojahn@outlook.de",
+    description="Cython components for the Tibetan Text Metrics Webapp",
+    long_description=long_description,
+    long_description_content_type="text/markdown",
+    url="https://github.com/daniel-wojahn/tibetan-text-metrics",
+    ext_modules=cythonize(
+        [
+            Extension(
+                "pipeline.fast_lcs",  # Module name to import: from pipeline.fast_lcs import ...
+                ["pipeline/fast_lcs.pyx"],
+                include_dirs=[numpy.get_include()],
+            )
+        ],
+        compiler_directives={'language_level' : "3"} # For Python 3 compatibility
+    ),
+    # Indicates that package data (like .pyx files) should be included if specified in MANIFEST.in
+    # For simple cases like this, Cythonize usually handles it.
+    include_package_data=True,
+    # Although this setup.py is in webapp, it's building modules for the 'pipeline' sub-package
+    # We don't list packages here as this setup.py is just for the extension.
+    # The main app will treat 'pipeline' as a regular package.
+    zip_safe=False, # Cython extensions are generally not zip-safe
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: MIT License",
+        "Operating System :: OS Independent",
+    ],
+    python_requires='>=3.8',
+    install_requires=[
+        "numpy>=1.20", # Ensure numpy is available for runtime if not just build time
+    ],
+    # setup_requires is deprecated, use pyproject.toml for build-system requirements
+)