Spaces:

daniel-wojahn
/

ttm-webapp-hf

Running

App Files Files Community

daniel-wojahn commited on May 16

Commit

4bf5701

verified ·

1 Parent(s): c667561

Upload 19 files

Browse files

Files changed (19) hide show

README.md +121 -13
app.py +286 -0
pipeline/__init__.py +1 -0
pipeline/__pycache__/__init__.cpython-310.pyc +0 -0
pipeline/__pycache__/metrics.cpython-310.pyc +0 -0
pipeline/__pycache__/process.cpython-310.pyc +0 -0
pipeline/__pycache__/semantic_embedding.cpython-310.pyc +0 -0
pipeline/__pycache__/tokenize.cpython-310.pyc +0 -0
pipeline/__pycache__/upload.cpython-310.pyc +0 -0
pipeline/__pycache__/visualize.cpython-310.pyc +0 -0
pipeline/metrics.py +279 -0
pipeline/process.py +128 -0
pipeline/semantic_embedding.py +151 -0
pipeline/tokenize.py +38 -0
pipeline/upload.py +23 -0
pipeline/visualize.py +154 -0
requirements.txt +12 -0
results.csv +49 -0
theme.py +186 -0

README.md CHANGED Viewed

@@ -1,14 +1,122 @@
----
-title: Ttm Webapp Hf
-emoji: 💻
-colorFrom: yellow
-colorTo: yellow
-sdk: gradio
-sdk_version: 5.29.1
-app_file: app.py
-pinned: false
-license: mit
-short_description: A web app for analyzing Tibetan textual similarities
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Tibetan Text Metrics Web App
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
+[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
+[![Project Status: Active – Web app version for accessible text analysis.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
+A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts. This tool provides a graphical interface to the core text comparison functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project, making it accessible to researchers without Python or command-line experience. Built with Python and Gradio.
+## Background
+The Tibetan Text Metrics project aims to provide quantitative methods for assessing textual similarities at the chapter or segment level, helping researchers understand patterns of textual evolution. This web application extends these capabilities by offering an intuitive interface, removing the need for manual script execution and environment setup for end-users.
+## Key Features of the Web App
+-   **Easy File Upload**: Upload one or more Tibetan `.txt` files directly through the browser.
+-   **Automatic Segmentation**: Uses Tibetan section markers (e.g., `༈`) to automatically split texts into comparable chapters or sections.
+-   **Core Metrics Computed**:
+    -   **Jaccard Similarity (%)**: Measures vocabulary overlap between segments.
+    -   **Normalized Longest Common Subsequence (LCS)**: Identifies the longest shared sequence of words, indicating direct textual parallels.
+    -   **Semantic Similarity (BuddhistNLP)**: Uses the `buddhist-nlp/bodhi-sentence-cased-v1` model to compare the contextual meaning of segments.
+    -   **TF-IDF Cosine Similarity**: Highlights texts that share important or characteristic terms by comparing their TF-IDF profiles.
+-   **Handles Long Texts**: Implements automated chunking for semantic similarity to process texts exceeding the model's token limit.
+-   **Interactive Visualizations**:
+    -   Heatmaps for Jaccard, LCS, Semantic, and TF-IDF similarity metrics, providing a quick overview of inter-segment relationships.
+    -   Bar chart displaying word counts per segment.
+-   **Downloadable Results**: Export detailed metrics as a CSV file.
+-   **Simplified Workflow**: No command-line interaction or Python scripting needed for analysis.
+## Text Segmentation and Best Practices
+**Why segment your texts?**
+To obtain meaningful results, it is highly recommended to divide your Tibetan texts into logical chapters or sections before uploading. Comparing entire texts as a single unit often produces shallow or misleading results, especially for long or complex works. Chapters or sections allow the tool to detect stylistic, lexical, or structural differences that would otherwise be hidden.
+**How to segment your texts:**
+-   Use a clear marker (e.g., `༈` or another unique string) to separate chapters/sections in your `.txt` files.
+-   Each segment should represent a coherent part of the text (e.g., a chapter, legal clause, or thematic section).
+-   The tool will automatically split your file on this marker for analysis. If no marker is found, the entire file is treated as a single segment, and a warning will be issued.
+**Best practices:**
+-   Ensure your marker is unique and does not appear within a chapter.
+-   Try to keep chapters/sections of similar length for more balanced comparisons.
+-   For poetry or short texts, consider grouping several poems or stanzas as one segment.
+## Implemented Metrics
+The application computes and visualizes the following similarity metrics between corresponding chapters/segments of the uploaded texts:
+1.  **Jaccard Similarity (%)**: This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words. It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`. Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
+2.  **Normalized LCS (Longest Common Subsequence)**: This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
+    *   *Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
+3.  **Semantic Similarity (BuddhistNLP)**: Utilizes the `buddhist-nlp/bodhi-sentence-cased-v1` model to compute the cosine similarity between the semantic embeddings of text segments. This model is fine-tuned for Buddhist studies texts and captures nuances in meaning. For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
+4.  **TF-IDF Cosine Similarity**: This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment. TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments. This helps to identify terms that are characteristic or discriminative for a segment. Each segment is then represented as a vector of these TF-IDF scores. Finally, the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
+## Getting Started (Running Locally)
+1.  Ensure you have Python 3.10 or newer.
+2.  Navigate to the `webapp` directory:
+    ```bash
+    cd path/to/tibetan-text-metrics/webapp
+    ```
+3.  Create a virtual environment (recommended):
+    ```bash
+    python -m venv .venv
+    source .venv/bin/activate  # On macOS/Linux
+    # .venv\Scripts\activate    # On Windows
+    ```
+4.  Install dependencies:
+    ```bash
+    pip install -r requirements.txt
+    ```
+5.  Run the web application:
+    ```bash
+    python app.py
+    ```
+6.  Open your web browser and go to the local URL provided (usually `http://127.0.0.1:7860`).
+## Usage
+1.  **Upload Files**: Use the file upload interface to select one or more `.txt` files containing Tibetan Unicode text.
+2.  **Run Analysis**: Click the "Run Analysis" button.
+3.  **View Results**:
+    -   A preview of the similarity metrics will be displayed.
+    -   Download the full results as a CSV file.
+    -   Interactive heatmaps for Jaccard Similarity, Normalized LCS, Semantic Similarity, and TF-IDF Cosine Similarity will be generated.
+    -   A bar chart showing word counts per segment will also be available.
+    -   Any warnings (e.g., regarding missing chapter markers) will be displayed.
+## Structure
+-   `app.py` — Gradio web app entry point and UI definition.
+-   `pipeline/` — Modules for file handling, text processing, metrics calculation, and visualization.
+    -   `process.py`: Core logic for segmenting texts and orchestrating metric computation.
+    -   `metrics.py`: Implementation of Jaccard, LCS, and Semantic Similarity (including chunking).
+    -   `semantic_embedding.py`: Handles loading and using the sentence transformer model.
+    -   `tokenize.py`: Tibetan text tokenization using `botok`.
+    -   `upload.py`: File upload handling (currently minimal).
+    -   `visualize.py`: Generates heatmaps and word count plots.
+-   `requirements.txt` — Python dependencies for the web application.
+## License
+This project is licensed under the Creative Commons Attribution 4.0 International License - see the [LICENSE](../../LICENSE) file in the main project directory for details.
+## Citation
+If you use this web application or the underlying TTM tool in your research, please cite the main project:
+```bibtex
+@software{wojahn2025ttm,
+  title = {TibetanTextMetrics (TTM): Computing Text Similarity Metrics on POS-tagged Tibetan Texts},
+  author = {Daniel Wojahn},
+  year = {2025},
+  url = {https://github.com/daniel-wojahn/tibetan-text-metrics},
+  version = {0.3.0}
+}
+```
+---
+For questions or issues specifically regarding the web application, please refer to the main project's [issue tracker](https://github.com/daniel-wojahn/tibetan-text-metrics/issues) or contact Daniel Wojahn.

app.py ADDED Viewed

	@@ -0,0 +1,286 @@

+import gradio as gr
+from pathlib import Path
+from pipeline.process import process_texts
+from pipeline.visualize import generate_visualizations, generate_word_count_chart
+import cProfile
+import pstats
+import io
+from theme import tibetan_theme
+# Main interface logic
+def main_interface():
+    with gr.Blocks(
+        theme=tibetan_theme,
+        title="Tibetan Text Metrics Web App",
+        css=tibetan_theme.get_css_string(),
+    ) as demo:
+        gr.Markdown(
+            """# Tibetan Text Metrics Web App
+<span style='font-size:18px;'>A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the [Tibetan Text Metrics (TTM)](https://github.com/daniel-wojahn/tibetan-text-metrics) project.</span>
+        """,
+            elem_classes="gr-markdown",
+        )
+        with gr.Row(elem_id="steps-row"):
+            with gr.Column(scale=1, elem_classes="step-column"):
+                with gr.Group():
+                    gr.Markdown(
+                        """
+                    ## Step 1: Upload Your Tibetan Text Files
+                    <span style='font-size:16px;'>Upload one or more `.txt` files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible.</span>
+                    """,
+                        elem_classes="gr-markdown",
+                    )
+                    file_input = gr.File(
+                        label="Upload Tibetan .txt files",
+                        file_types=[".txt"],
+                        file_count="multiple",
+                    )
+            with gr.Column(scale=1, elem_classes="step-column"):
+                with gr.Group():
+                    gr.Markdown(
+                        """## Step 2: Configure and Run the Analysis
+<span style='font-size:16px;'>Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using a marker like '༈'. The tool will split files based on this marker.</span>
+                    """,
+                        elem_classes="gr-markdown",
+                    )
+                    semantic_toggle_radio = gr.Radio(
+                        label="Compute Semantic Similarity?",
+                        choices=["Yes", "No"],
+                        value="Yes",
+                        info="Semantic similarity can be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.",
+                        elem_id="semantic-radio-group",
+                    )
+                    process_btn = gr.Button(
+                        "Run Analysis", elem_id="run-btn", variant="primary"
+                    )
+        gr.Markdown(
+            """## Results
+        """,
+            elem_classes="gr-markdown",
+        )
+        # The heatmap_titles and metric_tooltips dictionaries are defined here
+        # heatmap_titles = { ... }
+        # metric_tooltips = { ... }
+        csv_output = gr.File(label="Download CSV Results")
+        metrics_preview = gr.Dataframe(
+            label="Similarity Metrics Preview", interactive=False, visible=True
+        )
+        word_count_plot = gr.Plot(label="Word Counts per Segment")
+        # Heatmap tabs for each metric
+        heatmap_titles = {
+            "Jaccard Similarity (%)": "Jaccard Similarity (%): Higher scores (brighter) mean more shared unique words.",
+            "Normalized LCS": "Normalized LCS: Higher scores (brighter) mean longer shared sequences of words.",
+            "Semantic Similarity (BuddhistNLP)": "Semantic Similarity (BuddhistNLP): Higher scores (brighter) mean more similar meanings.",
+            "TF-IDF Cosine Sim": "TF-IDF Cosine Similarity: Higher scores mean texts share more important, distinctive vocabulary.",
+        }
+        metric_tooltips = {
+            "Jaccard Similarity (%)": """
+### Jaccard Similarity (%)
+This metric quantifies the lexical overlap between two text segments by comparing their sets of *unique* words.
+It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?'
+It is calculated as `(Number of common unique words) / (Total number of unique words in both texts combined) * 100`.
+Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent.
+A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
+""",
+            "Normalized LCS": """
+### Normalized LCS (Longest Common Subsequence)
+This metric measures the length of the longest sequence of words that appears in *both* text segments, maintaining their original relative order.
+Importantly, these words do not need to be directly adjacent (contiguous) in either text.
+For example, if Text A is '<u>the</u> quick <u>brown</u> fox <u>jumps</u>' and Text B is '<u>the</u> lazy cat and <u>brown</u> dog <u>jumps</u> high', the LCS is 'the brown jumps'.
+The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage.
+A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
+*Note on Interpretation*: It's possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
+""",
+            "Semantic Similarity (BuddhistNLP)": """
+### Semantic Similarity (BuddhistNLP)
+Utilizes the `buddhist-nlp/bodhi-sentence-cased-v1` model to compute the cosine similarity between the semantic embeddings of text segments.
+This model is fine-tuned for Buddhist studies texts and captures nuances in meaning.
+For texts exceeding the model's 512-token input limit, an automated chunking strategy is employed: texts are divided into overlapping chunks, each chunk is embedded, and the resulting chunk embeddings are averaged (mean pooling) to produce a single representative vector for the entire segment before comparison.
+""",
+            "TF-IDF Cosine Sim": """
+### TF-IDF Cosine Similarity
+This metric first calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment.
+TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments.
+This helps to identify terms that are characteristic or discriminative for a segment.
+Each segment is then represented as a vector of these TF-IDF scores.
+Finally, the cosine similarity is computed between these vectors.
+A score closer to 1 indicates that the two segments share more of these important, distinguishing terms, suggesting they cover similar specific topics or themes.
+""",
+        }
+        heatmap_tabs = {}
+        gr.Markdown("## Detailed Metric Analysis", elem_classes="gr-markdown")
+        with gr.Tabs(elem_id="heatmap-tab-group"):
+            for metric_key, descriptive_title in heatmap_titles.items():
+                with gr.Tab(metric_key):
+                    if metric_key in metric_tooltips:
+                        gr.Markdown(value=metric_tooltips[metric_key])
+                    else:
+                        gr.Markdown(
+                            value=f"### {metric_key}\nDescription not found."
+                        )  # Fallback
+                    heatmap_tabs[metric_key] = gr.Plot(
+                        label=f"Heatmap: {metric_key}", show_label=False
+                    )
+        # The outputs in process_btn.click should use the short metric names as keys for heatmap_tabs
+        # e.g., heatmap_tabs["Jaccard Similarity (%)"]
+        # Ensure the plot is part of the layout. This assumes plots are displayed sequentially
+        # within the current gr.Tab("Results"). If they are in specific TabItems, this needs adjustment.
+        # For now, this modification focuses on creating the plot object and making it an output.
+        # The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
+        warning_box = gr.Markdown(visible=False)
+        def run_pipeline(files, enable_semantic_str):
+            """
+            Processes uploaded files, computes metrics, generates visualizations, and prepares outputs for the UI.
+            Args:
+                files (List[FileStorage]): A list of file objects uploaded by the user.
+            Returns:
+                tuple: A tuple containing the following elements in order:
+                    - csv_path (str | None): Path to the generated CSV results file, or None on error.
+                    - metrics_preview_df (pd.DataFrame | str | None): DataFrame for metrics preview, error string, or None.
+                    - word_count_fig (matplotlib.figure.Figure | None): Plot of word counts, or None on error.
+                    - jaccard_heatmap (matplotlib.figure.Figure | None): Jaccard similarity heatmap, or None.
+                    - lcs_heatmap (matplotlib.figure.Figure | None): LCS heatmap, or None.
+                    - semantic_heatmap (matplotlib.figure.Figure | None): Semantic similarity heatmap, or None.
+                    - warning_update (gr.update): Gradio update for the warning box.
+            """
+            if not files:
+                return (
+                    None,
+                    "Please upload files to process.",
+                    None,
+                    None,
+                    None,
+                    None,
+                    gr.update(value="Please upload files to process.", visible=True),
+                )
+            pr = cProfile.Profile()
+            pr.enable()
+            # Initialize results to ensure they are defined in finally if an early error occurs
+            (
+                csv_path_res,
+                metrics_preview_df_res,
+                word_count_fig_res,
+                jaccard_heatmap_res,
+                lcs_heatmap_res,
+                semantic_heatmap_res,
+                tfidf_heatmap_res,
+                warning_update_res,
+            ) = (
+                None,
+                "Processing error. See console for details.",
+                None,
+                None,
+                None,
+                None,
+                None,
+                gr.update(
+                    value="Processing error. See console for details.", visible=True
+                ),
+            )
+            try:
+                filenames = [
+                    Path(file.name).name for file in files
+                ]  # Use Path().name to get just the filename
+                text_data = {
+                    Path(file.name)
+                    .name: Path(file.name)
+                    .read_text(encoding="utf-8-sig")
+                    for file in files
+                }
+                enable_semantic_bool = enable_semantic_str == "Yes"
+                df_results, word_counts_df_data, warning_raw = process_texts(
+                    text_data, filenames, enable_semantic=enable_semantic_bool
+                )
+                if df_results.empty:
+                    warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
+                    warning_message = (
+                        "No common chapters found or results are empty. " + warning_md
+                    )
+                    metrics_preview_df_res = warning_message
+                    warning_update_res = gr.update(value=warning_message, visible=True)
+                    # Results for this case are set, finally will execute, then return
+                else:
+                    # heatmap_titles is already defined in the outer scope of main_interface
+                    heatmaps_data = generate_visualizations(
+                        df_results, descriptive_titles=heatmap_titles
+                    )
+                    word_count_fig_res = generate_word_count_chart(word_counts_df_data)
+                    csv_path_res = "results.csv"
+                    df_results.to_csv(csv_path_res, index=False)
+                    warning_md = f"**⚠️ Warning:** {warning_raw}" if warning_raw else ""
+                    metrics_preview_df_res = df_results.head(10)
+                    jaccard_heatmap_res = heatmaps_data.get("Jaccard Similarity (%)")
+                    lcs_heatmap_res = heatmaps_data.get("Normalized LCS")
+                    semantic_heatmap_res = heatmaps_data.get(
+                        "Semantic Similarity (BuddhistNLP)"
+                    )
+                    tfidf_heatmap_res = heatmaps_data.get("TF-IDF Cosine Sim")
+                    warning_update_res = gr.update(
+                        visible=bool(warning_raw), value=warning_md
+                    )
+            except Exception as e:
+                # logger.error(f"Error in processing: {e}", exc_info=True) # Already logged in process_texts or lower levels
+                metrics_preview_df_res = f"Error: {str(e)}"
+                warning_update_res = gr.update(value=f"Error: {str(e)}", visible=True)
+            finally:
+                pr.disable()
+                s = io.StringIO()
+                sortby = (
+                    pstats.SortKey.CUMULATIVE
+                )  # Sort by cumulative time spent in function
+                ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
+                print("\n--- cProfile Stats (Top 30) ---")
+                ps.print_stats(30)  # Print the top 30 costly functions
+                print(s.getvalue())
+                print("-------------------------------\n")
+            return (
+                csv_path_res,
+                metrics_preview_df_res,
+                word_count_fig_res,
+                jaccard_heatmap_res,
+                lcs_heatmap_res,
+                semantic_heatmap_res,
+                tfidf_heatmap_res,
+                warning_update_res,
+            )
+        process_btn.click(
+            run_pipeline,
+            inputs=[file_input, semantic_toggle_radio],
+            outputs=[
+                csv_output,
+                metrics_preview,
+                word_count_plot,
+                heatmap_tabs["Jaccard Similarity (%)"],
+                heatmap_tabs["Normalized LCS"],
+                heatmap_tabs["Semantic Similarity (BuddhistNLP)"],
+                heatmap_tabs["TF-IDF Cosine Sim"],
+                warning_box,
+            ],
+        )
+    return demo
+if __name__ == "__main__":
+    demo = main_interface()
+    demo.launch()

pipeline/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

pipeline/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (187 Bytes). View file

pipeline/__pycache__/metrics.cpython-310.pyc ADDED Viewed

Binary file (7.23 kB). View file

pipeline/__pycache__/process.cpython-310.pyc ADDED Viewed

Binary file (3.74 kB). View file

pipeline/__pycache__/semantic_embedding.cpython-310.pyc ADDED Viewed

Binary file (4.02 kB). View file

pipeline/__pycache__/tokenize.cpython-310.pyc ADDED Viewed

Binary file (1.14 kB). View file

pipeline/__pycache__/upload.cpython-310.pyc ADDED Viewed

Binary file (983 Bytes). View file

pipeline/__pycache__/visualize.cpython-310.pyc ADDED Viewed

Binary file (4.37 kB). View file

pipeline/metrics.py ADDED Viewed

	@@ -0,0 +1,279 @@

+import numpy as np
+import pandas as pd
+from typing import List, Dict
+from itertools import combinations
+from sklearn.metrics.pairwise import cosine_similarity
+import torch
+from .semantic_embedding import generate_embeddings
+from .tokenize import tokenize_texts
+import logging
+from numba import njit
+from sklearn.feature_extraction.text import TfidfVectorizer
+logger = logging.getLogger(__name__)
+MAX_TOKENS_PER_CHUNK = 500  # Max tokens (words via botok) per chunk
+CHUNK_OVERLAP = 50  # Number of tokens to overlap between chunks
+def _chunk_text(
+    original_text_content: str,
+    tokens: List[str],
+    max_chunk_tokens: int,
+    overlap_tokens: int,
+) -> List[str]:
+    """
+    Splits a list of tokens into chunks and reconstructs text segments from these token chunks.
+    The reconstructed text segments are intended for embedding models.
+    Args:
+        original_text_content (str): The original raw text string. Used if no chunking is needed.
+        tokens (List[str]): The list of botok tokens for the original_text_content.
+        max_chunk_tokens (int): Maximum number of botok tokens per chunk.
+        overlap_tokens (int): Number of botok tokens to overlap between chunks.
+    Returns:
+        List[str]: A list of text strings, where each string is a chunk.
+    """
+    if (
+        not tokens
+    ):  # Handles empty or whitespace-only original text that led to no tokens
+        return [original_text_content] if original_text_content.strip() else []
+    if len(tokens) <= max_chunk_tokens:
+        # If not chunking, return the original text content directly, as per MEMORY[a777e6ad-11c4-4b90-8e6e-63a923a94432]
+        # The memory states raw text segments are passed directly to the model.
+        # Joining tokens here would alter spacing, etc.
+        return [original_text_content]
+    reconstructed_text_chunks = []
+    start_idx = 0
+    while start_idx < len(tokens):
+        end_idx = min(start_idx + max_chunk_tokens, len(tokens))
+        current_chunk_botok_tokens = tokens[start_idx:end_idx]
+        # Reconstruct the text chunk by joining the botok tokens. This is an approximation.
+        # The semantic model's internal tokenizer will handle this string.
+        reconstructed_text_chunks.append(" ".join(current_chunk_botok_tokens))
+        if end_idx == len(tokens):
+            break
+        next_start_idx = start_idx + max_chunk_tokens - overlap_tokens
+        if next_start_idx <= start_idx:
+            next_start_idx = start_idx + 1
+        start_idx = next_start_idx
+    return reconstructed_text_chunks
+@njit
+def compute_normalized_lcs(words1: List[str], words2: List[str]) -> float:
+    m, n = len(words1), len(words2)
+    if m == 0 or n == 0:
+        return 0.0
+    dp = np.zeros((m + 1, n + 1), dtype=np.int32)
+    for i in range(1, m + 1):
+        for j in range(1, n + 1):
+            if words1[i - 1] == words2[j - 1]:
+                dp[i, j] = dp[i - 1, j - 1] + 1
+            else:
+                dp[i, j] = max(dp[i - 1, j], dp[i, j - 1])
+    lcs_length = int(dp[m, n])
+    avg_length = (m + n) / 2
+    return lcs_length / avg_length if avg_length > 0 else 0.0
+def compute_semantic_similarity(
+    text1_segment: str,
+    text2_segment: str,
+    tokens1: List[str],
+    tokens2: List[str],
+    model,
+    device,
+) -> float:
+    """Computes semantic similarity using a sentence transformer model, with chunking for long texts."""
+    if model is None or device is None:
+        logger.warning(
+            "Semantic similarity model or device not available. Skipping calculation."
+        )
+        return np.nan  # Return NaN if model isn't loaded
+    if not text1_segment or not text2_segment:
+        logger.info(
+            "One or both texts are empty for semantic similarity. Returning 0.0."
+        )
+        return 0.0  # Or np.nan, depending on desired behavior for empty inputs
+    def _get_aggregated_embedding(
+        raw_text_segment: str, botok_tokens: List[str], model_obj, device_str
+    ) -> torch.Tensor | None:
+        """Helper to get a single embedding for a text, chunking if necessary."""
+        if (
+            not botok_tokens and not raw_text_segment.strip()
+        ):  # Check if effectively empty
+            logger.info(
+                f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
+            )
+            return None
+        if len(botok_tokens) > MAX_TOKENS_PER_CHUNK:
+            logger.info(
+                f"Text segment with ~{len(botok_tokens)} tokens exceeds {MAX_TOKENS_PER_CHUNK}, chunking {raw_text_segment[:30]}..."
+            )
+            # Pass the original raw text and its pre-computed botok tokens to _chunk_text
+            text_chunks = _chunk_text(
+                raw_text_segment, botok_tokens, MAX_TOKENS_PER_CHUNK, CHUNK_OVERLAP
+            )
+            if not text_chunks:
+                logger.warning(
+                    f"Chunking resulted in no chunks for segment: {raw_text_segment[:100]}..."
+                )
+                return None
+            logger.info(
+                f"Generated {len(text_chunks)} chunks for segment: {raw_text_segment[:30]}..."
+            )
+            chunk_embeddings = generate_embeddings(text_chunks, model_obj, device_str)
+            if chunk_embeddings is None or chunk_embeddings.nelement() == 0:
+                logger.error(
+                    f"Failed to generate embeddings for chunks of text: {raw_text_segment[:100]}..."
+                )
+                return None
+            # Mean pooling of chunk embeddings
+            aggregated_embedding = torch.mean(chunk_embeddings, dim=0, keepdim=True)
+            return aggregated_embedding
+        else:
+            # Text is short enough, embed raw text directly as per MEMORY[a777e6ad-11c4-4b90-8e6e-63a923a94432]
+            if not raw_text_segment.strip():
+                logger.info(
+                    f"Text segment is empty or only whitespace: {raw_text_segment[:100]}... Returning None for embedding."
+                )
+                return None
+            embedding = generate_embeddings([raw_text_segment], model_obj, device_str)
+            if embedding is None or embedding.nelement() == 0:
+                logger.error(
+                    f"Failed to generate embedding for text: {raw_text_segment[:100]}..."
+                )
+                return None
+            return embedding  # Already [1, embed_dim]
+    try:
+        # Pass raw text and its pre-computed botok tokens
+        embedding1 = _get_aggregated_embedding(text1_segment, tokens1, model, device)
+        embedding2 = _get_aggregated_embedding(text2_segment, tokens2, model, device)
+        if (
+            embedding1 is None
+            or embedding2 is None
+            or embedding1.nelement() == 0
+            or embedding2.nelement() == 0
+        ):
+            logger.error(
+                "Failed to obtain one or both aggregated embeddings for semantic similarity."
+            )
+            return np.nan
+        # Cosine similarity expects 2D arrays, embeddings are [1, embed_dim] and on CPU
+        similarity = cosine_similarity(embedding1.numpy(), embedding2.numpy())
+        return float(similarity[0][0])
+    except Exception as e:
+        logger.error(
+            f"Error computing semantic similarity with chunking:\nText1: '{text1_segment[:100]}...'\nText2: '{text2_segment[:100]}...'\nError: {e}",
+            exc_info=True,
+        )
+        return np.nan
+def compute_all_metrics(
+    texts: Dict[str, str], model=None, device=None, enable_semantic: bool = True
+) -> pd.DataFrame:
+    """
+    Computes all selected similarity metrics between pairs of texts.
+    Args:
+        texts (Dict[str, str]): A dictionary where keys are text identifiers (e.g., filenames or segment IDs)
+                               and values are the text content strings.
+        model (SentenceTransformer, optional): The pre-loaded sentence transformer model.
+                                              Defaults to None.
+        device (str, optional): The device the model is on ('cuda' or 'cpu').
+                                Defaults to None.
+    Returns:
+        pd.DataFrame: A DataFrame where each row contains the metrics for a pair of texts,
+                      including 'Text Pair', 'Jaccard Similarity (%)', 'Normalized LCS',
+                      and 'Semantic Similarity (BuddhistNLP)'.
+    """
+    files = list(texts.keys())
+    results = []
+    # Prepare token lists (always use tokenize_texts for raw Unicode)
+    token_lists = {}
+    corpus_for_tfidf = []  # For storing space-joined tokens for TF-IDF
+    for fname, content in texts.items():
+        tokenized_content = tokenize_texts([content])  # Returns a list of lists
+        if tokenized_content and tokenized_content[0]:
+            token_lists[fname] = tokenized_content[0]
+        else:
+            token_lists[fname] = []
+        # Regardless of whether tokenized_content[0] exists, prepare entry for TF-IDF corpus
+        # If tokens exist, join them; otherwise, use an empty string for that document
+        corpus_for_tfidf.append(
+            " ".join(token_lists[fname])
+            if fname in token_lists and token_lists[fname]
+            else ""
+        )
+    # TF-IDF Vectorization and Cosine Similarity Calculation
+    if corpus_for_tfidf:
+        # Using a dummy tokenizer and preprocessor as input is already tokenized (as space-separated strings)
+        # and we don't want further case changes or token modifications for Tibetan.
+        vectorizer = TfidfVectorizer(
+            tokenizer=lambda x: x.split(), preprocessor=lambda x: x, token_pattern=None
+        )
+        tfidf_matrix = vectorizer.fit_transform(corpus_for_tfidf)
+        # Calculate pairwise cosine similarity on the TF-IDF matrix
+        # This gives a square matrix where cosine_sim_matrix[i, j] is the similarity between doc i and doc j
+        cosine_sim_matrix = cosine_similarity(tfidf_matrix)
+    else:
+        # Handle case with no texts or all empty texts
+        cosine_sim_matrix = np.array(
+            [[]]
+        )  # Or some other appropriate empty/default structure
+    for i, j in combinations(range(len(files)), 2):
+        f1, f2 = files[i], files[j]
+        words1, words2 = token_lists[f1], token_lists[f2]
+        jaccard = (
+            len(set(words1) & set(words2)) / len(set(words1) | set(words2))
+            if set(words1) | set(words2)
+            else 0.0
+        )
+        jaccard_percent = jaccard * 100.0
+        norm_lcs = compute_normalized_lcs(words1, words2)
+        # Semantic Similarity Calculation
+        if enable_semantic:
+            # Pass raw texts and their pre-computed botok tokens
+            semantic_sim = compute_semantic_similarity(
+                texts[f1], texts[f2], words1, words2, model, device
+            )
+        else:
+            semantic_sim = np.nan
+        results.append(
+            {
+                "Text Pair": f"{f1} vs {f2}",
+                "Jaccard Similarity (%)": jaccard_percent,
+                "Normalized LCS": norm_lcs,
+                # Pass tokens1 and tokens2 to compute_semantic_similarity
+                "Semantic Similarity (BuddhistNLP)": semantic_sim,
+                "TF-IDF Cosine Sim": (
+                    cosine_sim_matrix[i, j]
+                    if cosine_sim_matrix.size > 0
+                    and i < cosine_sim_matrix.shape[0]
+                    and j < cosine_sim_matrix.shape[1]
+                    else np.nan
+                ),
+            }
+        )
+    return pd.DataFrame(results)

pipeline/process.py ADDED Viewed

	@@ -0,0 +1,128 @@

+import pandas as pd
+from typing import Dict, List, Tuple
+from .metrics import compute_all_metrics
+from .semantic_embedding import get_sentence_transformer_model_and_device
+from .tokenize import tokenize_texts
+import logging
+logger = logging.getLogger(__name__)
+def process_texts(
+    text_data: Dict[str, str], filenames: List[str], enable_semantic: bool = True
+) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
+    """
+    Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
+    Args:
+        text_data (Dict[str, str]): A dictionary mapping filenames to their content.
+        filenames (List[str]): A list of filenames that were uploaded.
+    Returns:
+        Tuple[pd.DataFrame, pd.DataFrame, str]:
+            - metrics_df: DataFrame with similarity metrics between corresponding chapters of file pairs.
+            - word_counts_df: DataFrame with word counts for each segment (chapter) in each file.
+            - warning: A string containing any warnings generated during processing (e.g., missing chapter markers).
+    """
+    st_model, st_device = None, None
+    if enable_semantic:
+        logger.info(
+            "Semantic similarity enabled. Loading sentence transformer model..."
+        )
+        try:
+            st_model, st_device = get_sentence_transformer_model_and_device()
+            logger.info(
+                f"Sentence transformer model loaded successfully on {st_device}."
+            )
+        except Exception as e:
+            logger.error(
+                f"Failed to load sentence transformer model: {e}. Semantic similarity will not be available."
+            )
+            # Optionally, add a warning to the UI if model loading fails
+            # warning += " Semantic similarity model failed to load."
+    else:
+        logger.info("Semantic similarity disabled. Skipping model loading.")
+    # Detect chapter marker
+    chapter_marker = "༈"
+    fallback = False
+    segment_texts = {}
+    for fname in filenames:
+        content = text_data[fname]
+        if chapter_marker in content:
+            segments = [
+                seg.strip() for seg in content.split(chapter_marker) if seg.strip()
+            ]
+            for idx, seg in enumerate(segments):
+                seg_id = f"{fname}|chapter {idx+1}"
+                segment_texts[seg_id] = seg
+        else:
+            seg_id = f"{fname}|chapter 1"
+            segment_texts[seg_id] = content.strip()
+            fallback = True
+    warning = ""
+    if fallback:
+        warning = (
+            "No chapter marker found in one or more files. "
+            "Each file will be treated as a single segment. "
+            "For best results, add a unique marker (e.g., ༈) to separate chapters or sections."
+        )
+    # Group chapters by filename (preserving order)
+    file_to_chapters = {}
+    for seg_id in segment_texts:
+        fname = seg_id.split("|")[0]
+        file_to_chapters.setdefault(fname, []).append(seg_id)
+    # For each pair of files, compare corresponding chapters (by index)
+    from itertools import combinations
+    results = []
+    files = list(file_to_chapters.keys())
+    for file1, file2 in combinations(files, 2):
+        chaps1 = file_to_chapters[file1]
+        chaps2 = file_to_chapters[file2]
+        min_chaps = min(len(chaps1), len(chaps2))
+        for idx in range(min_chaps):
+            seg1 = chaps1[idx]
+            seg2 = chaps2[idx]
+            # Compute metrics for this chapter pair
+            # Use compute_all_metrics on just these two segments
+            pair_metrics = compute_all_metrics(
+                {seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
+                model=st_model,
+                device=st_device,
+                enable_semantic=enable_semantic,
+            )
+            # Rename 'Text Pair' to show file stems and chapter number
+            # Set Text Pair and Chapter columns
+            pair_metrics.loc[:, "Text Pair"] = f"{file1} vs {file2}"
+            pair_metrics.loc[:, "Chapter"] = idx + 1
+            results.append(pair_metrics)
+    if results:
+        metrics_df = pd.concat(results, ignore_index=True)
+    else:
+        metrics_df = pd.DataFrame()
+    # Calculate word counts
+    word_counts_data = []
+    for seg_id, text_content in segment_texts.items():
+        fname, chapter_info = seg_id.split("|", 1)
+        chapter_num = int(chapter_info.replace("chapter ", ""))
+        # Use botok for accurate word count for raw Tibetan text
+        tokenized_segments = tokenize_texts([text_content])  # Returns a list of lists
+        if tokenized_segments and tokenized_segments[0]:
+            word_count = len(tokenized_segments[0])
+        else:
+            word_count = 0
+        word_counts_data.append(
+            {
+                "Filename": fname.replace(".txt", ""),
+                "ChapterNumber": chapter_num,
+                "SegmentID": seg_id,
+                "WordCount": word_count,
+            }
+        )
+    word_counts_df = pd.DataFrame(word_counts_data)
+    if not word_counts_df.empty:
+        word_counts_df = word_counts_df.sort_values(
+            by=["Filename", "ChapterNumber"]
+        ).reset_index(drop=True)
+    return metrics_df, word_counts_df, warning

pipeline/semantic_embedding.py ADDED Viewed

	@@ -0,0 +1,151 @@

+import logging
+import torch
+from sentence_transformers import SentenceTransformer
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+# Define the model ID for the fine-tuned Tibetan MiniLM
+DEFAULT_MODEL_NAME = "buddhist-nlp/buddhist-sentence-similarity"
+def get_sentence_transformer_model_and_device(
+    model_id: str = DEFAULT_MODEL_NAME, device_preference: str = "auto"
+):
+    """
+    Loads the Sentence Transformer model and determines the device.
+    Priority: CUDA -> MPS (Apple Silicon) -> CPU.
+    Args:
+        model_id (str): The Hugging Face model ID.
+        device_preference (str): Preferred device ("cuda", "mps", "cpu", "auto").
+    Returns:
+        tuple: (model, device_str)
+               - model: The loaded SentenceTransformer model.
+               - device_str: The device the model is loaded on ("cuda", "mps", or "cpu").
+    """
+    selected_device_str = ""
+    if device_preference == "auto":
+        if torch.cuda.is_available():
+            selected_device_str = "cuda"
+        elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+            selected_device_str = "mps"
+        else:
+            selected_device_str = "cpu"
+    elif device_preference == "cuda" and torch.cuda.is_available():
+        selected_device_str = "cuda"
+    elif (
+        device_preference == "mps"
+        and hasattr(torch.backends, "mps")
+        and torch.backends.mps.is_available()
+    ):
+        selected_device_str = "mps"
+    else:  # Handles explicit "cpu" preference or fallback if preferred is unavailable
+        selected_device_str = "cpu"
+    logger.info(f"Attempting to use device: {selected_device_str}")
+    try:
+        logger.info(
+            f"Loading Sentence Transformer model: {model_id} on device: {selected_device_str}"
+        )
+        # SentenceTransformer expects a string like 'cuda', 'mps', or 'cpu'
+        model = SentenceTransformer(model_id, device=selected_device_str)
+        logger.info(f"Model {model_id} loaded successfully on {selected_device_str}.")
+        return model, selected_device_str
+    except Exception as e:
+        logger.error(
+            f"Error loading model {model_id} on device {selected_device_str}: {e}"
+        )
+        # Fallback to CPU if the initially selected device (CUDA or MPS) failed
+        if selected_device_str != "cpu":
+            logger.warning(
+                f"Failed to load model on {selected_device_str}, attempting to load on CPU..."
+            )
+            fallback_device_str = "cpu"
+            try:
+                model = SentenceTransformer(model_id, device=fallback_device_str)
+                logger.info(
+                    f"Model {model_id} loaded successfully on CPU after fallback."
+                )
+                return model, fallback_device_str
+            except Exception as fallback_e:
+                logger.error(
+                    f"Error loading model {model_id} on CPU during fallback: {fallback_e}"
+                )
+                raise fallback_e  # Re-raise exception if CPU fallback also fails
+        raise e  # Re-raise original exception if selected_device_str was already CPU or no fallback attempted
+def generate_embeddings(texts: list[str], model, device: str):
+    """
+    Generates embeddings for a list of texts using the provided Sentence Transformer model.
+    Args:
+        texts (list[str]): A list of texts to embed.
+        model: The loaded SentenceTransformer model.
+        device (str): The device the model is on (primarily for logging, model.encode handles device).
+    Returns:
+        torch.Tensor: A tensor containing the embeddings, moved to CPU.
+    """
+    if not texts:
+        logger.warning(
+            "No texts provided to generate_embeddings. Returning empty tensor."
+        )
+        return torch.empty(0)
+    logger.info(f"Generating embeddings for {len(texts)} texts...")
+    # The encode method of SentenceTransformer handles tokenization and pooling internally.
+    # It also manages moving data to the model's device.
+    embeddings = model.encode(texts, convert_to_tensor=True, show_progress_bar=True)
+    logger.info(f"Embeddings generated with shape: {embeddings.shape}")
+    return (
+        embeddings.cpu()
+    )  # Ensure embeddings are on CPU for consistent further processing
+if __name__ == "__main__":
+    # Example usage:
+    logger.info("Starting example usage of semantic_embedding module...")
+    test_texts = [
+        "བཀྲ་ཤིས་བདེ་ལེགས།",
+        "hello world",  # Test with non-Tibetan to see behavior
+        "དེ་རིང་གནམ་གཤིས་ཡག་པོ་འདུག",
+    ]
+    logger.info("Attempting to load model using default cache directory.")
+    try:
+        # Forcing CPU for this example to avoid potential CUDA issues in diverse environments
+        # or if CUDA is not intended for this specific test.
+        st_model, st_device = get_sentence_transformer_model_and_device(
+            device_preference="cpu"  # Explicitly use CPU for this test run
+        )
+        if st_model:
+            logger.info(f"Test model loaded on device: {st_device}")
+            example_embeddings = generate_embeddings(test_texts, st_model, st_device)
+            logger.info(
+                f"Generated example embeddings shape: {example_embeddings.shape}"
+            )
+            if example_embeddings.nelement() > 0:  # Check if tensor is not empty
+                logger.info(
+                    f"First embedding (first 10 dims): {example_embeddings[0][:10]}..."
+                )
+            else:
+                logger.info("Generated example embeddings tensor is empty.")
+        else:
+            logger.error("Failed to load model for example usage.")
+    except Exception as e:
+        logger.error(f"An error occurred during the example usage: {e}")
+    logger.info("Finished example usage.")

pipeline/tokenize.py ADDED Viewed

	@@ -0,0 +1,38 @@

+from typing import List
+try:
+    from botok import WordTokenizer
+    # Initialize the tokenizer once at the module level
+    BOTOK_TOKENIZER = WordTokenizer()
+except ImportError:
+    # Handle the case where botok might not be installed,
+    # though it's a core dependency for this app.
+    BOTOK_TOKENIZER = None
+    print("ERROR: botok library not found. Tokenization will fail.")
+    # Optionally, raise an error here if botok is absolutely critical for the app to even start
+    # raise ImportError("botok is required for tokenization. Please install it.")
+def tokenize_texts(texts: List[str]) -> List[List[str]]:
+    """
+    Tokenizes a list of raw Tibetan texts using botok.
+    Args:
+        texts: List of raw text strings.
+    Returns:
+        List of tokenized texts (each as a list of tokens).
+    """
+    if BOTOK_TOKENIZER is None:
+        # This case should ideally be handled more gracefully,
+        # perhaps by preventing analysis if the tokenizer failed to load.
+        raise RuntimeError(
+            "Botok tokenizer failed to initialize. Cannot tokenize texts."
+        )
+    tokenized_texts_list = []
+    for text_content in texts:
+        tokens = [
+            w.text for w in BOTOK_TOKENIZER.tokenize(text_content) if w.text.strip()
+        ]
+        tokenized_texts_list.append(tokens)
+    return tokenized_texts_list

pipeline/upload.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import os
+from typing import List, Tuple
+def handle_file_upload(files, input_type: str) -> Tuple[dict, List[str]]:
+    """
+    Reads uploaded files and returns their contents and filenames.
+    Args:
+        files: List of uploaded file objects from Gradio
+        input_type: Type of input (raw, tokenized, pos-tagged)
+    Returns:
+        text_data: dict mapping filename to file content
+        filenames: list of filenames
+    """
+    text_data = {}
+    filenames = []
+    for file in files:
+        filename = os.path.basename(file.name)
+        with open(file.name, "r", encoding="utf-8") as f:
+            content = f.read()
+        text_data[filename] = content
+        filenames.append(filename)
+    return text_data, filenames

pipeline/visualize.py ADDED Viewed

	@@ -0,0 +1,154 @@

+import plotly.graph_objects as go
+import pandas as pd
+import plotly.express as px  # For color palettes
+import numpy as np  # Ensure numpy is imported, in case pivot_table uses it for aggfunc
+def generate_visualizations(metrics_df: pd.DataFrame, descriptive_titles: dict = None):
+    """
+    Generate heatmap visualizations for all metrics.
+    Args:
+        metrics_df: DataFrame with similarity metrics (segment-level)
+    Returns:
+        heatmaps: dict of {metric_name: plotly Figure} for each metric
+    """
+    # Identify all numeric metric columns (exclude 'Text Pair' and 'Chapter')
+    metric_cols = [
+        col
+        for col in metrics_df.columns
+        if col not in ["Text Pair", "Chapter"] and metrics_df[col].dtype != object
+    ]
+    for col in metrics_df.columns:
+        if "Pattern Similarity" in col and col not in metric_cols:
+            metric_cols.append(col)
+    # --- Heatmaps for each metric ---
+    heatmaps = {}
+    # Using 'Reds' colormap as requested for a red/white gradient.
+    # Chapter 1 will be at the top of the Y-axis due to sort_index(ascending=False).
+    for metric in metric_cols:
+        # Check if all values for this metric are NaN
+        if metrics_df[metric].isnull().all():
+            heatmaps[metric] = None
+            continue  # Move to the next metric
+        pivot = metrics_df.pivot(index="Chapter", columns="Text Pair", values=metric)
+        pivot = pivot.sort_index(ascending=False)  # Invert Y-axis: Chapter 1 at the top
+        # Additional check: if pivot is empty or all NaNs after pivoting (e.g., due to single chapter comparisons)
+        if pivot.empty or pivot.isnull().all().all():
+            heatmaps[metric] = None
+            continue
+        cleaned_columns = [col.replace(".txt", "") for col in pivot.columns]
+        cmap = "Reds"  # Apply 'Reds' colormap to all heatmaps
+        text = [
+            [f"{val:.2f}" if pd.notnull(val) else "" for val in row]
+            for row in pivot.values
+        ]
+        fig = go.Figure(
+            data=go.Heatmap(
+                z=pivot.values,
+                x=cleaned_columns,
+                y=pivot.index,
+                colorscale=cmap,
+                zmin=float(np.nanmin(pivot.values)),
+                zmax=float(np.nanmax(pivot.values)),
+                text=text,
+                texttemplate="%{text}",
+                hovertemplate="Chapter %{y}<br>Text Pair: %{x}<br>Value: %{z:.2f}<extra></extra>",
+                colorbar=dict(title=metric, thickness=20, tickfont=dict(size=14)),
+            )
+        )
+        plot_title = (
+            descriptive_titles.get(metric, metric) if descriptive_titles else metric
+        )
+        fig.update_layout(
+            title=plot_title,
+            xaxis_title="Text Pair",
+            yaxis_title="Chapter",
+            autosize=False,
+            width=1350,
+            height=1200,
+            font=dict(size=16),
+            margin=dict(l=140, b=80, t=60),
+        )
+        fig.update_xaxes(tickangle=30, tickfont=dict(size=16))
+        fig.update_yaxes(tickfont=dict(size=16), autorange="reversed")
+        # Ensure all integer chapter numbers are shown if the axis is numeric and reversed
+        if pd.api.types.is_numeric_dtype(pivot.index):
+            fig.update_yaxes(
+                tickmode="array",
+                tickvals=pivot.index,
+                ticktext=[str(i) for i in pivot.index],
+            )
+        heatmaps[metric] = fig
+    # Use all features including pattern similarities if present
+    if not metrics_df.empty:
+        # Remove '.txt' from Text Pair labels
+        metrics_df = metrics_df.copy()
+        metrics_df["Text Pair"] = metrics_df["Text Pair"].str.replace(
+            ".txt", "", regex=False
+        )
+    return heatmaps
+def generate_word_count_chart(word_counts_df: pd.DataFrame):
+    """
+    Generates a bar chart for word counts per segment (file/chapter).
+    Args:
+        word_counts_df: DataFrame with 'Filename', 'ChapterNumber', 'SegmentID', 'WordCount'.
+    Returns:
+        plotly Figure for the bar chart, or None if input is empty.
+    """
+    if word_counts_df.empty:
+        return None
+    fig = go.Figure()
+    # Assign colors based on Filename
+    unique_files = sorted(word_counts_df["Filename"].unique())
+    colors = px.colors.qualitative.Plotly  # Get a default Plotly color sequence
+    for i, filename in enumerate(unique_files):
+        file_df = word_counts_df[word_counts_df["Filename"] == filename].sort_values(
+            "ChapterNumber"
+        )
+        fig.add_trace(
+            go.Bar(
+                x=file_df["ChapterNumber"],
+                y=file_df["WordCount"],
+                name=filename,
+                marker_color=colors[i % len(colors)],
+                text=file_df["WordCount"],
+                textposition="auto",
+                customdata=file_df[["Filename"]],  # Pass Filename for hovertemplate
+                hovertemplate="<b>File</b>: %{customdata[0]}<br>"
+                + "<b>Chapter</b>: %{x}<br>"
+                + "<b>Word Count</b>: %{y}<extra></extra>",
+            )
+        )
+    fig.update_layout(
+        title_text="Word Counts per Chapter (Grouped by File)",
+        xaxis_title="Chapter Number",
+        yaxis_title="Word Count",
+        barmode="group",
+        font=dict(size=14),
+        legend_title_text="Filename",
+        xaxis=dict(
+            type="category"
+        ),  # Treat chapter numbers as categories for distinct grouping
+        autosize=True,
+        margin=dict(l=80, r=50, b=100, t=50, pad=4),
+    )
+    # Ensure x-axis ticks are shown for all chapter numbers present
+    all_chapter_numbers = sorted(word_counts_df["ChapterNumber"].unique())
+    fig.update_xaxes(
+        tickmode="array",
+        tickvals=all_chapter_numbers,
+        ticktext=[str(ch) for ch in all_chapter_numbers],
+    )
+    return fig

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+gradio>=4.0.0
+pandas
+plotly
+matplotlib
+seaborn
+scikit-learn
+# botok is required for Tibetan tokenization, ensure it is available on Hugging Face Spaces
+botok
+torch
+transformers
+sentence-transformers
+numba

results.csv ADDED Viewed

	@@ -0,0 +1,49 @@

+Text Pair,Jaccard Similarity (%),Normalized LCS,Semantic Similarity (BuddhistNLP),TF-IDF Cosine Sim,Chapter
+Nepal12.txt vs LTWA.txt,0.24183796856106407,0.0020325203252032522,,0.49779922611270133,1
+Nepal12.txt vs LTWA.txt,0.49019607843137253,0.006349206349206349,,0.3092411626312882,2
+Nepal12.txt vs LTWA.txt,0.423728813559322,0.0051813471502590676,,0.2835703043060698,3
+Nepal12.txt vs LTWA.txt,47.05882352941176,0.6213151927437641,,0.7895311629826965,4
+Nepal12.txt vs LTWA.txt,29.81366459627329,0.4437299035369775,,0.7168755829405374,5
+Nepal12.txt vs LTWA.txt,36.31578947368421,0.5212464589235127,,0.8074186048740946,6
+Nepal12.txt vs LTWA.txt,26.769911504424783,0.2892561983471074,,0.8417955275319969,7
+Nepal12.txt vs LTWA.txt,43.08943089430895,0.5246753246753246,,0.9303577865963523,8
+Nepal12.txt vs LTWA.txt,29.625292740046838,0.2795071335927367,,0.9109232758325454,9
+Nepal12.txt vs LTWA.txt,52.798053527980535,0.6291525423728813,,0.9087016630709266,10
+Nepal12.txt vs LTWA.txt,53.075170842824605,0.6566920565832427,,0.9670241021079997,11
+Nepal12.txt vs LTWA.txt,36.029411764705884,0.43727598566308246,,0.7062555004041853,12
+Nepal12.txt vs LTWA.txt,46.80232558139535,0.6346153846153846,,0.9098381701939877,13
+Nepal12.txt vs LTWA.txt,34.954407294832826,0.3716433941997852,,0.8681558692770571,14
+Nepal12.txt vs LTWA.txt,13.288288288288289,0.13510520487264674,,0.672356831077214,15
+Nepal12.txt vs LTWA.txt,0.205761316872428,0.00186219739292365,,0.379562190805602,16
+Nepal12.txt vs Leiden.txt,0.23364485981308408,0.0020151133501259445,,0.48802450618062665,1
+Nepal12.txt vs Leiden.txt,0.5,0.006493506493506494,,0.2674810724560652,2
+Nepal12.txt vs Leiden.txt,0.43478260869565216,0.005633802816901409,,0.2833623975540526,3
+Nepal12.txt vs Leiden.txt,47.159090909090914,0.5876543209876544,,0.7517513456553279,4
+Nepal12.txt vs Leiden.txt,30.075187969924812,0.38866396761133604,,0.4408740561408262,5
+Nepal12.txt vs Leiden.txt,40.853658536585364,0.549520766773163,,0.8128602295895004,6
+Nepal12.txt vs Leiden.txt,31.524008350730686,0.2908224076281287,,0.8738496945193566,7
+Nepal12.txt vs Leiden.txt,39.01273885350319,0.43176831943835015,,0.926450152931149,8
+Nepal12.txt vs Leiden.txt,27.754415475189237,0.19742670322716727,,0.9388989588908633,9
+Nepal12.txt vs Leiden.txt,54.936708860759495,0.6423049894588897,,0.921846371542675,10
+Nepal12.txt vs Leiden.txt,47.205707491082045,0.6151611966308452,,0.9428816469273638,11
+Nepal12.txt vs Leiden.txt,19.011406844106464,0.24765478424015008,,0.7015349768313621,12
+Nepal12.txt vs Leiden.txt,53.125,0.6434782608695652,,0.8959335274660175,13
+Nepal12.txt vs Leiden.txt,31.699346405228756,0.3399765533411489,,0.7322457041553282,14
+Nepal12.txt vs Leiden.txt,21.730382293762577,0.23124448367166814,,0.7987166674337054,15
+Nepal12.txt vs Leiden.txt,0.17667844522968199,0.001563721657544957,,0.35541620823150016,16
+LTWA.txt vs Leiden.txt,44.46351931330472,0.5621676373765511,,0.973014209495479,1
+LTWA.txt vs Leiden.txt,52.27272727272727,0.6720516962843296,,0.8952409611501382,2
+LTWA.txt vs Leiden.txt,51.14006514657981,0.6132971506105834,,0.8817937094432065,3
+LTWA.txt vs Leiden.txt,43.29896907216495,0.5622119815668203,,0.7896479610422875,4
+LTWA.txt vs Leiden.txt,30.48780487804878,0.39869281045751637,,0.5548358409765478,5
+LTWA.txt vs Leiden.txt,35.80246913580247,0.4931506849315068,,0.7722629194984254,6
+LTWA.txt vs Leiden.txt,36.211699164345404,0.44142614601018676,,0.8733787580475953,7
+LTWA.txt vs Leiden.txt,42.33038348082596,0.4753257007500987,,0.9428524600432496,8
+LTWA.txt vs Leiden.txt,25.196850393700785,0.16772554002541296,,0.9208737107077982,9
+LTWA.txt vs Leiden.txt,53.73493975903615,0.6971279373368147,,0.9411333779617488,10
+LTWA.txt vs Leiden.txt,54.285714285714285,0.6786239341370185,,0.9677013964807858,11
+LTWA.txt vs Leiden.txt,18.43137254901961,0.1700404858299595,,0.6761607878976519,12
+LTWA.txt vs Leiden.txt,48.37758112094395,0.6323851203501094,,0.8863173251539074,13
+LTWA.txt vs Leiden.txt,42.79661016949153,0.570957095709571,,0.7795387299539099,14
+LTWA.txt vs Leiden.txt,28.455284552845526,0.36042402826855124,,0.7233387457696162,15
+LTWA.txt vs Leiden.txt,49.572649572649574,0.6079182630906769,,0.9527216526629653,16

theme.py ADDED Viewed

	@@ -0,0 +1,186 @@

+import gradio as gr
+from gradio.themes.utils import colors, sizes, fonts
+class TibetanAppTheme(gr.themes.Soft):
+    def __init__(self):
+        super().__init__(
+            primary_hue=colors.blue,  # Primary interactive elements (e.g., #2563eb)
+            secondary_hue=colors.orange,  # For accents if needed, or default buttons
+            neutral_hue=colors.slate,  # For backgrounds, borders, and text
+            font=[
+                fonts.GoogleFont("Inter"),
+                "ui-sans-serif",
+                "system-ui",
+                "sans-serif",
+            ],
+            radius_size=sizes.radius_md,  # General radius, can be overridden (16px was for cards)
+            text_size=sizes.text_md,  # Base font size (16px)
+        )
+        self.theme_vars_for_set = {
+            # Global & Body Styles
+            "body_background_fill": "#f0f2f5",
+            "body_text_color": "#333333",
+            # Card Styles (.gr-group)
+            "block_background_fill": "#ffffff",
+            "block_radius": "16px",  # May need to be removed if not a valid settable CSS var
+            "block_shadow": "0 4px 12px rgba(0, 0, 0, 0.08)",
+            "block_padding": "24px",
+            "block_border_width": "0px",
+            # Markdown Styles
+            "body_text_color_subdued": "#4b5563",
+            # Button Styles
+            "button_secondary_background_fill": "#ffffff",
+            "button_secondary_text_color": "#374151",
+            "button_secondary_border_color": "#d1d5db",
+            "button_secondary_border_color_hover": "#adb5bd",
+            "button_secondary_background_fill_hover": "#f9fafb",
+            # Primary Button
+            "button_primary_background_fill": "#2563eb",
+            "button_primary_text_color": "#ffffff",
+            "button_primary_border_color": "transparent",
+            "button_primary_background_fill_hover": "#1d4ed8",
+            # HR style
+            "border_color_accent_subdued": "#e5e7eb",
+        }
+        super().set(**self.theme_vars_for_set)
+        # Store CSS overrides; these will be converted to a string and applied via gr.Blocks(css=...)
+        self.css_overrides = {
+            ".gradio-container, .gr-block, .gr-markdown, label, input, .gr-slider, .gr-radio, .gr-button": {
+                "font-family": ", ".join(self.font),
+                "font-size": "16px !important",
+                "line-height": "1.6 !important",
+                "color": "#333333 !important",
+            },
+            ".gr-group": {"margin-bottom": "24px !important"},  # min-height removed
+            ".gr-markdown": {
+                "background": "transparent !important",
+                "font-size": "1em !important",
+                "margin-bottom": "16px !important",
+            },
+            ".gr-markdown h1": {
+                "font-size": "28px !important",
+                "font-weight": "600 !important",
+                "margin-bottom": "8px !important",
+                "color": "#111827 !important",
+            },
+            ".gr-markdown h2": {
+                "font-size": "26px !important",
+                "font-weight": "600 !important",
+                "color": "var(--primary-600, #2563eb) !important",
+                "margin-top": "32px !important",
+                "margin-bottom": "16px !important",
+            },
+            ".gr-markdown h3": {
+                "font-size": "22px !important",
+                "font-weight": "600 !important",
+                "color": "#1f2937 !important",
+                "margin-top": "24px !important",
+                "margin-bottom": "12px !important",
+            },
+            ".gr-markdown p, .gr-markdown span": {
+                "font-size": "16px !important",
+                "color": "#4b5563 !important",
+            },
+            ".gr-button button": {
+                "border-radius": "8px !important",
+                "padding": "10px 20px !important",
+                "font-weight": "500 !important",
+                "box-shadow": "0 1px 2px 0 rgba(0, 0, 0, 0.05) !important",
+                "border": "1px solid #d1d5db !important",
+                "background-color": "#ffffff !important",
+                "color": "#374151 !important",
+            },
+            "#run-btn": {
+                "background": "var(--button-primary-background-fill) !important",
+                "color": "var(--button-primary-text-color) !important",
+                "font-weight": "bold !important",
+                "font-size": "24px !important",
+                "border": "none !important",
+                "box-shadow": "var(--button-primary-shadow) !important",
+            },
+            "#run-btn:hover": {  # Changed selector
+                "background": "var(--button-primary-background-fill-hover) !important",
+                "box-shadow": "0px 4px 12px rgba(0, 0, 0, 0.15) !important",
+                "transform": "translateY(-1px) !important",
+            },
+            ".gr-button button:hover": {
+                "background-color": "#f9fafb !important",
+                "border-color": "#adb5bd !important",
+            },
+            "hr": {
+                "margin": "32px 0 !important",
+                "border": "none !important",
+                "border-top": "1px solid var(--border-color-accent-subdued) !important",
+            },
+            ".gr-slider, .gr-radio, .gr-file": {"margin-bottom": "20px !important"},
+            ".gr-radio .gr-form button": {
+                "background-color": "#f3f4f6 !important",
+                "color": "#374151 !important",
+                "border": "1px solid #d1d5db !important",
+                "border-radius": "6px !important",
+                "padding": "8px 16px !important",
+                "font-weight": "500 !important",
+            },
+            ".gr-radio .gr-form button:hover": {
+                "background-color": "#e5e7eb !important",
+                "border-color": "#9ca3af !important",
+            },
+            ".gr-radio .gr-form button.selected": {
+                "background-color": "var(--primary-500, #3b82f6) !important",
+                "color": "#ffffff !important",
+                "border-color": "var(--primary-500, #3b82f6) !important",
+            },
+            ".gr-radio .gr-form button.selected:hover": {
+                "background-color": "var(--primary-600, #2563eb) !important",
+                "border-color": "var(--primary-600, #2563eb) !important",
+            },
+            "#semantic-radio-group span": {  # General selector, refined size
+                "font-size": "17px !important",
+                "font-weight": "500 !important",
+            },
+            "#semantic-radio-group div": {  # General selector, refined size
+                "font-size": "14px !important"
+            },
+            # Row and Column flex styles for equal height
+            "#steps-row": {
+                "display": "flex !important",
+                "align-items": "stretch !important",
+            },
+            ".step-column": {
+                "display": "flex !important",
+                "flex-direction": "column !important",
+            },
+            ".step-column > .gr-group": {
+                "flex-grow": "1 !important",
+                "display": "flex !important",
+                "flex-direction": "column !important",
+            },
+            ".tabs > .tab-nav": {"border-bottom": "1px solid #d1d5db !important"},
+            ".tabs > .tab-nav > button.selected": {
+                "border-bottom": "2px solid var(--primary-500) !important",
+                "color": "var(--primary-500) !important",
+                "background-color": "transparent !important",
+            },
+            ".tabs > .tab-nav > button": {
+                "color": "#6b7280 !important",
+                "background-color": "transparent !important",
+                "padding": "10px 15px !important",
+                "border-bottom": "2px solid transparent !important",
+            },
+        }
+    def get_css_string(self) -> str:
+        """Converts the self.css_overrides dictionary into a CSS string."""
+        css_parts = []
+        for selector, properties in self.css_overrides.items():
+            props_str = "\n".join(
+                [f"    {prop}: {value};" for prop, value in properties.items()]
+            )
+            css_parts.append(f"{selector} {{\n{props_str}\n}}")
+        return "\n\n".join(css_parts)
+# Instantiate the theme for easy import
+tibetan_theme = TibetanAppTheme()