A Computational Toolkit for Tibetan Textual Analysis: Methods and Applications of the Tibetan Text Metrics (TTM) Application

Abstract: The study of Tibetan textual traditions, with its vast and complex corpus, presents unique challenges for quantitative analysis. Traditional philological methods, while essential, can be enhanced by computational tools that reveal large-scale patterns of similarity and variation. This paper introduces the Tibetan Text Metrics (TTM) web application, an accessible open-source toolkit designed to bridge this gap. TTM provides a suite of similarity metrics—including Jaccard, Normalized Longest Common Subsequence (LCS), TF-IDF Cosine Similarity, and semantic analysis using advanced embedding models (FastText and SentenceTransformers). A novel feature of the application is its AI-powered interpretation engine, which translates quantitative data into scholarly insights, making complex metrics accessible to a broader audience. By offering a user-friendly interface for sophisticated textual analysis, TTM empowers researchers to explore manuscript relationships, track textual transmission, and uncover new avenues for inquiry within Tibetan studies and the broader digital humanities landscape.

1. Introduction

1.1. The Challenge of Tibetan Textual Scholarship

The Tibetan literary corpus is one of the world's most extensive, encompassing centuries of philosophy, history, and religious doctrine. The transmission of these texts has been a complex process, involving manual copying that resulted in a rich but challenging textual landscape of divergent manuscript lineages. This challenge is exemplified by the development of the TTM application itself, which originated from the analysis of multiple editions of the 17th-century legal text, The Pronouncements in Sixteen Chapters (zhal lce bcu drug). An initial attempt to create a critical edition using standard collation software like CollateX proved untenable; the variations between editions were so substantial that they produced a convoluted apparatus that obscured, rather than clarified, the texts' relationships. It became clear that a different approach was needed—one that could move beyond one-to-one textual comparison to provide a higher-level, quantitative overview of textual similarity. TTM was developed to meet this need, providing a toolkit to assess relationships at the chapter level and reveal the broader patterns of textual evolution that traditional methods might miss.

1.2. Digital Humanities and Under-Resourced Languages

The rise of digital humanities has brought a wealth of computational tools to literary analysis. However, many of these tools are designed for well-resourced languages like English, leaving languages with fewer digital resources, such as Tibetan, underserved. The unique characteristics of the Tibetan script and language, including its syllabic nature and complex orthography, require specialized tools for effective processing and analysis. The Tibetan Text Metrics (TTM) project addresses this need by providing a tailored solution that respects the linguistic nuances of Tibetan, thereby making a vital contribution to the growing field of global digital humanities.

1.3. The Tibetan Text Metrics (TTM) Application

This paper introduces the Tibetan Text Metrics (TTM) web application, a user-friendly, open-source tool designed to make sophisticated textual analysis accessible to scholars of Tibetan, regardless of their technical background. The application empowers researchers to move beyond manual comparison by providing a suite of computational metrics that reveal distinct aspects of textual relationships—from direct lexical overlap (Jaccard similarity) and shared narrative structure (Normalized LCS) to thematic parallels (TF-IDF) and deep semantic connections (FastText and Transformer-based embeddings). This article will detail the methodologies underpinning these metrics, describe the application's key features—including its novel AI-powered interpretation engine—and demonstrate its practical utility through a case study. By doing so, we aim to show how TTM can serve as a valuable assistant in the scholar's toolkit, augmenting traditional research methods and opening up new possibilities for the study of Tibetan textual history.

2. Methodology: A Multi-faceted Approach to Text Similarity

To provide a holistic view of textual relationships, the TTM application employs a multi-faceted methodology that combines lexical, structural, and semantic analysis. This approach is built upon a foundation of Tibetan-specific text processing, ensuring that each metric is applied in a linguistically sound manner.

2.1. Text Pre-processing and Segmentation

Meaningful computational analysis begins with careful pre-processing. The TTM application automates several crucial steps to prepare texts for comparison.

Segmentation: Comparing entire texts, especially long ones, can obscure significant internal variations. TTM therefore defaults to a chapter-level analysis. It automatically segments texts using the Tibetan sbrul shad (༈), a common marker for section breaks. This allows for a more granular comparison, revealing similarities and differences at a structural level that mirrors the text's own divisions. If no marker is found, the application treats the entire file as a single segment and issues a warning.

Tokenization: To analyze a text computationally, it must be broken down into individual units, or tokens. Given the syllabic nature of the Tibetan script, where morphemes are delimited by a tsheg (་), simple whitespace tokenization is inadequate. TTM leverages the botok library, a state-of-the-art tokenizer for Tibetan, which accurately identifies word boundaries, ensuring that the subsequent analysis is based on meaningful linguistic units.

Stopword Filtering: Many words in a language are grammatically necessary but carry little unique semantic weight. These "stopwords" can skew similarity scores by creating an illusion of similarity based on common grammatical structures. TTM provides two levels of optional stopword filtering to address this:

The Standard list targets only the most frequent, low-information grammatical particles and punctuation (e.g., the instrumental particle གིས་ (gis), the genitive particle གི་ (gi), and the sentence-ending shad །).
The Aggressive list includes the standard particles but also removes a wider range of function words, such as pronouns (e.g., འདི (this), དེ་ (that)), auxiliary verbs (e.g., ཡིན་ (is)), and common quantifiers (e.g., རྣམས་ (plural marker)).

This tiered approach allows researchers to fine-tune their analysis, either preserving the grammatical structure or focusing purely on the substantive vocabulary of a text.

2.2. Lexical and Thematic Similarity Metrics

These metrics focus on the vocabulary and key terms within the texts.

Jaccard Similarity: This metric measures the direct overlap in vocabulary between two segments. It is calculated as the size of the intersection of the word sets divided by the size of their union. The result is a score from 0 to 1, representing the proportion of unique words that are common to both texts. Jaccard similarity is a straightforward and effective measure of shared vocabulary, independent of word order or frequency.

TF-IDF Cosine Similarity: Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (a corpus). It gives higher weight to terms that are frequent in one document but rare across the corpus, thus identifying the words that are most characteristic of that document. TTM calculates a TF-IDF vector for each text segment and then uses cosine similarity to measure the angle between these vectors. A higher score indicates that two segments share more of the same characteristic terms, suggesting a thematic similarity.

2.3. Structural Similarity Metric

Normalized Longest Common Subsequence (LCS): This metric moves beyond vocabulary to assess structural parallels. The LCS algorithm finds the longest sequence of words that appears in both texts in the same relative order, though not necessarily contiguously. For example, the LCS of "the brown fox jumps" and "the lazy brown dog jumps" is "the brown jumps". TTM normalizes the length of this subsequence to produce a score that reflects shared phrasing and narrative structure. A high LCS score can indicate direct textual borrowing or a shared structural template. To ensure performance, the LCS calculation is optimized with a custom Cython implementation.

2.4. Semantic Similarity

To capture similarities in meaning that may not be apparent from lexical overlap, TTM employs semantic similarity using word and sentence embeddings.

FastText Embeddings: The application utilizes the official Facebook FastText model for Tibetan, which represents words as dense vectors in a high-dimensional space. A key advantage of FastText is its use of character n-grams, allowing it to generate meaningful vectors even for out-of-vocabulary words—a crucial feature for handling the orthographic variations common in Tibetan manuscripts. To create a single vector for an entire text segment, TTM employs a sophisticated TF-IDF weighted averaging of the word vectors, giving more weight to the embeddings of characteristic terms.

Hugging Face Models: In addition to FastText, TTM integrates the sentence-transformers library, providing access to a wide range of pre-trained models from the Hugging Face Hub. This allows researchers to leverage powerful, context-aware models like LaBSE or XLM-RoBERTa, which can capture nuanced semantic relationships between entire sentences and paragraphs.

3. The TTM Web Application: Features and Functionality

The TTM application is designed to be a practical tool for researchers. Its features are built to facilitate an intuitive workflow, from data input to the interpretation of results.

3.1. User Interface and Workflow

Built with the Gradio framework, the application's interface is clean and straightforward. The workflow is designed to be linear and intuitive:

File Upload: Users begin by uploading one or more Tibetan .txt files.
Configuration: Users can then configure the analysis by selecting which metrics to compute, choosing an embedding model, and setting the desired level of stopword filtering.
Execution: A single "Run Analysis" button initiates the entire processing pipeline.

This simple, step-by-step process removes the barriers of command-line tools and complex software setups, making the technology accessible to all scholars.

3.2. Data Visualization

Understanding numerical similarity scores can be challenging. TTM addresses this by providing rich, interactive visualizations:

Heatmaps: For each similarity metric, the application generates a heatmap that provides an at-a-glance overview of the relationships between all text segments. Darker cells indicate higher similarity, allowing researchers to quickly identify areas of strong textual connection.
Bar Charts: A word count chart for each text provides a simple but effective visualization of the relative lengths of the segments, which is important context for interpreting the similarity scores.

These visualizations are not only useful for analysis but are also publication-ready, allowing researchers to easily incorporate them into their own work.

3.3. AI-Powered Interpretation

A standout feature of the TTM application is its AI-powered interpretation engine. While quantitative metrics are powerful, their scholarly significance is not always self-evident. The "Interpret Results" button addresses this challenge by sending the computed metrics to a large language model (Mistral 7B via the OpenRouter API).

The AI then generates a qualitative analysis of the results, framed in the language of textual scholarship. This analysis typically includes:

An overview of the general patterns of similarity.
A discussion of notable chapters with particularly high or low similarity.
An interpretation of what the different metrics collectively suggest about the texts' relationship (e.g., lexical borrowing vs. structural parallels).
Suggestions for further scholarly investigation.

This feature acts as a bridge between the quantitative data and its qualitative interpretation, helping researchers to understand the implications of their findings and to formulate new research questions.

5. Discussion and Future Directions

5.1. Interpreting the Metrics: A Holistic View

The true analytical power of the TTM application lies not in any single metric, but in the synthesis of all of them. For example, a high Jaccard similarity combined with a low LCS score might suggest that two texts share a common vocabulary but arrange it differently, perhaps indicating a shared topic but different authorial styles. Conversely, a high LCS score with a moderate Jaccard similarity could point to a shared structural backbone or direct borrowing, even with significant lexical variation. The addition of semantic similarity further enriches this picture, revealing conceptual connections that might be missed by lexical and structural methods alone. The TTM application facilitates this holistic approach, encouraging a nuanced interpretation of textual relationships.

5.2. Limitations

While powerful, the TTM application has limitations. The quality of the analysis is highly dependent on the quality of the input texts; poorly scanned or OCR'd texts may yield unreliable results. The performance of the semantic models, while state-of-the-art, may also vary depending on the specific domain of the texts being analyzed. Furthermore, the AI-powered interpretation, while a useful guide, is not a substitute for scholarly expertise and should be treated as a starting point for further investigation, not a definitive conclusion.

5.3. Future Work

The TTM project is under active development, with several potential avenues for future enhancement. These include:

Integration of More Models: Expanding the library of available embedding models to include more domain-specific options.
Enhanced Visualization: Adding more advanced visualization tools, such as network graphs to show relationships between multiple texts.
User-Trainable Models: Exposing the functionality to train custom FastText models directly within the web UI, allowing researchers to create highly specialized models for their specific corpora.

6. Conclusion

The Tibetan Text Metrics web application represents a significant step forward in making computational textual analysis accessible to the field of Tibetan studies. By combining a suite of powerful similarity metrics with an intuitive user interface and a novel AI-powered interpretation engine, TTM lowers the barrier to entry for digital humanities research. It provides scholars with a versatile tool to explore textual relationships, investigate manuscript histories, and generate new, data-driven insights. As such, TTM serves not as a replacement for traditional philology, but as a powerful complement, one that promises to enrich and expand the horizons of Tibetan textual scholarship.