daniel-wojahn commited on
Commit
4c96fd8
·
1 Parent(s): 2e45dc8

chapter name feature added

Browse files
Files changed (3) hide show
  1. academic_article.md +66 -3
  2. app.py +68 -6
  3. pipeline/process.py +21 -12
academic_article.md CHANGED
@@ -6,7 +6,7 @@
6
 
7
  ### 1.1. The Challenge of Tibetan Textual Scholarship
8
 
9
- The Tibetan literary corpus is one of the world's most extensive, encompassing centuries of philosophy, history, jurisprudence, and religious doctrine. The transmission of these texts has been a complex process, involving manual copying by scribes across different regions and monastic traditions. This has resulted in a rich but challenging textual landscape, characterized by numerous variations, scribal errors, and divergent manuscript lineages. For scholars, tracing the history of a text and understanding its evolution requires meticulous philological work. While traditional methods are indispensable, the sheer scale of the available material necessitates computational approaches that can identify patterns and relationships that are not immediately apparent to the human eye.
10
 
11
  ### 1.2. Digital Humanities and Under-Resourced Languages
12
 
@@ -14,7 +14,7 @@ The rise of digital humanities has brought a wealth of computational tools to li
14
 
15
  ### 1.3. The Tibetan Text Metrics (TTM) Application
16
 
17
- This paper introduces the Tibetan Text Metrics (TTM) web application, a user-friendly, open-source tool designed to make sophisticated textual analysis accessible to scholars of Tibetan, regardless of their technical background. The application allows researchers to upload Tibetan texts, automatically segment them into meaningful sections, and compare them using a range of quantitative metrics. This article will detail the methodologies underpinning the TTM application, describe its key features, and demonstrate its practical utility through a case study. By doing so, we aim to show how TTM can serve as a valuable assistant in the scholar's toolkit, augmenting traditional research methods and opening up new possibilities for the study of Tibetan textual history.
18
 
19
  ## 2. Methodology: A Multi-faceted Approach to Text Similarity
20
 
@@ -28,7 +28,12 @@ Meaningful computational analysis begins with careful pre-processing. The TTM ap
28
 
29
  **Tokenization:** To analyze a text computationally, it must be broken down into individual units, or tokens. Given the syllabic nature of the Tibetan script, where morphemes are delimited by a *tsheg* (་), simple whitespace tokenization is inadequate. TTM leverages the `botok` library, a state-of-the-art tokenizer for Tibetan, which accurately identifies word boundaries, ensuring that the subsequent analysis is based on meaningful linguistic units.
30
 
31
- **Stopword Filtering:** Many words in a language (e.g., particles, pronouns) are grammatically necessary but carry little unique semantic weight. These "stopwords" can skew similarity scores by creating an illusion of similarity based on common grammatical structures. TTM provides two levels of optional stopword filtering: a *Standard* list containing the most common particles, and an *Aggressive* list that also removes function words. This allows researchers to focus their analysis on the substantive vocabulary of a text.
 
 
 
 
 
32
 
33
  ### 2.2. Lexical and Thematic Similarity Metrics
34
 
@@ -49,3 +54,61 @@ To capture similarities in meaning that may not be apparent from lexical overlap
49
  **FastText Embeddings:** The application utilizes the official Facebook FastText model for Tibetan, which represents words as dense vectors in a high-dimensional space. A key advantage of FastText is its use of character n-grams, allowing it to generate meaningful vectors even for out-of-vocabulary words—a crucial feature for handling the orthographic variations common in Tibetan manuscripts. To create a single vector for an entire text segment, TTM employs a sophisticated TF-IDF weighted averaging of the word vectors, giving more weight to the embeddings of characteristic terms.
50
 
51
  **Hugging Face Models:** In addition to FastText, TTM integrates the `sentence-transformers` library, providing access to a wide range of pre-trained models from the Hugging Face Hub. This allows researchers to leverage powerful, context-aware models like LaBSE or XLM-RoBERTa, which can capture nuanced semantic relationships between entire sentences and paragraphs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ### 1.1. The Challenge of Tibetan Textual Scholarship
8
 
9
+ The Tibetan literary corpus is one of the world's most extensive, encompassing centuries of philosophy, history, and religious doctrine. The transmission of these texts has been a complex process, involving manual copying that resulted in a rich but challenging textual landscape of divergent manuscript lineages. This challenge is exemplified by the development of the TTM application itself, which originated from the analysis of multiple editions of the 17th-century legal text, *The Pronouncements in Sixteen Chapters* (*zhal lce bcu drug*). An initial attempt to create a critical edition using standard collation software like CollateX proved untenable; the variations between editions were so substantial that they produced a convoluted apparatus that obscured, rather than clarified, the texts' relationships. It became clear that a different approach was needed—one that could move beyond one-to-one textual comparison to provide a higher-level, quantitative overview of textual similarity. TTM was developed to meet this need, providing a toolkit to assess relationships at the chapter level and reveal the broader patterns of textual evolution that traditional methods might miss.
10
 
11
  ### 1.2. Digital Humanities and Under-Resourced Languages
12
 
 
14
 
15
  ### 1.3. The Tibetan Text Metrics (TTM) Application
16
 
17
+ This paper introduces the Tibetan Text Metrics (TTM) web application, a user-friendly, open-source tool designed to make sophisticated textual analysis accessible to scholars of Tibetan, regardless of their technical background. The application empowers researchers to move beyond manual comparison by providing a suite of computational metrics that reveal distinct aspects of textual relationships—from direct lexical overlap (Jaccard similarity) and shared narrative structure (Normalized LCS) to thematic parallels (TF-IDF) and deep semantic connections (FastText and Transformer-based embeddings). This article will detail the methodologies underpinning these metrics, describe the application's key features—including its novel AI-powered interpretation engine—and demonstrate its practical utility through a case study. By doing so, we aim to show how TTM can serve as a valuable assistant in the scholar's toolkit, augmenting traditional research methods and opening up new possibilities for the study of Tibetan textual history.
18
 
19
  ## 2. Methodology: A Multi-faceted Approach to Text Similarity
20
 
 
28
 
29
  **Tokenization:** To analyze a text computationally, it must be broken down into individual units, or tokens. Given the syllabic nature of the Tibetan script, where morphemes are delimited by a *tsheg* (་), simple whitespace tokenization is inadequate. TTM leverages the `botok` library, a state-of-the-art tokenizer for Tibetan, which accurately identifies word boundaries, ensuring that the subsequent analysis is based on meaningful linguistic units.
30
 
31
+ **Stopword Filtering:** Many words in a language are grammatically necessary but carry little unique semantic weight. These "stopwords" can skew similarity scores by creating an illusion of similarity based on common grammatical structures. TTM provides two levels of optional stopword filtering to address this:
32
+
33
+ * The **Standard** list targets only the most frequent, low-information grammatical particles and punctuation (e.g., the instrumental particle `གིས་` (gis), the genitive particle `གི་` (gi), and the sentence-ending *shad* `།`).
34
+ * The **Aggressive** list includes the standard particles but also removes a wider range of function words, such as pronouns (e.g., `འདི` (this), `དེ་` (that)), auxiliary verbs (e.g., `ཡིན་` (is)), and common quantifiers (e.g., `རྣམས་` (plural marker)).
35
+
36
+ This tiered approach allows researchers to fine-tune their analysis, either preserving the grammatical structure or focusing purely on the substantive vocabulary of a text.
37
 
38
  ### 2.2. Lexical and Thematic Similarity Metrics
39
 
 
54
  **FastText Embeddings:** The application utilizes the official Facebook FastText model for Tibetan, which represents words as dense vectors in a high-dimensional space. A key advantage of FastText is its use of character n-grams, allowing it to generate meaningful vectors even for out-of-vocabulary words—a crucial feature for handling the orthographic variations common in Tibetan manuscripts. To create a single vector for an entire text segment, TTM employs a sophisticated TF-IDF weighted averaging of the word vectors, giving more weight to the embeddings of characteristic terms.
55
 
56
  **Hugging Face Models:** In addition to FastText, TTM integrates the `sentence-transformers` library, providing access to a wide range of pre-trained models from the Hugging Face Hub. This allows researchers to leverage powerful, context-aware models like LaBSE or XLM-RoBERTa, which can capture nuanced semantic relationships between entire sentences and paragraphs.
57
+
58
+ ## 3. The TTM Web Application: Features and Functionality
59
+
60
+ The TTM application is designed to be a practical tool for researchers. Its features are built to facilitate an intuitive workflow, from data input to the interpretation of results.
61
+
62
+ ### 3.1. User Interface and Workflow
63
+
64
+ Built with the Gradio framework, the application's interface is clean and straightforward. The workflow is designed to be linear and intuitive:
65
+
66
+ 1. **File Upload:** Users begin by uploading one or more Tibetan `.txt` files.
67
+ 2. **Configuration:** Users can then configure the analysis by selecting which metrics to compute, choosing an embedding model, and setting the desired level of stopword filtering.
68
+ 3. **Execution:** A single "Run Analysis" button initiates the entire processing pipeline.
69
+
70
+ This simple, step-by-step process removes the barriers of command-line tools and complex software setups, making the technology accessible to all scholars.
71
+
72
+ ### 3.2. Data Visualization
73
+
74
+ Understanding numerical similarity scores can be challenging. TTM addresses this by providing rich, interactive visualizations:
75
+
76
+ * **Heatmaps:** For each similarity metric, the application generates a heatmap that provides an at-a-glance overview of the relationships between all text segments. Darker cells indicate higher similarity, allowing researchers to quickly identify areas of strong textual connection.
77
+ * **Bar Charts:** A word count chart for each text provides a simple but effective visualization of the relative lengths of the segments, which is important context for interpreting the similarity scores.
78
+
79
+ These visualizations are not only useful for analysis but are also publication-ready, allowing researchers to easily incorporate them into their own work.
80
+
81
+ ### 3.3. AI-Powered Interpretation
82
+
83
+ A standout feature of the TTM application is its AI-powered interpretation engine. While quantitative metrics are powerful, their scholarly significance is not always self-evident. The "Interpret Results" button addresses this challenge by sending the computed metrics to a large language model (Mistral 7B via the OpenRouter API).
84
+
85
+ The AI then generates a qualitative analysis of the results, framed in the language of textual scholarship. This analysis typically includes:
86
+
87
+ * An overview of the general patterns of similarity.
88
+ * A discussion of notable chapters with particularly high or low similarity.
89
+ * An interpretation of what the different metrics collectively suggest about the texts' relationship (e.g., lexical borrowing vs. structural parallels).
90
+ * Suggestions for further scholarly investigation.
91
+
92
+ This feature acts as a bridge between the quantitative data and its qualitative interpretation, helping researchers to understand the implications of their findings and to formulate new research questions.
93
+
94
+ ## 5. Discussion and Future Directions
95
+
96
+ ### 5.1. Interpreting the Metrics: A Holistic View
97
+
98
+ The true analytical power of the TTM application lies not in any single metric, but in the synthesis of all of them. For example, a high Jaccard similarity combined with a low LCS score might suggest that two texts share a common vocabulary but arrange it differently, perhaps indicating a shared topic but different authorial styles. Conversely, a high LCS score with a moderate Jaccard similarity could point to a shared structural backbone or direct borrowing, even with significant lexical variation. The addition of semantic similarity further enriches this picture, revealing conceptual connections that might be missed by lexical and structural methods alone. The TTM application facilitates this holistic approach, encouraging a nuanced interpretation of textual relationships.
99
+
100
+ ### 5.2. Limitations
101
+
102
+ While powerful, the TTM application has limitations. The quality of the analysis is highly dependent on the quality of the input texts; poorly scanned or OCR'd texts may yield unreliable results. The performance of the semantic models, while state-of-the-art, may also vary depending on the specific domain of the texts being analyzed. Furthermore, the AI-powered interpretation, while a useful guide, is not a substitute for scholarly expertise and should be treated as a starting point for further investigation, not a definitive conclusion.
103
+
104
+ ### 5.3. Future Work
105
+
106
+ The TTM project is under active development, with several potential avenues for future enhancement. These include:
107
+
108
+ * **Integration of More Models:** Expanding the library of available embedding models to include more domain-specific options.
109
+ * **Enhanced Visualization:** Adding more advanced visualization tools, such as network graphs to show relationships between multiple texts.
110
+ * **User-Trainable Models:** Exposing the functionality to train custom FastText models directly within the web UI, allowing researchers to create highly specialized models for their specific corpora.
111
+
112
+ ## 6. Conclusion
113
+
114
+ The Tibetan Text Metrics web application represents a significant step forward in making computational textual analysis accessible to the field of Tibetan studies. By combining a suite of powerful similarity metrics with an intuitive user interface and a novel AI-powered interpretation engine, TTM lowers the barrier to entry for digital humanities research. It provides scholars with a versatile tool to explore textual relationships, investigate manuscript histories, and generate new, data-driven insights. As such, TTM serves not as a replacement for traditional philology, but as a powerful complement, one that promises to enrich and expand the horizons of Tibetan textual scholarship.
app.py CHANGED
@@ -48,6 +48,18 @@ def main_interface():
48
  "<small>Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.</small>",
49
  elem_classes="gr-markdown"
50
  )
 
 
 
 
 
 
 
 
 
 
 
 
51
  with gr.Column(scale=1, elem_classes="step-column"):
52
  with gr.Group():
53
  gr.Markdown(
@@ -257,9 +269,45 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
257
  # For now, this modification focuses on creating the plot object and making it an output.
258
  # The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
259
 
 
 
 
 
 
 
 
260
  warning_box = gr.Markdown(visible=False)
261
 
262
- def run_pipeline(files, enable_semantic, model_name, stopwords_option, batch_size, show_progress, progress=gr.Progress()):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
263
  """Run the text analysis pipeline on the uploaded files.
264
 
265
  Args:
@@ -340,6 +388,8 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
340
  Path(file.name).name for file in files
341
  ] # Use Path().name to get just the filename
342
  text_data = {}
 
 
343
 
344
  # Read files with progress updates
345
  for i, file in enumerate(files):
@@ -390,15 +440,16 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
390
  internal_model_id = "facebook-fasttext-pretrained"
391
 
392
  df_results, word_counts_df_data, warning_raw = process_texts(
393
- text_data,
394
- filenames,
395
  enable_semantic=enable_semantic_bool,
396
  model_name=internal_model_id,
397
  use_stopwords=use_stopwords,
398
  use_lite_stopwords=use_lite_stopwords,
399
  progress_callback=progress_tracker,
400
  batch_size=batch_size,
401
- show_progress_bar=show_progress
 
402
  )
403
 
404
  if df_results.empty:
@@ -504,9 +555,20 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
504
  logger.error(f"Error in interpret_results: {e}", exc_info=True)
505
  return f"Error interpreting results: {str(e)}"
506
 
 
 
 
507
  process_btn.click(
508
  fn=run_pipeline,
509
- inputs=[file_input, semantic_toggle_radio, model_dropdown, stopwords_dropdown, batch_size_slider, progress_bar_checkbox],
 
 
 
 
 
 
 
 
510
  outputs=[
511
  csv_output,
512
  metrics_preview,
@@ -516,7 +578,7 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
516
  heatmap_tabs["Semantic Similarity"],
517
  heatmap_tabs["TF-IDF Cosine Sim"],
518
  warning_box,
519
- ]
520
  )
521
 
522
  # Connect the interpret button
 
48
  "<small>Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.</small>",
49
  elem_classes="gr-markdown"
50
  )
51
+
52
+ with gr.Column(scale=1, elem_classes="step-column", elem_id="chapter-rename-column"):
53
+ chapter_rename_group = gr.Group(visible=False)
54
+ with chapter_rename_group:
55
+ gr.Markdown(
56
+ """## Step 1.5: Name Your Chapters (Optional)
57
+ <span style='font-size:16px;'>Provide a name for each chapter below. These names will be used in the heatmaps and results.</span>
58
+ """,
59
+ elem_classes="gr-markdown",
60
+ )
61
+ chapter_names_ui = gr.Column()
62
+
63
  with gr.Column(scale=1, elem_classes="step-column"):
64
  with gr.Group():
65
  gr.Markdown(
 
269
  # For now, this modification focuses on creating the plot object and making it an output.
270
  # The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
271
 
272
+ # Wire up the chapter renaming UI
273
+ file_input.upload(
274
+ setup_chapter_rename_ui,
275
+ inputs=[file_input],
276
+ outputs=[chapter_rename_group, chapter_names_ui, file_data_state]
277
+ )
278
+
279
  warning_box = gr.Markdown(visible=False)
280
 
281
+ # State to hold file info and chapter names
282
+ file_data_state = gr.State(value={})
283
+
284
+ def setup_chapter_rename_ui(files):
285
+ if not files:
286
+ return gr.update(visible=False), [], {}
287
+
288
+ file_data = {}
289
+ chapter_name_inputs = []
290
+ for file in files:
291
+ try:
292
+ content = Path(file.name).read_text(encoding="utf-8-sig")
293
+ segments = content.split('༈')
294
+ num_chapters = len(segments)
295
+ file_data[Path(file.name).name] = {'path': file.name, 'chapters': num_chapters}
296
+
297
+ for i in range(num_chapters):
298
+ default_name = f"{Path(file.name).name} - Chapter {i + 1}"
299
+ chapter_name_inputs.append(gr.Textbox(label=f"Name for Chapter {i+1} in '{Path(file.name).name}'", value=default_name))
300
+ except Exception as e:
301
+ logger.error(f"Error processing file {file.name} for chapter renaming: {e}")
302
+ # Handle file read error gracefully
303
+ pass
304
+
305
+ if not chapter_name_inputs:
306
+ return gr.update(visible=False), [], {}
307
+
308
+ return gr.update(visible=True), chapter_name_inputs, file_data
309
+
310
+ def run_pipeline(files, enable_semantic, model_name, stopwords_option, batch_size, show_progress, chapter_names_list, progress=gr.Progress()):
311
  """Run the text analysis pipeline on the uploaded files.
312
 
313
  Args:
 
388
  Path(file.name).name for file in files
389
  ] # Use Path().name to get just the filename
390
  text_data = {}
391
+ # The chapter_names_list is a list of lists, flatten it
392
+ flat_chapter_names = [name for sublist in chapter_names_list for name in sublist]
393
 
394
  # Read files with progress updates
395
  for i, file in enumerate(files):
 
440
  internal_model_id = "facebook-fasttext-pretrained"
441
 
442
  df_results, word_counts_df_data, warning_raw = process_texts(
443
+ text_data=text_data,
444
+ filenames=filenames,
445
  enable_semantic=enable_semantic_bool,
446
  model_name=internal_model_id,
447
  use_stopwords=use_stopwords,
448
  use_lite_stopwords=use_lite_stopwords,
449
  progress_callback=progress_tracker,
450
  batch_size=batch_size,
451
+ show_progress_bar=show_progress,
452
+ chapter_names=flat_chapter_names
453
  )
454
 
455
  if df_results.empty:
 
555
  logger.error(f"Error in interpret_results: {e}", exc_info=True)
556
  return f"Error interpreting results: {str(e)}"
557
 
558
+ # The `process_btn.click` call needs to be defined here.
559
+ # It will take inputs from all the configuration UI elements and the dynamic chapter name fields.
560
+ # The `chapter_names_ui` component, being a `gr.Column`, will pass the values of its children as a list.
561
  process_btn.click(
562
  fn=run_pipeline,
563
+ inputs=[
564
+ file_input,
565
+ semantic_toggle_radio,
566
+ model_dropdown,
567
+ stopwords_dropdown,
568
+ batch_size_slider,
569
+ progress_bar_checkbox,
570
+ chapter_names_ui, # Pass the column containing dynamic textboxes
571
+ ],
572
  outputs=[
573
  csv_output,
574
  metrics_preview,
 
578
  heatmap_tabs["Semantic Similarity"],
579
  heatmap_tabs["TF-IDF Cosine Sim"],
580
  warning_box,
581
+ ],
582
  )
583
 
584
  # Connect the interpret button
pipeline/process.py CHANGED
@@ -57,7 +57,8 @@ def process_texts(
57
  use_lite_stopwords: bool = False,
58
  progress_callback = None,
59
  batch_size: int = 32,
60
- show_progress_bar: bool = False
 
61
  ) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
62
  """
63
  Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
@@ -152,6 +153,7 @@ def process_texts(
152
  chapter_marker = "༈"
153
  fallback = False
154
  segment_texts = {}
 
155
 
156
  # Process each file
157
  for i, fname in enumerate(filenames):
@@ -181,14 +183,19 @@ def process_texts(
181
  continue
182
 
183
  for idx, seg in enumerate(segments):
184
- seg_id = f"{fname}|chapter {idx+1}"
 
 
185
  cleaned_seg = clean_tibetan_text_for_fasttext(seg)
186
- segment_texts[seg_id] = cleaned_seg
 
187
  else:
188
  # No chapter markers found, treat entire file as one segment
189
- seg_id = f"{fname}|chapter 1"
 
190
  cleaned_content = clean_tibetan_text_for_fasttext(content.strip())
191
- segment_texts[seg_id] = cleaned_content
 
192
  fallback = True
193
 
194
  # Generate warning if no chapter markers found
@@ -213,7 +220,7 @@ def process_texts(
213
  logger.warning(f"Progress callback error (non-critical): {e}")
214
 
215
  all_segment_ids = list(segment_texts.keys())
216
- all_segment_contents = list(segment_texts.values())
217
  tokenized_segments_list = tokenize_texts(all_segment_contents)
218
 
219
  segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
@@ -294,7 +301,7 @@ def process_texts(
294
  logger.info("Using botok word-level tokenization for FastText model.")
295
 
296
  pair_metrics = compute_all_metrics(
297
- texts={seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
298
  token_lists={seg1: segment_tokens[seg1], seg2: segment_tokens[seg2]},
299
  model=model,
300
  enable_semantic=enable_semantic,
@@ -306,9 +313,10 @@ def process_texts(
306
  show_progress_bar=show_progress_bar
307
  )
308
 
309
- # Rename 'Text Pair' to show file stems and chapter number
 
310
  pair_metrics.loc[:, "Text Pair"] = f"{file1} vs {file2}"
311
- pair_metrics.loc[:, "Chapter"] = idx + 1
312
  results.append(pair_metrics)
313
 
314
  except Exception as e:
@@ -333,7 +341,7 @@ def process_texts(
333
  word_counts_data = []
334
 
335
  # Process each segment
336
- for i, (seg_id, text_content) in enumerate(segment_texts.items()):
337
  # Update progress
338
  if progress_callback is not None and len(segment_texts) > 0:
339
  try:
@@ -342,8 +350,7 @@ def process_texts(
342
  except Exception as e:
343
  logger.warning(f"Progress callback error (non-critical): {e}")
344
 
345
- fname, chapter_info = seg_id.split("|", 1)
346
- chapter_num = int(chapter_info.replace("chapter ", ""))
347
 
348
  try:
349
  # Use botok for accurate word count for raw Tibetan text
@@ -356,6 +363,7 @@ def process_texts(
356
  word_counts_data.append(
357
  {
358
  "Filename": fname.replace(".txt", ""),
 
359
  "ChapterNumber": chapter_num,
360
  "SegmentID": seg_id,
361
  "WordCount": word_count,
@@ -367,6 +375,7 @@ def process_texts(
367
  word_counts_data.append(
368
  {
369
  "Filename": fname.replace(".txt", ""),
 
370
  "ChapterNumber": chapter_num,
371
  "SegmentID": seg_id,
372
  "WordCount": 0,
 
57
  use_lite_stopwords: bool = False,
58
  progress_callback = None,
59
  batch_size: int = 32,
60
+ show_progress_bar: bool = False,
61
+ chapter_names: List[str] = None
62
  ) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
63
  """
64
  Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
 
153
  chapter_marker = "༈"
154
  fallback = False
155
  segment_texts = {}
156
+ chapter_name_counter = 0
157
 
158
  # Process each file
159
  for i, fname in enumerate(filenames):
 
183
  continue
184
 
185
  for idx, seg in enumerate(segments):
186
+ # Use custom chapter name if available
187
+ custom_name = chapter_names[chapter_name_counter] if chapter_names and chapter_name_counter < len(chapter_names) else f"Chapter {idx + 1}"
188
+ seg_id = f"{fname}|{custom_name}"
189
  cleaned_seg = clean_tibetan_text_for_fasttext(seg)
190
+ segment_texts[seg_id] = (cleaned_seg, idx + 1) # Store text and original number
191
+ chapter_name_counter += 1
192
  else:
193
  # No chapter markers found, treat entire file as one segment
194
+ custom_name = chapter_names[chapter_name_counter] if chapter_names and chapter_name_counter < len(chapter_names) else "Chapter 1"
195
+ seg_id = f"{fname}|{custom_name}"
196
  cleaned_content = clean_tibetan_text_for_fasttext(content.strip())
197
+ segment_texts[seg_id] = (cleaned_content, 1)
198
+ chapter_name_counter += 1
199
  fallback = True
200
 
201
  # Generate warning if no chapter markers found
 
220
  logger.warning(f"Progress callback error (non-critical): {e}")
221
 
222
  all_segment_ids = list(segment_texts.keys())
223
+ all_segment_contents = [data[0] for data in segment_texts.values()]
224
  tokenized_segments_list = tokenize_texts(all_segment_contents)
225
 
226
  segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
 
301
  logger.info("Using botok word-level tokenization for FastText model.")
302
 
303
  pair_metrics = compute_all_metrics(
304
+ texts={seg1: segment_texts[seg1][0], seg2: segment_texts[seg2][0]},
305
  token_lists={seg1: segment_tokens[seg1], seg2: segment_tokens[seg2]},
306
  model=model,
307
  enable_semantic=enable_semantic,
 
313
  show_progress_bar=show_progress_bar
314
  )
315
 
316
+ # Rename 'Text Pair' to show file stems and chapter name
317
+ chapter_name = seg1.split('|', 1)[1]
318
  pair_metrics.loc[:, "Text Pair"] = f"{file1} vs {file2}"
319
+ pair_metrics.loc[:, "Chapter"] = chapter_name
320
  results.append(pair_metrics)
321
 
322
  except Exception as e:
 
341
  word_counts_data = []
342
 
343
  # Process each segment
344
+ for i, (seg_id, (text_content, chapter_num)) in enumerate(segment_texts.items()):
345
  # Update progress
346
  if progress_callback is not None and len(segment_texts) > 0:
347
  try:
 
350
  except Exception as e:
351
  logger.warning(f"Progress callback error (non-critical): {e}")
352
 
353
+ fname, chapter_name = seg_id.split("|", 1)
 
354
 
355
  try:
356
  # Use botok for accurate word count for raw Tibetan text
 
363
  word_counts_data.append(
364
  {
365
  "Filename": fname.replace(".txt", ""),
366
+ "ChapterName": chapter_name,
367
  "ChapterNumber": chapter_num,
368
  "SegmentID": seg_id,
369
  "WordCount": word_count,
 
375
  word_counts_data.append(
376
  {
377
  "Filename": fname.replace(".txt", ""),
378
+ "ChapterName": chapter_name,
379
  "ChapterNumber": chapter_num,
380
  "SegmentID": seg_id,
381
  "WordCount": 0,