Spaces:
Sleeping
Sleeping
Commit
·
4c96fd8
1
Parent(s):
2e45dc8
chapter name feature added
Browse files- academic_article.md +66 -3
- app.py +68 -6
- pipeline/process.py +21 -12
academic_article.md
CHANGED
@@ -6,7 +6,7 @@
|
|
6 |
|
7 |
### 1.1. The Challenge of Tibetan Textual Scholarship
|
8 |
|
9 |
-
The Tibetan literary corpus is one of the world's most extensive, encompassing centuries of philosophy, history,
|
10 |
|
11 |
### 1.2. Digital Humanities and Under-Resourced Languages
|
12 |
|
@@ -14,7 +14,7 @@ The rise of digital humanities has brought a wealth of computational tools to li
|
|
14 |
|
15 |
### 1.3. The Tibetan Text Metrics (TTM) Application
|
16 |
|
17 |
-
This paper introduces the Tibetan Text Metrics (TTM) web application, a user-friendly, open-source tool designed to make sophisticated textual analysis accessible to scholars of Tibetan, regardless of their technical background. The application
|
18 |
|
19 |
## 2. Methodology: A Multi-faceted Approach to Text Similarity
|
20 |
|
@@ -28,7 +28,12 @@ Meaningful computational analysis begins with careful pre-processing. The TTM ap
|
|
28 |
|
29 |
**Tokenization:** To analyze a text computationally, it must be broken down into individual units, or tokens. Given the syllabic nature of the Tibetan script, where morphemes are delimited by a *tsheg* (་), simple whitespace tokenization is inadequate. TTM leverages the `botok` library, a state-of-the-art tokenizer for Tibetan, which accurately identifies word boundaries, ensuring that the subsequent analysis is based on meaningful linguistic units.
|
30 |
|
31 |
-
**Stopword Filtering:** Many words in a language
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
### 2.2. Lexical and Thematic Similarity Metrics
|
34 |
|
@@ -49,3 +54,61 @@ To capture similarities in meaning that may not be apparent from lexical overlap
|
|
49 |
**FastText Embeddings:** The application utilizes the official Facebook FastText model for Tibetan, which represents words as dense vectors in a high-dimensional space. A key advantage of FastText is its use of character n-grams, allowing it to generate meaningful vectors even for out-of-vocabulary words—a crucial feature for handling the orthographic variations common in Tibetan manuscripts. To create a single vector for an entire text segment, TTM employs a sophisticated TF-IDF weighted averaging of the word vectors, giving more weight to the embeddings of characteristic terms.
|
50 |
|
51 |
**Hugging Face Models:** In addition to FastText, TTM integrates the `sentence-transformers` library, providing access to a wide range of pre-trained models from the Hugging Face Hub. This allows researchers to leverage powerful, context-aware models like LaBSE or XLM-RoBERTa, which can capture nuanced semantic relationships between entire sentences and paragraphs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
7 |
### 1.1. The Challenge of Tibetan Textual Scholarship
|
8 |
|
9 |
+
The Tibetan literary corpus is one of the world's most extensive, encompassing centuries of philosophy, history, and religious doctrine. The transmission of these texts has been a complex process, involving manual copying that resulted in a rich but challenging textual landscape of divergent manuscript lineages. This challenge is exemplified by the development of the TTM application itself, which originated from the analysis of multiple editions of the 17th-century legal text, *The Pronouncements in Sixteen Chapters* (*zhal lce bcu drug*). An initial attempt to create a critical edition using standard collation software like CollateX proved untenable; the variations between editions were so substantial that they produced a convoluted apparatus that obscured, rather than clarified, the texts' relationships. It became clear that a different approach was needed—one that could move beyond one-to-one textual comparison to provide a higher-level, quantitative overview of textual similarity. TTM was developed to meet this need, providing a toolkit to assess relationships at the chapter level and reveal the broader patterns of textual evolution that traditional methods might miss.
|
10 |
|
11 |
### 1.2. Digital Humanities and Under-Resourced Languages
|
12 |
|
|
|
14 |
|
15 |
### 1.3. The Tibetan Text Metrics (TTM) Application
|
16 |
|
17 |
+
This paper introduces the Tibetan Text Metrics (TTM) web application, a user-friendly, open-source tool designed to make sophisticated textual analysis accessible to scholars of Tibetan, regardless of their technical background. The application empowers researchers to move beyond manual comparison by providing a suite of computational metrics that reveal distinct aspects of textual relationships—from direct lexical overlap (Jaccard similarity) and shared narrative structure (Normalized LCS) to thematic parallels (TF-IDF) and deep semantic connections (FastText and Transformer-based embeddings). This article will detail the methodologies underpinning these metrics, describe the application's key features—including its novel AI-powered interpretation engine—and demonstrate its practical utility through a case study. By doing so, we aim to show how TTM can serve as a valuable assistant in the scholar's toolkit, augmenting traditional research methods and opening up new possibilities for the study of Tibetan textual history.
|
18 |
|
19 |
## 2. Methodology: A Multi-faceted Approach to Text Similarity
|
20 |
|
|
|
28 |
|
29 |
**Tokenization:** To analyze a text computationally, it must be broken down into individual units, or tokens. Given the syllabic nature of the Tibetan script, where morphemes are delimited by a *tsheg* (་), simple whitespace tokenization is inadequate. TTM leverages the `botok` library, a state-of-the-art tokenizer for Tibetan, which accurately identifies word boundaries, ensuring that the subsequent analysis is based on meaningful linguistic units.
|
30 |
|
31 |
+
**Stopword Filtering:** Many words in a language are grammatically necessary but carry little unique semantic weight. These "stopwords" can skew similarity scores by creating an illusion of similarity based on common grammatical structures. TTM provides two levels of optional stopword filtering to address this:
|
32 |
+
|
33 |
+
* The **Standard** list targets only the most frequent, low-information grammatical particles and punctuation (e.g., the instrumental particle `གིས་` (gis), the genitive particle `གི་` (gi), and the sentence-ending *shad* `།`).
|
34 |
+
* The **Aggressive** list includes the standard particles but also removes a wider range of function words, such as pronouns (e.g., `འདི` (this), `དེ་` (that)), auxiliary verbs (e.g., `ཡིན་` (is)), and common quantifiers (e.g., `རྣམས་` (plural marker)).
|
35 |
+
|
36 |
+
This tiered approach allows researchers to fine-tune their analysis, either preserving the grammatical structure or focusing purely on the substantive vocabulary of a text.
|
37 |
|
38 |
### 2.2. Lexical and Thematic Similarity Metrics
|
39 |
|
|
|
54 |
**FastText Embeddings:** The application utilizes the official Facebook FastText model for Tibetan, which represents words as dense vectors in a high-dimensional space. A key advantage of FastText is its use of character n-grams, allowing it to generate meaningful vectors even for out-of-vocabulary words—a crucial feature for handling the orthographic variations common in Tibetan manuscripts. To create a single vector for an entire text segment, TTM employs a sophisticated TF-IDF weighted averaging of the word vectors, giving more weight to the embeddings of characteristic terms.
|
55 |
|
56 |
**Hugging Face Models:** In addition to FastText, TTM integrates the `sentence-transformers` library, providing access to a wide range of pre-trained models from the Hugging Face Hub. This allows researchers to leverage powerful, context-aware models like LaBSE or XLM-RoBERTa, which can capture nuanced semantic relationships between entire sentences and paragraphs.
|
57 |
+
|
58 |
+
## 3. The TTM Web Application: Features and Functionality
|
59 |
+
|
60 |
+
The TTM application is designed to be a practical tool for researchers. Its features are built to facilitate an intuitive workflow, from data input to the interpretation of results.
|
61 |
+
|
62 |
+
### 3.1. User Interface and Workflow
|
63 |
+
|
64 |
+
Built with the Gradio framework, the application's interface is clean and straightforward. The workflow is designed to be linear and intuitive:
|
65 |
+
|
66 |
+
1. **File Upload:** Users begin by uploading one or more Tibetan `.txt` files.
|
67 |
+
2. **Configuration:** Users can then configure the analysis by selecting which metrics to compute, choosing an embedding model, and setting the desired level of stopword filtering.
|
68 |
+
3. **Execution:** A single "Run Analysis" button initiates the entire processing pipeline.
|
69 |
+
|
70 |
+
This simple, step-by-step process removes the barriers of command-line tools and complex software setups, making the technology accessible to all scholars.
|
71 |
+
|
72 |
+
### 3.2. Data Visualization
|
73 |
+
|
74 |
+
Understanding numerical similarity scores can be challenging. TTM addresses this by providing rich, interactive visualizations:
|
75 |
+
|
76 |
+
* **Heatmaps:** For each similarity metric, the application generates a heatmap that provides an at-a-glance overview of the relationships between all text segments. Darker cells indicate higher similarity, allowing researchers to quickly identify areas of strong textual connection.
|
77 |
+
* **Bar Charts:** A word count chart for each text provides a simple but effective visualization of the relative lengths of the segments, which is important context for interpreting the similarity scores.
|
78 |
+
|
79 |
+
These visualizations are not only useful for analysis but are also publication-ready, allowing researchers to easily incorporate them into their own work.
|
80 |
+
|
81 |
+
### 3.3. AI-Powered Interpretation
|
82 |
+
|
83 |
+
A standout feature of the TTM application is its AI-powered interpretation engine. While quantitative metrics are powerful, their scholarly significance is not always self-evident. The "Interpret Results" button addresses this challenge by sending the computed metrics to a large language model (Mistral 7B via the OpenRouter API).
|
84 |
+
|
85 |
+
The AI then generates a qualitative analysis of the results, framed in the language of textual scholarship. This analysis typically includes:
|
86 |
+
|
87 |
+
* An overview of the general patterns of similarity.
|
88 |
+
* A discussion of notable chapters with particularly high or low similarity.
|
89 |
+
* An interpretation of what the different metrics collectively suggest about the texts' relationship (e.g., lexical borrowing vs. structural parallels).
|
90 |
+
* Suggestions for further scholarly investigation.
|
91 |
+
|
92 |
+
This feature acts as a bridge between the quantitative data and its qualitative interpretation, helping researchers to understand the implications of their findings and to formulate new research questions.
|
93 |
+
|
94 |
+
## 5. Discussion and Future Directions
|
95 |
+
|
96 |
+
### 5.1. Interpreting the Metrics: A Holistic View
|
97 |
+
|
98 |
+
The true analytical power of the TTM application lies not in any single metric, but in the synthesis of all of them. For example, a high Jaccard similarity combined with a low LCS score might suggest that two texts share a common vocabulary but arrange it differently, perhaps indicating a shared topic but different authorial styles. Conversely, a high LCS score with a moderate Jaccard similarity could point to a shared structural backbone or direct borrowing, even with significant lexical variation. The addition of semantic similarity further enriches this picture, revealing conceptual connections that might be missed by lexical and structural methods alone. The TTM application facilitates this holistic approach, encouraging a nuanced interpretation of textual relationships.
|
99 |
+
|
100 |
+
### 5.2. Limitations
|
101 |
+
|
102 |
+
While powerful, the TTM application has limitations. The quality of the analysis is highly dependent on the quality of the input texts; poorly scanned or OCR'd texts may yield unreliable results. The performance of the semantic models, while state-of-the-art, may also vary depending on the specific domain of the texts being analyzed. Furthermore, the AI-powered interpretation, while a useful guide, is not a substitute for scholarly expertise and should be treated as a starting point for further investigation, not a definitive conclusion.
|
103 |
+
|
104 |
+
### 5.3. Future Work
|
105 |
+
|
106 |
+
The TTM project is under active development, with several potential avenues for future enhancement. These include:
|
107 |
+
|
108 |
+
* **Integration of More Models:** Expanding the library of available embedding models to include more domain-specific options.
|
109 |
+
* **Enhanced Visualization:** Adding more advanced visualization tools, such as network graphs to show relationships between multiple texts.
|
110 |
+
* **User-Trainable Models:** Exposing the functionality to train custom FastText models directly within the web UI, allowing researchers to create highly specialized models for their specific corpora.
|
111 |
+
|
112 |
+
## 6. Conclusion
|
113 |
+
|
114 |
+
The Tibetan Text Metrics web application represents a significant step forward in making computational textual analysis accessible to the field of Tibetan studies. By combining a suite of powerful similarity metrics with an intuitive user interface and a novel AI-powered interpretation engine, TTM lowers the barrier to entry for digital humanities research. It provides scholars with a versatile tool to explore textual relationships, investigate manuscript histories, and generate new, data-driven insights. As such, TTM serves not as a replacement for traditional philology, but as a powerful complement, one that promises to enrich and expand the horizons of Tibetan textual scholarship.
|
app.py
CHANGED
@@ -48,6 +48,18 @@ def main_interface():
|
|
48 |
"<small>Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.</small>",
|
49 |
elem_classes="gr-markdown"
|
50 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
with gr.Column(scale=1, elem_classes="step-column"):
|
52 |
with gr.Group():
|
53 |
gr.Markdown(
|
@@ -257,9 +269,45 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
|
|
257 |
# For now, this modification focuses on creating the plot object and making it an output.
|
258 |
# The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
|
259 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
260 |
warning_box = gr.Markdown(visible=False)
|
261 |
|
262 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
263 |
"""Run the text analysis pipeline on the uploaded files.
|
264 |
|
265 |
Args:
|
@@ -340,6 +388,8 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
|
|
340 |
Path(file.name).name for file in files
|
341 |
] # Use Path().name to get just the filename
|
342 |
text_data = {}
|
|
|
|
|
343 |
|
344 |
# Read files with progress updates
|
345 |
for i, file in enumerate(files):
|
@@ -390,15 +440,16 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
|
|
390 |
internal_model_id = "facebook-fasttext-pretrained"
|
391 |
|
392 |
df_results, word_counts_df_data, warning_raw = process_texts(
|
393 |
-
text_data,
|
394 |
-
filenames,
|
395 |
enable_semantic=enable_semantic_bool,
|
396 |
model_name=internal_model_id,
|
397 |
use_stopwords=use_stopwords,
|
398 |
use_lite_stopwords=use_lite_stopwords,
|
399 |
progress_callback=progress_tracker,
|
400 |
batch_size=batch_size,
|
401 |
-
show_progress_bar=show_progress
|
|
|
402 |
)
|
403 |
|
404 |
if df_results.empty:
|
@@ -504,9 +555,20 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
|
|
504 |
logger.error(f"Error in interpret_results: {e}", exc_info=True)
|
505 |
return f"Error interpreting results: {str(e)}"
|
506 |
|
|
|
|
|
|
|
507 |
process_btn.click(
|
508 |
fn=run_pipeline,
|
509 |
-
inputs=[
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
510 |
outputs=[
|
511 |
csv_output,
|
512 |
metrics_preview,
|
@@ -516,7 +578,7 @@ Each segment is represented as a vector of these TF-IDF scores, and the cosine s
|
|
516 |
heatmap_tabs["Semantic Similarity"],
|
517 |
heatmap_tabs["TF-IDF Cosine Sim"],
|
518 |
warning_box,
|
519 |
-
]
|
520 |
)
|
521 |
|
522 |
# Connect the interpret button
|
|
|
48 |
"<small>Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.</small>",
|
49 |
elem_classes="gr-markdown"
|
50 |
)
|
51 |
+
|
52 |
+
with gr.Column(scale=1, elem_classes="step-column", elem_id="chapter-rename-column"):
|
53 |
+
chapter_rename_group = gr.Group(visible=False)
|
54 |
+
with chapter_rename_group:
|
55 |
+
gr.Markdown(
|
56 |
+
"""## Step 1.5: Name Your Chapters (Optional)
|
57 |
+
<span style='font-size:16px;'>Provide a name for each chapter below. These names will be used in the heatmaps and results.</span>
|
58 |
+
""",
|
59 |
+
elem_classes="gr-markdown",
|
60 |
+
)
|
61 |
+
chapter_names_ui = gr.Column()
|
62 |
+
|
63 |
with gr.Column(scale=1, elem_classes="step-column"):
|
64 |
with gr.Group():
|
65 |
gr.Markdown(
|
|
|
269 |
# For now, this modification focuses on creating the plot object and making it an output.
|
270 |
# The visual placement depends on how Gradio renders children of gr.Tab or if there's another container.
|
271 |
|
272 |
+
# Wire up the chapter renaming UI
|
273 |
+
file_input.upload(
|
274 |
+
setup_chapter_rename_ui,
|
275 |
+
inputs=[file_input],
|
276 |
+
outputs=[chapter_rename_group, chapter_names_ui, file_data_state]
|
277 |
+
)
|
278 |
+
|
279 |
warning_box = gr.Markdown(visible=False)
|
280 |
|
281 |
+
# State to hold file info and chapter names
|
282 |
+
file_data_state = gr.State(value={})
|
283 |
+
|
284 |
+
def setup_chapter_rename_ui(files):
|
285 |
+
if not files:
|
286 |
+
return gr.update(visible=False), [], {}
|
287 |
+
|
288 |
+
file_data = {}
|
289 |
+
chapter_name_inputs = []
|
290 |
+
for file in files:
|
291 |
+
try:
|
292 |
+
content = Path(file.name).read_text(encoding="utf-8-sig")
|
293 |
+
segments = content.split('༈')
|
294 |
+
num_chapters = len(segments)
|
295 |
+
file_data[Path(file.name).name] = {'path': file.name, 'chapters': num_chapters}
|
296 |
+
|
297 |
+
for i in range(num_chapters):
|
298 |
+
default_name = f"{Path(file.name).name} - Chapter {i + 1}"
|
299 |
+
chapter_name_inputs.append(gr.Textbox(label=f"Name for Chapter {i+1} in '{Path(file.name).name}'", value=default_name))
|
300 |
+
except Exception as e:
|
301 |
+
logger.error(f"Error processing file {file.name} for chapter renaming: {e}")
|
302 |
+
# Handle file read error gracefully
|
303 |
+
pass
|
304 |
+
|
305 |
+
if not chapter_name_inputs:
|
306 |
+
return gr.update(visible=False), [], {}
|
307 |
+
|
308 |
+
return gr.update(visible=True), chapter_name_inputs, file_data
|
309 |
+
|
310 |
+
def run_pipeline(files, enable_semantic, model_name, stopwords_option, batch_size, show_progress, chapter_names_list, progress=gr.Progress()):
|
311 |
"""Run the text analysis pipeline on the uploaded files.
|
312 |
|
313 |
Args:
|
|
|
388 |
Path(file.name).name for file in files
|
389 |
] # Use Path().name to get just the filename
|
390 |
text_data = {}
|
391 |
+
# The chapter_names_list is a list of lists, flatten it
|
392 |
+
flat_chapter_names = [name for sublist in chapter_names_list for name in sublist]
|
393 |
|
394 |
# Read files with progress updates
|
395 |
for i, file in enumerate(files):
|
|
|
440 |
internal_model_id = "facebook-fasttext-pretrained"
|
441 |
|
442 |
df_results, word_counts_df_data, warning_raw = process_texts(
|
443 |
+
text_data=text_data,
|
444 |
+
filenames=filenames,
|
445 |
enable_semantic=enable_semantic_bool,
|
446 |
model_name=internal_model_id,
|
447 |
use_stopwords=use_stopwords,
|
448 |
use_lite_stopwords=use_lite_stopwords,
|
449 |
progress_callback=progress_tracker,
|
450 |
batch_size=batch_size,
|
451 |
+
show_progress_bar=show_progress,
|
452 |
+
chapter_names=flat_chapter_names
|
453 |
)
|
454 |
|
455 |
if df_results.empty:
|
|
|
555 |
logger.error(f"Error in interpret_results: {e}", exc_info=True)
|
556 |
return f"Error interpreting results: {str(e)}"
|
557 |
|
558 |
+
# The `process_btn.click` call needs to be defined here.
|
559 |
+
# It will take inputs from all the configuration UI elements and the dynamic chapter name fields.
|
560 |
+
# The `chapter_names_ui` component, being a `gr.Column`, will pass the values of its children as a list.
|
561 |
process_btn.click(
|
562 |
fn=run_pipeline,
|
563 |
+
inputs=[
|
564 |
+
file_input,
|
565 |
+
semantic_toggle_radio,
|
566 |
+
model_dropdown,
|
567 |
+
stopwords_dropdown,
|
568 |
+
batch_size_slider,
|
569 |
+
progress_bar_checkbox,
|
570 |
+
chapter_names_ui, # Pass the column containing dynamic textboxes
|
571 |
+
],
|
572 |
outputs=[
|
573 |
csv_output,
|
574 |
metrics_preview,
|
|
|
578 |
heatmap_tabs["Semantic Similarity"],
|
579 |
heatmap_tabs["TF-IDF Cosine Sim"],
|
580 |
warning_box,
|
581 |
+
],
|
582 |
)
|
583 |
|
584 |
# Connect the interpret button
|
pipeline/process.py
CHANGED
@@ -57,7 +57,8 @@ def process_texts(
|
|
57 |
use_lite_stopwords: bool = False,
|
58 |
progress_callback = None,
|
59 |
batch_size: int = 32,
|
60 |
-
show_progress_bar: bool = False
|
|
|
61 |
) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
|
62 |
"""
|
63 |
Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
|
@@ -152,6 +153,7 @@ def process_texts(
|
|
152 |
chapter_marker = "༈"
|
153 |
fallback = False
|
154 |
segment_texts = {}
|
|
|
155 |
|
156 |
# Process each file
|
157 |
for i, fname in enumerate(filenames):
|
@@ -181,14 +183,19 @@ def process_texts(
|
|
181 |
continue
|
182 |
|
183 |
for idx, seg in enumerate(segments):
|
184 |
-
|
|
|
|
|
185 |
cleaned_seg = clean_tibetan_text_for_fasttext(seg)
|
186 |
-
segment_texts[seg_id] = cleaned_seg
|
|
|
187 |
else:
|
188 |
# No chapter markers found, treat entire file as one segment
|
189 |
-
|
|
|
190 |
cleaned_content = clean_tibetan_text_for_fasttext(content.strip())
|
191 |
-
segment_texts[seg_id] = cleaned_content
|
|
|
192 |
fallback = True
|
193 |
|
194 |
# Generate warning if no chapter markers found
|
@@ -213,7 +220,7 @@ def process_texts(
|
|
213 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
214 |
|
215 |
all_segment_ids = list(segment_texts.keys())
|
216 |
-
all_segment_contents =
|
217 |
tokenized_segments_list = tokenize_texts(all_segment_contents)
|
218 |
|
219 |
segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
|
@@ -294,7 +301,7 @@ def process_texts(
|
|
294 |
logger.info("Using botok word-level tokenization for FastText model.")
|
295 |
|
296 |
pair_metrics = compute_all_metrics(
|
297 |
-
texts={seg1: segment_texts[seg1], seg2: segment_texts[seg2]},
|
298 |
token_lists={seg1: segment_tokens[seg1], seg2: segment_tokens[seg2]},
|
299 |
model=model,
|
300 |
enable_semantic=enable_semantic,
|
@@ -306,9 +313,10 @@ def process_texts(
|
|
306 |
show_progress_bar=show_progress_bar
|
307 |
)
|
308 |
|
309 |
-
# Rename 'Text Pair' to show file stems and chapter
|
|
|
310 |
pair_metrics.loc[:, "Text Pair"] = f"{file1} vs {file2}"
|
311 |
-
pair_metrics.loc[:, "Chapter"] =
|
312 |
results.append(pair_metrics)
|
313 |
|
314 |
except Exception as e:
|
@@ -333,7 +341,7 @@ def process_texts(
|
|
333 |
word_counts_data = []
|
334 |
|
335 |
# Process each segment
|
336 |
-
for i, (seg_id, text_content) in enumerate(segment_texts.items()):
|
337 |
# Update progress
|
338 |
if progress_callback is not None and len(segment_texts) > 0:
|
339 |
try:
|
@@ -342,8 +350,7 @@ def process_texts(
|
|
342 |
except Exception as e:
|
343 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
344 |
|
345 |
-
fname,
|
346 |
-
chapter_num = int(chapter_info.replace("chapter ", ""))
|
347 |
|
348 |
try:
|
349 |
# Use botok for accurate word count for raw Tibetan text
|
@@ -356,6 +363,7 @@ def process_texts(
|
|
356 |
word_counts_data.append(
|
357 |
{
|
358 |
"Filename": fname.replace(".txt", ""),
|
|
|
359 |
"ChapterNumber": chapter_num,
|
360 |
"SegmentID": seg_id,
|
361 |
"WordCount": word_count,
|
@@ -367,6 +375,7 @@ def process_texts(
|
|
367 |
word_counts_data.append(
|
368 |
{
|
369 |
"Filename": fname.replace(".txt", ""),
|
|
|
370 |
"ChapterNumber": chapter_num,
|
371 |
"SegmentID": seg_id,
|
372 |
"WordCount": 0,
|
|
|
57 |
use_lite_stopwords: bool = False,
|
58 |
progress_callback = None,
|
59 |
batch_size: int = 32,
|
60 |
+
show_progress_bar: bool = False,
|
61 |
+
chapter_names: List[str] = None
|
62 |
) -> Tuple[pd.DataFrame, pd.DataFrame, str]:
|
63 |
"""
|
64 |
Processes uploaded texts, segments them by chapter marker, and computes metrics between chapters of different files.
|
|
|
153 |
chapter_marker = "༈"
|
154 |
fallback = False
|
155 |
segment_texts = {}
|
156 |
+
chapter_name_counter = 0
|
157 |
|
158 |
# Process each file
|
159 |
for i, fname in enumerate(filenames):
|
|
|
183 |
continue
|
184 |
|
185 |
for idx, seg in enumerate(segments):
|
186 |
+
# Use custom chapter name if available
|
187 |
+
custom_name = chapter_names[chapter_name_counter] if chapter_names and chapter_name_counter < len(chapter_names) else f"Chapter {idx + 1}"
|
188 |
+
seg_id = f"{fname}|{custom_name}"
|
189 |
cleaned_seg = clean_tibetan_text_for_fasttext(seg)
|
190 |
+
segment_texts[seg_id] = (cleaned_seg, idx + 1) # Store text and original number
|
191 |
+
chapter_name_counter += 1
|
192 |
else:
|
193 |
# No chapter markers found, treat entire file as one segment
|
194 |
+
custom_name = chapter_names[chapter_name_counter] if chapter_names and chapter_name_counter < len(chapter_names) else "Chapter 1"
|
195 |
+
seg_id = f"{fname}|{custom_name}"
|
196 |
cleaned_content = clean_tibetan_text_for_fasttext(content.strip())
|
197 |
+
segment_texts[seg_id] = (cleaned_content, 1)
|
198 |
+
chapter_name_counter += 1
|
199 |
fallback = True
|
200 |
|
201 |
# Generate warning if no chapter markers found
|
|
|
220 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
221 |
|
222 |
all_segment_ids = list(segment_texts.keys())
|
223 |
+
all_segment_contents = [data[0] for data in segment_texts.values()]
|
224 |
tokenized_segments_list = tokenize_texts(all_segment_contents)
|
225 |
|
226 |
segment_tokens = dict(zip(all_segment_ids, tokenized_segments_list))
|
|
|
301 |
logger.info("Using botok word-level tokenization for FastText model.")
|
302 |
|
303 |
pair_metrics = compute_all_metrics(
|
304 |
+
texts={seg1: segment_texts[seg1][0], seg2: segment_texts[seg2][0]},
|
305 |
token_lists={seg1: segment_tokens[seg1], seg2: segment_tokens[seg2]},
|
306 |
model=model,
|
307 |
enable_semantic=enable_semantic,
|
|
|
313 |
show_progress_bar=show_progress_bar
|
314 |
)
|
315 |
|
316 |
+
# Rename 'Text Pair' to show file stems and chapter name
|
317 |
+
chapter_name = seg1.split('|', 1)[1]
|
318 |
pair_metrics.loc[:, "Text Pair"] = f"{file1} vs {file2}"
|
319 |
+
pair_metrics.loc[:, "Chapter"] = chapter_name
|
320 |
results.append(pair_metrics)
|
321 |
|
322 |
except Exception as e:
|
|
|
341 |
word_counts_data = []
|
342 |
|
343 |
# Process each segment
|
344 |
+
for i, (seg_id, (text_content, chapter_num)) in enumerate(segment_texts.items()):
|
345 |
# Update progress
|
346 |
if progress_callback is not None and len(segment_texts) > 0:
|
347 |
try:
|
|
|
350 |
except Exception as e:
|
351 |
logger.warning(f"Progress callback error (non-critical): {e}")
|
352 |
|
353 |
+
fname, chapter_name = seg_id.split("|", 1)
|
|
|
354 |
|
355 |
try:
|
356 |
# Use botok for accurate word count for raw Tibetan text
|
|
|
363 |
word_counts_data.append(
|
364 |
{
|
365 |
"Filename": fname.replace(".txt", ""),
|
366 |
+
"ChapterName": chapter_name,
|
367 |
"ChapterNumber": chapter_num,
|
368 |
"SegmentID": seg_id,
|
369 |
"WordCount": word_count,
|
|
|
375 |
word_counts_data.append(
|
376 |
{
|
377 |
"Filename": fname.replace(".txt", ""),
|
378 |
+
"ChapterName": chapter_name,
|
379 |
"ChapterNumber": chapter_num,
|
380 |
"SegmentID": seg_id,
|
381 |
"WordCount": 0,
|