dodo13114 commited on
Commit
0a62c81
·
verified ·
1 Parent(s): 8d5adee

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +130 -14
  2. mistralocr_app_demo.py +1300 -0
  3. requirements.txt +64 -0
README.md CHANGED
@@ -1,14 +1,130 @@
1
- ---
2
- title: Mistral Ocr Translator Demo
3
- emoji: 👀
4
- colorFrom: purple
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 5.25.2
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: ' mistral-ocr-translator-demo'
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Mistral OCR 翻譯工具
3
+ emoji: 📄
4
+ colorFrom: indigo
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: "5.25.2"
8
+ app_file: mistralocr_app_demo.py
9
+ pinned: false
10
+ ---
11
+ # Mistral OCR & 翻譯工具
12
+
13
+ **English**: Convert PDF files to Markdown with OCR and English-to-Traditional Chinese translation, powered by Mistral, Gemini, and OpenAI.
14
+ **中文**: PDF 文件轉為 Markdown 格式,支援圖片 OCR 和英文到繁體中文翻譯,使用 Mistral、Gemini 和 OpenAI 模型。
15
+
16
+ ---
17
+
18
+ ## 功能亮點
19
+
20
+ - 📄 **PDF OCR**:使用 Mistral 模型提取 PDF 中的文字和圖片內容。
21
+ - 🌐 **翻譯**:將英文內容翻譯為繁體中文,支援 Gemini 和 OpenAI 模型。
22
+ - 🖼️ **圖片處理**:自動儲存 PDF 中的圖片並嵌入 Markdown。
23
+ - 💾 **多格式輸出**:生成英文原文和繁體中文翻譯的 Markdown 檔案。
24
+ - 🖥️ **Gradio 介面**:直觀的網頁 UI,無需本地安裝即可使用。
25
+
26
+ ---
27
+
28
+ ## 快速開始
29
+
30
+ 本工具部署於 Hugging Face Spaces,無需本地設置即可試用。請按照以下步驟操作:
31
+
32
+ 1. **上傳 PDF**:
33
+ - 在 Gradio 介面拖曳或點擊「上傳 PDF 檔案」,選擇你的 PDF 文件。
34
+ - 建議使用小型 PDF(<10MB)以確保快速處理。
35
+ 2. **輸入 API 金鑰**:
36
+ - **Mistral API 金鑰**(必要):用於 OCR 處理。
37
+ - **Gemini/OpenAI 金鑰**(可選):用於翻譯或結構化。
38
+ 3. **設置選項**:
39
+ - 選擇輸出格式(中文翻譯、英文原文,可多選)。
40
+ - 啟用「處理圖片 OCR」(預設開啟,適合掃描文件或圖表)。
41
+ 4. **開始處理**:
42
+ - 點擊「開始處理」按鈕。
43
+ - 在「處理日誌」標籤查看進度,完成後從「下載檔案」標籤下載結果(Markdown 和圖片)。
44
+
45
+ > **提示**:確保網路穩定以完成 API 請求。首次使用可選擇包含文字和圖表的 PDF,體驗完整的 OCR 和翻譯功能。
46
+
47
+ ---
48
+
49
+ ## 需求
50
+
51
+ - **Mistral API 金鑰**(必要):從 [Mistral Console](https://console.mistral.ai/) 獲取,用於 PDF 和圖片 OCR。
52
+ - **Gemini API 金鑰**(可選):從 [Google AI Studio](https://aistudio.google.com/app/apikey) 獲取,用於翻譯或結構化。
53
+ - **OpenAI API 金鑰**(可選):從 [OpenAI Platform](https://platform.openai.com/api-keys) 獲取,用於 GPT 模型。
54
+ - **網路連線**:穩定的連線以確保 API 請求順暢。
55
+
56
+ > **注意**:所有 API 金鑰僅在處理期間使用,不會儲存。
57
+
58
+ ---
59
+
60
+ ## API 使用量參考(粗略估計)
61
+
62
+ 以下為兩個實際測試場景的 API 使用情況,可供預估大致耗用量:
63
+
64
+ ### 測試場景一(Gemini 全流程)
65
+
66
+ - **PDF 範例**:Jones & Bergen (2025) 論文前 3 頁(含 1 張圖片)
67
+ - **Mistral OCR**:消耗約 **4 Pages**(含圖片額外一次處理)
68
+ - **Gemini 2.0 Flash**:
69
+ - 結構化 + 翻譯(單模型)
70
+ - 輸入 Token 約 **7,300 Tokens**
71
+
72
+ ### 測試場景二(分開處理:Gemini 結構化 + GPT-4o Mini 翻譯)
73
+
74
+ - **PDF 範例**:另一份 3 頁英文文件(含圖片)
75
+ - **Mistral OCR**:消耗約 **4 Pages**
76
+ - **Gemini 2.0 Flash**(僅做結構化):
77
+ - 輸入 Token 約 **2,357 Tokens**
78
+ - **GPT-4o Mini**(做翻譯):
79
+ - 輸入 Token 約 **4,440 Tokens**
80
+
81
+ > **注意**:實際耗用量會根據 PDF 頁數、內容密度、圖片比例與翻譯範圍有所不同,以上數據僅供參考。
82
+
83
+ 測試樣本之一引用:
84
+ Jones, C. R., & Bergen, B. K. (2025). *Large Language Models Pass the Turing Test*. *arXiv preprint* [arXiv:2503.23674](https://arxiv.org/abs/2503.23674)
85
+ 本測試僅借用該論文前 3 頁作為輸入範例進行處理流程測試,未轉載、修改或散佈其內容。
86
+
87
+ ---
88
+
89
+ ## 注意事項
90
+
91
+ - **檔案大小**:大型 PDF(>50MB)可能因 API 配額或 Spaces 資源限制而處理緩慢。
92
+ - **翻譯準確性**:AI 翻譯可能有誤,請對照原文驗證重要內容。
93
+ - **版權規範**:請確保上傳的 PDF 符合版權法規,您有權進行 OCR 和翻譯。
94
+ - **檢查點**:工具會儲存暫存檢查點以加速重複處理,可手動禁用。
95
+
96
+ ---
97
+
98
+ ## 技術與引用
99
+
100
+ 本專案整合以下技術,並基於 Mistral 官方範例進行延伸:
101
+
102
+ - [Mistral AI](https://mistral.ai/):PDF 和圖片 OCR。
103
+ - [Google Gemini](https://ai.google.dev/):翻譯與結構化。
104
+ - [OpenAI](https://openai.com/):GPT 模型支援。
105
+ - [Gradio](https://www.gradio.app/):互動式介面。
106
+ - 改編自 [Mistral OCR Notebook](https://colab.research.google.com/github/mistralai/cookbook/blob/main/mistral/ocr/structured_ocr.ipynb)。
107
+
108
+ 感謝以上服務提供者的技術支持!
109
+
110
+ ---
111
+
112
+ ## 授權
113
+
114
+ 根據 MIT 授權發布,詳見 [LICENSE](./LICENSE)。
115
+
116
+ **版權**:© 2025 David Chang
117
+
118
+ ---
119
+
120
+ ## 聯繫與反饋
121
+
122
+ - **作者**:David Chang
123
+ - **GitHub**:https://github.com/dodo13114arch/mistralocr-pdf2md-translator
124
+ - **問題與建議**:歡迎在 GitHub 提交 Issue 或 Pull Request!
125
+ - **支持本專案**:如果覺得有用,請給個星星 ⭐!
126
+
127
+ ---
128
+
129
+ **免責聲明**
130
+ 本工具僅供學習與研究用途。使用者需自行遵守 API 提供者的條款([Mistral](https://mistral.ai/terms)、[Gemini](https://ai.google.dev/terms)、[OpenAI](https://openai.com/policies)),並確保上傳的 PDF 合法。翻譯結果僅供參考,可能存在不準確之處,請自行驗證。
mistralocr_app_demo.py ADDED
@@ -0,0 +1,1300 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+
4
+ """
5
+ PDF Mistral OCR 匯出工具
6
+
7
+ 本程式可將 PDF 文件自動化轉換為 Markdown 格式,包含以下流程:
8
+
9
+ 1. 使用 Mistral OCR 模型辨識 PDF 內文與圖片
10
+ 2. 將辨識結果組成含圖片的 Markdown 檔
11
+ 3. 使用 Gemini 模型將英文內容翻譯為台灣繁體中文
12
+ 4. 匯出 Markdown 檔(原文版 + 翻譯版)與對應圖片
13
+
14
+ 新增功能:
15
+ - 處理過程中的檢查點,可以保存中間結果
16
+ - Gradio 介面,方便調整參數和選擇輸出格式
17
+ """
18
+
19
+ # Standard libraries
20
+ import os
21
+ import json
22
+ import base64
23
+ import time
24
+ import tempfile # Already imported, ensure it's used correctly later
25
+ from pathlib import Path
26
+ import pickle
27
+ import certifi
28
+ import shutil # Added for zipping images
29
+ os.environ["SSL_CERT_FILE"] = certifi.where()
30
+
31
+ # Third-party libraries
32
+ from IPython.display import Markdown, display
33
+ from pydantic import BaseModel
34
+ from dotenv import load_dotenv
35
+ import gradio as gr
36
+
37
+ # Mistral AI
38
+ from mistralai import Mistral
39
+ from mistralai.models import OCRResponse, ImageURLChunk, DocumentURLChunk, TextChunk
40
+
41
+ # Google Gemini
42
+ from google import genai
43
+ from google.genai import types
44
+
45
+ # OpenAI
46
+ # Import the library (add 'openai' to requirements.txt)
47
+ try:
48
+ from openai import OpenAI
49
+ except ImportError:
50
+ print("⚠️ OpenAI library not found. Please install it: pip install openai")
51
+ OpenAI = None # Set to None if import fails
52
+
53
+ # ===== Pydantic Models =====
54
+
55
+ class StructuredOCR(BaseModel):
56
+ file_name: str
57
+ topics: list[str]
58
+ languages: str
59
+ ocr_contents: dict
60
+
61
+ # ===== Utility Functions =====
62
+
63
+ def retry_with_backoff(func, retries=5, base_delay=1.5):
64
+ """Retry a function with exponential backoff."""
65
+ for attempt in range(retries):
66
+ try:
67
+ return func()
68
+ except Exception as e:
69
+ if "429" in str(e):
70
+ wait_time = base_delay * (2 ** attempt)
71
+ print(f"⚠️ API rate limit hit. Retrying in {wait_time:.1f}s...")
72
+ time.sleep(wait_time)
73
+ else:
74
+ raise e
75
+ raise RuntimeError("❌ Failed after multiple retries.")
76
+
77
+ def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
78
+ """Replace image placeholders in markdown with base64-encoded images."""
79
+ for img_name, base64_str in images_dict.items():
80
+ markdown_str = markdown_str.replace(
81
+ f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})"
82
+ )
83
+ return markdown_str
84
+
85
+ def get_combined_markdown(ocr_response: OCRResponse) -> str:
86
+ """Combine OCR text and images into a single markdown document."""
87
+ markdowns: list[str] = []
88
+ for page in ocr_response.pages:
89
+ image_data = {img.id: img.image_base64 for img in page.images}
90
+ markdowns.append(replace_images_in_markdown(page.markdown, image_data))
91
+ return "\n\n".join(markdowns)
92
+
93
+ def insert_ocr_below_images(markdown_str, ocr_img_map, page_idx):
94
+ """Insert OCR results below images in markdown."""
95
+ for img_id, ocr_text in ocr_img_map.get(page_idx, {}).items():
96
+ markdown_str = markdown_str.replace(
97
+ f"![{img_id}]({img_id})",
98
+ f"![{img_id}]({img_id})\n\n> 📄 Image OCR Result:\n\n```json\n{ocr_text}\n```"
99
+ )
100
+ return markdown_str
101
+
102
+ def save_images_and_replace_links(markdown_str, images_dict, page_idx, image_folder="images"):
103
+ """Save base64 images to files and update markdown links."""
104
+ os.makedirs(image_folder, exist_ok=True)
105
+ image_id_to_path = {}
106
+
107
+ for i, (img_id, base64_str) in enumerate(images_dict.items()):
108
+ img_bytes = base64.b64decode(base64_str.split(",")[-1])
109
+ # 使用相對路徑,僅保留資料夾名稱和檔案名稱
110
+ img_path = f"{os.path.basename(image_folder)}/page_{page_idx+1}_img_{i+1}.png"
111
+
112
+ # 實際儲存的完整路徑
113
+ full_img_path = os.path.join(image_folder, f"page_{page_idx+1}_img_{i+1}.png")
114
+ with open(full_img_path, "wb") as f:
115
+ f.write(img_bytes)
116
+ image_id_to_path[img_id] = img_path
117
+
118
+ for img_id, img_path in image_id_to_path.items():
119
+ markdown_str = markdown_str.replace(
120
+ f"![{img_id}]({img_id})", f"![{img_id}]({img_path})"
121
+ )
122
+
123
+ return markdown_str
124
+
125
+ # ===== Translation Functions =====
126
+
127
+ # Default translation system prompt
128
+ DEFAULT_TRANSLATION_SYSTEM_INSTRUCTION = """
129
+ 你是一位專業的技術文件翻譯者。請將我提供的英文 Markdown 內容翻譯成**台灣繁體中文**。
130
+
131
+ **核心要求:**
132
+ 1. **翻譯所有英文文字:** 你的主要工作是翻譯內容中的英文敘述性文字(段落、列表、表格等)。
133
+ 2. **保持結構與程式碼不變:**
134
+ * **不要**更改任何 Markdown 標記(如 `#`, `*`, `-`, `[]()`, `![]()`, ``` ```, ` `` `, `---`)。
135
+ * **不要**翻譯或���改程式碼區塊 (``` ... ```) 和行內程式碼 (`code`) 裡的任何內容。
136
+ * 若有 JSON,**不要**更改鍵(key),僅翻譯字串值(value)。
137
+ 3. **處理專有名詞:** 對於普遍接受的英文技術術語、縮寫或專有名詞(例如 API, SDK, CPU, Google, Python 等),傾向於**保留英文原文**。但請確保翻譯了其他所有非術語的常規英文文字。
138
+ 4. **直接輸出結果:** 請直接回傳翻譯後的完整 Markdown 文件,不要添加任何額外說明。
139
+ """
140
+
141
+ # Updated signature to accept openai_client
142
+ def translate_markdown_pages(pages, gemini_client, openai_client, model="gemini-2.0-flash", system_instruction=None):
143
+ """Translate markdown pages using the selected API (Gemini or OpenAI). Yields progress strings and translated page content."""
144
+ if system_instruction is None:
145
+ system_instruction = DEFAULT_TRANSLATION_SYSTEM_INSTRUCTION
146
+
147
+ # No longer collecting in a list here, will yield pages directly
148
+ total_pages = len(pages) # Get total pages for progress
149
+
150
+ for idx, page in enumerate(pages):
151
+ progress_message = f"🔁 正在翻譯第 {idx+1} / {total_pages} 頁..."
152
+ print(progress_message) # Print to console
153
+ yield progress_message # Yield progress string for Gradio log
154
+
155
+ try:
156
+ if model.startswith("gpt-"):
157
+ # --- OpenAI Translation Logic ---
158
+ if not openai_client:
159
+ error_msg = f"⚠️ OpenAI client not initialized for translation model {model}. Skipping page {idx+1}."
160
+ print(error_msg)
161
+ yield error_msg
162
+ yield f"--- ERROR: OpenAI Client Error for Page {idx+1} ---\n\n{page}"
163
+ continue # Skip to next page
164
+
165
+ print(f" - Translating using OpenAI model: {model}")
166
+ try:
167
+ # Construct messages for OpenAI translation
168
+ # Use the provided system_instruction as the system message
169
+ messages = [
170
+ {"role": "system", "content": system_instruction},
171
+ {"role": "user", "content": page}
172
+ ]
173
+
174
+ response = openai_client.chat.completions.create(
175
+ model=model,
176
+ messages=messages,
177
+ temperature=0.1 # Lower temperature for more deterministic translation
178
+ )
179
+ translated_md = response.choices[0].message.content.strip()
180
+ except Exception as openai_e:
181
+ error_msg = f"⚠️ OpenAI 翻譯第 {idx+1} / {total_pages} 頁失敗:{openai_e}"
182
+ print(error_msg)
183
+ yield error_msg # Yield error string to Gradio log
184
+ yield f"--- ERROR: OpenAI Translation Failed for Page {idx+1} ---\n\n{page}"
185
+ continue # Skip to next page
186
+
187
+ elif model.startswith("gemini"):
188
+ # --- Gemini Translation Logic ---
189
+ print(f" - Translating using Gemini model: {model}")
190
+ response = gemini_client.models.generate_content(
191
+ model=model,
192
+ config=types.GenerateContentConfig(
193
+ system_instruction=system_instruction
194
+ ),
195
+ contents=page
196
+ )
197
+ translated_md = response.text.strip()
198
+
199
+ else:
200
+ # --- Unsupported Model ---
201
+ error_msg = f"⚠️ Unsupported translation model: {model}. Skipping page {idx+1}."
202
+ print(error_msg)
203
+ yield error_msg
204
+ yield f"--- ERROR: Unsupported Translation Model for Page {idx+1} ---\n\n{page}"
205
+ continue # Skip to next page
206
+
207
+ # --- Yield successful translation ---
208
+ # translated_pages.append(translated_md) # Removed duplicate append
209
+
210
+ yield translated_md # Yield the actual translated page content
211
+
212
+ except Exception as e:
213
+ error_msg = f"⚠️ 翻譯第 {idx+1} / {total_pages} 頁失敗:{e}"
214
+ print(error_msg)
215
+ yield error_msg # Yield error string to Gradio log
216
+ # Yield error marker instead of translated content
217
+ yield f"--- ERROR: Translation Failed for Page {idx+1} ---\n\n{page}"
218
+
219
+ final_message = f"✅ 翻譯完成 {total_pages} 頁。"
220
+ yield final_message # Yield final translation status string
221
+ print(final_message) # Print final translation status
222
+ # No return needed for a generator yielding results
223
+
224
+ # ===== PDF Processing Functions =====
225
+
226
+ def process_pdf_with_mistral_ocr(pdf_path, client, model="mistral-ocr-latest"):
227
+ """Process PDF with Mistral OCR."""
228
+ pdf_file = Path(pdf_path)
229
+
230
+ # Upload to mistral
231
+ uploaded_file = client.files.upload(
232
+ file={
233
+ "file_name": pdf_file.stem,
234
+ "content": pdf_file.read_bytes(),
235
+ },
236
+ purpose="ocr"
237
+ )
238
+
239
+ signed_url = client.files.get_signed_url(file_id=uploaded_file.id, expiry=1)
240
+
241
+ # OCR analyze PDF
242
+ pdf_response = client.ocr.process(
243
+ document=DocumentURLChunk(document_url=signed_url.url),
244
+ model=model,
245
+ include_image_base64=True
246
+ )
247
+
248
+ return pdf_response
249
+
250
+ # Updated function signature to include structure_text_only
251
+ def process_images_with_ocr(pdf_response, mistral_client, gemini_client, openai_client, structure_model="pixtral-12b-latest", structure_text_only=False):
252
+ """Process images from PDF pages with OCR and structure using the specified model."""
253
+ image_ocr_results = {}
254
+
255
+ for page_idx, page in enumerate(pdf_response.pages):
256
+ for i, img in enumerate(page.images):
257
+ base64_data_url = img.image_base64
258
+
259
+ # Extract raw base64 data for Gemini
260
+ try:
261
+ # Handle potential variations in data URL prefix
262
+ if ',' in base64_data_url:
263
+ base64_content = base64_data_url.split(',', 1)[1]
264
+ else:
265
+ # Assume it's just the base64 content if no comma prefix
266
+ base64_content = base64_data_url
267
+ # Decode and re-encode to ensure it's valid base64 bytes for Gemini
268
+ image_bytes = base64.b64decode(base64_content)
269
+ except Exception as e:
270
+ print(f"⚠️ Error decoding base64 for page {page_idx+1}, image {i+1}: {e}. Skipping image.")
271
+ continue # Skip this image if base64 is invalid
272
+
273
+ def run_ocr_and_parse():
274
+ # Step 1: Basic OCR (always use Mistral OCR for initial text extraction)
275
+ print(f" - Performing basic OCR on page {page_idx+1}, image {i+1}...")
276
+ image_response = mistral_client.ocr.process(
277
+ document=ImageURLChunk(image_url=base64_data_url),
278
+ model="mistral-ocr-latest" # Use the dedicated OCR model here
279
+ )
280
+ image_ocr_markdown = image_response.pages[0].markdown
281
+ print(f" - Basic OCR text extracted.")
282
+
283
+ # Step 2: Structure the OCR markdown using the selected model
284
+ print(f" - Structuring OCR using: {structure_model}")
285
+ if structure_model == "pixtral-12b-latest":
286
+ print(f" - Using Mistral Pixtral...")
287
+ print(f" - Sending request to Pixtral API...") # Added print statement
288
+ structured = mistral_client.chat.parse(
289
+ model=structure_model, # Use the selected structure_model
290
+ messages=[
291
+ {
292
+ "role": "user",
293
+ "content": [
294
+ ImageURLChunk(image_url=base64_data_url),
295
+ TextChunk(text=(
296
+ f"This is the image's OCR in markdown:\n{image_ocr_markdown}\n. "
297
+ "Convert this into a structured JSON response with the OCR contents in a sensible dictionary."
298
+ ))
299
+ ]
300
+ }
301
+ ],
302
+ response_format=StructuredOCR, # Use Pydantic model for expected structure
303
+ temperature=0
304
+ )
305
+ structured_data = structured.choices[0].message.parsed
306
+ pretty_text = json.dumps(structured_data.ocr_contents, indent=2, ensure_ascii=False)
307
+
308
+ elif structure_model.startswith("gemini"): # Handle gemini-flash-2.0 etc.
309
+ print(f" - Using Google Gemini ({structure_model})...")
310
+ # Define the base prompt text
311
+ base_prompt_text = f"""
312
+ You are an expert OCR structuring assistant. Your goal is to extract and structure the relevant content into a JSON object based on the provided information.
313
+
314
+ **Initial OCR Markdown:**
315
+ ```markdown
316
+ {image_ocr_markdown}
317
+ ```
318
+
319
+ **Task:**
320
+ Generate a JSON object containing the structured OCR content found in the image. Focus on extracting meaningful information and organizing it logically within the JSON. The JSON should represent the `ocr_contents` field.
321
+
322
+ **Output Format:**
323
+ Return ONLY the JSON object, without any surrounding text or markdown formatting. Example:
324
+ ```json
325
+ {{
326
+ "title": "Example Title",
327
+ "sections": [
328
+ {{"header": "Section 1", "content": "Details..."}},
329
+ {{"header": "Section 2", "content": "More details..."}}
330
+ ],
331
+ "key_value_pairs": {{
332
+ "key1": "value1",
333
+ "key2": "value2"
334
+ }}
335
+ }}
336
+ ```
337
+ (Adapt the structure based on the image content.)
338
+ """
339
+ # Prepare API call based on structure_text_only flag
340
+ gemini_contents = []
341
+ if structure_text_only:
342
+ print(" - Mode: Text-only structuring")
343
+ # Modify prompt slightly for text-only
344
+ gemini_prompt = base_prompt_text.replace(
345
+ "Analyze the provided image and the initial OCR text",
346
+ "Analyze the initial OCR text"
347
+ ).replace(
348
+ "content from the image",
349
+ "content from the text"
350
+ )
351
+ gemini_contents.append(gemini_prompt)
352
+ else:
353
+ print(" - Mode: Image + Text structuring")
354
+ gemini_prompt = base_prompt_text # Use original prompt
355
+ # Prepare image part for Gemini using types.Part.from_bytes
356
+ # Assuming PNG, might need dynamic type detection in the future
357
+ # Pass the decoded image_bytes, not the base64_content string
358
+ try: # Corrected indentation
359
+ image_part = types.Part.from_bytes(
360
+ mime_type="image/png",
361
+ data=image_bytes
362
+ )
363
+ gemini_contents = [gemini_prompt, image_part] # Text prompt first, then image Part
364
+ except Exception as e:
365
+ print(f" - ⚠️ Error creating Gemini image Part: {e}. Skipping image structuring.")
366
+ # Fallback or re-raise depending on desired behavior
367
+ pretty_text = json.dumps({"error": "Failed to create Gemini image Part", "details": str(e)}, indent=2, ensure_ascii=False)
368
+ return pretty_text # Exit run_ocr_and_parse for this image
369
+
370
+ # Call Gemini API - Corrected to use gemini_client.models.generate_content
371
+ print(f" - Sending request to Gemini API ({structure_model})...") # Added print statement
372
+
373
+ try:
374
+ response = gemini_client.models.generate_content(
375
+ model=structure_model,
376
+ contents=gemini_contents # Pass the constructed list
377
+ )
378
+ except Exception as api_e:
379
+ print(f" - ⚠️ Error calling Gemini API: {api_e}")
380
+ # Fallback or re-raise
381
+ pretty_text = json.dumps({"error": "Failed to call Gemini API", "details": str(api_e)}, indent=2, ensure_ascii=False)
382
+ return pretty_text # Exit run_ocr_and_parse for this image
383
+
384
+ # Extract and clean the JSON response
385
+ raw_json_text = response.text.strip()
386
+ # Remove potential markdown code fences
387
+ if raw_json_text.startswith("```json"):
388
+ raw_json_text = raw_json_text[7:]
389
+ if raw_json_text.endswith("```"):
390
+ raw_json_text = raw_json_text[:-3]
391
+ raw_json_text = raw_json_text.strip()
392
+
393
+ # Validate and format the JSON
394
+ try:
395
+ parsed_json = json.loads(raw_json_text)
396
+ pretty_text = json.dumps(parsed_json, indent=2, ensure_ascii=False)
397
+ except json.JSONDecodeError as json_e:
398
+ print(f" - ⚠️ Gemini response was not valid JSON: {json_e}")
399
+ print(f" - Raw response: {raw_json_text}")
400
+ # Fallback: return the raw text wrapped in a basic JSON structure
401
+ pretty_text = json.dumps({"error": "Failed to parse Gemini JSON response", "raw_output": raw_json_text}, indent=2, ensure_ascii=False)
402
+
403
+ elif structure_model == "gpt-4o-mini":
404
+ print(f" - Using OpenAI GPT-4o mini...")
405
+ if not openai_client:
406
+ print(" - ⚠️ OpenAI client not initialized. Skipping.")
407
+ return json.dumps({"error": "OpenAI client not initialized. Check API key and library installation."}, indent=2, ensure_ascii=False)
408
+
409
+ # Define the base prompt text for OpenAI
410
+ openai_base_prompt = f"""
411
+ You are an expert OCR structuring assistant. Your goal is to extract and structure the relevant content into a JSON object based on the provided information.
412
+
413
+ **Initial OCR Markdown:**
414
+ ```markdown
415
+ {image_ocr_markdown}
416
+ ```
417
+
418
+ **Task:**
419
+ Generate a JSON object containing the structured OCR content found in the image. Focus on extracting meaningful information and organizing it logically within the JSON. The JSON should represent the `ocr_contents` field.
420
+
421
+ **Output Format:**
422
+ Return ONLY the JSON object, without any surrounding text or markdown formatting. Example:
423
+ ```json
424
+ {{
425
+ "title": "Example Title",
426
+ "sections": [
427
+ {{"header": "Section 1", "content": "Details..."}},
428
+ {{"header": "Section 2", "content": "More details..."}}
429
+ ],
430
+ "key_value_pairs": {{
431
+ "key1": "value1",
432
+ "key2": "value2"
433
+ }}
434
+ }}
435
+ ```
436
+ (Adapt the structure based on the image content. Ensure the output is valid JSON.)
437
+ """
438
+ # Prepare payload for OpenAI vision based on structure_text_only
439
+ openai_content_list = []
440
+ if structure_text_only:
441
+ print(" - Mode: Text-only structuring")
442
+ # Modify prompt slightly for text-only
443
+ openai_prompt = openai_base_prompt.replace(
444
+ "Analyze the provided image and the initial OCR text",
445
+ "Analyze the initial OCR text"
446
+ ).replace(
447
+ "content from the image",
448
+ "content from the text"
449
+ )
450
+ openai_content_list.append({"type": "text", "text": openai_prompt})
451
+ else:
452
+ print(" - Mode: Image + Text structuring")
453
+ openai_prompt = openai_base_prompt # Use original prompt
454
+ # Use the base64_content string directly for the data URL
455
+ # Assuming PNG, might need dynamic type detection
456
+ image_data_url = f"data:image/png;base64,{base64_content}" # Corrected indentation
457
+ openai_content_list.append({"type": "text", "text": openai_prompt})
458
+ openai_content_list.append({
459
+ "type": "image_url",
460
+ "image_url": {"url": image_data_url, "detail": "auto"},
461
+ })
462
+
463
+ print(f" - Sending request to OpenAI API ({structure_model})...")
464
+ try:
465
+ response = openai_client.chat.completions.create(
466
+ model=structure_model,
467
+ messages=[
468
+ {
469
+ "role": "user",
470
+ "content": openai_content_list, # Pass the constructed list
471
+ }
472
+ ],
473
+ # Optionally add max_tokens if needed, but rely on prompt for JSON structure
474
+ # max_tokens=1000,
475
+ temperature=0.1 # Lower temperature for deterministic JSON
476
+ )
477
+
478
+ raw_json_text = response.choices[0].message.content.strip()
479
+ # Clean potential markdown fences
480
+ if raw_json_text.startswith("```json"):
481
+ raw_json_text = raw_json_text[7:]
482
+ if raw_json_text.endswith("```"):
483
+ raw_json_text = raw_json_text[:-3]
484
+ raw_json_text = raw_json_text.strip()
485
+
486
+ # Validate and format JSON
487
+ try:
488
+ parsed_json = json.loads(raw_json_text)
489
+ pretty_text = json.dumps(parsed_json, indent=2, ensure_ascii=False)
490
+ except json.JSONDecodeError as json_e:
491
+ print(f" - ⚠️ OpenAI response was not valid JSON: {json_e}")
492
+ print(f" - Raw response: {raw_json_text}")
493
+ pretty_text = json.dumps({"error": "Failed to parse OpenAI JSON response", "raw_output": raw_json_text}, indent=2, ensure_ascii=False)
494
+
495
+ except Exception as api_e:
496
+ print(f" - ⚠️ Error calling OpenAI API: {api_e}")
497
+ pretty_text = json.dumps({"error": "Failed to call OpenAI API", "details": str(api_e)}, indent=2, ensure_ascii=False)
498
+
499
+ else: # Final attempt to correct indentation for the final else
500
+ print(f" - ⚠️ Unsupported structure model: {structure_model}. Skipping structuring.")
501
+ # Fallback: return the basic OCR markdown wrapped in JSON
502
+ pretty_text = json.dumps({"unstructured_ocr": image_ocr_markdown}, indent=2, ensure_ascii=False)
503
+
504
+ return pretty_text
505
+
506
+ try:
507
+ # Pass the actual structure model name to the inner function if needed,
508
+ # or rely on the outer scope variable 'structure_model' as done here.
509
+ result = retry_with_backoff(run_ocr_and_parse, retries=4)
510
+ image_ocr_results[(page_idx, img.id)] = result
511
+ except Exception as e:
512
+ print(f"❌ Failed at page {page_idx+1}, image {i+1}: {e}")
513
+
514
+ # Reorganize results by page
515
+ ocr_by_page = {}
516
+ for (page_idx, img_id), ocr_text in image_ocr_results.items():
517
+ ocr_by_page.setdefault(page_idx, {})[img_id] = ocr_text
518
+ print(f" - Successfully processed page {page_idx+1}, image {i+1} with {structure_model}.")
519
+
520
+ return ocr_by_page
521
+
522
+ # ===== Checkpoint Functions =====
523
+
524
+ def save_checkpoint(data, filename, console_output=None):
525
+ """Save data to a checkpoint file."""
526
+ with open(filename, 'wb') as f:
527
+ pickle.dump(data, f)
528
+ message = f"✅ 已儲存檢查點:{filename}"
529
+ print(message) # Corrected indentation
530
+ # Removed console_output append
531
+ return message # Return message
532
+
533
+ def load_checkpoint(filename, console_output=None):
534
+ """Load data from a checkpoint file."""
535
+ if os.path.exists(filename):
536
+ with open(filename, 'rb') as f:
537
+ data = pickle.load(f)
538
+ message = f"✅ 已載入檢查點:{filename}"
539
+ print(message)
540
+ # Removed console_output append
541
+ return data, message # Return message
542
+ return None, None # Return None message
543
+
544
+ # ===== Main Processing Function =====
545
+
546
+ # Updated function signature to include structure_text_only
547
+ def process_pdf_to_markdown(
548
+ pdf_path,
549
+ mistral_client,
550
+ gemini_client,
551
+ openai_client,
552
+ ocr_model="mistral-ocr-latest",
553
+ structure_model="pixtral-12b-latest",
554
+ structure_text_only=False, # Added structure_text_only
555
+ translation_model="gemini-2.0-flash",
556
+ translation_system_prompt=None,
557
+ process_images=True,
558
+ output_formats_selected=None, # New parameter for selected formats
559
+ output_dir=None,
560
+ checkpoint_dir=None,
561
+ use_existing_checkpoints=True
562
+ ):
563
+ """Main function to process PDF to markdown with translation. Yields log messages."""
564
+ if output_formats_selected is None:
565
+ output_formats_selected = ["中文翻譯", "英文原文"] # Default if not provided
566
+
567
+ pdf_file = Path(pdf_path)
568
+ filename_stem = pdf_file.stem
569
+ # Sanitize the filename stem here as well
570
+ sanitized_stem = filename_stem.replace(" ", "_")
571
+ print(f"--- 開始處理檔案: {pdf_file.name} (Sanitized Stem: {sanitized_stem}) ---") # Console print
572
+
573
+ # Output and checkpoint directories are now expected to be set by the caller (Gradio function)
574
+ # os.makedirs(output_dir, exist_ok=True) # Ensure caller created it
575
+ # os.makedirs(checkpoint_dir, exist_ok=True) # Ensure caller created it
576
+
577
+ # Checkpoint files - Use sanitized_stem
578
+ pdf_ocr_checkpoint = os.path.join(checkpoint_dir, f"{sanitized_stem}_pdf_ocr.pkl")
579
+ image_ocr_checkpoint = os.path.join(checkpoint_dir, f"{sanitized_stem}_image_ocr.pkl")
580
+ # Checkpoint for raw page data (list of tuples: (raw_markdown_text, images_dict))
581
+ raw_page_data_checkpoint = os.path.join(checkpoint_dir, f"{sanitized_stem}_raw_page_data.pkl")
582
+
583
+ # Step 1: Process PDF with OCR (with checkpoint)
584
+ pdf_response = None
585
+ load_msg = None
586
+ if use_existing_checkpoints:
587
+ pdf_response, load_msg = load_checkpoint(pdf_ocr_checkpoint) # Get message
588
+ if load_msg: yield load_msg # Yield message
589
+
590
+ if pdf_response is None:
591
+ msg = "🔍 正在處理 PDF OCR..."
592
+ yield msg
593
+ print(msg) # Console print
594
+ pdf_response = process_pdf_with_mistral_ocr(pdf_path, mistral_client, model=ocr_model)
595
+ save_msg = save_checkpoint(pdf_response, pdf_ocr_checkpoint) # save_checkpoint already prints
596
+ if save_msg: yield save_msg # Yield message
597
+ else:
598
+ print("ℹ️ 使用現有 PDF OCR 檢查點。")
599
+
600
+ # Step 2: Process images with OCR (with checkpoint)
601
+ ocr_by_page = {}
602
+ if process_images:
603
+ load_msg = None
604
+ if use_existing_checkpoints:
605
+ ocr_by_page, load_msg = load_checkpoint(image_ocr_checkpoint) # Get message
606
+ if load_msg: yield load_msg # Yield message
607
+
608
+ if ocr_by_page is None or not ocr_by_page: # Check if empty dict from checkpoint or explicitly empty
609
+ msg = f"🖼️ 正在使用 '{structure_model}' 處理圖片 OCR 與結構化..."
610
+ yield msg
611
+ print(msg) # Console print
612
+ # Pass gemini_client and correct structure_model parameter name
613
+ ocr_by_page = process_images_with_ocr(
614
+ pdf_response,
615
+ mistral_client,
616
+ gemini_client,
617
+ openai_client,
618
+ structure_model=structure_model,
619
+ structure_text_only=structure_text_only # Pass the text-only flag
620
+ )
621
+ save_msg = save_checkpoint(ocr_by_page, image_ocr_checkpoint) # save_checkpoint already prints
622
+ if save_msg: yield save_msg # Yield message
623
+ else:
624
+ print("ℹ️ 使用現有圖片 OCR 檢查點。")
625
+ else:
626
+ print("ℹ️ 跳過圖片 OCR 處理。") # process_images was False
627
+
628
+ # Step 3: Create or load RAW page data (markdown text + image dicts)
629
+ raw_page_data = None # List of tuples: (raw_markdown_text, images_dict)
630
+ load_msg = None
631
+ if use_existing_checkpoints:
632
+ # Try loading the raw page data checkpoint
633
+ raw_page_data, load_msg = load_checkpoint(raw_page_data_checkpoint)
634
+ if load_msg: yield load_msg
635
+
636
+ if raw_page_data is None:
637
+ msg = "📝 正在建立原始頁面資料 (Markdown + 圖片資訊)..."
638
+ yield msg
639
+ print(msg)
640
+ raw_page_data = []
641
+ for page_idx, page in enumerate(pdf_response.pages):
642
+ images_dict = {img.id: img.image_base64 for img in page.images}
643
+ raw_md_text = page.markdown # Just the raw text with ![id](id)
644
+ raw_page_data.append((raw_md_text, images_dict)) # Store as tuple
645
+
646
+ # Save the RAW page data checkpoint
647
+ save_msg = save_checkpoint(raw_page_data, raw_page_data_checkpoint)
648
+ if save_msg: yield save_msg
649
+ else:
650
+ print("ℹ️ 使用現有原始頁面資料檢查點。")
651
+
652
+ # Step 3.5: Conditionally insert image OCR results based on CURRENT UI selection
653
+ pages_after_ocr_insertion = [] # List to hold markdown strings after potential OCR insertion
654
+ if process_images and ocr_by_page: # Check if UI wants OCR AND if OCR results exist
655
+ msg = "✍️ 根據目前設定,正在將圖片 OCR 結果插入 Markdown..."
656
+ yield msg
657
+ print(msg)
658
+ for page_idx, (raw_md, _) in enumerate(raw_page_data): # Iterate through raw data
659
+ # Insert OCR results into the raw markdown text BEFORE replacing links
660
+ md_with_ocr = insert_ocr_below_images(raw_md, ocr_by_page, page_idx)
661
+ pages_after_ocr_insertion.append(md_with_ocr)
662
+ else:
663
+ # If not inserting OCR, just use the raw markdown text
664
+ if process_images and not ocr_by_page:
665
+ msg = "ℹ️ 已勾選處理圖片 OCR,但無圖片 OCR 結果可插入 (可能需要重新執行圖片 OCR)。"
666
+ yield msg
667
+ print(msg)
668
+ elif not process_images:
669
+ msg = "ℹ️ 未勾選處理圖片 OCR,跳過插入步驟。"
670
+ yield msg
671
+ print(msg)
672
+ # Use the raw markdown text directly
673
+ pages_after_ocr_insertion = [raw_md for raw_md, _ in raw_page_data]
674
+
675
+ # Step 3.6: Save images and replace links in the (potentially modified) markdown
676
+ final_markdown_pages = [] # This list will have final file paths as links
677
+ # Use sanitized_stem for image folder name
678
+ image_folder_name = os.path.join(output_dir, f"images_{sanitized_stem}")
679
+ msg = f"🖼️ 正在儲存圖片並更新 Markdown 連結至 '{os.path.basename(image_folder_name)}'..."
680
+ yield msg
681
+ print(msg)
682
+ # Iterate using the pages_after_ocr_insertion list and the original image dicts from raw_page_data
683
+ for page_idx, (md_to_link, (_, images_dict)) in enumerate(zip(pages_after_ocr_insertion, raw_page_data)):
684
+ # Now save images and replace links on the processed markdown (which might have OCR inserted)
685
+ final_md = save_images_and_replace_links(md_to_link, images_dict, page_idx, image_folder=image_folder_name)
686
+ final_markdown_pages.append(final_md)
687
+
688
+ # Step 4: Translate the final markdown pages
689
+ translated_markdown_pages = None # Initialize
690
+ need_translation = "中文翻譯" in output_formats_selected
691
+ if need_translation:
692
+ # Translate the final list with correct image links, passing both clients
693
+ translation_generator = translate_markdown_pages(
694
+ final_markdown_pages,
695
+ gemini_client,
696
+ openai_client, # Pass openai_client
697
+ model=translation_model,
698
+ system_instruction=translation_system_prompt
699
+ )
700
+ # Collect yielded pages from the translation generator
701
+ translated_markdown_pages = [] # Initialize list to store results
702
+ for item in translation_generator:
703
+ # Check if it's a progress string or actual content/error
704
+ # Simple check: assume non-empty strings starting with specific emojis are progress/status
705
+ if isinstance(item, str) and (item.startswith("🔁") or item.startswith("⚠️") or item.startswith("✅")):
706
+ yield item # Forward progress/status string
707
+ else:
708
+ # Assume it's translated content or an error marker page
709
+ translated_markdown_pages.append(item)
710
+ else:
711
+ yield "ℹ️ 跳過翻譯步驟 (未勾選中文翻譯)。"
712
+ print("ℹ️ 跳過翻譯步驟 (未勾選中文翻譯)。")
713
+ translated_markdown_pages = None # Ensure it's None if skipped
714
+
715
+ # Step 5: Combine pages into complete markdown strings
716
+ # The "original" output now correctly reflects the final state before translation
717
+ final_markdown_original = "\n\n---\n\n".join(final_markdown_pages) # Use the final pages with links
718
+ final_markdown_translated = "\n\n---\n\n".join(translated_markdown_pages) if translated_markdown_pages else None
719
+
720
+ # Step 6: Save files based on selection - Use sanitized_stem
721
+ saved_files = {}
722
+ if "英文原文" in output_formats_selected:
723
+ original_md_name = os.path.join(output_dir, f"{sanitized_stem}_original.md")
724
+ try:
725
+ with open(original_md_name, "w", encoding="utf-8") as f:
726
+ f.write(final_markdown_original)
727
+ msg = f"✅ 已儲存原文版:{original_md_name}"
728
+ yield msg
729
+ print(msg) # Console print
730
+ saved_files["original_file"] = original_md_name
731
+ except Exception as e:
732
+ msg = f"❌ 儲存原文版失敗: {e}"
733
+ yield msg
734
+ print(msg)
735
+
736
+ if "中文翻譯" in output_formats_selected and final_markdown_translated:
737
+ translated_md_name = os.path.join(output_dir, f"{sanitized_stem}_translated.md")
738
+ try:
739
+ with open(translated_md_name, "w", encoding="utf-8") as f:
740
+ f.write(final_markdown_translated)
741
+ msg = f"✅ 已儲存翻譯版:{translated_md_name}"
742
+ yield msg
743
+ print(msg) # Console print
744
+ saved_files["translated_file"] = translated_md_name
745
+ except Exception as e:
746
+ msg = f"❌ 儲存翻譯版失敗: {e}"
747
+ yield msg
748
+ print(msg)
749
+
750
+ # Always report image folder path if it was created (i.e., if images existed and were saved)
751
+ # The folder creation happens in save_images_and_replace_links
752
+ image_folder_name = os.path.join(output_dir, f"images_{sanitized_stem}")
753
+ if os.path.isdir(image_folder_name): # Check if the folder actually exists
754
+ msg = f"✅ 圖片資料夾:{image_folder_name}"
755
+ yield msg
756
+ print(msg) # Console print
757
+ saved_files["image_folder"] = image_folder_name
758
+ # else: # Optional: Log if folder wasn't created (e.g., PDF had no images)
759
+ # msg = f"ℹ️ PDF 文件不包含圖片,未建立圖片資料夾。"
760
+ # yield msg
761
+ # print(msg)
762
+
763
+
764
+ print(f"--- 完成處理檔案: {pdf_file.name} ---") # Console print
765
+
766
+ # Return the final result dictionary for Gradio UI update
767
+ yield {
768
+ "saved_files": saved_files, # Dictionary of saved file paths
769
+ "translated_content": final_markdown_translated,
770
+ "original_content": final_markdown_original,
771
+ "output_formats_selected": output_formats_selected # Pass back selections
772
+ }
773
+
774
+ # ===== Gradio Interface =====
775
+
776
+ def create_gradio_interface():
777
+ """Create a Gradio interface for the PDF to Markdown tool."""
778
+
779
+ # Client initialization is now moved inside process_pdf
780
+
781
+ # Define processing function for Gradio
782
+ def process_pdf( # Updated signature to accept API keys and return file paths + log
783
+ pdf_file,
784
+ # API Keys from UI
785
+ mistral_api_key_input,
786
+ gemini_api_key_input,
787
+ openai_api_key_input,
788
+ # Other parameters
789
+ ocr_model,
790
+ structure_model,
791
+ translation_model,
792
+ translation_system_prompt,
793
+ process_images,
794
+ output_format, # CheckboxGroup list
795
+ use_existing_checkpoints,
796
+ structure_text_only
797
+ ): # -> tuple[str | None, str | None, str | None, str]:
798
+ # Accumulate logs for console output
799
+ log_accumulator = ""
800
+ mistral_client = None
801
+ gemini_client = None
802
+ openai_client = None
803
+ print("\n--- Gradio 處理請求開始 ---") # Console print
804
+ # Placeholders for file outputs and log
805
+ output_original_md_path = None
806
+ output_translated_md_path = None
807
+ output_images_zip_path = None
808
+
809
+ # --- Early Exit Checks ---
810
+ if pdf_file is None:
811
+ log_accumulator += "❌ 請先上傳 PDF 檔案\n"
812
+ print("❌ 錯誤:未上傳 PDF 檔案")
813
+ # Return Nones for files/previews and the error log (6 values total)
814
+ yield None, None, None, None, None, "❌ 錯誤:未上傳 PDF 檔案\n" + log_accumulator
815
+ return
816
+
817
+ # --- API Key and Client Initialization ---
818
+ log_accumulator += "🔑 正在初始化 API Clients...\n"
819
+ # Yield updates for the log output only (6 values total)
820
+ yield gr.update(), gr.update(), gr.update(), gr.update(), gr.update(), log_accumulator
821
+
822
+ # Mistral (Required)
823
+ if not mistral_api_key_input:
824
+ log_accumulator += "❌ 錯誤:請務必提供 Mistral API Key。\n"
825
+ print("❌ 錯誤:未提供 Mistral API Key")
826
+ # Yield Nones for files/previews and the error log (6 values total)
827
+ yield None, None, None, None, None, log_accumulator
828
+ return
829
+ try:
830
+ mistral_client = Mistral(api_key=mistral_api_key_input)
831
+ log_accumulator += "✅ Mistral Client 初始化成功。\n"
832
+ print("✅ Mistral Client initialized.")
833
+ except Exception as e:
834
+ log_accumulator += f"❌ 初始化 Mistral Client 失敗: {e}\n"
835
+ print(f"❌ Error initializing Mistral Client: {e}")
836
+ # Yield Nones for files/previews and the error log (6 values total)
837
+ yield None, None, None, None, None, log_accumulator
838
+ return
839
+
840
+ # Gemini (Optional, depends on model selection later)
841
+ if gemini_api_key_input:
842
+ try:
843
+ gemini_client = genai.Client(api_key=gemini_api_key_input)
844
+ log_accumulator += "✅ Gemini Client 初始化成功。\n"
845
+ print("✅ Gemini Client initialized.")
846
+ except Exception as e:
847
+ log_accumulator += f"⚠️ 初始化 Gemini Client 失敗 (若未使用 Gemini 模型可忽略): {e}\n"
848
+ print(f"⚠️ Error initializing Gemini Client (ignore if not using Gemini models): {e}")
849
+ gemini_client = None # Ensure it's None if init fails
850
+ else:
851
+ log_accumulator += "ℹ️ 未提供 Gemini API Key,將無法使用 Gemini 模型。\n"
852
+ print("ℹ️ Gemini API Key not provided.")
853
+ gemini_client = None
854
+
855
+ # OpenAI (Optional, depends on model selection later)
856
+ if openai_api_key_input and OpenAI:
857
+ try:
858
+ openai_client = OpenAI(api_key=openai_api_key_input)
859
+ log_accumulator += "✅ OpenAI Client 初始化成功。\n"
860
+ print("✅ OpenAI Client initialized.")
861
+ except Exception as e:
862
+ log_accumulator += f"⚠️ 初始化 OpenAI Client 失敗 (若未使用 OpenAI 模型可忽略): {e}\n"
863
+ print(f"⚠️ Error initializing OpenAI Client (ignore if not using OpenAI models): {e}")
864
+ openai_client = None # Ensure it's None if init fails
865
+ elif not OpenAI:
866
+ log_accumulator += "ℹ️ OpenAI library 未安裝,無法使用 OpenAI 模型。\n"
867
+ print("ℹ️ OpenAI library not installed.")
868
+ openai_client = None
869
+ else:
870
+ log_accumulator += "ℹ️ 未提供 OpenAI API Key,將無法使用 OpenAI 模型。\n"
871
+ print("ℹ️ OpenAI API Key not provided.")
872
+ openai_client = None
873
+ # --- End API Key and Client Initialization ---
874
+
875
+
876
+ if not output_format:
877
+ log_accumulator += "❌ 請至少選擇一種輸出格式(中文翻譯 或 英文原文)\n"
878
+ print("❌ 錯誤:未選擇輸出格式")
879
+ # Yield Nones for files/previews and the error log (6 values total)
880
+ yield None, None, None, None, None, "❌ 錯誤:未選擇輸出格式\n" + log_accumulator
881
+ return
882
+
883
+ pdf_path_obj = Path(pdf_file.name) # Use pdf_file.name for Path object with temp files
884
+ filename_stem = pdf_path_obj.stem
885
+ # Sanitize the filename stem (replace spaces with underscores)
886
+ sanitized_stem = filename_stem.replace(" ", "_")
887
+ print(f"收到檔案: {pdf_path_obj.name} (Sanitized Stem: {sanitized_stem})") # Console print
888
+ print(f"選擇的輸出格式: {output_format}")
889
+
890
+ # --- Output Directory Logic (Using Temp Dir for Gradio Compatibility) ---
891
+ try:
892
+ # Create a unique temporary directory for this run's outputs
893
+ # This directory will be inside Gradio's allowed paths (/tmp)
894
+ temp_base_dir = tempfile.mkdtemp()
895
+ output_dir = os.path.join(temp_base_dir, "outputs") # Subdir for final files
896
+ checkpoint_dir = os.path.join(temp_base_dir, f"checkpoints_{sanitized_stem}") # Subdir for checkpoints
897
+
898
+ os.makedirs(output_dir, exist_ok=True)
899
+ os.makedirs(checkpoint_dir, exist_ok=True)
900
+ log_accumulator += f"📂 使用暫存輸出目錄: {output_dir}\n"
901
+ log_accumulator += f"💾 使用暫存檢查點目錄: {checkpoint_dir}\n"
902
+ print(f"Using temporary output directory: {output_dir}")
903
+ print(f"Using temporary checkpoint directory: {checkpoint_dir}")
904
+
905
+ except Exception as e:
906
+ error_msg = f"❌ 無法建立暫存目錄: {e}"
907
+ log_accumulator += f"{error_msg}\n"
908
+ print(f"❌ 錯誤:{error_msg}")
909
+ # Yield Nones for files/previews and the error log (6 values total)
910
+ yield None, None, None, None, None, f"❌ 錯誤:{error_msg}\n" + log_accumulator
911
+ return
912
+ # --- End Output Directory Logic ---
913
+
914
+
915
+ # --- Initial Log Messages ---
916
+ # Yield updates for the log output only (6 values total)
917
+ log_accumulator += f"🚀 開始處理 PDF: {pdf_path_obj.name}\n"
918
+ yield gr.update(), gr.update(), gr.update(), gr.update(), gr.update(), log_accumulator
919
+ # Log the temp dirs being used
920
+ log_accumulator += f"📂 使用暫存輸出目錄: {output_dir}\n" # Added log message back
921
+ yield gr.update(), gr.update(), gr.update(), gr.update(), gr.update(), log_accumulator
922
+ log_accumulator += f"💾 使用暫存檢查點目錄: {checkpoint_dir}\n" # Added log message back
923
+ yield gr.update(), gr.update(), gr.update(), gr.update(), gr.update(), log_accumulator
924
+
925
+ # Determine if translation is needed based on CheckboxGroup selection
926
+ # The 'translate' checkbox is now less relevant, primary control is output_format
927
+ need_translation_for_processing = "中文翻譯" in output_format
928
+ log_accumulator += "✅ 將產生中文翻譯\n" if need_translation_for_processing else "ℹ️ 不產生中文翻譯 (未勾選)\n"
929
+ yield gr.update(), gr.update(), gr.update(), gr.update(), gr.update(), log_accumulator
930
+ log_accumulator += "✅ 使用現有檢查點(如果存在)\n" if use_existing_checkpoints else "🔄 重新處理所有步驟(不使用現有檢查點)\n"
931
+ yield gr.update(), gr.update(), gr.update(), gr.update(), gr.update(), log_accumulator
932
+ print(f"需要翻譯: {need_translation_for_processing}, 使用檢查點: {use_existing_checkpoints}")
933
+
934
+ # --- Main Processing ---
935
+ try:
936
+ # process_pdf_to_markdown is a generator, iterate through its yields
937
+ processor = process_pdf_to_markdown(
938
+ pdf_path=pdf_file, # Pass the file path/object directly
939
+ mistral_client=mistral_client,
940
+ gemini_client=gemini_client,
941
+ openai_client=openai_client,
942
+ ocr_model=ocr_model,
943
+ structure_model=structure_model,
944
+ structure_text_only=structure_text_only, # Pass text-only flag
945
+ translation_model=translation_model,
946
+ translation_system_prompt=translation_system_prompt if translation_system_prompt.strip() else None,
947
+ process_images=process_images,
948
+ output_formats_selected=output_format, # Pass selected formats
949
+ output_dir=output_dir,
950
+ checkpoint_dir=checkpoint_dir,
951
+ use_existing_checkpoints=use_existing_checkpoints
952
+ )
953
+
954
+ result_data = None
955
+ # Iterate through the generator from process_pdf_to_markdown
956
+ for item in processor:
957
+ if isinstance(item, dict): # Check if it's the final result dict
958
+ result_data = item
959
+ # Don't yield the dict itself to the log
960
+ elif isinstance(item, str):
961
+ # Append and yield intermediate logs (6 values total)
962
+ log_accumulator += f"{item}\n"
963
+ # Yield updates for the log output only
964
+ yield gr.update(), gr.update(), gr.update(), gr.update(), gr.update(), log_accumulator
965
+ # Handle potential other types if necessary, otherwise ignore
966
+
967
+ # --- Process Final Result for UI ---
968
+ # This part runs after the processor generator is exhausted
969
+ if result_data:
970
+ saved_files_dict = result_data.get("saved_files", {})
971
+ output_original_md_path = saved_files_dict.get("original_file")
972
+ output_translated_md_path = saved_files_dict.get("translated_file")
973
+ image_folder_path = saved_files_dict.get("image_folder") # Gets the folder path
974
+
975
+ # Zip the image folder only if the path exists and it's a directory
976
+ if image_folder_path and os.path.isdir(image_folder_path):
977
+ log_accumulator += f"ℹ️ 找到圖片資料夾: {image_folder_path},嘗試壓縮...\n"
978
+ print(f"ℹ️ Found image folder: {image_folder_path}, attempting to zip...")
979
+ zip_base_name = image_folder_path # Use folder name as base for zip path
980
+ try:
981
+ # Ensure the target zip path doesn't conflict if run multiple times in same temp dir context (though mkdtemp should prevent this)
982
+ output_images_zip_path = shutil.make_archive(zip_base_name, 'zip', root_dir=os.path.dirname(image_folder_path), base_dir=os.path.basename(image_folder_path))
983
+ log_accumulator += f"✅ 已成功壓縮圖片資料夾:{output_images_zip_path}\n"
984
+ print(f"✅ Successfully zipped images: {output_images_zip_path}")
985
+ except Exception as zip_e:
986
+ error_msg = f"⚠️ 壓縮圖片資料夾 '{image_folder_path}' 失敗: {zip_e}"
987
+ log_accumulator += f"{error_msg}\n"
988
+ print(error_msg)
989
+ output_images_zip_path = None # Ensure it's None if zipping failed
990
+ else:
991
+ # Explicitly log if image folder wasn't found or isn't a directory
992
+ if image_folder_path: # Path exists but not a dir
993
+ log_accumulator += f"ℹ️ 找到圖片資料夾路徑,但 '{image_folder_path}' 不是有效的資料夾。無法壓縮。\n"
994
+ print(f"ℹ️ Image folder path found but not a directory: {image_folder_path}. Cannot zip.")
995
+ else: # Path not found in saved_files (likely no images in PDF or folder wasn't saved)
996
+ log_accumulator += f"ℹ️ 未找到圖片資料夾路徑 (可能 PDF 無圖片或未儲存)。無法壓縮。\n"
997
+ print(f"ℹ️ Image folder path not found in saved_files (likely no images in PDF or folder not saved). Cannot zip.")
998
+ output_images_zip_path = None # Ensure it's None
999
+
1000
+
1001
+ final_log_message = "✅ 處理完成!請查看預覽視窗,或至下載檔案視窗下載檔案。" # Updated message
1002
+ log_accumulator += f"{final_log_message}\n"
1003
+ print(f"--- Gradio 處理請求完成 ---")
1004
+
1005
+ else:
1006
+ final_log_message = "⚠️ 處理完成,但未收到預期的結果字典。"
1007
+ log_accumulator += f"{final_log_message}\n"
1008
+ print(f"⚠️ 警告:{final_log_message}")
1009
+
1010
+ # Final yield: provide paths for file outputs, markdown content for previews, and the final log
1011
+ yield (
1012
+ output_original_md_path,
1013
+ output_translated_md_path,
1014
+ output_images_zip_path,
1015
+ result_data.get("original_content", "無原文內容可預覽"), # Content for original preview
1016
+ result_data.get("translated_content", "無翻譯內容可預覽"), # Content for translated preview
1017
+ log_accumulator
1018
+ )
1019
+
1020
+ except Exception as e:
1021
+ error_message = f"❌ Gradio 處理過程中發生未預期錯誤: {str(e)}"
1022
+ log_accumulator += f"{error_message}\n"
1023
+ print(f"❌ 嚴重錯誤:{error_message}")
1024
+ import traceback
1025
+ traceback.print_exc() # Print full traceback to console
1026
+ # Final yield in case of error: provide Nones for files/previews and the error log (6 values total)
1027
+ yield None, None, None, None, None, log_accumulator
1028
+
1029
+ # Create Gradio interface
1030
+ with gr.Blocks(title="Mistral OCR & Translation Tool") as demo:
1031
+ gr.Markdown("""
1032
+ # Mistral OCR & 翻譯工具
1033
+
1034
+ Convert PDF files to Markdown with OCR and English-to-Chinese translation, powered by Mistral, Gemini, and OpenAI.
1035
+ 將 PDF 文件轉為 Markdown 格式,支援圖片 OCR 和英文到繁體中文翻譯,使用 Mistral、Gemini 和 OpenAI 模型。
1036
+ """)
1037
+
1038
+ with gr.Row():
1039
+ with gr.Column(scale=1):
1040
+ pdf_file = gr.File(label="上傳 PDF 檔案", file_types=[".pdf"])
1041
+
1042
+ with gr.Accordion("基本設定", open=True):
1043
+ # Define default path for placeholder clarity
1044
+ default_output_path_display = os.path.join("桌面", "MistralOCR_Output") # Simplified for display
1045
+ # Output directory is now handled internally using tempfile, remove UI element
1046
+ # output_dir = gr.Textbox(
1047
+ # label="輸出目錄 (請貼上完整路徑)",
1048
+ # placeholder=f"留空預設儲存至:{default_output_path_display}",
1049
+ # info="將所有輸出檔案 (Markdown, 圖片, 檢查點) 儲存於此目錄。",
1050
+ # value="" # Default logic remains in process_pdf
1051
+ # )
1052
+
1053
+ use_existing_checkpoints = gr.Checkbox(
1054
+ label="使用現有檢查點(如果存在)",
1055
+ value=True,
1056
+ info="啟用後,如果檢查點存在,將跳過已完成的步驟。"
1057
+ )
1058
+
1059
+ output_format = gr.CheckboxGroup(
1060
+ label="輸出格式 (可多選)",
1061
+ choices=["中文翻譯", "英文原文"],
1062
+ value=["中文翻譯", "英文原文"], # Default to both
1063
+ info="選擇您需要儲存的 Markdown 檔案格式。"
1064
+ )
1065
+
1066
+ with gr.Accordion("API Keys (請自行填入)", open=True):
1067
+ mistral_api_key_input = gr.Textbox(
1068
+ label="Mistral API Key",
1069
+ type="password",
1070
+ placeholder="請貼上你的 Mistral API Key",
1071
+ info="(必要) 用於 PDF 和圖片 OCR。請從 https://console.mistral.ai/ 獲取。���金鑰僅用於本次處理,不會儲存。"
1072
+ )
1073
+ gemini_api_key_input = gr.Textbox(
1074
+ label="Gemini API Key",
1075
+ type="password",
1076
+ placeholder="請貼上你的 Gemini API Key",
1077
+ info="(推薦) 若選擇 Gemini 模型進行翻譯或結構化,則需要。請從 https://aistudio.google.com/app/apikey 獲取。此金鑰僅用於本次處理,不會儲存。"
1078
+ )
1079
+ openai_api_key_input = gr.Textbox(
1080
+ label="OpenAI API Key",
1081
+ type="password",
1082
+ placeholder="請貼上你的 OpenAI API Key",
1083
+ info="(可選) 若選擇 GPT 模型進行翻譯或結構化,則需要。請從 https://platform.openai.com/api-keys 獲取。此金鑰僅用於本次處理,不會儲存。"
1084
+ )
1085
+
1086
+
1087
+ with gr.Accordion("處理選項", open=False):
1088
+ process_images = gr.Checkbox(
1089
+ label="處理圖片 OCR",
1090
+ value=True,
1091
+ info="啟用後,將對 PDF 中的圖片額外進行 OCR 辨識"
1092
+ )
1093
+
1094
+
1095
+
1096
+ with gr.Accordion("模型設定", open=True):
1097
+ ocr_model = gr.Dropdown(
1098
+ label="OCR 模型",
1099
+ choices=["mistral-ocr-latest"],
1100
+ value="mistral-ocr-latest"
1101
+ )
1102
+ structure_model = gr.Dropdown(
1103
+ label="結構化模型 (用於圖片 OCR)",
1104
+ choices=["pixtral-12b-latest", "gemini-2.0-flash", "gpt-4o-mini", "gpt-4o"], # Added gpt-4o
1105
+ value="gemini-2.0-flash",
1106
+ info="選擇用於結構化圖片 OCR 結果的模型。需要對應的 API Key。"
1107
+ )
1108
+ structure_text_only = gr.Checkbox(
1109
+ label="僅用文字進行結構化 (節省 Token)",
1110
+ value=False,
1111
+ info="勾選後,僅將圖片的初步 OCR 文字傳送給 Gemini 或 OpenAI 進行結構化,不傳送圖片本身。對 Pixtral 無效。⚠️注意:缺少圖片視覺資訊可能導致結構化效果不佳,建議僅在 OCR 文字已足夠清晰時使用。"
1112
+ )
1113
+ translation_model = gr.Dropdown(
1114
+ label="翻譯模型",
1115
+ choices=[
1116
+ "gemini-2.0-flash",
1117
+ "gemini-2.5-pro-exp-03-25",
1118
+ "gemini-2.0-flash-lite",
1119
+ "gpt-4o", # Added OpenAI models
1120
+ "gpt-4o-mini"
1121
+ ],
1122
+ value="gemini-2.0-flash"
1123
+ )
1124
+ with gr.Accordion("進階設定", open=False):
1125
+ translation_system_prompt = gr.Textbox(
1126
+ label="翻譯系統提示詞",
1127
+ value=DEFAULT_TRANSLATION_SYSTEM_INSTRUCTION,
1128
+ lines=10
1129
+ )
1130
+
1131
+ process_button = gr.Button("開始處理", variant="primary")
1132
+
1133
+ with gr.Column(scale=2):
1134
+ with gr.Tab("處理日誌"):
1135
+ console_output = gr.Textbox(
1136
+ label="處理進度",
1137
+ lines=20,
1138
+ max_lines=50,
1139
+ interactive=False,
1140
+ autoscroll=True
1141
+ )
1142
+ with gr.Tab("使用說明"):
1143
+
1144
+ gr.Markdown("""
1145
+ # 使用說明
1146
+
1147
+ 1. 上傳 PDF 檔案(可拖曳或點擊上傳)
1148
+ 2. 輸入 Mistral API 金鑰(必要)及 Gemini/OpenAI 金鑰(可選)
1149
+ 3. 基本設定:
1150
+ - 選擇是否使用現有檢查點(預設啟用)
1151
+ - 選擇輸出格式(中文翻譯、英文原文,可多選)
1152
+ 4. 處理選項:
1153
+ - 選擇是否處理圖片 OCR(預設啟用)
1154
+ 5. 模型與進階設定(可選):
1155
+ - 選擇 OCR、結構化、翻譯模型
1156
+ - 修改翻譯提示詞(若需其他語言)
1157
+ 6. 點擊「開始處理」按鈕
1158
+ 7. 於「處理日誌」標籤查看進度,完成後從「下載檔案」標籤下載結果
1159
+
1160
+ ## 檢查點說明
1161
+
1162
+ - **PDF OCR 檢查點**:儲存 PDF 的 OCR 結果
1163
+ - **圖片 OCR 檢查點**:儲存圖片的 OCR 結構化結果
1164
+ - 若需重新處理,可取消勾選「使用現有檢查點」
1165
+
1166
+ ## 輸出檔案
1167
+
1168
+ - `[檔名]_original.md`:英文原文 Markdown
1169
+ - `[檔名]_translated.md`:繁體中文翻譯 Markdown
1170
+ - `images_[檔名].zip`:PDF 中提取的圖片
1171
+
1172
+ ## API 使用量參考(粗略估計)
1173
+
1174
+ 以下為兩個實際測試場景的 API 使用情況,可供預估大致耗用量:
1175
+
1176
+ ### 測試場景一(Gemini 全流程)
1177
+
1178
+ - **PDF 範例**:Jones & Bergen (2025) 論文前 3 頁(含 1 張圖片)
1179
+ - **Mistral OCR**:消耗約 **4 Pages**(含圖片額外一次處理)
1180
+ - **Gemini 2.0 Flash**:
1181
+ - 結構化 + 翻譯(單模型)
1182
+ - 輸入 Token 約 **7,300 Tokens**
1183
+
1184
+ ### 測試場景二(分開處理:Gemini 結構化 + GPT-4o Mini 翻譯)
1185
+
1186
+ - **PDF 範例**:同一份 3 頁英文文件(含圖片)
1187
+ - **Mistral OCR**:消耗約 **4 Pages**
1188
+ - **Gemini 2.0 Flash**(僅做結構化):
1189
+ - 輸入 Token 約 **2,357 Tokens**
1190
+ - **GPT-4o Mini**(做翻譯):
1191
+ - 輸入 Token 約 **4,440 Tokens**
1192
+
1193
+ > **注意**:實際耗用量會根據 PDF 頁數、內容密度、圖片比例與翻譯範圍有所不同,以上數據僅供參考。
1194
+
1195
+ 測試樣本之一引用:
1196
+ Jones, C. R., & Bergen, B. K. (2025). *Large Language Models Pass the Turing Test*. *arXiv preprint* [arXiv:2503.23674](https://arxiv.org/abs/2503.23674)
1197
+ 本測試僅借用該論文前 3 頁作為輸入範例進行處理流程測試,未轉載、修改或散佈其內容。
1198
+ """)
1199
+
1200
+ with gr.Tab("預覽原文"): # New Tab for Original Preview
1201
+ preview_original_md = gr.Markdown(label="預覽原文 Markdown")
1202
+
1203
+ with gr.Tab("預覽翻譯"): # New Tab for Translated Preview
1204
+ preview_translated_md = gr.Markdown(label="預覽翻譯 Markdown")
1205
+
1206
+
1207
+ with gr.Tab("下載檔案"): # Changed Tab name
1208
+ # Add File output components for downloads
1209
+ output_original_md = gr.File(label="下載原文 Markdown (.md)")
1210
+ output_translated_md = gr.File(label="下載翻譯 Markdown (.md)")
1211
+ output_images_zip = gr.File(label="下載圖片 (.zip)")
1212
+ with gr.Tab("關於"): # 新增標籤
1213
+ gr.Markdown("""
1214
+ ## 關於 Mistral OCR 翻譯工具
1215
+
1216
+ 本工具由 **David Chang** 開發,旨在將 PDF 文件轉換為 Markdown 格式,支援圖片 OCR 和英文到繁體中文的翻譯。整合以下技術:
1217
+ - **Mistral AI**:PDF 和圖片 OCR
1218
+ - **Google Gemini / OpenAI**:翻譯與結構化
1219
+ - **Gradio**:互動式網頁介面
1220
+
1221
+ ### 版權與授權
1222
+ - **作者**:David Chang
1223
+ - **版權**:© 2025 David Chang
1224
+ - **授權**:MIT 授權,詳見 [LICENSE](https://github.com/dodo13114arch/mistralocr-pdf2md-translator/blob/main/LICENSE)
1225
+ - **GitHub**:https://github.com/dodo13114arch/mistralocr-pdf2md-translator
1226
+
1227
+ ### 感謝
1228
+ 感謝 Mistral AI、Google Gemini、OpenAI 和 Gradio 提供的技術支持,以及 Mistral 官方範例的啟發 ([Colab Notebook](https://colab.research.google.com/github/mistralai/cookbook/blob/main/mistral/ocr/structured_ocr.ipynb))。
1229
+
1230
+ ### 聯繫與反饋
1231
+ 歡迎在 GitHub 上提交問題或建議!
1232
+ """)
1233
+
1234
+ # Define outputs for the click event
1235
+ # Order must match the final yield in process_pdf:
1236
+ # file_orig, file_trans, file_zip, preview_orig, preview_trans, console_log
1237
+ outputs_list = [
1238
+ output_original_md,
1239
+ output_translated_md,
1240
+ output_images_zip,
1241
+ preview_original_md, # Added output for original preview
1242
+ preview_translated_md, # Added output for translated preview
1243
+ console_output
1244
+ ]
1245
+
1246
+ # Define inputs for the click event (remove console_output)
1247
+ inputs_list=[
1248
+ pdf_file,
1249
+ # API Key Inputs
1250
+ mistral_api_key_input,
1251
+ gemini_api_key_input,
1252
+ openai_api_key_input,
1253
+ # Other parameters
1254
+ ocr_model,
1255
+ structure_model,
1256
+ translation_model,
1257
+ translation_system_prompt,
1258
+ process_images,
1259
+ # translate, # Removed
1260
+ output_format, # CheckboxGroup list
1261
+ use_existing_checkpoints,
1262
+ structure_text_only
1263
+ ]
1264
+
1265
+ # Use process_button.click with the generator function
1266
+ process_button.click(
1267
+ fn=process_pdf,
1268
+ inputs=inputs_list,
1269
+ outputs=outputs_list
1270
+ )
1271
+
1272
+ # Add event handler to exit script when UI is closed/unloaded
1273
+ # Removed inputs and outputs arguments as they are not accepted by unload
1274
+ # demo.unload(fn=lambda: os._exit(0))
1275
+
1276
+
1277
+ gr.Markdown("""
1278
+
1279
+ ---
1280
+
1281
+ **免責聲明**
1282
+ 本工具僅供學習與研究用途,整合 Mistral、Google Gemini 和 OpenAI API。請確保:
1283
+ - 您擁有合法的 API 金鑰,並遵守各服務條款([Mistral](https://mistral.ai/terms)、[Gemini](https://ai.google.dev/terms)、[OpenAI](https://openai.com/policies))。
1284
+ - 上傳的 PDF 文件符合版權法規,您有權進行處理。
1285
+ - 翻譯結果可能有誤,請自行驗證。
1286
+ 本工具不儲存任何上傳檔案或 API 金鑰,所有處理均在暫存環境中完成。
1287
+
1288
+ **版權資訊**
1289
+ Copyright © 2025 David Chang. 根據 MIT 授權發布,詳見 [LICENSE](https://github.com/dodo13114arch/mistralocr-pdf2md-translator/blob/main/LICENSE)。
1290
+ GitHub: https://github.com/dodo13114arch/mistralocr-pdf2md-translator
1291
+ """)
1292
+
1293
+ return demo
1294
+
1295
+ # ===== Main Execution =====
1296
+
1297
+ if __name__ == "__main__":
1298
+ # Create and launch Gradio interface
1299
+ demo = create_gradio_interface()
1300
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ aiofiles==23.2.1
2
+ annotated-types==0.7.0
3
+ anyio==4.9.0
4
+ cachetools==5.5.2
5
+ certifi==2025.1.31
6
+ charset-normalizer==3.4.1
7
+ click==8.1.8
8
+ distro==1.9.0
9
+ eval_type_backport==0.2.2
10
+ exceptiongroup==1.2.2
11
+ fastapi==0.115.12
12
+ ffmpy==0.5.0
13
+ filelock==3.18.0
14
+ fsspec==2025.3.2
15
+ google-auth==2.38.0
16
+ google-genai==1.9.0
17
+ gradio==5.25.2
18
+ gradio_client==1.8.2
19
+ groovy==0.1.2
20
+ h11==0.14.0
21
+ httpcore==1.0.7
22
+ httpx==0.28.1
23
+ huggingface-hub==0.30.1
24
+ idna==3.10
25
+ ipython # Added back for runtime dependency
26
+ Jinja2==3.1.6
27
+ jiter==0.9.0
28
+ markdown-it-py==3.0.0
29
+ MarkupSafe==3.0.2
30
+ mdurl==0.1.2
31
+ mistralai==1.6.0
32
+ numpy==2.2.4
33
+ openai==1.73.0
34
+ orjson==3.10.16
35
+ pandas==2.2.3
36
+ pillow==11.1.0
37
+ pyasn1==0.6.1
38
+ pyasn1_modules==0.4.2
39
+ pydantic==2.11.2
40
+ pydantic_core==2.33.1
41
+ pydub==0.25.1
42
+ python-dateutil==2.9.0 # Use latest available version
43
+ python-dotenv==1.1.0
44
+ python-multipart==0.0.20
45
+ pytz==2025.2
46
+ PyYAML==6.0.2
47
+ requests==2.32.3
48
+ rich==14.0.0
49
+ rsa==4.9
50
+ ruff==0.11.4
51
+ safehttpx==0.1.6
52
+ semantic-version==2.10.0
53
+ shellingham==1.5.4
54
+ sniffio==1.3.1
55
+ starlette==0.46.1
56
+ tomlkit==0.13.2
57
+ tqdm==4.67.1
58
+ typer==0.15.2
59
+ typing-inspection==0.4.0
60
+ typing_extensions==4.13.1
61
+ tzdata==2025.2
62
+ urllib3==2.3.0
63
+ uvicorn==0.34.0
64
+ websockets==15.0.1