KshitizTayal commited on
Commit
974a7e1
Β·
verified Β·
1 Parent(s): 4a43961

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +150 -3
README.md CHANGED
@@ -1,3 +1,150 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🌐 T5-Based Multilingual Text Translator
2
+
3
+ This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility.
4
+
5
+ ---
6
+
7
+ ## πŸ“ Problem Statement
8
+
9
+ The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization.
10
+
11
+ ---
12
+
13
+ ## πŸ“Š Dataset
14
+
15
+ - **Source:** Custom parallel corpus (`.txt` files) with one-to-one sentence alignments.
16
+ - **Languages Supported:**
17
+ - English
18
+ - French
19
+ - German
20
+ - Italian
21
+ - Portuguese
22
+
23
+ - **Structure:**
24
+ - Each language has a corresponding `.txt` file.
25
+ - Lines are aligned by index to form translation pairs.
26
+
27
+ - **Example Input Format:**
28
+ ```
29
+ Source: translate English to French: I am a student.
30
+ Target: Je suis un Γ©tudiant.
31
+ ```
32
+
33
+ ---
34
+
35
+ ## 🧠 Model Details
36
+
37
+ - **Architecture:** T5-small
38
+ - **Tokenizer:** `T5Tokenizer`
39
+ - **Model:** `T5ForConditionalGeneration`
40
+ - **Task Type:** Sequence-to-Sequence Translation (Supervised Fine-tuning)
41
+
42
+ ---
43
+
44
+ ## πŸ”§ Installation
45
+
46
+ ```bash
47
+ pip install transformers datasets torch gtts
48
+ ```
49
+
50
+ ---
51
+
52
+ ## πŸš€ Loading the Model
53
+
54
+ ```python
55
+ from transformers import T5ForConditionalGeneration, T5Tokenizer
56
+ import torch
57
+
58
+ # Load quantized model (float16)
59
+ model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16)
60
+ tokenizer = T5Tokenizer.from_pretrained("quantized_model")
61
+
62
+ # Translation example
63
+ source = "translate English to German: How are you?"
64
+ inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True)
65
+
66
+ with torch.no_grad():
67
+ outputs = model.generate(**inputs)
68
+
69
+ print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
70
+ ```
71
+
72
+ ---
73
+
74
+ ## πŸ“ˆ Performance Metrics
75
+
76
+ As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated.
77
+
78
+ ---
79
+
80
+ ## πŸ‹οΈ Fine-Tuning Details
81
+
82
+ ### πŸ“š Dataset Preparation
83
+
84
+ - A total of 5 text files (`english.txt`, `french.txt`, etc.)
85
+ - Each sentence aligned by index for parallel translation.
86
+
87
+ ### πŸ”§ Training Configuration
88
+
89
+ - **Epochs:** 1
90
+ - **Batch size:** 4
91
+ - **Max sequence length:** 128
92
+ - **Model base:** `t5-small`
93
+ - **Framework:** Hugging Face Transformers + PyTorch
94
+ - **Evaluation strategy:** 10% test split
95
+
96
+ ---
97
+
98
+ ## πŸ”„ Quantization
99
+
100
+ Post-training quantization was performed using `.half()` precision (FP16) to reduce model size and improve inference speed.
101
+
102
+ ```python
103
+ # Load full-precision model
104
+ model_fp32 = T5ForConditionalGeneration.from_pretrained("model")
105
+
106
+ # Convert to half precision
107
+ model_fp16 = model_fp32.half()
108
+ model_fp16.save_pretrained("quantized_model")
109
+ ```
110
+
111
+ **Model Size Comparison:**
112
+
113
+ | Type | Size (KB) |
114
+ |------------------|-----------|
115
+ | FP32 (Original) | ~6,904 KB |
116
+ | FP16 (Quantized) | ~3,452 KB |
117
+
118
+ ---
119
+
120
+ ## πŸ“ Repository Structure
121
+
122
+ ```
123
+ .
124
+ β”œβ”€β”€ model/ # Contains FP32 model files
125
+ β”‚ β”œβ”€β”€ config.json
126
+ β”‚ β”œβ”€β”€ model.safetensors
127
+ β”‚ β”œβ”€β”€ tokenizer_config.json
128
+ β”‚ └── ...
129
+ β”œβ”€β”€ quantized_model/ # Contains FP16 quantized model files
130
+ β”‚ β”œβ”€β”€ config.json
131
+ β”‚ β”œβ”€β”€ model.safetensors
132
+ β”‚ β”œβ”€β”€ tokenizer_config.json
133
+ β”‚ └── ...
134
+ β”œβ”€β”€ README.md # Documentation
135
+ └── multilingual_translator.py # Training and inference script
136
+ ```
137
+
138
+ ---
139
+
140
+ ## ⚠️ Limitations
141
+
142
+ - Trained on a small dataset with only one epoch β€” may not generalize well to all phrases or complex sentences.
143
+ - Language coverage is limited to 5 predefined languages.
144
+ - gTTS is dependent on Google API and requires internet access.
145
+
146
+ ---
147
+
148
+ ## 🀝 Contributing
149
+
150
+ Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use.