File size: 4,074 Bytes
974a7e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# 🌐 T5-Based Multilingual Text Translator

This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility.

---

## πŸ“ Problem Statement

The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization.

---

## πŸ“Š Dataset

- **Source:** Custom parallel corpus (`.txt` files) with one-to-one sentence alignments.
- **Languages Supported:**
  - English
  - French
  - German
  - Italian
  - Portuguese

- **Structure:**
  - Each language has a corresponding `.txt` file.
  - Lines are aligned by index to form translation pairs.

- **Example Input Format:**
  ```
  Source: translate English to French: I am a student.
  Target: Je suis un Γ©tudiant.
  ```

---

## 🧠 Model Details

- **Architecture:** T5-small
- **Tokenizer:** `T5Tokenizer`
- **Model:** `T5ForConditionalGeneration`
- **Task Type:** Sequence-to-Sequence Translation (Supervised Fine-tuning)

---

## πŸ”§ Installation

```bash
pip install transformers datasets torch gtts
```

---

## πŸš€ Loading the Model

```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

# Load quantized model (float16)
model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16)
tokenizer = T5Tokenizer.from_pretrained("quantized_model")

# Translation example
source = "translate English to German: How are you?"
inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model.generate(**inputs)

print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## πŸ“ˆ Performance Metrics

As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated.

---

## πŸ‹οΈ Fine-Tuning Details

### πŸ“š Dataset Preparation

- A total of 5 text files (`english.txt`, `french.txt`, etc.)
- Each sentence aligned by index for parallel translation.

### πŸ”§ Training Configuration

- **Epochs:** 1  
- **Batch size:** 4  
- **Max sequence length:** 128  
- **Model base:** `t5-small`  
- **Framework:** Hugging Face Transformers + PyTorch  
- **Evaluation strategy:** 10% test split  

---

## πŸ”„ Quantization

Post-training quantization was performed using `.half()` precision (FP16) to reduce model size and improve inference speed.

```python
# Load full-precision model
model_fp32 = T5ForConditionalGeneration.from_pretrained("model")

# Convert to half precision
model_fp16 = model_fp32.half()
model_fp16.save_pretrained("quantized_model")
```

**Model Size Comparison:**

| Type            | Size (KB) |
|------------------|-----------|
| FP32 (Original)  | ~6,904 KB |
| FP16 (Quantized) | ~3,452 KB |

---

## πŸ“ Repository Structure

```
.
β”œβ”€β”€ model/                       # Contains FP32 model files
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   └── ...
β”œβ”€β”€ quantized_model/            # Contains FP16 quantized model files
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   └── ...
β”œβ”€β”€ README.md                   # Documentation
└── multilingual_translator.py  # Training and inference script
```

---

## ⚠️ Limitations

- Trained on a small dataset with only one epoch β€” may not generalize well to all phrases or complex sentences.
- Language coverage is limited to 5 predefined languages.
- gTTS is dependent on Google API and requires internet access.

---

## 🀝 Contributing

Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use.