File size: 4,074 Bytes
974a7e1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
# π T5-Based Multilingual Text Translator
This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility.
---
## π Problem Statement
The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization.
---
## π Dataset
- **Source:** Custom parallel corpus (`.txt` files) with one-to-one sentence alignments.
- **Languages Supported:**
- English
- French
- German
- Italian
- Portuguese
- **Structure:**
- Each language has a corresponding `.txt` file.
- Lines are aligned by index to form translation pairs.
- **Example Input Format:**
```
Source: translate English to French: I am a student.
Target: Je suis un Γ©tudiant.
```
---
## π§ Model Details
- **Architecture:** T5-small
- **Tokenizer:** `T5Tokenizer`
- **Model:** `T5ForConditionalGeneration`
- **Task Type:** Sequence-to-Sequence Translation (Supervised Fine-tuning)
---
## π§ Installation
```bash
pip install transformers datasets torch gtts
```
---
## π Loading the Model
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
# Load quantized model (float16)
model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16)
tokenizer = T5Tokenizer.from_pretrained("quantized_model")
# Translation example
source = "translate English to German: How are you?"
inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model.generate(**inputs)
print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## π Performance Metrics
As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated.
---
## ποΈ Fine-Tuning Details
### π Dataset Preparation
- A total of 5 text files (`english.txt`, `french.txt`, etc.)
- Each sentence aligned by index for parallel translation.
### π§ Training Configuration
- **Epochs:** 1
- **Batch size:** 4
- **Max sequence length:** 128
- **Model base:** `t5-small`
- **Framework:** Hugging Face Transformers + PyTorch
- **Evaluation strategy:** 10% test split
---
## π Quantization
Post-training quantization was performed using `.half()` precision (FP16) to reduce model size and improve inference speed.
```python
# Load full-precision model
model_fp32 = T5ForConditionalGeneration.from_pretrained("model")
# Convert to half precision
model_fp16 = model_fp32.half()
model_fp16.save_pretrained("quantized_model")
```
**Model Size Comparison:**
| Type | Size (KB) |
|------------------|-----------|
| FP32 (Original) | ~6,904 KB |
| FP16 (Quantized) | ~3,452 KB |
---
## π Repository Structure
```
.
βββ model/ # Contains FP32 model files
β βββ config.json
β βββ model.safetensors
β βββ tokenizer_config.json
β βββ ...
βββ quantized_model/ # Contains FP16 quantized model files
β βββ config.json
β βββ model.safetensors
β βββ tokenizer_config.json
β βββ ...
βββ README.md # Documentation
βββ multilingual_translator.py # Training and inference script
```
---
## β οΈ Limitations
- Trained on a small dataset with only one epoch β may not generalize well to all phrases or complex sentences.
- Language coverage is limited to 5 predefined languages.
- gTTS is dependent on Google API and requires internet access.
---
## π€ Contributing
Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use.
|