|
# π T5-Based Multilingual Text Translator |
|
|
|
This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility. |
|
|
|
--- |
|
|
|
## π Problem Statement |
|
|
|
The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization. |
|
|
|
--- |
|
|
|
## π Dataset |
|
|
|
- **Source:** Custom parallel corpus (`.txt` files) with one-to-one sentence alignments. |
|
- **Languages Supported:** |
|
- English |
|
- French |
|
- German |
|
- Italian |
|
- Portuguese |
|
|
|
- **Structure:** |
|
- Each language has a corresponding `.txt` file. |
|
- Lines are aligned by index to form translation pairs. |
|
|
|
- **Example Input Format:** |
|
``` |
|
Source: translate English to French: I am a student. |
|
Target: Je suis un Γ©tudiant. |
|
``` |
|
|
|
--- |
|
|
|
## π§ Model Details |
|
|
|
- **Architecture:** T5-small |
|
- **Tokenizer:** `T5Tokenizer` |
|
- **Model:** `T5ForConditionalGeneration` |
|
- **Task Type:** Sequence-to-Sequence Translation (Supervised Fine-tuning) |
|
|
|
--- |
|
|
|
## π§ Installation |
|
|
|
```bash |
|
pip install transformers datasets torch gtts |
|
``` |
|
|
|
--- |
|
|
|
## π Loading the Model |
|
|
|
```python |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer |
|
import torch |
|
|
|
# Load quantized model (float16) |
|
model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16) |
|
tokenizer = T5Tokenizer.from_pretrained("quantized_model") |
|
|
|
# Translation example |
|
source = "translate English to German: How are you?" |
|
inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True) |
|
|
|
with torch.no_grad(): |
|
outputs = model.generate(**inputs) |
|
|
|
print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
--- |
|
|
|
## π Performance Metrics |
|
|
|
As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated. |
|
|
|
--- |
|
|
|
## ποΈ Fine-Tuning Details |
|
|
|
### π Dataset Preparation |
|
|
|
- A total of 5 text files (`english.txt`, `french.txt`, etc.) |
|
- Each sentence aligned by index for parallel translation. |
|
|
|
### π§ Training Configuration |
|
|
|
- **Epochs:** 1 |
|
- **Batch size:** 4 |
|
- **Max sequence length:** 128 |
|
- **Model base:** `t5-small` |
|
- **Framework:** Hugging Face Transformers + PyTorch |
|
- **Evaluation strategy:** 10% test split |
|
|
|
--- |
|
|
|
## π Quantization |
|
|
|
Post-training quantization was performed using `.half()` precision (FP16) to reduce model size and improve inference speed. |
|
|
|
```python |
|
# Load full-precision model |
|
model_fp32 = T5ForConditionalGeneration.from_pretrained("model") |
|
|
|
# Convert to half precision |
|
model_fp16 = model_fp32.half() |
|
model_fp16.save_pretrained("quantized_model") |
|
``` |
|
|
|
**Model Size Comparison:** |
|
|
|
| Type | Size (KB) | |
|
|------------------|-----------| |
|
| FP32 (Original) | ~6,904 KB | |
|
| FP16 (Quantized) | ~3,452 KB | |
|
|
|
--- |
|
|
|
## π Repository Structure |
|
|
|
``` |
|
. |
|
βββ model/ # Contains FP32 model files |
|
β βββ config.json |
|
β βββ model.safetensors |
|
β βββ tokenizer_config.json |
|
β βββ ... |
|
βββ quantized_model/ # Contains FP16 quantized model files |
|
β βββ config.json |
|
β βββ model.safetensors |
|
β βββ tokenizer_config.json |
|
β βββ ... |
|
βββ README.md # Documentation |
|
βββ multilingual_translator.py # Training and inference script |
|
``` |
|
|
|
--- |
|
|
|
## β οΈ Limitations |
|
|
|
- Trained on a small dataset with only one epoch β may not generalize well to all phrases or complex sentences. |
|
- Language coverage is limited to 5 predefined languages. |
|
- gTTS is dependent on Google API and requires internet access. |
|
|
|
--- |
|
|
|
## π€ Contributing |
|
|
|
Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use. |
|
|