File size: 4,074 Bytes

974a7e1

# 🌐 T5-Based Multilingual Text Translator

This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility.

---

## 📝 Problem Statement

The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization.

---

## 📊 Dataset

- **Source:** Custom parallel corpus (`.txt` files) with one-to-one sentence alignments.
- **Languages Supported:**
  - English
  - French
  - German
  - Italian
  - Portuguese

- **Structure:**
  - Each language has a corresponding `.txt` file.
  - Lines are aligned by index to form translation pairs.

- **Example Input Format:**
  ```
  Source: translate English to French: I am a student.
  Target: Je suis un étudiant.
  ```

---

## 🧠 Model Details

- **Architecture:** T5-small
- **Tokenizer:** `T5Tokenizer`
- **Model:** `T5ForConditionalGeneration`
- **Task Type:** Sequence-to-Sequence Translation (Supervised Fine-tuning)

---

## 🔧 Installation

```bash
pip install transformers datasets torch gtts
```

---

## 🚀 Loading the Model

```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

# Load quantized model (float16)
model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16)
tokenizer = T5Tokenizer.from_pretrained("quantized_model")

# Translation example
source = "translate English to German: How are you?"
inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model.generate(**inputs)

print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## 📈 Performance Metrics

As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated.

---

## 🏋️ Fine-Tuning Details

### 📚 Dataset Preparation

- A total of 5 text files (`english.txt`, `french.txt`, etc.)
- Each sentence aligned by index for parallel translation.

### 🔧 Training Configuration

- **Epochs:** 1  
- **Batch size:** 4  
- **Max sequence length:** 128  
- **Model base:** `t5-small`  
- **Framework:** Hugging Face Transformers + PyTorch  
- **Evaluation strategy:** 10% test split  

---

## 🔄 Quantization

Post-training quantization was performed using `.half()` precision (FP16) to reduce model size and improve inference speed.

```python
# Load full-precision model
model_fp32 = T5ForConditionalGeneration.from_pretrained("model")

# Convert to half precision
model_fp16 = model_fp32.half()
model_fp16.save_pretrained("quantized_model")
```

**Model Size Comparison:**

| Type            | Size (KB) |
|------------------|-----------|
| FP32 (Original)  | ~6,904 KB |
| FP16 (Quantized) | ~3,452 KB |

---

## 📁 Repository Structure

```
.
├── model/                       # Contains FP32 model files
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer_config.json
│   └── ...
├── quantized_model/            # Contains FP16 quantized model files
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer_config.json
│   └── ...
├── README.md                   # Documentation
└── multilingual_translator.py  # Training and inference script
```

---

## ⚠️ Limitations

- Trained on a small dataset with only one epoch — may not generalize well to all phrases or complex sentences.
- Language coverage is limited to 5 predefined languages.
- gTTS is dependent on Google API and requires internet access.

---

## 🤝 Contributing

Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use.