KshitizTayal's picture
Update README.md
974a7e1 verified
# 🌐 T5-Based Multilingual Text Translator
This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility.
---
## πŸ“ Problem Statement
The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization.
---
## πŸ“Š Dataset
- **Source:** Custom parallel corpus (`.txt` files) with one-to-one sentence alignments.
- **Languages Supported:**
- English
- French
- German
- Italian
- Portuguese
- **Structure:**
- Each language has a corresponding `.txt` file.
- Lines are aligned by index to form translation pairs.
- **Example Input Format:**
```
Source: translate English to French: I am a student.
Target: Je suis un Γ©tudiant.
```
---
## 🧠 Model Details
- **Architecture:** T5-small
- **Tokenizer:** `T5Tokenizer`
- **Model:** `T5ForConditionalGeneration`
- **Task Type:** Sequence-to-Sequence Translation (Supervised Fine-tuning)
---
## πŸ”§ Installation
```bash
pip install transformers datasets torch gtts
```
---
## πŸš€ Loading the Model
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
# Load quantized model (float16)
model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16)
tokenizer = T5Tokenizer.from_pretrained("quantized_model")
# Translation example
source = "translate English to German: How are you?"
inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model.generate(**inputs)
print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## πŸ“ˆ Performance Metrics
As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated.
---
## πŸ‹οΈ Fine-Tuning Details
### πŸ“š Dataset Preparation
- A total of 5 text files (`english.txt`, `french.txt`, etc.)
- Each sentence aligned by index for parallel translation.
### πŸ”§ Training Configuration
- **Epochs:** 1
- **Batch size:** 4
- **Max sequence length:** 128
- **Model base:** `t5-small`
- **Framework:** Hugging Face Transformers + PyTorch
- **Evaluation strategy:** 10% test split
---
## πŸ”„ Quantization
Post-training quantization was performed using `.half()` precision (FP16) to reduce model size and improve inference speed.
```python
# Load full-precision model
model_fp32 = T5ForConditionalGeneration.from_pretrained("model")
# Convert to half precision
model_fp16 = model_fp32.half()
model_fp16.save_pretrained("quantized_model")
```
**Model Size Comparison:**
| Type | Size (KB) |
|------------------|-----------|
| FP32 (Original) | ~6,904 KB |
| FP16 (Quantized) | ~3,452 KB |
---
## πŸ“ Repository Structure
```
.
β”œβ”€β”€ model/ # Contains FP32 model files
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ model.safetensors
β”‚ β”œβ”€β”€ tokenizer_config.json
β”‚ └── ...
β”œβ”€β”€ quantized_model/ # Contains FP16 quantized model files
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ model.safetensors
β”‚ β”œβ”€β”€ tokenizer_config.json
β”‚ └── ...
β”œβ”€β”€ README.md # Documentation
└── multilingual_translator.py # Training and inference script
```
---
## ⚠️ Limitations
- Trained on a small dataset with only one epoch β€” may not generalize well to all phrases or complex sentences.
- Language coverage is limited to 5 predefined languages.
- gTTS is dependent on Google API and requires internet access.
---
## 🀝 Contributing
Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use.