EgyTriplet Fine-Tuned Model 🇪🇬

This model is a fine-tuned version of the multilingual-e5-large for semantic embedding tasks in Egyptian Arabic and Modern Standard Arabic (MSA).

It was trained using the EgyTriplets - 2M dataset, a large-scale triplet dataset built using an automated pipeline involving translation, quality scoring, and hard-negative mining.

💡 What’s Special?

Fine-tuned using Triplet Loss
Supports dialectal and standard Arabic embedding
Trained on 2 million anchor-positive-negative triplets
Boosts retrieval, semantic search, and paraphrase detection tasks

🧪 Training Details

Setting	Value
Base Model	multilingual-e5-large / multilingual-e5-large-instruct
Dataset	EgyTriplets - 2M
Loss Function	TripletMarginLoss (margin = 0.3)
Batch Size	16
Epochs	3
Learning Rate	2e-5
Hardware	NVIDIA A100 40GB

📈 Performance (Triplet Loss)

Model	Final Triplet Loss
egytriplet-e5-large	0.107
egytriplet-e5-large-instruct	0.103

Lower loss = better semantic alignment.

🧠 Intended Use

This model is best suited for:

Semantic similarity
Information retrieval
Search ranking
Arabic paraphrase detection
Dialect-to-MSA alignment tasks

🗣️ Languages

Egyptian Arabic (ar-eg)
Modern Standard Arabic (msa)

📄 License

CC BY 4.0 — free to use, adapt, and share with attribution.

📬 Citation

If you use this model in your work, please cite:

@misc{egytriplets2024,
  author = {Mohammad Essam},
  title = {EgyTriplets: Generating 2 Million Egyptian Arabic Triplets via Transformer-Based Translation and Retrieval},
  year = {2024},
  url = {https://huggingface.co/datasets/metga97/egytriplets-2m}
}

metga97
/

egytriplet-e5-large-instruct