EgyTriplet Fine-Tuned Model πͺπ¬
This model is a fine-tuned version of the multilingual-e5-large
for semantic embedding tasks in Egyptian Arabic and Modern Standard Arabic (MSA).
It was trained using the EgyTriplets - 2M dataset, a large-scale triplet dataset built using an automated pipeline involving translation, quality scoring, and hard-negative mining.
π‘ Whatβs Special?
- Fine-tuned using Triplet Loss
- Supports dialectal and standard Arabic embedding
- Trained on 2 million anchor-positive-negative triplets
- Boosts retrieval, semantic search, and paraphrase detection tasks
π§ͺ Training Details
Setting | Value |
---|---|
Base Model | multilingual-e5-large / multilingual-e5-large-instruct |
Dataset | EgyTriplets - 2M |
Loss Function | TripletMarginLoss (margin = 0.3) |
Batch Size | 16 |
Epochs | 3 |
Learning Rate | 2e-5 |
Hardware | NVIDIA A100 40GB |
π Performance (Triplet Loss)
Model | Final Triplet Loss |
---|---|
egytriplet-e5-large | 0.107 |
egytriplet-e5-large-instruct | 0.103 |
Lower loss = better semantic alignment.
π§ Intended Use
This model is best suited for:
- Semantic similarity
- Information retrieval
- Search ranking
- Arabic paraphrase detection
- Dialect-to-MSA alignment tasks
π£οΈ Languages
- Egyptian Arabic (ar-eg)
- Modern Standard Arabic (msa)
π License
CC BY 4.0 β free to use, adapt, and share with attribution.
π¬ Citation
If you use this model in your work, please cite:
@misc{egytriplets2024,
author = {Mohammad Essam},
title = {EgyTriplets: Generating 2 Million Egyptian Arabic Triplets via Transformer-Based Translation and Retrieval},
year = {2024},
url = {https://huggingface.co/datasets/metga97/egytriplets-2m}
}
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for metga97/egytriplet-e5-large-instruct
Base model
intfloat/multilingual-e5-large-instruct