EgyTriplet Fine-Tuned Model πŸ‡ͺπŸ‡¬

This model is a fine-tuned version of the multilingual-e5-large for semantic embedding tasks in Egyptian Arabic and Modern Standard Arabic (MSA).

It was trained using the EgyTriplets - 2M dataset, a large-scale triplet dataset built using an automated pipeline involving translation, quality scoring, and hard-negative mining.


πŸ’‘ What’s Special?

  • Fine-tuned using Triplet Loss
  • Supports dialectal and standard Arabic embedding
  • Trained on 2 million anchor-positive-negative triplets
  • Boosts retrieval, semantic search, and paraphrase detection tasks

πŸ§ͺ Training Details

Setting Value
Base Model multilingual-e5-large / multilingual-e5-large-instruct
Dataset EgyTriplets - 2M
Loss Function TripletMarginLoss (margin = 0.3)
Batch Size 16
Epochs 3
Learning Rate 2e-5
Hardware NVIDIA A100 40GB

πŸ“ˆ Performance (Triplet Loss)

Model Final Triplet Loss
egytriplet-e5-large 0.107
egytriplet-e5-large-instruct 0.103

Lower loss = better semantic alignment.


🧠 Intended Use

This model is best suited for:

  • Semantic similarity
  • Information retrieval
  • Search ranking
  • Arabic paraphrase detection
  • Dialect-to-MSA alignment tasks

πŸ—£οΈ Languages

  • Egyptian Arabic (ar-eg)
  • Modern Standard Arabic (msa)

πŸ“„ License

CC BY 4.0 β€” free to use, adapt, and share with attribution.


πŸ“¬ Citation

If you use this model in your work, please cite:

@misc{egytriplets2024,
  author = {Mohammad Essam},
  title = {EgyTriplets: Generating 2 Million Egyptian Arabic Triplets via Transformer-Based Translation and Retrieval},
  year = {2024},
  url = {https://huggingface.co/datasets/metga97/egytriplets-2m}
}
Downloads last month
5
Safetensors
Model size
560M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for metga97/egytriplet-e5-large-instruct

Finetuned
(152)
this model

Dataset used to train metga97/egytriplet-e5-large-instruct