metadata

language:
  - fa
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - loss:CachedMultipleNegativesRankingLoss
widget:
  - source_sentence: درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.
    sentences:
      - >-
        درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار
        می‌روند.
      - درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.
      - ایران برای بار دیگر توانست به مدال طلا دست یابد.
  - source_sentence: در زمستان هوای تهران بسیار آلوده است.
    sentences:
      - تهران هوای پاکی در فصل زمستان دارد.
      - مشهد و تهران شلوغ‌ترین شهرهای ایران هستند.
      - در زمستان‌ها هوای تهران پاک نیست.
  - source_sentence: یادگیری زبان خارجی فرصت‌های شغلی را افزایش می‌دهد.
    sentences:
      - تسلط بر چند زبان، شانس استخدام در شرکت‌های بین‌المللی را بالا می‌برد.
      - دانستن زبان‌های خارجی تأثیری در موفقیت شغلی ندارد.
      - دمای هوا در قطب جنوب به پایین‌ترین حد خود در 50 سال اخیر رسید.
  - source_sentence: سفر کردن باعث گسترش دیدگاه‌های فرهنگی می‌شود.
    sentences:
      - بازدید از کشورهای مختلف به درک بهتر تنوع فرهنگی کمک می‌کند.
      - سفر کردن هیچ تأثیری بر دیدگاه‌های فرهنگی افراد ندارد
      - دمای هوا در قطب جنوب به پایین‌ترین حد خود در 50 سال اخیر رسید.
base_model:
  - PartAI/TookaBERT-Large

Tooka-SBERT-V2-Large

This model is a Sentence Transformers model trained for semantic textual similarity and embedding tasks. It maps sentences and paragraphs to a dense vector space, where semantically similar texts are close together.

The model is trained in two sizes: Small and Large

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install sentence-transformers==3.4.1

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("PartAI/Tooka-SBERT-V2-Large")
# Run inference
sentences = [
    'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
    'درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار می‌روند.',
    'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

🛠️ Training Details

The training is performed in two stages:

Pretraining on the Targoman News dataset
Fine-tuning on multiple synthetic datasets

Stage 1: Pretraining

We use an asymmetric setup.
Input formatting:
- Titles are prepended with "سوال: "
- Texts are prepended with "متن: "
Loss function: CachedMultipleNegativesRankingLoss

Stage 2: Fine-tuning

Loss functions:
- CachedMultipleNegativesRankingLoss
- CoSENTLoss
Used across multiple synthetic datasets

📊 Evaluation

We evaluate our model on the PTEB Benchmark. Our model outperforms mE5-Base on average across PTEB tasks.

For Retrieval and Reranking tasks, we follow the same asymmetric structure, prepending:

"سوال: " to queries
"متن: " to documents

Model	#Params	Pair-Classification-Avg	Classification-Avg	Retrieval-Avg	Reranking-Avg	CrossTasks-Avg
Tooka-SBERT-V2-Large	353M	80.24	74.73	59.80	73.44	72.05
Tooka-SBERT-V2-Small	123M	75.69	72.16	61.24	73.40	70.62
jina-embeddings-v3	572M	71.88	79.27	65.18	64.62	70.24
multilingual-e5-base	278M	70.76	69.71	63.90	76.01	70.09
Tooka-SBERT-V1-Large	353M	81.52	71.54	45.61	60.44	64.78

Task-Specific Datasets in PTEB

Pair-Classification:
- FarsTail
Classification:
- MassiveIntentClassification
- MassiveScenarioClassification
- MultilingualSentimentClassification
- PersianFoodSentimentClassification
Retrieval:
- MIRACLRetrieval
- NeuCLIR2023Retrieval
- WikipediaRetrievalMultilingual
Reranking:
- MIRACLReranking
- WikipediaRerankingMultilingual

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, 
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}