File size: 1,954 Bytes
82c932f 0f8d867 82c932f 8889986 82c932f 86890aa 82c932f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
---
license: apache-2.0
language:
- en
base_model:
- intfloat/e5-base-unsupervised
pipeline_tag: sentence-similarity
---
# cadet-embed-base-v1
**cadet-embed-base-v1** is a BERT-base embedding model fine-tuned **from `intfloat/e5-base-unsupervised`** with
* **cross-encoder listwise distillation** (teachers: `RankT5-3B` and `BAAI/bge-reranker-v2.5-gemma2-lightweight`)
* **purely synthetic queries** (Llama-3.1 8B generated: questions, claims, titles, keywords, zero-shot & few-shot web queries) over 400k passages total from MSMARCO, DBPedia and Wikipedia corpora.
The result: highly effective BERT-base retrieval.
We provide our training code and scripts to generate synthetic queries at https://github.com/manveertamber/cadet-dense-retrieval.
---
## Quick start
```python
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("manveertamber/cadet-embed-base-v1")
query = "query: capital of France"
passages = [
"passage: Paris is the capital and largest city of France.",
"passage: Berlin is known for its vibrant art scene.",
"passage: The Eiffel Tower is located in Paris, France."
]
# Encode
q_emb = model.encode(query, normalize_embeddings=True)
p_embs = model.encode(passages, normalize_embeddings=True) # shape (n_passages, dim)
scores = np.dot(p_embs, q_emb) # shape (n_passages,)
# Rank passages by score
for passage, score in sorted(zip(passages, scores), key=lambda x: x[1], reverse=True):
print(f"{score:.3f}\t{passage}")
```
If you use this model, please cite:
```
@article{tamber2025conventionalcontrastivelearningfalls,
title={Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data},
author={Manveer Singh Tamber and Suleman Kazi and Vivek Sourabh and Jimmy Lin},
journal={arXiv:2505.19274},
year={2025}
}
```
|