|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- intfloat/e5-base-unsupervised |
|
pipeline_tag: sentence-similarity |
|
--- |
|
|
|
|
|
# cadet-embed-base-v1 |
|
|
|
**cadet-embed-base-v1** is a BERT-base embedding model fine-tuned **from `intfloat/e5-base-unsupervised`** with |
|
|
|
* **cross-encoder listwise distillation** (teachers: `RankT5-3B` and `BAAI/bge-reranker-v2.5-gemma2-lightweight`) |
|
* **purely synthetic queries** (Llama-3.1 8B generated: questions, claims, titles, keywords, zero-shot & few-shot web queries) over 400k passages total from MSMARCO, DBPedia and Wikipedia corpora. |
|
|
|
The result: highly effective BERT-base retrieval. |
|
|
|
|
|
We provide our training code and scripts to generate synthetic queries at https://github.com/manveertamber/cadet-dense-retrieval. |
|
|
|
--- |
|
|
|
## Quick start |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
import numpy as np |
|
|
|
model = SentenceTransformer("manveertamber/cadet-embed-base-v1") |
|
|
|
query = "query: capital of France" |
|
|
|
passages = [ |
|
"passage: Paris is the capital and largest city of France.", |
|
"passage: Berlin is known for its vibrant art scene.", |
|
"passage: The Eiffel Tower is located in Paris, France." |
|
] |
|
|
|
# Encode |
|
q_emb = model.encode(query, normalize_embeddings=True) |
|
p_embs = model.encode(passages, normalize_embeddings=True) # shape (n_passages, dim) |
|
|
|
scores = np.dot(p_embs, q_emb) # shape (n_passages,) |
|
|
|
# Rank passages by score |
|
for passage, score in sorted(zip(passages, scores), key=lambda x: x[1], reverse=True): |
|
print(f"{score:.3f}\t{passage}") |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
If you use this model, please cite: |
|
|
|
``` |
|
@article{tamber2025conventionalcontrastivelearningfalls, |
|
title={Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data}, |
|
author={Manveer Singh Tamber and Suleman Kazi and Vivek Sourabh and Jimmy Lin}, |
|
journal={arXiv:2505.19274}, |
|
year={2025} |
|
} |
|
``` |
|
|