manveertamber
/

cadet-embed-base-v1

Sentence Similarity

Model card Files Files and versions Community

cadet-embed-base-v1 / README.md

manveertamber's picture

Update README.md

0f8d867 verified about 1 month ago

|

history blame contribute delete

1.95 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- intfloat/e5-base-unsupervised
	pipeline_tag: sentence-similarity
	---


	# cadet-embed-base-v1

	cadet-embed-base-v1 is a BERT-base embedding model fine-tuned from `intfloat/e5-base-unsupervised` with

	* cross-encoder listwise distillation (teachers: `RankT5-3B` and `BAAI/bge-reranker-v2.5-gemma2-lightweight`)
	* purely synthetic queries (Llama-3.1 8B generated: questions, claims, titles, keywords, zero-shot & few-shot web queries) over 400k passages total from MSMARCO, DBPedia and Wikipedia corpora.

	The result: highly effective BERT-base retrieval.


	We provide our training code and scripts to generate synthetic queries at https://github.com/manveertamber/cadet-dense-retrieval.

	---

	## Quick start
	```python
	from sentence_transformers import SentenceTransformer
	import numpy as np

	model = SentenceTransformer("manveertamber/cadet-embed-base-v1")

	query = "query: capital of France"

	passages = [
	"passage: Paris is the capital and largest city of France.",
	"passage: Berlin is known for its vibrant art scene.",
	"passage: The Eiffel Tower is located in Paris, France."
	]

	# Encode
	q_emb = model.encode(query, normalize_embeddings=True)
	p_embs = model.encode(passages, normalize_embeddings=True) # shape (n_passages, dim)

	scores = np.dot(p_embs, q_emb) # shape (n_passages,)

	# Rank passages by score
	for passage, score in sorted(zip(passages, scores), key=lambda x: x[1], reverse=True):
	print(f"{score:.3f}\t{passage}")


	```



	If you use this model, please cite:

	```
	@article{tamber2025conventionalcontrastivelearningfalls,
	title={Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data},
	author={Manveer Singh Tamber and Suleman Kazi and Vivek Sourabh and Jimmy Lin},
	journal={arXiv:2505.19274},
	year={2025}
	}
	```