--- language: - en tags: - ColBERT - PyLate - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:99515 - loss:Contrastive base_model: EuroBERT/EuroBERT-210m datasets: - reasonir/reasonir-data pipeline_tag: sentence-similarity library_name: PyLate metrics: - accuracy model-index: - name: PyLate model based on EuroBERT/EuroBERT-210m results: - task: type: col-berttriplet name: Col BERTTriplet dataset: name: Unknown type: unknown metrics: - type: accuracy value: 0.973160982131958 name: Accuracy license: cc-by-nc-4.0 --- [](https://huggingface.co/fjmgAI) ## Fine-Tuned Model **`fjmgAI/reason-colBERT-210M-EuroBERT`** ## Base Model **`EuroBERT/EuroBERT-210m`** ## Fine-Tuning Method Fine-tuning was performed using **[PyLate](https://github.com/lightonai/pylate)**, with contrastive training on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. ## Dataset **[`reasonir/reasonir-data`](https://huggingface.co/datasets/reasonir/reasonir-data)** ### Description This dataset has been used for the English language and contains **101,000 examples**, designed for **rag-comprehensive-triplets**, using a data preprocessing script from the BRIGHT dataset. ## Fine-Tuning Details - The model was trained using the **Contrastive Training**. - Evaluated with pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator | Metric | Value | |:-------------|:-----------| | **accuracy** | **0.9732** | ## Usage First install the PyLate library: ```bash pip install -U pylate ``` ### Retrieval PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval. #### Indexing documents First, load the ColBERT model and initialize the Voyager index, then encode and index your documents: ```python import torch from pylate import indexes, models, retrieve # Step 1: Load the ColBERT model and Move the model to GPU if available, otherwise use CPU model = models.ColBERT( model_name_or_path=("fjmgAI/reason-colBERT-210M-EuroBERT", trust_remote_code=True) ) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) # Step 2: Initialize the Voyager index index = indexes.Voyager( index_folder="pylate-index", index_name="index", override=True, # This overwrites the existing index if any ) # Step 3: Encode the documents documents_ids = ["1", "2", "3"] documents = ["document 1 text", "document 2 text", "document 3 text"] documents_embeddings = model.encode( documents, batch_size=32, is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries show_progress_bar=True, ) # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids index.add_documents( documents_ids=documents_ids, documents_embeddings=documents_embeddings, ) ``` Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it: ```python # To load an index, simply instantiate it with the correct folder/name and without overriding it index = indexes.Voyager( index_folder="pylate-index", index_name="index", ) ``` #### Retrieving top-k documents for queries Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores: ```python # Step 1: Initialize the ColBERT retriever retriever = retrieve.ColBERT(index=index) # Step 2: Encode the queries queries_embeddings = model.encode( ["query for document 3", "query for document 1"], batch_size=32, is_query=True, # # Ensure that it is set to False to indicate that these are queries show_progress_bar=True, ) # Step 3: Retrieve top-k documents scores = retriever.retrieve( queries_embeddings=queries_embeddings, k=10, # Retrieve the top 10 matches for each query ) ``` ### Reranking If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank: ```python import torch from pylate import rank, models queries = [ "query A", "query B", ] documents = [ ["document A", "document B"], ["document 1", "document C", "document B"], ] documents_ids = [ [1, 2], [1, 3, 2], ] model = models.ColBERT( model_name_or_path=("fjmgAI/reason-colBERT-210M-EuroBERT", trust_remote_code=True), ) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) queries_embeddings = model.encode( queries, is_query=True, ) documents_embeddings = model.encode( documents, is_query=False, ) reranked_documents = rank.rerank( documents_ids=documents_ids, queries_embeddings=queries_embeddings, documents_embeddings=documents_embeddings, ) ``` ### Framework Versions - Python: 3.10.12 - Sentence Transformers: 4.0.2 - PyLate: 1.2.0 - Transformers: 4.48.2 - PyTorch: 2.5.1+cu121 - Accelerate: 1.2.1 - Datasets: 3.3.1 - Tokenizers: 0.21.0 ## Purpose This tuned model is designed to be used in scenarios that require **efficient embedding-based retrieval through reasoning** comparing embeddings at the token level with its MaxSim operation, ideal for **question-answering and document retrieval**. - **Developed by:** fjmgAI - **License:** Unfortunately, since the [ReasonIR data](https://huggingface.co/datasets/reasonir/reasonir-data) has been released under a cc-by-nc-4.0 license, we cannot release this model under an Apache 2.0 license. However, the authors of ReasonIR [released code to generate the data](https://github.com/facebookresearch/ReasonIR/tree/main/synthetic_data_generation). Anyone willing to reproduce the data could then easily reproduce this model under an Apache 2.0 [](https://github.com/lightonai/pylate)