Russian Constructicon Embedder

This is a specialized sentence-transformers model fine-tuned from intfloat/multilingual-e5-large-instruct for finding Russian Constructicon patterns in text. The model is trained to compare Russian text examples with construction patterns from the Russian Constructicon database, enabling semantic search for linguistic constructions.

Model Details

Model Description

  • Model Type: Sentence Transformer specialized for Russian Constructicon patterns
  • Base model: intfloat/multilingual-e5-large-instruct
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity
  • Language: Russian
  • Training Dataset: Russian Constructicon examples and patterns

Model Purpose

This model is specifically designed to encode Russian text examples and Constructicon patterns into a shared embedding space where similar constructions are close together. It enables:

  • Finding Constructicon patterns that match given Russian text examples
  • Semantic search through Russian construction databases
  • Similarity comparison between text examples and linguistic patterns
  • Construction pattern retrieval and ranking

Usage

Primary Usage (RusCxnPipe Library)

This model is designed to be used with the RusCxnPipe library for automatic Russian Constructicon pattern extraction:

from ruscxnpipe import SemanticSearch

# Initialize with this specific model
search = SemanticSearch(
    model_name="Futyn-Maker/ruscxn-embedder",
    query_prefix="Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: ",
    pattern_prefix=""
)

# Find construction candidates
examples = ["Петр так и замер.", "Мы, мягко говоря, совсем не ладили."]
results = search.find_candidates(queries=examples, n=5)

for result in results:
    print(f"Example: {result['query']}")
    for candidate in result['candidates']:
        print(f"  Pattern: {candidate['pattern']} (similarity: {candidate['similarity']:.3f})")

Direct Usage (Sentence Transformers)

For advanced users who want to use the model directly:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Futyn-Maker/ruscxn-embedder")

# Note: Use the correct prefixes for optimal performance
query_prefix = "Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: "
pattern_prefix = ""

# Encode a Russian example
example = query_prefix + "Петр так и замер."
example_embedding = model.encode(example)

# Encode construction patterns (no prefix needed)
patterns = [
    "NP-Nom так и VP-Pfv",
    "VP вокруг да около",
    "мягко говоря, Cl"
]
pattern_embeddings = model.encode(patterns)

# Calculate similarities
from sentence_transformers.util import cos_sim
similarities = cos_sim(example_embedding, pattern_embeddings)
print(similarities)

Out-of-Scope Use

While this model is optimized for Russian Constructicon pattern matching, it may also be useful for other tasks involving Russian linguistic patterns, such as:

  • Clustering of similar constructions
  • Classification of constructions

However, performance on these tasks has not been systematically evaluated.

Training Details

Training Dataset

The model was trained on 15,298 examples from the Russian Constructicon database, where each training sample consists of:

  • Query: A Russian text example with the instruction prefix
  • Pattern: A corresponding Constructicon pattern

Training Objective

The model was fine-tuned using CachedMultipleNegativesSymmetricRankingLoss to learn embeddings where:

  • Examples containing a construction are similar to that construction's pattern
  • The embedding space preserves semantic relationships between related constructions

Training Hyperparameters

  • Learning rate: 2e-05
  • Batch size: 1024
  • Training epochs: 10 (best model from epoch 5)
  • Warmup ratio: 0.1
  • Weight decay: 0.01
  • Loss function: CachedMultipleNegativesSymmetricRankingLoss

Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Performance

The model achieved its best validation performance at epoch 5 with a validation loss of 0.1145.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 4.1.0
  • Transformers: 4.51.3
  • PyTorch: 2.7.0+cu126
Downloads last month
40
Safetensors
Model size
560M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Futyn-Maker/ruscxn-embedder

Finetuned
(108)
this model

Space using Futyn-Maker/ruscxn-embedder 1