Russian Constructicon Embedder
This is a specialized sentence-transformers model fine-tuned from intfloat/multilingual-e5-large-instruct for finding Russian Constructicon patterns in text. The model is trained to compare Russian text examples with construction patterns from the Russian Constructicon database, enabling semantic search for linguistic constructions.
Model Details
Model Description
- Model Type: Sentence Transformer specialized for Russian Constructicon patterns
- Base model: intfloat/multilingual-e5-large-instruct
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
- Language: Russian
- Training Dataset: Russian Constructicon examples and patterns
Model Purpose
This model is specifically designed to encode Russian text examples and Constructicon patterns into a shared embedding space where similar constructions are close together. It enables:
- Finding Constructicon patterns that match given Russian text examples
- Semantic search through Russian construction databases
- Similarity comparison between text examples and linguistic patterns
- Construction pattern retrieval and ranking
Usage
Primary Usage (RusCxnPipe Library)
This model is designed to be used with the RusCxnPipe library for automatic Russian Constructicon pattern extraction:
from ruscxnpipe import SemanticSearch
# Initialize with this specific model
search = SemanticSearch(
model_name="Futyn-Maker/ruscxn-embedder",
query_prefix="Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: ",
pattern_prefix=""
)
# Find construction candidates
examples = ["Петр так и замер.", "Мы, мягко говоря, совсем не ладили."]
results = search.find_candidates(queries=examples, n=5)
for result in results:
print(f"Example: {result['query']}")
for candidate in result['candidates']:
print(f" Pattern: {candidate['pattern']} (similarity: {candidate['similarity']:.3f})")
Direct Usage (Sentence Transformers)
For advanced users who want to use the model directly:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Futyn-Maker/ruscxn-embedder")
# Note: Use the correct prefixes for optimal performance
query_prefix = "Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: "
pattern_prefix = ""
# Encode a Russian example
example = query_prefix + "Петр так и замер."
example_embedding = model.encode(example)
# Encode construction patterns (no prefix needed)
patterns = [
"NP-Nom так и VP-Pfv",
"VP вокруг да около",
"мягко говоря, Cl"
]
pattern_embeddings = model.encode(patterns)
# Calculate similarities
from sentence_transformers.util import cos_sim
similarities = cos_sim(example_embedding, pattern_embeddings)
print(similarities)
Out-of-Scope Use
While this model is optimized for Russian Constructicon pattern matching, it may also be useful for other tasks involving Russian linguistic patterns, such as:
- Clustering of similar constructions
- Classification of constructions
However, performance on these tasks has not been systematically evaluated.
Training Details
Training Dataset
The model was trained on 15,298 examples from the Russian Constructicon database, where each training sample consists of:
- Query: A Russian text example with the instruction prefix
- Pattern: A corresponding Constructicon pattern
Training Objective
The model was fine-tuned using CachedMultipleNegativesSymmetricRankingLoss to learn embeddings where:
- Examples containing a construction are similar to that construction's pattern
- The embedding space preserves semantic relationships between related constructions
Training Hyperparameters
- Learning rate: 2e-05
- Batch size: 1024
- Training epochs: 10 (best model from epoch 5)
- Warmup ratio: 0.1
- Weight decay: 0.01
- Loss function: CachedMultipleNegativesSymmetricRankingLoss
Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Performance
The model achieved its best validation performance at epoch 5 with a validation loss of 0.1145.
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 4.1.0
- Transformers: 4.51.3
- PyTorch: 2.7.0+cu126
- Downloads last month
- 40
Model tree for Futyn-Maker/ruscxn-embedder
Base model
intfloat/multilingual-e5-large-instruct