ruscxn-embedder / README.md
Futyn-Maker's picture
Initial upload of Russian Constructicon embedder model
754fb38 verified
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:15298
- loss:CachedMultipleNegativesSymmetricRankingLoss
- russian
- constructicon
- nlp
- linguistics
base_model: intfloat/multilingual-e5-large-instruct
widget:
- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
Constructicon that it contains
Query: Петр так и замер.'
sentences:
- NP-Nom так и VP-Pfv
- VP вокруг да около
- NP-Nom в гробу видать NP-Acc
- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
Constructicon that it contains
Query: Мы, мягко говоря, совсем не ладили.'
sentences:
- VP по всем правилам (NP-Gen)
- как насчёт XP?
- мягко говоря, Cl
- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
Constructicon that it contains
Query: Не беспокойтесь, всё будет сделано в лучшем виде.'
sentences:
- быть может, XP/Cl
- вот было бы здорово, если бы Cl
- всё будет Adv/Adj-Short
- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
Constructicon that it contains
Query: Самолет до Саратова уже год как отменили.'
sentences:
- показать, где раки зимуют NP-Dat
- VP как угорелый
- (вот) (уже) (NumCrd-Nom/NumCrd-Acc) NP Cop как Cl/NP-Nom (вот) (уже) (NumCrd-Acc)
NP как XP
- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
Constructicon that it contains
Query: Срочно делай уроки, а не то будешь иметь дело с раздраженным отцом!'
sentences:
- Cl, (а) не то Aux-Fut иметь дело с NP-Ins
- VP (NP-Acc) с ног на голову
- VP под NP-Acc
pipeline_tag: sentence-similarity
library_name: sentence-transformers
language:
- ru
---
# Russian Constructicon Embedder
This is a specialized [sentence-transformers](https://www.SBERT.net) model fine-tuned from [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) for finding Russian Constructicon patterns in text. The model is trained to compare Russian text examples with construction patterns from the Russian Constructicon database, enabling semantic search for linguistic constructions.
## Model Details
### Model Description
- **Model Type:** Sentence Transformer specialized for Russian Constructicon patterns
- **Base model:** [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 1024 dimensions
- **Similarity Function:** Cosine Similarity
- **Language:** Russian
- **Training Dataset:** Russian Constructicon examples and patterns
### Model Purpose
This model is specifically designed to encode Russian text examples and Constructicon patterns into a shared embedding space where similar constructions are close together. It enables:
- Finding Constructicon patterns that match given Russian text examples
- Semantic search through Russian construction databases
- Similarity comparison between text examples and linguistic patterns
- Construction pattern retrieval and ranking
## Usage
### Primary Usage (RusCxnPipe Library)
This model is designed to be used with the [RusCxnPipe](https://github.com/Futyn-Maker/ruscxnpipe) library for automatic Russian Constructicon pattern extraction:
```python
from ruscxnpipe import SemanticSearch
# Initialize with this specific model
search = SemanticSearch(
model_name="Futyn-Maker/ruscxn-embedder",
query_prefix="Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: ",
pattern_prefix=""
)
# Find construction candidates
examples = ["Петр так и замер.", "Мы, мягко говоря, совсем не ладили."]
results = search.find_candidates(queries=examples, n=5)
for result in results:
print(f"Example: {result['query']}")
for candidate in result['candidates']:
print(f" Pattern: {candidate['pattern']} (similarity: {candidate['similarity']:.3f})")
```
### Direct Usage (Sentence Transformers)
For advanced users who want to use the model directly:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Futyn-Maker/ruscxn-embedder")
# Note: Use the correct prefixes for optimal performance
query_prefix = "Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: "
pattern_prefix = ""
# Encode a Russian example
example = query_prefix + "Петр так и замер."
example_embedding = model.encode(example)
# Encode construction patterns (no prefix needed)
patterns = [
"NP-Nom так и VP-Pfv",
"VP вокруг да около",
"мягко говоря, Cl"
]
pattern_embeddings = model.encode(patterns)
# Calculate similarities
from sentence_transformers.util import cos_sim
similarities = cos_sim(example_embedding, pattern_embeddings)
print(similarities)
```
## Out-of-Scope Use
While this model is optimized for Russian Constructicon pattern matching, it may also be useful for other tasks involving Russian linguistic patterns, such as:
- Clustering of similar constructions
- Classification of constructions
However, performance on these tasks has not been systematically evaluated.
## Training Details
### Training Dataset
The model was trained on **15,298 examples** from the Russian Constructicon database, where each training sample consists of:
- **Query:** A Russian text example with the instruction prefix
- **Pattern:** A corresponding Constructicon pattern
### Training Objective
The model was fine-tuned using **CachedMultipleNegativesSymmetricRankingLoss** to learn embeddings where:
- Examples containing a construction are similar to that construction's pattern
- The embedding space preserves semantic relationships between related constructions
### Training Hyperparameters
- **Learning rate:** 2e-05
- **Batch size:** 1024
- **Training epochs:** 10 (best model from epoch 5)
- **Warmup ratio:** 0.1
- **Weight decay:** 0.01
- **Loss function:** CachedMultipleNegativesSymmetricRankingLoss
### Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
```
## Performance
The model achieved its best validation performance at epoch 5 with a validation loss of **0.1145**.
## Framework Versions
- Python: 3.10.12
- Sentence Transformers: 4.1.0
- Transformers: 4.51.3
- PyTorch: 2.7.0+cu126