|
--- |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- dataset_size:15298 |
|
- loss:CachedMultipleNegativesSymmetricRankingLoss |
|
- russian |
|
- constructicon |
|
- nlp |
|
- linguistics |
|
base_model: intfloat/multilingual-e5-large-instruct |
|
widget: |
|
- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian |
|
Constructicon that it contains |
|
|
|
Query: Петр так и замер.' |
|
sentences: |
|
- NP-Nom так и VP-Pfv |
|
- VP вокруг да около |
|
- NP-Nom в гробу видать NP-Acc |
|
- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian |
|
Constructicon that it contains |
|
|
|
Query: Мы, мягко говоря, совсем не ладили.' |
|
sentences: |
|
- VP по всем правилам (NP-Gen) |
|
- как насчёт XP? |
|
- мягко говоря, Cl |
|
- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian |
|
Constructicon that it contains |
|
|
|
Query: Не беспокойтесь, всё будет сделано в лучшем виде.' |
|
sentences: |
|
- быть может, XP/Cl |
|
- вот было бы здорово, если бы Cl |
|
- всё будет Adv/Adj-Short |
|
- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian |
|
Constructicon that it contains |
|
|
|
Query: Самолет до Саратова уже год как отменили.' |
|
sentences: |
|
- показать, где раки зимуют NP-Dat |
|
- VP как угорелый |
|
- (вот) (уже) (NumCrd-Nom/NumCrd-Acc) NP Cop как Cl/NP-Nom (вот) (уже) (NumCrd-Acc) |
|
NP как XP |
|
- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian |
|
Constructicon that it contains |
|
|
|
Query: Срочно делай уроки, а не то будешь иметь дело с раздраженным отцом!' |
|
sentences: |
|
- Cl, (а) не то Aux-Fut иметь дело с NP-Ins |
|
- VP (NP-Acc) с ног на голову |
|
- VP под NP-Acc |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
language: |
|
- ru |
|
--- |
|
|
|
# Russian Constructicon Embedder |
|
|
|
This is a specialized [sentence-transformers](https://www.SBERT.net) model fine-tuned from [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) for finding Russian Constructicon patterns in text. The model is trained to compare Russian text examples with construction patterns from the Russian Constructicon database, enabling semantic search for linguistic constructions. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer specialized for Russian Constructicon patterns |
|
- **Base model:** [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 1024 dimensions |
|
- **Similarity Function:** Cosine Similarity |
|
- **Language:** Russian |
|
- **Training Dataset:** Russian Constructicon examples and patterns |
|
|
|
### Model Purpose |
|
|
|
This model is specifically designed to encode Russian text examples and Constructicon patterns into a shared embedding space where similar constructions are close together. It enables: |
|
|
|
- Finding Constructicon patterns that match given Russian text examples |
|
- Semantic search through Russian construction databases |
|
- Similarity comparison between text examples and linguistic patterns |
|
- Construction pattern retrieval and ranking |
|
|
|
## Usage |
|
|
|
### Primary Usage (RusCxnPipe Library) |
|
|
|
This model is designed to be used with the [RusCxnPipe](https://github.com/Futyn-Maker/ruscxnpipe) library for automatic Russian Constructicon pattern extraction: |
|
|
|
```python |
|
from ruscxnpipe import SemanticSearch |
|
|
|
# Initialize with this specific model |
|
search = SemanticSearch( |
|
model_name="Futyn-Maker/ruscxn-embedder", |
|
query_prefix="Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: ", |
|
pattern_prefix="" |
|
) |
|
|
|
# Find construction candidates |
|
examples = ["Петр так и замер.", "Мы, мягко говоря, совсем не ладили."] |
|
results = search.find_candidates(queries=examples, n=5) |
|
|
|
for result in results: |
|
print(f"Example: {result['query']}") |
|
for candidate in result['candidates']: |
|
print(f" Pattern: {candidate['pattern']} (similarity: {candidate['similarity']:.3f})") |
|
``` |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
For advanced users who want to use the model directly: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
model = SentenceTransformer("Futyn-Maker/ruscxn-embedder") |
|
|
|
# Note: Use the correct prefixes for optimal performance |
|
query_prefix = "Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: " |
|
pattern_prefix = "" |
|
|
|
# Encode a Russian example |
|
example = query_prefix + "Петр так и замер." |
|
example_embedding = model.encode(example) |
|
|
|
# Encode construction patterns (no prefix needed) |
|
patterns = [ |
|
"NP-Nom так и VP-Pfv", |
|
"VP вокруг да около", |
|
"мягко говоря, Cl" |
|
] |
|
pattern_embeddings = model.encode(patterns) |
|
|
|
# Calculate similarities |
|
from sentence_transformers.util import cos_sim |
|
similarities = cos_sim(example_embedding, pattern_embeddings) |
|
print(similarities) |
|
``` |
|
|
|
## Out-of-Scope Use |
|
|
|
While this model is optimized for Russian Constructicon pattern matching, it may also be useful for other tasks involving Russian linguistic patterns, such as: |
|
|
|
- Clustering of similar constructions |
|
- Classification of constructions |
|
|
|
However, performance on these tasks has not been systematically evaluated. |
|
|
|
## Training Details |
|
|
|
### Training Dataset |
|
|
|
The model was trained on **15,298 examples** from the Russian Constructicon database, where each training sample consists of: |
|
- **Query:** A Russian text example with the instruction prefix |
|
- **Pattern:** A corresponding Constructicon pattern |
|
|
|
### Training Objective |
|
|
|
The model was fine-tuned using **CachedMultipleNegativesSymmetricRankingLoss** to learn embeddings where: |
|
- Examples containing a construction are similar to that construction's pattern |
|
- The embedding space preserves semantic relationships between related constructions |
|
|
|
### Training Hyperparameters |
|
|
|
- **Learning rate:** 2e-05 |
|
- **Batch size:** 1024 |
|
- **Training epochs:** 10 (best model from epoch 5) |
|
- **Warmup ratio:** 0.1 |
|
- **Weight decay:** 0.01 |
|
- **Loss function:** CachedMultipleNegativesSymmetricRankingLoss |
|
|
|
### Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel |
|
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
(2): Normalize() |
|
) |
|
``` |
|
|
|
## Performance |
|
|
|
The model achieved its best validation performance at epoch 5 with a validation loss of **0.1145**. |
|
|
|
## Framework Versions |
|
|
|
- Python: 3.10.12 |
|
- Sentence Transformers: 4.1.0 |
|
- Transformers: 4.51.3 |
|
- PyTorch: 2.7.0+cu126 |
|
|