ruscxn-embedder / README.md
Futyn-Maker's picture
Initial upload of Russian Constructicon embedder model
754fb38 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:15298
  - loss:CachedMultipleNegativesSymmetricRankingLoss
  - russian
  - constructicon
  - nlp
  - linguistics
base_model: intfloat/multilingual-e5-large-instruct
widget:
  - source_sentence: >-
      Instruct: Given a sentence, find the constructions of the Russian
      Constructicon that it contains

      Query: Петр так и замер.
    sentences:
      - NP-Nom так и VP-Pfv
      - VP вокруг да около
      - NP-Nom в гробу видать NP-Acc
  - source_sentence: >-
      Instruct: Given a sentence, find the constructions of the Russian
      Constructicon that it contains

      Query: Мы, мягко говоря, совсем не ладили.
    sentences:
      - VP по всем правилам (NP-Gen)
      - как насчёт XP?
      - мягко говоря, Cl
  - source_sentence: >-
      Instruct: Given a sentence, find the constructions of the Russian
      Constructicon that it contains

      Query: Не беспокойтесь, всё будет сделано в лучшем виде.
    sentences:
      - быть может, XP/Cl
      - вот было бы здорово, если бы Cl
      - всё будет Adv/Adj-Short
  - source_sentence: >-
      Instruct: Given a sentence, find the constructions of the Russian
      Constructicon that it contains

      Query: Самолет до Саратова уже год как отменили.
    sentences:
      - показать, где раки зимуют NP-Dat
      - VP как угорелый
      - >-
        (вот) (уже) (NumCrd-Nom/NumCrd-Acc) NP Cop как Cl/NP-Nom (вот) (уже)
        (NumCrd-Acc) NP как XP
  - source_sentence: >-
      Instruct: Given a sentence, find the constructions of the Russian
      Constructicon that it contains

      Query: Срочно делай уроки, а не то будешь иметь дело с раздраженным отцом!
    sentences:
      - Cl, (а) не то Aux-Fut иметь дело с NP-Ins
      - VP (NP-Acc) с ног на голову
      - VP под NP-Acc
pipeline_tag: sentence-similarity
library_name: sentence-transformers
language:
  - ru

Russian Constructicon Embedder

This is a specialized sentence-transformers model fine-tuned from intfloat/multilingual-e5-large-instruct for finding Russian Constructicon patterns in text. The model is trained to compare Russian text examples with construction patterns from the Russian Constructicon database, enabling semantic search for linguistic constructions.

Model Details

Model Description

  • Model Type: Sentence Transformer specialized for Russian Constructicon patterns
  • Base model: intfloat/multilingual-e5-large-instruct
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity
  • Language: Russian
  • Training Dataset: Russian Constructicon examples and patterns

Model Purpose

This model is specifically designed to encode Russian text examples and Constructicon patterns into a shared embedding space where similar constructions are close together. It enables:

  • Finding Constructicon patterns that match given Russian text examples
  • Semantic search through Russian construction databases
  • Similarity comparison between text examples and linguistic patterns
  • Construction pattern retrieval and ranking

Usage

Primary Usage (RusCxnPipe Library)

This model is designed to be used with the RusCxnPipe library for automatic Russian Constructicon pattern extraction:

from ruscxnpipe import SemanticSearch

# Initialize with this specific model
search = SemanticSearch(
    model_name="Futyn-Maker/ruscxn-embedder",
    query_prefix="Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: ",
    pattern_prefix=""
)

# Find construction candidates
examples = ["Петр так и замер.", "Мы, мягко говоря, совсем не ладили."]
results = search.find_candidates(queries=examples, n=5)

for result in results:
    print(f"Example: {result['query']}")
    for candidate in result['candidates']:
        print(f"  Pattern: {candidate['pattern']} (similarity: {candidate['similarity']:.3f})")

Direct Usage (Sentence Transformers)

For advanced users who want to use the model directly:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Futyn-Maker/ruscxn-embedder")

# Note: Use the correct prefixes for optimal performance
query_prefix = "Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: "
pattern_prefix = ""

# Encode a Russian example
example = query_prefix + "Петр так и замер."
example_embedding = model.encode(example)

# Encode construction patterns (no prefix needed)
patterns = [
    "NP-Nom так и VP-Pfv",
    "VP вокруг да около",
    "мягко говоря, Cl"
]
pattern_embeddings = model.encode(patterns)

# Calculate similarities
from sentence_transformers.util import cos_sim
similarities = cos_sim(example_embedding, pattern_embeddings)
print(similarities)

Out-of-Scope Use

While this model is optimized for Russian Constructicon pattern matching, it may also be useful for other tasks involving Russian linguistic patterns, such as:

  • Clustering of similar constructions
  • Classification of constructions

However, performance on these tasks has not been systematically evaluated.

Training Details

Training Dataset

The model was trained on 15,298 examples from the Russian Constructicon database, where each training sample consists of:

  • Query: A Russian text example with the instruction prefix
  • Pattern: A corresponding Constructicon pattern

Training Objective

The model was fine-tuned using CachedMultipleNegativesSymmetricRankingLoss to learn embeddings where:

  • Examples containing a construction are similar to that construction's pattern
  • The embedding space preserves semantic relationships between related constructions

Training Hyperparameters

  • Learning rate: 2e-05
  • Batch size: 1024
  • Training epochs: 10 (best model from epoch 5)
  • Warmup ratio: 0.1
  • Weight decay: 0.01
  • Loss function: CachedMultipleNegativesSymmetricRankingLoss

Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Performance

The model achieved its best validation performance at epoch 5 with a validation loss of 0.1145.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 4.1.0
  • Transformers: 4.51.3
  • PyTorch: 2.7.0+cu126