--- tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:15298 - loss:CachedMultipleNegativesSymmetricRankingLoss - russian - constructicon - nlp - linguistics base_model: intfloat/multilingual-e5-large-instruct widget: - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains Query: Петр так и замер.' sentences: - NP-Nom так и VP-Pfv - VP вокруг да около - NP-Nom в гробу видать NP-Acc - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains Query: Мы, мягко говоря, совсем не ладили.' sentences: - VP по всем правилам (NP-Gen) - как насчёт XP? - мягко говоря, Cl - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains Query: Не беспокойтесь, всё будет сделано в лучшем виде.' sentences: - быть может, XP/Cl - вот было бы здорово, если бы Cl - всё будет Adv/Adj-Short - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains Query: Самолет до Саратова уже год как отменили.' sentences: - показать, где раки зимуют NP-Dat - VP как угорелый - (вот) (уже) (NumCrd-Nom/NumCrd-Acc) NP Cop как Cl/NP-Nom (вот) (уже) (NumCrd-Acc) NP как XP - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains Query: Срочно делай уроки, а не то будешь иметь дело с раздраженным отцом!' sentences: - Cl, (а) не то Aux-Fut иметь дело с NP-Ins - VP (NP-Acc) с ног на голову - VP под NP-Acc pipeline_tag: sentence-similarity library_name: sentence-transformers language: - ru --- # Russian Constructicon Embedder This is a specialized [sentence-transformers](https://www.SBERT.net) model fine-tuned from [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) for finding Russian Constructicon patterns in text. The model is trained to compare Russian text examples with construction patterns from the Russian Constructicon database, enabling semantic search for linguistic constructions. ## Model Details ### Model Description - **Model Type:** Sentence Transformer specialized for Russian Constructicon patterns - **Base model:** [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 1024 dimensions - **Similarity Function:** Cosine Similarity - **Language:** Russian - **Training Dataset:** Russian Constructicon examples and patterns ### Model Purpose This model is specifically designed to encode Russian text examples and Constructicon patterns into a shared embedding space where similar constructions are close together. It enables: - Finding Constructicon patterns that match given Russian text examples - Semantic search through Russian construction databases - Similarity comparison between text examples and linguistic patterns - Construction pattern retrieval and ranking ## Usage ### Primary Usage (RusCxnPipe Library) This model is designed to be used with the [RusCxnPipe](https://github.com/Futyn-Maker/ruscxnpipe) library for automatic Russian Constructicon pattern extraction: ```python from ruscxnpipe import SemanticSearch # Initialize with this specific model search = SemanticSearch( model_name="Futyn-Maker/ruscxn-embedder", query_prefix="Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: ", pattern_prefix="" ) # Find construction candidates examples = ["Петр так и замер.", "Мы, мягко говоря, совсем не ладили."] results = search.find_candidates(queries=examples, n=5) for result in results: print(f"Example: {result['query']}") for candidate in result['candidates']: print(f" Pattern: {candidate['pattern']} (similarity: {candidate['similarity']:.3f})") ``` ### Direct Usage (Sentence Transformers) For advanced users who want to use the model directly: ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("Futyn-Maker/ruscxn-embedder") # Note: Use the correct prefixes for optimal performance query_prefix = "Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: " pattern_prefix = "" # Encode a Russian example example = query_prefix + "Петр так и замер." example_embedding = model.encode(example) # Encode construction patterns (no prefix needed) patterns = [ "NP-Nom так и VP-Pfv", "VP вокруг да около", "мягко говоря, Cl" ] pattern_embeddings = model.encode(patterns) # Calculate similarities from sentence_transformers.util import cos_sim similarities = cos_sim(example_embedding, pattern_embeddings) print(similarities) ``` ## Out-of-Scope Use While this model is optimized for Russian Constructicon pattern matching, it may also be useful for other tasks involving Russian linguistic patterns, such as: - Clustering of similar constructions - Classification of constructions However, performance on these tasks has not been systematically evaluated. ## Training Details ### Training Dataset The model was trained on **15,298 examples** from the Russian Constructicon database, where each training sample consists of: - **Query:** A Russian text example with the instruction prefix - **Pattern:** A corresponding Constructicon pattern ### Training Objective The model was fine-tuned using **CachedMultipleNegativesSymmetricRankingLoss** to learn embeddings where: - Examples containing a construction are similar to that construction's pattern - The embedding space preserves semantic relationships between related constructions ### Training Hyperparameters - **Learning rate:** 2e-05 - **Batch size:** 1024 - **Training epochs:** 10 (best model from epoch 5) - **Warmup ratio:** 0.1 - **Weight decay:** 0.01 - **Loss function:** CachedMultipleNegativesSymmetricRankingLoss ### Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() ) ``` ## Performance The model achieved its best validation performance at epoch 5 with a validation loss of **0.1145**. ## Framework Versions - Python: 3.10.12 - Sentence Transformers: 4.1.0 - Transformers: 4.51.3 - PyTorch: 2.7.0+cu126