Initial upload of Russian Constructicon embedder model

Browse files

Files changed (12) hide show

.gitattributes +1 -0
1_Pooling/config.json +10 -0
README.md +194 -0
config.json +28 -0
config_sentence_transformers.json +10 -0
model.safetensors +3 -0
modules.json +20 -0
sentence_bert_config.json +4 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +51 -0
tokenizer.json +3 -0
tokenizer_config.json +63 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 1024,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,194 @@

+---
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- generated_from_trainer
+- dataset_size:15298
+- loss:CachedMultipleNegativesSymmetricRankingLoss
+- russian
+- constructicon
+- nlp
+- linguistics
+base_model: intfloat/multilingual-e5-large-instruct
+widget:
+- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
+    Constructicon that it contains
+    Query: Петр так и замер.'
+  sentences:
+  - NP-Nom так и VP-Pfv
+  - VP вокруг да около
+  - NP-Nom в гробу видать NP-Acc
+- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
+    Constructicon that it contains
+    Query: Мы, мягко говоря, совсем не ладили.'
+  sentences:
+  - VP по всем правилам (NP-Gen)
+  - как насчёт XP?
+  - мягко говоря, Cl
+- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
+    Constructicon that it contains
+    Query: Не беспокойтесь, всё будет сделано в лучшем виде.'
+  sentences:
+  - быть может, XP/Cl
+  - вот было бы здорово, если бы Cl
+  - всё будет Adv/Adj-Short
+- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
+    Constructicon that it contains
+    Query: Самолет до Саратова уже год как отменили.'
+  sentences:
+  - показать, где раки зимуют NP-Dat
+  - VP как угорелый
+  - (вот) (уже) (NumCrd-Nom/NumCrd-Acc) NP Cop как Cl/NP-Nom (вот) (уже) (NumCrd-Acc)
+    NP как XP
+- source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
+    Constructicon that it contains
+    Query: Срочно делай уроки, а не то будешь иметь дело с раздраженным отцом!'
+  sentences:
+  - Cl, (а) не то Aux-Fut иметь дело с NP-Ins
+  - VP (NP-Acc) с ног на голову
+  - VP под NP-Acc
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+language:
+- ru
+---
+# Russian Constructicon Embedder
+This is a specialized [sentence-transformers](https://www.SBERT.net) model fine-tuned from [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) for finding Russian Constructicon patterns in text. The model is trained to compare Russian text examples with construction patterns from the Russian Constructicon database, enabling semantic search for linguistic constructions.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer specialized for Russian Constructicon patterns
+- **Base model:** [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
+- **Maximum Sequence Length:** 512 tokens
+- **Output Dimensionality:** 1024 dimensions
+- **Similarity Function:** Cosine Similarity
+- **Language:** Russian
+- **Training Dataset:** Russian Constructicon examples and patterns
+### Model Purpose
+This model is specifically designed to encode Russian text examples and Constructicon patterns into a shared embedding space where similar constructions are close together. It enables:
+- Finding Constructicon patterns that match given Russian text examples
+- Semantic search through Russian construction databases
+- Similarity comparison between text examples and linguistic patterns
+- Construction pattern retrieval and ranking
+## Usage
+### Primary Usage (RusCxnPipe Library)
+This model is designed to be used with the [RusCxnPipe](https://github.com/Futyn-Maker/ruscxnpipe) library for automatic Russian Constructicon pattern extraction:
+```python
+from ruscxnpipe import SemanticSearch
+# Initialize with this specific model
+search = SemanticSearch(
+    model_name="Futyn-Maker/ruscxn-embedder",
+    query_prefix="Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: ",
+    pattern_prefix=""
+)
+# Find construction candidates
+examples = ["Петр так и замер.", "Мы, мягко говоря, совсем не ладили."]
+results = search.find_candidates(queries=examples, n=5)
+for result in results:
+    print(f"Example: {result['query']}")
+    for candidate in result['candidates']:
+        print(f"  Pattern: {candidate['pattern']} (similarity: {candidate['similarity']:.3f})")
+```
+### Direct Usage (Sentence Transformers)
+For advanced users who want to use the model directly:
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("Futyn-Maker/ruscxn-embedder")
+# Note: Use the correct prefixes for optimal performance
+query_prefix = "Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: "
+pattern_prefix = ""
+# Encode a Russian example
+example = query_prefix + "Петр так и замер."
+example_embedding = model.encode(example)
+# Encode construction patterns (no prefix needed)
+patterns = [
+    "NP-Nom так и VP-Pfv",
+    "VP вокруг да около",
+    "мягко говоря, Cl"
+]
+pattern_embeddings = model.encode(patterns)
+# Calculate similarities
+from sentence_transformers.util import cos_sim
+similarities = cos_sim(example_embedding, pattern_embeddings)
+print(similarities)
+```
+## Out-of-Scope Use
+While this model is optimized for Russian Constructicon pattern matching, it may also be useful for other tasks involving Russian linguistic patterns, such as:
+- Clustering of similar constructions
+- Classification of constructions
+However, performance on these tasks has not been systematically evaluated.
+## Training Details
+### Training Dataset
+The model was trained on **15,298 examples** from the Russian Constructicon database, where each training sample consists of:
+- **Query:** A Russian text example with the instruction prefix
+- **Pattern:** A corresponding Constructicon pattern
+### Training Objective
+The model was fine-tuned using **CachedMultipleNegativesSymmetricRankingLoss** to learn embeddings where:
+- Examples containing a construction are similar to that construction's pattern
+- The embedding space preserves semantic relationships between related constructions
+### Training Hyperparameters
+- **Learning rate:** 2e-05
+- **Batch size:** 1024
+- **Training epochs:** 10 (best model from epoch 5)
+- **Warmup ratio:** 0.1
+- **Weight decay:** 0.01
+- **Loss function:** CachedMultipleNegativesSymmetricRankingLoss
+### Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
+  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+  (2): Normalize()
+)
+```
+## Performance
+The model achieved its best validation performance at epoch 5 with a validation loss of **0.1145**.
+## Framework Versions
+- Python: 3.10.12
+- Sentence Transformers: 4.1.0
+- Transformers: 4.51.3
+- PyTorch: 2.7.0+cu126

config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "_name_or_path": "./embedding_model",
+  "architectures": [
+    "XLMRobertaModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.49.0",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 250002
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "__version__": {
+    "sentence_transformers": "4.1.0",
+    "transformers": "4.49.0",
+    "pytorch": "2.4.1+cu121"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4dbbc5d5741756a3532cfc39fdcc273a2f25008b7325d6dff551dbabf9a977b3
+size 2239607176

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Normalize",
+    "type": "sentence_transformers.models.Normalize"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 512,
+  "do_lower_case": false
+}

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
+size 5069051

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:883b037111086fd4dfebbbc9b7cee11e1517b5e0c0514879478661440f137085
+size 17082987

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [],
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "max_length": 512,
+  "model_max_length": 512,
+  "pad_to_multiple_of": null,
+  "pad_token": "<pad>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "</s>",
+  "stride": 0,
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "<unk>"
+}