Futyn-Maker commited on
Commit
754fb38
·
verified ·
1 Parent(s): 779f254

Initial upload of Russian Constructicon embedder model

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:15298
8
+ - loss:CachedMultipleNegativesSymmetricRankingLoss
9
+ - russian
10
+ - constructicon
11
+ - nlp
12
+ - linguistics
13
+ base_model: intfloat/multilingual-e5-large-instruct
14
+ widget:
15
+ - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
16
+ Constructicon that it contains
17
+
18
+ Query: Петр так и замер.'
19
+ sentences:
20
+ - NP-Nom так и VP-Pfv
21
+ - VP вокруг да около
22
+ - NP-Nom в гробу видать NP-Acc
23
+ - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
24
+ Constructicon that it contains
25
+
26
+ Query: Мы, мягко говоря, совсем не ладили.'
27
+ sentences:
28
+ - VP по всем правилам (NP-Gen)
29
+ - как насчёт XP?
30
+ - мягко говоря, Cl
31
+ - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
32
+ Constructicon that it contains
33
+
34
+ Query: Не беспокойтесь, всё будет сделано в лучшем виде.'
35
+ sentences:
36
+ - быть может, XP/Cl
37
+ - вот было бы здорово, если бы Cl
38
+ - всё будет Adv/Adj-Short
39
+ - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
40
+ Constructicon that it contains
41
+
42
+ Query: Самолет до Саратова уже год как отменили.'
43
+ sentences:
44
+ - показать, где раки зимуют NP-Dat
45
+ - VP как угорелый
46
+ - (вот) (уже) (NumCrd-Nom/NumCrd-Acc) NP Cop как Cl/NP-Nom (вот) (уже) (NumCrd-Acc)
47
+ NP как XP
48
+ - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian
49
+ Constructicon that it contains
50
+
51
+ Query: Срочно делай уроки, а не то будешь иметь дело с раздраженным отцом!'
52
+ sentences:
53
+ - Cl, (а) не то Aux-Fut иметь дело с NP-Ins
54
+ - VP (NP-Acc) с ног на голову
55
+ - VP под NP-Acc
56
+ pipeline_tag: sentence-similarity
57
+ library_name: sentence-transformers
58
+ language:
59
+ - ru
60
+ ---
61
+
62
+ # Russian Constructicon Embedder
63
+
64
+ This is a specialized [sentence-transformers](https://www.SBERT.net) model fine-tuned from [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) for finding Russian Constructicon patterns in text. The model is trained to compare Russian text examples with construction patterns from the Russian Constructicon database, enabling semantic search for linguistic constructions.
65
+
66
+ ## Model Details
67
+
68
+ ### Model Description
69
+ - **Model Type:** Sentence Transformer specialized for Russian Constructicon patterns
70
+ - **Base model:** [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
71
+ - **Maximum Sequence Length:** 512 tokens
72
+ - **Output Dimensionality:** 1024 dimensions
73
+ - **Similarity Function:** Cosine Similarity
74
+ - **Language:** Russian
75
+ - **Training Dataset:** Russian Constructicon examples and patterns
76
+
77
+ ### Model Purpose
78
+
79
+ This model is specifically designed to encode Russian text examples and Constructicon patterns into a shared embedding space where similar constructions are close together. It enables:
80
+
81
+ - Finding Constructicon patterns that match given Russian text examples
82
+ - Semantic search through Russian construction databases
83
+ - Similarity comparison between text examples and linguistic patterns
84
+ - Construction pattern retrieval and ranking
85
+
86
+ ## Usage
87
+
88
+ ### Primary Usage (RusCxnPipe Library)
89
+
90
+ This model is designed to be used with the [RusCxnPipe](https://github.com/Futyn-Maker/ruscxnpipe) library for automatic Russian Constructicon pattern extraction:
91
+
92
+ ```python
93
+ from ruscxnpipe import SemanticSearch
94
+
95
+ # Initialize with this specific model
96
+ search = SemanticSearch(
97
+ model_name="Futyn-Maker/ruscxn-embedder",
98
+ query_prefix="Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: ",
99
+ pattern_prefix=""
100
+ )
101
+
102
+ # Find construction candidates
103
+ examples = ["Петр так и замер.", "Мы, мягко говоря, совсем не ладили."]
104
+ results = search.find_candidates(queries=examples, n=5)
105
+
106
+ for result in results:
107
+ print(f"Example: {result['query']}")
108
+ for candidate in result['candidates']:
109
+ print(f" Pattern: {candidate['pattern']} (similarity: {candidate['similarity']:.3f})")
110
+ ```
111
+
112
+ ### Direct Usage (Sentence Transformers)
113
+
114
+ For advanced users who want to use the model directly:
115
+
116
+ ```python
117
+ from sentence_transformers import SentenceTransformer
118
+
119
+ model = SentenceTransformer("Futyn-Maker/ruscxn-embedder")
120
+
121
+ # Note: Use the correct prefixes for optimal performance
122
+ query_prefix = "Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: "
123
+ pattern_prefix = ""
124
+
125
+ # Encode a Russian example
126
+ example = query_prefix + "Петр так и замер."
127
+ example_embedding = model.encode(example)
128
+
129
+ # Encode construction patterns (no prefix needed)
130
+ patterns = [
131
+ "NP-Nom так и VP-Pfv",
132
+ "VP вокруг да около",
133
+ "мягко говоря, Cl"
134
+ ]
135
+ pattern_embeddings = model.encode(patterns)
136
+
137
+ # Calculate similarities
138
+ from sentence_transformers.util import cos_sim
139
+ similarities = cos_sim(example_embedding, pattern_embeddings)
140
+ print(similarities)
141
+ ```
142
+
143
+ ## Out-of-Scope Use
144
+
145
+ While this model is optimized for Russian Constructicon pattern matching, it may also be useful for other tasks involving Russian linguistic patterns, such as:
146
+
147
+ - Clustering of similar constructions
148
+ - Classification of constructions
149
+
150
+ However, performance on these tasks has not been systematically evaluated.
151
+
152
+ ## Training Details
153
+
154
+ ### Training Dataset
155
+
156
+ The model was trained on **15,298 examples** from the Russian Constructicon database, where each training sample consists of:
157
+ - **Query:** A Russian text example with the instruction prefix
158
+ - **Pattern:** A corresponding Constructicon pattern
159
+
160
+ ### Training Objective
161
+
162
+ The model was fine-tuned using **CachedMultipleNegativesSymmetricRankingLoss** to learn embeddings where:
163
+ - Examples containing a construction are similar to that construction's pattern
164
+ - The embedding space preserves semantic relationships between related constructions
165
+
166
+ ### Training Hyperparameters
167
+
168
+ - **Learning rate:** 2e-05
169
+ - **Batch size:** 1024
170
+ - **Training epochs:** 10 (best model from epoch 5)
171
+ - **Warmup ratio:** 0.1
172
+ - **Weight decay:** 0.01
173
+ - **Loss function:** CachedMultipleNegativesSymmetricRankingLoss
174
+
175
+ ### Model Architecture
176
+
177
+ ```
178
+ SentenceTransformer(
179
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
180
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
181
+ (2): Normalize()
182
+ )
183
+ ```
184
+
185
+ ## Performance
186
+
187
+ The model achieved its best validation performance at epoch 5 with a validation loss of **0.1145**.
188
+
189
+ ## Framework Versions
190
+
191
+ - Python: 3.10.12
192
+ - Sentence Transformers: 4.1.0
193
+ - Transformers: 4.51.3
194
+ - PyTorch: 2.7.0+cu126
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./embedding_model",
3
+ "architectures": [
4
+ "XLMRobertaModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 1024,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 514,
17
+ "model_type": "xlm-roberta",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 24,
20
+ "output_past": true,
21
+ "pad_token_id": 1,
22
+ "position_embedding_type": "absolute",
23
+ "torch_dtype": "float32",
24
+ "transformers_version": "4.49.0",
25
+ "type_vocab_size": 1,
26
+ "use_cache": true,
27
+ "vocab_size": 250002
28
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.1.0",
4
+ "transformers": "4.49.0",
5
+ "pytorch": "2.4.1+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4dbbc5d5741756a3532cfc39fdcc273a2f25008b7325d6dff551dbabf9a977b3
3
+ size 2239607176
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:883b037111086fd4dfebbbc9b7cee11e1517b5e0c0514879478661440f137085
3
+ size 17082987
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "additional_special_tokens": [],
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": true,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "extra_special_tokens": {},
50
+ "mask_token": "<mask>",
51
+ "max_length": 512,
52
+ "model_max_length": 512,
53
+ "pad_to_multiple_of": null,
54
+ "pad_token": "<pad>",
55
+ "pad_token_type_id": 0,
56
+ "padding_side": "right",
57
+ "sep_token": "</s>",
58
+ "stride": 0,
59
+ "tokenizer_class": "XLMRobertaTokenizer",
60
+ "truncation_side": "right",
61
+ "truncation_strategy": "longest_first",
62
+ "unk_token": "<unk>"
63
+ }