netandreus commited on
Commit
aba9483
·
verified ·
1 Parent(s): c2de3ac

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,390 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: BAAI/bge-reranker-v2-m3
4
+ tags:
5
+ - generated_from_trainer
6
+ library_name: sentence-transformers
7
+ pipeline_tag: text-ranking
8
+ model-index:
9
+ - name: bge_reranker
10
+ results: []
11
+ ---
12
+
13
+ # Reranker
14
+
15
+ **More details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/tree/master).**
16
+
17
+ - [Model List](#model-list)
18
+ - [Usage](#usage)
19
+ - [Fine-tuning](#fine-tune)
20
+ - [Evaluation](#evaluation)
21
+ - [Citation](#citation)
22
+
23
+ Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding.
24
+ You can get a relevance score by inputting query and passage to the reranker.
25
+ And the score can be mapped to a float value in [0,1] by sigmoid function.
26
+
27
+
28
+ ## Model List
29
+
30
+ | Model | Base model | Language | layerwise | feature |
31
+ |:--------------------------------------------------------------------------|:--------:|:-----------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:|
32
+ | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | Chinese and English | - | Lightweight reranker model, easy to deploy, with fast inference. |
33
+ | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) | Chinese and English | - | Lightweight reranker model, easy to deploy, with fast inference. |
34
+ | [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) | [bge-m3](https://huggingface.co/BAAI/bge-m3) | Multilingual | - | Lightweight reranker model, possesses strong multilingual capabilities, easy to deploy, with fast inference. |
35
+ | [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma) | [gemma-2b](https://huggingface.co/google/gemma-2b) | Multilingual | - | Suitable for multilingual contexts, performs well in both English proficiency and multilingual capabilities. |
36
+ | [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise) | [MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) | Multilingual | 8-40 | Suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers for output, facilitating accelerated inference. |
37
+
38
+
39
+ You can select the model according your senario and resource.
40
+ - For **multilingual**, utilize [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) and [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma)
41
+
42
+ - For **Chinese or English**, utilize [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) and [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise).
43
+
44
+ - For **efficiency**, utilize [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) and the low layer of [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise).
45
+
46
+ - For better performance, recommand [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise) and [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma)
47
+
48
+ ## Usage
49
+ ### Using FlagEmbedding
50
+
51
+ ```
52
+ pip install -U FlagEmbedding
53
+ ```
54
+
55
+ #### For normal reranker (bge-reranker-base / bge-reranker-large / bge-reranker-v2-m3 )
56
+
57
+ Get relevance scores (higher scores indicate more relevance):
58
+
59
+ ```python
60
+ from FlagEmbedding import FlagReranker
61
+ reranker = FlagReranker('BAAI/bge-reranker-v2-m3', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
62
+
63
+ score = reranker.compute_score(['query', 'passage'])
64
+ print(score) # -5.65234375
65
+
66
+ # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
67
+ score = reranker.compute_score(['query', 'passage'], normalize=True)
68
+ print(score) # 0.003497010252573502
69
+
70
+ scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
71
+ print(scores) # [-8.1875, 5.26171875]
72
+
73
+ # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
74
+ scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True)
75
+ print(scores) # [0.00027803096387751553, 0.9948403768236574]
76
+ ```
77
+
78
+ #### For LLM-based reranker
79
+
80
+ ```python
81
+ from FlagEmbedding import FlagLLMReranker
82
+ reranker = FlagLLMReranker('BAAI/bge-reranker-v2-gemma', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
83
+ # reranker = FlagLLMReranker('BAAI/bge-reranker-v2-gemma', use_bf16=True) # You can also set use_bf16=True to speed up computation with a slight performance degradation
84
+
85
+ score = reranker.compute_score(['query', 'passage'])
86
+ print(score)
87
+
88
+ scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
89
+ print(scores)
90
+ ```
91
+
92
+ #### For LLM-based layerwise reranker
93
+
94
+ ```python
95
+ from FlagEmbedding import LayerWiseFlagLLMReranker
96
+ reranker = LayerWiseFlagLLMReranker('BAAI/bge-reranker-v2-minicpm-layerwise', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
97
+ # reranker = LayerWiseFlagLLMReranker('BAAI/bge-reranker-v2-minicpm-layerwise', use_bf16=True) # You can also set use_bf16=True to speed up computation with a slight performance degradation
98
+
99
+ score = reranker.compute_score(['query', 'passage'], cutoff_layers=[28]) # Adjusting 'cutoff_layers' to pick which layers are used for computing the score.
100
+ print(score)
101
+
102
+ scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], cutoff_layers=[28])
103
+ print(scores)
104
+ ```
105
+
106
+ ### Using Huggingface transformers
107
+
108
+ #### For normal reranker (bge-reranker-base / bge-reranker-large / bge-reranker-v2-m3 )
109
+
110
+ Get relevance scores (higher scores indicate more relevance):
111
+
112
+ ```python
113
+ import torch
114
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
115
+
116
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-v2-m3')
117
+ model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-v2-m3')
118
+ model.eval()
119
+
120
+ pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
121
+ with torch.no_grad():
122
+ inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
123
+ scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
124
+ print(scores)
125
+ ```
126
+
127
+ #### For LLM-based reranker
128
+
129
+ ```python
130
+ import torch
131
+ from transformers import AutoModelForCausalLM, AutoTokenizer
132
+
133
+ def get_inputs(pairs, tokenizer, prompt=None, max_length=1024):
134
+ if prompt is None:
135
+ prompt = "Given a query A and a passage B, determine whether the passage contains an answer to the query by providing a prediction of either 'Yes' or 'No'."
136
+ sep = "\n"
137
+ prompt_inputs = tokenizer(prompt,
138
+ return_tensors=None,
139
+ add_special_tokens=False)['input_ids']
140
+ sep_inputs = tokenizer(sep,
141
+ return_tensors=None,
142
+ add_special_tokens=False)['input_ids']
143
+ inputs = []
144
+ for query, passage in pairs:
145
+ query_inputs = tokenizer(f'A: {query}',
146
+ return_tensors=None,
147
+ add_special_tokens=False,
148
+ max_length=max_length * 3 // 4,
149
+ truncation=True)
150
+ passage_inputs = tokenizer(f'B: {passage}',
151
+ return_tensors=None,
152
+ add_special_tokens=False,
153
+ max_length=max_length,
154
+ truncation=True)
155
+ item = tokenizer.prepare_for_model(
156
+ [tokenizer.bos_token_id] + query_inputs['input_ids'],
157
+ sep_inputs + passage_inputs['input_ids'],
158
+ truncation='only_second',
159
+ max_length=max_length,
160
+ padding=False,
161
+ return_attention_mask=False,
162
+ return_token_type_ids=False,
163
+ add_special_tokens=False
164
+ )
165
+ item['input_ids'] = item['input_ids'] + sep_inputs + prompt_inputs
166
+ item['attention_mask'] = [1] * len(item['input_ids'])
167
+ inputs.append(item)
168
+ return tokenizer.pad(
169
+ inputs,
170
+ padding=True,
171
+ max_length=max_length + len(sep_inputs) + len(prompt_inputs),
172
+ pad_to_multiple_of=8,
173
+ return_tensors='pt',
174
+ )
175
+
176
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-v2-gemma')
177
+ model = AutoModelForCausalLM.from_pretrained('BAAI/bge-reranker-v2-gemma')
178
+ yes_loc = tokenizer('Yes', add_special_tokens=False)['input_ids'][0]
179
+ model.eval()
180
+
181
+ pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
182
+ with torch.no_grad():
183
+ inputs = get_inputs(pairs, tokenizer)
184
+ scores = model(**inputs, return_dict=True).logits[:, -1, yes_loc].view(-1, ).float()
185
+ print(scores)
186
+ ```
187
+
188
+ #### For LLM-based layerwise reranker
189
+
190
+ ```python
191
+ import torch
192
+ from transformers import AutoModelForCausalLM, AutoTokenizer
193
+
194
+ def get_inputs(pairs, tokenizer, prompt=None, max_length=1024):
195
+ if prompt is None:
196
+ prompt = "Given a query A and a passage B, determine whether the passage contains an answer to the query by providing a prediction of either 'Yes' or 'No'."
197
+ sep = "\n"
198
+ prompt_inputs = tokenizer(prompt,
199
+ return_tensors=None,
200
+ add_special_tokens=False)['input_ids']
201
+ sep_inputs = tokenizer(sep,
202
+ return_tensors=None,
203
+ add_special_tokens=False)['input_ids']
204
+ inputs = []
205
+ for query, passage in pairs:
206
+ query_inputs = tokenizer(f'A: {query}',
207
+ return_tensors=None,
208
+ add_special_tokens=False,
209
+ max_length=max_length * 3 // 4,
210
+ truncation=True)
211
+ passage_inputs = tokenizer(f'B: {passage}',
212
+ return_tensors=None,
213
+ add_special_tokens=False,
214
+ max_length=max_length,
215
+ truncation=True)
216
+ item = tokenizer.prepare_for_model(
217
+ [tokenizer.bos_token_id] + query_inputs['input_ids'],
218
+ sep_inputs + passage_inputs['input_ids'],
219
+ truncation='only_second',
220
+ max_length=max_length,
221
+ padding=False,
222
+ return_attention_mask=False,
223
+ return_token_type_ids=False,
224
+ add_special_tokens=False
225
+ )
226
+ item['input_ids'] = item['input_ids'] + sep_inputs + prompt_inputs
227
+ item['attention_mask'] = [1] * len(item['input_ids'])
228
+ inputs.append(item)
229
+ return tokenizer.pad(
230
+ inputs,
231
+ padding=True,
232
+ max_length=max_length + len(sep_inputs) + len(prompt_inputs),
233
+ pad_to_multiple_of=8,
234
+ return_tensors='pt',
235
+ )
236
+
237
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-v2-minicpm-layerwise', trust_remote_code=True)
238
+ model = AutoModelForCausalLM.from_pretrained('BAAI/bge-reranker-v2-minicpm-layerwise', trust_remote_code=True, torch_dtype=torch.bfloat16)
239
+ model = model.to('cuda')
240
+ model.eval()
241
+
242
+ pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
243
+ with torch.no_grad():
244
+ inputs = get_inputs(pairs, tokenizer).to(model.device)
245
+ all_scores = model(**inputs, return_dict=True, cutoff_layers=[28])
246
+ all_scores = [scores[:, -1].view(-1, ).float() for scores in all_scores[0]]
247
+ print(all_scores)
248
+ ```
249
+
250
+ ## Fine-tune
251
+
252
+ ### Data Format
253
+
254
+ Train data should be a json file, where each line is a dict like this:
255
+
256
+ ```
257
+ {"query": str, "pos": List[str], "neg":List[str], "prompt": str}
258
+ ```
259
+
260
+ `query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts, `prompt` indicates the relationship between query and texts. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives.
261
+
262
+ See [toy_finetune_data.jsonl](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranker/toy_finetune_data.jsonl) for a toy data file.
263
+
264
+ ### Train
265
+
266
+ You can fine-tune the reranker with the following code:
267
+
268
+ **For llm-based reranker**
269
+
270
+ ```shell
271
+ torchrun --nproc_per_node {number of gpus} \
272
+ -m FlagEmbedding.llm_reranker.finetune_for_instruction.run \
273
+ --output_dir {path to save model} \
274
+ --model_name_or_path google/gemma-2b \
275
+ --train_data ./toy_finetune_data.jsonl \
276
+ --learning_rate 2e-4 \
277
+ --num_train_epochs 1 \
278
+ --per_device_train_batch_size 1 \
279
+ --gradient_accumulation_steps 16 \
280
+ --dataloader_drop_last True \
281
+ --query_max_len 512 \
282
+ --passage_max_len 512 \
283
+ --train_group_size 16 \
284
+ --logging_steps 1 \
285
+ --save_steps 2000 \
286
+ --save_total_limit 50 \
287
+ --ddp_find_unused_parameters False \
288
+ --gradient_checkpointing \
289
+ --deepspeed stage1.json \
290
+ --warmup_ratio 0.1 \
291
+ --bf16 \
292
+ --use_lora True \
293
+ --lora_rank 32 \
294
+ --lora_alpha 64 \
295
+ --use_flash_attn True \
296
+ --target_modules q_proj k_proj v_proj o_proj
297
+ ```
298
+
299
+ **For llm-based layerwise reranker**
300
+
301
+ ```shell
302
+ torchrun --nproc_per_node {number of gpus} \
303
+ -m FlagEmbedding.llm_reranker.finetune_for_layerwise.run \
304
+ --output_dir {path to save model} \
305
+ --model_name_or_path openbmb/MiniCPM-2B-dpo-bf16 \
306
+ --train_data ./toy_finetune_data.jsonl \
307
+ --learning_rate 2e-4 \
308
+ --num_train_epochs 1 \
309
+ --per_device_train_batch_size 1 \
310
+ --gradient_accumulation_steps 16 \
311
+ --dataloader_drop_last True \
312
+ --query_max_len 512 \
313
+ --passage_max_len 512 \
314
+ --train_group_size 16 \
315
+ --logging_steps 1 \
316
+ --save_steps 2000 \
317
+ --save_total_limit 50 \
318
+ --ddp_find_unused_parameters False \
319
+ --gradient_checkpointing \
320
+ --deepspeed stage1.json \
321
+ --warmup_ratio 0.1 \
322
+ --bf16 \
323
+ --use_lora True \
324
+ --lora_rank 32 \
325
+ --lora_alpha 64 \
326
+ --use_flash_attn True \
327
+ --target_modules q_proj k_proj v_proj o_proj \
328
+ --start_layer 8 \
329
+ --head_multi True \
330
+ --head_type simple \
331
+ --lora_extra_parameters linear_head
332
+ ```
333
+
334
+ Our rerankers are initialized from [google/gemma-2b](https://huggingface.co/google/gemma-2b) (for llm-based reranker) and [openbmb/MiniCPM-2B-dpo-bf16](https://huggingface.co/openbmb/MiniCPM-2B-dpo-bf16) (for llm-based layerwise reranker), and we train it on a mixture of multilingual datasets:
335
+
336
+ - [bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data)
337
+ - [quora train data](https://huggingface.co/datasets/quora)
338
+ - [fever train data](https://fever.ai/dataset/fever.html)
339
+
340
+ ## Evaluation
341
+
342
+ - llama-index.
343
+
344
+ ![image-20240317193909373](./assets/llama-index.png)
345
+
346
+
347
+ - BEIR.
348
+
349
+ rereank the top 100 results from bge-en-v1.5 large.
350
+
351
+ ![image-20240317174633333](./assets/BEIR-bge-en-v1.5.png)
352
+
353
+ rereank the top 100 results from e5 mistral 7b instruct.
354
+
355
+ ![image-20240317172949713](./assets/BEIR-e5-mistral.png)
356
+
357
+ - CMTEB-retrieval.
358
+ It rereank the top 100 results from bge-zh-v1.5 large.
359
+
360
+ ![image-20240317173026235](./assets/CMTEB-retrieval-bge-zh-v1.5.png)
361
+
362
+ - miracl (multi-language).
363
+ It rereank the top 100 results from bge-m3.
364
+
365
+ ![image-20240317173117639](./assets/miracl-bge-m3.png)
366
+
367
+
368
+
369
+ ## Citation
370
+
371
+ If you find this repository useful, please consider giving a star and citation
372
+
373
+ ```bibtex
374
+ @misc{li2023making,
375
+ title={Making Large Language Models A Better Foundation For Dense Retrieval},
376
+ author={Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao},
377
+ year={2023},
378
+ eprint={2312.15503},
379
+ archivePrefix={arXiv},
380
+ primaryClass={cs.CL}
381
+ }
382
+ @misc{chen2024bge,
383
+ title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
384
+ author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
385
+ year={2024},
386
+ eprint={2402.03216},
387
+ archivePrefix={arXiv},
388
+ primaryClass={cs.CL}
389
+ }
390
+ ```
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "BAAI/bge-m3",
3
+ "architectures": [
4
+ "XLMRobertaForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 1024,
13
+ "id2label": {
14
+ "0": "LABEL_0"
15
+ },
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 4096,
18
+ "label2id": {
19
+ "LABEL_0": 0
20
+ },
21
+ "layer_norm_eps": 1e-05,
22
+ "max_position_embeddings": 8194,
23
+ "model_type": "xlm-roberta",
24
+ "num_attention_heads": 16,
25
+ "num_hidden_layers": 24,
26
+ "output_past": true,
27
+ "pad_token_id": 1,
28
+ "position_embedding_type": "absolute",
29
+ "torch_dtype": "float32",
30
+ "transformers_version": "4.38.1",
31
+ "type_vocab_size": 1,
32
+ "use_cache": true,
33
+ "vocab_size": 250002
34
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9e3e081faff1eefb84019509b2f5558fd74c1a05a2c7db22f74174fcedb5286
3
+ size 2271071852
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:69564b696052886ed0ac63fa393e928384e0f8caada38c1f4864a9bfbf379c15
3
+ size 17098273
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 8192,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "sp_model_kwargs": {},
53
+ "tokenizer_class": "XLMRobertaTokenizer",
54
+ "unk_token": "<unk>"
55
+ }