Add new SentenceTransformer model
Browse files- README.md +80 -78
- config.json +14 -20
- config_sentence_transformers.json +1 -1
- model.safetensors +2 -2
- special_tokens_map.json +7 -43
- tokenizer.json +26 -88
- tokenizer_config.json +19 -21
- vocab.json +0 -0
README.md
CHANGED
@@ -6,65 +6,67 @@ tags:
|
|
6 |
- generated_from_trainer
|
7 |
- dataset_size:4893
|
8 |
- loss:TripletLoss
|
9 |
-
base_model:
|
10 |
widget:
|
11 |
-
- source_sentence:
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
same time. The possible existence of a parallel universe has been scientifically
|
16 |
-
conceded, Captain.[SEP]All right. What would happen if another universe, say a
|
17 |
-
minus universe, came into contact with a positive universe such as ours?
|
18 |
sentences:
|
19 |
-
- '
|
20 |
-
-
|
21 |
-
-
|
22 |
-
- source_sentence:
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
There's an inscription, several languages.[SEP]The Keeper's dead.
|
27 |
sentences:
|
28 |
-
- '
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
|
|
38 |
sentences:
|
39 |
-
-
|
40 |
-
|
41 |
-
-
|
42 |
-
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
|
|
|
|
|
|
|
|
47 |
sentences:
|
48 |
-
-
|
49 |
-
|
50 |
-
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
sentences:
|
59 |
-
-
|
60 |
-
- '
|
61 |
-
|
|
|
62 |
pipeline_tag: sentence-similarity
|
63 |
library_name: sentence-transformers
|
64 |
metrics:
|
65 |
- cosine_accuracy
|
66 |
model-index:
|
67 |
-
- name: SentenceTransformer based on
|
68 |
results:
|
69 |
- task:
|
70 |
type: triplet
|
@@ -74,7 +76,7 @@ model-index:
|
|
74 |
type: evaluator_enc
|
75 |
metrics:
|
76 |
- type: cosine_accuracy
|
77 |
-
value: 0.
|
78 |
name: Cosine Accuracy
|
79 |
- task:
|
80 |
type: triplet
|
@@ -84,19 +86,19 @@ model-index:
|
|
84 |
type: evaluator_val
|
85 |
metrics:
|
86 |
- type: cosine_accuracy
|
87 |
-
value: 0.
|
88 |
name: Cosine Accuracy
|
89 |
---
|
90 |
|
91 |
-
# SentenceTransformer based on
|
92 |
|
93 |
-
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [
|
94 |
|
95 |
## Model Details
|
96 |
|
97 |
### Model Description
|
98 |
- **Model Type:** Sentence Transformer
|
99 |
-
- **Base model:** [
|
100 |
- **Maximum Sequence Length:** 128 tokens
|
101 |
- **Output Dimensionality:** 768 dimensions
|
102 |
- **Similarity Function:** Cosine Similarity
|
@@ -114,7 +116,7 @@ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [m
|
|
114 |
|
115 |
```
|
116 |
SentenceTransformer(
|
117 |
-
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model:
|
118 |
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
119 |
)
|
120 |
```
|
@@ -137,9 +139,9 @@ from sentence_transformers import SentenceTransformer
|
|
137 |
model = SentenceTransformer("greatakela/gnlp_hw1_encoder")
|
138 |
# Run inference
|
139 |
sentences = [
|
140 |
-
"
|
141 |
-
|
142 |
-
"
|
143 |
]
|
144 |
embeddings = model.encode(sentences)
|
145 |
print(embeddings.shape)
|
@@ -186,7 +188,7 @@ You can finetune this model on your own dataset.
|
|
186 |
|
187 |
| Metric | evaluator_enc | evaluator_val |
|
188 |
|:--------------------|:--------------|:--------------|
|
189 |
-
| **cosine_accuracy** | **0.
|
190 |
|
191 |
<!--
|
192 |
## Bias, Risks and Limitations
|
@@ -212,13 +214,13 @@ You can finetune this model on your own dataset.
|
|
212 |
| | sentence_0 | sentence_1 | sentence_2 |
|
213 |
|:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
|
214 |
| type | string | string | string |
|
215 |
-
| details | <ul><li>min: 2 tokens</li><li>mean: 83.
|
216 |
* Samples:
|
217 |
-
| sentence_0
|
218 |
-
|
219 |
-
| <code>
|
220 |
-
| <code>
|
221 |
-
| <code>
|
222 |
* Loss: [<code>TripletLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#tripletloss) with these parameters:
|
223 |
```json
|
224 |
{
|
@@ -357,25 +359,25 @@ You can finetune this model on your own dataset.
|
|
357 |
### Training Logs
|
358 |
| Epoch | Step | Training Loss | evaluator_enc_cosine_accuracy | evaluator_val_cosine_accuracy |
|
359 |
|:------:|:----:|:-------------:|:-----------------------------:|:-----------------------------:|
|
360 |
-
| -1 | -1 | - | 0.
|
361 |
-
| 0.4902 | 300 | - | 0.
|
362 |
-
| 0.8170 | 500 |
|
363 |
-
| 0.9804 | 600 | - | 0.
|
364 |
-
| 1.0 | 612 | - | 0.
|
365 |
-
| 1.4706 | 900 | - | 0.
|
366 |
-
| 1.6340 | 1000 | 0.
|
367 |
-
| 1.9608 | 1200 | - | 0.
|
368 |
-
| 2.0 | 1224 | - | 0.
|
369 |
-
| 2.4510 | 1500 | 0.
|
370 |
-
| 2.9412 | 1800 | - | 0.
|
371 |
-
| 3.0 | 1836 | - | 0.
|
372 |
-
| -1 | -1 | - | - | 0.
|
373 |
|
374 |
|
375 |
### Framework Versions
|
376 |
- Python: 3.11.11
|
377 |
- Sentence Transformers: 3.4.1
|
378 |
-
- Transformers: 4.
|
379 |
- PyTorch: 2.5.1+cu124
|
380 |
- Accelerate: 1.3.0
|
381 |
- Datasets: 3.3.2
|
|
|
6 |
- generated_from_trainer
|
7 |
- dataset_size:4893
|
8 |
- loss:TripletLoss
|
9 |
+
base_model: distilbert/distilroberta-base
|
10 |
widget:
|
11 |
+
- source_sentence: Indeed. Allow me to rephrase. Will you join me for dinner? I am
|
12 |
+
honoured, Commander. Are the guards also invited? Mister Spock. That corridor
|
13 |
+
is forbidden to all but loyal Romulans. Of course. I shall obey your restrictions.[SEP]I
|
14 |
+
hope that one day there will be no need for you to observe any restrictions.
|
|
|
|
|
|
|
15 |
sentences:
|
16 |
+
- ' EKG showed arrhythmia, pRobably just a mild heart attack.'
|
17 |
+
- It would be illogical to assume that all conditions remain stable.
|
18 |
+
- Your very presence will destroy the people you seek. Surely you know that.
|
19 |
+
- source_sentence: Mudd. And he has Christine. She's in danger. My love. He's going
|
20 |
+
planet side. No. Not with my Christine. Relax, darling. I'll set you down somewhere
|
21 |
+
safe and then I'll be off discreetly. We must go after them, Captain. I'll lead
|
22 |
+
a landing party.[SEP]Spock, you're obviously not yourself. Maybe some rest.
|
|
|
23 |
sentences:
|
24 |
+
- ' If it''s meningococcus, half the passengers on this plane could get infected
|
25 |
+
and die before we reach New York.'
|
26 |
+
- Tactically well planned. When the Federation investigates, we'll be recorded as
|
27 |
+
just another mysterious starship disappearance.
|
28 |
+
- Captain, I insist upon going. Christine. I can't stand the thought of any danger
|
29 |
+
to her, to the woman I love.
|
30 |
+
- source_sentence: That is precisely why we should not fight. My ship is at stake.
|
31 |
+
I will not harm others, Captain. His convictions are most profound in this matter,
|
32 |
+
Captain. So are mine, Spock. If I believed that there was a peaceful way out of
|
33 |
+
this[SEP]The risk will be mine alone. If I fail, you lose nothing. After all,
|
34 |
+
I'm no warrior.
|
35 |
sentences:
|
36 |
+
- The captain knows that I have fought at his side before and will do so now, if
|
37 |
+
need be. However, I too, am a Vulcan, bred to peace. Let him attempt it.
|
38 |
+
- ' A torch test could "�'
|
39 |
+
- I have retained more strength than any of you. My internal structure is different,
|
40 |
+
Captain, my life span longer. It is wiser if I go to the temple to try and find
|
41 |
+
the communicators and contact the ship.
|
42 |
+
- source_sentence: So now it has virtually unlimited power. Captain, what'll we do?
|
43 |
+
Spock, Scotty, come with me. Report, Spock. The multitronic unit is drawing more
|
44 |
+
and more power directly from the warp engines. The computer now controls all helm,
|
45 |
+
navigation, and engineering functions. And communications and fire control.[SEP]We'll
|
46 |
+
reach the rendezvous point for the war games within an hour. We must regain control
|
47 |
+
of the ship by then.
|
48 |
sentences:
|
49 |
+
- There is one possibility. The automatic helm navigation circuit relays might be
|
50 |
+
disrupted from engineering level three.
|
51 |
+
- Nothing there.
|
52 |
+
- ' Wow, you remember where our first date was? I didn''t think you were paying
|
53 |
+
attention.'
|
54 |
+
- source_sentence: I want facts, not poetry. I have given you the facts, Captain.
|
55 |
+
The entire magnetic field in this solar system simply blinked. The planet below,
|
56 |
+
the mass of which we're measuring, attained zero gravity. That's impossible. What
|
57 |
+
you're describing Is non-existence. Standard General Alert signal from Starfleet
|
58 |
+
Command, Captain.[SEP]All stations to immediate alert status. Stand by.
|
59 |
sentences:
|
60 |
+
- As you may recall from your histories, this conflict was fought,
|
61 |
+
- ' Mm hmm. [Quick wink to the parents.] Okay, lean forwards. Now hold very still,
|
62 |
+
okay? [He picks at Clancy''s neck with some tweezers.] Got it!'
|
63 |
+
- Captain, scanners now report a life object on the planet surface below.
|
64 |
pipeline_tag: sentence-similarity
|
65 |
library_name: sentence-transformers
|
66 |
metrics:
|
67 |
- cosine_accuracy
|
68 |
model-index:
|
69 |
+
- name: SentenceTransformer based on distilbert/distilroberta-base
|
70 |
results:
|
71 |
- task:
|
72 |
type: triplet
|
|
|
76 |
type: evaluator_enc
|
77 |
metrics:
|
78 |
- type: cosine_accuracy
|
79 |
+
value: 0.9995912313461304
|
80 |
name: Cosine Accuracy
|
81 |
- task:
|
82 |
type: triplet
|
|
|
86 |
type: evaluator_val
|
87 |
metrics:
|
88 |
- type: cosine_accuracy
|
89 |
+
value: 0.9861111044883728
|
90 |
name: Cosine Accuracy
|
91 |
---
|
92 |
|
93 |
+
# SentenceTransformer based on distilbert/distilroberta-base
|
94 |
|
95 |
+
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [distilbert/distilroberta-base](https://huggingface.co/distilbert/distilroberta-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
96 |
|
97 |
## Model Details
|
98 |
|
99 |
### Model Description
|
100 |
- **Model Type:** Sentence Transformer
|
101 |
+
- **Base model:** [distilbert/distilroberta-base](https://huggingface.co/distilbert/distilroberta-base) <!-- at revision fb53ab8802853c8e4fbdbcd0529f21fc6f459b2b -->
|
102 |
- **Maximum Sequence Length:** 128 tokens
|
103 |
- **Output Dimensionality:** 768 dimensions
|
104 |
- **Similarity Function:** Cosine Similarity
|
|
|
116 |
|
117 |
```
|
118 |
SentenceTransformer(
|
119 |
+
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel
|
120 |
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
121 |
)
|
122 |
```
|
|
|
139 |
model = SentenceTransformer("greatakela/gnlp_hw1_encoder")
|
140 |
# Run inference
|
141 |
sentences = [
|
142 |
+
"I want facts, not poetry. I have given you the facts, Captain. The entire magnetic field in this solar system simply blinked. The planet below, the mass of which we're measuring, attained zero gravity. That's impossible. What you're describing Is non-existence. Standard General Alert signal from Starfleet Command, Captain.[SEP]All stations to immediate alert status. Stand by.",
|
143 |
+
'Captain, scanners now report a life object on the planet surface below.',
|
144 |
+
" Mm hmm. [Quick wink to the parents.] Okay, lean forwards. Now hold very still, okay? [He picks at Clancy's neck with some tweezers.] Got it!",
|
145 |
]
|
146 |
embeddings = model.encode(sentences)
|
147 |
print(embeddings.shape)
|
|
|
188 |
|
189 |
| Metric | evaluator_enc | evaluator_val |
|
190 |
|:--------------------|:--------------|:--------------|
|
191 |
+
| **cosine_accuracy** | **0.9996** | **0.9861** |
|
192 |
|
193 |
<!--
|
194 |
## Bias, Risks and Limitations
|
|
|
214 |
| | sentence_0 | sentence_1 | sentence_2 |
|
215 |
|:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
|
216 |
| type | string | string | string |
|
217 |
+
| details | <ul><li>min: 2 tokens</li><li>mean: 83.72 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 19.05 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 19.47 tokens</li><li>max: 128 tokens</li></ul> |
|
218 |
* Samples:
|
219 |
+
| sentence_0 | sentence_1 | sentence_2 |
|
220 |
+
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------|
|
221 |
+
| <code>I'm not a plebe. This is today, fifteen years later. What are you doing here? I'm being exactly what you expect me to be, Jimmy boy. Did you enjoy it, Captain? Yes, I enjoyed it. After all these years. I did enjoy it. The one thing I wanted to do after all these years was to beat the tar out of Finnegan. Which supports a theory I've been formulating.[SEP]That we're all meeting people and things that we happen to be thinking about at the moment.</code> | <code>Yes. Somehow our thoughts are read, these things are quickly manufactured and provided for us.</code> | <code> You did not suddenly fall in love with me. You were looking for something, and I happened to be st "�</code> |
|
222 |
+
| <code>McCoy here. Received and understood. But we still have some doubts up here, Captain. Can you tell us any more? Not really. When do you plan to beam back up, Captain? I think we'll spend the night here, Mister Spock.[SEP]No! No, no, no.</code> | <code>And you will continue to check in every four hours?</code> | <code> Is Everything ok?</code> |
|
223 |
+
| <code>Do you think it would cause a complete breakdown of discipline if a lowly lieutenant kissed a Starship Captain on the bridge of his ship? Let's try. See? No change. Discipline goes on. And so must the Enterprise. Goodbye, Jim. Goodbye, Areel. Better luck next time. I had pretty good luck this time. I lost, didn't l?[SEP]She's a very good lawyer.</code> | <code>Obviously.</code> | <code> [over PA system, somberly] Ladies and gentlemen, we have a passenger with a confirmed case of bacterial meningitis.</code> |
|
224 |
* Loss: [<code>TripletLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#tripletloss) with these parameters:
|
225 |
```json
|
226 |
{
|
|
|
359 |
### Training Logs
|
360 |
| Epoch | Step | Training Loss | evaluator_enc_cosine_accuracy | evaluator_val_cosine_accuracy |
|
361 |
|:------:|:----:|:-------------:|:-----------------------------:|:-----------------------------:|
|
362 |
+
| -1 | -1 | - | 0.5931 | - |
|
363 |
+
| 0.4902 | 300 | - | 0.9832 | - |
|
364 |
+
| 0.8170 | 500 | 1.0694 | - | - |
|
365 |
+
| 0.9804 | 600 | - | 0.9926 | - |
|
366 |
+
| 1.0 | 612 | - | 0.9939 | - |
|
367 |
+
| 1.4706 | 900 | - | 0.9965 | - |
|
368 |
+
| 1.6340 | 1000 | 0.1834 | - | - |
|
369 |
+
| 1.9608 | 1200 | - | 0.9988 | - |
|
370 |
+
| 2.0 | 1224 | - | 0.9988 | - |
|
371 |
+
| 2.4510 | 1500 | 0.0539 | 0.9992 | - |
|
372 |
+
| 2.9412 | 1800 | - | 0.9996 | - |
|
373 |
+
| 3.0 | 1836 | - | 0.9996 | - |
|
374 |
+
| -1 | -1 | - | - | 0.9861 |
|
375 |
|
376 |
|
377 |
### Framework Versions
|
378 |
- Python: 3.11.11
|
379 |
- Sentence Transformers: 3.4.1
|
380 |
+
- Transformers: 4.49.0
|
381 |
- PyTorch: 2.5.1+cu124
|
382 |
- Accelerate: 1.3.0
|
383 |
- Datasets: 3.3.2
|
config.json
CHANGED
@@ -1,33 +1,27 @@
|
|
1 |
{
|
2 |
-
"_name_or_path": "
|
3 |
"architectures": [
|
4 |
-
"
|
5 |
],
|
6 |
"attention_probs_dropout_prob": 0.1,
|
|
|
|
|
|
|
7 |
"hidden_act": "gelu",
|
8 |
"hidden_dropout_prob": 0.1,
|
9 |
"hidden_size": 768,
|
10 |
"initializer_range": 0.02,
|
11 |
"intermediate_size": 3072,
|
12 |
-
"layer_norm_eps": 1e-
|
13 |
-
"
|
14 |
-
"
|
15 |
-
"max_relative_positions": -1,
|
16 |
-
"model_type": "deberta",
|
17 |
"num_attention_heads": 12,
|
18 |
-
"num_hidden_layers":
|
19 |
-
"pad_token_id":
|
20 |
-
"
|
21 |
-
"pooler_hidden_act": "gelu",
|
22 |
-
"pooler_hidden_size": 768,
|
23 |
-
"pos_att_type": [
|
24 |
-
"c2p",
|
25 |
-
"p2c"
|
26 |
-
],
|
27 |
-
"position_biased_input": false,
|
28 |
-
"relative_attention": true,
|
29 |
"torch_dtype": "float32",
|
30 |
-
"transformers_version": "4.
|
31 |
-
"type_vocab_size":
|
|
|
32 |
"vocab_size": 50265
|
33 |
}
|
|
|
1 |
{
|
2 |
+
"_name_or_path": "distilroberta-base",
|
3 |
"architectures": [
|
4 |
+
"RobertaModel"
|
5 |
],
|
6 |
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"bos_token_id": 0,
|
8 |
+
"classifier_dropout": null,
|
9 |
+
"eos_token_id": 2,
|
10 |
"hidden_act": "gelu",
|
11 |
"hidden_dropout_prob": 0.1,
|
12 |
"hidden_size": 768,
|
13 |
"initializer_range": 0.02,
|
14 |
"intermediate_size": 3072,
|
15 |
+
"layer_norm_eps": 1e-05,
|
16 |
+
"max_position_embeddings": 514,
|
17 |
+
"model_type": "roberta",
|
|
|
|
|
18 |
"num_attention_heads": 12,
|
19 |
+
"num_hidden_layers": 6,
|
20 |
+
"pad_token_id": 1,
|
21 |
+
"position_embedding_type": "absolute",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
"torch_dtype": "float32",
|
23 |
+
"transformers_version": "4.49.0",
|
24 |
+
"type_vocab_size": 1,
|
25 |
+
"use_cache": true,
|
26 |
"vocab_size": 50265
|
27 |
}
|
config_sentence_transformers.json
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
{
|
2 |
"__version__": {
|
3 |
"sentence_transformers": "3.4.1",
|
4 |
-
"transformers": "4.
|
5 |
"pytorch": "2.5.1+cu124"
|
6 |
},
|
7 |
"prompts": {},
|
|
|
1 |
{
|
2 |
"__version__": {
|
3 |
"sentence_transformers": "3.4.1",
|
4 |
+
"transformers": "4.49.0",
|
5 |
"pytorch": "2.5.1+cu124"
|
6 |
},
|
7 |
"prompts": {},
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f0502135f9f964ae538e564593524e73ba2fbe10f4e311f1ba3be445c87d2844
|
3 |
+
size 328485128
|
special_tokens_map.json
CHANGED
@@ -1,51 +1,15 @@
|
|
1 |
{
|
2 |
-
"bos_token":
|
3 |
-
|
4 |
-
|
5 |
-
"normalized": false,
|
6 |
-
"rstrip": false,
|
7 |
-
"single_word": false
|
8 |
-
},
|
9 |
-
"cls_token": {
|
10 |
-
"content": "[CLS]",
|
11 |
-
"lstrip": false,
|
12 |
-
"normalized": false,
|
13 |
-
"rstrip": false,
|
14 |
-
"single_word": false
|
15 |
-
},
|
16 |
-
"eos_token": {
|
17 |
-
"content": "[SEP]",
|
18 |
-
"lstrip": false,
|
19 |
-
"normalized": false,
|
20 |
-
"rstrip": false,
|
21 |
-
"single_word": false
|
22 |
-
},
|
23 |
"mask_token": {
|
24 |
-
"content": "
|
25 |
"lstrip": true,
|
26 |
-
"normalized": true,
|
27 |
-
"rstrip": false,
|
28 |
-
"single_word": false
|
29 |
-
},
|
30 |
-
"pad_token": {
|
31 |
-
"content": "[PAD]",
|
32 |
-
"lstrip": false,
|
33 |
"normalized": false,
|
34 |
"rstrip": false,
|
35 |
"single_word": false
|
36 |
},
|
37 |
-
"
|
38 |
-
|
39 |
-
|
40 |
-
"normalized": false,
|
41 |
-
"rstrip": false,
|
42 |
-
"single_word": false
|
43 |
-
},
|
44 |
-
"unk_token": {
|
45 |
-
"content": "[UNK]",
|
46 |
-
"lstrip": false,
|
47 |
-
"normalized": false,
|
48 |
-
"rstrip": false,
|
49 |
-
"single_word": false
|
50 |
-
}
|
51 |
}
|
|
|
1 |
{
|
2 |
+
"bos_token": "<s>",
|
3 |
+
"cls_token": "<s>",
|
4 |
+
"eos_token": "</s>",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
"mask_token": {
|
6 |
+
"content": "<mask>",
|
7 |
"lstrip": true,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
"normalized": false,
|
9 |
"rstrip": false,
|
10 |
"single_word": false
|
11 |
},
|
12 |
+
"pad_token": "<pad>",
|
13 |
+
"sep_token": "</s>",
|
14 |
+
"unk_token": "<unk>"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
}
|
tokenizer.json
CHANGED
@@ -10,54 +10,54 @@
|
|
10 |
"strategy": "BatchLongest",
|
11 |
"direction": "Right",
|
12 |
"pad_to_multiple_of": null,
|
13 |
-
"pad_id":
|
14 |
"pad_type_id": 0,
|
15 |
-
"pad_token": "
|
16 |
},
|
17 |
"added_tokens": [
|
18 |
{
|
19 |
"id": 0,
|
20 |
-
"content": "
|
21 |
"single_word": false,
|
22 |
"lstrip": false,
|
23 |
"rstrip": false,
|
24 |
-
"normalized":
|
25 |
"special": true
|
26 |
},
|
27 |
{
|
28 |
"id": 1,
|
29 |
-
"content": "
|
30 |
"single_word": false,
|
31 |
"lstrip": false,
|
32 |
"rstrip": false,
|
33 |
-
"normalized":
|
34 |
"special": true
|
35 |
},
|
36 |
{
|
37 |
"id": 2,
|
38 |
-
"content": "
|
39 |
"single_word": false,
|
40 |
"lstrip": false,
|
41 |
"rstrip": false,
|
42 |
-
"normalized":
|
43 |
"special": true
|
44 |
},
|
45 |
{
|
46 |
"id": 3,
|
47 |
-
"content": "
|
48 |
"single_word": false,
|
49 |
"lstrip": false,
|
50 |
"rstrip": false,
|
51 |
-
"normalized":
|
52 |
"special": true
|
53 |
},
|
54 |
{
|
55 |
"id": 50264,
|
56 |
-
"content": "
|
57 |
"single_word": false,
|
58 |
"lstrip": true,
|
59 |
"rstrip": false,
|
60 |
-
"normalized":
|
61 |
"special": true
|
62 |
}
|
63 |
],
|
@@ -69,79 +69,17 @@
|
|
69 |
"use_regex": true
|
70 |
},
|
71 |
"post_processor": {
|
72 |
-
"type": "
|
73 |
-
"
|
74 |
-
|
75 |
-
|
76 |
-
"id": "[CLS]",
|
77 |
-
"type_id": 0
|
78 |
-
}
|
79 |
-
},
|
80 |
-
{
|
81 |
-
"Sequence": {
|
82 |
-
"id": "A",
|
83 |
-
"type_id": 0
|
84 |
-
}
|
85 |
-
},
|
86 |
-
{
|
87 |
-
"SpecialToken": {
|
88 |
-
"id": "[SEP]",
|
89 |
-
"type_id": 0
|
90 |
-
}
|
91 |
-
}
|
92 |
],
|
93 |
-
"
|
94 |
-
|
95 |
-
|
96 |
-
"id": "[CLS]",
|
97 |
-
"type_id": 0
|
98 |
-
}
|
99 |
-
},
|
100 |
-
{
|
101 |
-
"Sequence": {
|
102 |
-
"id": "A",
|
103 |
-
"type_id": 0
|
104 |
-
}
|
105 |
-
},
|
106 |
-
{
|
107 |
-
"SpecialToken": {
|
108 |
-
"id": "[SEP]",
|
109 |
-
"type_id": 0
|
110 |
-
}
|
111 |
-
},
|
112 |
-
{
|
113 |
-
"Sequence": {
|
114 |
-
"id": "B",
|
115 |
-
"type_id": 1
|
116 |
-
}
|
117 |
-
},
|
118 |
-
{
|
119 |
-
"SpecialToken": {
|
120 |
-
"id": "[SEP]",
|
121 |
-
"type_id": 1
|
122 |
-
}
|
123 |
-
}
|
124 |
],
|
125 |
-
"
|
126 |
-
|
127 |
-
"id": "[CLS]",
|
128 |
-
"ids": [
|
129 |
-
1
|
130 |
-
],
|
131 |
-
"tokens": [
|
132 |
-
"[CLS]"
|
133 |
-
]
|
134 |
-
},
|
135 |
-
"[SEP]": {
|
136 |
-
"id": "[SEP]",
|
137 |
-
"ids": [
|
138 |
-
2
|
139 |
-
],
|
140 |
-
"tokens": [
|
141 |
-
"[SEP]"
|
142 |
-
]
|
143 |
-
}
|
144 |
-
}
|
145 |
},
|
146 |
"decoder": {
|
147 |
"type": "ByteLevel",
|
@@ -159,10 +97,10 @@
|
|
159 |
"byte_fallback": false,
|
160 |
"ignore_merges": false,
|
161 |
"vocab": {
|
162 |
-
"
|
163 |
-
"
|
164 |
-
"
|
165 |
-
"
|
166 |
".": 4,
|
167 |
"Ġthe": 5,
|
168 |
",": 6,
|
@@ -50423,7 +50361,7 @@
|
|
50423 |
"madeupword0000": 50261,
|
50424 |
"madeupword0001": 50262,
|
50425 |
"madeupword0002": 50263,
|
50426 |
-
"
|
50427 |
},
|
50428 |
"merges": [
|
50429 |
[
|
|
|
10 |
"strategy": "BatchLongest",
|
11 |
"direction": "Right",
|
12 |
"pad_to_multiple_of": null,
|
13 |
+
"pad_id": 1,
|
14 |
"pad_type_id": 0,
|
15 |
+
"pad_token": "<pad>"
|
16 |
},
|
17 |
"added_tokens": [
|
18 |
{
|
19 |
"id": 0,
|
20 |
+
"content": "<s>",
|
21 |
"single_word": false,
|
22 |
"lstrip": false,
|
23 |
"rstrip": false,
|
24 |
+
"normalized": true,
|
25 |
"special": true
|
26 |
},
|
27 |
{
|
28 |
"id": 1,
|
29 |
+
"content": "<pad>",
|
30 |
"single_word": false,
|
31 |
"lstrip": false,
|
32 |
"rstrip": false,
|
33 |
+
"normalized": true,
|
34 |
"special": true
|
35 |
},
|
36 |
{
|
37 |
"id": 2,
|
38 |
+
"content": "</s>",
|
39 |
"single_word": false,
|
40 |
"lstrip": false,
|
41 |
"rstrip": false,
|
42 |
+
"normalized": true,
|
43 |
"special": true
|
44 |
},
|
45 |
{
|
46 |
"id": 3,
|
47 |
+
"content": "<unk>",
|
48 |
"single_word": false,
|
49 |
"lstrip": false,
|
50 |
"rstrip": false,
|
51 |
+
"normalized": true,
|
52 |
"special": true
|
53 |
},
|
54 |
{
|
55 |
"id": 50264,
|
56 |
+
"content": "<mask>",
|
57 |
"single_word": false,
|
58 |
"lstrip": true,
|
59 |
"rstrip": false,
|
60 |
+
"normalized": false,
|
61 |
"special": true
|
62 |
}
|
63 |
],
|
|
|
69 |
"use_regex": true
|
70 |
},
|
71 |
"post_processor": {
|
72 |
+
"type": "RobertaProcessing",
|
73 |
+
"sep": [
|
74 |
+
"</s>",
|
75 |
+
2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
],
|
77 |
+
"cls": [
|
78 |
+
"<s>",
|
79 |
+
0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
80 |
],
|
81 |
+
"trim_offsets": true,
|
82 |
+
"add_prefix_space": false
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
83 |
},
|
84 |
"decoder": {
|
85 |
"type": "ByteLevel",
|
|
|
97 |
"byte_fallback": false,
|
98 |
"ignore_merges": false,
|
99 |
"vocab": {
|
100 |
+
"<s>": 0,
|
101 |
+
"<pad>": 1,
|
102 |
+
"</s>": 2,
|
103 |
+
"<unk>": 3,
|
104 |
".": 4,
|
105 |
"Ġthe": 5,
|
106 |
",": 6,
|
|
|
50361 |
"madeupword0000": 50261,
|
50362 |
"madeupword0001": 50262,
|
50363 |
"madeupword0002": 50263,
|
50364 |
+
"<mask>": 50264
|
50365 |
},
|
50366 |
"merges": [
|
50367 |
[
|
tokenizer_config.json
CHANGED
@@ -1,60 +1,58 @@
|
|
1 |
{
|
2 |
-
"add_bos_token": false,
|
3 |
"add_prefix_space": false,
|
4 |
"added_tokens_decoder": {
|
5 |
"0": {
|
6 |
-
"content": "
|
7 |
"lstrip": false,
|
8 |
-
"normalized":
|
9 |
"rstrip": false,
|
10 |
"single_word": false,
|
11 |
"special": true
|
12 |
},
|
13 |
"1": {
|
14 |
-
"content": "
|
15 |
"lstrip": false,
|
16 |
-
"normalized":
|
17 |
"rstrip": false,
|
18 |
"single_word": false,
|
19 |
"special": true
|
20 |
},
|
21 |
"2": {
|
22 |
-
"content": "
|
23 |
"lstrip": false,
|
24 |
-
"normalized":
|
25 |
"rstrip": false,
|
26 |
"single_word": false,
|
27 |
"special": true
|
28 |
},
|
29 |
"3": {
|
30 |
-
"content": "
|
31 |
"lstrip": false,
|
32 |
-
"normalized":
|
33 |
"rstrip": false,
|
34 |
"single_word": false,
|
35 |
"special": true
|
36 |
},
|
37 |
"50264": {
|
38 |
-
"content": "
|
39 |
"lstrip": true,
|
40 |
-
"normalized":
|
41 |
"rstrip": false,
|
42 |
"single_word": false,
|
43 |
"special": true
|
44 |
}
|
45 |
},
|
46 |
-
"bos_token": "
|
47 |
"clean_up_tokenization_spaces": false,
|
48 |
-
"cls_token": "
|
49 |
-
"
|
50 |
-
"eos_token": "[SEP]",
|
51 |
"errors": "replace",
|
52 |
"extra_special_tokens": {},
|
53 |
-
"mask_token": "
|
54 |
"model_max_length": 128,
|
55 |
-
"pad_token": "
|
56 |
-
"sep_token": "
|
57 |
-
"tokenizer_class": "
|
58 |
-
"
|
59 |
-
"
|
60 |
}
|
|
|
1 |
{
|
|
|
2 |
"add_prefix_space": false,
|
3 |
"added_tokens_decoder": {
|
4 |
"0": {
|
5 |
+
"content": "<s>",
|
6 |
"lstrip": false,
|
7 |
+
"normalized": true,
|
8 |
"rstrip": false,
|
9 |
"single_word": false,
|
10 |
"special": true
|
11 |
},
|
12 |
"1": {
|
13 |
+
"content": "<pad>",
|
14 |
"lstrip": false,
|
15 |
+
"normalized": true,
|
16 |
"rstrip": false,
|
17 |
"single_word": false,
|
18 |
"special": true
|
19 |
},
|
20 |
"2": {
|
21 |
+
"content": "</s>",
|
22 |
"lstrip": false,
|
23 |
+
"normalized": true,
|
24 |
"rstrip": false,
|
25 |
"single_word": false,
|
26 |
"special": true
|
27 |
},
|
28 |
"3": {
|
29 |
+
"content": "<unk>",
|
30 |
"lstrip": false,
|
31 |
+
"normalized": true,
|
32 |
"rstrip": false,
|
33 |
"single_word": false,
|
34 |
"special": true
|
35 |
},
|
36 |
"50264": {
|
37 |
+
"content": "<mask>",
|
38 |
"lstrip": true,
|
39 |
+
"normalized": false,
|
40 |
"rstrip": false,
|
41 |
"single_word": false,
|
42 |
"special": true
|
43 |
}
|
44 |
},
|
45 |
+
"bos_token": "<s>",
|
46 |
"clean_up_tokenization_spaces": false,
|
47 |
+
"cls_token": "<s>",
|
48 |
+
"eos_token": "</s>",
|
|
|
49 |
"errors": "replace",
|
50 |
"extra_special_tokens": {},
|
51 |
+
"mask_token": "<mask>",
|
52 |
"model_max_length": 128,
|
53 |
+
"pad_token": "<pad>",
|
54 |
+
"sep_token": "</s>",
|
55 |
+
"tokenizer_class": "RobertaTokenizer",
|
56 |
+
"trim_offsets": true,
|
57 |
+
"unk_token": "<unk>"
|
58 |
}
|
vocab.json
CHANGED
The diff for this file is too large to render.
See raw diff
|
|