Update README.md
Browse files
README.md
CHANGED
@@ -13,9 +13,9 @@ THe encoder from [flan-ul2](https://huggingface.co/google/flan-ul2). This model
|
|
13 |
|
14 |
## basic usage
|
15 |
|
16 |
-
> note: this is '
|
17 |
|
18 |
-
|
19 |
|
20 |
## Functions
|
21 |
|
@@ -24,9 +24,11 @@ This guide provides a set of functions to calculate the cosine similarity betwee
|
|
24 |
<details>
|
25 |
<summary><b>Details</b></summary>
|
26 |
|
27 |
-
|
28 |
|
29 |
```python
|
|
|
|
|
30 |
def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]:
|
31 |
"""
|
32 |
Load the model and tokenizer based on the given model name.
|
@@ -35,9 +37,9 @@ def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]
|
|
35 |
model_name (str): The name of the model to be loaded.
|
36 |
|
37 |
Returns:
|
38 |
-
Tuple[
|
39 |
"""
|
40 |
-
model =
|
41 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
42 |
model.eval() # Deactivate Dropout
|
43 |
return model, tokenizer
|
@@ -47,7 +49,7 @@ def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]
|
|
47 |
|
48 |
### get_embeddings
|
49 |
|
50 |
-
This
|
51 |
|
52 |
<details>
|
53 |
<summary><b>Details</b></summary>
|
@@ -56,7 +58,7 @@ This function gets the embeddings for the given texts using the provided model a
|
|
56 |
```python
|
57 |
def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor:
|
58 |
"""
|
59 |
-
Get the embeddings
|
60 |
|
61 |
Args:
|
62 |
model (AutoModel): The model to be used for getting embeddings.
|
@@ -103,7 +105,7 @@ def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str])
|
|
103 |
|
104 |
### calculate_cosine_similarity
|
105 |
|
106 |
-
|
107 |
|
108 |
<details>
|
109 |
<summary><b>click to expand</b></summary>
|
@@ -128,13 +130,13 @@ def calculate_cosine_similarity(embeddings: torch.Tensor, texts: List[str]) -> N
|
|
128 |
|
129 |
## Usage
|
130 |
|
131 |
-
|
132 |
|
133 |
```bash
|
134 |
-
pip install transformers scipy
|
135 |
```
|
136 |
|
137 |
-
Then, you can use the functions
|
138 |
|
139 |
```python
|
140 |
model_name = "pszemraj/flan-ul2-text-encoder"
|
@@ -153,13 +155,15 @@ calculate_cosine_similarity(embeddings, texts)
|
|
153 |
|
154 |
This will print the cosine similarity between the first text and all other texts in the `texts` list.
|
155 |
|
156 |
-
<details>
|
157 |
-
<summary><b>Customization</b></summary>
|
158 |
-
|
159 |
-
You can customize the texts by modifying the `texts` list. You can also use a different model by changing the `model_name` variable.
|
160 |
-
|
161 |
-
</details>
|
162 |
-
|
163 |
## References
|
164 |
|
165 |
-
This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
## basic usage
|
15 |
|
16 |
+
> note: this is 'one way' to use the encoder, not 'the only way'. suggestions and ideas welcome.
|
17 |
|
18 |
+
Below is an example and a set of functions to compute the cosine similarity between the embeddings of different texts with this model
|
19 |
|
20 |
## Functions
|
21 |
|
|
|
24 |
<details>
|
25 |
<summary><b>Details</b></summary>
|
26 |
|
27 |
+
loads the model and tokenizer based on `model_name`. It returns a tuple containing the loaded model and tokenizer.
|
28 |
|
29 |
```python
|
30 |
+
from transformers import AutoModelForTextEncoding, AutoTokenizer
|
31 |
+
|
32 |
def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]:
|
33 |
"""
|
34 |
Load the model and tokenizer based on the given model name.
|
|
|
37 |
model_name (str): The name of the model to be loaded.
|
38 |
|
39 |
Returns:
|
40 |
+
Tuple[AutoModelForTextEncoding, AutoTokenizer]: The loaded model and tokenizer.
|
41 |
"""
|
42 |
+
model = AutoModelForTextEncoding.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
|
43 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
44 |
model.eval() # Deactivate Dropout
|
45 |
return model, tokenizer
|
|
|
49 |
|
50 |
### get_embeddings
|
51 |
|
52 |
+
This computes the embeddings for the given texts given the model and tokenizer via weighted mean pooling across seq_len (as in [SGPT](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be))
|
53 |
|
54 |
<details>
|
55 |
<summary><b>Details</b></summary>
|
|
|
58 |
```python
|
59 |
def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor:
|
60 |
"""
|
61 |
+
Get the embeddings via weighted mean pooling across seq_len
|
62 |
|
63 |
Args:
|
64 |
model (AutoModel): The model to be used for getting embeddings.
|
|
|
105 |
|
106 |
### calculate_cosine_similarity
|
107 |
|
108 |
+
Helper fn to compute and print out cosine similarity
|
109 |
|
110 |
<details>
|
111 |
<summary><b>click to expand</b></summary>
|
|
|
130 |
|
131 |
## Usage
|
132 |
|
133 |
+
Install packages:
|
134 |
|
135 |
```bash
|
136 |
+
pip install transformers accelerate sentencepiece scipy
|
137 |
```
|
138 |
|
139 |
+
Then, you can use the functions to compute embeddings and similarity scores:
|
140 |
|
141 |
```python
|
142 |
model_name = "pszemraj/flan-ul2-text-encoder"
|
|
|
155 |
|
156 |
This will print the cosine similarity between the first text and all other texts in the `texts` list.
|
157 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
158 |
## References
|
159 |
|
160 |
+
This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be).
|
161 |
+
|
162 |
+
```
|
163 |
+
@article{muennighoff2022sgpt,
|
164 |
+
title={SGPT: GPT Sentence Embeddings for Semantic Search},
|
165 |
+
author={Muennighoff, Niklas},
|
166 |
+
journal={arXiv preprint arXiv:2202.08904},
|
167 |
+
year={2022}
|
168 |
+
}
|
169 |
+
```
|