pszemraj commited on
Commit
c1e1ca7
·
1 Parent(s): 3e7409c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -19
README.md CHANGED
@@ -13,9 +13,9 @@ THe encoder from [flan-ul2](https://huggingface.co/google/flan-ul2). This model
13
 
14
  ## basic usage
15
 
16
- > note: this is 'a way' of using the encoder, and not 'the only way'. suggestions and ideas welcome
17
 
18
- This guide provides a set of functions to calculate the cosine similarity between the embeddings of different texts. The embeddings are calculated using a pre-trained model.
19
 
20
  ## Functions
21
 
@@ -24,9 +24,11 @@ This guide provides a set of functions to calculate the cosine similarity betwee
24
  <details>
25
  <summary><b>Details</b></summary>
26
 
27
- This function loads the model and tokenizer based on the given model name. It returns a tuple containing the loaded model and tokenizer.
28
 
29
  ```python
 
 
30
  def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]:
31
  """
32
  Load the model and tokenizer based on the given model name.
@@ -35,9 +37,9 @@ def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]
35
  model_name (str): The name of the model to be loaded.
36
 
37
  Returns:
38
- Tuple[AutoModel, AutoTokenizer]: The loaded model and tokenizer.
39
  """
40
- model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
41
  tokenizer = AutoTokenizer.from_pretrained(model_name)
42
  model.eval() # Deactivate Dropout
43
  return model, tokenizer
@@ -47,7 +49,7 @@ def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]
47
 
48
  ### get_embeddings
49
 
50
- This function gets the embeddings for the given texts using the provided model and tokenizer. It returns the calculated embeddings.
51
 
52
  <details>
53
  <summary><b>Details</b></summary>
@@ -56,7 +58,7 @@ This function gets the embeddings for the given texts using the provided model a
56
  ```python
57
  def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor:
58
  """
59
- Get the embeddings for the given texts using the provided model and tokenizer.
60
 
61
  Args:
62
  model (AutoModel): The model to be used for getting embeddings.
@@ -103,7 +105,7 @@ def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str])
103
 
104
  ### calculate_cosine_similarity
105
 
106
- This function calculates and prints the cosine similarity between the first text and all other texts. It does not return anything.
107
 
108
  <details>
109
  <summary><b>click to expand</b></summary>
@@ -128,13 +130,13 @@ def calculate_cosine_similarity(embeddings: torch.Tensor, texts: List[str]) -> N
128
 
129
  ## Usage
130
 
131
- To use these functions, you need to have the `transformers` and `scipy` libraries installed. You can install these with pip:
132
 
133
  ```bash
134
- pip install transformers scipy
135
  ```
136
 
137
- Then, you can use the functions in your Python code as needed. For example:
138
 
139
  ```python
140
  model_name = "pszemraj/flan-ul2-text-encoder"
@@ -153,13 +155,15 @@ calculate_cosine_similarity(embeddings, texts)
153
 
154
  This will print the cosine similarity between the first text and all other texts in the `texts` list.
155
 
156
- <details>
157
- <summary><b>Customization</b></summary>
158
-
159
- You can customize the texts by modifying the `texts` list. You can also use a different model by changing the `model_name` variable.
160
-
161
- </details>
162
-
163
  ## References
164
 
165
- This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be).
 
 
 
 
 
 
 
 
 
 
13
 
14
  ## basic usage
15
 
16
+ > note: this is 'one way' to use the encoder, not 'the only way'. suggestions and ideas welcome.
17
 
18
+ Below is an example and a set of functions to compute the cosine similarity between the embeddings of different texts with this model
19
 
20
  ## Functions
21
 
 
24
  <details>
25
  <summary><b>Details</b></summary>
26
 
27
+ loads the model and tokenizer based on `model_name`. It returns a tuple containing the loaded model and tokenizer.
28
 
29
  ```python
30
+ from transformers import AutoModelForTextEncoding, AutoTokenizer
31
+
32
  def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]:
33
  """
34
  Load the model and tokenizer based on the given model name.
 
37
  model_name (str): The name of the model to be loaded.
38
 
39
  Returns:
40
+ Tuple[AutoModelForTextEncoding, AutoTokenizer]: The loaded model and tokenizer.
41
  """
42
+ model = AutoModelForTextEncoding.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
43
  tokenizer = AutoTokenizer.from_pretrained(model_name)
44
  model.eval() # Deactivate Dropout
45
  return model, tokenizer
 
49
 
50
  ### get_embeddings
51
 
52
+ This computes the embeddings for the given texts given the model and tokenizer via weighted mean pooling across seq_len (as in [SGPT](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be))
53
 
54
  <details>
55
  <summary><b>Details</b></summary>
 
58
  ```python
59
  def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor:
60
  """
61
+ Get the embeddings via weighted mean pooling across seq_len
62
 
63
  Args:
64
  model (AutoModel): The model to be used for getting embeddings.
 
105
 
106
  ### calculate_cosine_similarity
107
 
108
+ Helper fn to compute and print out cosine similarity
109
 
110
  <details>
111
  <summary><b>click to expand</b></summary>
 
130
 
131
  ## Usage
132
 
133
+ Install packages:
134
 
135
  ```bash
136
+ pip install transformers accelerate sentencepiece scipy
137
  ```
138
 
139
+ Then, you can use the functions to compute embeddings and similarity scores:
140
 
141
  ```python
142
  model_name = "pszemraj/flan-ul2-text-encoder"
 
155
 
156
  This will print the cosine similarity between the first text and all other texts in the `texts` list.
157
 
 
 
 
 
 
 
 
158
  ## References
159
 
160
+ This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be).
161
+
162
+ ```
163
+ @article{muennighoff2022sgpt,
164
+ title={SGPT: GPT Sentence Embeddings for Semantic Search},
165
+ author={Muennighoff, Niklas},
166
+ journal={arXiv preprint arXiv:2202.08904},
167
+ year={2022}
168
+ }
169
+ ```