pszemraj
/

flan-ul2-text-encoder

@@ -13,9 +13,9 @@ THe encoder from [flan-ul2](https://huggingface.co/google/flan-ul2). This model
 ## basic usage
-> note: this is 'a way' of using the encoder, and not 'the only way'. suggestions and ideas welcome
-This guide provides a set of functions to calculate the cosine similarity between the embeddings of different texts. The embeddings are calculated using a pre-trained model.
 ## Functions
@@ -24,9 +24,11 @@ This guide provides a set of functions to calculate the cosine similarity betwee
 <details>
 <summary><b>Details</b></summary>
-This function loads the model and tokenizer based on the given model name. It returns a tuple containing the loaded model and tokenizer.
 ```python
 def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]:
     """
     Load the model and tokenizer based on the given model name.
@@ -35,9 +37,9 @@ def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]
         model_name (str): The name of the model to be loaded.
     Returns:
-        Tuple[AutoModel, AutoTokenizer]: The loaded model and tokenizer.
     """
-    model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
     tokenizer = AutoTokenizer.from_pretrained(model_name)
     model.eval()  # Deactivate Dropout
     return model, tokenizer
@@ -47,7 +49,7 @@ def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]
 ### get_embeddings
-This function gets the embeddings for the given texts using the provided model and tokenizer. It returns the calculated embeddings.
 <details>
 <summary><b>Details</b></summary>
@@ -56,7 +58,7 @@ This function gets the embeddings for the given texts using the provided model a
 ```python
 def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor:
     """
-    Get the embeddings for the given texts using the provided model and tokenizer.
     Args:
         model (AutoModel): The model to be used for getting embeddings.
@@ -103,7 +105,7 @@ def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str])
 ### calculate_cosine_similarity
-This function calculates and prints the cosine similarity between the first text and all other texts. It does not return anything.
 <details>
 <summary><b>click to expand</b></summary>
@@ -128,13 +130,13 @@ def calculate_cosine_similarity(embeddings: torch.Tensor, texts: List[str]) -> N
 ## Usage
-To use these functions, you need to have the `transformers` and `scipy` libraries installed. You can install these with pip:
 ```bash
-pip install transformers scipy
 ```
-Then, you can use the functions in your Python code as needed. For example:
 ```python
 model_name = "pszemraj/flan-ul2-text-encoder"
@@ -153,13 +155,15 @@ calculate_cosine_similarity(embeddings, texts)
 This will print the cosine similarity between the first text and all other texts in the `texts` list.
-<details>
-<summary><b>Customization</b></summary>
-You can customize the texts by modifying the `texts` list. You can also use a different model by changing the `model_name` variable.
-</details>
 ## References
-This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be).

 ## basic usage
+> note: this is 'one way' to use the encoder, not 'the only way'. suggestions and ideas welcome.
+Below is an example and a set of functions to compute the cosine similarity between the embeddings of different texts with this model
 ## Functions
 <details>
 <summary><b>Details</b></summary>
+loads the model and tokenizer based on `model_name`. It returns a tuple containing the loaded model and tokenizer.
 ```python
+from transformers import AutoModelForTextEncoding, AutoTokenizer
 def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]:
     """
     Load the model and tokenizer based on the given model name.
         model_name (str): The name of the model to be loaded.
     Returns:
+        Tuple[AutoModelForTextEncoding, AutoTokenizer]: The loaded model and tokenizer.
     """
+    model = AutoModelForTextEncoding.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
     tokenizer = AutoTokenizer.from_pretrained(model_name)
     model.eval()  # Deactivate Dropout
     return model, tokenizer
 ### get_embeddings
+This computes the embeddings for the given texts given the model and tokenizer via weighted mean pooling across seq_len (as in [SGPT](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be))
 <details>
 <summary><b>Details</b></summary>
 ```python
 def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor:
     """
+    Get the embeddings via weighted mean pooling across seq_len
     Args:
         model (AutoModel): The model to be used for getting embeddings.
 ### calculate_cosine_similarity
+Helper fn to compute and print out cosine similarity
 <details>
 <summary><b>click to expand</b></summary>
 ## Usage
+Install packages:
 ```bash
+pip install transformers accelerate sentencepiece scipy
 ```
+Then, you can use the functions to compute embeddings and similarity scores:
 ```python
 model_name = "pszemraj/flan-ul2-text-encoder"
 This will print the cosine similarity between the first text and all other texts in the `texts` list.
 ## References
+This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be).
+```
+@article{muennighoff2022sgpt,
+  title={SGPT: GPT Sentence Embeddings for Semantic Search},
+  author={Muennighoff, Niklas},
+  journal={arXiv preprint arXiv:2202.08904},
+  year={2022}
+}
+```