<div style="border-left: 4px solid #00A000; background-color: #F0FFF0; padding: 10px; margin: 10px 0;">
    <strong>Tip:</strong>  To run the default model for this notebook, it's suggested to run on  4xL4 GPUs. You can also opt for a smaller GPU, but you must swap out the used model. If you want to run on a smaller GPU it's suggested to use the `meta-llama/Meta-Llama-3-8B-Instruct` model.
</div>

# Creating a synthetic dataset for fine-tuning a Sentence Transformer model 

In the previous notebook, we prepared a dataset that we'll use to generate our synthetic dataset. We'll now focus on generating a dataset that we can use to train or fine-tune a sentence similarity model using the Sentence Transformers library. 

As a reminder, the type of dataset format we're working towards creating has three columns: "anchor", "positive", and "negative". In our case, "anchor" is the text from the domain we want our Sentence Transformers model to work well with. The "positive" text is a text which should be similar in some way to the "anchor" text, whilst the "negative" should be in some way dissimilar to our original text.

As a very simple example, if our "anchor" text is "Bill 179 introduces restrictions on the use of pesticides in Canada", a "positive" example might be "a law that regulating the use of pesticides", and a "negative" example might be "a bill related to education funding". Whilst it is possible to train a Sentence Transformer model using only positive examples, the inclusion of negative examples can help the model to learn more about the space of possible sentences and improve its performance. The [Sentence Transformers documentation](https://www.sbert.net/docs/sentence_transformer/training_overview.html) goes into more detail about this.

## Creating positive and negative pairs using an LLM

One of the remarkable aspects of working with large language models, especially instruction-tuned models, is the significant control and flexibility they offer in text generation. This unique capability empowers us to create a synthetic dataset for training a sentence similarity model. We can leverage our input anchor text alongside a prompt to generate the positive and negative examples for similarity, giving us full control over the dataset's composition.

### What do we mean by similarity? 

How will the model be used in practice? Since we are fine-tuning a model and have a lot of control over the dataset, we should consider what we want similarity to look like. 

In an RAG use case, we might default to creating a prompt that's something like "based on this text, write a user query that would be satisfied by this text" Whilst that could make sense, for many RAG applications, the embeddings are used to give extra context to an LLM based on a user prompt. This user prompt might take the form of a query, but it's more likely to be a question rather than the query we make to a database. Depending on the use case, we might want to generate prompts that are more like the type of questions we expect the model to use.

Let's start with our imports

In [1]:
import json
import random

from datasets import concatenate_datasets, load_dataset
from huggingface_hub import DatasetCard, login
from outlines import generate, models
from pydantic import BaseModel, conlist, constr
from tqdm.auto import tqdm
from vllm import LLM, SamplingParams

You can do this below if you haven't authenticated with the Hugging Face Hub yet.

In [2]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We can now load the dataset we created in the previous step. 

In [3]:
ds = load_dataset("davanstrien/bill_summary_us_chunks", split="train")

You can see below we currently have a dataset consisting of an `id` and `section` column. In our case, the `section` column is the text we want to use as our anchor text.

In [4]:
ds

Dataset({
    features: ['id', 'section'],
    num_rows: 3446013
})

We'll create a sample of chunks from our processed dataset. We'll keep this reasonably small here, but if you use this notebook with your own dataset, you may want to increase the number of examples you work with. 

In [5]:
subset_idx = random.sample(range(len(ds)), k=1000)

In [6]:
ds = ds.select(subset_idx)

## Creating an LLM

Our choice of LLM is crucial as it forms the basis for creating our negative and positive example pairs, derived from the dataset we're working with. To facilitate this, we require an LLM that can respond to formatted prompts. We've used `vLLM` to run our models (we'll delve into the reasons shortly). However, the question remains-which model should we choose? 

### Choosing an LLM for synthetic data generation?

Since we're focusing on using an open weights LLM, we've already excluded LLMs, which we can only access via an API. However, we still need to choose an LLM. There are a few significant criteria we are considering:

- The size of the LLM and the resources required to run the LLM
- The performance of the LLM 
- The license of the LLM

Whilst it may seem that you want to use the largest model possible to get the best possible data, this increases the cost and time involved in generating your dataset. Whilst you could immediately use an LLM to create responses for a batch of data when identifying a suitable LLM for generating syntehtic data, you often want to:

- Test a few different LLMs
- Experiment with the best prompt 

While it's possible to reuse a standard prompt, it's often more effective to modify it. For instance, when generating synthetic data for embedding datasets, we may need to capture various types of similarity through our prompts. A generic prompt may not be as effective in this scenario. Additionally, it's beneficial to iterate quickly on different LLMs to identify the ones that perform best for a range of prompts. For this initial exploration, platforms like Hugging Chat or local LLM apps like LM Studio can be used to test various LLMS and prompts. 

When exploring prompts and generations for this task, I found the following models all worked pretty well:

| Model | Parameter Count |
|-------|-----------------|
| [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 8B parameters |
| [01-ai/Yi-1.5-34B-Chat](https://huggingface.co/01-ai/Yi-1.5-34B-Chat) | 34B parameters |
| [casperhansen/llama-3-70b-instruct-awq](https://huggingface.co/casperhansen/llama-3-70b-instruct-awq) | 70B parameters |
| [Qwen/Qwen2-72B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ) | 72B parameters |

You will see that for the Llama-3-70b and Qwen2-72b models, that AWQ quantized versions are used. This is because these models will be too large to fit into the VRAM of a 4xL4 GPU.

For this notebook we use the Yi-1.5-34B model as a default. This model is a good balance between performance and resource requirements. 

We can now load the model and tokenizer. You may need to reduce the `gpu_memory_utilization` if you are using a smaller GPU or are hitting CUDA out of memory errors. You may also need to adjust this parameter if you have very long anchor texts.


<div style="border-left: 4px solid #00A000; background-color: #F0FFF0; padding: 10px; margin: 10px 0;">
    <strong>Tip:</strong> If you are running on a single GPU update the `tensor_parallel_size` below to 1. 
</div>

In [9]:
llm = LLM(
    "01-ai/Yi-1.5-34B-Chat",
    tensor_parallel_size=4,
    tokenizer="01-ai/Yi-1.5-34B-Chat",
    gpu_memory_utilization=0.9,
    enable_chunked_prefill=True,
)

2024-06-19 14:03:15,692	INFO worker.py:1753 -- Started a local Ray instance.


INFO 06-19 14:03:16 config.py:623] Defaulting to use mp for distributed inference
INFO 06-19 14:03:16 config.py:707] Chunked prefill is enabled (EXPERIMENTAL).
INFO 06-19 14:03:16 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='01-ai/Yi-1.5-34B-Chat', speculative_config=None, tokenizer='01-ai/Yi-1.5-34B-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=01-ai/Yi-1.5-34B-Chat)
[1;36m(VllmWorkerProcess pid=44273)[0;0m INFO 06-19 14:03:19 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWork

Traceback (most recent call last):
  File "/home/user/miniconda/lib/python3.9/multiprocessing/resource_tracker.py", line 201, in main
    cache[rtype].remove(name)
KeyError: '/psm_e0bb0061'
Traceback (most recent call last):
  File "/home/user/miniconda/lib/python3.9/multiprocessing/resource_tracker.py", line 201, in main
    cache[rtype].remove(name)
KeyError: '/psm_e0bb0061'
Traceback (most recent call last):
  File "/home/user/miniconda/lib/python3.9/multiprocessing/resource_tracker.py", line 201, in main
    cache[rtype].remove(name)
KeyError: '/psm_e0bb0061'


[1;36m(VllmWorkerProcess pid=44274)[0;0m INFO 06-19 14:03:20 weight_utils.py:218] Using model weights format ['*.safetensors']
[1;36m(VllmWorkerProcess pid=44273)[0;0m INFO 06-19 14:03:20 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 06-19 14:03:20 weight_utils.py:218] Using model weights format ['*.safetensors']
[1;36m(VllmWorkerProcess pid=44272)[0;0m INFO 06-19 14:03:20 weight_utils.py:218] Using model weights format ['*.safetensors']
[1;36m(VllmWorkerProcess pid=44273)[0;0m INFO 06-19 14:03:32 model_runner.py:159] Loading model weights took 16.0451 GB
INFO 06-19 14:03:32 model_runner.py:159] Loading model weights took 16.0451 GB
[1;36m(VllmWorkerProcess pid=44272)[0;0m INFO 06-19 14:03:32 model_runner.py:159] Loading model weights took 16.0451 GB
[1;36m(VllmWorkerProcess pid=44274)[0;0m INFO 06-19 14:03:32 model_runner.py:159] Loading model weights took 16.0451 GB
INFO 06-19 14:03:33 distributed_gpu_executor.py:56] # GPU blocks: 3187, # CPU bloc

Below are some configs for other models you might want to experiment with. 


In [1]:
# llm = LLM(
#     "meta-llama/Meta-Llama-3-8B-Instruct",
#     tensor_parallel_size=4,
#     tokenizer="meta-llama/Meta-Llama-3-8B-Instruct",
#     gpu_memory_utilization=0.9,
#     enable_chunked_prefill=True,
# )

In [2]:
# llm = LLM(
#     "Qwen/Qwen2-72B-Instruct-AWQ",
#     quantization="AWQ",
#     tensor_parallel_size=4,
#     tokenizer="Qwen/Qwen2-72B-Instruct-AWQ",
#     gpu_memory_utilization=0.8,
#     enable_chunked_prefill=True,
# )

### Constrained Generation using Outlines


Now we have created an LLM and loaded our dataset, we can move to the next step of creating our dataset. As a reminder we are looking to create a dataset with three columns: "anchor", "positive", and "negative" using the anchor text from our dataset as a starting point.

We can do something like this as a prompt:

"Given the following text, write a sentence that is similar to the text. Write a sentence that is dissimilar to the text."

This prompt (might!) generate a sentence that is similar to the anchor text and a sentence that is dissimilar to the anchor text. However, we may want some way to have more certainty that the LLM produces the text we want (at least in terms of format). For example if we want to generate valid JSON the first token generated should be `{`. 

Since we want to have two types of output "good" and "bad" aka positive and negative examples, we may to produce data that looks something like:

{"good": "a sentence that is similar to the text", "bad": "a sentence that is dissimilar to the text"}


A strong LLM with a good prompt will probably already do this but we can also use a technique called guided generation to enforce this more strongly. This has a few advantages:

- the outputs are what we expect i.e. valid JSON, this makes processing them in later steps easier
- we can have some additional control over other parts of the generation process

For doing guided generation we'll use a library called [Outlines](https://github.com/outlines-dev/outlines). Outlines is a library that allows you to create constraints on the generation process.

> Outlines〰 is a Python library that allows you to use Large Language Model in a simple and robust way (with structured generation).

One of the advantages of Outlines over some other libraries is that it does structured generation by directly altering the behavior of the models generation process rather than using prompting and lots of retries. The Outlines library has an integration with `vLLM` which we used to load our model.

We can start by passing the llm we just created into the Outlines `models.VLLM` class

In [10]:
model = models.VLLM(llm)

### Defining the constraints

We can now define the constraints we want to use. We can do this in various ways using the Outlines library but one nice way is to use a Pydantic Class to represent the output we want. If you are not familiar with Pydantic it is a

> fast and extensible library for validating and serializing data using Python type hints.

In our case we want to have two fields "good" and "bad" which are strings. We could define this as a Pydantic class like this:


```python

from pydantic import BaseModel

class Example(BaseModel):
    good: str
    bad: str

```

This will ensure we get a JSON object with two fields "good" and "bad" which are strings. However, when creating data for a sentence embedding task it can be useful to have multiple generations for each anchor text. In particular for the `bad` examples we may want the LLM to produce multiple bad examples and then choose the "hard" negative example i.e. the one that is most similar to the anchor text. This can help the model learn more quickly and deal with more complex examples.

We can define this as a Pydantic class like this:

```python

from pydantic import BaseModel
from typing import List

class Example(BaseModel):
    good: str
    bad: List[str]

```

You can see that we now have a list of strings for the "bad" field. This will allow us to generate multiple negative examples for each anchor text.

In addition to controlling the structure of the JSON we can also define some additional constrains. For this example we'll create a Pydantic Class that has:

- a good field which is a list of 1 string
- a bad field which is a list of 3 strings
- we also specify a constraint on the minimum and maximum length of the strings



In [11]:
class AbstractDescriptions(BaseModel):
    good: conlist(constr(min_length=20, max_length=200), min_length=1, max_length=1)  # type: ignore
    bad: conlist(constr(min_length=20, max_length=200), min_length=3, max_length=3)  # type: ignore


schema = AbstractDescriptions.model_json_schema()

Although it can be tempting to make a Pydantic Class that has a lot of constraints, it's often better to start with a simple class and then add more constraints as needed. Whilst Outlines will help guide the model to the output you want if it's very hard for an LLM to generate the output you want you may get suboptimal results.

## Creating our prompt

Now we turn to writing a prompt that will generate the data we want. In this case we write a prompt that is focused on the specific data we are working with (US legislation). We give the LLM a few shot example to help it understand the type of response we want as well as the format. 

In addition we pass in the constraints we defined above as a JSON Schema. This is another reason you may want to limit the number of constrains you have as the JSON Schema will get larger and in turn the size of your prompts will grow a lot. 

In [12]:
def format_prompt(text: str) -> str:
    return f"""Write one good and three bad abstract descriptions of the following text. Output the descriptions in a JSON file with keys ‘good’ and ‘bad’.
Example:
Text: Community Energy Savings Program Act of 2019\n\nThis bill directs the Department of Energy to establish a grant program for states and Indian tribes to provide loans to consumers and communities that want to implement cost-effective energy efficiency measures.
Good description: A legislative proposal to promote energy efficiency through financial incentives
Bad description: A federal reform relating to the process for submitting planning applications related to oil pipelines

Note: Descriptions can vary in abstraction, detail, and focus. Both good and bad descriptions should be short (max 20 words)

Text to describe: {text}
Return a JSON object with the keys 'good' and 'bad' using this schema: {schema}."""

We not add a new column to our dataset called `prompt` which contains the prompt we want to use.

In [15]:
ds = ds.map(lambda x: {"prompt": format_prompt(x['section'])})

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

We now create a `generator` object using Outlines. We pass our model and the `AbstractDescriptions` Pydantic model we created to Outlines `generate.json`. Outlines will then "compile" our Pydantic model into a [finite state machine](https://en.wikipedia.org/wiki/Finite-state_machine) (FSM) representation of our Schema. Outlines constrains the possible tokens that the LLM can generate at different steps during the generation.

In [17]:
generator = generate.json(model, AbstractDescriptions)

Compiling FSM index for all state transitions: 100%|██████████| 4845/4845 [01:44<00:00, 46.55it/s]


We now have a `generator` to which we can pass our prompts. Although we created this via Outlines, under the hood the generation is being done by `vLLM`. `vLLM` supports [offline batched inference](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference). Many LLM serving engines are primarily optimized for deployment via an API where a variable number of requests will be sent to the model. In our case, we have a batch of prompts for which we want to generate responses. The vLLM `LLM` Class:

> includes a tokenizer, a language model (possibly distributed across multiple GPUs), and GPU memory space allocated for intermediate states (aka KV cache). Given a batch of prompts and sampling parameters, this Class generates texts from the model using an intelligent batching mechanism and efficient memory management.

While we could still work to optimize throughput further, this Class represents a good starting point for synthetic data generation. 

### Generating our dataset

We are now ready to generate our dataset. We'll use the `generator` object to generate our data. We'll run our dataset as batches to allow us to retry generating data that fails. 

In the below `process_batch` function, we roughly do the following:
- Split our dataset into an initial batch size
- Try and generate a response for the batch
- If we have errors for the batch, we split the batch that failed into a smaller subset and retryß
- We go through this process with a maximum number of retries and a minimum batch size

There are ways we could further optimize this, but this works quite well for not skipping a whole batch whilst not wasting too much compute/time retrying tiny batches to generate a bit of extra data. 


We can specify some generation parameters that will be passed to vLLM when generating the data. 

In [None]:
params = SamplingParams(n=1, max_tokens=800, best_of=2, temperature=0.7)

We can now generate our data. Feel free to adjust the `batch_size`, `max_retries`, and `min_batch_size` to suit your needs.

In [18]:
def process_batch(batch, batch_size, retry_count=0, max_retries=3, min_batch_size=25):
    if len(batch) == 0:
        return []

    batch_prompts = batch["prompt"]

    try:
        batch_results = generator(batch_prompts, sampling_params=params)
        return batch.add_column(
            "generations",
            [result.model_dump_json() for result in batch_results],
        )
    except Exception as e:
        print(f"Error in generation for batch (size {len(batch)}): {e}")

        if retry_count >= max_retries or batch_size <= min_batch_size:
            print("Max retries reached or batch size too small. Skipping batch.")
            return []

        print(f"Splitting batch and retrying (retry count: {retry_count + 1})")
        num_sub_batches = 2 ** (retry_count + 1)
        sub_batch_size = max(batch_size // num_sub_batches, min_batch_size)
        sub_batches = [
            batch.shard(num_shards=num_sub_batches, index=i)
            for i in range(num_sub_batches)
        ]
        print(f"Sub batch size {sub_batch_size}")
        updated_sub_batches = []
        for sub_batch in sub_batches:
            updated_sub_batch = process_batch(
                sub_batch, sub_batch_size, retry_count + 1, max_retries, min_batch_size
            )
            if len(updated_sub_batch) > 0:
                updated_sub_batches.append(updated_sub_batch)

        return concatenate_datasets(updated_sub_batches) if updated_sub_batches else []


initial_batch_size = 100
num_batches = len(ds) // initial_batch_size + (len(ds) % initial_batch_size != 0)
dataset_parts = [ds.shard(num_shards=num_batches, index=i) for i in range(num_batches)]

updated_parts = []
for part in tqdm(dataset_parts):
    updated_part = process_batch(part, initial_batch_size)
    if len(updated_part) > 0:
        updated_parts.append(updated_part)

if updated_parts:
    ds = concatenate_datasets(updated_parts)
else:
    print("No successfully updated parts. The dataset remains unchanged.")

  0%|          | 0/10 [00:00<?, ?it/s]


Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s][A
Processed prompts:   1%|          | 1/100 [00:39<1:05:13, 39.53s/it, Generation Speed: 1.64 toks/s][A
Processed prompts:   2%|▏         | 2/100 [00:42<29:49, 18.26s/it, Generation Speed: 2.89 toks/s]  [A
Processed prompts:   3%|▎         | 3/100 [00:43<16:39, 10.31s/it, Generation Speed: 4.41 toks/s][A
Processed prompts:   5%|▌         | 5/100 [00:44<07:27,  4.71s/it, Generation Speed: 7.16 toks/s][A
Processed prompts:   6%|▌         | 6/100 [00:45<05:37,  3.59s/it, Generation Speed: 8.50 toks/s][A
Processed prompts:   8%|▊         | 8/100 [00:46<03:33,  2.32s/it, Generation Speed: 10.96 toks/s][A
Processed prompts:   9%|▉         | 9/100 [00:47<02:58,  1.96s/it, Generation Speed: 12.26 toks/s][A
Processed prompts:  11%|█         | 11/100 [00:52<03:17,  2.22s/it, Generation Speed: 13.56 toks/s][A
Processed prompts:  12%|█▏        | 12/100 [00:55<03:23,  2.31s/it, Generation Speed: 14.2

Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]


Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s][A
Processed prompts:   1%|          | 1/100 [00:34<57:02, 34.57s/it, Generation Speed: 1.42 toks/s][A
Processed prompts:   2%|▏         | 2/100 [00:37<25:36, 15.68s/it, Generation Speed: 2.86 toks/s][A
Processed prompts:   3%|▎         | 3/100 [00:37<14:22,  8.90s/it, Generation Speed: 4.49 toks/s][A
Processed prompts:   4%|▍         | 4/100 [00:38<09:08,  5.71s/it, Generation Speed: 6.02 toks/s][A
Processed prompts:   5%|▌         | 5/100 [00:42<08:06,  5.12s/it, Generation Speed: 6.97 toks/s][A
Processed prompts:   6%|▌         | 6/100 [00:46<07:03,  4.50s/it, Generation Speed: 7.71 toks/s][A
Processed prompts:   7%|▋         | 7/100 [00:46<05:08,  3.31s/it, Generation Speed: 9.02 toks/s][A
Processed prompts:   8%|▊         | 8/100 [00:47<03:53,  2.53s/it, Generation Speed: 10.30 toks/s][A
Processed prompts:   9%|▉         | 9/100 [00:49<03:28,  2.29s/it, Generation Speed: 11.48 toks/s

Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]


Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s][A
Processed prompts:   1%|          | 1/100 [00:37<1:01:59, 37.57s/it, Generation Speed: 1.70 toks/s][A
Processed prompts:   2%|▏         | 2/100 [00:40<28:28, 17.43s/it, Generation Speed: 3.06 toks/s]  [A
Processed prompts:   4%|▍         | 4/100 [00:43<11:40,  7.30s/it, Generation Speed: 5.62 toks/s][A
Processed prompts:   5%|▌         | 5/100 [00:45<08:51,  5.59s/it, Generation Speed: 6.96 toks/s][A
Processed prompts:   6%|▌         | 6/100 [00:46<06:29,  4.15s/it, Generation Speed: 8.33 toks/s][A
Processed prompts:   7%|▋         | 7/100 [00:47<05:15,  3.39s/it, Generation Speed: 9.61 toks/s][A
Processed prompts:  11%|█         | 11/100 [00:48<02:01,  1.37s/it, Generation Speed: 14.61 toks/s][A
Processed prompts:  12%|█▏        | 12/100 [00:50<02:04,  1.42s/it, Generation Speed: 15.61 toks/s][A
Processed prompts:  13%|█▎        | 13/100 [00:52<02:10,  1.50s/it, Generation Speed: 16.4

Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]


Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s][A
Processed prompts:   1%|          | 1/100 [00:36<59:31, 36.08s/it, Generation Speed: 1.36 toks/s][A
Processed prompts:   2%|▏         | 2/100 [00:36<25:01, 15.32s/it, Generation Speed: 2.90 toks/s][A
Processed prompts:   3%|▎         | 3/100 [00:37<14:02,  8.69s/it, Generation Speed: 4.43 toks/s][A
Processed prompts:   4%|▍         | 4/100 [00:38<08:54,  5.57s/it, Generation Speed: 6.03 toks/s][A
Processed prompts:   5%|▌         | 5/100 [00:40<07:02,  4.45s/it, Generation Speed: 7.21 toks/s][A
Processed prompts:   6%|▌         | 6/100 [00:43<05:52,  3.75s/it, Generation Speed: 8.41 toks/s][A
Processed prompts:   8%|▊         | 8/100 [00:44<03:14,  2.11s/it, Generation Speed: 11.12 toks/s][A
Processed prompts:   9%|▉         | 9/100 [00:46<03:23,  2.23s/it, Generation Speed: 11.96 toks/s][A
Processed prompts:  11%|█         | 11/100 [00:48<02:24,  1.63s/it, Generation Speed: 14.52 toks

Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]


Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s][A
Processed prompts:   1%|          | 1/100 [00:40<1:06:31, 40.31s/it, Generation Speed: 1.49 toks/s][A
Processed prompts:   2%|▏         | 2/100 [00:43<30:30, 18.67s/it, Generation Speed: 2.92 toks/s]  [A
Processed prompts:   3%|▎         | 3/100 [00:47<18:53, 11.69s/it, Generation Speed: 4.09 toks/s][A
Processed prompts:   4%|▍         | 4/100 [00:48<11:51,  7.41s/it, Generation Speed: 5.26 toks/s][A
Processed prompts:   5%|▌         | 5/100 [00:48<07:59,  5.05s/it, Generation Speed: 6.62 toks/s][A
Processed prompts:   6%|▌         | 6/100 [00:49<05:40,  3.62s/it, Generation Speed: 7.79 toks/s][A
Processed prompts:   7%|▋         | 7/100 [00:50<04:12,  2.72s/it, Generation Speed: 9.04 toks/s][A
Processed prompts:   8%|▊         | 8/100 [00:52<03:43,  2.43s/it, Generation Speed: 10.16 toks/s][A
Processed prompts:   9%|▉         | 9/100 [00:54<03:22,  2.22s/it, Generation Speed: 11.36 to

Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]


Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s][A
Processed prompts:   1%|          | 1/100 [00:35<58:52, 35.68s/it, Generation Speed: 1.71 toks/s][A
Processed prompts:   4%|▍         | 4/100 [00:39<12:31,  7.83s/it, Generation Speed: 5.95 toks/s][A
Processed prompts:   5%|▌         | 5/100 [00:42<09:59,  6.31s/it, Generation Speed: 7.04 toks/s][A
Processed prompts:   6%|▌         | 6/100 [00:44<08:06,  5.18s/it, Generation Speed: 8.20 toks/s][A
Processed prompts:   7%|▋         | 7/100 [00:45<06:04,  3.92s/it, Generation Speed: 9.38 toks/s][A
Processed prompts:   9%|▉         | 9/100 [00:46<03:50,  2.53s/it, Generation Speed: 11.75 toks/s][A
Processed prompts:  11%|█         | 11/100 [00:47<02:31,  1.71s/it, Generation Speed: 14.45 toks/s][A
Processed prompts:  13%|█▎        | 13/100 [00:48<01:48,  1.25s/it, Generation Speed: 16.79 toks/s][A
Processed prompts:  14%|█▍        | 14/100 [00:49<01:40,  1.17s/it, Generation Speed: 17.76 t

Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]


Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s][A
Processed prompts:   1%|          | 1/100 [00:35<58:17, 35.33s/it, Generation Speed: 1.73 toks/s][A
Processed prompts:   2%|▏         | 2/100 [00:38<26:49, 16.43s/it, Generation Speed: 3.14 toks/s][A
Processed prompts:   3%|▎         | 3/100 [00:40<16:14, 10.04s/it, Generation Speed: 4.56 toks/s][A
Processed prompts:   4%|▍         | 4/100 [00:44<11:48,  7.38s/it, Generation Speed: 5.78 toks/s][A
Processed prompts:   7%|▋         | 7/100 [00:46<05:05,  3.28s/it, Generation Speed: 9.58 toks/s][A
Processed prompts:   8%|▊         | 8/100 [00:49<04:47,  3.12s/it, Generation Speed: 10.58 toks/s][A
Processed prompts:   9%|▉         | 9/100 [00:50<03:53,  2.56s/it, Generation Speed: 11.83 toks/s][A
Processed prompts:  13%|█▎        | 13/100 [00:51<01:53,  1.31s/it, Generation Speed: 16.08 toks/s][A
Processed prompts:  14%|█▍        | 14/100 [00:52<01:45,  1.23s/it, Generation Speed: 16.97 to

Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]


Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s][A
Processed prompts:   1%|          | 1/100 [00:34<56:53, 34.48s/it, Generation Speed: 1.62 toks/s][A
Processed prompts:   2%|▏         | 2/100 [00:36<24:45, 15.15s/it, Generation Speed: 3.27 toks/s][A
Processed prompts:   3%|▎         | 3/100 [00:41<17:31, 10.84s/it, Generation Speed: 4.33 toks/s][A
Processed prompts:   5%|▌         | 5/100 [00:42<07:48,  4.93s/it, Generation Speed: 7.04 toks/s][A
Processed prompts:   6%|▌         | 6/100 [00:43<05:52,  3.74s/it, Generation Speed: 8.29 toks/s][A
Processed prompts:   7%|▋         | 7/100 [00:44<04:29,  2.89s/it, Generation Speed: 9.42 toks/s][A
Processed prompts:   8%|▊         | 8/100 [00:45<03:30,  2.29s/it, Generation Speed: 10.31 toks/s][A
Processed prompts:   9%|▉         | 9/100 [00:48<03:55,  2.59s/it, Generation Speed: 11.10 toks/s][A
Processed prompts:  11%|█         | 11/100 [00:49<02:21,  1.59s/it, Generation Speed: 13.85 toks

Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]


Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s][A
Processed prompts:   1%|          | 1/100 [00:33<56:03, 33.98s/it, Generation Speed: 1.65 toks/s][A
Processed prompts:   2%|▏         | 2/100 [00:38<27:26, 16.80s/it, Generation Speed: 3.12 toks/s][A
Processed prompts:   3%|▎         | 3/100 [00:43<18:31, 11.45s/it, Generation Speed: 4.29 toks/s][A
Processed prompts:   4%|▍         | 4/100 [00:47<13:20,  8.34s/it, Generation Speed: 5.44 toks/s][A
Processed prompts:   5%|▌         | 5/100 [00:49<09:27,  5.97s/it, Generation Speed: 6.51 toks/s][A
Processed prompts:   6%|▌         | 6/100 [00:50<07:06,  4.53s/it, Generation Speed: 7.52 toks/s][A
Processed prompts:   7%|▋         | 7/100 [00:52<05:37,  3.63s/it, Generation Speed: 8.48 toks/s][A
Processed prompts:   9%|▉         | 9/100 [00:53<03:10,  2.09s/it, Generation Speed: 10.82 toks/s][A
Processed prompts:  10%|█         | 10/100 [00:55<03:00,  2.01s/it, Generation Speed: 11.51 toks/

Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]


Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s][A
Processed prompts:   1%|          | 1/100 [00:27<45:52, 27.80s/it, Generation Speed: 1.83 toks/s][A
Processed prompts:   2%|▏         | 2/100 [00:36<27:23, 16.77s/it, Generation Speed: 2.88 toks/s][A
Processed prompts:   3%|▎         | 3/100 [00:40<17:39, 10.92s/it, Generation Speed: 4.14 toks/s][A
Processed prompts:   4%|▍         | 4/100 [00:42<11:38,  7.27s/it, Generation Speed: 5.37 toks/s][A
Processed prompts:   5%|▌         | 5/100 [00:43<07:50,  4.95s/it, Generation Speed: 6.74 toks/s][A
Processed prompts:   6%|▌         | 6/100 [00:45<06:02,  3.86s/it, Generation Speed: 7.90 toks/s][A
Processed prompts:   7%|▋         | 7/100 [00:48<05:45,  3.72s/it, Generation Speed: 8.89 toks/s][A
Processed prompts:   9%|▉         | 9/100 [00:49<03:13,  2.12s/it, Generation Speed: 11.46 toks/s][A
Processed prompts:  10%|█         | 10/100 [00:50<02:41,  1.80s/it, Generation Speed: 12.59 toks/

Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]

If we look at our dataset we can see that we have a new column called `generations` which contains the generated data.

In [22]:
ds

Dataset({
    features: ['id', 'section', 'prompt', 'generations'],
    num_rows: 1000
})

Let's look at an example row

In [24]:
ds[0]

{'id': '116s4049rs',
 'section': 'rear admiral in the Navy, or an equivalent grade in the Space Force is under investigation for alleged misconduct or pending the disposition of an adverse personnel action at the time of retirement, the Secretary of the military department concerned may— (A) conditionally determine the highest permanent grade of satisfactory service on active duty of the officer pending completion of the investigation or resolution of the personnel action, as applicable; and (B) retire the officer in that conditional grade, subject to subsection (e).',
 'prompt': "Write one good and three bad abstract descriptions of the following text. Output the descriptions in a JSON file with keys ‘good’ and ‘bad’.\nExample:\nText: Community Energy Savings Program Act of 2019\n\nThis bill directs the Department of Energy to establish a grant program for states and Indian tribes to provide loans to consumers and communities that want to implement cost-effective energy efficiency mea

We'll do some more work to format the dataset in a format that is compatible with Sentence Transformers training APIs but let's already push to the raw dataset to the Hub. We can push this to a config called `raw`.

In [25]:
ds.push_to_hub('bill_summary_us_chunks-similarity', "raw")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/798 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/davanstrien/bill_summary_us_chunks-similarity-yi/commit/026cf1d1555ec6fece47b57fe54dc0d01f793221', commit_message='Upload dataset', commit_description='', oid='026cf1d1555ec6fece47b57fe54dc0d01f793221', pr_url=None, pr_revision=None, pr_num=None)

We can also see that we can load an example from the `generations` column as valid JSON.

In [26]:
json.loads(ds[0]['generations'])

{'good': ['Retirement of rear admiral under investigation for misconduct may be conditional on the outcome of the investigation and resolution of personnel actions.'],
 'bad': ['A proposal to change the retirement benefits for military personnel based on their years of service.',
  'Legislation to implement new training programs for naval officers.',
  'Investigation process for harassment complaints in the military.']}

### Formatting the dataset for Sentence Transformers

Sentence Transformers requires a dataset with three columns: "anchor", "positive", and "negative". We can now format our dataset in this way.

For now we'll just randomly select a positive and negative example from the generations we created. A later notebook will look at how we can select the best positive and negative examples for each anchor text in other ways.

In [27]:
def format_for_st(row):
    anchor = row["section"]
    generations = json.loads(row["generations"])
    good = generations.get("good")
    bad = generations.get("bad")
    positive = random.choice(good)
    negative = random.choice(bad)
    return {"anchor": anchor, "positive": positive, "negative": negative}

In [28]:
format_for_st(ds[0])

{'anchor': 'rear admiral in the Navy, or an equivalent grade in the Space Force is under investigation for alleged misconduct or pending the disposition of an adverse personnel action at the time of retirement, the Secretary of the military department concerned may— (A) conditionally determine the highest permanent grade of satisfactory service on active duty of the officer pending completion of the investigation or resolution of the personnel action, as applicable; and (B) retire the officer in that conditional grade, subject to subsection (e).',
 'positive': 'Retirement of rear admiral under investigation for misconduct may be conditional on the outcome of the investigation and resolution of personnel actions.',
 'negative': 'Legislation to implement new training programs for naval officers.'}

In [29]:
ds = ds.map(format_for_st)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

We remove all the unnecessary columns

In [30]:
ds = ds.remove_columns([c for c in ds.column_names if c not in { 'anchor', 'positive', 'negative'}])

We now push to the Hub, this time we don't specify a config so it will be pushed to the default config.

In [39]:
info = ds.push_to_hub('bill_summary_us_chunks-similarity')

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/732 [00:00<?, ?B/s]

### Add some metadata to the Dataset Card

We can now add some metadata to the dataset card. This will help us keep track of the dataset and understand how it was created.

In [38]:
card = DatasetCard.load("davanstrien/bill-summary-us-similarity")

In [33]:
card.data['tags'] = ['sentence-transformers', 'synthetic' ,"synthetic-data-workshop"]

In [42]:
card.push_to_hub("davanstrien/bill_summary_us_chunks-similarity")

CommitInfo(commit_url='https://huggingface.co/datasets/davanstrien/bill_summary_us_chunks-similarity/commit/8cf7a2e9209f631f5395296b0f8ee081101f4479', commit_message='Upload README.md with huggingface_hub', commit_description='', oid='8cf7a2e9209f631f5395296b0f8ee081101f4479', pr_url=None, pr_revision=None, pr_num=None)

fin