metadata

title: Abliteration
emoji: 👀
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 5.35.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
  - read-repos
  - write-repos
  - manage-repos
license: gpl-3.0

Abliteration: A Guide to Relaxing Moderation in Open-Source Language Models

Our HuggingFace Abliteration Space URL

Introduction

Currently, open-source large language models are typically equipped with strict moderation mechanisms that refuse to answer any content they deem potentially harmful. Abliteration is a technique that can modify these models to relax their moderation thresholds, enabling freer and more open conversations with users.

A Simple Analogy

Imagine the model is like a robot that follows a rulebook and says "no" whenever it thinks a question might be dangerous. Inside the robot's brain, there are lots of little lights called "hidden states" that light up in special ways when the robot wants to say "no." Engineers can examine many questions and discover which lights always shine when the robot refuses to answer. This is called finding the "refusal direction." Abliteration works by gently turning down those lights or helping the robot ignore them a little. This makes the robot less strict, so it can say "yes" to more questions and have more open conversations.

How Abliteration Works

The actual process of Abliteration involves first selecting specific positions of the model's hidden states to observe, such as the outputs from particular layers. The model is then tested with a large number of harmful and harmless questions. By analyzing the hidden state outputs at the selected positions for both types of prompts, we calculate the average hidden state for harmful and harmless questions respectively. The difference between these averages gives us the "refusal direction."

Using this refusal direction, we can apply a reverse adjustment to the model's parameters, making it harder for the model to refuse to answer questions and thereby lowering its refusal threshold.

We built upon the outstanding work from remove-refusals-with-transformers and made modifications to adapt it to newer model architectures, especially to handle the challenges introduced by Mixture-of-Experts (MoE) models. Additionally, we developed a Hugging Face Space, Abliteration, that makes it simple for anyone to apply abliteration to their own open-source models with minimal effort.

How to Use the Hugging Face Space

1. Login

To use models hosted on Hugging Face, please log in with your Hugging Face account. If you wish to upload the modified model to an organization repository, make sure to grant the appropriate permissions during authorization. If you need to re-authorize your account or expand the authorization scope, please first go to the 'Connected Apps' section in your Hugging Face account settings to unbind the current authorization, and then log in again.

2. Specify and Load Model

Please enter the Hugging Face model repository ID you want to use in the "Hub Model ID" field. This can be either a public model or a private repository that you have access to. Once selected, click the "Load Model" button to load the model.

If you would like to compare the effect of the abliterated model, you can use the "Chat Test" section to interact with the original model and record its responses.

3. Customize Instruction Sets

Harmful Instructions and Harmless Instructions are the primary datasets used in the abliteration process. They are essential for calculating the average harmful and average harmless hidden states, as well as determining the refusal direction.

You can customize these datasets in the provided text boxes. If needed, you can also modify the datasets to make them more relevant to your specific task. This customization can help you create a model that is particularly resistant to refusing answers to the types of questions you care about.

4. Key Parameters

Number of Instructions: When calculating the refusal direction, this parameter determines how many samples are randomly selected from both the harmful and harmless datasets for training. For example, if the number is set to 32, 32 prompts will be sampled from each dataset to compute the hidden states and the refusal direction. If the modified model's performance is not satisfactory, increasing this parameter may help.
Scale Factor: This factor controls how strongly the correction vector is added back into the model's parameters. If you desire a stronger effect, you can increase this parameter.
Skip Beginning Layers and Skip Ending Layers: These parameters allow you to exclude model layers that you do not wish to modify. For instance, if the initial layers are primarily used for feature extraction, or if the final layer is a linear layer that you prefer to keep unchanged, you can use these settings to skip them.
Refusal Direction Layer Fraction: This parameter specifies which portion of the model's layers should be used to extract the hidden states for calculating the refusal direction.

Parameters 3 and 4 allow you to perform a more customized abliteration process based on the structure of the model you are using.

Chat Test

Congratulations! You have now obtained a preliminary abliterated model!

You can interact with it in the "Chat Test" tab to see how it performs. You can also modify the model's output token limit and parameters like temperature to test how the model behaves under different settings. If the results are satisfactory, you will find the modified model in the personal or organizational repository you selected.

If the performance is not satisfactory, try adjusting the parameters, such as increasing the Number of Instructions or the Scale Factor, and re-run the abliteration process to achieve better results.

Finally, if you wish to use llama.cpp to quantize the generated model, we warmly welcome and encourage you to try our other Space: Quantize My Repo.

Data

This project uses the harmful.txt and harmless.txt datasets from Sumandora/remove-refusals-with-transformers, licensed under Apache License 2.0. See LICENSE for details.