|
|
|
library_name: transformers |
|
tags: [] |
|
|
|
|
|
|
|
## Model Description |
|
This Llama3-based model is fine-tuned using the "Representation Bending" (REPBEND) approach described in [Representation Bending for Large Language Model Safety](https://arxiv.org/abs/2504.01550). REPBEND modifies the model’s internal representations to reduce harmful or unsafe responses while preserving overall capabilities. The result is a model that is robust to various forms of adversarial jailbreak attacks, out-of-distribution harmful prompts, and fine-tuning exploits, all while maintaining useful and informative responses to benign requests. |
|
|
|
## Uses |
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
from peft import PeftModel |
|
|
|
model_id = "meta-llama/Meta-Llama-3-8B-Instruct" |
|
adapter_id = "thkim0305/RepBend_Llama3_8B_LoRA" |
|
tokenizer = AutoTokenizer.from_pretrained(adapter_id, use_fast=False) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto", |
|
) |
|
model = PeftModel.from_pretrained(model, adapter_id, adapter_name="default") |
|
|
|
input_text = "Who are you?" |
|
template = "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" |
|
|
|
prompt = template.format(instruction=input_text) |
|
|
|
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device) |
|
outputs = model.generate(input_ids, max_new_tokens=256) |
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
print(generated_text) |
|
``` |
|
|
|
## Code |
|
|
|
Please refers to [this github page](https://github.com/AIM-Intelligence/RepBend/tree/main?tab=readme-ov-file) |
|
|
|
## Citation |
|
``` |
|
@article{repbend, |
|
title={Representation Bending for Large Language Model Safety}, |
|
author={Yousefpour, Ashkan and Kim, Taeheon and Kwon, Ryan S and Lee, Seungbeen and Jeung, Wonje and Han, Seungju and Wan, Alvin and Ngan, Harrison and Yu, Youngjae and Choi, Jonghyun}, |
|
journal={arXiv preprint arXiv:2504.01550}, |
|
year={2025} |
|
} |
|
``` |