File size: 2,811 Bytes

a7a2a20
 
 
5c5925b
a7a2a20
 
5c5925b
a7a2a20
5c5925b
a7a2a20
9bf75cc
a7a2a20
 
5c5925b
 
a7a2a20
5c5925b
 
 
a7a2a20
5c5925b
a7a2a20
5c5925b
 
 
c8115fc
a7a2a20
5c5925b
 
 
 
 
 
a7a2a20
5c5925b
 
 
 
 
 
a7a2a20
5c5925b
 
a7a2a20
5c5925b
 
a7a2a20
5c5925b
a7a2a20
 
5c5925b
a7a2a20
5c5925b
 
a7a2a20
 
5c5925b
a7a2a20
5c5925b
 
c8115fc
a7a2a20
 
5c5925b
a7a2a20
5c5925b
 
 
 
 
 
 
 
c8115fc

---
base_model: HuggingFaceH4/zephyr-7b-beta
library_name: peft
license: apache-2.0
---

# INSAIT-Institute/Zephyr-7B-MixAT

![INSAIT logo](./assets/images/insait.png)

This is a model adapter for [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), fine-tuned using the MixAT method. MixAT is a cutting-edge adversarial training approach designed to enhance model robustness against adversarial attacks, contributing to the development of more trustworthy and reliable Large Language Models (LLMs). For details, see our paper [MixAT: Combining Continuous and Discrete Adversarial Training for LLMs](https://arxiv.org/abs/2505.16947). Training and evaluation code is available in the [MixAT Github repository](https://github.com/insait-institute/MixAT).


## Use in 🤗 PEFT and Transformers (Quantized)
First, install the required libraries:

```bash
pip install transformers peft bitsandbytes
```

Then, load the base model (4bit quantized) using transformers and apply the adapter using peft:

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16"
)

base_model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=bnb_config
)

model = PeftModel.from_pretrained(base_model, "INSAIT-Institute/Zephyr-7B-MixAT")
```

## Results
MixAT has been evaluated against a broad range of state-of-the-art adversarial attacks, introducing the At Least One Attack Success Rate (ALO-ASR) metric to assess worst-case model vulnerability. Our results show that MixAT achieves significantly improved robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining good utility scores and a runtime comparable to continuous relaxation-based methods.

![MixAT results](./assets/images/main_table.png)


## Model Sources

- Repository: https://github.com/insait-institute/MixAT
- Paper: https://arxiv.org/abs/2505.16947


## Summary

- Base model: [HuggingFaceH4/zephyr-7b-beta](HuggingFaceH4/zephyr-7b-beta)
- Contact:  dimitar.iliev.dimitrov@insait.ai and dekanycsaba23@gmail.com
- License: Distributed under [Apache License Version 2.0](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md)


## Citation

```bibtex
@article{dekany2025mixat,
  title={MixAT: Combining Continuous and Discrete Adversarial Training for LLMs},
  author={D{\'e}k{\'a}ny, Csaba and Balauca, Stefan and Staab, Robin and Dimitrov, Dimitar I and Vechev, Martin},
  journal={arXiv preprint arXiv:2505.16947},
  year={2025}
}
```