--- base_model: HuggingFaceH4/zephyr-7b-beta library_name: peft license: apache-2.0 --- # INSAIT-Institute/Zephyr-7B-MixAT ![INSAIT logo](./assets/images/insait.png) This is a model adapter for [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), fine-tuned using the MixAT method. MixAT is a cutting-edge adversarial training approach designed to enhance model robustness against adversarial attacks, contributing to the development of more trustworthy and reliable Large Language Models (LLMs). For details, see our paper [MixAT: Combining Continuous and Discrete Adversarial Training for LLMs](https://arxiv.org/abs/2505.16947). Training and evaluation code is available in the [MixAT Github repository](https://github.com/insait-institute/MixAT). ## Use in 🤗 PEFT and Transformers (Quantized) First, install the required libraries: ```bash pip install transformers peft bitsandbytes ``` Then, load the base model (4bit quantized) using transformers and apply the adapter using peft: ```python from peft import PeftModel from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=False, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="bfloat16" ) base_model = AutoModelForCausalLM.from_pretrained( "HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto", quantization_config=bnb_config ) model = PeftModel.from_pretrained(base_model, "INSAIT-Institute/Zephyr-7B-MixAT") ``` ## Results MixAT has been evaluated against a broad range of state-of-the-art adversarial attacks, introducing the At Least One Attack Success Rate (ALO-ASR) metric to assess worst-case model vulnerability. Our results show that MixAT achieves significantly improved robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining good utility scores and a runtime comparable to continuous relaxation-based methods. ![MixAT results](./assets/images/main_table.png) ## Model Sources - Repository: https://github.com/insait-institute/MixAT - Paper: https://arxiv.org/abs/2505.16947 ## Summary - Base model: [HuggingFaceH4/zephyr-7b-beta](HuggingFaceH4/zephyr-7b-beta) - Contact: dimitar.iliev.dimitrov@insait.ai and dekanycsaba23@gmail.com - License: Distributed under [Apache License Version 2.0](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md) ## Citation ```bibtex @article{dekany2025mixat, title={MixAT: Combining Continuous and Discrete Adversarial Training for LLMs}, author={D{\'e}k{\'a}ny, Csaba and Balauca, Stefan and Staab, Robin and Dimitrov, Dimitar I and Vechev, Martin}, journal={arXiv preprint arXiv:2505.16947}, year={2025} } ```