INSAIT-Institute
/

Zephyr-7B-MixAT

Model card Files Files and versions Community

Zephyr-7B-MixAT / README.md

dimitadi's picture

Update README.md

c8115fc verified 14 days ago

|

history blame contribute delete

2.81 kB

	---
	base_model: HuggingFaceH4/zephyr-7b-beta
	library_name: peft
	license: apache-2.0
	---

	# INSAIT-Institute/Zephyr-7B-MixAT

	![INSAIT logo](./assets/images/insait.png)

	This is a model adapter for [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), fine-tuned using the MixAT method. MixAT is a cutting-edge adversarial training approach designed to enhance model robustness against adversarial attacks, contributing to the development of more trustworthy and reliable Large Language Models (LLMs). For details, see our paper [MixAT: Combining Continuous and Discrete Adversarial Training for LLMs](https://arxiv.org/abs/2505.16947). Training and evaluation code is available in the [MixAT Github repository](https://github.com/insait-institute/MixAT).


	## Use in 🤗 PEFT and Transformers (Quantized)
	First, install the required libraries:

	```bash
	pip install transformers peft bitsandbytes
	```

	Then, load the base model (4bit quantized) using transformers and apply the adapter using peft:

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, BitsAndBytesConfig
	import torch

	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_use_double_quant=False,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype="bfloat16"
	)

	base_model = AutoModelForCausalLM.from_pretrained(
	"HuggingFaceH4/zephyr-7b-beta",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	quantization_config=bnb_config
	)

	model = PeftModel.from_pretrained(base_model, "INSAIT-Institute/Zephyr-7B-MixAT")
	```

	## Results
	MixAT has been evaluated against a broad range of state-of-the-art adversarial attacks, introducing the At Least One Attack Success Rate (ALO-ASR) metric to assess worst-case model vulnerability. Our results show that MixAT achieves significantly improved robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining good utility scores and a runtime comparable to continuous relaxation-based methods.

	![MixAT results](./assets/images/main_table.png)


	## Model Sources

	- Repository: https://github.com/insait-institute/MixAT
	- Paper: https://arxiv.org/abs/2505.16947


	## Summary

	- Base model: [HuggingFaceH4/zephyr-7b-beta](HuggingFaceH4/zephyr-7b-beta)
	- Contact: dimitar.iliev.dimitrov@insait.ai and dekanycsaba23@gmail.com
	- License: Distributed under [Apache License Version 2.0](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md)


	## Citation

	```bibtex
	@article{dekany2025mixat,
	title={MixAT: Combining Continuous and Discrete Adversarial Training for LLMs},
	author={D{\'e}k{\'a}ny, Csaba and Balauca, Stefan and Staab, Robin and Dimitrov, Dimitar I and Vechev, Martin},
	journal={arXiv preprint arXiv:2505.16947},
	year={2025}
	}
	```