Papers
arxiv:2406.16235

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Published on Jun 23, 2024
· Submitted by yongzx on Jun 25, 2024

Abstract

Direct Preference Optimization (DPO) trained on English data significantly reduces toxicity in multilingual Large Language Models, attributed to the dual multilinguality property of MLP layers.

AI-generated summary

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.

Community

Paper author Paper submitter

Why can DPO training with English data detoxify LLMs in different languages? In our work, we give a mechanistic explanation for the zero-shot cross-lingual transfer of DPO detoxification.

Sign up or log in to comment

Models citing this paper 6

Browse 6 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.16235 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.16235 in a Space README.md to link it from this page.

Collections including this paper 2