arxiv:2505.23856

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities

Published on May 29

· Submitted by

Authors:

Abstract

OMNIGUARD detects harmful prompts across languages and modalities by identifying aligned internal representations in large language models, achieving high accuracy and efficiency.

AI-generated summary

The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient (approx 120 times faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

View arXiv page View PDF Add to collection

Community

vsahil

Paper submitter 4 days ago

We build a new AI safety moderation model, OmniGuard, that can detect harmful prompts across multiple languages and multiple modalities, all using one approach. It achieves SOTA results for detecting harmful prompts in 3 modalities: (multilingual) text, images, and audios.

OmniGuard operates by finding:

internal representations of a model (LLM or MLLM) that are universally shared across languages or different modalities, and
building a classifier using these representations

Using the internal representations for safety classification bypasses the need for a separate Guard model while making OmniGuard ~120X faster than the next fastest baseline Guard model.

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.23856 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.23856 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.23856 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.