Papers
arxiv:2505.16211

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Published on May 22
ยท Submitted by JusperLee on May 26
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

AudioTrust evaluates the trustworthiness of Audio Large Language Models across multifaceted dimensions, using a comprehensive dataset and specific metrics to assess their performance in real-world audio scenarios.

AI-generated summary

The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.

Community

Paper submitter

AudioTrust is a comprehensive trustworthiness evaluation framework for Audio Large Language Models (ALLMs), effectively revealing potential risks of audio large models across six dimensions: fairness, hallucination, security, privacy, robustness, and authentication. It aggregates over 4,420 real-world audio/text data samples, covering 18 experimental settings such as daily conversations, emergency calls, and voice assistants, and designs 9 audio-specific evaluation metrics to build an automated assessment pipeline. Key findings include: closed-source models perform better in robustness and security protection, while open-source models still have blind spots in privacy and fairness; most ALLMs exhibit systematic biases regarding sensitive attributes such as gender, accent, and age. It is hoped that researchers will continue to optimize audio large models based on AudioTrust to jointly promote a safer and more trustworthy AI audio ecosystem!

๐Ÿ“„ Paper:https://arxiv.org/pdf/2505.16211
๐Ÿ’ป Code: https://github.com/JusperLee/AudioTrust
๐Ÿค— Data: https://huggingface.co/datasets/JusperLee/AudioTrust

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.16211 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.16211 in a Space README.md to link it from this page.

Collections including this paper 2