Sounding that Object: Interactive Object-Aware Image to Audio Generation
Abstract
An interactive object-aware audio generation model aligns sounds with user-selected visual objects in images using a conditional latent diffusion model with multi-modal attention.
Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: https://tinglok.netlify.app/files/avobject/
Community
audio gen
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization (2025)
- In-the-wild Audio Spatialization with Flexible Text-guided Localization (2025)
- TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis (2025)
- Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator (2025)
- OmniAudio: Generating Spatial Audio from 360-Degree Video (2025)
- MAGREF: Masked Guidance for Any-Reference Video Generation (2025)
- KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper