Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models
Abstract
The Fork-Merge Decoding strategy improves balanced multimodal understanding in audio-visual large language models by separating and then combining modality-specific reasoning.
The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs) by addressing modality bias without requiring additional training. In current AV-LLMs, audio and video features are typically processed jointly in the decoder. While this strategy facilitates unified multimodal understanding, it may introduce modality bias, where the model tends to over-rely on one modality due to imbalanced training signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications. FMD first performs modality-specific reasoning by processing audio-only and video-only inputs through the early decoder layers (a fork phase), and then merges the resulting hidden states for joint reasoning in the remaining layers (a merge phase). This approach promotes balanced modality contributions and leverages complementary information across modalities. We evaluate our method on two representative AV-LLMs, VideoLLaMA2 and video-SALMONN, using three benchmark datasets. Experimental results demonstrate consistent performance improvements on tasks focused on audio, video, and combined audio-visual reasoning, demonstrating the effectiveness of inference-time interventions for robust multimodal understanding.
Community
Fork-Merge Decoding (FMD) enhances audio-visual large language models (AV-LLMs) reasoning at inference time by leveraging both unimodal understanding and cross-modal interactions without additional training.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding (2025)
- DGFNet: End-to-End Audio-Visual Source Separation Based on Dynamic Gating Fusion (2025)
- Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities (2025)
- Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach (2025)
- EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning (2025)
- Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model (2025)
- Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper