YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CoGenAV: Contrastive-Generative Audio-Visual Representation Learning


πŸš€ Project Overview

CoGenAV is a framework for audio-visual representation learning based on Contrastive-Generative Synchronization, designed to learn efficient and generalizable audio-visual representations through multimodal alignment of speech, lip movements, and text. The model performs exceptionally well across multiple audio-visual tasks, including:

  • Audio-Visual Speech Recognition (AVSR)
  • Visual Speech Recognition (VSR)
  • Audio-Visual Speech Enhancement and Separation (AVSE/AVSS)
  • Active Speaker Detection (ASD)

πŸ—οΈ Framework

The left panel depicts the Audio-Visual Feature Representation framework and the Contrastive-Generative Synchronization Training methodology. For generative synchronization, we design a Feature Adaptation Module and employ a frozen pre-trained ASR model as the Speech Recognition (SR) head. The right panel demonstrates the application of CoGenAV to diverse downstream tasks, including Visual Speech Recognition (VSR), Audio-Visual Speech Recognition (AVSR), Audio-Visual Speech Separation (AVSS), Audio-Visual Speech Enhancement (AVSE), and Active Speaker Detection (ASD).


🌟 Key Advantages

  1. Efficient Learning: High-performance models can be trained with only 223 hours of labeled data (from the LRS2 dataset).
  2. Cross-Task Generalizability: Unified representation learning allows direct adaptation to various downstream tasks without task-specific architectural adjustments.
  3. Robustness: Performance improves by 70%+ in noisy environments (0 dB SNR), significantly outperforming traditional audio-only models.

Usage

  1. Install dependencies:

    pip install -r requirements.txt
    #Need to ensure that whisper and fairseq is installed
    pip install -U openai-whisper
    git clone https://github.com/pytorch/fairseq
    cd fairseq
    pip install --editable ./
    
  2. Infer CoGenAV for VSR/AVSR :

     import whisper
     from whisper.model import AudioEncoder
     from infer_vsr_avsr import cogenav_forward
     from models.cogenav import CoGenAV
     # Override the Whisper encoder's forward function
     AudioEncoder.forward = cogenav_forward
     # Load CoGenAV model
     cogenav = CoGenAV(cfg_file="config/base.yaml", model_tensor="weights/base_cogenav.pt")
     # Load Whisper model as SR_Head
     SR_Head = whisper.load_model("small", download_root="weights/whisper/")
     SR_Head.encoder.adapter = cogenav.adapter.half()
     # Prepare input using CoGenAV
     input_ids = cogenav(video, audio).permute(0, 2, 1)  # For cogenav_av
     # input_ids = cogenav(video, None).permute(0, 2, 1)  # For cogenav_v
     # input_ids = cogenav(None, audio).permute(0, 2, 1)  # For cogenav_a
     # input_ids = audio  # For whisper_a
     # Decode using Whisper model
     result = whisper.decode(SR_Head, input_ids, options)[0]
    
  3. Infer CoGenAV for AVSS/AVSE :

     from models.cogenav import CoGenAV
     from  models.sepformer import build_Sepformer
     # Load CoGenAV model
     cogenav = CoGenAV(cfg_file="config/base.yaml", model_tensor="weights/base_cogenav.pt")
     # Load sepformer model as avss/avse head
     sepformer_head = build_Sepformer().cuda()
     # sep speech with lip feature from mix wav
     lip_feature = cogenav(video, None,use_upsampler=False)
     sep_wav = sepformer_head.forward(audio_mix, lip_feature)
    
  4. Infer script:

     python infer_vsr_avsr.py --input_type cogenav_av --model_size large  --cogenav_ckpt weights/large_cogenav.pt
     python infer_avse_avss.py --task_type avse
    

🎬 Demo

Demo For AVSR/VSR

AVSR/VSR

Demo For AVSS/AVSE

AVSS(Audio-Visual Speech Separation)
AVSE(Audio-Visual Speech Enhancement)

Result

CoGenAV Base for VSR/AVSR

Size SR Head Modalities VSR AVSR@noise AVSR@clean AVSR with sft whisper @clean
- Whisper medium A - 34.2 6.4 1.5
Base Whisper small AV 24.8 5.2 2.5 -
Large Whisper medium AV 20.4 2.6 1.8 1.27

Note: VSR/AVSR results on LRS2. The evaluation metric used is WER, and the results are obtained from training conducted solely on the LRS2 dataset.

CoGenAV Base for AVSS/AVSE

Task SS Head Test Dataset SI-SNRi SDRi PESQ
AVSS AV-Sepformer mix_2_spk_tt 15.7 16.0 3.23
AVSE AV-Sepformer lrs2_test+noise 8.3 9.0 2.56

Note: AVSS/AVSE results on LRS2. These metrics represent the average values for all speakers in each test set, where larger SI-SNRi, SDRi, and PESQ are better.

CoGenAV Base for ASD

Task SD Head Test Dataset mAP
ASD LRASD Talkies 96.3
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support