FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion

This repo is a PyTorch implementation of FUSE proposed in our paper: FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion

Results and Weights

MVSEC outdoor_day11

Methods	Input	a1	a2	a3	Abs.Rel	RMSE	RMSElog
E2Depth	E	0.567	0.772	0.876	0.346	8.564	0.421
EReFormer	E	0.664	0.831	0.923	0.271	-	0.333
HMNet	E	0.690	0.849	0.930	0.254	6.890	0.319
RAMNet	I+E	0.541	0.778	0.877	0.303	8.526	0.424
SRFNet	I+E	0.637	0.810	0.900	0.268	8.453	0.375
HMNet	I+E	0.717	0.868	0.940	0.230	6.922	0.310
PCDepth	I+E	0.712	0.867	0.941	0.228	6.526	0.301
Ours	I+E	0.745	0.892	0.957	0.196	6.004	0.270

MVSEC outdoor_night1

Methods	Input	a1	a2	a3	Abs.Rel	RMSE	RMSElog
E2Depth	E	0.408	0.615	0.754	0.591	11.210	0.646
EReFormer	E	0.547	0.753	0.881	0.317	-	0.415
HMNet	E	0.513	0.714	0.837	0.323	9.008	0.482
RAMNet	I+E	0.296	0.502	0.635	0.583	13.340	0.830
SRFNet	I+E	0.433	0.662	0.800	0.371	11.469	0.521
HMNet	I+E	0.497	0.661	0.784	0.349	10.818	0.543
PCDepth	I+E	0.632	0.822	0.922	0.271	6.715	0.354
Ours	I+E	0.629	0.824	0.923	0.261	6.587	0.351

Weights

To facilitate community communication, we provide our model weights. We provide our foundation model and its metric version on target datasets (MVSEC and DENSE) respectively. As described in our paper, when applied to the target dataset, we only train the deep decoding head, the frozen image-event joint encoder (consisting of the image encoder, event encoder, and FreDFuse). So the three versions of the model only have different weights in the deep decoding head. For each version, we provide three types: small, base and large.

Foundation Model

Below is the weight of the image event joint estimation foundation model obtained by performing knowledge transfer through the FUSE framework we proposed, using Depth Anything V2 as the foundation model for image depth estimation. The depth decoding head weights below are from Depth Anything V2, which outputs inverse depth instead of depth.

Methods	Weights
SMALL	Baidu
BASE	Baidu
LARGE	Baidu

MVSEC

Freeze the image event joint encoder weights of our foundation model and the metric depth estimator weights obtained by training the deep decoding head on the MVSEC dataset. The metrics are the evaluation results under MVSEC outdoor_night1.

Methods	a1	a2	a3	Abs.Rel	RMSE	RMSElog	Weights
SMALL	0.613	0.814	0.922	0.267	6.785	0.352	Baidu
BASE	0.632	0.827	0.925	0.270	6.445	0.348	Baidu
LARGE	0.629	0.824	0.923	0.261	6.587	0.351	Baidu

DENSE

Freeze the image event joint encoder weights of our foundation model and the metric depth estimator weights obtained by training the deep decoding head on the DENSE dataset.

Methods	Abs.Rel	RMSELog	10m	20m	30m	Weights
SMALL	0.401	0.491	1.231	4.262	7.052	Baidu
BASE	0.348	0.467	1.199	3.732	5.976	Baidu
LARGE	0.385	0.457	1.286	3.998	6.639	Baidu

Installation

conda create -n fuse python=3.9
pip install -r requirements.txt

Data preparation

We use following datasets:

MVSEC: Multi Vehicle Stereo Event Camera Dataset
DENSE: Learning Monocular Dense Depth from Events
EventScape: Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction

Since the image, event, and depth labels in MVSEC are asynchronous, it is necessary to manually construct image-event-depth pairs, which can be achieved by running the script scripts/process_mvsec_hdf5.py.

In the event voxel grid representation, we choose 3 as the number of time bins. DENSE and EventScape initially provide a voxel grid bin of 5, which you can modify using the scripts under scripts directory.

Training

The training process is divided into three stages: Feature Alignment, Feature Fusion, and Adaptation to the Target Dataset.

We use image-event pairs (without using depth ground truth) from EventScape and Depth Anything V2 to train the image-event joint encoder. We use DENSE and MVSEC as target datasets to verify the model effect. On DENSE and MVSEC, we only train the decoder.

Feature Alignment

Run the script align_feature.sh.

You need to modify the variables load_from and save_path in the script, which represent the path of the pre-trained Depth Anything weights used and the path to save the training weights, respectively.

Feature Fusion

Run the script fuse_feature.sh.

The variable prompt_encoder_pretrained in this script should be the event encoder weight path obtained in the feature alignment stage

Adaptation to the Target Dataset

Run the script train.sh.

Inference

Run the script run.sh.

load_from indicates the path of the pre-trained weights to be loaded.

Evaluation

Run the script eval.sh.

predictions_dataset and target_dataset represent the directory paths where prediction results and depth data are stored respectively. clip_distanceindicates the maximum depth to be evaluated, which is 80 for MVSEC and 1000 for DENSE

Acknowledgements

This project includes code from the following repositories:

Depth-Anything-V2 - We use Depth Anything V2 as the image depth foundation model

pihaisun
/

FUSE