arxiv:2505.24714

FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation

Published on May 30

· Submitted by

luojunyu on Jun 4

Upvote

Authors:

Junyu Luo ,

Zhizhuo Kou ,

Liming Yang ,

Abstract

FinMME is a comprehensive multimodal dataset for financial research and FinScore is an evaluation system that highlights the challenges faced by even advanced models like GPT-4o in the finance domain.

AI-generated summary

Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

luojunyu

Paper author Paper submitter 2 days ago

FinMME is a pioneering benchmark dataset for multimodal financial AI, addressing a notable lack of such resources. It's designed to be highly challenging.