FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
Abstract
FinMME is a comprehensive multimodal dataset for financial research and FinScore is an evaluation system that highlights the challenges faced by even advanced models like GPT-4o in the finance domain.
Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.
Community
FinMME is a pioneering benchmark dataset for multimodal financial AI, addressing a notable lack of such resources. It's designed to be highly challenging.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs (2025)
- Infi-Med: Low-Resource Medical MLLMs with Robust Reasoning Evaluation (2025)
- ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering (2025)
- BnMMLU: Measuring Massive Multitask Language Understanding in Bengali (2025)
- ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering (2025)
- Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages (2025)
- MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper