MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
Abstract
A new image-to-code model and dataset are developed to enhance multimodal mathematical reasoning, resulting in a model that outperforms current benchmarks in mathematical problem-solving.
Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.
Community
- [2025.05.16] 🤗 MathCoder-VL-2B, MathCoder-VL-8B and FigCodifier-8B is available now! 🔥🔥🔥
- [2025.05.16] Our MathCoder-VL is accepted to ACL 2025 Findings. 🔥🔥🔥
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems (2025)
- Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding (2025)
- Unicorn: Text-Only Data Synthesis for Vision Language Model Training (2025)
- MM-IFEngine: Towards Multimodal Instruction Following (2025)
- Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration (2025)
- Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models (2025)
- Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper