AdaptLLM's picture
Update README.md
c905096 verified
metadata
language:
  - en

Adapting Multimodal Large Language Models to Domains via Post-Training

This project adapts general Multimodal Large Language Models (MLLMs) to specific domains like science and industry to improve their real-world use. It focuses on three main areas:

1. Data Synthesis

  • We create a generate-then-filter pipeline using open-source models to make diverse visual tasks from domain-specific image-caption pairs.
  • This data works better than data made by hand or closed-source models (e.g., GPT-4V/o).

2. Training Pipeline

  • Instead of the usual two-step training (image-caption pairs first, then visual tasks), we use a single-stage training to handle more tasks for specific domains.

3. Task Evaluation

  • We test our method in important fields like biomedicine, food, and remote sensing.
  • We train and evaluate MLLMs on domain-specific tasks to show how well they perform.

Resources

🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗

Model Repo ID in HF 🤗 Domain Base Model Training Data Evaluation Benchmark
Visual Instruction Synthesizer AdaptLLM/visual-instruction-synthesizer - open-llava-next-llama3-8b VisionFLAN and ALLaVA -
AdaMLLM-med-1B AdaptLLM/biomed-InternVL3-1B Biomedicine InternVL3-1B biomed-visual-instructions biomed-VQA-benchmark
AdaMLLM-med-4B AdaptLLM/biomed-gemma-3-4b-it Biomedicine gemma-3-4b-it biomed-visual-instructions biomed-VQA-benchmark
AdaMLLM-med-3B AdaptLLM/biomed-Qwen2.5-VL-3B-Instruct Biomedicine Qwen2.5-VL-3B-Instruct biomed-visual-instructions biomed-VQA-benchmark
AdaMLLM-food-3B AdaptLLM/food-Qwen2.5-VL-3B-Instruct Food Qwen2.5-VL-3B-Instruct food-visual-instructions food-VQA-benchmark
AdaMLLM-remote-sensing-3B AdaptLLM/remote-sensing-Qwen2.5-VL-3B-Instruct Remote Sensing Qwen2.5-VL-3B-Instruct remote-sensing-visual-instructions remote-sensing-VQA-benchmark
AdaMLLM-med-2B AdaptLLM/biomed-Qwen2-VL-2B-Instruct Biomedicine Qwen2-VL-2B-Instruct biomed-visual-instructions biomed-VQA-benchmark
AdaMLLM-food-2B AdaptLLM/food-Qwen2-VL-2B-Instruct Food Qwen2-VL-2B-Instruct food-visual-instructions food-VQA-benchmark
AdaMLLM-remote-sensing-2B AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct Remote Sensing Qwen2-VL-2B-Instruct remote-sensing-visual-instructions remote-sensing-VQA-benchmark
AdaMLLM-med-8B AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B Biomedicine open-llava-next-llama3-8b biomed-visual-instructions biomed-VQA-benchmark
AdaMLLM-food-8B AdaptLLM/food-LLaVA-NeXT-Llama3-8B Food open-llava-next-llama3-8b food-visual-instructions food-VQA-benchmark
AdaMLLM-remote-sensing-8B AdaptLLM/remote-sensing-LLaVA-NeXT-Llama3-8B Remote Sensing open-llava-next-llama3-8b remote-sensing-visual-instructions remote-sensing-VQA-benchmark
AdaMLLM-med-11B AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct Biomedicine Llama-3.2-11B-Vision-Instruct biomed-visual-instructions biomed-VQA-benchmark
AdaMLLM-food-11B AdaptLLM/food-Llama-3.2-11B-Vision-Instruct Food Llama-3.2-11B-Vision-Instruct food-visual-instructions food-VQA-benchmark
AdaMLLM-remote-sensing-11B AdaptLLM/remote-sensing-Llama-3.2-11B-Vision-Instruct Remote Sensing Llama-3.2-11B-Vision-Instruct remote-sensing-visual-instructions remote-sensing-VQA-benchmark

Code: https://github.com/bigai-ai/QA-Synthesizer

Citation

If you find our work helpful, please cite us.

Adapt MLLM to Domains

@article{adamllm,
  title={On Domain-Specific Post-Training for Multimodal Large Language Models},
  author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
  journal={arXiv preprint arXiv:2411.19930},
  year={2024}
}

Adapt LLM to Domains (ICLR 2024)

@inproceedings{
adaptllm,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}