--- license: other license_name: qwen license_link: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE language: - ja - en pipeline_tag: image-text-to-text tags: - multimodal library_name: transformers --- # Stockmark-2-VL-100B-beta ![image/jpeg](./assets/stmk2_vl_logo.png) ## Introduction **Stockmark-2-VL-100B-beta** is a 100-billion-parameter Japanese-specialized visual language model with Chain-of-Thought (CoT) reasoning for document reading comprehension. As Stockmark-2-VL-100B-beta used synthetic data generated by [Qwen2.5-VL-72B](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct), it is provided in accordance with the [Qwen license](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE). As a beta release, Stockmark-2-VL-100B-beta is still undergoing improvements and evaluations. Feedback and insights from users will help refine future versions. See [our blog](https://stockmark-tech.hatenablog.com/entry/2025/06/03/101007) for the detail. This project is supported by [GENIAC](https://www.meti.go.jp/policy/mono_info_service/geniac/index.html). ## Model architecture The architecture of Stockmark-2-VL-100B-beta follows the same framework as [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/): - **LLM**: we initially used [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) with a smaller number of parameters in our prior experiments on architecture and data. Ultimately, [stockmark/Stockmark-2-100B-Instruct-beta](https://huggingface.co/stockmark/Stockmark-2-100B-Instruct-beta) was implemented for official training. - **Vision encoder**: in contrast to [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) used in the original LLaVA-OneVision, we employed the newly developed [google/siglip2-so400m-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384), which has better multilingual performance, as the vision encoder. In ablation experiments, we observed that the model with SigLIP2 demonstrated better performance than with SigLIP. - **Projector**: we used random initial weights for the 2-layer MLP of the projector. ## Evaluation ### Japanese document reading comprehension performance evaluation We evaluated document reading comprehension performance using the following three benchmarks: - [JDocQA](https://github.com/mizuumi/JDocQA): A total of 1,175 questions. We evaluated Stockmark-2-VL-100B-beta using [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) and adopted the LLM-as-a-judge score as the comparison metric (using `gpt-4o-2024-11-20` as judge model). Scores for models other than Stockmark-2-VL-100B-beta were obtained from the [llm-jp-eval-mm leaderboard](https://llm-jp.github.io/llm-jp-eval-mm/) as of 2025/5/15. - [BusinessSlideVQA](https://github.com/stockmarkteam/business-slide-questions): This is a benchmark for evaluating the ability to comprehend complex Japanese business slide images with a total of 220 questions, which was constructed by our team. The evaluation metric is the same as JDocQA, with scoring by llm-as-a-judge (using `gpt-4o-2024-11-20` as judge model). - JChartQA: We randomly sampled 100 questions from ChartQA-val, translated both English texts from the questions and images into Japanese, and constructed a benchmark. The performance comparison results are demonstrated in the table below. Stockmark-2-VL-100B-beta significantly outperforms other domestic VLM models in all performance metrics. Additionally, Stockmark-2-VL-100B-beta achieves higher scores than GPT-4o in BusinessSlideVQA and JChartQA, demonstrating superior overall performance in document reading comprehension performance compared to GPT-4o. | | BusinessSlideVQA /LLM | JChartQA /Acc | JDocQA /LLM | |------------------------------------------------------------------------------------|-----------------------|---------------|-------------| | [Heron-NVILA-Lite-15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B) | 2.8 | 0.41 | 2.7 | | [sarashina2-vision-14b](https://huggingface.co/sbintuitions/sarashina2-vision-14b) | 3.3 | 0.52 | 3.1 | | [llm-jp-3-vila-14b](https://huggingface.co/llm-jp/llm-jp-3-vila-14b) | 2.0 | 0.23 | 2.5 | | [gpt-4o-2024-11-20](https://platform.openai.com/docs/models/gpt-4o) | 4.1 | 0.77 | **3.6** | | **Stockmark-2-VL-100B-beta** | **4.2** | **0.88** | 3.5 | ### Japanese general domain VQA We selected the following three commonly used benchmarks to evaluate model performance in Japanese general domain VQA: - [Heron-Bench](https://huggingface.co/datasets/turing-motors/Japanese-Heron-Bench) - [JA-VLM-Bench-In-the-Wild](https://huggingface.co/datasets/SakanaAI/JA-VLM-Bench-In-the-Wild) - [JA-VG-VQA500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500) We evaluated three benchmarks using llm-jp-eval-mm. All generation parameters were set to default values, and the judge model used for scoring was `gpt-4o-2024-11-20`. Stockmark-2-VL-100B-beta achieved the highest scores on Heron-Bench and JA-VG-VQA500, and also ranked among the top performers on JA-VLM-Bench-In-the-Wild. This demonstrates that Stockmark-2-VL-100B-beta is currently a state-of-the-art domestic model for Japanese general domain tasks. | | Heron-Bench /LLM | JA-VLM-Bench-In-the-Wild /LLM | JA-VG-VQA500 /LLM | |------------------------------------------------------------------------------------|------------------|-------------------------------|-------------------| | [Heron-NVILA-Lite-15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B) | 73.5 | **4.4** | 4.0 | | [sarashina2-vision-14b](https://huggingface.co/sbintuitions/sarashina2-vision-14b) | 60.1 | 4.0 | 3.7 | | [llm-jp-3-vila-14b](https://huggingface.co/llm-jp/llm-jp-3-vila-14b) | 68.0 | 4.1 | 3.9 | | **Stockmark-2-VL-100B-beta** | **78.8** | 4.1 | **4.1** | ## Quickstart ### Inference using 🤗Transformers Stockmark-2-VL-100B-beta is built on the LLaVA-OneVision architecture, please make sure to have `transformers>=4.45.0` installed. ```bash pip install transformers>=4.45.0 accelerate torchvision pillow ``` The following is a code snippet demonstrating how to use Stockmark-2-VL-100B-beta in pure `transformers`. ```python import requests from PIL import Image import torch from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration from huggingface_hub import hf_hub_download model_id = "stockmark/Stockmark-2-VL-100B-beta" model = LlavaOnevisionForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, trust_remote_code=True, low_cpu_mem_usage=True, device_map="auto" ) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) conversation = [ { "role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。" }, { "role": "user", "content": "30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか?", }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) img_path = hf_hub_download( repo_id=model_id, filename="assets/demo.png" ) raw_image = Image.open(img_path) inputs = processor(images=raw_image, text=prompt, return_tensors="pt").to("cuda").to(torch.bfloat16) output_ids = model.generate(**inputs, max_new_tokens=255, do_sample=False) generated_ids = [ output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids) ] answer = processor.batch_decode( generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True )[0].strip() print(answer) ``` ### Inference using vLLM The following is a code snippet demonstrating how to use Stockmark-2-VL-100B-beta in `vLLM`. ```python import os import requests from PIL import Image from transformers import ( AutoProcessor, ) from huggingface_hub import hf_hub_download from vllm import LLM, SamplingParams os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" def main(): model_id = "stockmark/Stockmark-2-VL-100B-beta" processor = AutoProcessor.from_pretrained( model_id, trust_remote_code=True ) message = [ { "role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。" }, { "role": "user", "content": "30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか?" } ] prompt = processor.apply_chat_template(message, add_generation_prompt=True) print(prompt) llm = LLM( model=model_id, tensor_parallel_size=2, limit_mm_per_prompt={"image": 1}, trust_remote_code=True, dtype="bfloat16", ) img_path = hf_hub_download( repo_id=model_id, filename="assets/demo.png" ) image = Image.open(img_path) inputs = { "prompt": prompt, "multi_modal_data": { "image": image }, } sampling_params = SamplingParams( temperature=0, max_tokens=256 ) outputs = llm.generate( inputs, sampling_params=sampling_params, ) answer = outputs[0].outputs[0].text print(answer) if __name__ == "__main__": main() ``` ### Evaluation using `llm-jp-eval-mm` If you are interested in evaluating Stockmark-2-VL-100B-beta using llm-jp-eval-mm, please add the following codes to llm-jp-eval-mm.
Model class The following is the model class for Stockmark-2-VL-100B-beta in llm-jp-eval-mm. Please include it in the `llm-jp-eval-mm/examples` directory. ```python # -*- coding: utf-8 -*- """ @File : stockmark_vl.py @Description : The VLM model class for Stockmark-2-VL-100B-beta. """ import torch from PIL import Image from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor from base_vlm import BaseVLM from utils import GenerationConfig DEFAULT_IMAGE_TOKEN = "" class VLM(BaseVLM): def __init__(self, model_id) -> None: self.model_id = model_id self.model = LlavaOnevisionForConditionalGeneration.from_pretrained( self.model_id, torch_dtype=torch.bfloat16, trust_remote_code=True, low_cpu_mem_usage=True, device_map="auto" ) self.processor = AutoProcessor.from_pretrained(self.model_id) def generate( self, images: list[Image.Image], text: str, gen_kwargs: GenerationConfig = GenerationConfig(), ) -> str: content = DEFAULT_IMAGE_TOKEN * len(images) + "\n" + text messages = [ { "role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。" }, { "role": "user", "content": content, }, ] prompt = self.processor.apply_chat_template( messages, add_generation_prompt=True ) if len(images) == 0: images = None inputs = self.processor(images=images, text=prompt, return_tensors="pt").to( "cuda" ).to(torch.bfloat16) output_ids = self.model.generate(**inputs, **gen_kwargs.__dict__) generated_ids = [ output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids) ] answer = self.processor.batch_decode( generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True )[0].strip() return answer ``` Please make sure that the information for Stockmark-2-VL-100B-beta is included in `MODEL_ID_TO_CLASS_PATH` in `llm-jp-eval-mm/examples/model_table.py`. ```python MODEL_ID_TO_CLASS_PATH = { "stockmark/Stockmark-2-VL-100B-beta": "stockmark_vl.VLM", } ```
Dependency group Use the following code to create a dependency group in llm-jp-eval-mm for Stockmark-2-VL-100B-beta. ```bash uv add --group stockmark_vl transformers>=4.49.0 torch>=2.5.1 torchvision>=0.20.1 flash-attn>=2.7.3 accelerate>=0.27.2 sentencepiece>=0.2.0 pillow>=10.4.0 protobuf>=5.29.3 ```
## Risks and Limitations As a beta release, this model has not been fully calibrated to ensure compliance with social norms, ethical standards, and legal regulations. Since Stockmark-2-VL-100B-beta is a visual reasoning model, it may ignore formatting requirements in the prompt and retain the output of the CoT process. ## License [qwen](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE) ## Developed by [Stockmark Inc.](https://stockmark.co.jp/)