Stockmark-2-VL-100B-beta

Introduction

Stockmark-2-VL-100B-beta is a 100-billion-parameter Japanese-specialized visual language model with Chain-of-Thought (CoT) reasoning for document reading comprehension. As Stockmark-2-VL-100B-beta used synthetic data generated by Qwen2.5-VL-72B, it is provided in accordance with the Qwen license.

As a beta release, Stockmark-2-VL-100B-beta is still undergoing improvements and evaluations. Feedback and insights from users will help refine future versions.

See our blog for the detail.

This project is supported by GENIAC.

Model architecture

The architecture of Stockmark-2-VL-100B-beta follows the same framework as LLaVA-OneVision:

LLM: we initially used Qwen/Qwen2-7B-Instruct with a smaller number of parameters in our prior experiments on architecture and data. Ultimately, stockmark/Stockmark-2-100B-Instruct-beta was implemented for official training.
Vision encoder: in contrast to google/siglip-so400m-patch14-384 used in the original LLaVA-OneVision, we employed the newly developed google/siglip2-so400m-patch14-384, which has better multilingual performance, as the vision encoder. In ablation experiments, we observed that the model with SigLIP2 demonstrated better performance than with SigLIP.
Projector: we used random initial weights for the 2-layer MLP of the projector.

Evaluation

Japanese document reading comprehension performance evaluation

We evaluated document reading comprehension performance using the following three benchmarks:

JDocQA: A total of 1,175 questions. We evaluated Stockmark-2-VL-100B-beta using llm-jp-eval-mm and adopted the LLM-as-a-judge score as the comparison metric (using gpt-4o-2024-11-20 as judge model). Scores for models other than Stockmark-2-VL-100B-beta were obtained from the llm-jp-eval-mm leaderboard as of 2025/5/15.
BusinessSlideVQA: This is a benchmark for evaluating the ability to comprehend complex Japanese business slide images with a total of 220 questions, which was constructed by our team. The evaluation metric is the same as JDocQA, with scoring by llm-as-a-judge (using gpt-4o-2024-11-20 as judge model).
JChartQA: We randomly sampled 100 questions from ChartQA-val, translated both English texts from the questions and images into Japanese, and constructed a benchmark.

The performance comparison results are demonstrated in the table below. Stockmark-2-VL-100B-beta significantly outperforms other domestic VLM models in all performance metrics. Additionally, Stockmark-2-VL-100B-beta achieves higher scores than GPT-4o in BusinessSlideVQA and JChartQA, demonstrating superior overall performance in document reading comprehension performance compared to GPT-4o.

	BusinessSlideVQA /LLM	JChartQA /Acc	JDocQA /LLM
Heron-NVILA-Lite-15B	2.8	0.41	2.7
sarashina2-vision-14b	3.3	0.52	3.1
llm-jp-3-vila-14b	2.0	0.23	2.5
gpt-4o-2024-11-20	4.1	0.77	3.6
Stockmark-2-VL-100B-beta	4.2	0.88	3.5

Japanese general domain VQA

We selected the following three commonly used benchmarks to evaluate model performance in Japanese general domain VQA:

We evaluated three benchmarks using llm-jp-eval-mm. All generation parameters were set to default values, and the judge model used for scoring was gpt-4o-2024-11-20.

Stockmark-2-VL-100B-beta achieved the highest scores on Heron-Bench and JA-VG-VQA500, and also ranked among the top performers on JA-VLM-Bench-In-the-Wild. This demonstrates that Stockmark-2-VL-100B-beta is currently a state-of-the-art domestic model for Japanese general domain tasks.

	Heron-Bench /LLM	JA-VLM-Bench-In-the-Wild /LLM	JA-VG-VQA500 /LLM
Heron-NVILA-Lite-15B	73.5	4.4	4.0
sarashina2-vision-14b	60.1	4.0	3.7
llm-jp-3-vila-14b	68.0	4.1	3.9
Stockmark-2-VL-100B-beta	78.8	4.1	4.1

Quickstart

Inference using 🤗Transformers

Stockmark-2-VL-100B-beta is built on the LLaVA-OneVision architecture, please make sure to have transformers>=4.45.0 installed.

pip install transformers>=4.45.0 accelerate torchvision pillow

The following is a code snippet demonstrating how to use Stockmark-2-VL-100B-beta in pure transformers.

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
from huggingface_hub import hf_hub_download

model_id = "stockmark/Stockmark-2-VL-100B-beta"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    trust_remote_code=True,
    low_cpu_mem_usage=True, 
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

conversation = [
    {
        "role": "system", 
        "content": "あなたは誠実で優秀な日本人のアシスタントです。"
    },
    {
        "role": "user",
        "content": "<image>30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか？",
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

img_path = hf_hub_download(
    repo_id=model_id,
    filename="assets/demo.png"
)
raw_image = Image.open(img_path)
inputs = processor(images=raw_image, text=prompt, return_tensors="pt").to("cuda").to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=255, do_sample=False)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]

answer = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0].strip()

print(answer)

Inference using vLLM

The following is a code snippet demonstrating how to use Stockmark-2-VL-100B-beta in vLLM.

import os
import requests
from PIL import Image
from transformers import (
    AutoProcessor,
)
from huggingface_hub import hf_hub_download
from vllm import LLM, SamplingParams

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def main():
    model_id = "stockmark/Stockmark-2-VL-100B-beta"
    processor = AutoProcessor.from_pretrained(
        model_id,
        trust_remote_code=True        
    )

    message = [
        {
            "role": "system", 
            "content": "あなたは誠実で優秀な日本人のアシスタントです。"
        },
        {
            "role": "user", 
            "content": "<image>30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか？"
        }
    ]
    prompt = processor.apply_chat_template(message, add_generation_prompt=True)    
    print(prompt)

    llm = LLM(
        model=model_id,
        tensor_parallel_size=2,
        limit_mm_per_prompt={"image": 1},
        trust_remote_code=True,
        dtype="bfloat16",
    )

    img_path = hf_hub_download(
        repo_id=model_id,
        filename="assets/demo.png"
    )
    image = Image.open(img_path)
    inputs = {
        "prompt": prompt,
        "multi_modal_data": {
            "image": image
        },
    }

    sampling_params = SamplingParams(
        temperature=0,
        max_tokens=256
    )

    outputs = llm.generate(
        inputs,
        sampling_params=sampling_params,
    )

    answer = outputs[0].outputs[0].text
    print(answer)

if __name__ == "__main__":
    main()

Evaluation using `llm-jp-eval-mm`

If you are interested in evaluating Stockmark-2-VL-100B-beta using llm-jp-eval-mm, please add the following codes to llm-jp-eval-mm.

Model class

The following is the model class for Stockmark-2-VL-100B-beta in llm-jp-eval-mm. Please include it in the llm-jp-eval-mm/examples directory.

# -*- coding: utf-8 -*-
"""
@File        :   stockmark_vl.py
@Description :   The VLM model class for Stockmark-2-VL-100B-beta.
"""

import torch
from PIL import Image
from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor
from base_vlm import BaseVLM
from utils import GenerationConfig

DEFAULT_IMAGE_TOKEN = "<image>"

class VLM(BaseVLM):
    def __init__(self, model_id) -> None:
        self.model_id = model_id
        self.model = LlavaOnevisionForConditionalGeneration.from_pretrained(
            self.model_id,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            low_cpu_mem_usage=True,
            device_map="auto"
        )
        self.processor = AutoProcessor.from_pretrained(self.model_id)

    def generate(
        self,
        images: list[Image.Image],
        text: str,
        gen_kwargs: GenerationConfig = GenerationConfig(),
    ) -> str:
        content = DEFAULT_IMAGE_TOKEN * len(images)  + "\n" + text
        messages = [
            {
                "role": "system", 
                "content": "あなたは誠実で優秀な日本人のアシスタントです。"
            },
            {
                "role": "user",
                "content": content,
            },
        ]
        prompt = self.processor.apply_chat_template(
            messages, add_generation_prompt=True
        )

        if len(images) == 0:
            images = None

        inputs = self.processor(images=images, text=prompt, return_tensors="pt").to(
            "cuda"
        ).to(torch.bfloat16)

        output_ids = self.model.generate(**inputs, **gen_kwargs.__dict__)
        generated_ids = [
            output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
        ]

        answer = self.processor.batch_decode(
            generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
        )[0].strip()

        return answer

Please make sure that the information for Stockmark-2-VL-100B-beta is included in MODEL_ID_TO_CLASS_PATH in llm-jp-eval-mm/examples/model_table.py.

MODEL_ID_TO_CLASS_PATH = {
    "stockmark/Stockmark-2-VL-100B-beta": "stockmark_vl.VLM",
}

Dependency group

Use the following code to create a dependency group in llm-jp-eval-mm for Stockmark-2-VL-100B-beta.

uv add --group stockmark_vl transformers>=4.49.0 torch>=2.5.1 torchvision>=0.20.1 flash-attn>=2.7.3 accelerate>=0.27.2 sentencepiece>=0.2.0 pillow>=10.4.0 protobuf>=5.29.3

Risks and Limitations

As a beta release, this model has not been fully calibrated to ensure compliance with social norms, ethical standards, and legal regulations.

Since Stockmark-2-VL-100B-beta is a visual reasoning model, it may ignore formatting requirements in the prompt and retain the output of the CoT process.

License

qwen

Developed by

Stockmark Inc.

stockmark
/

Stockmark-2-VL-100B-beta

Stockmark-2-VL-100B-beta

Introduction

Model architecture

Evaluation

Japanese document reading comprehension performance evaluation

Japanese general domain VQA

Quickstart

Inference using 🤗Transformers

Inference using vLLM

Evaluation using `llm-jp-eval-mm`

Risks and Limitations

License

Developed by

Collection including stockmark/Stockmark-2-VL-100B-beta

Stokmark-2

Stockmark-2-VL-100B-beta

Introduction

Model architecture

Evaluation

Japanese document reading comprehension performance evaluation

Japanese general domain VQA

Quickstart

Inference using 🤗Transformers

Inference using vLLM

Evaluation using llm-jp-eval-mm

Risks and Limitations

License

Developed by

Collection including stockmark/Stockmark-2-VL-100B-beta

Evaluation using `llm-jp-eval-mm`