---
license: other
license_name: qwen
license_link: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE
language:
- ja
- en
pipeline_tag: image-text-to-text
tags:
- multimodal
library_name: transformers
---

# Stockmark-2-VL-100B-beta

![image/jpeg](./assets/stmk2_vl_logo.png)

## Introduction

**Stockmark-2-VL-100B-beta** is a 100-billion-parameter Japanese-specialized visual language model with Chain-of-Thought (CoT) reasoning for document reading comprehension. As Stockmark-2-VL-100B-beta used synthetic data generated by [Qwen2.5-VL-72B](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct), it is provided in accordance with the [Qwen license](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE).

As a beta release, Stockmark-2-VL-100B-beta is still undergoing improvements and evaluations. Feedback and insights from users will help refine future versions.

See [our blog](https://stockmark-tech.hatenablog.com/entry/2025/06/03/101007) for the detail.

This project is supported by [GENIAC](https://www.meti.go.jp/policy/mono_info_service/geniac/index.html).

## Model architecture
The architecture of Stockmark-2-VL-100B-beta follows the same framework as [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/):

- **LLM**: we initially used [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) with a smaller number of parameters in our prior experiments on architecture and data. Ultimately, [stockmark/Stockmark-2-100B-Instruct-beta](https://huggingface.co/stockmark/Stockmark-2-100B-Instruct-beta) was implemented for official training.

- **Vision encoder**: in contrast to [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) used in the original LLaVA-OneVision, we employed the newly developed [google/siglip2-so400m-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384), which has better multilingual performance, as the vision encoder. In ablation experiments, we observed that the model with SigLIP2 demonstrated better performance than with SigLIP.

- **Projector**: we used random initial weights for the 2-layer MLP of the projector.

## Evaluation

### Japanese document reading comprehension performance evaluation

We evaluated document reading comprehension performance using the following three benchmarks:

- [JDocQA](https://github.com/mizuumi/JDocQA): A total of 1,175 questions. We evaluated Stockmark-2-VL-100B-beta using [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) and adopted the LLM-as-a-judge score as the comparison metric (using `gpt-4o-2024-11-20` as judge model). Scores for models other than Stockmark-2-VL-100B-beta were obtained from the [llm-jp-eval-mm leaderboard](https://llm-jp.github.io/llm-jp-eval-mm/) as of 2025/5/15.

- [BusinessSlideVQA](https://github.com/stockmarkteam/business-slide-questions): This is a benchmark for evaluating the ability to comprehend complex Japanese business slide images with a total of 220 questions, which was constructed by our team. The evaluation metric is the same as JDocQA, with scoring by llm-as-a-judge (using `gpt-4o-2024-11-20` as judge model).

- JChartQA: We randomly sampled 100 questions from ChartQA-val, translated both English texts from the questions and images into Japanese, and constructed a benchmark.

The performance comparison results are demonstrated in the table below. Stockmark-2-VL-100B-beta significantly outperforms other domestic VLM models in all performance metrics. Additionally, Stockmark-2-VL-100B-beta achieves higher scores than GPT-4o in BusinessSlideVQA and JChartQA, demonstrating superior overall performance in document reading comprehension performance compared to GPT-4o.

|                                                                                    | BusinessSlideVQA /LLM | JChartQA /Acc | JDocQA /LLM |
|------------------------------------------------------------------------------------|-----------------------|---------------|-------------|
| [Heron-NVILA-Lite-15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B)  | 2.8                   | 0.41          | 2.7         |
| [sarashina2-vision-14b](https://huggingface.co/sbintuitions/sarashina2-vision-14b) | 3.3                   | 0.52          | 3.1         |
| [llm-jp-3-vila-14b](https://huggingface.co/llm-jp/llm-jp-3-vila-14b)               | 2.0                   | 0.23          | 2.5         |
| [gpt-4o-2024-11-20](https://platform.openai.com/docs/models/gpt-4o)                | 4.1                   | 0.77          | **3.6**     |
| **Stockmark-2-VL-100B-beta**                                                       | **4.2**               | **0.88**      | 3.5         |

### Japanese general domain VQA

We selected the following three commonly used benchmarks to evaluate model performance in Japanese general domain VQA:

- [Heron-Bench](https://huggingface.co/datasets/turing-motors/Japanese-Heron-Bench)

- [JA-VLM-Bench-In-the-Wild](https://huggingface.co/datasets/SakanaAI/JA-VLM-Bench-In-the-Wild)

- [JA-VG-VQA500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500)

We evaluated three benchmarks using llm-jp-eval-mm. All generation parameters were set to default values, and the judge model used for scoring was `gpt-4o-2024-11-20`.

Stockmark-2-VL-100B-beta achieved the highest scores on Heron-Bench and JA-VG-VQA500, and also ranked among the top performers on JA-VLM-Bench-In-the-Wild. This demonstrates that Stockmark-2-VL-100B-beta is currently a state-of-the-art domestic model for Japanese general domain tasks.

|                                                                                    | Heron-Bench /LLM | JA-VLM-Bench-In-the-Wild /LLM | JA-VG-VQA500 /LLM |
|------------------------------------------------------------------------------------|------------------|-------------------------------|-------------------|
| [Heron-NVILA-Lite-15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B)  | 73.5             | **4.4**                       | 4.0               |
| [sarashina2-vision-14b](https://huggingface.co/sbintuitions/sarashina2-vision-14b) | 60.1             | 4.0                           | 3.7               |
| [llm-jp-3-vila-14b](https://huggingface.co/llm-jp/llm-jp-3-vila-14b)               | 68.0             | 4.1                           | 3.9               |
| **Stockmark-2-VL-100B-beta**                                                       | **78.8**         | 4.1                           | **4.1**           |

## Quickstart

### Inference using 🤗Transformers

Stockmark-2-VL-100B-beta is built on the LLaVA-OneVision architecture, please make sure to have `transformers>=4.45.0` installed.

```bash
pip install transformers>=4.45.0 accelerate torchvision pillow
```

The following is a code snippet demonstrating how to use Stockmark-2-VL-100B-beta in pure `transformers`.

```python
import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
from huggingface_hub import hf_hub_download

model_id = "stockmark/Stockmark-2-VL-100B-beta"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    trust_remote_code=True,
    low_cpu_mem_usage=True, 
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

conversation = [
    {
        "role": "system", 
        "content": "あなたは誠実で優秀な日本人のアシスタントです。"
    },
    {
        "role": "user",
        "content": "<image>30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか？",
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

img_path = hf_hub_download(
    repo_id=model_id,
    filename="assets/demo.png"
)
raw_image = Image.open(img_path)
inputs = processor(images=raw_image, text=prompt, return_tensors="pt").to("cuda").to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=255, do_sample=False)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]

answer = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0].strip()

print(answer)

```

### Inference using vLLM

The following is a code snippet demonstrating how to use Stockmark-2-VL-100B-beta in `vLLM`.

```python
import os
import requests
from PIL import Image
from transformers import (
    AutoProcessor,
)
from huggingface_hub import hf_hub_download
from vllm import LLM, SamplingParams

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def main():
    model_id = "stockmark/Stockmark-2-VL-100B-beta"
    processor = AutoProcessor.from_pretrained(
        model_id,
        trust_remote_code=True        
    )

    message = [
        {
            "role": "system", 
            "content": "あなたは誠実で優秀な日本人のアシスタントです。"
        },
        {
            "role": "user", 
            "content": "<image>30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか？"
        }
    ]
    prompt = processor.apply_chat_template(message, add_generation_prompt=True)    
    print(prompt)

    llm = LLM(
        model=model_id,
        tensor_parallel_size=2,
        limit_mm_per_prompt={"image": 1},
        trust_remote_code=True,
        dtype="bfloat16",
    )

    img_path = hf_hub_download(
        repo_id=model_id,
        filename="assets/demo.png"
    )
    image = Image.open(img_path)
    inputs = {
        "prompt": prompt,
        "multi_modal_data": {
            "image": image
        },
    }

    sampling_params = SamplingParams(
        temperature=0,
        max_tokens=256
    )

    outputs = llm.generate(
        inputs,
        sampling_params=sampling_params,
    )

    answer = outputs[0].outputs[0].text
    print(answer)

if __name__ == "__main__":
    main()
```

### Evaluation using `llm-jp-eval-mm`

If you are interested in evaluating Stockmark-2-VL-100B-beta using llm-jp-eval-mm, please add the following codes to llm-jp-eval-mm.

<details>
<summary>Model class</summary>

The following is the model class for Stockmark-2-VL-100B-beta in llm-jp-eval-mm. Please include it in the `llm-jp-eval-mm/examples` directory.

```python
# -*- coding: utf-8 -*-
"""
@File        :   stockmark_vl.py
@Description :   The VLM model class for Stockmark-2-VL-100B-beta.
"""

import torch
from PIL import Image
from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor
from base_vlm import BaseVLM
from utils import GenerationConfig

DEFAULT_IMAGE_TOKEN = "<image>"

class VLM(BaseVLM):
    def __init__(self, model_id) -> None:
        self.model_id = model_id
        self.model = LlavaOnevisionForConditionalGeneration.from_pretrained(
            self.model_id,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            low_cpu_mem_usage=True,
            device_map="auto"
        )
        self.processor = AutoProcessor.from_pretrained(self.model_id)

    def generate(
        self,
        images: list[Image.Image],
        text: str,
        gen_kwargs: GenerationConfig = GenerationConfig(),
    ) -> str:
        content = DEFAULT_IMAGE_TOKEN * len(images)  + "\n" + text
        messages = [
            {
                "role": "system", 
                "content": "あなたは誠実で優秀な日本人のアシスタントです。"
            },
            {
                "role": "user",
                "content": content,
            },
        ]
        prompt = self.processor.apply_chat_template(
            messages, add_generation_prompt=True
        )

        if len(images) == 0:
            images = None

        inputs = self.processor(images=images, text=prompt, return_tensors="pt").to(
            "cuda"
        ).to(torch.bfloat16)

        output_ids = self.model.generate(**inputs, **gen_kwargs.__dict__)
        generated_ids = [
            output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
        ]

        answer = self.processor.batch_decode(
            generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
        )[0].strip()

        return answer

```

Please make sure that the information for Stockmark-2-VL-100B-beta is included in `MODEL_ID_TO_CLASS_PATH` in `llm-jp-eval-mm/examples/model_table.py`.

```python
MODEL_ID_TO_CLASS_PATH = {
    "stockmark/Stockmark-2-VL-100B-beta": "stockmark_vl.VLM",
}
```

</details>

<details>
<summary>Dependency group</summary>

Use the following code to create a dependency group in llm-jp-eval-mm for Stockmark-2-VL-100B-beta.

```bash
uv add --group stockmark_vl transformers>=4.49.0 torch>=2.5.1 torchvision>=0.20.1 flash-attn>=2.7.3 accelerate>=0.27.2 sentencepiece>=0.2.0 pillow>=10.4.0 protobuf>=5.29.3
```

</details>


## Risks and Limitations

As a beta release, this model has not been fully calibrated to ensure compliance with social norms, ethical standards, and legal regulations.

Since Stockmark-2-VL-100B-beta is a visual reasoning model, it may ignore formatting requirements in the prompt and retain the output of the CoT process.

## License

[qwen](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE)

## Developed by

[Stockmark Inc.](https://stockmark.co.jp/)