Stockmark-2-VL-100B-beta
Introduction
Stockmark-2-VL-100B-beta is a 100-billion-parameter Japanese-specialized visual language model with Chain-of-Thought (CoT) reasoning for document reading comprehension. As Stockmark-2-VL-100B-beta used synthetic data generated by Qwen2.5-VL-72B, it is provided in accordance with the Qwen license.
As a beta release, Stockmark-2-VL-100B-beta is still undergoing improvements and evaluations. Feedback and insights from users will help refine future versions.
See our blog for the detail.
This project is supported by GENIAC.
Model architecture
The architecture of Stockmark-2-VL-100B-beta follows the same framework as LLaVA-OneVision:
LLM: we initially used Qwen/Qwen2-7B-Instruct with a smaller number of parameters in our prior experiments on architecture and data. Ultimately, stockmark/Stockmark-2-100B-Instruct-beta was implemented for official training.
Vision encoder: in contrast to google/siglip-so400m-patch14-384 used in the original LLaVA-OneVision, we employed the newly developed google/siglip2-so400m-patch14-384, which has better multilingual performance, as the vision encoder. In ablation experiments, we observed that the model with SigLIP2 demonstrated better performance than with SigLIP.
Projector: we used random initial weights for the 2-layer MLP of the projector.
Evaluation
Japanese document reading comprehension performance evaluation
We evaluated document reading comprehension performance using the following three benchmarks:
JDocQA: A total of 1,175 questions. We evaluated Stockmark-2-VL-100B-beta using llm-jp-eval-mm and adopted the LLM-as-a-judge score as the comparison metric (using
gpt-4o-2024-11-20
as judge model). Scores for models other than Stockmark-2-VL-100B-beta were obtained from the llm-jp-eval-mm leaderboard as of 2025/5/15.BusinessSlideVQA: This is a benchmark for evaluating the ability to comprehend complex Japanese business slide images with a total of 220 questions, which was constructed by our team. The evaluation metric is the same as JDocQA, with scoring by llm-as-a-judge (using
gpt-4o-2024-11-20
as judge model).JChartQA: We randomly sampled 100 questions from ChartQA-val, translated both English texts from the questions and images into Japanese, and constructed a benchmark.
The performance comparison results are demonstrated in the table below. Stockmark-2-VL-100B-beta significantly outperforms other domestic VLM models in all performance metrics. Additionally, Stockmark-2-VL-100B-beta achieves higher scores than GPT-4o in BusinessSlideVQA and JChartQA, demonstrating superior overall performance in document reading comprehension performance compared to GPT-4o.
BusinessSlideVQA /LLM | JChartQA /Acc | JDocQA /LLM | |
---|---|---|---|
Heron-NVILA-Lite-15B | 2.8 | 0.41 | 2.7 |
sarashina2-vision-14b | 3.3 | 0.52 | 3.1 |
llm-jp-3-vila-14b | 2.0 | 0.23 | 2.5 |
gpt-4o-2024-11-20 | 4.1 | 0.77 | 3.6 |
Stockmark-2-VL-100B-beta | 4.2 | 0.88 | 3.5 |
Japanese general domain VQA
We selected the following three commonly used benchmarks to evaluate model performance in Japanese general domain VQA:
We evaluated three benchmarks using llm-jp-eval-mm. All generation parameters were set to default values, and the judge model used for scoring was gpt-4o-2024-11-20
.
Stockmark-2-VL-100B-beta achieved the highest scores on Heron-Bench and JA-VG-VQA500, and also ranked among the top performers on JA-VLM-Bench-In-the-Wild. This demonstrates that Stockmark-2-VL-100B-beta is currently a state-of-the-art domestic model for Japanese general domain tasks.
Heron-Bench /LLM | JA-VLM-Bench-In-the-Wild /LLM | JA-VG-VQA500 /LLM | |
---|---|---|---|
Heron-NVILA-Lite-15B | 73.5 | 4.4 | 4.0 |
sarashina2-vision-14b | 60.1 | 4.0 | 3.7 |
llm-jp-3-vila-14b | 68.0 | 4.1 | 3.9 |
Stockmark-2-VL-100B-beta | 78.8 | 4.1 | 4.1 |
Quickstart
Inference using 🤗Transformers
Stockmark-2-VL-100B-beta is built on the LLaVA-OneVision architecture, please make sure to have transformers>=4.45.0
installed.
pip install transformers>=4.45.0 accelerate torchvision pillow
The following is a code snippet demonstrating how to use Stockmark-2-VL-100B-beta in pure transformers
.
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
from huggingface_hub import hf_hub_download
model_id = "stockmark/Stockmark-2-VL-100B-beta"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
low_cpu_mem_usage=True,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
conversation = [
{
"role": "system",
"content": "あなたは誠実で優秀な日本人のアシスタントです。"
},
{
"role": "user",
"content": "<image>30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか?",
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
img_path = hf_hub_download(
repo_id=model_id,
filename="assets/demo.png"
)
raw_image = Image.open(img_path)
inputs = processor(images=raw_image, text=prompt, return_tensors="pt").to("cuda").to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=255, do_sample=False)
generated_ids = [
output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0].strip()
print(answer)
Inference using vLLM
The following is a code snippet demonstrating how to use Stockmark-2-VL-100B-beta in vLLM
.
import os
import requests
from PIL import Image
from transformers import (
AutoProcessor,
)
from huggingface_hub import hf_hub_download
from vllm import LLM, SamplingParams
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
def main():
model_id = "stockmark/Stockmark-2-VL-100B-beta"
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True
)
message = [
{
"role": "system",
"content": "あなたは誠実で優秀な日本人のアシスタントです。"
},
{
"role": "user",
"content": "<image>30歳未満の社員に対するアンケート回答結果で、最も割合が高かった「使用頻度」は何ですか?"
}
]
prompt = processor.apply_chat_template(message, add_generation_prompt=True)
print(prompt)
llm = LLM(
model=model_id,
tensor_parallel_size=2,
limit_mm_per_prompt={"image": 1},
trust_remote_code=True,
dtype="bfloat16",
)
img_path = hf_hub_download(
repo_id=model_id,
filename="assets/demo.png"
)
image = Image.open(img_path)
inputs = {
"prompt": prompt,
"multi_modal_data": {
"image": image
},
}
sampling_params = SamplingParams(
temperature=0,
max_tokens=256
)
outputs = llm.generate(
inputs,
sampling_params=sampling_params,
)
answer = outputs[0].outputs[0].text
print(answer)
if __name__ == "__main__":
main()
Evaluation using llm-jp-eval-mm
If you are interested in evaluating Stockmark-2-VL-100B-beta using llm-jp-eval-mm, please add the following codes to llm-jp-eval-mm.
Model class
The following is the model class for Stockmark-2-VL-100B-beta in llm-jp-eval-mm. Please include it in the llm-jp-eval-mm/examples
directory.
# -*- coding: utf-8 -*-
"""
@File : stockmark_vl.py
@Description : The VLM model class for Stockmark-2-VL-100B-beta.
"""
import torch
from PIL import Image
from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor
from base_vlm import BaseVLM
from utils import GenerationConfig
DEFAULT_IMAGE_TOKEN = "<image>"
class VLM(BaseVLM):
def __init__(self, model_id) -> None:
self.model_id = model_id
self.model = LlavaOnevisionForConditionalGeneration.from_pretrained(
self.model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
low_cpu_mem_usage=True,
device_map="auto"
)
self.processor = AutoProcessor.from_pretrained(self.model_id)
def generate(
self,
images: list[Image.Image],
text: str,
gen_kwargs: GenerationConfig = GenerationConfig(),
) -> str:
content = DEFAULT_IMAGE_TOKEN * len(images) + "\n" + text
messages = [
{
"role": "system",
"content": "あなたは誠実で優秀な日本人のアシスタントです。"
},
{
"role": "user",
"content": content,
},
]
prompt = self.processor.apply_chat_template(
messages, add_generation_prompt=True
)
if len(images) == 0:
images = None
inputs = self.processor(images=images, text=prompt, return_tensors="pt").to(
"cuda"
).to(torch.bfloat16)
output_ids = self.model.generate(**inputs, **gen_kwargs.__dict__)
generated_ids = [
output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
answer = self.processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0].strip()
return answer
Please make sure that the information for Stockmark-2-VL-100B-beta is included in MODEL_ID_TO_CLASS_PATH
in llm-jp-eval-mm/examples/model_table.py
.
MODEL_ID_TO_CLASS_PATH = {
"stockmark/Stockmark-2-VL-100B-beta": "stockmark_vl.VLM",
}
Dependency group
Use the following code to create a dependency group in llm-jp-eval-mm for Stockmark-2-VL-100B-beta.
uv add --group stockmark_vl transformers>=4.49.0 torch>=2.5.1 torchvision>=0.20.1 flash-attn>=2.7.3 accelerate>=0.27.2 sentencepiece>=0.2.0 pillow>=10.4.0 protobuf>=5.29.3
Risks and Limitations
As a beta release, this model has not been fully calibrated to ensure compliance with social norms, ethical standards, and legal regulations.
Since Stockmark-2-VL-100B-beta is a visual reasoning model, it may ignore formatting requirements in the prompt and retain the output of the CoT process.
License
Developed by
- Downloads last month
- 210