|
--- |
|
base_model: |
|
- Qwen/Qwen2.5-VL-3B-Instruct |
|
datasets: |
|
- remyxai/OpenSpaces |
|
language: |
|
- en |
|
library_name: transformers |
|
license: apache-2.0 |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- remyx |
|
- vqasynth |
|
- spatial-reasoning |
|
- multimodal |
|
- vlm |
|
- vision-language |
|
- robotics |
|
- distance-estimation |
|
- embodied-ai |
|
- quantitative-spatial-reasoning |
|
new_version: remyxai/SpaceThinker-Qwen2.5VL-3B |
|
model-index: |
|
- name: SpaceQwen2.5-VL-3B-Instruct |
|
results: |
|
- task: |
|
type: visual-question-answering |
|
name: Spatial Reasoning |
|
dataset: |
|
name: 3DSRBench |
|
type: benchmark |
|
metrics: |
|
- type: success_rate |
|
value: 0.515 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.5 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.3045 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.5767 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.3663 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.33 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.4392 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.6554 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.2615 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.2322 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.7373 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.5179 |
|
name: Overall Success Rate |
|
- type: success_rate |
|
value: 0.4879 |
|
name: Overall Success Rate |
|
--- |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/v4edJliSy46xBA8g5ZXf8.png" width="500"/> |
|
|
|
# SpaceQwen2.5-VL-3B-Instruct |
|
|
|
The model was presented in the paper [OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models](https://huggingface.co/papers/2506.03135). More information can be found at the [project page](https://qizekun.github.io/omnispatial/). |
|
|
|
|
|
- **Model Type:** Multimodal, Vision-Language Model |
|
- **Architecture**: `Qwen2.5-VL-3B-Instruct` |
|
- **Model Size:** 3.75B parameters (FP16) |
|
- **Finetuned from:** Qwen/Qwen2.5-VL-3B-Instruct |
|
- **Finetune Strategy:** LoRA (Low-Rank Adaptation) |
|
- **License:** Apache-2.0 |
|
|
|
### Model Overview |
|
|
|
This model uses data synthesis techniques and publicly available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. |
|
With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create a VQA dataset for spatial reasoning. |
|
|
|
|
|
## Running SpaceQwen2.5-VL-3B-Instruct |
|
|
|
### Ollama |
|
To launch with ollama, run: |
|
```bash |
|
ollama run hf.co/remyxai/SpaceQwen2.5-VL-3B-Instruct:latest |
|
``` |
|
|
|
### Transformers |
|
|
|
Install qwen dependencies: |
|
``` |
|
pip install qwen-vl-utils[decord]==0.0.8 |
|
``` |
|
|
|
To run inference on a sample image: |
|
```python |
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
|
from qwen_vl_utils import process_vision_info |
|
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
"remyxai/SpaceQwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto" |
|
) |
|
processor = AutoProcessor.from_pretrained("remyxai/SpaceQwen2.5-VL-3B-Instruct") |
|
|
|
messages = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{ |
|
"type": "image", |
|
"image": "https://raw.githubusercontent.com/remyxai/VQASynth/refs/heads/main/assets/warehouse_sample_2.jpeg", |
|
}, |
|
{"type": "text", "text": "What is the height of the man in the red hat in feet?"}, |
|
], |
|
} |
|
] |
|
|
|
# Preparation for inference |
|
text = processor.apply_chat_template( |
|
messages, tokenize=False, add_generation_prompt=True |
|
) |
|
image_inputs, video_inputs = process_vision_info(messages) |
|
inputs = processor( |
|
text=[text], |
|
images=image_inputs, |
|
videos=video_inputs, |
|
padding=True, |
|
return_tensors="pt", |
|
) |
|
inputs = inputs.to("cuda") |
|
|
|
# Inference: Generation of the output |
|
generated_ids = model.generate(**inputs, max_new_tokens=128) |
|
generated_ids_trimmed = [ |
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
] |
|
output_text = processor.batch_decode( |
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
) |
|
print(output_text) |
|
``` |
|
|
|
### GGUF |
|
|
|
Or run **SpaceQwen2.5-VL-3B-Instruct** using **llama.cpp**: |
|
```bash |
|
./llama-qwen2vl-cli -m /path/to/SpaceQwen2.5-VL-3B-Instruct/SpaceQwen2.5-VL-3B-Instruct-F16.gguf \ |
|
--mmproj /path/to/SpaceQwen2.5-VL-3B-Instruct/spaceqwen2.5-vl-3b-instruct-vision.gguf \ |
|
-p "What's the height of the man in the red hat?" \ |
|
--image /path/to/warehouse_sample_2.jpeg --threads 24 -ngl 99 |
|
``` |
|
|
|
|
|
## Dataset & Training |
|
|
|
**SpaceQwen2.5-VL-3B-Instruct** uses LoRA to fine-tune [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the |
|
[OpenSpaces](https://huggingface.co/datasets/salma-remyx/OpenSpaces) dataset. |
|
|
|
**Dataset Summary**: |
|
- ~10k synthetic spatial reasoning traces |
|
- Question types: spatial relations (distances (units), above, left-of, contains, closest to) |
|
- Format: image (RGB) + question + answer |
|
|
|
- **Dataset:** [OpenSpaces](https://huggingface.co/datasets/remyxai/OpenSpaces) |
|
- **Code:** [VQASynth](https://github.com/remyxai/VQASynth/tree/main) |
|
- **Reference:** [SpatialVLM](https://spatial-vlm.github.io/) |
|
|
|
Scripts for LoRA SFT available at [trl](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py) |
|
|
|
|
|
## Model Evaluation |
|
|
|
## SpatialScore |
|
**SpaceQwen** shines in the 3D positional relations categories of the SpatialScore-Hard comparison featured in the table below: |
|
|
|
 |
|
|
|
Read more about the comprehensive spatial reasoning benchmark: [SpatialScore](https://haoningwu3639.github.io/SpatialScore/). |
|
|
|
The following chart compares performance between **SpaceQwen** and **SpaceThinker** on the **SpatialScore** benchmarks sources. |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/M9DMOXkhef7-LBzGNxykm.png" alt="SpaceQwen_v_SpaceThinker" style="max-height: 250px;"> |
|
|
|
## OmniSpatial |
|
|
|
**OmniSpatial** is another comprehensive spatial reasoning benchmark that assesses dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking capabilities. |
|
 |
|
|
|
Learn more about [OmniSpatial](https://qizekun.github.io/omnispatial/). |
|
|
|
## ⚠️ Limitations & Ethical Considerations |
|
|
|
- Performance may degrade in cluttered environments or camera perspective. |
|
- This model was fine-tuned using synthetic reasoning over an internet image dataset. |
|
- Multimodal biases inherent to the base model (Qwen2.5-VL) may persist. |
|
- Not intended for use in safety-critical or legal decision-making. |
|
|
|
> Users are encouraged to evaluate outputs critically and consider fine-tuning for domain-specific safety and performance. |
|
|
|
## Citation |
|
``` |
|
@article{chen2024spatialvlm, |
|
title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities}, |
|
author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei}, |
|
journal = {arXiv preprint arXiv:2401.12168}, |
|
year = {2024}, |
|
url = {https://arxiv.org/abs/2401.12168}, |
|
} |
|
|
|
@misc{qwen2.5-VL, |
|
title = {Qwen2.5-VL}, |
|
url = {https://qwenlm.github.io/blog/qwen2.5-vl/}, |
|
author = {Qwen Team}, |
|
month = {January}, |
|
year = {2025} |
|
} |
|
|
|
@article{wu2025spatialscore, |
|
author = {Wu, Haoning and Huang, Xiao and Chen, Yaohui and Zhang, Ya and Wang, Yanfeng and Xie, Weidi}, |
|
title = {SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding}, |
|
journal = {arXiv preprint arXiv:2505.17012}, |
|
year = {2025}, |
|
} |
|
|
|
@article{omnispatial25, |
|
title = {OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models}, |
|
author = {Mengdi Jia and Zekun Qi and Shaochen Zhang and Wenyao Zhang and Xinqiang Yu and Jiawei He and He Wang and Li Yi}, |
|
journal = {arXiv preprint arXiv:2506.03135}, |
|
year = {2025} |
|
} |
|
``` |