nielsr's picture
nielsr HF Staff
Improve model card with paper link and clarify metadata
df12433 verified
|
raw
history blame
8.39 kB
---
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
datasets:
- remyxai/OpenSpaces
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- remyx
- vqasynth
- spatial-reasoning
- multimodal
- vlm
- vision-language
- robotics
- distance-estimation
- embodied-ai
- quantitative-spatial-reasoning
new_version: remyxai/SpaceThinker-Qwen2.5VL-3B
model-index:
- name: SpaceQwen2.5-VL-3B-Instruct
results:
- task:
type: visual-question-answering
name: Spatial Reasoning
dataset:
name: 3DSRBench
type: benchmark
metrics:
- type: success_rate
value: 0.515
name: Overall Success Rate
- type: success_rate
value: 0.5
name: Overall Success Rate
- type: success_rate
value: 0.3045
name: Overall Success Rate
- type: success_rate
value: 0.5767
name: Overall Success Rate
- type: success_rate
value: 0.3663
name: Overall Success Rate
- type: success_rate
value: 0.33
name: Overall Success Rate
- type: success_rate
value: 0.4392
name: Overall Success Rate
- type: success_rate
value: 0.6554
name: Overall Success Rate
- type: success_rate
value: 0.2615
name: Overall Success Rate
- type: success_rate
value: 0.2322
name: Overall Success Rate
- type: success_rate
value: 0.7373
name: Overall Success Rate
- type: success_rate
value: 0.5179
name: Overall Success Rate
- type: success_rate
value: 0.4879
name: Overall Success Rate
---
<img src="https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/v4edJliSy46xBA8g5ZXf8.png" width="500"/>
# SpaceQwen2.5-VL-3B-Instruct
The model was presented in the paper [OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models](https://huggingface.co/papers/2506.03135). More information can be found at the [project page](https://qizekun.github.io/omnispatial/).
- **Model Type:** Multimodal, Vision-Language Model
- **Architecture**: `Qwen2.5-VL-3B-Instruct`
- **Model Size:** 3.75B parameters (FP16)
- **Finetuned from:** Qwen/Qwen2.5-VL-3B-Instruct
- **Finetune Strategy:** LoRA (Low-Rank Adaptation)
- **License:** Apache-2.0
### Model Overview
This model uses data synthesis techniques and publicly available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models.
With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create a VQA dataset for spatial reasoning.
## Running SpaceQwen2.5-VL-3B-Instruct
### Ollama
To launch with ollama, run:
```bash
ollama run hf.co/remyxai/SpaceQwen2.5-VL-3B-Instruct:latest
```
### Transformers
Install qwen dependencies:
```
pip install qwen-vl-utils[decord]==0.0.8
```
To run inference on a sample image:
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"remyxai/SpaceQwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("remyxai/SpaceQwen2.5-VL-3B-Instruct")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://raw.githubusercontent.com/remyxai/VQASynth/refs/heads/main/assets/warehouse_sample_2.jpeg",
},
{"type": "text", "text": "What is the height of the man in the red hat in feet?"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
### GGUF
Or run **SpaceQwen2.5-VL-3B-Instruct** using **llama.cpp**:
```bash
./llama-qwen2vl-cli -m /path/to/SpaceQwen2.5-VL-3B-Instruct/SpaceQwen2.5-VL-3B-Instruct-F16.gguf \
--mmproj /path/to/SpaceQwen2.5-VL-3B-Instruct/spaceqwen2.5-vl-3b-instruct-vision.gguf \
-p "What's the height of the man in the red hat?" \
--image /path/to/warehouse_sample_2.jpeg --threads 24 -ngl 99
```
## Dataset & Training
**SpaceQwen2.5-VL-3B-Instruct** uses LoRA to fine-tune [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the
[OpenSpaces](https://huggingface.co/datasets/salma-remyx/OpenSpaces) dataset.
**Dataset Summary**:
- ~10k synthetic spatial reasoning traces
- Question types: spatial relations (distances (units), above, left-of, contains, closest to)
- Format: image (RGB) + question + answer
- **Dataset:** [OpenSpaces](https://huggingface.co/datasets/remyxai/OpenSpaces)
- **Code:** [VQASynth](https://github.com/remyxai/VQASynth/tree/main)
- **Reference:** [SpatialVLM](https://spatial-vlm.github.io/)
Scripts for LoRA SFT available at [trl](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py)
## Model Evaluation
## SpatialScore
**SpaceQwen** shines in the 3D positional relations categories of the SpatialScore-Hard comparison featured in the table below:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/sNei_Js6IjEKKHK717PeZ.png)
Read more about the comprehensive spatial reasoning benchmark: [SpatialScore](https://haoningwu3639.github.io/SpatialScore/).
The following chart compares performance between **SpaceQwen** and **SpaceThinker** on the **SpatialScore** benchmarks sources.
<img src="https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/M9DMOXkhef7-LBzGNxykm.png" alt="SpaceQwen_v_SpaceThinker" style="max-height: 250px;">
## OmniSpatial
**OmniSpatial** is another comprehensive spatial reasoning benchmark that assesses dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking capabilities.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/EDHmFRztyTI-lhdgEYZzP.png)
Learn more about [OmniSpatial](https://qizekun.github.io/omnispatial/).
## ⚠️ Limitations & Ethical Considerations
- Performance may degrade in cluttered environments or camera perspective.
- This model was fine-tuned using synthetic reasoning over an internet image dataset.
- Multimodal biases inherent to the base model (Qwen2.5-VL) may persist.
- Not intended for use in safety-critical or legal decision-making.
> Users are encouraged to evaluate outputs critically and consider fine-tuning for domain-specific safety and performance.
## Citation
```
@article{chen2024spatialvlm,
title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
journal = {arXiv preprint arXiv:2401.12168},
year = {2024},
url = {https://arxiv.org/abs/2401.12168},
}
@misc{qwen2.5-VL,
title = {Qwen2.5-VL},
url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
author = {Qwen Team},
month = {January},
year = {2025}
}
@article{wu2025spatialscore,
author = {Wu, Haoning and Huang, Xiao and Chen, Yaohui and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
title = {SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding},
journal = {arXiv preprint arXiv:2505.17012},
year = {2025},
}
@article{omnispatial25,
title = {OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models},
author = {Mengdi Jia and Zekun Qi and Shaochen Zhang and Wenyao Zhang and Xinqiang Yu and Jiawei He and He Wang and Li Yi},
journal = {arXiv preprint arXiv:2506.03135},
year = {2025}
}
```