remyxai
/

SpaceQwen2.5-VL-3B-Instruct

Model card Files Files and versions Community

SpaceQwen2.5-VL-3B-Instruct / README.md

nielsr HF Staff

Improve model card with paper link and clarify metadata

df12433 verified 2 months ago

preview code

raw

history blame

8.39 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	datasets:
	- remyxai/OpenSpaces
	language:
	- en
	library_name: transformers
	license: apache-2.0
	pipeline_tag: image-text-to-text
	tags:
	- remyx
	- vqasynth
	- spatial-reasoning
	- multimodal
	- vlm
	- vision-language
	- robotics
	- distance-estimation
	- embodied-ai
	- quantitative-spatial-reasoning
	new_version: remyxai/SpaceThinker-Qwen2.5VL-3B
	model-index:
	- name: SpaceQwen2.5-VL-3B-Instruct
	results:
	- task:
	type: visual-question-answering
	name: Spatial Reasoning
	dataset:
	name: 3DSRBench
	type: benchmark
	metrics:
	- type: success_rate
	value: 0.515
	name: Overall Success Rate
	- type: success_rate
	value: 0.5
	name: Overall Success Rate
	- type: success_rate
	value: 0.3045
	name: Overall Success Rate
	- type: success_rate
	value: 0.5767
	name: Overall Success Rate
	- type: success_rate
	value: 0.3663
	name: Overall Success Rate
	- type: success_rate
	value: 0.33
	name: Overall Success Rate
	- type: success_rate
	value: 0.4392
	name: Overall Success Rate
	- type: success_rate
	value: 0.6554
	name: Overall Success Rate
	- type: success_rate
	value: 0.2615
	name: Overall Success Rate
	- type: success_rate
	value: 0.2322
	name: Overall Success Rate
	- type: success_rate
	value: 0.7373
	name: Overall Success Rate
	- type: success_rate
	value: 0.5179
	name: Overall Success Rate
	- type: success_rate
	value: 0.4879
	name: Overall Success Rate
	---

	<img src="https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/v4edJliSy46xBA8g5ZXf8.png" width="500"/>

	# SpaceQwen2.5-VL-3B-Instruct

	The model was presented in the paper [OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models](https://huggingface.co/papers/2506.03135). More information can be found at the [project page](https://qizekun.github.io/omnispatial/).


	- Model Type: Multimodal, Vision-Language Model
	- Architecture: `Qwen2.5-VL-3B-Instruct`
	- Model Size: 3.75B parameters (FP16)
	- Finetuned from: Qwen/Qwen2.5-VL-3B-Instruct
	- Finetune Strategy: LoRA (Low-Rank Adaptation)
	- License: Apache-2.0

	### Model Overview

	This model uses data synthesis techniques and publicly available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models.
	With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create a VQA dataset for spatial reasoning.


	## Running SpaceQwen2.5-VL-3B-Instruct

	### Ollama
	To launch with ollama, run:
	```bash
	ollama run hf.co/remyxai/SpaceQwen2.5-VL-3B-Instruct:latest
	```

	### Transformers

	Install qwen dependencies:
	```
	pip install qwen-vl-utils[decord]==0.0.8
	```

	To run inference on a sample image:
	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info

	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"remyxai/SpaceQwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
	)
	processor = AutoProcessor.from_pretrained("remyxai/SpaceQwen2.5-VL-3B-Instruct")

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://raw.githubusercontent.com/remyxai/VQASynth/refs/heads/main/assets/warehouse_sample_2.jpeg",
	},
	{"type": "text", "text": "What is the height of the man in the red hat in feet?"},
	],
	}
	]

	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	### GGUF

	Or run SpaceQwen2.5-VL-3B-Instruct using llama.cpp:
	```bash
	./llama-qwen2vl-cli -m /path/to/SpaceQwen2.5-VL-3B-Instruct/SpaceQwen2.5-VL-3B-Instruct-F16.gguf \
	--mmproj /path/to/SpaceQwen2.5-VL-3B-Instruct/spaceqwen2.5-vl-3b-instruct-vision.gguf \
	-p "What's the height of the man in the red hat?" \
	--image /path/to/warehouse_sample_2.jpeg --threads 24 -ngl 99
	```


	## Dataset & Training

	SpaceQwen2.5-VL-3B-Instruct uses LoRA to fine-tune [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the
	[OpenSpaces](https://huggingface.co/datasets/salma-remyx/OpenSpaces) dataset.

	Dataset Summary:
	- ~10k synthetic spatial reasoning traces
	- Question types: spatial relations (distances (units), above, left-of, contains, closest to)
	- Format: image (RGB) + question + answer

	- Dataset: [OpenSpaces](https://huggingface.co/datasets/remyxai/OpenSpaces)
	- Code: [VQASynth](https://github.com/remyxai/VQASynth/tree/main)
	- Reference: [SpatialVLM](https://spatial-vlm.github.io/)

	Scripts for LoRA SFT available at [trl](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py)


	## Model Evaluation

	## SpatialScore
	SpaceQwen shines in the 3D positional relations categories of the SpatialScore-Hard comparison featured in the table below:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/sNei_Js6IjEKKHK717PeZ.png)

	Read more about the comprehensive spatial reasoning benchmark: [SpatialScore](https://haoningwu3639.github.io/SpatialScore/).

	The following chart compares performance between SpaceQwen and SpaceThinker on the SpatialScore benchmarks sources.

	<img src="https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/M9DMOXkhef7-LBzGNxykm.png" alt="SpaceQwen_v_SpaceThinker" style="max-height: 250px;">

	## OmniSpatial

	OmniSpatial is another comprehensive spatial reasoning benchmark that assesses dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking capabilities.
	![image/png](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/EDHmFRztyTI-lhdgEYZzP.png)

	Learn more about [OmniSpatial](https://qizekun.github.io/omnispatial/).

	## ⚠️ Limitations & Ethical Considerations

	- Performance may degrade in cluttered environments or camera perspective.
	- This model was fine-tuned using synthetic reasoning over an internet image dataset.
	- Multimodal biases inherent to the base model (Qwen2.5-VL) may persist.
	- Not intended for use in safety-critical or legal decision-making.

	> Users are encouraged to evaluate outputs critically and consider fine-tuning for domain-specific safety and performance.

	## Citation
	```
	@article{chen2024spatialvlm,
	title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
	author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
	journal = {arXiv preprint arXiv:2401.12168},
	year = {2024},
	url = {https://arxiv.org/abs/2401.12168},
	}

	@misc{qwen2.5-VL,
	title = {Qwen2.5-VL},
	url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
	author = {Qwen Team},
	month = {January},
	year = {2025}
	}

	@article{wu2025spatialscore,
	author = {Wu, Haoning and Huang, Xiao and Chen, Yaohui and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
	title = {SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding},
	journal = {arXiv preprint arXiv:2505.17012},
	year = {2025},
	}

	@article{omnispatial25,
	title = {OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models},
	author = {Mengdi Jia and Zekun Qi and Shaochen Zhang and Wenyao Zhang and Xinqiang Yu and Jiawei He and He Wang and Li Yi},
	journal = {arXiv preprint arXiv:2506.03135},
	year = {2025}
	}
	```