|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- mistralai/Mistral-7B-Instruct-v0.3 |
|
--- |
|
# Introduction |
|
|
|
## Task Description |
|
This model was fine-tuned to improve its ability to perform spatial |
|
reasoning tasks. The objective is to enable the model to interpret natural language queries related to |
|
spatial relationships, directions, and locations and output actionable responses. |
|
The task addresses limitations in current LLMs, which often fail to perform precise spatial reasoning, |
|
such as determining relationships between points on a map, planning routes, or identifying locations |
|
based on bounding boxes. |
|
|
|
## Task Importance |
|
Spatial reasoning is important for a wide range of applications such as navigation and geospatial |
|
analysis. Many smaller LLMs, while strong in general reasoning, often lack the ability to interpret |
|
spatial relationships with precision or utilize real-world geographic data effectively. For example, they |
|
struggle to answer queries like “What’s between Point A and Point B?” or “Find me the fastest route |
|
avoiding traffic at 8 AM tomorrow.” Even when the LLM has access to geospatial information, smaller models struggle to correctly |
|
interpret user questions. |
|
|
|
## Related Work/Gap Analysis |
|
|
|
While there is ongoing research in integrating LLMs with geospatial systems, most existing solutions |
|
rely on symbolic AI or rule-based systems rather than leveraging the generalization capabilities of |
|
LLMs. Additionally, the paper “Advancing Spatial Reasoning in Large Language Models: An In-Depth |
|
Evaluation and Enhancement Using the StepGame Benchmark,” concluded that larger models like |
|
GPT-4 perform well in mapping natural language descriptions to spatial relations but struggle with |
|
multi-hop reasoning. This paper used the StepGame as a benchmark for spatial reasoning. |
|
Fine-tuning a model fills the gap identified in the paper, as the only solutions identified in their |
|
research was prompt engineering with Chain of Thought. |
|
|
|
Research by organizations like OpenAI and Google has focused on improving contextual reasoning |
|
through fine-tuning, but there is limited work targeting spatial reasoning. |
|
|
|
## Main Results |
|
|
|
The fine-tuned model slightly improved on general knowledge tasks such as MMLU Geography and Babi Task 17 |
|
compared to the original Mistral-7B base model. However, its performance on spatial reasoning benchmarks like SpatialEval |
|
significantly declined, suggesting that fine-tuning may have led to incompatibility between the prompt style used for training with StepGame |
|
and the multiple-choice formatting in SpatialEval. |
|
|
|
# Training Data |
|
|
|
For this fine-tuning task the step-game dataset was used. This dataset is large and provides multi-step |
|
reasoning challenges for geospatial reasoning. The train-test split is predefined with 50,000 rows in the train split and 10,000 in the test |
|
split. It focuses on multi-step problem-solving with |
|
spatial relationships, such as directional logic, relative positioning, and route-based reasoning. It |
|
presents text-based tasks that require stepwise deductions, ensuring the model develops strong |
|
reasoning abilities beyond simple fact recall. This dataset follows the template of story, |
|
question and answer to assess spatial reasoning as depicted below. |
|
|
|
 |
|
|
|
|
|
# Training Method |
|
|
|
For this task of spatial reasoning LoRA (Low-Rank Adaptation) was used as the training method. |
|
LoRA allows for efficient fine-tuning of large language models by freezing the majority of the model weights and only updating small, |
|
low-rank adapter matrices within attention layers. It significantly reduces the computational cost and memory requirements of full |
|
fine-tuning, making it ideal for working with limited GPU resources. LoRA is especially effective for task-specific |
|
adaptation when the dataset is moderately sized and instruction formatting is consistent as in the case of this dataset of stepGame. |
|
In previous experiments with spatial reasoning fine-tuning, LoRA performed better than prompt tuning. While prompt tuning resulted in close to 0% accuracy on both the StepGame and |
|
MMLU evaluations, LoRA preserved partial task performance (18% accuracy) and retained some general knowledge ability (46% accuracy on |
|
MMLU geography vs. 52% before training). I used a learning rate of 1e-4, batch size of 4, and trained for 3 epochs. |
|
This setup preserved general reasoning ability while improving spatial accuracy. |
|
|
|
# Evaluation |
|
|
|
| Model | MMLU-Geography (%) | Spatial Eval (%) | Babi Task 17 (%) | |
|
|-------------------------------------------------|--------------------|------------------|------------------| |
|
| mistralai/Mistral-7B-Instruct-v0.3 (base model) | 75.63% | 36.07% | 51% | |
|
| sareena/spatial_lora_mistral (fine-tuned) | 76.17% | 0% | 53% | |
|
| meta-llama/Llama-2-7b-hf | 42.42% | 18.25% | 48.00% | |
|
| google/gemma-7b | 80.30% | 7.01% | 58.00% | |
|
|
|
## Benchmark Tasks |
|
|
|
### SpatialQA Benchmark |
|
|
|
This benchmark provides more realistic question answer pairs than the |
|
stepgame benchmark. It contains place names instead of abstracted concepts of letters |
|
and answers, but still requires the same multi-step geospatial reasoning capability. |
|
It complements stepGame by testing broader spatial logic and more realistic |
|
scenarios |
|
|
|
### bAbI Dataset (Task 17) |
|
|
|
This benchmark was introduced by Facebook AI Research. It includes 20 |
|
synthetic question-answering tasks designed to evaluate various reasoning abilities in |
|
models. Tasks 17 ("Positional Reasoning") specifically assesses |
|
spatial reasoning through textual descriptions. Pathfinding in combination with spatial reasoning will potentially help assess |
|
the model’s performance in tasks such as calculating routes, which would be a common |
|
application for a geospatial reasoning fine-tuned model. |
|
|
|
|
|
### MMLU benchmark geography and environmental subset |
|
|
|
This benchmark is a comprehensive evaluation suite designed to assess a model's |
|
reasoning and knowledge across a wide range of subjects. When focusing on geography |
|
and environmental science subsets, the benchmark offers an opportunity to test both |
|
domain-specific knowledge and broader reasoning abilities relevant to geospatial tasks. |
|
Input/Output: This benchmark consists of multiple-choice questions covering topics |
|
from elementary to advanced levels. This benchmark is intended to assess the model’s general performance |
|
post-processing, and its ability to apply knowledge across subjects most relevant to its |
|
fine-tuning task. |
|
|
|
|
|
## Comparison Models |
|
LLaMA-2 and Gemma represent strong alternatives from Meta and Google respectively, offering diverse architectural approaches with a similar number of parameters and |
|
training data sources. Including these models allowed for a more meaningful evaluation of how my fine-tuned model performs |
|
not just against its own baseline, but also against state-of-the-art peers on spatial reasoning and general knowledge tasks. |
|
|
|
# Usage and Intended Uses |
|
This model is designed to assist with natural language spatial reasoning, particularly in tasks that involve multi-step relational |
|
inference between objects or locations described in text. This could be implemented in agentic spatial systems and/or text-based game bots. |
|
|
|
|
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
|
|
model = AutoModelForCausalLM.from_pretrained("sareena/spatial_lora_mistral") |
|
tokenizer = AutoTokenizer.from_pretrained("sareena/spatial_lora_mistral") |
|
|
|
inputs = tokenizer("Q: The couch is to the left of the table. The lamp is on the couch. Where is the lamp?", return_tensors="pt") |
|
outputs = model.generate(**inputs, max_new_tokens=50) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
|
``` |
|
|
|
# Prompt Format |
|
The model is trained on instruction-style input with a spatial reasoning question: |
|
|
|
```text |
|
Q: The couch is to the left of the table. The lamp is on the couch. Where is the lamp in relation to the table? |
|
``` |
|
|
|
# Expected Output Format |
|
The output is a short, natural language spatial answer: |
|
|
|
```text |
|
A: left |
|
``` |
|
|
|
# Limitations |
|
|
|
The model is still limited in deep reasoning capabilities and sometimes fails multi-hop spatial tasks. |
|
LoRA helps balance this trade-off, but fine-tuning on more diverse spatial tasks could yield stronger generalization. |
|
)erformance on the SpatialEval benchmark dropped drastically, due to incompatibility between the prompt style used for training and the |
|
multiple-choice formatting in SpatialEval. Future work to remediate this would be to test more prompt formats in training or use instruction-tuned datasets |
|
more similar to the downstream evaluations.m |
|
|
|
## Citation |
|
|
|
1. **Hendrycks, Dan**, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. |
|
*"Measuring Massive Multitask Language Understanding."* arXiv preprint arXiv:2009.03300 (2020). |
|
|
|
2. **Li, Fangjun**, et al. |
|
*“Advancing Spatial Reasoning in Large Language Models: An in-Depth Evaluation and Enhancement Using the StepGame Benchmark.”* |
|
*arXiv.Org*, 8 Jan. 2024. [https://arxiv.org/abs/2401.03991](https://arxiv.org/abs/2401.03991) |
|
|
|
3. **Mirzaee, Roshanak**, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi. |
|
*"SpartQA: A Textual Question Answering Benchmark for Spatial Reasoning."* arXiv preprint arXiv:2104.05832 (2021). |
|
|
|
4. **Shi, Zhengxiang**, et al. |
|
*“StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts.”* |
|
*arXiv.Org*, 18 Apr. 2022. [https://arxiv.org/abs/2204.08292](https://arxiv.org/abs/2204.08292) |
|
|
|
5. **Wang, Mila**, Xiang Lorraine Li, and William Yang Wang. |
|
*"SpatialEval: A Benchmark for Spatial Reasoning Evaluation."* arXiv preprint arXiv:2104.08635 (2021). |
|
|
|
6. **Weston, Jason**, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. |
|
*"Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks."* arXiv preprint arXiv:1502.05698 (2015). |