GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Abstract
GUI-Actor, a VLM-based method using attention for coordinate-free GUI grounding, outperforms existing methods with better generalization and efficient fine-tuning.
One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated <ACTOR> token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning (2025)
- Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis (2025)
- UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning (2025)
- ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search (2025)
- Visual Test-time Scaling for GUI Agent Grounding (2025)
- GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents (2025)
- UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper