Holo1-7B

Model Description

Holo1 is an Action Vision-Language Model (VLM) developed by HCompany for use in the Surfer-H web agent system. It is designed to interact with web interfaces like a human user.

As part of a broader agentic architecture, Holo1 acts as a policy, localizer, or validator, helping the agent understand and act in digital environments.

Trained on a mix of open-access, synthetic, and self-generated data, Holo1 enables state-of-the-art (SOTA) performance on the WebVoyager benchmark, offering the best accuracy/cost tradeoff among current models. It also excels in UI localization tasks such as Screenspot, Screenspot-V2, Screenspot-Pro, GroundUI-Web, and our own newly introduced benchmark WebClick.

Holo1 is optimized for both accuracy and cost-efficiency, making it a strong open-source alternative to existing VLMs.

For more details, check our paper and our blog post.

Developed by: HCompany
Model type: Action Vision-Language Model
Finetuned from model: Qwen/Qwen2.5-VL-7B-Instruct
Paper: https://arxiv.org/abs/2506.02865
Blog Post: https://www.hcompany.ai/surfer-h
License: Apache 2.0

Results

Surfer-H: Pareto-Optimal Performance on WebVoyager

Surfer-H is designed to be flexible and modular. It is composed of three independent components:

A Policy model that plans, decides, and drives the agent's behavior
A Localizer model that sees and understands visual UIs to drive precise interactions
A Validator model that checks whether the answer is valid

The agent thinks before acting, takes notes, and can retry if its answer is rejected. It can operate with different models for each module, allowing for tradeoffs between accuracy, speed, and cost.

We evaluated Surfer-H on the WebVoyager benchmark: 643 real-world web tasks ranging from retrieving prices to finding news or scheduling events.

We’ve tested multiple configurations, from GPT-4-powered agents to 100% open Holo1 setups. Among them, the fully Holo1-based agents offered the strongest tradeoff between accuracy and cost:

Surfer-H + Holo1-7B: 92.2% accuracy at $0.13 per task
Surfer-H + GPT-4.1: 92.0% at $0.54 per task
Surfer-H + Holo1-3B: 89.7% at $0.11 per task
Surfer-H + GPT-4.1-mini: 88.8% at $0.26 per task

This places Holo1-powered agents on the Pareto frontier, delivering the best accuracy per dollar. Unlike other agents that rely on custom APIs or brittle wrappers, Surfer-H operates purely through the browser — just like a real user. Combined with Holo1, it becomes a powerful, general-purpose, cost-efficient web automation system.

Holo1: State-of-the-Art UI Localization

A key skill for the real-world utility of our VLMs within agents is localization: the ability to identify precise coordinates on a user interface (UI) to interact with to complete a task or follow an instruction. To assess this capability, we evaluated our Holo1 models on several established localization benchmarks, including Screenspot, Screenspot-V2, Screenspot-Pro, GroundUI-Web, and our own newly introduced benchmark WebClick.

Get Started with the Model

We provide 2 spaces to experiment with Localization and Navigation:

We provide starter code for the localization task: i.e. image + instruction -> click coordinates

We also provide code to reproduce screenspot evaluations: screenspot_eval.py

Prepare model, processor

Holo1 models are based on Qwen2.5-VL architecture, which comes with transformers support. Here we provide a simple usage example. You can load the model and the processor as follows:

import json
import os
from typing import Any, Literal

from transformers import AutoModelForImageTextToText, AutoProcessor

# default: Load the model on the available device(s)
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
model = AutoModelForImageTextToText.from_pretrained(
    "Hcompany/Holo1-7B",
    torch_dtype="auto",
    # torch_dtype=torch.bfloat16,
    # attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processor
processor = AutoProcessor.from_pretrained("Hcompany/Holo1-7B")
# The default range for the number of visual tokens per image in the model is 4-1280.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)

# Helper function to run inference
def run_inference(messages: list[dict[str, Any]]) -> str:
    # Preparation for inference
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(
        text=[text],
        images=image,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
    return processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)

Prepare image and instruction

WARNING: Holo1 is using absolute coordinates (number of pixels) and HuggingFace processor is doing image resize. To have matching coordinates, one needs to smart_resize the image.

from PIL import Image
from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

# Prepare image and instruction
image_url = "https://huggingface.co/Hcompany/Holo1-7B/resolve/main/calendar_example.jpg" 
image = Image.open(requests.get(image_url, stream=True).raw)

# Resize the image so that predicted absolute coordinates match the size of the image.
image_processor = processor.image_processor
resized_height, resized_width = smart_resize(
    image.height,
    image.width,
    factor=image_processor.patch_size * image_processor.merge_size,
    min_pixels=image_processor.min_pixels,
    max_pixels=image_processor.max_pixels,
)
image = image.resize(size=(resized_width, resized_height), resample=None)  # type: ignore

Navigation with Structured Output

import json
from . import navigation

task = "Book a hotel in Paris on August 3rd for 3 nights"
prompt = navigation.get_navigation_prompt(task, image, step=1)
navigation_str = run_inference(prompt)[0]
navigation = navigation.NavigationStep(**json.loads(navigation_str))
print(navigation)
# Expected NavigationStep(note='', thought='I need to select the check-out date as August 3rd and then proceed to search for hotels.', action=ClickElementAction(action='click_element', element='August 3rd on the calendar', x=777, y=282))

Localization with click(x, y)

from . import localization

instruction = "Select July 14th as the check-out date"
prompt = localization.get_localization_prompt(image, instruction)
coordinates = run_inference(prompt)[0]
print(coordinates)
# Expected Click(352, 348)

Localization with Structured Output

We trained Holo1 as an Action VLM with extensive use of json and tool calls. Therefore, it can be queried reliably with structured output:

import json
from . import localization

instruction = "Select July 14th as the check-out date"
prompt = localization.get_localization_prompt_structured_output(image, instruction)
coordinates_structured_str = run_inference(prompt)[0]
coordinates_structured = localization.ClickAction(**json.loads(coordinates_structured_str))
print(coordinates_structured)
# Expected ClickAction(action='click', x=352, y=340)

Citation

BibTeX:

@misc{andreux2025surferhmeetsholo1costefficient,
      title={Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights}, 
      author={Mathieu Andreux and Breno Baldas Skuk and Hamza Benchekroun and Emilien Biré and Antoine Bonnet and Riaz Bordie and Matthias Brunel and Pierre-Louis Cedoz and Antoine Chassang and Mickaël Chen and Alexandra D. Constantinou and Antoine d'Andigné and Hubert de La Jonquière and Aurélien Delfosse and Ludovic Denoyer and Alexis Deprez and Augustin Derupti and Michael Eickenberg and Mathïs Federico and Charles Kantor and Xavier Koegler and Yann Labbé and Matthew C. H. Lee and Erwan Le Jumeau de Kergaradec and Amir Mahla and Avshalom Manevich and Adrien Maret and Charles Masson and Rafaël Maurin and Arturo Mena and Philippe Modard and Axel Moyal and Axel Nguyen Kerbel and Julien Revelle and Mats L. Richter and María Santos and Laurent Sifre and Maxime Theillard and Marc Thibault and Louis Thiry and Léo Tronchon and Nicolas Usunier and Tony Wu},
      year={2025},
      eprint={2506.02865},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.02865}, 
}

Hcompany
/

Holo1-7B