BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Abstract
BrowseComp-ZH evaluates large language models on real-time Chinese web browsing tasks, highlighting challenges in retrieval and reasoning beyond existing English benchmarks.
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.
Community
š« Excited to share our recent work: BrowseComp-ZH, the first high-difficulty benchmark specifically designed to evaluate large language models (LLMs) on Chinese web browsing tasks.
BrowseComp-ZH serves as a critical testbed for assessing:
Reasoning-augmented LLMs
Agent-based search systems
Retrieval-augmented generation (RAG) in non-English contexts
We constructed 289 multi-constraint questions across 11 domains (e.g., Film, Art, History, Medicine), each reverse-engineered from a factual answer and validated through a rigorous two-stage quality control process.
š Despite strong performance on existing benchmarks, mainstream models struggled significantly on BrowseComp-ZH:
1ļøā£ GPT-4o: 6.2% accuracy
2ļøā£ Most models scored below 10%
3ļøā£ Even the best-performing system, OpenAI DeepResearch, achieved only 42.9%
Why is this benchmark so challenging?
ā Chinese web content is highly fragmented across platforms
ā Tasks demand multi-hop reasoning and cross-page synthesis
This work is a collaboration between HKUST (Guangzhou), Peking University, Zhejiang University, Alibaba, ByteDance, NIO, and others. We hope it contributes to advancing multilingual, tool-using LLM agents and inspires further research in Chinese web intelligence.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents (2025)
- Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study (2025)
- MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation (2025)
- A Survey of Large Language Model Agents for Question Answering (2025)
- YourBench: Easy Custom Evaluation Sets for Everyone (2025)
- ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback (2025)
- Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper