--- title: TRAIL Leaderboard emoji: 🥇 colorFrom: green colorTo: indigo sdk: gradio app_file: app.py pinned: true license: mit short_description: Trace Reasoning and Agentic Issue Localization Leaderboard sdk_version: 5.19.0 tags: - leaderboard --- # Model Performance Leaderboard This is a Hugging Face Space that hosts a leaderboard for comparing model performances across various metrics of TRAIL dataset. ## Features - **Submit Your Answers**: Run your model on TRAIL dataset. Submit your results. - **Leaderboard**: View how your submissions are ranked. ## Instructions * Please refer to our GitHub repository at https://github.com/patronus-ai/trail-benchmark for step‑by‑step instructions on how to run your model with the TRAIL dataset. * Please upload a zip file containing your model outputs. The zip file should contain: - One or more directories with model outputs - Each directory should contain JSON files with the model's predictions - Directory names should indicate the split (GAIA_ or SWE_) * Once the evaluation is complete, we’ll upload the scores (this process will soon be automated). ## Benchmarking on TRAIL [TRAIL(Trace Reasoning and Agentic Issue Localization)](https://arxiv.org/abs/2505.08638) is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows. ## License This project is open source and available under the MIT license.