jitinpatronus commited on
Commit
3069ced
·
verified ·
1 Parent(s): e0cd95e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -23,13 +23,13 @@ This is a Hugging Face Space that hosts a leaderboard for comparing model perfor
23
 
24
  ## Instructions
25
 
26
- 1. Please refer to our GitHub repository athttps://github.com/patronus-ai/trail-benchmark for step‑by‑step instructions on how to run your model with the TRAIL dataset.
27
  2. Compress the resulting JSON outputs into a ZIP archive whose filename begins with SWE_/GAIA_, and submit it.
28
  3. Once the evaluation is complete, we’ll upload the scores (this process will soon be automated).
29
 
30
  ## Benchmarking on TRAIL
31
 
32
- TRAIL(Trace Reasoning and Agentic Issue Localization) is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows.
33
 
34
  ## License
35
 
 
23
 
24
  ## Instructions
25
 
26
+ 1. Please refer to our GitHub repository at https://github.com/patronus-ai/trail-benchmark for step‑by‑step instructions on how to run your model with the TRAIL dataset.
27
  2. Compress the resulting JSON outputs into a ZIP archive whose filename begins with SWE_/GAIA_, and submit it.
28
  3. Once the evaluation is complete, we’ll upload the scores (this process will soon be automated).
29
 
30
  ## Benchmarking on TRAIL
31
 
32
+ [TRAIL(Trace Reasoning and Agentic Issue Localization)](https://arxiv.org/abs/2505.08638) is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows.
33
 
34
  ## License
35