File size: 3,260 Bytes
09f141d
 
 
9b2d9e3
 
eb03344
70fab96
9b2d9e3
 
 
 
52e5d39
9b2d9e3
526b78c
 
9b2d9e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170b68e
9b2d9e3
 
 
 
170b68e
 
 
0278d2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import pandas as pd
import streamlit as st

st.set_page_config(
    page_title="JuStRank",
    page_icon="οΈπŸ§‘πŸ»β€βš–οΈ",
    # layout="wide",
    initial_sidebar_state="auto",
    menu_items=None,
)

st.title("πŸ§‘πŸ»β€βš–οΈ JuStRank: The Best Judges for Ranking Systems πŸ§‘πŸ»β€βš–οΈ")

url = "https://arxiv.org/abs/2412.09569"
st.subheader("Check out our [ACL paper](%s) for more details" % url)

def prettify_judge_name(judge_name):
    pretty_judge = (judge_name[0].upper()+judge_name[1:]).replace("Gpt", "GPT")
    return pretty_judge


def format_digits(flt, num_digits=3):
    format_str = "{:."+str(num_digits-1)+"f}"
    format_str_zeroes = "{:."+str(num_digits)+"f}"
    return format_str_zeroes.format(flt)[1:] if (0 < flt < 1) else format_str.format(flt)


df = pd.read_csv("./best_judges_single_agg.csv")[["Judge Model", "Realization", "Ranking Agreement", "Decisiveness", "Bias"]]
df["Judge Model"] = df["Judge Model"].apply(prettify_judge_name)

styled_data = (
    df.style.background_gradient(subset=["Ranking Agreement"])
    .background_gradient(
        subset=["Ranking Agreement"],
        cmap="RdYlGn",
        vmin=0.5,
        vmax=0.9,
    )
    .format(subset=["Ranking Agreement", "Decisiveness", "Bias"], formatter=format_digits)
    .set_properties(**{"text-align": "center"})
)


st.dataframe(styled_data, use_container_width=True, height=800, hide_index=True)

st.text("\n\n")
st.markdown(
    r"""
    This leaderboard measures the **system-level performance and behavior of LLM judges**, and was created as part of the **[JuStRank paper](https://www.arxiv.org/abs/2412.09569)** from ACL 2025.
    
    Judges are sorted according to *Ranking Agreement* with humans, i.e., comparing how the judges rank different systems (generative models) relative to how humans rank those systems on [LMSys Arena](https://lmarena.ai/leaderboard/text/hard-prompts-english).
    
    We also compare judges in terms of the *Decisiveness* and *Bias* reflected in their judgment behaviors (refer to the paper for details).
    
    In our research we tested 10 **LLM judges** and 8 **reward models**, and asked them to score the [responses](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto/tree/main/data/arena-hard-v0.1/model_answer) of 63 systems to the 500 questions from Arena Hard v0.1.
    For each LLM judge we tried 4 different _realizations_, i.e., different prompt and scoring methods used with the LLM judge.
    
    In total, the judge ranking is derived from **[1.5 million raw judgment scores](https://huggingface.co/datasets/ibm-research/justrank_judge_scores)** (48 judge realizations X 63 target systems X 500 instances).

    If you find this useful, please cite our work πŸ€—

    ```bibtex
    @inproceedings{gera2025justrank,
        title={JuStRank: Benchmarking LLM Judges for System Ranking}, 
        author={Gera, Ariel and Boni, Odellia and Perlitz, Yotam and Bar-Haim, Roy and Eden, Lilach and Yehudai, Asaf},
        booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
        month={july},
        address={Vienna, Austria},
        year={2025}
        url={www.arxiv.org/abs/2412.09569}, 
    }
    ```
    """
)