File size: 3,955 Bytes
6a95078
 
 
 
93b22a6
6a95078
93b22a6
 
 
 
 
 
 
 
 
6a95078
 
93b22a6
6a95078
93b22a6
 
 
6a95078
 
 
 
93b22a6
 
 
 
dba1c4d
93b22a6
 
 
 
 
 
 
 
 
6a95078
 
 
 
93b22a6
 
 
 
 
 
 
 
 
 
 
 
 
6a95078
 
 
 
93b22a6
 
 
 
 
 
 
 
 
 
 
 
 
 
6a95078
 
 
93b22a6
 
 
 
 
 
 
 
 
6a95078
 
 
 
 
93b22a6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="description" content="JQL: Judging Quality across Languages - A pipeline for multilingual data filtering.">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>JQL: Judging Quality across Languages</title>
  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css">
  <style>
    body { font-family: 'Noto Sans', sans-serif; }
    .hero.is-primary { background-color: #f9d5e5; }
    .subtitle img { max-width: 100%; height: auto; }
    .section-title { margin-top: 2em; }
  </style>
</head>
<body>
<section class="hero is-primary">
  <div class="hero-body">
    <div class="container has-text-centered">
      <h1 class="title is-1">🦊 JQL: Judging Quality across Languages</h1>
      <p class="subtitle is-5">Scalable and lightweight multilingual data filtering with LLM-based annotators</p>
    </div>
  </div>
</section>

<section class="section">
  <div class="container content">
    <h2 class="title is-3">🧩 Main Pipeline Steps</h2>
    <figure>
      <img src="https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/1zPQcwqt9Li_gCvd04_2_.png" alt="JQL Pipeline Overview">
      <figcaption><em>Figure 1: Overview of the JQL pipeline</em></figcaption>
    </figure>

    <ol>
      <li><strong>πŸ“‹ Ground Truth Creation:</strong> Human annotators label monolingual documents based on a structured instruction prompt. These documents are translated into all target languages to create a multilingual gold-standard dataset. (See Figure 1)</li>
      <li><strong>πŸ€– LLM-as-a-Judge Selection & Data Annotation:</strong> Strong multilingual LLMs (e.g., Gemma, Mistral, LLaMA) are evaluated against the ground truth, and top-performing models are used to produce synthetic annotations. (See Figure 1)</li>
      <li><strong>πŸͺΆ Lightweight Annotator Training:</strong> Train compact regression heads on frozen multilingual embeddings to create efficient, high-throughput annotators. (See Figure 1)</li>
      <li><strong>πŸš€ Scalable Data Filtering:</strong> Use trained annotators to filter large-scale pretraining corpora using quantile thresholds. (See Figure 1)</li>
    </ol>
  </div>
</section>

<section class="section">
  <div class="container content">
    <h2 class="title is-3">πŸ“Š Results</h2>
    <ul>
      <li><strong>Accuracy:</strong> Spearman’s ρ > 0.87 with human ground truth</li>
      <li><strong>Downstream LLM Training:</strong>
        <ul>
          <li>+7.2% benchmark performance improvement</li>
          <li>+4.8% token retention vs. FineWeb2 heuristic filter</li>
          <li>Effective threshold strategies: 0.6 and 0.7 quantile</li>
        </ul>
      </li>
      <li><strong>Annotation Speed:</strong> ~11,000 docs/min (A100 GPU, avg. 690 tokens)</li>
    </ul>
  </div>
</section>

<section class="section">
  <div class="container content">
    <h2 class="title is-3">πŸ“ Available Artifacts</h2>
    <ul>
      <li>βœ… Ground truth annotations in 35 languages</li>
      <li>βœ… Synthetic LLM-annotated dataset (14M+ documents)</li>
      <li>βœ… Lightweight annotation models:
        <ul>
          <li>JQL-Gemma</li>
          <li>JQL-Mistral</li>
          <li>JQL-Llama</li>
        </ul>
      </li>
      <li>βœ… Training & inference scripts (coming soon)</li>
    </ul>
  </div>
</section>

<section class="section">
  <div class="container content">
    <h2 class="title is-3">πŸ“œ Citation</h2>
    <p>If you use JQL, the annotations, or the pretrained annotators, please cite the paper:</p>
    <pre><code>@article{your2024jql,
  title={JQL: Judging Quality across Languages},
  author={Your, Name and Collaborators, Here},
  journal={Conference or preprint archive},
  year={2024}
}</code></pre>
  </div>
</section>

</body>
</html>