File size: 4,653 Bytes
6a95078
 
 
 
93b22a6
6a95078
93b22a6
 
 
 
 
 
 
 
 
6a95078
 
93b22a6
6a95078
93b22a6
 
 
6a95078
 
 
 
8c12e10
 
 
edb7a23
 
 
8c12e10
 
edb7a23
8c12e10
 
 
42c2c6f
93b22a6
 
 
 
dba1c4d
93b22a6
 
 
 
 
 
 
 
 
6a95078
 
 
 
93b22a6
 
 
5853b04
 
93b22a6
 
 
 
 
 
5853b04
93b22a6
6a95078
 
 
 
93b22a6
 
 
5853b04
 
 
93b22a6
 
 
 
 
 
655b7cb
93b22a6
6a95078
 
 
93b22a6
 
 
 
 
 
 
 
 
6a95078
 
 
 
 
93b22a6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="description" content="JQL: Judging Quality across Languages - A pipeline for multilingual data filtering.">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>JQL: Judging Quality across Languages</title>
  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css">
  <style>
    body { font-family: 'Noto Sans', sans-serif; }
    .hero.is-primary { background-color: #f9d5e5; }
    .subtitle img { max-width: 100%; height: auto; }
    .section-title { margin-top: 2em; }
  </style>
</head>
<body>
<section class="hero is-primary">
  <div class="hero-body">
    <div class="container has-text-centered">
      <h1 class="title is-1">🦊 JQL: Judging Quality across Languages</h1>
      <p class="subtitle is-5">Scalable and lightweight multilingual data filtering with LLM-based annotators</p>
    </div>
  </div>
</section>

<section class="section">
  <div class="container content">
    <p>
      High-quality multilingual data is crucial for training effective large language models (LLMs).
      <strong>JQL (Judging Quality across Languages)</strong> is a scalable and lightweight multilingual data filtering approach that distills the judgment capabilities of strong 
      multilingual LLMs into efficient cross-lingual annotators. 
    </p>
    <p>
      Overall, JQL improves data quality, retains more tokens, and generalizes to unseen languages. It outperforms heuristic baselines and enables cost-efficient multilingual pretraining data curation at scale.
    </p>
  </div>
</section>
  
<section class="section">
  <div class="container content">
    <h2 class="title is-3">🧩 Main Pipeline Steps</h2>
    <figure>
      <img src="https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/1zPQcwqt9Li_gCvd04_2_.png" alt="JQL Pipeline Overview">
      <figcaption><em>Figure 1: Overview of the JQL pipeline</em></figcaption>
    </figure>

    <ol>
      <li><strong>πŸ“‹ Ground Truth Creation:</strong> Human annotators label monolingual documents based on a structured instruction prompt. These documents are translated into all target languages to create a multilingual gold-standard dataset. (See Figure 1)</li>
      <li><strong>πŸ€– LLM-as-a-Judge Selection & Data Annotation:</strong> Strong multilingual LLMs (e.g., Gemma, Mistral, LLaMA) are evaluated against the ground truth, and top-performing models are used to produce synthetic annotations. (See Figure 1)</li>
      <li><strong>πŸͺΆ Lightweight Annotator Training:</strong> Train compact regression heads on frozen multilingual embeddings to create efficient, high-throughput annotators. (See Figure 1)</li>
      <li><strong>πŸš€ Scalable Data Filtering:</strong> Use trained annotators to filter large-scale pretraining corpora using quantile thresholds. (See Figure 1)</li>
    </ol>
  </div>
</section>

<section class="section">
  <div class="container content">
    <h2 class="title is-3">πŸ“Š Results</h2>
    <ul>
      <li><strong>βœ”οΈ Accuracy:</strong> Spearman’s ρ > 0.87 with human ground truth</li>
      <li><strong>πŸ“ˆ Downstream LLM Training:</strong>
        <ul>
          <li>+7.2% benchmark performance improvement</li>
          <li>+4.8% token retention vs. FineWeb2 heuristic filter</li>
          <li>Effective threshold strategies: 0.6 and 0.7 quantile</li>
        </ul>
      </li>
      <li><strong>⚑ Annotation Speed:</strong> ~11,000 docs/min (A100 GPU, avg. 690 tokens)</li>
    </ul>
  </div>
</section>

<section class="section">
  <div class="container content">
    <h2 class="title is-3">πŸ“ Available Artifacts</h2>
    <ul>
      <li>πŸ“„ Ground truth annotations in 35 languages</li>
      <li>🧠 Synthetic LLM-annotated dataset (14M+ documents)</li>
      <li>πŸͺΆ Lightweight annotation models:
        <ul>
          <li>JQL-Gemma</li>
          <li>JQL-Mistral</li>
          <li>JQL-Llama</li>
        </ul>
      </li>
      <li>πŸ› οΈ Training & inference scripts (coming soon)</li>
    </ul>
  </div>
</section>

<section class="section">
  <div class="container content">
    <h2 class="title is-3">πŸ“œ Citation</h2>
    <p>If you use JQL, the annotations, or the pretrained annotators, please cite the paper:</p>
    <pre><code>@article{your2024jql,
  title={JQL: Judging Quality across Languages},
  author={Your, Name and Collaborators, Here},
  journal={Conference or preprint archive},
  year={2024}
}</code></pre>
  </div>
</section>

</body>
</html>