JQL: Judging Quality across Languages

High-quality multilingual data is crucial for training effective large language models (LLMs). JQL (Judging Quality across Languages) is a scalable and lightweight multilingual data filtering approach that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators.

Overall, JQL improves data quality, retains more tokens, and generalizes to unseen languages. It outperforms heuristic baselines and enables cost-efficient multilingual pretraining data curation at scale.

📊 Results

✔️ Accuracy

Spearman’s ρ > 0.87 with human ground truth

📈 Downstream LLM Training Impact

+7.2% benchmark performance improvement
+4.8% token retention compared to FineWeb2 heuristic filter
Reliable thresholding with 0.6 and 0.7 quantiles

⚡ Annotation Speed

~11,000 docs/min (on A100 GPU, avg. 690 tokens per doc)

📁 Available Artifacts

      📄 Ground truth annotations in 35 languages
🧠 Synthetic LLM-annotated dataset (14M+ documents)
🪶 Lightweight annotation models:
          JQL-Gemma
JQL-Mistral
JQL-Llama

        
🛠️ Training & inference scripts (coming soon)

    

🧩 Main Pipeline Steps

JQL Pipeline Overview — *Figure 1: Overview of the JQL pipeline*

📋 Ground Truth Creation: Human annotators label monolingual documents based on a structured instruction prompt. These documents are translated into all target languages to create a multilingual gold-standard dataset. (See Figure 1)
🤖 LLM-as-a-Judge Selection & Data Annotation: Strong multilingual LLMs (e.g., Gemma, Mistral, LLaMA) are evaluated against the ground truth, and top-performing models are used to produce synthetic annotations. (See Figure 1)
🪶 Lightweight Annotator Training: Train compact regression heads on frozen multilingual embeddings to create efficient, high-throughput annotators. (See Figure 1)
🚀 Scalable Data Filtering: Use trained annotators to filter large-scale pretraining corpora using quantile thresholds. (See Figure 1)

📊 Results

Accuracy: Spearman’s ρ > 0.87 with human ground truth
Downstream LLM Training:
- +7.2% benchmark performance improvement
- +4.8% token retention vs. FineWeb2 heuristic filter
- Effective threshold strategies: 0.6 and 0.7 quantile
Annotation Speed: ~11,000 docs/min (A100 GPU, avg. 690 tokens)

📁 Available Artifacts

✅ Ground truth annotations in 35 languages
✅ Synthetic LLM-annotated dataset (14M+ documents)
✅ Lightweight annotation models:
- JQL-Gemma
- JQL-Mistral
- JQL-Llama
✅ Training & inference scripts (coming soon)

📜 Citation

If you use JQL, the annotations, or the pretrained annotators, please cite the paper:

@article{your2024jql,
  title={JQL: Judging Quality across Languages},
  author={Your, Name and Collaborators, Here},
  journal={Conference or preprint archive},
  year={2024}
}

🦊 JQL: Judging Quality across Languages

📊 Results

📁 Available Artifacts

🧩 Main Pipeline Steps

📊 Results

📁 Available Artifacts

📜 Citation