arxiv:2506.02295

QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

Published on Jun 2

· Submitted by

Omartificial-Intelligence-Space on Jun 4

Upvote

Authors:

Ahmed Wasfy ,

Omer Nacar ,

Omar Elshehy ,

Adel Ammar ,

Wadii Boulila

Abstract

Qari-OCR, a series of fine-tuned vision-language models, achieves state-of-the-art performance in Arabic OCR through iterative optimization on specialized datasets, handling diacritics, fonts, layouts, and low-resolution images.

AI-generated summary

The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.

View arXiv page View PDF Add to collection

Community

Omartificial-Intelligence-Space

Paper author Paper submitter 2 days ago

QARI-OCR: High-Fidelity Arabic Text Recognition via Multimodal LLM Adaptation

QARI-OCR is a state-of-the-art Arabic OCR system fine-tuned from the vision-language model Qwen2-VL-2B-Instruct. It achieves unprecedented accuracy in transcribing printed, diacritized, and handwritten Arabic text through targeted training on large-scale synthetic datasets. QARI-OCR is open-source and sets a new benchmark for Arabic text recognition and document layout understanding.

🧠 Key Highlights

State-of-the-Art Accuracy
- Character Error Rate (CER): 0.061
- Word Error Rate (WER): 0.160
- BLEU Score: 0.737
Multimodal LLM Fine-Tuning
- Built on Qwen2-VL-2B-Instruct
- Enhanced for Arabic with diacritics, ligatures, and classical scripts
Robust to Document Complexity
- Mixed layouts, various fonts, degraded scans, and handwritten text

🚀 Model Versions

Version	Focus	Dataset Size	Diacritics	Layout-Aware	Handwriting
`v0.1`	Clean text, 5 fonts, no diacritics	5,000	❌	❌	❌
`v0.2`	Diacritized/classical text, 10 fonts	50,000	✅	❌	❌
`v0.3`	Realistic layouts, mixed sizes, handwriting	10,000	✅	✅	✅

📊 Performance (Test Set of 200 Pages)

Model	CER ↓	WER ↓	BLEU ↑
Tesseract	0.436	0.889	0.108
EasyOCR	0.791	0.918	0.051
Mistral OCR	0.210	0.570	0.440
AIN	0.640	0.210	0.830
QARI v0.2	0.061	0.160	0.737
QARI v0.3	0.300	0.545	0.485

📁 Features

✅ Diacritic recognition (fathah, kasrah, dammah, sukun, shadda, tanwin)
✅ Font diversity (12+ Arabic fonts)
✅ Layout parsing with HTML tag reconstruction (v0.3)
✅ Handwritten text recognition (v0.3)
✅ Robust to blur, noise, and low-resolution scans
✅ All models and datasets are open-source

🏗️ Pipeline Overview

Dataset Generation

Text Source: News & classical Arabic corpora
Rendering: HTML → PDF → Image
Degradation: Clean, moderate, and heavy noise
Annotation: Paired with exact ground-truth

Model Training

Backbone: Qwen2-VL-2B-Instruct
Fine-tuning: LoRA adapters (4-bit), PEFT, Unsloth
Frameworks: Hugging Face trl + SFTTrainer
Training Specs:
- 1 epoch, AdamW optimizer (lr=2e-4), 48GB A6000 GPU

🔬 Quantization Impact

Model	Precision	CER ↓	WER ↓	BLEU ↑
QARI v0.2	8-bit	0.091	0.255	0.583
QARI v0.2	4-bit	3.452	4.516	0.001
QARI v0.3	8-bit	0.133	0.353	0.472
QARI v0.3	4-bit	3.228	6.428	0.001

⚠️ Use 8-bit quantization for best accuracy. 4-bit is not recommended for OCR tasks requiring fine-grained recognition.

📚 Resources

🤗 Hugging Face Models & Data: [[https://huggingface.co/riotu-lab/QARI-OCR]
📄 Paper: arXiv:2506.02295

⚠️ Limitations

Suboptimal with dense text or narrow line spacing
Limited recognition for figures/charts and embedded numerals
Peripheral elements (e.g., margins/page numbers) sometimes skipped

🧾 Citation



@article
	{wasfy2025qari,
  title={QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation},
  author={Ahmed Wasfy and Omer Nacar and Abdelakreem Elkhateb and Mahmoud Reda and Omar Elshehy and Adel Ammar and Wadii Boulila},
  journal={arXiv preprint arXiv:2506.02295},
  year={2025}
}