QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation
Abstract
Qari-OCR, a series of fine-tuned vision-language models, achieves state-of-the-art performance in Arabic OCR through iterative optimization on specialized datasets, handling diacritics, fonts, layouts, and low-resolution images.
The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.
Community
QARI-OCR: High-Fidelity Arabic Text Recognition via Multimodal LLM Adaptation
QARI-OCR is a state-of-the-art Arabic OCR system fine-tuned from the vision-language model Qwen2-VL-2B-Instruct. It achieves unprecedented accuracy in transcribing printed, diacritized, and handwritten Arabic text through targeted training on large-scale synthetic datasets. QARI-OCR is open-source and sets a new benchmark for Arabic text recognition and document layout understanding.
๐ง Key Highlights
- State-of-the-Art Accuracy
- Character Error Rate (CER): 0.061
- Word Error Rate (WER): 0.160
- BLEU Score: 0.737
- Multimodal LLM Fine-Tuning
- Built on Qwen2-VL-2B-Instruct
- Enhanced for Arabic with diacritics, ligatures, and classical scripts
- Robust to Document Complexity
- Mixed layouts, various fonts, degraded scans, and handwritten text
๐ Model Versions
Version | Focus | Dataset Size | Diacritics | Layout-Aware | Handwriting |
---|---|---|---|---|---|
v0.1 |
Clean text, 5 fonts, no diacritics | 5,000 | โ | โ | โ |
v0.2 |
Diacritized/classical text, 10 fonts | 50,000 | โ | โ | โ |
v0.3 |
Realistic layouts, mixed sizes, handwriting | 10,000 | โ | โ | โ |
๐ Performance (Test Set of 200 Pages)
Model | CER โ | WER โ | BLEU โ |
---|---|---|---|
Tesseract | 0.436 | 0.889 | 0.108 |
EasyOCR | 0.791 | 0.918 | 0.051 |
Mistral OCR | 0.210 | 0.570 | 0.440 |
AIN | 0.640 | 0.210 | 0.830 |
QARI v0.2 | 0.061 | 0.160 | 0.737 |
QARI v0.3 | 0.300 | 0.545 | 0.485 |
๐ Features
- โ Diacritic recognition (fathah, kasrah, dammah, sukun, shadda, tanwin)
- โ Font diversity (12+ Arabic fonts)
- โ Layout parsing with HTML tag reconstruction (v0.3)
- โ Handwritten text recognition (v0.3)
- โ Robust to blur, noise, and low-resolution scans
- โ All models and datasets are open-source
๐๏ธ Pipeline Overview
Dataset Generation
- Text Source: News & classical Arabic corpora
- Rendering: HTML โ PDF โ Image
- Degradation: Clean, moderate, and heavy noise
- Annotation: Paired with exact ground-truth
Model Training
- Backbone: Qwen2-VL-2B-Instruct
- Fine-tuning: LoRA adapters (4-bit), PEFT, Unsloth
- Frameworks: Hugging Face
trl
+SFTTrainer
- Training Specs:
- 1 epoch, AdamW optimizer (lr=2e-4), 48GB A6000 GPU
๐ฌ Quantization Impact
Model | Precision | CER โ | WER โ | BLEU โ |
---|---|---|---|---|
QARI v0.2 | 8-bit | 0.091 | 0.255 | 0.583 |
QARI v0.2 | 4-bit | 3.452 | 4.516 | 0.001 |
QARI v0.3 | 8-bit | 0.133 | 0.353 | 0.472 |
QARI v0.3 | 4-bit | 3.228 | 6.428 | 0.001 |
โ ๏ธ Use 8-bit quantization for best accuracy. 4-bit is not recommended for OCR tasks requiring fine-grained recognition.
๐ Resources
- ๐ค Hugging Face Models & Data: [[https://huggingface.co/riotu-lab/QARI-OCR]
- ๐ Paper: arXiv:2506.02295
โ ๏ธ Limitations
- Suboptimal with dense text or narrow line spacing
- Limited recognition for figures/charts and embedded numerals
- Peripheral elements (e.g., margins/page numbers) sometimes skipped
๐งพ Citation
@article
{wasfy2025qari,
title={QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation},
author={Ahmed Wasfy and Omer Nacar and Abdelakreem Elkhateb and Mahmoud Reda and Omar Elshehy and Adel Ammar and Wadii Boulila},
journal={arXiv preprint arXiv:2506.02295},
year={2025}
}
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition (2025)
- PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language (2025)
- Every Pixel Tells a Story: End-to-End Urdu Newspaper OCR (2025)
- Sadeed: Advancing Arabic Diacritization Through Small Language Model (2025)
- Lost in OCR Translation? Vision-Based Approaches to Robust Document Retrieval (2025)
- GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training (2025)
- PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 2
Collections including this paper 0
No Collection including this paper