Papers
arxiv:2506.02295

QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

Published on Jun 2
ยท Submitted by Omartificial-Intelligence-Space on Jun 4
Authors:
,
,

Abstract

Qari-OCR, a series of fine-tuned vision-language models, achieves state-of-the-art performance in Arabic OCR through iterative optimization on specialized datasets, handling diacritics, fonts, layouts, and low-resolution images.

AI-generated summary

The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.

Community

Paper author Paper submitter

QARI-OCR: High-Fidelity Arabic Text Recognition via Multimodal LLM Adaptation

QARI-OCR is a state-of-the-art Arabic OCR system fine-tuned from the vision-language model Qwen2-VL-2B-Instruct. It achieves unprecedented accuracy in transcribing printed, diacritized, and handwritten Arabic text through targeted training on large-scale synthetic datasets. QARI-OCR is open-source and sets a new benchmark for Arabic text recognition and document layout understanding.


๐Ÿง  Key Highlights

  • State-of-the-Art Accuracy
    • Character Error Rate (CER): 0.061
    • Word Error Rate (WER): 0.160
    • BLEU Score: 0.737
  • Multimodal LLM Fine-Tuning
    • Built on Qwen2-VL-2B-Instruct
    • Enhanced for Arabic with diacritics, ligatures, and classical scripts
  • Robust to Document Complexity
    • Mixed layouts, various fonts, degraded scans, and handwritten text

๐Ÿš€ Model Versions

Version Focus Dataset Size Diacritics Layout-Aware Handwriting
v0.1 Clean text, 5 fonts, no diacritics 5,000 โŒ โŒ โŒ
v0.2 Diacritized/classical text, 10 fonts 50,000 โœ… โŒ โŒ
v0.3 Realistic layouts, mixed sizes, handwriting 10,000 โœ… โœ… โœ…

๐Ÿ“Š Performance (Test Set of 200 Pages)

Model CER โ†“ WER โ†“ BLEU โ†‘
Tesseract 0.436 0.889 0.108
EasyOCR 0.791 0.918 0.051
Mistral OCR 0.210 0.570 0.440
AIN 0.640 0.210 0.830
QARI v0.2 0.061 0.160 0.737
QARI v0.3 0.300 0.545 0.485

๐Ÿ“ Features

  • โœ… Diacritic recognition (fathah, kasrah, dammah, sukun, shadda, tanwin)
  • โœ… Font diversity (12+ Arabic fonts)
  • โœ… Layout parsing with HTML tag reconstruction (v0.3)
  • โœ… Handwritten text recognition (v0.3)
  • โœ… Robust to blur, noise, and low-resolution scans
  • โœ… All models and datasets are open-source

๐Ÿ—๏ธ Pipeline Overview

Dataset Generation

  1. Text Source: News & classical Arabic corpora
  2. Rendering: HTML โ†’ PDF โ†’ Image
  3. Degradation: Clean, moderate, and heavy noise
  4. Annotation: Paired with exact ground-truth

Model Training

  • Backbone: Qwen2-VL-2B-Instruct
  • Fine-tuning: LoRA adapters (4-bit), PEFT, Unsloth
  • Frameworks: Hugging Face trl + SFTTrainer
  • Training Specs:
    • 1 epoch, AdamW optimizer (lr=2e-4), 48GB A6000 GPU

๐Ÿ”ฌ Quantization Impact

Model Precision CER โ†“ WER โ†“ BLEU โ†‘
QARI v0.2 8-bit 0.091 0.255 0.583
QARI v0.2 4-bit 3.452 4.516 0.001
QARI v0.3 8-bit 0.133 0.353 0.472
QARI v0.3 4-bit 3.228 6.428 0.001

โš ๏ธ Use 8-bit quantization for best accuracy. 4-bit is not recommended for OCR tasks requiring fine-grained recognition.


๐Ÿ“š Resources


โš ๏ธ Limitations

  • Suboptimal with dense text or narrow line spacing
  • Limited recognition for figures/charts and embedded numerals
  • Peripheral elements (e.g., margins/page numbers) sometimes skipped

๐Ÿงพ Citation



@article
	{wasfy2025qari,
  title={QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation},
  author={Ahmed Wasfy and Omer Nacar and Abdelakreem Elkhateb and Mahmoud Reda and Omar Elshehy and Adel Ammar and Wadii Boulila},
  journal={arXiv preprint arXiv:2506.02295},
  year={2025}
}

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.02295 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.