SWEETPnx ZAEBUC Model

Model Description

CAMeL-Lab/text-editing-zaebuc-pnx is a text editing model tailored for grammatical error correction (GEC) in Modern Standard Arabic (MSA). The model is based on AraBERTv02, which we fine-tuned using the ZAEBUC dataset. This model was introduced in our ACL 2025 paper, Enhancing Text Editing for Grammatical Error Correction: Arabic as a Case Study, where we refer to it as SWEET (Subword Edit Error Tagger).

The model was fine-tuned to fix punctuation (i.e., Pnx) errors. Details about the training procedure, data preprocessing, and hyperparameters are available in the paper. The fine-tuning code and associated resources are publicly available on our GitHub repository: https://github.com/CAMeL-Lab/text-editing.

Intended uses

To use the CAMeL-Lab/text-editing-zaebuc-pnx model, you must clone our text editing GitHub repository and follow the installation requirements. We used this SWEETPnx model to report results on the ZAEBUC dev and test sets in our paper. This model is intended to be used with SWEETNoPnx (CAMeL-Lab/text-editing-zaebuc-nopnx) model.

How to use

Clone our text editing GitHub repository and follow the installation requirements

from transformers import BertTokenizer, BertForTokenClassification
import torch
import torch.nn.functional as F
from gec.tag import rewrite


nopnx_tokenizer = BertTokenizer.from_pretrained('CAMeL-Lab/text-editing-zaebuc-nopnx')
nopnx_model = BertForTokenClassification.from_pretrained('CAMeL-Lab/text-editing-zaebuc-nopnx')

pnx_tokenizer = BertTokenizer.from_pretrained('CAMeL-Lab/text-editing-zaebuc-pnx')
pnx_model = BertForTokenClassification.from_pretrained('CAMeL-Lab/text-editing-zaebuc-pnx')


def predict(model, tokenizer, text, decode_iter=1):
    for _ in range(decode_iter):
        tokenized_text = tokenizer(text, return_tensors="pt", is_split_into_words=True)
        with torch.no_grad():
            logits = model(**tokenized_text).logits
            preds = F.softmax(logits.squeeze(), dim=-1)
            preds = torch.argmax(preds, dim=-1).cpu().numpy()
            edits = [model.config.id2label[p] for p in preds[1:-1]]
            
            assert len(edits) == len(tokenized_text['input_ids'][0][1:-1])
        subwords = tokenizer.convert_ids_to_tokens(tokenized_text['input_ids'][0][1:-1])
        text = rewrite(subwords=[subwords], edits=[edits])[0][0]
    return text


text = 'ูŠุฌุจ ุงู„ุฅู‡ุชู…ุงู… ุจ ุงู„ุตุญู‡ ูˆ ู„ุง ุณูŠู…ุง ู ูŠ ุงู„ุตุญู‡ ุงู„ู†ูุณูŠู‡ ูŠุงุดุจุงุจ ุงู„ู…ุณุชู‚ุจู„ุŒุŒ'.split()

output_sent = predict(nopnx_model, nopnx_tokenizer, text, decode_iter=2)
output_sent = predict(pnx_model, pnx_tokenizer, output_sent.split(), decode_iter=1)
print(output_sent) # ูŠุฌุจ ุงู„ุงู‡ุชู…ุงู… ุจุงู„ุตุญุฉ ูˆู„ุง ุณูŠู…ุง ููŠ ุงู„ุตุญุฉ ุงู„ู†ูุณูŠุฉ ูŠุง ุดุจุงุจ ุงู„ู…ุณุชู‚ุจู„ .

Citation

@inter{alhafni-habash-2025-enhancing,
      title={Enhancing Text Editing for Grammatical Error Correction: Arabic as a Case Study}, 
      author={Bashar Alhafni and Nizar Habash},
      year={2025},
      eprint={2503.00985},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.00985}, 
}
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for CAMeL-Lab/text-editing-zaebuc-pnx

Finetuned
(3993)
this model

Collection including CAMeL-Lab/text-editing-zaebuc-pnx