SWEET MADAR CODA Model

Model Description

CAMeL-Lab/text-editing-coda is a text editing model tailored for grammatical error correction (GEC) in dialectal Arabic (DA). The model is based on AraBERTv02, which we fine-tuned using the MADAR CODA corpus. This model was introduced in our ACL 2025 paper, Enhancing Text Editing for Grammatical Error Correction: Arabic as a Case Study, where we refer to it as SWEET (Subword Edit Error Tagger). It achieved SOTA performance on the MADAR CODA dataset. Details about the training procedure, data preprocessing, and hyperparameters are available in the paper. The fine-tuning code and associated resources are publicly available on our GitHub repository: https://github.com/CAMeL-Lab/text-editing.

Intended uses

To use the CAMeL-Lab/text-editing-coda model, you must clone our text editing GitHub repository and follow the installation requirements. We used this SWEET model to report results on the MADAR CODA dev and test sets in our paper.

How to use

Clone our text editing GitHub repository and follow the installation requirements

from transformers import BertTokenizer, BertForTokenClassification
import torch
import torch.nn.functional as F
from gec.tag import rewrite

tokenizer = BertTokenizer.from_pretrained('CAMeL-Lab/text-editing-coda')
model = BertForTokenClassification.from_pretrained('CAMeL-Lab/text-editing-coda')

text = 'أنا بعطيك رقم تلفونو و عنوانو'.split()

tokenized_text = tokenizer(text, return_tensors="pt", is_split_into_words=True)

with torch.no_grad():
    logits = model(**tokenized_text).logits
    preds = F.softmax(logits.squeeze(), dim=-1)
    preds = torch.argmax(preds, dim=-1).cpu().numpy()
    edits = [model.config.id2label[p] for p in preds[1:-1]]
    assert len(edits) == len(tokenized_text['input_ids'][0][1:-1])

print(edits) # ['R_[ا]K*', 'K*I_[ا]K', 'K*', 'K*', 'K*', 'K*', 'K*R_[ه]', 'K*', 'MK*', 'R_[ه]']
subwords = tokenizer.convert_ids_to_tokens(tokenized_text['input_ids'][0][1:-1])
output_sent = rewrite(subwords=[subwords], edits=[edits])[0][0]
print(output_sent) # انا باعطيك رقم تلفونه وعنوانه

Citation

@inter{alhafni-habash-2025-enhancing,
      title={Enhancing Text Editing for Grammatical Error Correction: Arabic as a Case Study}, 
      author={Bashar Alhafni and Nizar Habash},
      year={2025},
      eprint={2503.00985},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.00985}, 
}

CAMeL-Lab
/

text-editing-coda

SWEET MADAR CODA Model

Model Description

Intended uses

How to use

Citation

Model tree for CAMeL-Lab/text-editing-coda

Collection including CAMeL-Lab/text-editing-coda

Text Editing