File size: 2,209 Bytes
b0240c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
license: apache-2.0
---
# Implementation of ACL 2024 findings "Improving Grammatical Error Correction via Contextual Data Augmentation"

[github link](https://github.com/wyxstriker/CDA4GEC)

# Model Weights
We release the model weights of each training stage.
Our model is trained based on the Fairseq framework, details of the weights and links to them are below.

|Name|Data Info|Download Link|
|:--:|--|--|
|Stage1|Pre-training on [C4 synthetic data](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction) with 200M scale|[CDA4GEC](https://huggingface.co/DecoderImmortal/CDA4GEC)/tree/main/stage1_checkpoint_best.pt|
|Stage2+|Fine-tuning on the augmented Lang8, NUCLE, FCE and W&I+L datasets|[CDA4GEC](https://huggingface.co/DecoderImmortal/CDA4GEC)/tree/main/stage2_checkpoint_best.pt|
|Stage3+|Continue fine-tuning on the augmented W&I+L dataset|[CDA4GEC](https://huggingface.co/DecoderImmortal/CDA4GEC)/tree/main/stage3_checkpoint_best.pt|

# Synthetic Data
> We only release the synthetic pseudo-data, please follow the official process to apply for the original annotated data.


|DataInfo|Amount|Source|Path|
|:--:|:--:|:--:|:--:|
|stage2+|2M|Lang-8 & NUCLE & FCE & W&I+L|[CDA4GEC](https://huggingface.co/DecoderImmortal/CDA4GEC)/tree/main/pseudo/stage2|
|stage3+|200K|W&I+L|[CDA4GEC](https://huggingface.co/DecoderImmortal/CDA4GEC)/tree/main/pseudo/stage3|

# Citation
If you find this work is useful for your research, please cite our paper:

```
@inproceedings{wang-etal-2024-improving-grammatical,
    title = "Improving Grammatical Error Correction via Contextual Data Augmentation",
    author = "Wang, Yixuan  and
      Wang, Baoxin  and
      Liu, Yijun  and
      Zhu, Qingfu  and
      Wu, Dayong  and
      Che, Wanxiang",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.647",
    pages = "10898--10910",
}
```