File size: 2,666 Bytes

dc8643e
 
 
 
bdb9b50
 
 
 
 
 
 
 
 
 
 
 
 
5f0dead
bdb9b50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
879ddea
0546b6f
b5f2858
dc8643e
 
 
 
 
b5f2858

---
license: apache-2.0
---



We propose GrammarCoder, a grammar-based model built on a decoder-only architecture, which excels in auto-regressive tasks like code generation, completion, and translation. To enhance its ability to code generation, we apply continued pre-training and instruction tuning on existing code model weights (i.e., DeepSeek-Coder-1.3B-Base, Qwen2.5-1.5B-Base, and Qwen2.5-7B-Base), expanding its knowledge base.


# Results
Compared with the model with the same experiment setting, Grammar-coder gained a better preformance on the dataset. The following table presents the code generation accuracy compared with the baseline.

| **Model**                         | **HumanEval** | **HumanEval+** | **MBPP** | **MBPP+** |
|------------------------------------|--------------|---------------|---------|---------|
| **Base Models**                    |              |               |         |         |
| DeepSeek-Coder-1.3B-Base           | 34.8         | 28.7          | 56.7    | 47.9    |
| Qwen2.5-1.5B-Base                  | 37.2         | 32.9          | 60.2    | 49.6    |
| Qwen2.5-7B-Base                  | 57.9         | 50.6          | 74.9    | 62.9    |
| **Normal Token-Based CPT**         |              |               |         |         |
| DeepSeek-Coder-1.3B-Base (CPT)     | 43.9         | 39.6          | 61.4    | 51.3    |
| Qwen2.5-1.5B-Base (CPT)            | 50.6         | 42.7          | 60.3    | 51.1    |
| Qwen2.5-7B-Base (CPT)            | 68.9         | 65.2          | 81.5    | 69.8    |
| **Grammar-Based CPT**              |              |               |         |         |
| **GrammarCoder-1.3B-Base**         | 63.4     | 57.3          | 68.3 | 56.9 |
| **GrammarCoder-1.5B-Base**         | 63.4     | 59.1      | 64.8    | 55.3    |
| **GrammarCoder-7B-Base**         | **76.8**     | **71.3**      | **85.2**    | **71.7**    |


The model has been open-sourced, and the model and the corresponding tokenizer are stored in HuggingFace-[GrammarCoder](https://huggingface.co/collections/qyliang/grammarcoder-683fe8778270d31b08fe54a4). 

# Requirements
- tree_sitter: 0.23.2
- tree_sitter_python: 0.23.5


# Evaluation
Please refer to Github (https://github.com/LIANGQINGYUAN/GrammarCoder) for details on the evaluation.

```
@article{liang2025grammar,
  title={Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?},
  author={Liang, Qingyuan and Zhang, Zhao and Sun, Zeyu and Lin, Zheng and Luo, Qi and Xiao, Yueyi and Chen, Yizhou and Zhang, Yuqun and Zhang, Haotian and Zhang, Lu and Bin, Chen and Yingfei Xiong},
  journal={arXiv preprint arXiv:2503.05507},
  year={2025}
}
```