---
license: apache-2.0
tags:
- qwen
- math
- fine-tuned
- open-r1
- supervised-finetuning
- evaluation
datasets:
- open-r1/OpenR1-Math-220k
- Idavidrein/gpqa
- HuggingFaceH4/MATH-500
metrics:
- accuracy
base_model:
- Qwen/Qwen2.5-0.5B
pipeline_tag: text-generation
library_name: transformers
language:
- en

model-index:
  - name: Qwen2.5-0.5B-Math220k (Checkpoint-15000)
    results:
      - task:
          type: multiple-choice
        dataset:
          name: GPQA
          type: open
        metrics:
          - name: Accuracy (Clean Extraction)
            type: accuracy
            value: 0.386
          - name: Accuracy (All Extraction)
            type: accuracy
            value: 0.410
      - task:
          type: mathematical-reasoning
        dataset:
          name: MATH500
          type: open
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.219
---

# Qwen2.5-0.5B-Math220k (Checkpoint-15000)

This model is a supervised fine-tuned variant of [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B), trained on the **default split of math220k** for step-by-step mathematical reasoning and standardized answer formatting.

## Training

- **Base model:** Qwen2.5-0.5B
- **Dataset:** math220k `default` subset (83k train, 10k test, filtered for verified answers)
- **Training steps:** 15,000
- **Checkpoint interval:** 500 steps
- **Learning rate:** 2.5e-6 with **cosine decay scheduler**
- **Batch size:** 64
- **Prompting format:** guided step-by-step reasoning, with enforced final answer formatting (`Answer:` or `\boxed{}`)

## Evaluation

All evaluations were performed on **bootstrapped datasets (size=1000)** to ensure fair, stable comparisons.

| Dataset         | Accuracy (Clean) | Accuracy (All) |
|----------------|------------------|----------------|
| GPQA (merged)  | 0.386            | 0.410          |
| MATH500        | 0.219            | N/A            |

- **Clean extraction:** only answers in canonical form (`Answer: X`, `\boxed{X}`)
- **All extraction:** includes fuzzy-matched final answers in phrases like “the correct answer is X”

Evaluation was performed with `eval_checkpoints_auto.py` using local bootstrapped datasets.
For detailed evaluation results and charts, see: [DexinRen/open-r1_DR_test/dexin_src/eval_output](https://github.com/DexinRen/open-r1_DR_test/tree/master/dexin_src/eval_output)

## Limitations

- The math220k dataset contains noisy, unverified solutions. The model may pick up flawed reasoning patterns.
- This checkpoint prioritizes **formatting discipline and correctness of final answers** over full reasoning transparency.
- MATH500 generalization is slightly degraded vs. the base model (expected for SFT).

## Files Included

- `model.safetensors`: model weights
- `tokenizer.json`, `vocab.json`, `config.json`: tokenizer and model config
- All files are stored using **Git LFS** for proper large file support.

## Citation

If you use this model, please cite:
Dexin Ren. "Fine-Tuning Qwen2.5-0.5B for Mathematical Reasoning." 2025. Available at: https://huggingface.co/DexinR/qwen2.5-math220k-ckpt15000

## Recommended Usage

For basic use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000")
model = AutoModelForCausalLM.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000", trust_remote_code=True)
```

For reproducible evaluation, use the [custom formatter and evaluation code](https://github.com/DexinRen/open-r1_DR_test):

```python
from dexin_src.utils.formatter import Formatter

tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000")
formatter = Formatter(tokenizer)
formatted_prompt = formatter.format_prompt(example)  # example is a row from your dataset
```