--- license: apache-2.0 tags: - qwen - math - fine-tuned - open-r1 - supervised-finetuning - evaluation datasets: - open-r1/OpenR1-Math-220k - Idavidrein/gpqa - HuggingFaceH4/MATH-500 metrics: - accuracy base_model: - Qwen/Qwen2.5-0.5B pipeline_tag: text-generation library_name: transformers language: - en model-index: - name: Qwen2.5-0.5B-Math220k (Checkpoint-15000) results: - task: type: multiple-choice dataset: name: GPQA type: open metrics: - name: Accuracy (Clean Extraction) type: accuracy value: 0.386 - name: Accuracy (All Extraction) type: accuracy value: 0.410 - task: type: mathematical-reasoning dataset: name: MATH500 type: open metrics: - name: Accuracy type: accuracy value: 0.219 --- # Qwen2.5-0.5B-Math220k (Checkpoint-15000) This model is a supervised fine-tuned variant of [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B), trained on the **default split of math220k** for step-by-step mathematical reasoning and standardized answer formatting. ## Training - **Base model:** Qwen2.5-0.5B - **Dataset:** math220k `default` subset (83k train, 10k test, filtered for verified answers) - **Training steps:** 15,000 - **Checkpoint interval:** 500 steps - **Learning rate:** 2.5e-6 with **cosine decay scheduler** - **Batch size:** 64 - **Prompting format:** guided step-by-step reasoning, with enforced final answer formatting (`Answer:` or `\boxed{}`) ## Evaluation All evaluations were performed on **bootstrapped datasets (size=1000)** to ensure fair, stable comparisons. | Dataset | Accuracy (Clean) | Accuracy (All) | |----------------|------------------|----------------| | GPQA (merged) | 0.386 | 0.410 | | MATH500 | 0.219 | N/A | - **Clean extraction:** only answers in canonical form (`Answer: X`, `\boxed{X}`) - **All extraction:** includes fuzzy-matched final answers in phrases like “the correct answer is X” Evaluation was performed with `eval_checkpoints_auto.py` using local bootstrapped datasets. For detailed evaluation results and charts, see: [DexinRen/open-r1_DR_test/dexin_src/eval_output](https://github.com/DexinRen/open-r1_DR_test/tree/master/dexin_src/eval_output) ## Limitations - The math220k dataset contains noisy, unverified solutions. The model may pick up flawed reasoning patterns. - This checkpoint prioritizes **formatting discipline and correctness of final answers** over full reasoning transparency. - MATH500 generalization is slightly degraded vs. the base model (expected for SFT). ## Files Included - `model.safetensors`: model weights - `tokenizer.json`, `vocab.json`, `config.json`: tokenizer and model config - All files are stored using **Git LFS** for proper large file support. ## Citation If you use this model, please cite: Dexin Ren. "Fine-Tuning Qwen2.5-0.5B for Mathematical Reasoning." 2025. Available at: https://huggingface.co/DexinR/qwen2.5-math220k-ckpt15000 ## Recommended Usage For basic use ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000") model = AutoModelForCausalLM.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000", trust_remote_code=True) ``` For reproducible evaluation, use the [custom formatter and evaluation code](https://github.com/DexinRen/open-r1_DR_test): ```python from dexin_src.utils.formatter import Formatter tokenizer = AutoTokenizer.from_pretrained("DexinR/qwen2.5-math220k-ckpt15000") formatter = Formatter(tokenizer) formatted_prompt = formatter.format_prompt(example) # example is a row from your dataset ```