Model Card
SFT for mathematical reasoning in our MathIF project.
Github Repository: https://github.com/TingchenFu/MathIF
Training Details
We base our experiments on the DeepScaler dataset, which contains approximately 40k math reasoning samples. We first distill long CoT reasoning traces from QwQ-32B, filtering out samples where QwQ-32B fails to generate a correct answer or the CoT exceeds 8192 tokens. This results in 18k high-quality examples.
The training is conducted using 16 NVIDIA H100 GPUs. For reinforcement learning, we adopt the GRPO framework and use verifiable outcome-based rewards. The model is trained with VeRL framework with most hyper-parameters following the default setting.
Evaluation
We use nucleus sampling (T=1.0, p=0.95) with a maximum generation length of 16,384 tokens for decoding and vLLM engine for efficient inference.
Citation
BibTeX:
@article{fu2025scaling,
title={Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models},
author={Fu, Tingchen and Gu, Jiawei and Li, Yafu and Qu, Xiaoye and Cheng, Yu},
journal={arXiv preprint arXiv:2505.14810},
year={2025}
}
- Downloads last month
- 4
Model tree for TingchenFu/sft_8k_qwen-2.5-7b_05021953
Base model
Qwen/Qwen2.5-7B