Model Card

SFT for mathematical reasoning in our MathIF project.

Github Repository: https://github.com/TingchenFu/MathIF

Training Details

We base our experiments on the DeepScaler dataset, which contains approximately 40k math reasoning samples. We first distill long CoT reasoning traces from QwQ-32B, filtering out samples where QwQ-32B fails to generate a correct answer or the CoT exceeds 8192 tokens. This results in 18k high-quality examples.

The training is conducted using 16 NVIDIA H100 GPUs. For reinforcement learning, we adopt the GRPO framework and use verifiable outcome-based rewards. The model is trained with VeRL framework with most hyper-parameters following the default setting.

Evaluation

We use nucleus sampling (T=1.0, p=0.95) with a maximum generation length of 16,384 tokens for decoding and vLLM engine for efficient inference.

Citation

BibTeX:

@article{fu2025scaling,
  title={Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models},
  author={Fu, Tingchen and Gu, Jiawei and Li, Yafu and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2505.14810},
  year={2025}
}

TingchenFu
/

sft_8k_qwen-2.5-7b_05021953

Model Card

Training Details

Evaluation

Citation

Model tree for TingchenFu/sft_8k_qwen-2.5-7b_05021953

Dataset used to train TingchenFu/sft_8k_qwen-2.5-7b_05021953