Repository for:

ThinkEdit-deepseek-llama3-8b

(We also release ThinkEdit versions for ThinkEdit-deepseek-qwen-1.5b, ThinkEdit-deepseek-qwen-14b, and ThinkEdit-deepseek-qwen-32b.)

Authors: Chung-En Sun, Ge Yan, Tsui-Wei Weng
Paper: ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models

Github: https://github.com/Trustworthy-ML-Lab/ThinkEdit


Introduction

Reasoning-augmented models sometimes fail by generating overly short, abstract chain-of-thought (CoT) reasoning, hurting their accuracy.

ThinkEdit is a lightweight weight-editing method that:

  • Identifies ~4% of "short reasoning" attention heads
  • Edits only ~0.2% of total parameters
  • Removes the "short reasoning" direction from their output
  • Boosts performance, especially on cases with short reasoning traces

Full Performance Results

1. Overall Accuracy

Model GSM8K MMLU Elementary Math MATH-Level1 MATH-Level5 MATH-500
deepseek-qwen-32b 92.97 ± 0.39 95.93 ± 0.83 96.41 ± 0.45 91.27 ± 0.53 91.62 ± 0.58
ThinkEdit-deepseek-qwen-32b 95.25 ± 0.25 98.02 ± 0.31 96.02 ± 0.42 91.31 ± 0.50 91.60 ± 0.65
deepseek-qwen-14b 90.80 ± 0.36 95.08 ± 0.65 96.32 ± 0.35 90.25 ± 0.72 91.48 ± 0.55
ThinkEdit-deepseek-qwen-14b 93.78 ± 0.50 96.56 ± 0.84 96.38 ± 0.52 91.03 ± 0.44 91.92 ± 0.63
deepseek-llama3-8b 82.26 ± 0.91 96.01 ± 0.62 93.46 ± 0.84 85.49 ± 0.83 87.26 ± 1.16
ThinkEdit-deepseek-llama3-8b 89.44 ± 0.55 96.19 ± 0.73 94.44 ± 0.31 86.49 ± 0.54 88.06 ± 1.09
deepseek-qwen-1.5b 79.15 ± 1.08 68.52 ± 1.56 93.00 ± 0.33 75.48 ± 0.90 82.22 ± 1.29
ThinkEdit-deepseek-qwen-1.5b 84.56 ± 0.79 90.66 ± 0.97 93.66 ± 0.62 75.05 ± 0.82 82.24 ± 0.89

2. Accuracy on Short Reasoning Cases (Top 5% / 10% / 20%)

Model GSM8K MMLU Elementary Math MATH-Level1 MATH-Level5 MATH-500
deepseek-qwen-32b 98.31 / 97.18 / 96.20 97.78 / 97.03 / 95.87 100.00 / 100.00 / 98.97 93.03 / 96.36 / 97.35 86.40 / 92.00 / 94.00
ThinkEdit-deepseek-qwen-32b 98.92 / 97.71 / 97.83 97.78 / 97.57 / 97.20 100.00 / 100.00 / 98.74 98.03 / 98.64 / 97.99 92.00 / 94.40 / 95.80
deepseek-qwen-14b 96.31 / 95.65 / 92.93 93.89 / 96.22 / 95.60 99.52 / 99.30 / 97.70 89.39 / 94.32 / 96.25 86.40 / 91.40 / 93.50
ThinkEdit-deepseek-qwen-14b 96.31 / 96.18 / 96.77 97.78 / 95.14 / 96.53 99.53 / 98.62 / 98.67 96.67 / 97.88 / 98.11 91.20 / 93.20 / 95.00
deepseek-llama3-8b 88.92 / 87.18 / 85.82 97.22 / 96.49 / 96.80 97.14 / 94.88 / 94.83 78.64 / 88.79 / 93.41 82.00 / 81.40 / 88.30
ThinkEdit-deepseek-llama3-8b 97.08 / 95.27 / 93.95 97.78 / 98.65 / 97.87 100.00 / 99.30 / 98.62 95.61 / 96.89 / 97.12 92.80 / 93.60 / 94.40
deepseek-qwen-1.5b 88.46 / 87.48 / 85.02 62.78 / 62.16 / 60.53 97.62 / 95.12 / 93.91 91.52 / 95.00 / 95.72 82.40 / 89.80 / 93.40
ThinkEdit-deepseek-qwen-1.5b 92.62 / 92.90 / 92.32 87.78 / 88.11 / 88.67 95.71 / 95.58 / 96.44 95.15 / 96.59 / 97.27 90.80 / 92.00 / 94.20

Usage

The usage of ThinkEdit models is exactly the same as the original deepseek-distilled models.

Citation

@misc{sun2025thinkedit,
      title={ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models}, 
      author={Chung-En Sun and Ge Yan and Tsui-Wei Weng},
      year={2025},
      eprint={2503.22048},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.22048}, 
}
Downloads last month
13
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cesun/ThinkEdit-deepseek-llama3-8b

Finetuned
(83)
this model
Quantizations
2 models