Zhihu-ai
/

Zhi-Create-DSR1-14B

@@ -26,7 +26,7 @@ In the [LLM Creative Story-Writing Benchmark](https://github.com/lechmazur/writi
 The figure below shows the performance comparison across different domains in WritingBench:
-![writingbench](./images/writingbench_score.png)
 <figcaption style="text-align:center; font-size:0.9em; color:#666">
 Figure 1: WritingBench performance of Zhi-Create-DSR1-14B and DeepSeek-R1-Distill-Qwen-14B across 6 domains and 3 writing requirements evaluated with WritingBench critic model (scale: 1-10). The six domains include: (D1) Academic & Engineering, (D2) Finance & Business, (D3) Politics & Law, (D4) Literature & Art, (D5) Education, and (D6) Advertising & Marketing. The three writing requirements assessed are: (R1) Style, (R2) Format, and (R3) Length. Here, "C" indicates category-specific scores.
@@ -51,7 +51,7 @@ Our evaluation results suggest promising improvements in the model's creative wr
 With respect to general capabilities, evaluations indicate modest improvements of **2%–5% in knowledge and reasoning tasks (CMMLU, MMLU-Pro)**, alongside encouraging progress in mathematical reasoning as measured by benchmarks such as **AIME-2024, AIME-2025, and GSM8K**. The results suggest that the model maintains a balanced performance profile, with improvements observed across creative writing, knowledge/reasoning, and mathematical tasks compared to DeepSeek-R1-Distill-Qwen-14B. These characteristics potentially make it suitable for a range of general-purpose applications. We conducted additional evaluations on the instruction-following ifeval benchmark, with experimental results demonstrating a performance improvement in model capabilities from an initial score of **71.43** to an enhanced score of **74.71**.
-![general](./images/general_score.png)
 <figcaption style="text-align:center; font-size:0.9em; color:#666">
 Figure 2: When evaluating model performance, it is recommended to conduct multiple tests and average the results. (We use n=16 and max_tokens=32768 for mathematical tasks and n=2 for others)

 The figure below shows the performance comparison across different domains in WritingBench:
+![writingbench](./writingbench_score.png)
 <figcaption style="text-align:center; font-size:0.9em; color:#666">
 Figure 1: WritingBench performance of Zhi-Create-DSR1-14B and DeepSeek-R1-Distill-Qwen-14B across 6 domains and 3 writing requirements evaluated with WritingBench critic model (scale: 1-10). The six domains include: (D1) Academic & Engineering, (D2) Finance & Business, (D3) Politics & Law, (D4) Literature & Art, (D5) Education, and (D6) Advertising & Marketing. The three writing requirements assessed are: (R1) Style, (R2) Format, and (R3) Length. Here, "C" indicates category-specific scores.
 With respect to general capabilities, evaluations indicate modest improvements of **2%–5% in knowledge and reasoning tasks (CMMLU, MMLU-Pro)**, alongside encouraging progress in mathematical reasoning as measured by benchmarks such as **AIME-2024, AIME-2025, and GSM8K**. The results suggest that the model maintains a balanced performance profile, with improvements observed across creative writing, knowledge/reasoning, and mathematical tasks compared to DeepSeek-R1-Distill-Qwen-14B. These characteristics potentially make it suitable for a range of general-purpose applications. We conducted additional evaluations on the instruction-following ifeval benchmark, with experimental results demonstrating a performance improvement in model capabilities from an initial score of **71.43** to an enhanced score of **74.71**.
+![general](./general_score.png)
 <figcaption style="text-align:center; font-size:0.9em; color:#666">
 Figure 2: When evaluating model performance, it is recommended to conduct multiple tests and average the results. (We use n=16 and max_tokens=32768 for mathematical tasks and n=2 for others)