Parkerlambert123 commited on
Commit
ede5af4
·
verified ·
1 Parent(s): bd874a3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -26,7 +26,7 @@ In the [LLM Creative Story-Writing Benchmark](https://github.com/lechmazur/writi
26
 
27
  The figure below shows the performance comparison across different domains in WritingBench:
28
 
29
- ![writingbench](./images/writingbench_score.png)
30
 
31
  <figcaption style="text-align:center; font-size:0.9em; color:#666">
32
  Figure 1: WritingBench performance of Zhi-Create-DSR1-14B and DeepSeek-R1-Distill-Qwen-14B across 6 domains and 3 writing requirements evaluated with WritingBench critic model (scale: 1-10). The six domains include: (D1) Academic & Engineering, (D2) Finance & Business, (D3) Politics & Law, (D4) Literature & Art, (D5) Education, and (D6) Advertising & Marketing. The three writing requirements assessed are: (R1) Style, (R2) Format, and (R3) Length. Here, "C" indicates category-specific scores.
@@ -51,7 +51,7 @@ Our evaluation results suggest promising improvements in the model's creative wr
51
 
52
  With respect to general capabilities, evaluations indicate modest improvements of **2%–5% in knowledge and reasoning tasks (CMMLU, MMLU-Pro)**, alongside encouraging progress in mathematical reasoning as measured by benchmarks such as **AIME-2024, AIME-2025, and GSM8K**. The results suggest that the model maintains a balanced performance profile, with improvements observed across creative writing, knowledge/reasoning, and mathematical tasks compared to DeepSeek-R1-Distill-Qwen-14B. These characteristics potentially make it suitable for a range of general-purpose applications. We conducted additional evaluations on the instruction-following ifeval benchmark, with experimental results demonstrating a performance improvement in model capabilities from an initial score of **71.43** to an enhanced score of **74.71**.
53
 
54
- ![general](./images/general_score.png)
55
 
56
  <figcaption style="text-align:center; font-size:0.9em; color:#666">
57
  Figure 2: When evaluating model performance, it is recommended to conduct multiple tests and average the results. (We use n=16 and max_tokens=32768 for mathematical tasks and n=2 for others)
 
26
 
27
  The figure below shows the performance comparison across different domains in WritingBench:
28
 
29
+ ![writingbench](./writingbench_score.png)
30
 
31
  <figcaption style="text-align:center; font-size:0.9em; color:#666">
32
  Figure 1: WritingBench performance of Zhi-Create-DSR1-14B and DeepSeek-R1-Distill-Qwen-14B across 6 domains and 3 writing requirements evaluated with WritingBench critic model (scale: 1-10). The six domains include: (D1) Academic & Engineering, (D2) Finance & Business, (D3) Politics & Law, (D4) Literature & Art, (D5) Education, and (D6) Advertising & Marketing. The three writing requirements assessed are: (R1) Style, (R2) Format, and (R3) Length. Here, "C" indicates category-specific scores.
 
51
 
52
  With respect to general capabilities, evaluations indicate modest improvements of **2%–5% in knowledge and reasoning tasks (CMMLU, MMLU-Pro)**, alongside encouraging progress in mathematical reasoning as measured by benchmarks such as **AIME-2024, AIME-2025, and GSM8K**. The results suggest that the model maintains a balanced performance profile, with improvements observed across creative writing, knowledge/reasoning, and mathematical tasks compared to DeepSeek-R1-Distill-Qwen-14B. These characteristics potentially make it suitable for a range of general-purpose applications. We conducted additional evaluations on the instruction-following ifeval benchmark, with experimental results demonstrating a performance improvement in model capabilities from an initial score of **71.43** to an enhanced score of **74.71**.
53
 
54
+ ![general](./general_score.png)
55
 
56
  <figcaption style="text-align:center; font-size:0.9em; color:#666">
57
  Figure 2: When evaluating model performance, it is recommended to conduct multiple tests and average the results. (We use n=16 and max_tokens=32768 for mathematical tasks and n=2 for others)