Update README.md
Browse files
README.md
CHANGED
@@ -26,7 +26,7 @@ In the [LLM Creative Story-Writing Benchmark](https://github.com/lechmazur/writi
|
|
26 |
|
27 |
The figure below shows the performance comparison across different domains in WritingBench:
|
28 |
|
29 |
-
. The six domains include: (D1) Academic & Engineering, (D2) Finance & Business, (D3) Politics & Law, (D4) Literature & Art, (D5) Education, and (D6) Advertising & Marketing. The three writing requirements assessed are: (R1) Style, (R2) Format, and (R3) Length. Here, "C" indicates category-specific scores.
|
@@ -51,7 +51,7 @@ Our evaluation results suggest promising improvements in the model's creative wr
|
|
51 |
|
52 |
With respect to general capabilities, evaluations indicate modest improvements of **2%–5% in knowledge and reasoning tasks (CMMLU, MMLU-Pro)**, alongside encouraging progress in mathematical reasoning as measured by benchmarks such as **AIME-2024, AIME-2025, and GSM8K**. The results suggest that the model maintains a balanced performance profile, with improvements observed across creative writing, knowledge/reasoning, and mathematical tasks compared to DeepSeek-R1-Distill-Qwen-14B. These characteristics potentially make it suitable for a range of general-purpose applications. We conducted additional evaluations on the instruction-following ifeval benchmark, with experimental results demonstrating a performance improvement in model capabilities from an initial score of **71.43** to an enhanced score of **74.71**.
|
53 |
|
54 |
-

|
|
|
26 |
|
27 |
The figure below shows the performance comparison across different domains in WritingBench:
|
28 |
|
29 |
+

|
30 |
|
31 |
<figcaption style="text-align:center; font-size:0.9em; color:#666">
|
32 |
Figure 1: WritingBench performance of Zhi-Create-DSR1-14B and DeepSeek-R1-Distill-Qwen-14B across 6 domains and 3 writing requirements evaluated with WritingBench critic model (scale: 1-10). The six domains include: (D1) Academic & Engineering, (D2) Finance & Business, (D3) Politics & Law, (D4) Literature & Art, (D5) Education, and (D6) Advertising & Marketing. The three writing requirements assessed are: (R1) Style, (R2) Format, and (R3) Length. Here, "C" indicates category-specific scores.
|
|
|
51 |
|
52 |
With respect to general capabilities, evaluations indicate modest improvements of **2%–5% in knowledge and reasoning tasks (CMMLU, MMLU-Pro)**, alongside encouraging progress in mathematical reasoning as measured by benchmarks such as **AIME-2024, AIME-2025, and GSM8K**. The results suggest that the model maintains a balanced performance profile, with improvements observed across creative writing, knowledge/reasoning, and mathematical tasks compared to DeepSeek-R1-Distill-Qwen-14B. These characteristics potentially make it suitable for a range of general-purpose applications. We conducted additional evaluations on the instruction-following ifeval benchmark, with experimental results demonstrating a performance improvement in model capabilities from an initial score of **71.43** to an enhanced score of **74.71**.
|
53 |
|
54 |
+

|
55 |
|
56 |
<figcaption style="text-align:center; font-size:0.9em; color:#666">
|
57 |
Figure 2: When evaluating model performance, it is recommended to conduct multiple tests and average the results. (We use n=16 and max_tokens=32768 for mathematical tasks and n=2 for others)
|