Update README.md
Browse files
README.md
CHANGED
@@ -119,23 +119,28 @@ DeepMath-1.5B is created by finetuning deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
|
|
119 |
|
120 |
<sub>Difficulty distribution comparison.</sub> </div>
|
121 |
|
122 |
-
**2.
|
123 |
|
124 |
<div align="center"> <img src="./assets/github-domain.png" width="50%"/>
|
125 |
|
126 |
<sub>Hierarchical breakdown of mathematical topics covered in DeepMath-103K.</sub></div>
|
127 |
|
128 |
-
|
|
|
|
|
|
|
|
|
|
|
129 |
|
130 |
<div align="center"> <img src="./assets/github-contamination-case.png" width="80%"/>
|
131 |
|
132 |
<sub>Detected contamination examples. Subtle conceptual overlaps can also be identified.</sub> </div>
|
133 |
|
134 |
-
**
|
135 |
|
136 |
<div align="center"> <img src="./assets/github-data-sample.png" width="90%"/>
|
137 |
|
138 |
-
<sub>
|
139 |
|
140 |
- **Question**: The mathematical problem statement.
|
141 |
- **Final Answer**: A reliably verifiable final answer, enabling robust rule-based reward functions for RL.
|
@@ -145,22 +150,73 @@ DeepMath-1.5B is created by finetuning deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
|
|
145 |
|
146 |
## 📊Main Results
|
147 |
|
148 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
149 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
150 |
|
151 |
-
| Model | MATH 500 | AMC23 | Olympiad Bench | Minerva Math | AIME24 | AIME25 |
|
152 |
-
| :----------------------: | :------: | :------: | :------------: | :----------: | :------: | :------: |
|
153 |
-
| Qwen2.5-7B-Base | 54.8 | 35.3 | 27.8 | 16.2 | 7.7 | 5.4 |
|
154 |
-
| Open-Reasoner-Zero-7B | 81.8 | 58.9 | 47.9 | 38.4 | 15.6 | 14.4 |
|
155 |
-
| Qwen-2.5-7B-SimpleRL-Zoo | 77.0 | 55.8 | 41.0 | 41.2 | 15.6 | 8.7 |
|
156 |
-
| [DeepMath-Zero-7B](https://huggingface.co/zwhe99/DeepMath-Zero-7B) | **85.5** | **64.7** | **51.0** | **45.3** | **20.4** | **17.5** |
|
157 |
|
158 |
-
| Model | MATH 500 | AMC23 | Olympiad Bench | Minerva Math | AIME24 | AIME25 |
|
159 |
-
| :---------------------: | :------: | :------: | :------------: | :----------: | :------: | :------: |
|
160 |
-
| R1-Distill-Qwen-1.5B | 84.7 | 72.0 | 53.1 | 36.6 | 29.4 | 24.8 |
|
161 |
-
| DeepScaleR-1.5B-Preview | 89.4 | 80.3 | 60.9 | 42.2 | **42.3** | 29.6 |
|
162 |
-
| Still-3-1.5B-Preview | 86.6 | 75.8 | 55.7 | 38.7 | 30.8 | 24.6 |
|
163 |
-
| [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B) | **89.9** | **82.3** | **61.8** | **42.5** | 37.3 | **30.8** |
|
164 |
|
165 |
## 🙏 Acknowledgements
|
166 |
|
@@ -171,6 +227,8 @@ This work can not be done without the help of the following works:
|
|
171 |
- **[TIGER-Lab/WebInstructSub](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub)**: Instruction data from MathStackExchange and ScienceStackExchange.
|
172 |
- **[AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)**: Approximately 860k math problems.
|
173 |
|
|
|
|
|
174 |
## 📚 Citation
|
175 |
```bibtex
|
176 |
@article{deepmath,
|
@@ -182,4 +240,4 @@ This work can not be done without the help of the following works:
|
|
182 |
primaryClass={cs.CL},
|
183 |
url={https://arxiv.org/abs/2504.11456},
|
184 |
}
|
185 |
-
```
|
|
|
119 |
|
120 |
<sub>Difficulty distribution comparison.</sub> </div>
|
121 |
|
122 |
+
**2. Data Diversity and Novelty**: DeepMath-103K spans a wide spectrum of mathematical subjects, including Algebra, Calculus, Number Theory, Geometry, Probability, and Discrete Mathematics.
|
123 |
|
124 |
<div align="center"> <img src="./assets/github-domain.png" width="50%"/>
|
125 |
|
126 |
<sub>Hierarchical breakdown of mathematical topics covered in DeepMath-103K.</sub></div>
|
127 |
|
128 |
+
The problems in DeepMath-103K are novel and unique, whereas many existing datasets are similar and overlap.
|
129 |
+
<div align="center"> <img src="./assets/github-tsne.png" width="70%"/>
|
130 |
+
|
131 |
+
<sub>Embedding distributions of different datasets.</sub></div>
|
132 |
+
|
133 |
+
**3. Rigorous Decontamination**: Built from diverse sources, DeepMath-103K underwent meticulous decontamination against common benchmarks using semantic matching. This minimizes test set leakage and promotes fair model evaluation.
|
134 |
|
135 |
<div align="center"> <img src="./assets/github-contamination-case.png" width="80%"/>
|
136 |
|
137 |
<sub>Detected contamination examples. Subtle conceptual overlaps can also be identified.</sub> </div>
|
138 |
|
139 |
+
**4. Rich Data Format**: Each sample in DeepMath-103K is structured with rich information to support various research applications:
|
140 |
|
141 |
<div align="center"> <img src="./assets/github-data-sample.png" width="90%"/>
|
142 |
|
143 |
+
<sub>An example data sample from DeepMath-103K.</sub> </div>
|
144 |
|
145 |
- **Question**: The mathematical problem statement.
|
146 |
- **Final Answer**: A reliably verifiable final answer, enabling robust rule-based reward functions for RL.
|
|
|
150 |
|
151 |
## 📊Main Results
|
152 |
|
153 |
+
DeepMath serise models achieve many **SOTA** results on challenging math benchmarks:
|
154 |
+
|
155 |
+
<div align="center"> <img src="./assets/github-main.png" width="90%"/>
|
156 |
+
|
157 |
+
<sub>Math reasoning performance.</sub> </div>
|
158 |
+
|
159 |
+
|
160 |
+
## 🎯Quick Start
|
161 |
+
|
162 |
+
#### Environment Preparation
|
163 |
+
|
164 |
+
```shell
|
165 |
+
git clone --recurse-submodules https://github.com/zwhe99/DeepMath.git && cd DeepMath
|
166 |
+
|
167 |
+
conda create -y -n deepmath python=3.12.2 && conda activate deepmath
|
168 |
+
pip3 install ray[default]
|
169 |
+
pip3 install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
|
170 |
+
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
|
171 |
+
pip3 install omegaconf==2.4.0.dev3 hydra-core==1.4.0.dev1 antlr4-python3-runtime==4.11.0 vllm==0.7.3
|
172 |
+
pip3 install math-verify[antlr4_11_0]==0.7.0 fire deepspeed tensorboardX prettytable datasets transformers==4.49.0
|
173 |
+
pip3 install -e verl
|
174 |
+
```
|
175 |
+
|
176 |
+
|
177 |
|
178 |
+
#### Evaluation
|
179 |
+
|
180 |
+
```shell
|
181 |
+
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_ATTENTION_BACKEND=XFORMERS VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 uni_eval.py \
|
182 |
+
--base_model zwhe99/DeepMath-Zero-7B \
|
183 |
+
--chat_template_name orz \
|
184 |
+
--system_prompt_name simplerl \
|
185 |
+
--output_dir \
|
186 |
+
--bf16 True \
|
187 |
+
--tensor_parallel_size 8 \
|
188 |
+
--data_id zwhe99/MATH \
|
189 |
+
--split math500 \
|
190 |
+
--max_model_len 32768 \
|
191 |
+
--temperature 0.6 \
|
192 |
+
--top_p 0.95 \
|
193 |
+
--n 16
|
194 |
+
```
|
195 |
+
|
196 |
+
|
197 |
+
|
198 |
+
#### Training
|
199 |
+
|
200 |
+
* Data Preparation
|
201 |
+
|
202 |
+
```shell
|
203 |
+
DATA_DIR=/path/to/your/data
|
204 |
+
python3 verl/examples/data_preprocess/deepmath_103k.py --local_dir $DATA_DIR
|
205 |
+
```
|
206 |
+
|
207 |
+
* Start Ray
|
208 |
+
|
209 |
+
```shell
|
210 |
+
# Head node (×1)
|
211 |
+
ray start --head --port=6379 --node-ip-address=$HEAD_ADDR --num-gpus=8
|
212 |
+
|
213 |
+
# Worker nodes (×7 or ×11)
|
214 |
+
ray start --address=$HEAD_ADDR:6379 --node-ip-address=$WORKER_ADDR --num-gpus=8
|
215 |
+
```
|
216 |
+
|
217 |
+
* Launch training at head node. See `scripts/train` for training scripts.
|
218 |
|
|
|
|
|
|
|
|
|
|
|
|
|
219 |
|
|
|
|
|
|
|
|
|
|
|
|
|
220 |
|
221 |
## 🙏 Acknowledgements
|
222 |
|
|
|
227 |
- **[TIGER-Lab/WebInstructSub](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub)**: Instruction data from MathStackExchange and ScienceStackExchange.
|
228 |
- **[AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)**: Approximately 860k math problems.
|
229 |
|
230 |
+
|
231 |
+
|
232 |
## 📚 Citation
|
233 |
```bibtex
|
234 |
@article{deepmath,
|
|
|
240 |
primaryClass={cs.CL},
|
241 |
url={https://arxiv.org/abs/2504.11456},
|
242 |
}
|
243 |
+
```
|