DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact

Base mode deepseek-ai/DeepSeek-R1-0528

This repository delivers an Int4 + selectively-Int8 GPTQ DeepSeek-R1-0528 model: only layers that are highly sensitive to quantization remain in Int8, while the rest stay Int4—preserving generation quality with minimal file-size overhead.

Preliminary trials show that converting the entire model to pure Int4 (AWQ/GPTQ) under the quantization layout used in vLLM’s current DeepSeek-R1 implementation degrades inference accuracy and can produce faulty outputs. Layer-wise fine-grained quantization substantially mitigates this issue.

Temporary patch:
vLLM == 0.9.0 does not yet natively support per-layer quantization for MoE modules.
We added get_moe_quant_method to gptq_marlin.py as an interim fix.
Until the upstream PR is merged, please replace the original file with the one provided in this repo.

Variant Overview

Variant	Characteristics	File Size	Recommended Scenario
Lite	Only the most critical layers upgraded to Int8; size close to pure Int4	355 GB	Resource-constrained, lightweight server deployments
Compact	More Int8 layers, relatively higher output quality	414 GB	VRAM-sufficient deployments focused on answer quality (e.g., 8 × A100)
Medium	Compact plus fully-Int8 attention layers; high quality with reduced long-context loss	445 GB	VRAM-rich deployments needing both top answer quality and high concurrency (e.g., 8 × H20)

Choose the variant that best matches your hardware and quality requirements.

【Model Update Date】

2025-05-31
1. fast commit

【Dependencies】

vllm==0.9.0
transformers==4.52.3

### 【💡Notes on New VLLM Versions💡】

1. Recommend Using V0 Inference Mode

Before launching vLLM, set the environment variable

export VLLM_USE_V1=0

### 【💡 Patch for gptq_marlin.py💡】

At present, vllm==0.9.0 lacks support for per-layer quantization configurations for the moe module, which will lead to errors when loading the model. I have implemented a simple fix by adding the get_moe_quant_method function to the gptq_marlin.py file.

Until the PR is merged, please replace the gptq_marlin.py file in your installation with the attached version, placing it at:

.../site-packages/vllm/model_executor/layers/quantization/gptq_marlin.py

【Model List】

FILE SIZE	LATEST UPDATE TIME
`414GB`	`2025-06-01`

【Model Download】

from huggingface_hub import snapshot_download
snapshot_download('QuantTrio/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact', cache_dir="local_path")

DeepSeek-R1-0528

Paper Link👁️

1. Introduction

The DeepSeek R1 model has undergone a minor version upgrade, with the current version being DeepSeek-R1-0528. In the latest update, DeepSeek R1 has significantly improved its depth of reasoning and inference capabilities by leveraging increased computational resources and introducing algorithmic optimization mechanisms during post-training. The model has demonstrated outstanding performance across various benchmark evaluations, including mathematics, programming, and general logic. Its overall performance is now approaching that of leading models, such as O3 and Gemini 2.5 Pro.

Compared to the previous version, the upgraded model shows significant improvements in handling complex reasoning tasks. For instance, in the AIME 2025 test, the model’s accuracy has increased from 70% in the previous version to 87.5% in the current version. This advancement stems from enhanced thinking depth during the reasoning process: in the AIME test set, the previous model used an average of 12K tokens per question, whereas the new version averages 23K tokens per question.

Beyond its improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.

2. Evaluation Results

DeepSeek-R1-0528

For all our models, the maximum generation length is set to 64K tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 16 responses per query to estimate pass@1.

Category	Benchmark (Metric)	DeepSeek R1	DeepSeek R1 0528
General
	MMLU-Redux (EM)	92.9	93.4
	MMLU-Pro (EM)	84.0	85.0
	GPQA-Diamond (Pass@1)	71.5	81.0
	SimpleQA (Correct)	30.1	27.8
	FRAMES (Acc.)	82.5	83.0
	Humanity's Last Exam (Pass@1)	8.5	17.7
Code
	LiveCodeBench (2408-2505) (Pass@1)	63.5	73.3
	Codeforces-Div1 (Rating)	1530	1930
	SWE Verified (Resolved)	49.2	57.6
	Aider-Polyglot (Acc.)	53.3	71.6
Math
	AIME 2024 (Pass@1)	79.8	91.4
	AIME 2025 (Pass@1)	70.0	87.5
	HMMT 2025 (Pass@1)	41.7	79.4
	CNMO 2024 (Pass@1)	78.8	86.9
Tools
	BFCL_v3_MultiTurn (Acc)	-	37.0
	Tau-Bench (Pass@1)	-	53.5(Airline)/63.9(Retail)

Note: We use Agentless framework to evaluate model performance on SWE-Verified. We only evaluate text-only prompts in HLE testsets. GPT-4.1 is employed to act user role in Tau-bench evaluation.

5. License

This code repository is licensed under MIT License. The use of DeepSeek-R1 models is also subject to MIT License. DeepSeek-R1 series (including Base and Chat) supports commercial use and distillation.

6. Citation

@misc{deepseekai2025deepseekr1incentivizingreasoningcapability,
      title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning}, 
      author={DeepSeek-AI},
      year={2025},
      eprint={2501.12948},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.12948}, 
}

7. Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.

QuantTrio
/

DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact

DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact

【Model Update Date】

【Dependencies】

1. Recommend Using V0 Inference Mode

【Model List】

【Model Download】

DeepSeek-R1-0528

1. Introduction

2. Evaluation Results

DeepSeek-R1-0528

5. License

6. Citation

7. Contact

Model tree for QuantTrio/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact