---
pipeline_tag: text-generation
base_model:
- deepseek-ai/DeepSeek-R1
license: mit
---
# Model Overview
## Description:
The NVIDIA DeepSeek R1 FP4 model is the quantized version of the DeepSeek AI's DeepSeek R1 model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/deepseek-ai/DeepSeek-R1). The NVIDIA DeepSeek R1 FP4 model is quantized with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
This model is ready for commercial/non-commercial use.
## Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA [(DeepSeek R1) Model Card](https://huggingface.co/deepseek-ai/DeepSeek-R1).
### License/Terms of Use:
[MIT](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md)
## Model Architecture:
**Architecture Type:** Transformers
**Network Architecture:** DeepSeek R1
## Input:
**Input Type(s):** Text
**Input Format(s):** String
**Input Parameters:** One Dimensional(1D): Sequences
**Other Properties Related to Input:** Context length up to 128K
## Output:
**Output Type(s):** Text
**Output Format:** String
**Output Parameters:** 1D: Sequences
**Other Properties Related to Output:** N/A
## Software Integration:
**Supported Runtime Engine(s):**
* TensorRT-LLM
**Supported Hardware Microarchitecture Compatibility:**
* NVIDIA Blackwell
**Preferred Operating System(s):**
* Linux
## Model Version(s):
The model is quantized with nvidia-modelopt **v0.23.0**
## Datasets:
* Calibration Dataset: [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail)
** Data collection method: Automated.
** Labeling method: Unknown.
* Evaluation Dataset: [MMLU](https://github.com/hendrycks/test)
** Data collection method: Unknown.
** Labeling method: N/A.
## Inference:
**Engine:** TensorRT-LLM
**Test Hardware:** B200
## Post Training Quantization
This model was obtained by quantizing the weights and activations of DeepSeek R1 to FP4 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 1.6x.
## Usage
### Deploy with TensorRT-LLM
To deploy the quantized FP4 checkpoint with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) LLM API, follow the sample codes below (you need 8xB200 GPU and TensorRT LLM built from source with the latest main branch):
* LLM API sample usage:
```
from tensorrt_llm import SamplingParams
from tensorrt_llm._torch import LLM
def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(max_tokens=32)
llm = LLM(model="nvidia/DeepSeek-R1-FP4", tensor_parallel_size=8, enable_attention_dp=True)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
# The entry point of the program need to be protected for spawning processes.
if __name__ == '__main__':
main()
```
### Evaluation
The accuracy benchmark results are presented in the table below:
Precision | MMLU | GSM8K | AIME2024 | GPQA Diamond | MATH-500 |
FP8 | 90.8 | 96.3 | 80.0 | 69.7 | 95.4 |
FP4 | 90.7 | 96.1 | 80.0 | 69.2 | 94.2 |