File size: 4,821 Bytes
21959ae
 
 
 
 
 
 
 
3c0bd56
 
 
21959ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c0bd56
21959ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c0bd56
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: mit
license_link: https://huggingface.co/rednote-hilab/dots.llm1.inst/blob/main/LICENSE
pipeline_tag: text-generation
base_model: rednote-hilab/dots.llm1.base
tags:
- chat
library_name: transformers
language:
- en
- zh
---

# dots1


## 1. Introduction


`dots.llm1` is a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. 
Leveraging our meticulously crafted and efficient data processing pipeline, `dots.llm1` achieves performance comparable to Qwen2.5-72B when trained on 11.2T high-quality tokens without synthetic data. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.


<p align="center">
  <img width="90%" src="./figures/performance.png">
</p>


## 2. Model Summary

**This repo contains the base and instruction-tuned `dots.llm1` model**. which has the following features:

- Type: A 14B/142B MoE model trained on 11.2T tokens.
- Training Stage: Pretraining & Post-training
- Architecture: Multi-head Attention with QK-Norm in Attention Layer, fine-grained MoE utilizing top-6 out of 128 routed experts, plus 2 shared experts.
- Number of Layers: 62
- Number of Attention Heads: 32
- Context Length: 32,768 tokens
- License: MIT

For more details, please refer to our [report](dots1_tech_report.pdf).

## 3. Example Usage

### Model Downloads

<div align="center">

| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
| :------------: | :------------: | :------------: | :------------: | :------------: |
| dots.llm1.base | 142B | 14B | 32K   | [🤗 Hugging Face](https://huggingface.co/rednote-hilab/dots.llm1.base)   |
| dots.llm1.inst  | 142B | 14B |  32K   | [🤗 Hugging Face](https://huggingface.co/rednote-hilab/dots.llm1.inst)   |

</div>

### Inference with huggingface

#### Text Completion

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "rednote-hilab/dots.llm1.base"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```

#### Chat Completion

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "rednote-hilab/dots.llm1.inst"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)

messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=200)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)
```


### Inference with sglang
[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models. SGLang could be used to launch a server with OpenAI-compatible API service. `sglang>=***` is required. It is as easy as

```shell
python -m sglang.launch_server --model-path dots.llm1.inst --tp 8 --host 0.0.0.0 --port 8000
```
An OpenAI-compatible API will be available at `http://localhost:8000/v1`.

### Inference with vllm
[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
`vllm>=***` is recommended.

```shell
vllm serve dots.llm1.inst --port 8000 --tensor-parallel-size 8
```
An OpenAI-compatible API will be available at `http://localhost:8000/v1`.

## 4. Evaluation Results

Detailed evaluation results are reported in this [📑 report](dots1_tech_report.pdf).

## Citation

If you find `dots.llm1` is useful or want to use in your projects, please kindly cite our paper:

```
@article{dots1,
      title={dots.llm1 Technical Report}, 
      author={rednote-hilab},
      journal={arXiv preprint arXiv:TBD},
      year={2025}
}
```