OpenSeek-Small v1 Model Documentation

Overview

OpenSeek-Small-v1 is the initial production model of the OpenSeek project.

Utilizes DeepSeek-V3-like MoE architecture.
Comprises 1.4 billion total parameters, with 0.4 billion activated parameters.
Trained on 720 billion tokens.
Demonstrates superior efficiency compared to 1-billion-parameter models.

Training Data

0.72TB tokens of high-quality pretraining data and the ratio for each domain is as follows:

Name	Ratio
Nemotron-CC-high-actual-actual-high	1.26
Nemotron-CC-high-actual-actual-low	0.67
Nemotron-CC-high-actual-actual-mid	2.05
Nemotron-CC-high-synthetic-distill-high	1.59
Nemotron-CC-high-synthetic-distill-low	0.64
Nemotron-CC-high-synthetic-distill-mid	2.32
Nemotron-CC-high-synthetic-diverse_qa_pairs-high	4.67
Nemotron-CC-high-synthetic-diverse_qa_pairs-low	2.16
Nemotron-CC-high-synthetic-diverse_qa_pairs-mid	7.58
Nemotron-CC-high-synthetic-extract_knowledge-high	6.43
Nemotron-CC-high-synthetic-extract_knowledge-low	0.07
Nemotron-CC-high-synthetic-extract_knowledge-mid	2.22
Nemotron-CC-high-synthetic-knowledge_list-high	1.88
Nemotron-CC-high-synthetic-knowledge_list-low	0.74
Nemotron-CC-high-synthetic-knowledge_list-mid	3.20
Nemotron-CC-high-synthetic-wrap_medium-high	3.89
Nemotron-CC-high-synthetic-wrap_medium-low	0.65
Nemotron-CC-high-synthetic-wrap_medium-mid	6.18
Nemotron-CC-low-synthetic-wrap_medium-high	0.17
Nemotron-CC-low-synthetic-wrap_medium-low	0.30
Nemotron-CC-low-synthetic-wrap_medium-mid	1.08
Nemotron-CC-medium-actual-actual-high	2.20
Nemotron-CC-medium-actual-actual-low	4.48
Nemotron-CC-medium-actual-actual-mid	7.76
arxiv	0.32
books	1.98
code	3.43
cot_synthesis_CC	9.82
cot_synthesis_OpenSource	0.46
cot_synthesis_arxiv	4.15
cot_synthesis_code	1.32
cot_synthesis_math	2.19
cot_synthesis_wiki	0.83
math	0.83
pes2o	0.31
stack	0.19
wiki	0.29
zh_cc	9.65

Wandb

Our training curves have been recorded in Weights & Biases wandb.

Evaluation

Category	Metrics (shots)	Llama-3.2-1B	Qwen2.5-1.5B	Qwen2.5-0.5B	OLMo-1B-0724	OpenSeek-Small-v1
English-Commonsense Reasoning	HellaSwag (5-shot)	0.4830	0.5007	0.4007	0.4909	0.3893
	TruthfulQA (0-shot)	0.3773	0.4663	0.3986	0.4029	0.3990
	Winogrande (5-shot)	0.6212	0.6448	0.5683	0.6290	0.5541
	CommonsenseQA (5-shot)	0.3120	0.7445	0.5487	0.1949	0.2048
	PIQA (5-shot)	0.7514	0.7612	0.7111	0.7459	0.7203
	OpenBookQA (5-shot)	0.2960	0.3340	0.2720	0.3080	0.2560
	BoolQ (5-shot)	0.6590	0.7774	0.6572	0.6508	0.6165
English-Problem-Solving	ARC Easy (5-shot)	0.6940	0.8043	0.6780	0.6111	0.6237
	ARC Challenge (5-shot)	0.3532	0.4846	0.3370	0.3063	0.3157
	MMLU (5-shot)	0.3124	0.6165	0.4818	0.2869	0.2654
English-Mathematics	GSM8K (5-shot)	0.0637	0.6194	0.3495	0.0159	0.0182
	Minerva Math (4-shot)	0.0180	0.2876	0.1160	0.0182	0.0010
Chinese	CEval (5-shot)	0.2779	0.6954	0.5423	0.2340	0.2422
	CMMLU (5-shot)	0.2687	0.6882	0.5300	0.2570	0.2468
Average Metrics	Average-English(w/o Math)	0.4859	0.6134	0.5053	0.4627	0.4345
	Average-English	0.4118	0.5868	0.4599	0.3884	0.3637
	Average-Chinese	0.2733	0.6918	0.5362	0.2455	0.2445
	Average	0.3920	0.6018	0.4708	0.3680	0.3466
	Average(w/o Math)	0.4505	0.6265	0.5105	0.4265	0.4028

OpenSeek-Small-v1 demonstrates superior efficiency compared to 1-billion-parameter models.

Usage Instructions

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)

inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))