OpenSeek-Small v1 Model Documentation
Overview
OpenSeek-Small-v1 is the initial production model of the OpenSeek project.
- Utilizes DeepSeek-V3-like MoE architecture.
- Comprises 1.4 billion total parameters, with 0.4 billion activated parameters.
- Trained on 720 billion tokens.
- Demonstrates superior efficiency compared to 1-billion-parameter models.
Training Data
- 0.72TB tokens of high-quality pretraining data and the ratio for each domain is as follows:
Name Ratio Nemotron-CC-high-actual-actual-high 1.26 Nemotron-CC-high-actual-actual-low 0.67 Nemotron-CC-high-actual-actual-mid 2.05 Nemotron-CC-high-synthetic-distill-high 1.59 Nemotron-CC-high-synthetic-distill-low 0.64 Nemotron-CC-high-synthetic-distill-mid 2.32 Nemotron-CC-high-synthetic-diverse_qa_pairs-high 4.67 Nemotron-CC-high-synthetic-diverse_qa_pairs-low 2.16 Nemotron-CC-high-synthetic-diverse_qa_pairs-mid 7.58 Nemotron-CC-high-synthetic-extract_knowledge-high 6.43 Nemotron-CC-high-synthetic-extract_knowledge-low 0.07 Nemotron-CC-high-synthetic-extract_knowledge-mid 2.22 Nemotron-CC-high-synthetic-knowledge_list-high 1.88 Nemotron-CC-high-synthetic-knowledge_list-low 0.74 Nemotron-CC-high-synthetic-knowledge_list-mid 3.20 Nemotron-CC-high-synthetic-wrap_medium-high 3.89 Nemotron-CC-high-synthetic-wrap_medium-low 0.65 Nemotron-CC-high-synthetic-wrap_medium-mid 6.18 Nemotron-CC-low-synthetic-wrap_medium-high 0.17 Nemotron-CC-low-synthetic-wrap_medium-low 0.30 Nemotron-CC-low-synthetic-wrap_medium-mid 1.08 Nemotron-CC-medium-actual-actual-high 2.20 Nemotron-CC-medium-actual-actual-low 4.48 Nemotron-CC-medium-actual-actual-mid 7.76 arxiv 0.32 books 1.98 code 3.43 cot_synthesis_CC 9.82 cot_synthesis_OpenSource 0.46 cot_synthesis_arxiv 4.15 cot_synthesis_code 1.32 cot_synthesis_math 2.19 cot_synthesis_wiki 0.83 math 0.83 pes2o 0.31 stack 0.19 wiki 0.29 zh_cc 9.65
Wandb
Our training curves have been recorded in Weights & Biases wandb.
Evaluation
Category | Metrics (shots) | Llama-3.2-1B | Qwen2.5-1.5B | Qwen2.5-0.5B | OLMo-1B-0724 | OpenSeek-Small-v1 |
---|---|---|---|---|---|---|
English-Commonsense Reasoning | HellaSwag (5-shot) | 0.4830 | 0.5007 | 0.4007 | 0.4909 | 0.3893 |
TruthfulQA (0-shot) | 0.3773 | 0.4663 | 0.3986 | 0.4029 | 0.3990 | |
Winogrande (5-shot) | 0.6212 | 0.6448 | 0.5683 | 0.6290 | 0.5541 | |
CommonsenseQA (5-shot) | 0.3120 | 0.7445 | 0.5487 | 0.1949 | 0.2048 | |
PIQA (5-shot) | 0.7514 | 0.7612 | 0.7111 | 0.7459 | 0.7203 | |
OpenBookQA (5-shot) | 0.2960 | 0.3340 | 0.2720 | 0.3080 | 0.2560 | |
BoolQ (5-shot) | 0.6590 | 0.7774 | 0.6572 | 0.6508 | 0.6165 | |
English-Problem-Solving | ARC Easy (5-shot) | 0.6940 | 0.8043 | 0.6780 | 0.6111 | 0.6237 |
ARC Challenge (5-shot) | 0.3532 | 0.4846 | 0.3370 | 0.3063 | 0.3157 | |
MMLU (5-shot) | 0.3124 | 0.6165 | 0.4818 | 0.2869 | 0.2654 | |
English-Mathematics | GSM8K (5-shot) | 0.0637 | 0.6194 | 0.3495 | 0.0159 | 0.0182 |
Minerva Math (4-shot) | 0.0180 | 0.2876 | 0.1160 | 0.0182 | 0.0010 | |
Chinese | CEval (5-shot) | 0.2779 | 0.6954 | 0.5423 | 0.2340 | 0.2422 |
CMMLU (5-shot) | 0.2687 | 0.6882 | 0.5300 | 0.2570 | 0.2468 | |
Average Metrics | Average-English(w/o Math) | 0.4859 | 0.6134 | 0.5053 | 0.4627 | 0.4345 |
Average-English | 0.4118 | 0.5868 | 0.4599 | 0.3884 | 0.3637 | |
Average-Chinese | 0.2733 | 0.6918 | 0.5362 | 0.2455 | 0.2445 | |
Average | 0.3920 | 0.6018 | 0.4708 | 0.3680 | 0.3466 | |
Average(w/o Math) | 0.4505 | 0.6265 | 0.5105 | 0.4265 | 0.4028 |
OpenSeek-Small-v1 demonstrates superior efficiency compared to 1-billion-parameter models.
Usage Instructions
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))