OpenSeek-Small-v1 / README.md
ldwang's picture
Update README.md
2920192 verified

OpenSeek-Small v1 Model Documentation

Overview

OpenSeek-Small-v1 is the initial production model of the OpenSeek project.

  • Utilizes DeepSeek-V3-like MoE architecture.
  • Comprises 1.4 billion total parameters, with 0.4 billion activated parameters.
  • Trained on 720 billion tokens.
  • Demonstrates superior efficiency compared to 1-billion-parameter models.

Training Data

  • 0.72TB tokens of high-quality pretraining data and the ratio for each domain is as follows:
    Name Ratio
    Nemotron-CC-high-actual-actual-high 1.26
    Nemotron-CC-high-actual-actual-low 0.67
    Nemotron-CC-high-actual-actual-mid 2.05
    Nemotron-CC-high-synthetic-distill-high 1.59
    Nemotron-CC-high-synthetic-distill-low 0.64
    Nemotron-CC-high-synthetic-distill-mid 2.32
    Nemotron-CC-high-synthetic-diverse_qa_pairs-high 4.67
    Nemotron-CC-high-synthetic-diverse_qa_pairs-low 2.16
    Nemotron-CC-high-synthetic-diverse_qa_pairs-mid 7.58
    Nemotron-CC-high-synthetic-extract_knowledge-high 6.43
    Nemotron-CC-high-synthetic-extract_knowledge-low 0.07
    Nemotron-CC-high-synthetic-extract_knowledge-mid 2.22
    Nemotron-CC-high-synthetic-knowledge_list-high 1.88
    Nemotron-CC-high-synthetic-knowledge_list-low 0.74
    Nemotron-CC-high-synthetic-knowledge_list-mid 3.20
    Nemotron-CC-high-synthetic-wrap_medium-high 3.89
    Nemotron-CC-high-synthetic-wrap_medium-low 0.65
    Nemotron-CC-high-synthetic-wrap_medium-mid 6.18
    Nemotron-CC-low-synthetic-wrap_medium-high 0.17
    Nemotron-CC-low-synthetic-wrap_medium-low 0.30
    Nemotron-CC-low-synthetic-wrap_medium-mid 1.08
    Nemotron-CC-medium-actual-actual-high 2.20
    Nemotron-CC-medium-actual-actual-low 4.48
    Nemotron-CC-medium-actual-actual-mid 7.76
    arxiv 0.32
    books 1.98
    code 3.43
    cot_synthesis_CC 9.82
    cot_synthesis_OpenSource 0.46
    cot_synthesis_arxiv 4.15
    cot_synthesis_code 1.32
    cot_synthesis_math 2.19
    cot_synthesis_wiki 0.83
    math 0.83
    pes2o 0.31
    stack 0.19
    wiki 0.29
    zh_cc 9.65

Wandb

Our training curves have been recorded in Weights & Biases wandb.

Evaluation

Category Metrics (shots) Llama-3.2-1B Qwen2.5-1.5B Qwen2.5-0.5B OLMo-1B-0724 OpenSeek-Small-v1
English-Commonsense Reasoning HellaSwag (5-shot) 0.4830 0.5007 0.4007 0.4909 0.3893
TruthfulQA (0-shot) 0.3773 0.4663 0.3986 0.4029 0.3990
Winogrande (5-shot) 0.6212 0.6448 0.5683 0.6290 0.5541
CommonsenseQA (5-shot) 0.3120 0.7445 0.5487 0.1949 0.2048
PIQA (5-shot) 0.7514 0.7612 0.7111 0.7459 0.7203
OpenBookQA (5-shot) 0.2960 0.3340 0.2720 0.3080 0.2560
BoolQ (5-shot) 0.6590 0.7774 0.6572 0.6508 0.6165
English-Problem-Solving ARC Easy (5-shot) 0.6940 0.8043 0.6780 0.6111 0.6237
ARC Challenge (5-shot) 0.3532 0.4846 0.3370 0.3063 0.3157
MMLU (5-shot) 0.3124 0.6165 0.4818 0.2869 0.2654
English-Mathematics GSM8K (5-shot) 0.0637 0.6194 0.3495 0.0159 0.0182
Minerva Math (4-shot) 0.0180 0.2876 0.1160 0.0182 0.0010
Chinese CEval (5-shot) 0.2779 0.6954 0.5423 0.2340 0.2422
CMMLU (5-shot) 0.2687 0.6882 0.5300 0.2570 0.2468
Average Metrics Average-English(w/o Math) 0.4859 0.6134 0.5053 0.4627 0.4345
Average-English 0.4118 0.5868 0.4599 0.3884 0.3637
Average-Chinese 0.2733 0.6918 0.5362 0.2455 0.2445
Average 0.3920 0.6018 0.4708 0.3680 0.3466
Average(w/o Math) 0.4505 0.6265 0.5105 0.4265 0.4028

OpenSeek-Small-v1 demonstrates superior efficiency compared to 1-billion-parameter models.

  • logC_vs_Metric_Average

Usage Instructions

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)

inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))