SmolLM_125M / README.md
waghmareps12's picture
Rename model-card.md to README.md
f56fb39 verified
metadata
language: en
tags:
  - pytorch
  - causal-lm
  - language-model
  - text-generation
license: mit
datasets:
  - wikitext
metrics:
  - perplexity
  - loss
library_name: pytorch
pipeline_tag: text-generation
model-index:
  - name: SmolLM-125M
    results:
      - task:
          type: text-generation
          name: Language Modeling
        dataset:
          type: wikitext
          name: WikiText-2
        metrics:
          - type: perplexity
            value: to_be_updated
          - type: loss
            value: to_be_updated

SmolLM-125M: A Lightweight Language Model for Consumer Hardware

This is a 125M parameter language model designed to be trained and run on consumer hardware with limited VRAM (4GB+). The model follows a GPT-style architecture but is optimized for efficiency and memory usage.

Model Details

  • Architecture: GPT-style Transformer
  • Parameters: 125M
  • Context Length: 512 tokens
  • Vocabulary: 50,257 tokens (GPT-2 tokenizer)
  • Training Data: WikiText-2
  • Hardware Requirements: 4GB+ VRAM GPU

Architecture Specifications

  • Layers: 12 transformer blocks
  • Attention Heads: 12
  • Embedding Dimension: 768
  • Activation: GELU
  • Layer Normalization: Pre-norm

Training Details

  • Hardware Used: GTX 1650 (4GB VRAM)
  • Training Time: ~4 hours
  • Batch Size: 4 (16 with gradient accumulation)
  • Learning Rate: 3e-4 with cosine decay
  • Weight Decay: 0.1
  • Optimizer: AdamW

Memory Optimizations

  1. Length-based batch scheduling
  2. Gradient accumulation (4 steps)
  3. Dynamic batch scheduling
  4. Pre-padded sequences

Usage

from transformers import AutoTokenizer
from model import SmallLanguageModel, ModelConfig

# Initialize model
config = ModelConfig(
    vocab_size=50257,
    block_size=512,
    n_layer=12,
    n_head=12,
    n_embd=768,
    dropout=0.1,
    bias=True
)
model = SmallLanguageModel(config)

# Generate text
tokenizer = AutoTokenizer.from_pretrained("gpt2")
input_text = "Once upon a time"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output_ids = model.generate(input_ids, max_length=100)
generated_text = tokenizer.decode(output_ids[0])

Limitations

  • Limited context window (512 tokens)
  • Smaller capacity compared to larger models
  • Training data limited to WikiText-2

License

This model is released under the MIT License.