nanovlm / models /README.md
ariG23498's picture
ariG23498 HF Staff
add demo
f2c2a4e

A newer version of the Gradio SDK is available: 5.33.2

Upgrade

Models

Vision Backbone (ViT)

This is a very lightweight Vision Transformer in native pytorch. I took inspiration from the following sources:

Language Model (Llama / SmolLM)

This is a decoder only LM, following the Llama 2/3 architecture. Inspiration from the following sources:

Modality Projection

This is a simple MLP (Linear Layer) for the Modality Projection between the Image Patch Encodings and the Language Embedding Space with a simple Pixel Shuffle (https://arxiv.org/pdf/2504.05299)

Vision-Language-Model

This brings all the individual parts together and handles the concatenation of images and text. Built as a simple version of SmolVLM (https://arxiv.org/pdf/2504.05299)