nvidia
/

Llama-3.1-Nemotron-Nano-VL-8B-V1

Image-Text-to-Text

Transformers

Safetensors

nvidia

VLM

llama3.1

Model card Files Files and versions Community

amalad commited on Jun 5

Commit

1fbc10a

1 Parent(s): 67c7cea

Update README and examples

Browse files

Files changed (2) hide show

README.md +10 -4
examples.py +1 -1

README.md CHANGED Viewed

@@ -5,13 +5,13 @@ license_link: >-
   https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
 ---
-# Llama-Nemotron-Nano-VL-8B-V1
 ## Model Overview
 ### Description
- Llama-Nemotron-Nano-VL-8B-V1 is a leading document intelligence vision language model (VLMs) that enables the ability to query and summarize images and video from the physical or virtual world. Llama-Nemotron-Nano-VL-8B-V1 is deployable in the data center, cloud and at the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance.
 This model was trained on commercial images and videos for all three stages of training and supports single image and video inference.
@@ -94,13 +94,19 @@ Supported Operating System(s): Linux<br>
 ### Model Versions:
 Llama-3.1-Nemotron-Nano-VL-8B-V1
-## Usage
 ```python
 from PIL import Image
 from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
-path = "nvidia/Llama-Nemotron-Nano-VL-8B-V1"
 model = AutoModel.from_pretrained(path, trust_remote_code=True, device_map="cuda").eval()
 tokenizer = AutoTokenizer.from_pretrained(path)
 image_processor = AutoImageProcessor.from_pretrained(path, trust_remote_code=True, device="cuda")

   https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
 ---
+# Llama-3.1-Nemotron-Nano-VL-8B-V1
 ## Model Overview
 ### Description
+ Llama Nemotron Nano VL is a leading document intelligence vision language model (VLMs) that enables the ability to query and summarize images and video from the physical or virtual world. Llama Nemotron Nano VL is deployable in the data center, cloud and at the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance.
 This model was trained on commercial images and videos for all three stages of training and supports single image and video inference.
 ### Model Versions:
 Llama-3.1-Nemotron-Nano-VL-8B-V1
+## Quick Start
+### Install Dependencies
+```
+pip install transformers accelerate timm einops open-clip-torch
+```
+### Usage
 ```python
 from PIL import Image
 from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
+path = "nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1"
 model = AutoModel.from_pretrained(path, trust_remote_code=True, device_map="cuda").eval()
 tokenizer = AutoTokenizer.from_pretrained(path)
 image_processor = AutoImageProcessor.from_pretrained(path, trust_remote_code=True, device="cuda")

examples.py CHANGED Viewed

@@ -2,7 +2,7 @@ import torch
 from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
 from PIL import Image
-path = "nvidia/Llama-Nemotron-Nano-VL-8B-V1"
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,

 from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
 from PIL import Image
+path = "nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1"
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,