gaunernst
/

gemma-3-4b-it-qat-compressed-tensors

@@ -1,13 +1,11 @@
 ---
 license: gemma
-library_name: transformers
 pipeline_tag: image-text-to-text
-extra_gated_heading: Access Gemma on Hugging Face
-extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and
-  agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging
-  Face and click below. Requests are processed immediately.
-extra_gated_button_content: Acknowledge license
-base_model: google/gemma-3-4b-it
 ---
 # Gemma 3 4B Instruction-tuned QAT compressed-tensors
@@ -26,6 +24,16 @@ Below is the original model card.
 **Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
 **Resources and Technical Documentation**:
 * [Gemma 3 Technical Report][g3-tech-report]
@@ -72,106 +80,29 @@ for everyone.
 ### Usage
-Below, there are some code snippets on how to get quickly started with running the model. First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0.
 ```sh
-$ pip install -U transformers
 ```
-Then, copy the snippet from the section that is relevant for your use case.
-#### Running with the `pipeline` API
-You can initialize the model and processor for inference with `pipeline` as follows.
-```python
-from transformers import pipeline
-import torch
-pipe = pipeline(
-    "image-text-to-text",
-    model="google/gemma-3-4b-it",
-    device="cuda",
-    torch_dtype=torch.bfloat16
-)
-```
-With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline.
-```python
-messages = [
-    {
-        "role": "system",
-        "content": [{"type": "text", "text": "You are a helpful assistant."}]
-    },
-    {
-        "role": "user",
-        "content": [
-            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
-            {"type": "text", "text": "What animal is on the candy?"}
-        ]
-    }
-]
-output = pipe(text=messages, max_new_tokens=200)
-print(output[0]["generated_text"][-1]["content"])
-# Okay, let's take a look!
-# Based on the image, the animal on the candy is a **turtle**.
-# You can see the shell shape and the head and legs.
 ```
-#### Running the model on a single/multi GPU
-```python
-# pip install accelerate
-from transformers import AutoProcessor, Gemma3ForConditionalGeneration
-from PIL import Image
-import requests
-import torch
-model_id = "google/gemma-3-4b-it"
-model = Gemma3ForConditionalGeneration.from_pretrained(
-    model_id, device_map="auto"
-).eval()
-processor = AutoProcessor.from_pretrained(model_id)
-messages = [
-    {
-        "role": "system",
-        "content": [{"type": "text", "text": "You are a helpful assistant."}]
-    },
-    {
-        "role": "user",
-        "content": [
-            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
-            {"type": "text", "text": "Describe this image in detail."}
-        ]
-    }
-]
-inputs = processor.apply_chat_template(
-    messages, add_generation_prompt=True, tokenize=True,
-    return_dict=True, return_tensors="pt"
-).to(model.device, dtype=torch.bfloat16)
-input_len = inputs["input_ids"].shape[-1]
-with torch.inference_mode():
-    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
-    generation = generation[0][input_len:]
-decoded = processor.decode(generation, skip_special_tokens=True)
-print(decoded)
-# **Overall Impression:** The image is a close-up shot of a vibrant garden scene,
-# focusing on a cluster of pink cosmos flowers and a busy bumblebee.
-# It has a slightly soft, natural feel, likely captured in daylight.
 ```
 ### Citation
 ```none
@@ -270,6 +201,10 @@ development workflow."*
 ## Evaluation
 Model evaluation metrics and results.
 ### Benchmark Results

 ---
+base_model: google/gemma-3-4b-it
 license: gemma
+tags:
+- gemma3
+- gemma
+- google
 pipeline_tag: image-text-to-text
 ---
 # Gemma 3 4B Instruction-tuned QAT compressed-tensors
 **Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
+> [!Note]
+> This repository corresponds to the 4B **instruction-tuned** version of the Gemma 3 model in GGUF format using Quantization Aware Training (QAT).
+> The GGUF corresponds to Q4_0 quantization.
+>
+> Thanks to QAT, the model is able to preserve similar quality as `bfloat16` while significantly reducing the memory requirements
+> to load the model.
+>
+> You can find the half-precision version [here](https://huggingface.co/google/gemma-3-4b-it).
 **Resources and Technical Documentation**:
 * [Gemma 3 Technical Report][g3-tech-report]
 ### Usage
+Below, there are some code snippets on how to get quickly started with running the model.
+**llama.cpp (text-only)**
 ```sh
+./llama-cli -hf google/gemma-3-4b-it-qat-q4_0-gguf -p "Write a poem about the Kraken."
 ```
+**llama.cpp (image input)**
+```sh
+wget https://github.com/bebechien/gemma/blob/main/surprise.png?raw=true -O ~/Downloads/surprise.png
+./llama-gemma3-cli -hf google/gemma-3-4b-it-qat-q4_0-gguf -p "Describe this image." --image ~/Downloads/surprise.png
 ```
+**ollama (text only)**
+Using GGUFs with Ollama via Hugging Face does not support image inputs at the moment. Please check the [docs on running gated repositories](https://huggingface.co/docs/hub/en/ollama#run-private-ggufs-from-the-hugging-face-hub).
+```sh
+ollama run hf.co/google/gemma-3-4b-it-qat-q4_0-gguf
 ```
 ### Citation
 ```none
 ## Evaluation
+> [!Note]
+> The evaluation in this section correspond to the original checkpoint, not the QAT checkpoint.
+>
 Model evaluation metrics and results.
 ### Benchmark Results