Access Gemma3-1B-IT on Hugging Face
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
To access Gemma3-1B-IT on Hugging Face, you are required to review and agree to the gemma license. To do this, please ensure you are logged in to Hugging Face and click below. Requests are processed immediately.
Log in or Sign Up to review the conditions and access this model content.
litert-community/Gemma3-1B-IT
This model provides a few variants of google/Gemma-3-1B-IT that are ready for deployment on Android using the LiteRT (fka TFLite) stack and MediaPipe LLM Inference API.
Use the models
Colab
Disclaimer: The target deployment surface for the LiteRT models is Android/iOS/Web and the stack has been optimized for performance on these targets. Trying out the system in Colab is an easier way to familiarize yourself with the LiteRT stack, with the caveat that the performance (memory and latency) on Colab could be much worse than on a local device.
Customize
Fine tune Gemma 3 1B and deploy with either LiteRT or Mediapipe LLM Inference API:
Android
- Download and install the apk.
- Follow the instructions in the app.
To build the demo app from source, please follow the instructions from the GitHub repository.
iOS
- Clone the MediaPipe samples repository and follow the instructions to build the LLM Inference iOS Sample App using XCode.
- Run the app via the iOS simulator or deploy to an iOS device.
Performance
Android
Note that all benchmark stats are from a Samsung S24 Ultra and multiple prefill signatures enabled.
Backend | Quantization scheme | Context length | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | CPU Memory (RSS in MB) | GPU Memory (RSS in MB) | Model size (MB) | |
---|---|---|---|---|---|---|---|---|---|
CPU |
fp32 (baseline) |
1280 |
49 tk/s |
10 tk/s |
5.59 s |
4,123 MB |
3,824 MB |
||
dynamic_int4 (block size 128) |
1280 |
138 tk/s |
50 tk/s |
2.33 s |
982 MB |
657 MB |
|||
4096 |
87 tk/s |
37 tk/s |
3.40 s |
1,145 MB |
657 MB |
||||
dynamic_int4 (block size 32) |
1280 |
107 tk/s |
48 tk/s |
3.49 s |
1,045 MB |
688 MB |
|||
4096 |
79 tk/s |
36 tk/s |
4.40 s |
1,210 MB |
688 MB |
||||
dynamic_int4 QAT |
2048 |
322 tk/s |
47 tk/s |
3.10 s |
1,138 MB |
529 MB |
|||
dynamic_int8 |
1280 |
177 tk/s |
33 tk/s |
1.69 s |
1,341 MB |
1,005 MB |
|||
4096 |
123 tk/s |
29 tk/s |
2.34 s |
1,504 MB |
1,005 MB |
||||
GPU |
dynamic_int4 QAT |
2048 |
2585 tk/s |
56 tk/s |
4.50 s |
1,205 MB |
529 MB |
||
dynamic_int8 |
1280 |
1191 tk/s |
24 tk/s |
4.68 s |
2,164 MB |
1,059 MB |
1,005 MB |
||
4096 |
814 tk/s |
24 tk/s |
4.99 s |
2,167 MB |
1,181 MB |
1,005 MB |
- For the list of supported quantization schemes see supported-schemes. For these models, we are using prefill signature lengths of 32, 128, 512 and 1280.
- Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models)
- Memory: indicator of peak RAM usage
- The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads
- Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ.
Web
Note that all benchmark stats are from a MacBook Pro 2024 (Apple M4 Max chip) running with 1280 KV cache size, 1024 tokens prefill, 256 tokens decode.
Backend | Quantization scheme | Precision | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | CPU Memory | GPU Memory | Model size (MB) | |
---|---|---|---|---|---|---|---|---|---|
GPU |
dynamic_int4 |
F16 |
4339 tk/s |
133 tk/s |
0.51 s |
460 MB |
1,331 MB |
700 MB |
|
F32 |
2837 tk/s |
134 tk/s |
0.49 s |
481 MB |
1,331 MB |
700 MB |
|||
dynamic_int4 QAT |
F16 |
1702 tk/s |
77 tk/s |
529 MB |
|||||
dynamic_int8 |
F16 |
4321 tk/s |
126 tk/s |
0.6 s |
471 MB |
1,740 MB |
1,011 MB |
||
F32 |
2805 tk/s |
129 tk/s |
0.58 s |
474 MB |
1,740 MB |
1,011 MB |
- Model size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models)
- dynamic_int4: quantized model with int4 weights and float activations.
- dynamic_int8: quantized model with int8 weights and float activations.