Access Gemma3-1B-IT on Hugging Face

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

To access Gemma3-1B-IT on Hugging Face, you are required to review and agree to the gemma license. To do this, please ensure you are logged in to Hugging Face and click below. Requests are processed immediately.

Log in or Sign Up to review the conditions and access this model content.

litert-community/Gemma3-1B-IT

This model provides a few variants of google/Gemma-3-1B-IT that are ready for deployment on Android using the LiteRT (fka TFLite) stack and MediaPipe LLM Inference API.

Use the models

Colab

Disclaimer: The target deployment surface for the LiteRT models is Android/iOS/Web and the stack has been optimized for performance on these targets. Trying out the system in Colab is an easier way to familiarize yourself with the LiteRT stack, with the caveat that the performance (memory and latency) on Colab could be much worse than on a local device.

Open In Colab

Customize

Fine tune Gemma 3 1B and deploy with either LiteRT or Mediapipe LLM Inference API:

Open In Colab

Android

  • Download and install the apk.
  • Follow the instructions in the app.

To build the demo app from source, please follow the instructions from the GitHub repository.

iOS

  • Clone the MediaPipe samples repository and follow the instructions to build the LLM Inference iOS Sample App using XCode.
  • Run the app via the iOS simulator or deploy to an iOS device.

Performance

Android

Note that all benchmark stats are from a Samsung S24 Ultra and multiple prefill signatures enabled.

Backend Quantization scheme Context length Prefill (tokens/sec) Decode (tokens/sec) Time-to-first-token (sec) CPU Memory (RSS in MB) GPU Memory (RSS in MB) Model size (MB)

CPU

fp32 (baseline)

1280

49 tk/s

10 tk/s

5.59 s

4,123 MB

3,824 MB

πŸ”—

dynamic_int4 (block size 128)

1280

138 tk/s

50 tk/s

2.33 s

982 MB

657 MB

πŸ”—

4096

87 tk/s

37 tk/s

3.40 s

1,145 MB

657 MB

πŸ”—

dynamic_int4 (block size 32)

1280

107 tk/s

48 tk/s

3.49 s

1,045 MB

688 MB

πŸ”—

4096

79 tk/s

36 tk/s

4.40 s

1,210 MB

688 MB

πŸ”—

dynamic_int4 QAT

2048

322 tk/s

47 tk/s

3.10 s

1,138 MB

529 MB

πŸ”—

dynamic_int8

1280

177 tk/s

33 tk/s

1.69 s

1,341 MB

1,005 MB

πŸ”—

4096

123 tk/s

29 tk/s

2.34 s

1,504 MB

1,005 MB

πŸ”—

GPU

dynamic_int4 QAT

2048

2585 tk/s

56 tk/s

4.50 s

1,205 MB

529 MB

πŸ”—

dynamic_int8

1280

1191 tk/s

24 tk/s

4.68 s

2,164 MB

1,059 MB

1,005 MB

πŸ”—

4096

814 tk/s

24 tk/s

4.99 s

2,167 MB

1,181 MB

1,005 MB

πŸ”—

  • For the list of supported quantization schemes see supported-schemes. For these models, we are using prefill signature lengths of 32, 128, 512 and 1280.
  • Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models)
  • Memory: indicator of peak RAM usage
  • The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads
  • Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ.

Web

Note that all benchmark stats are from a MacBook Pro 2024 (Apple M4 Max chip) running with 1280 KV cache size, 1024 tokens prefill, 256 tokens decode.

Backend Quantization scheme Precision Prefill (tokens/sec) Decode (tokens/sec) Time-to-first-token (sec) CPU Memory GPU Memory Model size (MB)

GPU

dynamic_int4

F16

4339 tk/s

133 tk/s

0.51 s

460 MB

1,331 MB

700 MB

πŸ”—

F32

2837 tk/s

134 tk/s

0.49 s

481 MB

1,331 MB

700 MB

πŸ”—

dynamic_int4 QAT

F16

1702 tk/s

77 tk/s

529 MB

πŸ”—

dynamic_int8

F16

4321 tk/s

126 tk/s

0.6 s

471 MB

1,740 MB

1,011 MB

πŸ”—

F32

2805 tk/s

129 tk/s

0.58 s

474 MB

1,740 MB

1,011 MB

πŸ”—

  • Model size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models)
  • dynamic_int4: quantized model with int4 weights and float activations.
  • dynamic_int8: quantized model with int8 weights and float activations.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 9 Ask for provider support

Spaces using litert-community/Gemma3-1B-IT 2