Llama.cpp hybrid layer quantization of Llama-4-Scout-17B-16E-Instruct by meta-llama

Original model: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. This particular quant achieves a ~50G gguf with the same perplexity as a ~60G IQ4_XS gguf. The quants employed are all K to avoid slow CPU processing of IQ quants, and they range from Q3_K_S at early layers to a final layer which may be specified by the quant name. In this case the final layer is Q4_K_M so the quant is called Q4_K_H. Note there is no unique Q4_K_H quant since the selections of quantizations to use as a function of layer are arbitrary. For this file the layer quants are as follows:

embed  : Q3_K
0..7   : Q3_K_S
8..23  : alt Q3_K_M Q3_K_S
24..44 : Q3_K_M
45     : Q3_K_L
46     : Q4_K_S
47     : Q4_K_M
output : Q5_K

These quants were select based on combined subjective and objective performance evaluations to give both high performance and moderate file size.

A slightly smaller version of this model is now available which intentionally drops to Q2_K on alternate deep layers to add a small controlled amount of extra nonlinearity and possibly help boostrap better generation creativity for vision mode. It is named Q4_V_H (V=Vision, but it does not have to be used for vision it will work for text only input also) to distinguish it from Q4_K_H. This quant has the side benefit of just fitting under 50G split threshold size so only one file need be downloaded to run it. For this file the layer quants are as follows:

embed  : Q3_K
0..7   : alt Q3_K_S Q2_K
8..23  : alt Q3_K_M Q3_K_S
24..44 : Q3_K_M
45     : Q3_K_L
46     : Q4_K_S
47     : Q4_K_M
output : Q5_K

A second smaller quant is also available with predominant Q3_K_S across layers as follows:

(layer distribution updates : v2 5/5/25, v3 5/6/25)
embed  : Q3_K
0..15  : alt Q3_K_S Q2_K
16..38 : Q3_K_S
39..45 : Q3_K_M
46     : Q3_K_L
47     : Q4_K_S
output : Q5_K

This quant reduces file size to 46.6G while maintaining perplexity (10.2) at the same level as a homogenous Q3_K_M quant with ~5G smaller file.

A third quant is also available making heavier use of Q2_K as follows:

embed  : Q3_K
0..15  : alt Q2_K Q2_K_S
16..23 : Q2_K
24..31 : alt Q3_K_S Q2_K
32..43 : Q3_K_S
44..47 : Q3_K_M
output : Q5_K

This quant reduces file size to 42.8 while maintaining good perplexity (10.47) close to homogenous Q3_K_M while being 9G smaller and exhibits no generation artifacts (nonsense words or completing words with chinese characters randomly) across a wide range of test promps.

Comparison:

Quant size PPL Comment
Q2_K 39.6 11.4 many generation artifacts, nonsense words, completes words with chinese tokens randomly
Q2_K_H 42.8e9 10.4 no generation artifacts, no nonsense words, very close to Q3_K_M perplexity
Q3_K_M_3_5 51.6e9 10.2 Q3_K_M with Q5_K output layer
Q3_K_H 46.6e9 10.2 Hybrid quant with Q5_K output layer, Q3_K_M perplexity
Q4_V_H 49.8e9 9.54 Q4_K_H with alt Q2_K on deepest layers for extra early stage nonlinearity
Q4_K_H 50.4e9 9.54 Hybrid quant with Q5_K output layer, IQ4_XS perplexity
IQ4_XS 59.9e9 9.57 IQ4_XS with default embed/output layer

Usage Notes:

This model can be run on a computer with 64G RAM using combined CPU and a single GPU. The Q2_K_H model might be able to run on a 48G RAM machine but the Q3_K_H and Q4_K_H quants almost certainly won't without either partial offload to a high VRAM GPU or making use of disk swap offload.

A good setup is to offload all model layers to GPU and all non-shared expert FFN tensors to CPU by specifying the override tensor flag: -ot exps=CPU as discussed here : https://github.com/ggml-org/llama.cpp/discussions/13154 This will open up around 45k of KV cache memory on a 12G VRAM GPU at q8_0 KV precision and run around 10t/s on a decent CPU. Note flash attention is needed to run q8_0 KV. To speed up prompt processing under this configuration x2 to x10 you can also either force ubatch (-u) to 16 or use the new --no-op-offload switch added in https://github.com/ggml-org/llama.cpp/pull/13386.

Update: tests on my HW (9900k CPU + 4070 GPU) show that both --no-op-offload and -ubatch 16 are not effective as of llama.cpp b5379.

without --no-op-offload,  ubatch 128
PP=30.856730699382798 TG=9.233060062734445   *best PP and TG config*
with --no-op-offload, ubatch 128
PP=25.74042603622643 TG=8.349090673295033
without --no-op-offload, -ubatch 16
PP=21.448615670596926 TG=9.234865287491097

Therefore it will be necessary to experiment with these flags to determine best settings on different HW configs.

At version b5237 and above, a change was made to llama.cpp flash attention code which is generating an apparent loss of precision in attention layer computation of around 1 to 3bits, causing higher numerical noise and degraded performance. The generations will still function, but for highest performance it may be necessary to turn flash attention off and run with a F16 KV until / if this problem is resolved. Discussion on llama.cpp issue tracker at : https://github.com/ggml-org/llama.cpp/issues/13287

Image capability:

As of llama.cpp b5423 vision capability has been added to llama 4: https://github.com/ggml-org/llama.cpp/pull/13282 . To run vision mode follow the docs in the mtmd readme in the tools directory of the source tree https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/README.md . Basically all that is needed is to generate a "mmproj" file for Llama scout, and then it can be tested with the cli. Preliminary testing shows performance to be excellent (ignore the warning about degraded vision mode of Llama 4 when the model loads, it works extremely well over a wide range of test cases I have run). For convenience the mmproj file is made available in this model repository also.

Download the file from below:

Link Type Size/e9 B Notes
Llama-4-Scout-17B-16E-Instruct.Q2_K_H.gguf Q2_K_H 42.8e9 B Good quality
Llama-4-Scout-17B-16E-Instruct.Q3_K_H.gguf Q3_K_H 46.6e9 B Solid quality
Llama-4-Scout-17B-16E-Instruct.Q4_K_H.gguf Q4_K_H 50.4e9 B Excellent quality
Llama-4-Scout-17B-16E-Instruct.Q4_V_H.gguf Q4_V_H 49.8e9 B Extra deep layer nonlinearity for vision mode
Llama-4-Scout-17B-16E-Instruct.mmproj mmproj 1.75e9 B multimedia projector for vision mode

A discussion thread about the hybrid quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
383
GGUF
Model size
108B params
Architecture
llama4
Hardware compatibility
Log In to view the estimation

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for steampunque/Llama-4-Scout-17B-16E-Instruct-Hybrid-GGUF