
John Leimgruber III
AI & ML interests
Recent Activity
Organizations
ubergarm's activity
1.5 bpw
Wahoo thanks for sharing your work!
benchmarks
Thanks for your work! Any chance for something between Q2_K_R and Q3_K_R?
This is a good question, Ed. As we've discussed I'm still developing an intuition for these kinds of things.
In my limited experience there tend to be two more common scenarios:
- The quantization doesn't damage the model too much and so is not immediately noticeable during inferencing. Probably 𝜌PPL is over 95%
- The model barely works, can't form sentences, repeats small phrases forever and is very damaged.
Rarely have I seen a situation that is kind of in-between where the model is obviously acting different, but still is somewhat coherent though makes a lot of mistakes. That might be an interesting place to explore for this "cut-off" so to speak. I wish I had more stats on it. Specifically it happened on my first exllamav3 exl3 quantization of a "faile" ParetoQ QAT of a 1B model quantized to 2bpw lol:
https://gist.github.com/ubergarm/9d560bab80241b90dac802e91b656743#references
The drop down there shows the model is somewhat coherent, but definitely goofed up pretty good haha...
Anyway, I'll keep my eye on 𝜌PPL more closely as I'm running a lot of KLD comparisons lately. Cheers!
After enhancing llama.cpp to handle user-defined quantization levels for arbitrary tensors (https://github.com/ggml-org/llama.cpp/pull/12511), I have added an option to prune whole layers (https://github.com/ggml-org/llama.cpp/pull/13037), and have published two versions of google/gemma-3-12b-it for demo and testing purposes:
* Tesor-wise: eaddario/gemma-3-12b-it-GGUF
* Pruned: eaddario/gemma-3-12b-it-pruned-GGUF
Even though the Perplexity scores of the pruned version are 3 times higher, the ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores are holding remarkably well, considering two layers were removed (26 and 29). This seems to support Xin Men et al conclusions in ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2403.03853)
Results summary in the model's card and test results in the ./scores directory. Questions/feedback is always welcomed.
Interesting comparisons!
https://huggingface.co/blog/autoround

Local Installation Video and Testing - Step by Step

Larger quants request
