John Leimgruber III

ubergarm

AI & ML interests

Open LLMs and Astrophotography image processing.

Recent Activity

Organizations

None yet

ubergarm's activity

New activity in ubergarm/DeepSeek-R1-0528-GGUF about 2 hours ago

1.5 bpw

1
12
#6 opened 6 days ago by
lmganon123
New activity in anikifoss/DeepSeek-R1-0528-DQ4_K_R4 about 2 hours ago
New activity in ubergarm/DeepSeek-R1-0528-GGUF about 2 hours ago

benchmarks

👍 1
2
#8 opened 1 day ago by
BernardH
replied to eaddario's post 3 days ago
view reply

This is a good question, Ed. As we've discussed I'm still developing an intuition for these kinds of things.

In my limited experience there tend to be two more common scenarios:

  1. The quantization doesn't damage the model too much and so is not immediately noticeable during inferencing. Probably 𝜌PPL is over 95%
  2. The model barely works, can't form sentences, repeats small phrases forever and is very damaged.

Rarely have I seen a situation that is kind of in-between where the model is obviously acting different, but still is somewhat coherent though makes a lot of mistakes. That might be an interesting place to explore for this "cut-off" so to speak. I wish I had more stats on it. Specifically it happened on my first exllamav3 exl3 quantization of a "faile" ParetoQ QAT of a 1B model quantized to 2bpw lol:

https://gist.github.com/ubergarm/9d560bab80241b90dac802e91b656743#references

The drop down there shows the model is somewhat coherent, but definitely goofed up pretty good haha...

Anyway, I'll keep my eye on 𝜌PPL more closely as I'm running a lot of KLD comparisons lately. Cheers!

reacted to eaddario's post with 🚀 3 days ago
view post
Post
1231
Layer-wise and Pruned versions of google/gemma-3-12b-it

After enhancing llama.cpp to handle user-defined quantization levels for arbitrary tensors (https://github.com/ggml-org/llama.cpp/pull/12511), I have added an option to prune whole layers (https://github.com/ggml-org/llama.cpp/pull/13037), and have published two versions of google/gemma-3-12b-it for demo and testing purposes:

* Tesor-wise: eaddario/gemma-3-12b-it-GGUF
* Pruned: eaddario/gemma-3-12b-it-pruned-GGUF

Even though the Perplexity scores of the pruned version are 3 times higher, the ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores are holding remarkably well, considering two layers were removed (26 and 29). This seems to support Xin Men et al conclusions in ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2403.03853)

Results summary in the model's card and test results in the ./scores directory. Questions/feedback is always welcomed.
New activity in eaddario/Qwen3-30B-A3B-GGUF 3 days ago