ubergarm
/

DeepSeek-R1T-Chimera-GGUF

@@ -19,6 +19,8 @@ This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/i
 These quants provide best in class quality for the given memory footprint.
 ## Big Thanks
 Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)!  **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
 Also thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for tips and tricks helping each other run all the fun new models!
@@ -30,22 +32,130 @@ So far these are my best recipes offering the great quality in good memory footp
 #### ubergarm/DeepSeek-R1T-Chimera-IQ4_KS
-TODO
 ## Quantization
 <details>
-<summary>👈Secret Recipe</summary>
 ```bash
-TODO
-(mostly iq6_k for all attention/shared experts and iq4_ks for all ffn layers)
 ```
 </details>
-## Discussion
-Updated imatrix MLA stuff.
 ## References
 * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)

 These quants provide best in class quality for the given memory footprint.
 ## Big Thanks
+Special thanks to `u/un_passant` for additional hardware access for this special project!
 Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)!  **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
 Also thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for tips and tricks helping each other run all the fun new models!
 #### ubergarm/DeepSeek-R1T-Chimera-IQ4_KS
+*NOTE*: This quant may take a *long time* to upload, hopefully less than a month lol...
+338.456 GiB (4.326 BPW)
+- type  f32:    361 tensors - norms etc.
+- type q6_0:     61 tensors - attn_k_b (not divisible by 256 so can't use iq6_k)
+- type iq6_k:   551 tensors - balance of attn, token_embd, output, output_norm, shared experts
+- type iq4_ks:  174 tensors - `ffn_(down|gate|up)_exps` routed experts
+This quant is designed to take advantage of faster [iq4_ks](https://github.com/ikawrakow/ik_llama.cpp/pull/417) CUDA
+performance and is *not* pre-repacked allowing multi-GPU users to offload
+additional layers easily. If you have enough RAM to hold it all you can
+use `-rtr` for run-time-repacking of remaining layers on CPU for improved
+performance, or use the offline repack tool for a custom solution tailored
+to you exact hardware configuration.
 ## Quantization
 <details>
+<summary>👈 Secret Recipe</summary>
 ```bash
+#!/usr/bin/env bash
+custom="
+# Token embedding and output tensors
+# note token_embd cannot be repacked quant type
+token_embd\.weight=iq6_k
+output\.weight=iq6_k
+output_norm\.weight=iq6_k
+# First 3 dense layers (0-3)
+blk\.[0-2]\.attn_k_b.*=q6_0
+blk\.[0-2]\.attn_.*=iq6_k
+blk\.[0-2]\..*=iq6_k
+# All attention, norm weights, and bias tensors for MoE layers (3-60)
+# Except blk.*.attn_k_b.weight is not divisible by 256 and no iq6_k so go with q6_0
+blk\.[3-9]\.attn_k_b.*=q6_0
+blk\.[1-5][0-9]\.attn_k_b.*=q6_0
+blk\.60\.attn_k_b.*=q6_0
+blk\.[3-9]\.attn_.*=iq6_k
+blk\.[1-5][0-9]\.attn_.*=iq6_k
+blk\.60\.attn_.*=iq6_k
+blk\.[3-9]\.ffn_norm\.weight=iq6_k
+blk\.[1-5][0-9]\.ffn_norm\.weight=iq6_k
+blk\.60\.ffn_norm\.weight=iq6_k
+blk\.[3-9]\.exp_probs_b\.bias=iq6_k
+blk\.[1-5][0-9]\.exp_probs_b\.bias=iq6_k
+blk\.60\.exp_probs_b\.bias=iq6_k
+# Shared Experts (3-60)
+blk\.[3-9]\.ffn_down_shexp\.weight=iq6_k
+blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq6_k
+blk\.60\.ffn_down_shexp\.weight=iq6_k
+blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
+blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
+blk\.60\.ffn_(gate|up)_shexp\.weight=iq6_k
+# The bulk of the model size is below
+# Routed Experts (3-60)
+# usually ffn_down is made a bit bigger than ffn_(gate|up) but you do you
+blk\.[3-9]\.ffn_down_exps\.weight=iq4_ks
+blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq4_ks
+blk\.60\.ffn_down_exps\.weight=iq4_ks
+blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
+blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
+blk\.60\.ffn_(gate|up)_exps\.weight=iq4_ks
+"
+custom=$(
+  echo "$custom" | grep -v '^#' | \
+  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+./build/bin/llama-quantize \
+    --imatrix /mnt/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera.imatrix \
+    --custom-q "$custom" \
+    /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-256x21B-BF16-00001-of-00030.gguf \
+    /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-IQ4_KS.gguf \
+    IQ4_KS \
+    40
+```
+</details>
+## imatrix
+Based on [some discussions on imatrix
+methodology](https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2878086261)
+I chose the tried and true old school methodology using
+default context length 512. This is one of the first imatrix
+generated using the [updated imatrix calculation fix for
+MLA](https://github.com/ikawrakow/ik_llama.cpp/pull/411) so went lower
+than Q8_0 on attention tensors for this MLA quant (iq6_k) given the
+discussions there and recent CUDA speed improvements.
+<details>
+<summary>👈 Imatrix Methodology</summary>
+```
+wget https://gist.githubusercontent.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/raw/571fda718462de863e5a0171078c175420c7649a/calibration_data_v5_rc.txt
+numactl -N 0 -m 0 \
+./build/bin/llama-imatrix \
+    --verbosity 1 \
+    -m /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-Q8_0.gguf \
+    -f calibration_data_v5_rc.txt \
+    -o DeepSeek-R1T-Chimera.imatrix \
+    --layer-similarity \
+    --ctx-size 512 \
+    --numa numactl \
+    --threads 40
+# NOTE: I actually forgot --layer-similarity otherwise would publish that here. Sorry!
 ```
 </details>
 ## References
 * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)