Update README & be patient as upload is slooow!
Browse files
README.md
CHANGED
@@ -19,6 +19,8 @@ This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/i
|
|
19 |
These quants provide best in class quality for the given memory footprint.
|
20 |
|
21 |
## Big Thanks
|
|
|
|
|
22 |
Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
|
23 |
|
24 |
Also thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for tips and tricks helping each other run all the fun new models!
|
@@ -30,22 +32,130 @@ So far these are my best recipes offering the great quality in good memory footp
|
|
30 |
|
31 |
#### ubergarm/DeepSeek-R1T-Chimera-IQ4_KS
|
32 |
|
33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
## Quantization
|
36 |
<details>
|
37 |
|
38 |
-
<summary>👈Secret Recipe</summary>
|
39 |
|
40 |
```bash
|
41 |
-
|
42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
```
|
44 |
|
45 |
</details>
|
46 |
|
47 |
-
## Discussion
|
48 |
-
Updated imatrix MLA stuff.
|
49 |
|
50 |
## References
|
51 |
* [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
|
|
|
19 |
These quants provide best in class quality for the given memory footprint.
|
20 |
|
21 |
## Big Thanks
|
22 |
+
Special thanks to `u/un_passant` for additional hardware access for this special project!
|
23 |
+
|
24 |
Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
|
25 |
|
26 |
Also thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for tips and tricks helping each other run all the fun new models!
|
|
|
32 |
|
33 |
#### ubergarm/DeepSeek-R1T-Chimera-IQ4_KS
|
34 |
|
35 |
+
*NOTE*: This quant may take a *long time* to upload, hopefully less than a month lol...
|
36 |
+
|
37 |
+
338.456 GiB (4.326 BPW)
|
38 |
+
|
39 |
+
- type f32: 361 tensors - norms etc.
|
40 |
+
- type q6_0: 61 tensors - attn_k_b (not divisible by 256 so can't use iq6_k)
|
41 |
+
- type iq6_k: 551 tensors - balance of attn, token_embd, output, output_norm, shared experts
|
42 |
+
- type iq4_ks: 174 tensors - `ffn_(down|gate|up)_exps` routed experts
|
43 |
+
|
44 |
+
This quant is designed to take advantage of faster [iq4_ks](https://github.com/ikawrakow/ik_llama.cpp/pull/417) CUDA
|
45 |
+
performance and is *not* pre-repacked allowing multi-GPU users to offload
|
46 |
+
additional layers easily. If you have enough RAM to hold it all you can
|
47 |
+
use `-rtr` for run-time-repacking of remaining layers on CPU for improved
|
48 |
+
performance, or use the offline repack tool for a custom solution tailored
|
49 |
+
to you exact hardware configuration.
|
50 |
|
51 |
## Quantization
|
52 |
<details>
|
53 |
|
54 |
+
<summary>👈 Secret Recipe</summary>
|
55 |
|
56 |
```bash
|
57 |
+
#!/usr/bin/env bash
|
58 |
+
|
59 |
+
custom="
|
60 |
+
# Token embedding and output tensors
|
61 |
+
# note token_embd cannot be repacked quant type
|
62 |
+
token_embd\.weight=iq6_k
|
63 |
+
output\.weight=iq6_k
|
64 |
+
output_norm\.weight=iq6_k
|
65 |
+
|
66 |
+
# First 3 dense layers (0-3)
|
67 |
+
blk\.[0-2]\.attn_k_b.*=q6_0
|
68 |
+
blk\.[0-2]\.attn_.*=iq6_k
|
69 |
+
blk\.[0-2]\..*=iq6_k
|
70 |
+
|
71 |
+
# All attention, norm weights, and bias tensors for MoE layers (3-60)
|
72 |
+
# Except blk.*.attn_k_b.weight is not divisible by 256 and no iq6_k so go with q6_0
|
73 |
+
blk\.[3-9]\.attn_k_b.*=q6_0
|
74 |
+
blk\.[1-5][0-9]\.attn_k_b.*=q6_0
|
75 |
+
blk\.60\.attn_k_b.*=q6_0
|
76 |
+
|
77 |
+
blk\.[3-9]\.attn_.*=iq6_k
|
78 |
+
blk\.[1-5][0-9]\.attn_.*=iq6_k
|
79 |
+
blk\.60\.attn_.*=iq6_k
|
80 |
+
|
81 |
+
blk\.[3-9]\.ffn_norm\.weight=iq6_k
|
82 |
+
blk\.[1-5][0-9]\.ffn_norm\.weight=iq6_k
|
83 |
+
blk\.60\.ffn_norm\.weight=iq6_k
|
84 |
+
|
85 |
+
blk\.[3-9]\.exp_probs_b\.bias=iq6_k
|
86 |
+
blk\.[1-5][0-9]\.exp_probs_b\.bias=iq6_k
|
87 |
+
blk\.60\.exp_probs_b\.bias=iq6_k
|
88 |
+
|
89 |
+
# Shared Experts (3-60)
|
90 |
+
blk\.[3-9]\.ffn_down_shexp\.weight=iq6_k
|
91 |
+
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq6_k
|
92 |
+
blk\.60\.ffn_down_shexp\.weight=iq6_k
|
93 |
+
|
94 |
+
blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
|
95 |
+
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
|
96 |
+
blk\.60\.ffn_(gate|up)_shexp\.weight=iq6_k
|
97 |
+
|
98 |
+
# The bulk of the model size is below
|
99 |
+
# Routed Experts (3-60)
|
100 |
+
# usually ffn_down is made a bit bigger than ffn_(gate|up) but you do you
|
101 |
+
blk\.[3-9]\.ffn_down_exps\.weight=iq4_ks
|
102 |
+
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq4_ks
|
103 |
+
blk\.60\.ffn_down_exps\.weight=iq4_ks
|
104 |
+
|
105 |
+
blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
|
106 |
+
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
|
107 |
+
blk\.60\.ffn_(gate|up)_exps\.weight=iq4_ks
|
108 |
+
"
|
109 |
+
|
110 |
+
custom=$(
|
111 |
+
echo "$custom" | grep -v '^#' | \
|
112 |
+
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
|
113 |
+
)
|
114 |
+
|
115 |
+
./build/bin/llama-quantize \
|
116 |
+
--imatrix /mnt/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera.imatrix \
|
117 |
+
--custom-q "$custom" \
|
118 |
+
/media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-256x21B-BF16-00001-of-00030.gguf \
|
119 |
+
/media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-IQ4_KS.gguf \
|
120 |
+
IQ4_KS \
|
121 |
+
40
|
122 |
+
```
|
123 |
+
|
124 |
+
</details>
|
125 |
+
|
126 |
+
## imatrix
|
127 |
+
Based on [some discussions on imatrix
|
128 |
+
methodology](https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2878086261)
|
129 |
+
I chose the tried and true old school methodology using
|
130 |
+
default context length 512. This is one of the first imatrix
|
131 |
+
generated using the [updated imatrix calculation fix for
|
132 |
+
MLA](https://github.com/ikawrakow/ik_llama.cpp/pull/411) so went lower
|
133 |
+
than Q8_0 on attention tensors for this MLA quant (iq6_k) given the
|
134 |
+
discussions there and recent CUDA speed improvements.
|
135 |
+
|
136 |
+
<details>
|
137 |
+
|
138 |
+
<summary>👈 Imatrix Methodology</summary>
|
139 |
+
|
140 |
+
```
|
141 |
+
wget https://gist.githubusercontent.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/raw/571fda718462de863e5a0171078c175420c7649a/calibration_data_v5_rc.txt
|
142 |
+
|
143 |
+
numactl -N 0 -m 0 \
|
144 |
+
./build/bin/llama-imatrix \
|
145 |
+
--verbosity 1 \
|
146 |
+
-m /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-Q8_0.gguf \
|
147 |
+
-f calibration_data_v5_rc.txt \
|
148 |
+
-o DeepSeek-R1T-Chimera.imatrix \
|
149 |
+
--layer-similarity \
|
150 |
+
--ctx-size 512 \
|
151 |
+
--numa numactl \
|
152 |
+
--threads 40
|
153 |
+
|
154 |
+
# NOTE: I actually forgot --layer-similarity otherwise would publish that here. Sorry!
|
155 |
```
|
156 |
|
157 |
</details>
|
158 |
|
|
|
|
|
159 |
|
160 |
## References
|
161 |
* [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
|