ubergarm commited on
Commit
3a52698
·
1 Parent(s): b384829

Update README & be patient as upload is slooow!

Browse files
Files changed (1) hide show
  1. README.md +116 -6
README.md CHANGED
@@ -19,6 +19,8 @@ This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/i
19
  These quants provide best in class quality for the given memory footprint.
20
 
21
  ## Big Thanks
 
 
22
  Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
23
 
24
  Also thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for tips and tricks helping each other run all the fun new models!
@@ -30,22 +32,130 @@ So far these are my best recipes offering the great quality in good memory footp
30
 
31
  #### ubergarm/DeepSeek-R1T-Chimera-IQ4_KS
32
 
33
- TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  ## Quantization
36
  <details>
37
 
38
- <summary>👈Secret Recipe</summary>
39
 
40
  ```bash
41
- TODO
42
- (mostly iq6_k for all attention/shared experts and iq4_ks for all ffn layers)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ```
44
 
45
  </details>
46
 
47
- ## Discussion
48
- Updated imatrix MLA stuff.
49
 
50
  ## References
51
  * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
 
19
  These quants provide best in class quality for the given memory footprint.
20
 
21
  ## Big Thanks
22
+ Special thanks to `u/un_passant` for additional hardware access for this special project!
23
+
24
  Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
25
 
26
  Also thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for tips and tricks helping each other run all the fun new models!
 
32
 
33
  #### ubergarm/DeepSeek-R1T-Chimera-IQ4_KS
34
 
35
+ *NOTE*: This quant may take a *long time* to upload, hopefully less than a month lol...
36
+
37
+ 338.456 GiB (4.326 BPW)
38
+
39
+ - type f32: 361 tensors - norms etc.
40
+ - type q6_0: 61 tensors - attn_k_b (not divisible by 256 so can't use iq6_k)
41
+ - type iq6_k: 551 tensors - balance of attn, token_embd, output, output_norm, shared experts
42
+ - type iq4_ks: 174 tensors - `ffn_(down|gate|up)_exps` routed experts
43
+
44
+ This quant is designed to take advantage of faster [iq4_ks](https://github.com/ikawrakow/ik_llama.cpp/pull/417) CUDA
45
+ performance and is *not* pre-repacked allowing multi-GPU users to offload
46
+ additional layers easily. If you have enough RAM to hold it all you can
47
+ use `-rtr` for run-time-repacking of remaining layers on CPU for improved
48
+ performance, or use the offline repack tool for a custom solution tailored
49
+ to you exact hardware configuration.
50
 
51
  ## Quantization
52
  <details>
53
 
54
+ <summary>👈 Secret Recipe</summary>
55
 
56
  ```bash
57
+ #!/usr/bin/env bash
58
+
59
+ custom="
60
+ # Token embedding and output tensors
61
+ # note token_embd cannot be repacked quant type
62
+ token_embd\.weight=iq6_k
63
+ output\.weight=iq6_k
64
+ output_norm\.weight=iq6_k
65
+
66
+ # First 3 dense layers (0-3)
67
+ blk\.[0-2]\.attn_k_b.*=q6_0
68
+ blk\.[0-2]\.attn_.*=iq6_k
69
+ blk\.[0-2]\..*=iq6_k
70
+
71
+ # All attention, norm weights, and bias tensors for MoE layers (3-60)
72
+ # Except blk.*.attn_k_b.weight is not divisible by 256 and no iq6_k so go with q6_0
73
+ blk\.[3-9]\.attn_k_b.*=q6_0
74
+ blk\.[1-5][0-9]\.attn_k_b.*=q6_0
75
+ blk\.60\.attn_k_b.*=q6_0
76
+
77
+ blk\.[3-9]\.attn_.*=iq6_k
78
+ blk\.[1-5][0-9]\.attn_.*=iq6_k
79
+ blk\.60\.attn_.*=iq6_k
80
+
81
+ blk\.[3-9]\.ffn_norm\.weight=iq6_k
82
+ blk\.[1-5][0-9]\.ffn_norm\.weight=iq6_k
83
+ blk\.60\.ffn_norm\.weight=iq6_k
84
+
85
+ blk\.[3-9]\.exp_probs_b\.bias=iq6_k
86
+ blk\.[1-5][0-9]\.exp_probs_b\.bias=iq6_k
87
+ blk\.60\.exp_probs_b\.bias=iq6_k
88
+
89
+ # Shared Experts (3-60)
90
+ blk\.[3-9]\.ffn_down_shexp\.weight=iq6_k
91
+ blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq6_k
92
+ blk\.60\.ffn_down_shexp\.weight=iq6_k
93
+
94
+ blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
95
+ blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
96
+ blk\.60\.ffn_(gate|up)_shexp\.weight=iq6_k
97
+
98
+ # The bulk of the model size is below
99
+ # Routed Experts (3-60)
100
+ # usually ffn_down is made a bit bigger than ffn_(gate|up) but you do you
101
+ blk\.[3-9]\.ffn_down_exps\.weight=iq4_ks
102
+ blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq4_ks
103
+ blk\.60\.ffn_down_exps\.weight=iq4_ks
104
+
105
+ blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
106
+ blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
107
+ blk\.60\.ffn_(gate|up)_exps\.weight=iq4_ks
108
+ "
109
+
110
+ custom=$(
111
+ echo "$custom" | grep -v '^#' | \
112
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
113
+ )
114
+
115
+ ./build/bin/llama-quantize \
116
+ --imatrix /mnt/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera.imatrix \
117
+ --custom-q "$custom" \
118
+ /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-256x21B-BF16-00001-of-00030.gguf \
119
+ /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-IQ4_KS.gguf \
120
+ IQ4_KS \
121
+ 40
122
+ ```
123
+
124
+ </details>
125
+
126
+ ## imatrix
127
+ Based on [some discussions on imatrix
128
+ methodology](https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2878086261)
129
+ I chose the tried and true old school methodology using
130
+ default context length 512. This is one of the first imatrix
131
+ generated using the [updated imatrix calculation fix for
132
+ MLA](https://github.com/ikawrakow/ik_llama.cpp/pull/411) so went lower
133
+ than Q8_0 on attention tensors for this MLA quant (iq6_k) given the
134
+ discussions there and recent CUDA speed improvements.
135
+
136
+ <details>
137
+
138
+ <summary>👈 Imatrix Methodology</summary>
139
+
140
+ ```
141
+ wget https://gist.githubusercontent.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/raw/571fda718462de863e5a0171078c175420c7649a/calibration_data_v5_rc.txt
142
+
143
+ numactl -N 0 -m 0 \
144
+ ./build/bin/llama-imatrix \
145
+ --verbosity 1 \
146
+ -m /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-Q8_0.gguf \
147
+ -f calibration_data_v5_rc.txt \
148
+ -o DeepSeek-R1T-Chimera.imatrix \
149
+ --layer-similarity \
150
+ --ctx-size 512 \
151
+ --numa numactl \
152
+ --threads 40
153
+
154
+ # NOTE: I actually forgot --layer-similarity otherwise would publish that here. Sorry!
155
  ```
156
 
157
  </details>
158
 
 
 
159
 
160
  ## References
161
  * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)