WAIT FOR ENTIRE UPLOAD TO FINISH BEFORE DOWNLOADING!

WARNING

Cooked this quant on a remote rig with limited uplink, will take a while, make sure it finishes uploading before you bother downloading the GGUF files.

`ik_llama.cpp` imatrix Quantizations of tngtech/DeepSeek-R1T-Chimera

This quant collection REQUIRES ik_llama.cpp fork to support advanced non-linear SotA quants. Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

These quants provide best in class quality for the given memory footprint.

Big Thanks

Special thanks to u/un_passant for additional hardware access for this special project!

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community here and on r/LocalLLaMA for tips and tricks helping each other run all the fun new models!

Excited to share and learn together. Thanks!

Quant Collection

So far these are my best recipes offering the great quality in good memory footprint breakpoints.

DeepSeek-R1T-Chimera-IQ4_KS

NOTE: This quant may take a long time to upload, hopefully less than a month lol...

338.456 GiB (4.326 BPW)

type f32: 361 tensors - norms etc.
type q6_0: 61 tensors - attn_k_b (not divisible by 256 so can't use iq6_k)
type iq6_k: 551 tensors - balance of attn, token_embd, output, output_norm, shared experts
type iq4_ks: 174 tensors - ffn_(down|gate|up)_exps routed experts

This quant is designed to take advantage of faster iq4_ks CUDA performance and is not pre-repacked allowing multi-GPU users to offload additional layers easily. If you have enough RAM to hold it all you can use -rtr for run-time-repacking of remaining layers on CPU for improved performance, or use the offline repack tool for a custom solution tailored to you exact hardware configuration.

Quantization

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# Token embedding and output tensors
# note token_embd cannot be repacked quant type
token_embd\.weight=iq6_k
output\.weight=iq6_k
output_norm\.weight=iq6_k

# First 3 dense layers (0-3)
blk\.[0-2]\.attn_k_b.*=q6_0
blk\.[0-2]\.attn_.*=iq6_k
blk\.[0-2]\..*=iq6_k

# All attention, norm weights, and bias tensors for MoE layers (3-60)
# Except blk.*.attn_k_b.weight is not divisible by 256 and no iq6_k so go with q6_0
blk\.[3-9]\.attn_k_b.*=q6_0
blk\.[1-5][0-9]\.attn_k_b.*=q6_0
blk\.60\.attn_k_b.*=q6_0

blk\.[3-9]\.attn_.*=iq6_k
blk\.[1-5][0-9]\.attn_.*=iq6_k
blk\.60\.attn_.*=iq6_k

blk\.[3-9]\.ffn_norm\.weight=iq6_k
blk\.[1-5][0-9]\.ffn_norm\.weight=iq6_k
blk\.60\.ffn_norm\.weight=iq6_k

blk\.[3-9]\.exp_probs_b\.bias=iq6_k
blk\.[1-5][0-9]\.exp_probs_b\.bias=iq6_k
blk\.60\.exp_probs_b\.bias=iq6_k

# Shared Experts (3-60)
blk\.[3-9]\.ffn_down_shexp\.weight=iq6_k
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq6_k
blk\.60\.ffn_down_shexp\.weight=iq6_k

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
blk\.60\.ffn_(gate|up)_shexp\.weight=iq6_k

# The bulk of the model size is below
# Routed Experts (3-60)
# usually ffn_down is made a bit bigger than ffn_(gate|up) but you do you
blk\.[3-9]\.ffn_down_exps\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq4_ks
blk\.60\.ffn_down_exps\.weight=iq4_ks

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
blk\.60\.ffn_(gate|up)_exps\.weight=iq4_ks
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --imatrix /mnt/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera.imatrix \
    --custom-q "$custom" \
    /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-256x21B-BF16-00001-of-00030.gguf \
    /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-IQ4_KS.gguf \
    IQ4_KS \
    40

imatrix

Based on some discussions on imatrix methodology I chose the tried and true old school methodology using default context length 512. This is one of the first imatrix generated using the updated imatrix calculation fix for MLA so went lower than Q8_0 on attention tensors for this MLA quant (iq6_k) given the discussions there and recent CUDA speed improvements.

👈 Imatrix Methodology

wget https://gist.githubusercontent.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/raw/571fda718462de863e5a0171078c175420c7649a/calibration_data_v5_rc.txt

numactl -N 0 -m 0 \
./build/bin/llama-imatrix \
    --verbosity 1 \
    -m /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-Q8_0.gguf \
    -f calibration_data_v5_rc.txt \
    -o DeepSeek-R1T-Chimera.imatrix \
    --layer-similarity \
    --ctx-size 512 \
    --numa numactl \
    --threads 40

# NOTE: I actually forgot --layer-similarity otherwise would publish that here. Sorry!

ubergarm
/

DeepSeek-R1T-Chimera-GGUF

WAIT FOR ENTIRE UPLOAD TO FINISH BEFORE DOWNLOADING!

`ik_llama.cpp` imatrix Quantizations of tngtech/DeepSeek-R1T-Chimera

Big Thanks

Quant Collection

DeepSeek-R1T-Chimera-IQ4_KS

Quantization

imatrix

References

Model tree for ubergarm/DeepSeek-R1T-Chimera-GGUF

WAIT FOR ENTIRE UPLOAD TO FINISH BEFORE DOWNLOADING!

ik_llama.cpp imatrix Quantizations of tngtech/DeepSeek-R1T-Chimera

Big Thanks

Quant Collection

DeepSeek-R1T-Chimera-IQ4_KS

Quantization

imatrix

References

Model tree for ubergarm/DeepSeek-R1T-Chimera-GGUF

`ik_llama.cpp` imatrix Quantizations of tngtech/DeepSeek-R1T-Chimera