1.5 bpw

#6
by lmganon123 - opened

I would definitely be interested in a 1.5 bpw quant for a 128GB +24GBVRAM. Probably would sit down and do it myself if the main model didn't take like 1TB of space. My experience with the first unsloth quants is that it is usable and in some ways better than Qwen 235B.

Good to know that even the small one still might be worth it over Qwen 235B as I was wondering that myself given the similar size.

I just discovered ik implemented some QTIP/exl3 style "trellis quants" including a 2.125 bpw iq2_kt ... Might be able to use that to keep perplexity good while shrinking it further.

128GB+24GBVRAM is still gonna be tough though as I'm less familiar with the various sub 2 bpw quants which have their own speed issues as well...

Challenging stuff!

Also had some experience with exl3 2bpw for 70B and it is incredible. Pretty sure it is gonna be even better for R1 considering how larger models can handle quant brain damage better.

@lmganon123

I'm about to upload DeepSeek-R1-0528-IQ1_S_R4 131GiB 1.664 BPW which should barely fit on 128GiB RAM + 24GB VRAM. The perplexity isn't as bad as I expected and it seems at least usable. Not the fastest and probably not extra GPU offload due to the quant types but still interesting if your desperate! hahah

Let me know if u try it out and get it working! Should land by Monday Morning Eastern Time zone.

Will do.

I compared mine against the two smallest others I could find. Both bartowski's and unsloths look pretty good given the size (lower is better). Mine seems to hold up okay for the size though so definitely curious to hear how it works out!

kld-r1-0528-smol-bois.png

I gave it a try and sadly I am on windows which does use a lot of ram by itself. Looking at task manager I am pretty sure I start using SSD just a little bit and whenever that happens at least for me the performance tanks completely. I got 10T/s processing and 1T/s generation with 12k ctx which is in unusable territory for me.

I'm using the IQ1_S_R4 on CachyOS (Arch Linux variant), using a 24+128 configuration - specifically an RTX 4090, 128GB of DDR4 @ 3600MT/s (CL18, loose timings) and a ~3500MB/s read M.2 SSD - and I can confirm that it requires mmap (if you try forcing -rtr it'll OOM) and uses the SSD. This is on boot with very little running.

That said, my generation eval is 4.5T/s. This is with 64K context, fp16 context, otherwise recommended settings with -ser 6,1 set. For context, running the Unsloth IQ1_S is around half as fast, so I consider this a great improvement - this is in the ballpark of usability now.

I think the output of this quant is very acceptable. I gave it some coding prompts and I think the results are still better than the big Qwen3 MoE with a similarly-sized quant. It's also not all that noticeably different in practice from the DS R1 0528 Unsloth IQ1_S (though I'm sure more extensive testing would reveal differences in quality).

Much thanks ubergarm for making this more accessible to those of us with gaming rigs and dual-channel 4-slot motherboards!

I gave it a try and sadly I am on windows which does use a lot of ram by itself.

and I can confirm that it requires mmap (if you try forcing -rtr it'll OOM) and uses the SSD. This is on boot with very little running.

Ahh okay I knew it would be tight and likely require headless or at least closing your browser lol... Regarding windows, that is gonna be harder probably assuming it has higher overhead than Linux. I too run ARCH, btw, and use a tiling windows manager in xwindows and manually disable picom my desktop compositor to free up max resources haha...

There may still be hope for @names-are-for-friends though is you want to try:

  1. Use -ctk q8_0 as quantized cache will free up some more VRAM
  2. Dial back to 16k context with -c 16384 just for now, and can try to turn it up again later if this works
  3. Now try to offload as many layers as possible before VRAM OOM e.g. -ot "blk\.(3|4)\.ffn_.*=CUDA0" \and add in (3|4|5) until OOM

EDIT: More discussion over here: https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13360298

The whole thing will look something like this:

./build/bin/llama-server \
    --model DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf \
    --alias ubergarm/DeepSeek-R1-0528-IQ1_S_R4 \
    --ctx-size 16384 \
    -ctk q8_0 \  # <--- this could go down to q4_0 for a little more VRAM
    -mla 3 -fa \
    -amb 512 \ # <--- this could go down to 256 in a pinch for a little more VRAM
    -fmoe \
    --n-gpu-layers 99 \
    -ot "blk\.(3|4)\.ffn_.*=CUDA0" \ # <--- this could go up e.g. (3|4|5|6) etc... *EDIT* this probably won't help given IQ1_S_R4 doesn't run on CUDA yet... oof...
    --override-tensor exps=CPU \ # <--- order matters so this goes last
    --parallel 1 \
    --threads 16 \ # <--- number of physical cores e.g. 8 or 16 etc...
    --host 127.0.0.1 \
    --port 8080

Hopefully some combination of these things will help you out. Also when you compile give this a go too:

# -DGGML_CUDA_IQK_FORCE_BF16=1 may be faster on 3090s but probably slower on 4090s but prevent issues when offloading `_r4` layers onto GPU
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF  -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build build --config Release -j $(nproc)

With some luck you will be able to get out of mmap() territory and likely get a decent speed boost if fully in RAM+VRAM.

Thanks again for the feedback!

The reason I used fp16 cache and large context is because I believed these were both held in VRAM, which is obviously underutilised otherwise with this small of a quant and with the given offload... but I could try the offload you designated as a matter of science? Throw some of those FFN onto the GPU and see what happens?

I'll add too, after a few runs, I think it's more like prompt eval = 17T/s, gen eval = 5T/s. I ran it with some pretty extensive existing context, and with some fresh prompts too. I was also generally running it with GDScript queries, and it seemed to work pretty well without confusing GDScript with normal Python, and even without mixing up old Godot objects with new ones...

I might try the compilation flags but I think current performance is good enough.

The reason I used fp16 cache and large context is because I believed these were both held in VRAM, which is obviously underutilised otherwise with this small of a quant and with the given offload... but I could try the offload you designated as a matter of science? Throw some of those FFN onto the GPU and see what happens?

Yeah makes sense now. I totally forgot the IQ1_S_R4 can't actually run on GPU despite putting it on the model card haha... You technically can but just sits on it like expensive RAM so no speed-ups (except for maybe not mmap()ing). Only the bigger 2/3/4/5 bit repacked quants now run on GPU as of like a week ago.

So I'm rolling another version of the exact same quant except not "pre repacked" so it will be plain IQ1_S which will actually run extra layers. Keep an eye on the repo it should arrive within 12 hours if all goes well and uploads smoothly overnight.

Sorry there is no way to "un-repack" the existing quant, only a feature to repack and "normal" one... haha...

Thanks for your patience and feedback and glad to know at least it seems to be working okay despite being so smol!

Cheers!

EDIT

I think it will barely fit and works with -rtr in about ~116.1GiB RAM plus according to nvidia-smi about 23412MiB VRAM when using -ctk f16 and -ctk q8_0 will get you in at 22448MiB VRAM comfortably so could add some context.

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-server \
    --model /mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ1_S/DeepSeek-R1-0528-IQ1_S-00001-of-00003.gguf \
    --alias ubergarm/DeepSeek-R1-0528-IQ1_S \
    --ctx-size 32768 \
    -ctk f16 \
    -mla 3 -fa \
    -amb 256 \
    -fmoe \
    --n-gpu-layers 99 \
    -ot "blk\.(3|4|5|6)\.ffn_.*=CUDA0" \
    --override-tensor exps=CPU \
    -rtr \
    --parallel 1 \
    --threads 24 \
    --host 127.0.0.1 \
    --port 8080

llm_load_tensors:        CPU buffer size = 117936.00 MiB
llm_load_tensors:  CUDA_Host buffer size =   469.99 MiB
llm_load_tensors:      CUDA0 buffer size = 17851.01 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 256
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =  2196.00 MiB
llama_new_context_with_model: KV self size  = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3041.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    78.01 MiB

If you need to shave down some more you can get more aggressive e.g. and with -ctk q4_0 and maybe try kicking attn_kv_b out of VRAM you can even offload another layer possibly. (That attn_kv_b tensor might not be used depending on which mla you use etc, I don't fully grok it but got the hot tip from @anikifoss 's [chonky DQ4_K_R4](quant https://huggingface.co/anikifoss/DeepSeek-R1-0528-DQ4_K_R4).

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-server \
    --model /mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ1_S/DeepSeek-R1-0528-IQ1_S-00001-of-00003.gguf \
    --alias ubergarm/DeepSeek-R1-0528-IQ1_S \
    --ctx-size 32768 \
    -ctk q4_0 \
    -mla 3 -fa \
    -amb 256 \
    -fmoe \
    --n-gpu-layers 99 \
    -ot "blk\.(3|4|5|6|7)\.ffn_.*=CUDA0" \
    -ot attn_kv_b=CPU \
    -ot exps=CPU \
    -rtr \
    --parallel 1 \
    --threads 24 \
    --host 127.0.0.1 \
    --port 8080

llm_load_tensors:        CPU buffer size = 116278.13 MiB
llm_load_tensors:  CUDA_Host buffer size =   469.99 MiB
llm_load_tensors:      CUDA0 buffer size = 19508.89 MiB
....................................................................................................
============ Repacked 61 tensors
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 256
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   617.64 MiB
llama_new_context_with_model: KV self size  =  617.62 MiB, c^KV (q4_0):  617.62 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3109.31 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    78.01 MiB

Interestingly it is a couple more GiB maybe the _r4 quant optimizes storage a bit being repacked not completely sure. Okay I'll upload it!

EDIT Here it is: https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ1_S

Can confirm that with:

IQ1_S-00001-of-00003.gguf -fa -mla 3 -ctk q8_0 -ctv q8_0 --ctx-size 25000 -amb 256 --n-gpu-layers 99 -ot "blk.(3|4|5|6).ffn_.*=CUDA0" --override-tensor exps=CPU

I still have around free 1GB of Vram and ram each and no longer use SSD on windows. Speed went to 1.7 and 17 pp which probably makes sense since I get 5T/s with your Qwen 235B quant on my setup.

@Imganon123

Yay, thanks for confirming you were able to get enough free resources to run on windows even!

I'd actually suggest the IQ1_S_R4 again now given ik added support for the it specifically for GPU offload. No need for the iq1_s anymore honestly, and it might have some of its own issues looking at the discussions over there around fmoe etc.

The iq1_s_r4 might get a little more tok/sec generation.

Otherwise you could even consider going -ctk q4_0 and brining -amb 128 (or even 64 in a pinch, though don't usually recommend this as going too small can slow it down). Finally I learned from anikifoss to also try -ot CPU,attn_kv_b=CPU as it likely isn't actually using that "extra" tensor so can free up a little more VRAM. That would allow you to add hopefully at least one or maybe two more layers to GPU.

Okie, have fun and curious if you find the tiny r1-0528 model better than that qwen-235b quant in practice too.

Sign up or log in to comment