Are there any changes in method vs. your v3-0324 quant?
I used your v3-0324 IQ notes to create an R1T IQ4 gguf, results seem good to me -- anything new here?
Hello again!
The main differences I'm experimenting with in my method are:
- Did not use repacked
_R4
quants in the release so folks with multi-GPU and more VRAM can use it. (repacked quants can technically be offloaded onto GPU, but it will act like "expensive RAM" and the CPU will do the calculations). If you need mmap() to run off of NVMe instead due to low RAM you'll have to repack the quant yourself on this one. - Experimenting with smaller attention tensors using the recent MLA imatrix calculation fix . Last time I went all Q8_0, though sounds like even iq5_k might be sufficient now.
- Experimenting with iq4_ks quant given it is one of the fastest of the new ones. Though all the iqN_k quants are now faster on CUDA.
I still used the evshiron llama.cpp fork + triton-cpu method to convert fp8 to bf16 and used a non-imatrix Q8_0 as the baseline to make the imatrix with my usual method which still seems sufficient from the research and discussions I've seen. If there is any clear evidence of tangible benefits for bothering changing it I'll adjust accordingly.
Also while some folks have made ik_llama.cpp custom R1T quants, I haven't seen them uploaded on huggingface so trying to release at least one to make it easier for folks to try it out easier.
Though honestly, not sure if this upload will ever finish, looking into the uplink issue currently haha...
Have you found any method changes that you like or suggest possible benefits in Perplexity, KL-Divergence, or benchmarks?
Thanks!
repacked quants can technically be offloaded onto GPU, but it will act like "expensive RAM"
I discovered this the hard way. I'd say it's worse than expensive RAM at 1000+ ms/tok.
It might be nice if ik would implement the inverse of -runtime-repack.
Last time I went all Q8_0
Given the small footprint of the shared tensors, there is something satisfying about keeping them Q8, but maybe that's irrational. More on that below.
I still used the evshiron llama.cpp fork + triton-cpu method to convert fp8 to bf16
I was not aware of these. All of my CUDA devices are Turing so I had waited until someone uploaded a BF16 conversion.
and used a non-imatrix Q8_0 as the baseline to make the imatrix
You didn't describe the process back to the base model, but I inferred from the script that it was probably base -> ??? -> Q8_0 -> imatrix, and did the same.
Also while some folks have made ik_llama.cpp custom R1T quants, I haven't seen them uploaded on huggingface so trying to release at least one to make it easier for folks to try it out easier.
I was planning to upload mine, w back-links to your V3 for the background -- no need if you're putting one up.
Have you found any method changes that you like or suggest possible benefits in Perplexity, KL-Divergence, or benchmarks?
I think I get the gist of what these are trying to measure, but I still don't have an intuition for it at the level that I'd be confident to e.g. write about. Charts show number-go-down and that's encouraging. The results when putting it to use (e.g. IQ4-experts + Q8 shared by your methods) seem good when actually using the model. My strategy has been to run the largest (highest fidelity) and drop down BPW-wise to get a throughput bump. For that I'm fine with the Q8/Q4 mixture. I have RAM for Q8-everything so I don't need quants to solve a will-it-fit problem.
It might be nice if ik would implement the inverse of -runtime-repack.
Interesting idea, yeah sorry you had to figure that out the hard way! I basically built it for my single 3090TI 24GB VRAM rig, but now going forward I'm not gonna "pre-repack" so folks have more flexibility.
Given the small footprint of the shared tensors, there is something satisfying about keeping them Q8, but maybe that's irrational. More on that below.
Yeah, they are a small percentage of the overall size of the model. I'm not 100% sure this model will ever complete uploading honestly, gonna try but its very slow right now. Feel free to upload yours and in the model card README.md some folks are beginning to add ik_llama.cpp
as a tag to make those quants easier to find. There is plenty of room for more ik quant cookers!
I was not aware of these. All of my CUDA devices are Turing so I had waited until someone uploaded a BF16 conversion.
Right, the only thing my 3090 can't do is native that specific fp8
format... The original method was described here which is mostly the same plus a pach that might be needed to get triton-cpu to build. Its pretty quick and not very resource intensive, though fast disk i/o does help.
You didn't describe the process back to the base model, but I inferred from the script that it was probably base -> ??? -> Q8_0 -> imatrix, and did the same.
Yeah I figure it isn't too bad given the original is fp8
anyway so q8_0
as baseline probably is fine.
I was planning to upload mine, w back-links to your V3 for the background -- no need if you're putting one up.
Go for it! If you have a q8_0 attention one ready to go folks might want that! I expect the iq6_k attention to be fine but don't have any comparisons yet.
Yeah, I'm still learning on how best to benchmark these big models in a meaningful way. Perplexity and KLD might not be the best for everything, but they are convenient ways of getting a feel for how various quants of the same model differ somewhat. Recently did a big writeup for reddit post on Qwen3-30B-A3B as it was smaller and much faster to test. I have a lot of the methodology more or less documented in a gist with some results and graphs.
Okay, get your tools sharp and ready for that R2 release coming soon :tm: lmao... cheers!