Which draft model is recommended for speculative decoding?

#12
by devnen - opened

Thank you for your hard work. I can achieve impressive inference speeds (about 25 tokens/s) using the latest version of llama.cpp, the UD-Q4_K_XL quantization, and the following command:

llama-server
--model ./Qwen3-235B-A22B-UD-Q4_K_XL/Qwen3-235B-A22B-UD-Q4_K_XL-00001-of-00003.gguf
-c 8096
-b 1024
-ngl 999
-fa
--parallel 1
--threads 16
-mg 0
--host 0.0.0.0
--port 8080

Could you recommend the optimal model for speculative decoding when using the --model-draft parameter?

Sign up or log in to comment