Which draft model is recommended for speculative decoding?

#12

by devnen - opened 16 days ago

16 days ago

Thank you for your hard work. I can achieve impressive inference speeds (about 25 tokens/s) using the latest version of llama.cpp, the UD-Q4_K_XL quantization, and the following command:

llama-server
--model ./Qwen3-235B-A22B-UD-Q4_K_XL/Qwen3-235B-A22B-UD-Q4_K_XL-00001-of-00003.gguf
-c 8096
-b 1024
-ngl 999
-fa
--parallel 1
--threads 16
-mg 0
--host 0.0.0.0
--port 8080

Could you recommend the optimal model for speculative decoding when using the --model-draft parameter?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment