You might be interested in this: a draft model for the full `deepseek-r1` model!

#1
by jukofyork - opened

I tested a few different models, and your's worked the best to create a draft model for the full deepseek-r1 model:

https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B

Owner

Cool, I'm glad it worked for your case! I was just working on vocab transplanting too, and your tool seems to work very well. Thank you!

I'm nearly done with the 0.5b for deepseek-r1 and it will be interesting to compare how you have been creating your drafts (eg: I'm using over 3B tokens but you seem to be getting good results at a tiny fraction of this!).

Did you use the --unmapped-init-scale option or just leave the unmatched tokens as zero? I found the initial perplexity was around 1/2 when I used --unmapped-init-scale 1.0 so used that to seed my initial models (but without post-fine-tuning leaving these as zero seemed to get a better hit-rate).

I also wonder if we should just be using temperature=1, no other samplers and then letting the model generate starting with a single <BOS> token and nothing else?

This would likely generate the most distributionally accurate data to fine-tune on, but for deepseek-r1 I can only generate around 1-1.5M tokens a day even using 4 machines, so decided to just use the most representative datasets (and 100% from R1 and not the distilled versions) I could find for it.

For other smaller models I can generate quite a lot of data like this over a couple of days so might be interesting to try it later.

Owner

I used unmapped-init-scale 1.0, but I found that for DeepSeek-Qwen-32B and QwQ the initial loss was higher than just starting from Qwen2.5-0.5B, even though I initialize the new tokens the same way, I don't know why.

On the other hand, the transplant for Mistral-Small worked great, as the vocab is completely different from Qwen's. I was trying to 'pretrain' it but that didn't work well.

I agree that it may be better for matching the whole distribution, but I'm not sure if it's worth it for a draft model, as you may need quite a few billion tokens. Instead, focusing on a selection of common tasks for this kind of model (coding, math) seems the natural way to go given the computation constraints. I used temperature=0.6 and no other samplers for the same reason.

Let me know how those experiments go!

I should have the first version finished later today, so interested to see how it works out!

I've also a couple of ideas for the transplant_vocab code that I will add later this week:

  • Better mean initialisation using a parameter t to control how "front loaded" the mean should be for the lm_head blended tokens.
  • An override option to help map the chat-ml tokens to the deepseek tokens, etc.

I've merged all the changes for now, but not 100% sure if they will work for all models:

https://github.com/jukofyork/transplant-vocab/pulls?q=is%3Apr+is%3Aclosed

(I'm still working on the full deepseek-r1 and 1deepseek-v3` models currently)

Owner

Looks good, I will try it in the following days (too many tasks in my backlog now) and I will let you know if the issues are solved

No problem, there are some other discussions ongoing in this thread:

https://huggingface.co/rdsm/QwenPhi-4-0.5b-Draft/discussions/1

Sign up or log in to comment