You might be interested in this: a draft model for the full `deepseek-r1` model!
I tested a few different models, and your's worked the best to create a draft model for the full deepseek-r1
model:
Cool, I'm glad it worked for your case! I was just working on vocab transplanting too, and your tool seems to work very well. Thank you!
I'm nearly done with the 0.5b
for deepseek-r1
and it will be interesting to compare how you have been creating your drafts (eg: I'm using over 3B tokens but you seem to be getting good results at a tiny fraction of this!).
Did you use the --unmapped-init-scale
option or just leave the unmatched tokens as zero? I found the initial perplexity was around 1/2 when I used --unmapped-init-scale 1.0
so used that to seed my initial models (but without post-fine-tuning leaving these as zero seemed to get a better hit-rate).
I also wonder if we should just be using temperature=1
, no other samplers and then letting the model generate starting with a single <BOS>
token and nothing else?
This would likely generate the most distributionally accurate data to fine-tune on, but for deepseek-r1
I can only generate around 1-1.5M tokens a day even using 4 machines, so decided to just use the most representative datasets (and 100% from R1 and not the distilled versions) I could find for it.
For other smaller models I can generate quite a lot of data like this over a couple of days so might be interesting to try it later.
I used unmapped-init-scale 1.0, but I found that for DeepSeek-Qwen-32B and QwQ the initial loss was higher than just starting from Qwen2.5-0.5B, even though I initialize the new tokens the same way, I don't know why.
On the other hand, the transplant for Mistral-Small worked great, as the vocab is completely different from Qwen's. I was trying to 'pretrain' it but that didn't work well.
I agree that it may be better for matching the whole distribution, but I'm not sure if it's worth it for a draft model, as you may need quite a few billion tokens. Instead, focusing on a selection of common tasks for this kind of model (coding, math) seems the natural way to go given the computation constraints. I used temperature=0.6 and no other samplers for the same reason.
Let me know how those experiments go!
I should have the first version finished later today, so interested to see how it works out!
I've also a couple of ideas for the transplant_vocab
code that I will add later this week:
- Better mean initialisation using a parameter
t
to control how "front loaded" the mean should be for thelm_head
blended tokens. - An override option to help map the chat-ml tokens to the deepseek tokens, etc.
I've merged all the changes for now, but not 100% sure if they will work for all models:
https://github.com/jukofyork/transplant-vocab/pulls?q=is%3Apr+is%3Aclosed
(I'm still working on the full deepseek-r1
and 1deepseek-v3` models currently)
Looks good, I will try it in the following days (too many tasks in my backlog now) and I will let you know if the issues are solved
No problem, there are some other discussions ongoing in this thread:
https://huggingface.co/rdsm/QwenPhi-4-0.5b-Draft/discussions/1