|
--- |
|
pipeline_tag: any-to-any |
|
--- |
|
This is the Chameleon-7b checkpoint, converted using the script [convert_chameleon_weights_to_hf.py](https://github.com/Alpha-VLLM/Lumina-mGPT/blob/main/lumina_mgpt/model/chameleon/convert_chameleon_weights_to_hf.py) from the [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT) repository. |
|
|
|
This release is intended to ease the initialization of Lumina-mGPT training. Before using this model, please ensure you have obtained permission to access the official Chameleon checkpoints available at [Hugging Face](https://huggingface.co/facebook/chameleon-7b). Usage of this model is at the user's own risk. |
|
|
|
|
|
<h2 style="color:rosybrown">Differences from the official chameleon-7B release</h2> |
|
|
|
*This model is **almost the same** as the official chameleon-7B release, with one important difference in the *qk-norm* implementation*: |
|
Due to unknown reasons, for the 34B Chameleon |
|
model, where 8-way model parallelism is employed during training, the weights in the qk-norm layers, which are expected to be the same across model-parallel ranks, |
|
are found to be different (See [here](https://github.com/huggingface/transformers/pull/31534#issuecomment-2207354677) for details). More intuitively, this means that the attention heads can be divided into 1 group for 7B model and 8 groups for 34B model, where the qk-norm parameters |
|
are the same within the groups but different among them. To mitigate this problem, `transformers` has developed the implementation to copy the qk-norm parameters to the shape `num_heads * head_dim`, |
|
however, this means that if we want to further finetune the Chameleon model, like the case of Lumina-mGPT, the qk-norm parameters will further diverge to the extent that the parameters are different |
|
between every two attention heads, which is not ideal. To solve this problem, we slightly change the implementation so that the qk-norm parameters are instead of shape `model_parallel_size x head_dim`, |
|
where `model_parallel_size` is 1 for 7B model and 8 for 34B model, and they are expanded to `num_heads * head_dim` during forward time through `repeat_interleave`. This modification ensures |
|
that the qk-norm parameters can always be consistent within existing groups. |