Alpha-VLLM
/

Chameleon_7B_mGPT

Model card Files Files and versions Community

Chameleon_7B_mGPT / README.md

Cxxs's picture

Update README.md

60159f1 verified 8 months ago

|

history blame contribute delete

2.23 kB

	---
	pipeline_tag: any-to-any
	---
	This is the Chameleon-7b checkpoint, converted using the script [convert_chameleon_weights_to_hf.py](https://github.com/Alpha-VLLM/Lumina-mGPT/blob/main/lumina_mgpt/model/chameleon/convert_chameleon_weights_to_hf.py) from the [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT) repository.

	This release is intended to ease the initialization of Lumina-mGPT training. Before using this model, please ensure you have obtained permission to access the official Chameleon checkpoints available at [Hugging Face](https://huggingface.co/facebook/chameleon-7b). Usage of this model is at the user's own risk.


	<h2 style="color:rosybrown">Differences from the official chameleon-7B release</h2>

	This model is almost the same* as the official chameleon-7B release, with one important difference in the qk-norm implementation*:
	Due to unknown reasons, for the 34B Chameleon
	model, where 8-way model parallelism is employed during training, the weights in the qk-norm layers, which are expected to be the same across model-parallel ranks,
	are found to be different (See [here](https://github.com/huggingface/transformers/pull/31534#issuecomment-2207354677) for details). More intuitively, this means that the attention heads can be divided into 1 group for 7B model and 8 groups for 34B model, where the qk-norm parameters
	are the same within the groups but different among them. To mitigate this problem, `transformers` has developed the implementation to copy the qk-norm parameters to the shape `num_heads * head_dim`,
	however, this means that if we want to further finetune the Chameleon model, like the case of Lumina-mGPT, the qk-norm parameters will further diverge to the extent that the parameters are different
	between every two attention heads, which is not ideal. To solve this problem, we slightly change the implementation so that the qk-norm parameters are instead of shape `model_parallel_size x head_dim`,
	where `model_parallel_size` is 1 for 7B model and 8 for 34B model, and they are expanded to `num_heads * head_dim` during forward time through `repeat_interleave`. This modification ensures
	that the qk-norm parameters can always be consistent within existing groups.