-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Add diffllama #34083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add diffllama #34083
Conversation
|
I am coding now, but it's first time I contribute transformers and other OSS. I may ask you some help. |
765db6a to
269055e
Compare
|
I still have a error located in modeling_diffllama.py@377: apply_rotary_pos_emb. Var "query_states" must be torch.Size([2, 32, 10, 128]) but the var is torch.Size([2, 64, 10, 64]). I need to change "query_states" or "cos"&"sin". |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! I think this would be an awesome fit to use modular transfomresr!
A bit of doc here: https://huggingface.co/docs/transformers/en/modular_transformers
this would help isolating the changes!
|
I've finished making normal/eager Attention, and I can run with AutoModelforForCausalLM.generate(). |
|
And also I fixed to fit modular transfomres. |
You don't need to divide by 2 if we use same number of attention heads as llama. instead you can just split in forward. Co-authored-by: Minho Ryu <[email protected]>
fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]>
new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]>
new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]>
fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]>
fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]>
fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]>
fit to changeing "num_heads // 2" place. and more visible Co-authored-by: Minho Ryu <[email protected]>
bzantium
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented flash and sdpa attention as well.
Co-authored-by: Minho Ryu <[email protected]>
Co-authored-by: Minho Ryu <[email protected]>
|
Hey, sorry for the delay! class DiffLlamaRotaryEmbedding(LlamaRotaryEmbedding):
passin the modular file. In your case, you will probably need to only rewrite the attention classes 😉 |
|
Are you still working on this PR, @weak-kajuma ? |
|
@Cyrilvallez Could you review again? I made |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! A great first modular! But you can still cut a lot of code, the only difference here are the attention classes so it's perfect for modular to pick up on everything by itself!
LMK if you run into any issues
|
You may need to rebase/merge on |
|
@Cyrilvallez Could you review again? Moduler transformers is very easy and good. And also I can pass all tests by merging latest changes. |
|
@Cyrilvallez any plannings to review this pr? |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, very good! Final comments 🤗
| class DiffLlamaRMSNorm(LlamaRMSNorm): | ||
| pass | ||
|
|
||
|
|
||
| ALL_LAYERNORM_LAYERS.append(DiffLlamaRMSNorm) | ||
|
|
||
|
|
||
| class DiffLlamaRotaryEmbedding(LlamaRotaryEmbedding): | ||
| pass | ||
|
|
||
|
|
||
| class DiffLlamaMLP(MistralMLP): | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be removed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove DiffLlamaMLP, then AttributeError: 'DiffLlamaConfig' object has no attribute 'mlp_bias' has happened. So I cannot remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! 🤗
7b0da01 to
b4ff5f3
Compare
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very very nice! The only thing missing is to update based on #35235 ! If you don't want we'll just open a PR afterwards!
| self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) | ||
| self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) | ||
|
|
||
| self.lambda_init = lambda_init_fn(layer_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go in _init_weights() AFAIK!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't know _init_weights(). How does they move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really a weight initialization, just declaration of a parameter so it should be ok as-is
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then I think it should be ready to merge, we'll take it from there! Thanks a lot for the contribution! cc @ArthurZucker
|
Cool let's merge then! 🤗 |
|
Thanks for iterating and congrats and the model merge 🥳 |
* first adding diffllama * add Diff Attention and other but still with errors * complate make attention Diff-Attention * fix some bugs which may be caused by transformer-cli while adding model * fix a bug caused by forgetting KV cache... * Update src/transformers/models/diffllama/modeling_diffllama.py You don't need to divide by 2 if we use same number of attention heads as llama. instead you can just split in forward. Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py fit to changeing "num_heads // 2" place. and more visible Co-authored-by: Minho Ryu <[email protected]> * I found Attention missed implemented from paper still on e072544. * re-implemented * adding groupnorm Co-authored-by: Minho Ryu <[email protected]> * align with transformers code style Co-authored-by: Minho Ryu <[email protected]> * fix typo Co-authored-by: Minho Ryu <[email protected]> * adding groupnorm Co-authored-by: Minho Ryu <[email protected]> * change SdpaAttention to DiffSdpaAttention Co-authored-by: Minho Ryu <[email protected]> * fix bug * Update src/transformers/models/diffllama/modeling_diffllama.py resolve "not same outputs" problem Co-authored-by: Minho Ryu <[email protected]> * fix bugs of places of "GroupNorm with scale" and etc * Revert "fix bugs of places of "GroupNorm with scale" and etc" This reverts commit 26307d9. * simplify multiple of attention (matmul) operations into one by repeating value_states Co-authored-by: Minho Ryu <[email protected]> * simplify multiple of attention (matmul) operations into one by repeating value_states Co-authored-by: Minho Ryu <[email protected]> * simplify multiple of attention (matmul) operations into one by repeating value_states Co-authored-by: Minho Ryu <[email protected]> * remove missed type * add diffllama model_doc * apply make style/quality * apply review comment about model * apply review comment about test * place diffllama alphabetically on the src/transformers/__init__.py * fix forgot code * Supports parameters that are not initialized with standard deviation 0 in the conventional method * add DiffLlamaConfig to CONFIG_CLASSES_TO_IGNORE_FOR_DOCSTRING_CHECKPOINT_CHECK on utils/check_config_docstrings.py * remove unused property of config * add to supported model list * add to spda supported model list * fix copyright, remove pretraining_tensor_parallel, and modify for initialization test * remove unused import and etc. * empty commit * empty commit * empty commit * apply modular transformers but with bugs * revert prev commit * create src/transformers/model/diffllama/modular_diffllama.py * run utils/modular_model_converter.py * empty commit * leaner modular diffllama * remove more and more in modular_diffllama.pt * remove more and more in modular_diffllama.pt * resolve missing docstring entries * force reset * convert modular --------- Co-authored-by: Minho Ryu <[email protected]>
What does this PR do?
This PR adds the codes for the DiffLlama, which is Llama model with Differential Transformer. Please refer to Differential Transformer. @ArthurZucker