Add diffllama #34083

weak-kajuma · 2024-10-11T08:28:31Z

What does this PR do?

This PR adds the codes for the DiffLlama, which is Llama model with Differential Transformer. Please refer to Differential Transformer. @ArthurZucker

weak-kajuma · 2024-10-11T08:31:58Z

I am coding now, but it's first time I contribute transformers and other OSS. I may ask you some help.

weak-kajuma · 2024-10-11T13:24:21Z

I still have a error located in modeling_diffllama.py@377: apply_rotary_pos_emb. Var "query_states" must be torch.Size([2, 32, 10, 128]) but the var is torch.Size([2, 64, 10, 64]). I need to change "query_states" or "cos"&"sin".

ArthurZucker

Hey! I think this would be an awesome fit to use modular transfomresr!
A bit of doc here: https://huggingface.co/docs/transformers/en/modular_transformers

this would help isolating the changes!

weak-kajuma · 2024-10-16T13:26:38Z

I've finished making normal/eager Attention, and I can run with AutoModelforForCausalLM.generate().
But I'll adapt it for FlashAttention2 and Sdpa Attention.

weak-kajuma · 2024-10-16T13:28:10Z

And also I fixed to fit modular transfomres.

src/transformers/models/diffllama/modeling_diffllama.py

You don't need to divide by 2 if we use same number of attention heads as llama. instead you can just split in forward. Co-authored-by: Minho Ryu <[email protected]>

fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]>

new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]>

fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]>

fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]>

fit to changeing "num_heads // 2" place. and more visible Co-authored-by: Minho Ryu <[email protected]>

bzantium

implemented flash and sdpa attention as well.

src/transformers/models/diffllama/modeling_diffllama.py

src/transformers/models/auto/configuration_auto.py

src/transformers/models/diffllama/modeling_diffllama.py

weak-kajuma · 2024-10-20T11:17:05Z

@bzantium
I found Attention missed implemented from paper still on e072544.
So I'll revert to e072544 and re-implement with your suggested code style.

Co-authored-by: Minho Ryu <[email protected]>

Cyrilvallez · 2024-11-20T16:46:07Z

Hey, sorry for the delay!
In order to use modular transformers, you need to create a new file, modular_diffllama.py, in which you can use inheritance from the different Llama classes. Then, to automatically create the modeling_diffllama.py file, just use our CLI: python utils/modular_model_converter.py --files_to_parse src/transformers/models/diffllama/modular_diffllama.py from the root of the transformers repo 🤗
LMK if you need more guidance for this! You can find some modular example, e.g. here
Basically, any class similar to a Llama class you can directly inherit from to avoid rewriting it, e.g. if DiffLlamaRotaryEmbedding is similar to LlamaRotaryEmbedding, you can use

class DiffLlamaRotaryEmbedding(LlamaRotaryEmbedding):
    pass

in the modular file. In your case, you will probably need to only rewrite the attention classes 😉

effortprogrammer · 2024-11-30T15:25:26Z

Are you still working on this PR, @weak-kajuma ?

weak-kajuma · 2024-12-04T07:12:14Z

@Cyrilvallez Could you review again? I made modular_diffllama.py.

Cyrilvallez

Hey! A great first modular! But you can still cut a lot of code, the only difference here are the attention classes so it's perfect for modular to pick up on everything by itself!
LMK if you run into any issues

src/transformers/models/diffllama/modular_diffllama.py

Cyrilvallez · 2024-12-04T17:36:17Z

You may need to rebase/merge on main though for modular to work perfectly as you seem to be a bit far behind. If something does not work as expected after my comments, you should try that first 🤗

weak-kajuma · 2024-12-06T12:16:29Z

@Cyrilvallez Could you review again? Moduler transformers is very easy and good. And also I can pass all tests by merging latest changes.

effortprogrammer · 2024-12-10T01:50:55Z

@Cyrilvallez any plannings to review this pr?

Cyrilvallez

Alright, very good! Final comments 🤗

Cyrilvallez · 2024-12-10T16:51:19Z

src/transformers/models/diffllama/modular_diffllama.py

+class DiffLlamaRMSNorm(LlamaRMSNorm):
+    pass
+
+
+ALL_LAYERNORM_LAYERS.append(DiffLlamaRMSNorm)
+
+
+class DiffLlamaRotaryEmbedding(LlamaRotaryEmbedding):
+    pass
+
+
+class DiffLlamaMLP(MistralMLP):
+    pass


Should be removed!

If I remove DiffLlamaMLP, then AttributeError: 'DiffLlamaConfig' object has no attribute 'mlp_bias' has happened. So I cannot remove it.

Good call! 🤗

src/transformers/models/diffllama/modular_diffllama.py

utils/check_config_docstrings.py

src/transformers/models/diffllama/configuration_diffllama.py

ArthurZucker

Very very nice! The only thing missing is to update based on #35235 ! If you don't want we'll just open a PR afterwards!

ArthurZucker · 2024-12-23T16:22:16Z

src/transformers/models/diffllama/modular_diffllama.py

+        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
+        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias)
+
+        self.lambda_init = lambda_init_fn(layer_idx)


this should go in _init_weights() AFAIK!

Sorry, I don't know _init_weights(). How does they move?

It's not really a weight initialization, just declaration of a parameter so it should be ok as-is

weak-kajuma · 2024-12-24T12:57:06Z

The main change of #35235 is about Attention, I know. But I may not be able to change differential attention like #35235. You are so busy, but I want you to make PR.

Cyrilvallez

Ok, then I think it should be ready to merge, we'll take it from there! Thanks a lot for the contribution! cc @ArthurZucker

ArthurZucker · 2025-01-07T10:34:51Z

Cool let's merge then! 🤗

ArthurZucker · 2025-01-07T10:35:08Z

Thanks for iterating and congrats and the model merge 🥳

* first adding diffllama * add Diff Attention and other but still with errors * complate make attention Diff-Attention * fix some bugs which may be caused by transformer-cli while adding model * fix a bug caused by forgetting KV cache... * Update src/transformers/models/diffllama/modeling_diffllama.py You don't need to divide by 2 if we use same number of attention heads as llama. instead you can just split in forward. Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]> * Update src/transformers/models/diffllama/modeling_diffllama.py fit to changeing "num_heads // 2" place. and more visible Co-authored-by: Minho Ryu <[email protected]> * I found Attention missed implemented from paper still on e072544. * re-implemented * adding groupnorm Co-authored-by: Minho Ryu <[email protected]> * align with transformers code style Co-authored-by: Minho Ryu <[email protected]> * fix typo Co-authored-by: Minho Ryu <[email protected]> * adding groupnorm Co-authored-by: Minho Ryu <[email protected]> * change SdpaAttention to DiffSdpaAttention Co-authored-by: Minho Ryu <[email protected]> * fix bug * Update src/transformers/models/diffllama/modeling_diffllama.py resolve "not same outputs" problem Co-authored-by: Minho Ryu <[email protected]> * fix bugs of places of "GroupNorm with scale" and etc * Revert "fix bugs of places of "GroupNorm with scale" and etc" This reverts commit 26307d9. * simplify multiple of attention (matmul) operations into one by repeating value_states Co-authored-by: Minho Ryu <[email protected]> * simplify multiple of attention (matmul) operations into one by repeating value_states Co-authored-by: Minho Ryu <[email protected]> * simplify multiple of attention (matmul) operations into one by repeating value_states Co-authored-by: Minho Ryu <[email protected]> * remove missed type * add diffllama model_doc * apply make style/quality * apply review comment about model * apply review comment about test * place diffllama alphabetically on the src/transformers/__init__.py * fix forgot code * Supports parameters that are not initialized with standard deviation 0 in the conventional method * add DiffLlamaConfig to CONFIG_CLASSES_TO_IGNORE_FOR_DOCSTRING_CHECKPOINT_CHECK on utils/check_config_docstrings.py * remove unused property of config * add to supported model list * add to spda supported model list * fix copyright, remove pretraining_tensor_parallel, and modify for initialization test * remove unused import and etc. * empty commit * empty commit * empty commit * apply modular transformers but with bugs * revert prev commit * create src/transformers/model/diffllama/modular_diffllama.py * run utils/modular_model_converter.py * empty commit * leaner modular diffllama * remove more and more in modular_diffllama.pt * remove more and more in modular_diffllama.pt * resolve missing docstring entries * force reset * convert modular --------- Co-authored-by: Minho Ryu <[email protected]>

qubvel added the New model label Oct 11, 2024

weak-kajuma added 2 commits October 11, 2024 13:11

first adding diffllama

3bd9e34

add Diff Attention and other but still with errors

269055e

weak-kajuma force-pushed the add_diffllama branch from 765db6a to 269055e Compare October 11, 2024 13:21

ArthurZucker reviewed Oct 15, 2024

View reviewed changes

weak-kajuma added 3 commits October 16, 2024 12:02

complate make attention Diff-Attention

dbbf073

fix some bugs which may be caused by transformer-cli while adding model

c4ea9df

fix a bug caused by forgetting KV cache...

e072544

bzantium reviewed Oct 20, 2024

View reviewed changes

weak-kajuma and others added 8 commits October 20, 2024 11:52

Update src/transformers/models/diffllama/modeling_diffllama.py

674d7a2

You don't need to divide by 2 if we use same number of attention heads as llama. instead you can just split in forward. Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

9eac636

fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

0e99dbd

new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

1e445c7

new codes are more meaningful than before Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

cca6a5c

fit to changeing "num_heads // 2" place Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

dd167af

fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

23099cb

fix 2times divide by sqrt(self.head_dim) Co-authored-by: Minho Ryu <[email protected]>

Update src/transformers/models/diffllama/modeling_diffllama.py

faac378

fit to changeing "num_heads // 2" place. and more visible Co-authored-by: Minho Ryu <[email protected]>

bzantium reviewed Oct 20, 2024

View reviewed changes

src/transformers/models/diffllama/modeling_diffllama.py Outdated Show resolved Hide resolved

bzantium reviewed Oct 20, 2024

View reviewed changes

src/transformers/models/diffllama/modeling_diffllama.py Outdated Show resolved Hide resolved

bzantium reviewed Oct 20, 2024

View reviewed changes

src/transformers/models/auto/configuration_auto.py Outdated Show resolved Hide resolved

bzantium reviewed Oct 20, 2024

View reviewed changes

src/transformers/models/diffllama/modeling_diffllama.py Show resolved Hide resolved

src/transformers/models/diffllama/modeling_diffllama.py Show resolved Hide resolved

weak-kajuma and others added 4 commits October 20, 2024 11:23

I found Attention missed implemented from paper still on e072544.

53e13aa

re-implemented

63b018a

adding groupnorm

204bec8

Co-authored-by: Minho Ryu <[email protected]>

align with transformers code style

bce12e5

Co-authored-by: Minho Ryu <[email protected]>

weak-kajuma added 4 commits December 1, 2024 07:11

revert prev commit

48e16cf

create src/transformers/model/diffllama/modular_diffllama.py

a44f95d

run utils/modular_model_converter.py

c45aa59

empty commit

c5741eb

Cyrilvallez reviewed Dec 4, 2024

View reviewed changes

weak-kajuma and others added 4 commits December 6, 2024 11:49

leaner modular diffllama

ea622ce

Merge branch 'huggingface:main' into add_diffllama

e30c298

remove more and more in modular_diffllama.pt

3f85c22

remove more and more in modular_diffllama.pt

87d034d

ArthurZucker requested a review from Cyrilvallez December 10, 2024 07:59

Cyrilvallez reviewed Dec 10, 2024

View reviewed changes

weak-kajuma added 2 commits December 21, 2024 05:56

resolve missing docstring entries

4660c6e

force reset

b4ff5f3

weak-kajuma force-pushed the add_diffllama branch from 7b0da01 to b4ff5f3 Compare December 21, 2024 11:30

weak-kajuma and others added 2 commits December 21, 2024 20:30

Merge branch 'huggingface:main' into add_diffllama

484a493

convert modular

0ce2023

ArthurZucker reviewed Dec 23, 2024

View reviewed changes

djsaunde mentioned this pull request Jan 3, 2025

convert-diff-transformer CLI command / codepath axolotl-ai-cloud/axolotl#2197

Closed

7 tasks

Cyrilvallez approved these changes Jan 7, 2025

View reviewed changes

ArthurZucker merged commit 96bf3d6 into huggingface:main Jan 7, 2025
23 checks passed

Add diffllama #34083

Add diffllama #34083

Uh oh!

Conversation

weak-kajuma commented Oct 11, 2024

What does this PR do?

Uh oh!

weak-kajuma commented Oct 11, 2024

Uh oh!

weak-kajuma commented Oct 11, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

weak-kajuma commented Oct 16, 2024

Uh oh!

weak-kajuma commented Oct 16, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bzantium left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weak-kajuma commented Oct 20, 2024

Uh oh!

Cyrilvallez commented Nov 20, 2024

Uh oh!

effortprogrammer commented Nov 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weak-kajuma commented Dec 4, 2024

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cyrilvallez commented Dec 4, 2024

Uh oh!

weak-kajuma commented Dec 6, 2024

Uh oh!

effortprogrammer commented Dec 10, 2024

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

weak-kajuma Dec 21, 2024

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Dec 23, 2024

Choose a reason for hiding this comment

effortprogrammer commented Nov 30, 2024 •

edited

Loading