-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix : Falcon processor doesn't account for a layout difference of qkv between transformers and GGUF #35088
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! cc @Isotr0py for a second look. In this refactor PR, he removed model_size
. Maybe there is a better check to deduce if new_decoder_architecture
should be set to True
or not
if model_size == "40b": | ||
parsed_parameters["config"]["new_decoder_architecture"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests for the 40b even if we won't run them on the CI (add @Skip) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I'm thinking about avoid using model_size
extracted from file name, because some user fine-tuned model may use custom filename without "40b" explicitly. (I also removed the model_size logic in #34385)
Considering "new_decoder_architecture" having 2 layernorm for attn in decoder layer, I will prefer to check the existence of attn_norm_2
to determine "new_decoder_architecture".
You can refer to https://huggingface.co/maddes8cht/tiiuae-falcon-40b-instruct-gguf?show_file_info=tiiuae-falcon-40b-instruct-Q2_K.gguf, which indeed has attn_norm_2
params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if model_size == "40b": | |
parsed_parameters["config"]["new_decoder_architecture"] = True | |
new_decoder_architecture = any("attn_norm_2" in tensor.name for tensor in reader.tensors) | |
parsed_parameters["config"]["new_decoder_architecture"] = new_decoder_architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @MekkCyber
What does this PR do?
For Falcon 40B in the
transformers
modeling code, the Q, K, and V tensors are fused and stored in an interleaved manner. This means that, for each group, the Q tensors for all heads in the group are stacked together, followed by the K and V matrices of that group. However, in the GGUF layout, thefalcon.tensor_data_layout
is set tojploski
, which changes how the fused Q, K, and V tensors are stored (more info here). In this layout, the Q, K, and V tensors are stored sequentially instead of interleaved. This sequential storage makes it easier to use on thellama_cpp
side. To handle this difference, the PR processes theqkv
tensors to convert them back into the interleaved format required fortransformers
modeling.Additionally, this PR adds a new field,
new_decoder_architecture
, to the configuration. This is necessary for ensuring the modeling code of Falcon handles thefused_qkv
sizes correctly. The PR also fixes the name of thenum_key_value_heads
field in the Falcon configuration, changing it to the correct name,num_kv_heads
.Who can review ?
@SunMarc @LysandreJik