-
Notifications
You must be signed in to change notification settings - Fork 31.5k
FEAT : Adding VPTQ quantization method to HFQuantizer #34770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @MekkCyber |
|
Niiiiice! 🚀 |
|
Awesome PR @wejoncy, Thank you for adding this amazing quantization method ! The integration looks very smooth🔥! I Just left some minor comments |
| ) | ||
|
|
||
| from accelerate import init_empty_weights | ||
| from vptq import VQuantLinear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for consistency with other quantization methods the imports are not done inside the replace_with_xxx_linear you can move them outside😁. And no need to check if vptq and accelerate are available at this stage, the validate_environment method in the quantizer will take care of that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion.
Thanks
| quantization_config=None, | ||
| current_key_name=None, | ||
| has_been_replaced=False, | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question, I noticed you don't need modules_to_not_convert so I assume you quantize all layers, even the emebeddings and lm_head. Does that work even in extreme quantization ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we used modules_to_convert in config json, it has all the layers which should to be quantized and include corresponding parameters.
But I add modules_to_not_convert in the function signature to allow user setting which layer should be excluded.
| else: | ||
| torch_dtype = torch.float32 | ||
| logger.info( | ||
| "CUDA is unavailable. Assuming VPTQ inference on CPU and loading the model in `torch.float32`. To overwrite it, set `torch_dtype` manually." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the overview page, you specified that vptq don't work on cpu, was that intended ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are working CPU kernels, but let's mark it unavailable for CPU at first.
| raise ImportError("Using `vptq` quantization requires VPTQ: `pip install vptq`") | ||
| import vptq | ||
|
|
||
| assert version.Version(vptq.__version__) >= version.Version("0.0.4"), "vptq is avaiable since 0.0.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think It would be cleaner to have something like this for vtpq :
def is_vptq_available(min_version: str = VPTQ_MIN_VERSION):
return _accelerate_available and version.parse(_vptq_version) >= version.parse(min_version)
You can follow the example of accelerate in the src/transformers/utils/import_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks.
| kwargs (`Dict[str, Any]`, *optional*): | ||
| Additional parameters from which to initialize the configuration object. | ||
| """ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand here the config.json file will contain a vptq layer config with all the specified parameters for each Linear layer in each model layer, so for a Llama3-70B with 80 model layers and 7 Linears in each layer that would be 80*7 = 560 layer configs. Is there a way to reduce that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It's exactly true. Config file contains all specific parameters and it a bit huge.
We used this to customize different quantization parameters for different linear layers to have the best performance.
But we will try to figure out a way to allow a simplified configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we, for each Linear layer like q_proj or down_proj, have a list of parameter values, and each parameter corresponds to a layer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @MekkCyber sorry, not quite following your suggestion, could you please explain a little bit more?
What we did is for each linear layer, we give it a paraments dict, such as
model.decoder_model.layers.0.self_attn.q_proj: {"v_len":[-1, 8], .....},
model.decoder_model.layers.0.self_attn.k_proj: {"v_len":[-1, 8], .....},
.
.
.
model.decoder_model.layers.n.self_attn.q_proj: {"v_len":[-1, 8], .....},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @wejoncy, sorry I wasn't clear, I meant smth like this for example for the up_proj linears :
model.layers.mlp.up_proj: {
"bias": [null, null, ..., null],
"enable_norm": [true, true, false, ..., true],
...
"num_centroids": [
(-1, 65536), (-1, 65536), ...,
],
...
},
So that each property contains a list of values, and each value corresponds to a layer, so the length of the lists, is the number of layers, and for the i-th layer, we access the biasproperty for example using : model.layers.mlp.up_proj["bias"][i]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @MekkCyber This is not quite doable as we have to allow some layers keeps its original precision. But we have a "Shared_layer_config" which contains a attention_layer and shared by all other layers.
Here is the example https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v12-k65536-4096-woft/blob/main/config.json
|
|
||
| EXPECTED_OUTPUT = "Hello my name is Sarah and I am a 25 year old woman from the United States. I am a college graduate and I am currently working as a marketing specialist for a small" | ||
|
|
||
| device_map = "cuda" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit ! If inference on cpu works, you can add a test of the cpu case too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed the CPU device option availability check. We will optimize the CPU kernel in the next VPTQ release.
After that I will open another PR to support CPU and add a CPU unit test.
Hi @MekkCyber Really appreciate your review on this PR and thanks for your interest on VPTQ quantization integration into transformers. Please let me know if there is anything need to be fixed. |
|
Thanks for iterating @wejoncy LGTM 🔥! Left only two small nits |
| if not is_accelerate_available(): | ||
| raise ImportError("Using `vptq` quantization requires Accelerate: `pip install accelerate`") | ||
|
|
||
| if not is_vptq_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small nit, now that is_vptq_available checks the version too, you can change the message to specify that vptq should be installed and it requires the min version you want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. we have VPTQ_MIN_VERSION=0.0.4 as default.
Good suggestion. I have added the minimal version in the error logs for guiding user clearly.
| if isinstance(module, VQuantLinear): | ||
| nb_vptq_linear += 1 | ||
|
|
||
| self.assertEqual(nb_linears - 25, nb_vptq_linear) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for a bit of clarity you can specify here in a comment that 25 corresponds to 24 decoder.layers.{layer_idx}.fc1 + lm_head in modules_to_not_convert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks
|
Thanks for the updates @wejoncy ! Can't wait for the PR to be merged ! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks the PR ! Excited to see more VTPQ models on the hub. Just a few nits. Thanks also for the nice tests and the documentation
Thank @SunMarc, for your review! We are actively developing new models and algorithms, and we expect to release support for multimodal models (LLaMA 3.2 and Qwen2-VL) next month. |
|
friendly ping @ArthurZucker |
|
Hi @ArthurZucker , |
|
Hi @ArthurZucker, Best regards, |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go! 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice readme! 🤗
|
Super sorry for the late review, I am a bit under water 😓 |
Thanks so much! And thanks to everyone for their help! Special thanks to @xianbaoqian for the support—super grateful! 😊 |
|
Hey @YangWang92, i was going through the VPTQ paper, and I wanted to make sure that I understand correctly, so VPTQ is the same as GPTQ except that the quantization step is not linear but uses VQ instead, is that correct ? |
Hi @MekkCyber, |
What does this PR do?
This PR is intended to add support for Extreme Low-bit Vector Post-Training Quantization(VPTQ) to the transformers library.
VPTQ is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.. More details here: https://github.com/microsoft/vptq
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@SunMarc @younesbelkada @ArthurZucker