Skip to content

Conversation

@wejoncy
Copy link
Contributor

@wejoncy wejoncy commented Nov 18, 2024

What does this PR do?

This PR is intended to add support for Extreme Low-bit Vector Post-Training Quantization(VPTQ) to the transformers library.

VPTQ is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.. More details here: https://github.com/microsoft/vptq

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc @younesbelkada @ArthurZucker

@wejoncy wejoncy marked this pull request as ready for review November 18, 2024 08:19
@SunMarc SunMarc requested a review from MekkCyber November 18, 2024 14:39
@SunMarc
Copy link
Member

SunMarc commented Nov 18, 2024

cc @MekkCyber

@ArthurZucker
Copy link
Collaborator

Niiiiice! 🚀

@MekkCyber
Copy link
Contributor

MekkCyber commented Nov 20, 2024

Awesome PR @wejoncy, Thank you for adding this amazing quantization method ! The integration looks very smooth🔥! I Just left some minor comments

)

from accelerate import init_empty_weights
from vptq import VQuantLinear
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for consistency with other quantization methods the imports are not done inside the replace_with_xxx_linear you can move them outside😁. And no need to check if vptq and accelerate are available at this stage, the validate_environment method in the quantizer will take care of that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion.
Thanks

quantization_config=None,
current_key_name=None,
has_been_replaced=False,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, I noticed you don't need modules_to_not_convert so I assume you quantize all layers, even the emebeddings and lm_head. Does that work even in extreme quantization ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we used modules_to_convert in config json, it has all the layers which should to be quantized and include corresponding parameters.
But I add modules_to_not_convert in the function signature to allow user setting which layer should be excluded.

else:
torch_dtype = torch.float32
logger.info(
"CUDA is unavailable. Assuming VPTQ inference on CPU and loading the model in `torch.float32`. To overwrite it, set `torch_dtype` manually."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the overview page, you specified that vptq don't work on cpu, was that intended ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are working CPU kernels, but let's mark it unavailable for CPU at first.

raise ImportError("Using `vptq` quantization requires VPTQ: `pip install vptq`")
import vptq

assert version.Version(vptq.__version__) >= version.Version("0.0.4"), "vptq is avaiable since 0.0.4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think It would be cleaner to have something like this for vtpq :

def is_vptq_available(min_version: str = VPTQ_MIN_VERSION):
    return _accelerate_available and version.parse(_vptq_version) >= version.parse(min_version)

You can follow the example of accelerate in the src/transformers/utils/import_utils.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks.

kwargs (`Dict[str, Any]`, *optional*):
Additional parameters from which to initialize the configuration object.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand here the config.json file will contain a vptq layer config with all the specified parameters for each Linear layer in each model layer, so for a Llama3-70B with 80 model layers and 7 Linears in each layer that would be 80*7 = 560 layer configs. Is there a way to reduce that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It's exactly true. Config file contains all specific parameters and it a bit huge.
We used this to customize different quantization parameters for different linear layers to have the best performance.
But we will try to figure out a way to allow a simplified configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we, for each Linear layer like q_proj or down_proj, have a list of parameter values, and each parameter corresponds to a layer ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MekkCyber sorry, not quite following your suggestion, could you please explain a little bit more?
What we did is for each linear layer, we give it a paraments dict, such as

model.decoder_model.layers.0.self_attn.q_proj: {"v_len":[-1, 8], .....},
model.decoder_model.layers.0.self_attn.k_proj: {"v_len":[-1, 8], .....},
.
.
.
model.decoder_model.layers.n.self_attn.q_proj: {"v_len":[-1, 8], .....},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wejoncy, sorry I wasn't clear, I meant smth like this for example for the up_proj linears :

model.layers.mlp.up_proj: {
        "bias": [null, null, ..., null],
        "enable_norm": [true, true, false, ..., true],
        ...
        "num_centroids": [
          (-1, 65536), (-1, 65536), ...,
        ],
        ...
      },

So that each property contains a list of values, and each value corresponds to a layer, so the length of the lists, is the number of layers, and for the i-th layer, we access the biasproperty for example using : model.layers.mlp.up_proj["bias"][i]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MekkCyber This is not quite doable as we have to allow some layers keeps its original precision. But we have a "Shared_layer_config" which contains a attention_layer and shared by all other layers.
Here is the example https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v12-k65536-4096-woft/blob/main/config.json


EXPECTED_OUTPUT = "Hello my name is Sarah and I am a 25 year old woman from the United States. I am a college graduate and I am currently working as a marketing specialist for a small"

device_map = "cuda"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit ! If inference on cpu works, you can add a test of the cpu case too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed the CPU device option availability check. We will optimize the CPU kernel in the next VPTQ release.
After that I will open another PR to support CPU and add a CPU unit test.

@MekkCyber MekkCyber requested review from MekkCyber and SunMarc and removed request for MekkCyber November 20, 2024 11:12
@wejoncy
Copy link
Contributor Author

wejoncy commented Nov 21, 2024

Awesome PR @wejoncy, Thank you for adding this amazing quantization method ! The integration looks very smooth🔥! I Just left some minor comments

Hi @MekkCyber Really appreciate your review on this PR and thanks for your interest on VPTQ quantization integration into transformers.
I have addressed your comments in the new commits thanks again.

Please let me know if there is anything need to be fixed.

@SunMarc SunMarc requested a review from MekkCyber November 21, 2024 16:30
@MekkCyber
Copy link
Contributor

Thanks for iterating @wejoncy LGTM 🔥! Left only two small nits

if not is_accelerate_available():
raise ImportError("Using `vptq` quantization requires Accelerate: `pip install accelerate`")

if not is_vptq_available():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nit, now that is_vptq_available checks the version too, you can change the message to specify that vptq should be installed and it requires the min version you want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. we have VPTQ_MIN_VERSION=0.0.4 as default.

Good suggestion. I have added the minimal version in the error logs for guiding user clearly.

if isinstance(module, VQuantLinear):
nb_vptq_linear += 1

self.assertEqual(nb_linears - 25, nb_vptq_linear)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for a bit of clarity you can specify here in a comment that 25 corresponds to 24 decoder.layers.{layer_idx}.fc1 + lm_head in modules_to_not_convert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks

@MekkCyber
Copy link
Contributor

Thanks for the updates @wejoncy ! Can't wait for the PR to be merged !

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the PR ! Excited to see more VTPQ models on the hub. Just a few nits. Thanks also for the nice tests and the documentation

@SunMarc SunMarc requested a review from ArthurZucker November 22, 2024 16:01
@YangWang92
Copy link
Contributor

Thanks the PR ! Excited to see more VTPQ models on the hub. Just a few nits. Thanks also for the nice tests and the documentation

Thank @SunMarc, for your review! We are actively developing new models and algorithms, and we expect to release support for multimodal models (LLaMA 3.2 and Qwen2-VL) next month.

@SunMarc
Copy link
Member

SunMarc commented Dec 2, 2024

friendly ping @ArthurZucker

@wejoncy
Copy link
Contributor Author

wejoncy commented Dec 6, 2024

Hi @ArthurZucker ,
A friendly Ping. Could you please take a look this PR? I will be actively updating code if any suggestions.
Thanks.

@YangWang92
Copy link
Contributor

Hi @ArthurZucker,
Just a polite ping to check in—may I ask if there’s anything else we need to modify? Thank you!

Best regards,
Yang

@huggingface huggingface deleted a comment from code30x58 Dec 18, 2024
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go! 🤗

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice readme! 🤗

@ArthurZucker ArthurZucker merged commit 4e27a40 into huggingface:main Dec 20, 2024
26 checks passed
@ArthurZucker
Copy link
Collaborator

Super sorry for the late review, I am a bit under water 😓

@wejoncy wejoncy deleted the vptq/dev branch December 20, 2024 08:48
@YangWang92
Copy link
Contributor

Super sorry for the late review, I am a bit under water 😓

Thanks so much! And thanks to everyone for their help! Special thanks to @xianbaoqian for the support—super grateful! 😊

@MekkCyber
Copy link
Contributor

Hey @YangWang92, i was going through the VPTQ paper, and I wanted to make sure that I understand correctly, so VPTQ is the same as GPTQ except that the quantization step is not linear but uses VQ instead, is that correct ?

@YangWang92
Copy link
Contributor

Hey @YangWang92, i was going through the VPTQ paper, and I wanted to make sure that I understand correctly, so VPTQ is the same as GPTQ except that the quantization step is not linear but uses VQ instead, is that correct ?

Hi @MekkCyber,
Sorry for the late reply! I'm quite busy recent days. VPTQ is indeed quite similar to GPTQ in terms of modeling the optimization problem, as both use second-order optimization methods for quantization. The key difference is that VPTQ specifically addresses the additional research questions introduced by Vector Quantization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants