FEAT : Adding VPTQ quantization method to HFQuantizer #34770

wejoncy · 2024-11-18T06:44:26Z

What does this PR do?

This PR is intended to add support for Extreme Low-bit Vector Post-Training Quantization(VPTQ) to the transformers library.

VPTQ is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.. More details here: https://github.com/microsoft/vptq

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc @younesbelkada @ArthurZucker

fix readme

SunMarc · 2024-11-18T14:43:01Z

cc @MekkCyber

ArthurZucker · 2024-11-20T09:17:00Z

Niiiiice! 🚀

MekkCyber · 2024-11-20T11:00:29Z

Awesome PR @wejoncy, Thank you for adding this amazing quantization method ! The integration looks very smooth🔥! I Just left some minor comments

docs/source/en/quantization/vptq.md

MekkCyber · 2024-11-20T10:19:31Z

src/transformers/integrations/vptq.py

+        )
+
+    from accelerate import init_empty_weights
+    from vptq import VQuantLinear


Just for consistency with other quantization methods the imports are not done inside the replace_with_xxx_linear you can move them outside😁. And no need to check if vptq and accelerate are available at this stage, the validate_environment method in the quantizer will take care of that

Good suggestion.
Thanks

MekkCyber · 2024-11-20T10:24:21Z

src/transformers/integrations/vptq.py

+    quantization_config=None,
+    current_key_name=None,
+    has_been_replaced=False,
+):


Just a question, I noticed you don't need modules_to_not_convert so I assume you quantize all layers, even the emebeddings and lm_head. Does that work even in extreme quantization ?

Well, we used modules_to_convert in config json, it has all the layers which should to be quantized and include corresponding parameters.
But I add modules_to_not_convert in the function signature to allow user setting which layer should be excluded.

src/transformers/quantizers/quantizer_vptq.py

MekkCyber · 2024-11-20T10:31:22Z

src/transformers/quantizers/quantizer_vptq.py

+            else:
+                torch_dtype = torch.float32
+                logger.info(
+                    "CUDA is unavailable. Assuming VPTQ inference on CPU and loading the model in `torch.float32`. To overwrite it, set `torch_dtype` manually."


In the overview page, you specified that vptq don't work on cpu, was that intended ?

We are working CPU kernels, but let's mark it unavailable for CPU at first.

MekkCyber · 2024-11-20T10:38:03Z

src/transformers/quantizers/quantizer_vptq.py

+            raise ImportError("Using `vptq` quantization requires VPTQ: `pip install vptq`")
+        import vptq
+
+        assert version.Version(vptq.__version__) >= version.Version("0.0.4"), "vptq is avaiable since 0.0.4"


I think It would be cleaner to have something like this for vtpq :

def is_vptq_available(min_version: str = VPTQ_MIN_VERSION): return _accelerate_available and version.parse(_vptq_version) >= version.parse(min_version)

You can follow the example of accelerate in the src/transformers/utils/import_utils.py

Sounds good. Thanks.

MekkCyber · 2024-11-20T10:47:21Z

src/transformers/utils/quantization_config.py

+        kwargs (`Dict[str, Any]`, *optional*):
+            Additional parameters from which to initialize the configuration object.
+    """
+


From what I understand here the config.json file will contain a vptq layer config with all the specified parameters for each Linear layer in each model layer, so for a Llama3-70B with 80 model layers and 7 Linears in each layer that would be 80*7 = 560 layer configs. Is there a way to reduce that ?

Yes, It's exactly true. Config file contains all specific parameters and it a bit huge.
We used this to customize different quantization parameters for different linear layers to have the best performance.
But we will try to figure out a way to allow a simplified configuration.

Can we, for each Linear layer like q_proj or down_proj, have a list of parameter values, and each parameter corresponds to a layer ?

Hi @MekkCyber sorry, not quite following your suggestion, could you please explain a little bit more?
What we did is for each linear layer, we give it a paraments dict, such as

model.decoder_model.layers.0.self_attn.q_proj: {"v_len":[-1, 8], .....}, model.decoder_model.layers.0.self_attn.k_proj: {"v_len":[-1, 8], .....}, . . . model.decoder_model.layers.n.self_attn.q_proj: {"v_len":[-1, 8], .....},

Hi @wejoncy, sorry I wasn't clear, I meant smth like this for example for the up_proj linears :

model.layers.mlp.up_proj: { "bias": [null, null, ..., null], "enable_norm": [true, true, false, ..., true], ... "num_centroids": [ (-1, 65536), (-1, 65536), ..., ], ... },

So that each property contains a list of values, and each value corresponds to a layer, so the length of the lists, is the number of layers, and for the i-th layer, we access the biasproperty for example using : model.layers.mlp.up_proj["bias"][i]

Hi @MekkCyber This is not quite doable as we have to allow some layers keeps its original precision. But we have a "Shared_layer_config" which contains a attention_layer and shared by all other layers.
Here is the example https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v12-k65536-4096-woft/blob/main/config.json

tests/quantization/vptq_integration/test_vptq.py

MekkCyber · 2024-11-20T10:57:07Z

tests/quantization/vptq_integration/test_vptq.py

+
+    EXPECTED_OUTPUT = "Hello my name is Sarah and I am a 25 year old woman from the United States. I am a college graduate and I am currently working as a marketing specialist for a small"
+
+    device_map = "cuda"


Small nit ! If inference on cpu works, you can add a test of the cpu case too

I have fixed the CPU device option availability check. We will optimize the CPU kernel in the next VPTQ release.
After that I will open another PR to support CPU and add a CPU unit test.

wejoncy · 2024-11-21T04:42:34Z

Awesome PR @wejoncy, Thank you for adding this amazing quantization method ! The integration looks very smooth🔥! I Just left some minor comments

Hi @MekkCyber Really appreciate your review on this PR and thanks for your interest on VPTQ quantization integration into transformers.
I have addressed your comments in the new commits thanks again.

Please let me know if there is anything need to be fixed.

MekkCyber · 2024-11-22T11:56:58Z

Thanks for iterating @wejoncy LGTM 🔥! Left only two small nits

MekkCyber · 2024-11-22T11:49:09Z

src/transformers/quantizers/quantizer_vptq.py

+        if not is_accelerate_available():
+            raise ImportError("Using `vptq` quantization requires Accelerate: `pip install accelerate`")
+
+        if not is_vptq_available():


Just a small nit, now that is_vptq_available checks the version too, you can change the message to specify that vptq should be installed and it requires the min version you want

Yes. we have VPTQ_MIN_VERSION=0.0.4 as default.

Good suggestion. I have added the minimal version in the error logs for guiding user clearly.

MekkCyber · 2024-11-22T11:55:18Z

tests/quantization/vptq_integration/test_vptq.py

+            if isinstance(module, VQuantLinear):
+                nb_vptq_linear += 1
+
+        self.assertEqual(nb_linears - 25, nb_vptq_linear)


Just for a bit of clarity you can specify here in a comment that 25 corresponds to 24 decoder.layers.{layer_idx}.fc1 + lm_head in modules_to_not_convert

Sounds good. Thanks

MekkCyber · 2024-11-22T13:39:27Z

Thanks for the updates @wejoncy ! Can't wait for the PR to be merged !

HuggingFaceDocBuilderDev · 2024-11-22T14:09:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks the PR ! Excited to see more VTPQ models on the hub. Just a few nits. Thanks also for the nice tests and the documentation

docs/source/en/quantization/vptq.md

src/transformers/quantizers/quantizer_vptq.py

src/transformers/utils/quantization_config.py

docs/source/en/quantization/vptq.md

YangWang92 · 2024-11-25T06:15:21Z

Thanks the PR ! Excited to see more VTPQ models on the hub. Just a few nits. Thanks also for the nice tests and the documentation

Thank @SunMarc, for your review! We are actively developing new models and algorithms, and we expect to release support for multimodal models (LLaMA 3.2 and Qwen2-VL) next month.

This reverts commit ed3b3ea.

SunMarc · 2024-12-02T15:33:23Z

friendly ping @ArthurZucker

wejoncy · 2024-12-06T02:22:37Z

Hi @ArthurZucker ,
A friendly Ping. Could you please take a look this PR? I will be actively updating code if any suggestions.
Thanks.

YangWang92 · 2024-12-13T08:56:14Z

Hi @ArthurZucker,
Just a polite ping to check in—may I ask if there’s anything else we need to modify? Thank you!

Best regards,
Yang

ArthurZucker

Let's go! 🤗

docs/source/en/quantization/vptq.md

ArthurZucker · 2024-12-17T10:44:53Z

docs/source/en/quantization/vptq.md

very nice readme! 🤗

ArthurZucker · 2024-12-20T08:46:08Z

Super sorry for the late review, I am a bit under water 😓

YangWang92 · 2024-12-20T08:48:28Z

Super sorry for the late review, I am a bit under water 😓

Thanks so much! And thanks to everyone for their help! Special thanks to @xianbaoqian for the support—super grateful! 😊

MekkCyber · 2025-01-27T13:38:07Z

Hey @YangWang92, i was going through the VPTQ paper, and I wanted to make sure that I understand correctly, so VPTQ is the same as GPTQ except that the quantization step is not linear but uses VQ instead, is that correct ?

YangWang92 · 2025-02-27T05:23:37Z

Hey @YangWang92, i was going through the VPTQ paper, and I wanted to make sure that I understand correctly, so VPTQ is the same as GPTQ except that the quantization step is not linear but uses VQ instead, is that correct ?

Hi @MekkCyber,
Sorry for the late reply! I'm quite busy recent days. VPTQ is indeed quite similar to GPTQ in terms of modeling the optimization problem, as both use second-order optimization methods for quantization. The key difference is that VPTQ specifically addresses the additional research questions introduced by Vector Quantization.

YangWang92 and others added 4 commits November 18, 2024 14:25

init vptq

34a48ef

add integration

5e42105

add vptq support

400146f

fix readme

add tests && format

4d12060

wejoncy force-pushed the vptq/dev branch from c6d488f to 4d12060 Compare November 18, 2024 08:18

wejoncy marked this pull request as ready for review November 18, 2024 08:19

format

2e4cad7

wejoncy mentioned this pull request Nov 18, 2024

Huggingface Transformer Support microsoft/VPTQ#115

Closed

Merge branch 'main' into vptq/dev

feac8f0

SunMarc requested a review from MekkCyber November 18, 2024 14:39

YangWang92 added 3 commits November 18, 2024 22:48

Merge branch 'main' into vptq/dev

311f0e7

Merge branch 'main' into vptq/dev

8a14f62

Merge branch 'main' into vptq/dev

33d3dfa

MekkCyber reviewed Nov 20, 2024

View reviewed changes

MekkCyber requested review from MekkCyber and SunMarc and removed request for MekkCyber November 20, 2024 11:12

wejoncy and others added 3 commits November 21, 2024 11:00

Merge branch 'main' into vptq/dev

94d5f81

address comments

2415b6e

format

f51ff3c

format

ca64cb5

SunMarc requested a review from MekkCyber November 21, 2024 16:30

MekkCyber reviewed Nov 22, 2024

View reviewed changes

wejoncy and others added 2 commits November 22, 2024 20:25

Merge branch 'main' into vptq/dev

fbeb053

address comments

486b80a

format

9828a8f

SunMarc approved these changes Nov 22, 2024

View reviewed changes

SunMarc requested a review from ArthurZucker November 22, 2024 16:01

wejoncy and others added 7 commits November 26, 2024 10:36

Merge branch 'main' into vptq/dev

b808f88

address comments

6f110ad

remove debug code

ed3b3ea

Revert "remove debug code"

7441a04

This reverts commit ed3b3ea.

fix test

fe0c9ec

Merge branch 'main' into vptq/dev

dbc4403

Merge branch 'main' into vptq/dev

ed46c02

Merge branch 'main' into vptq/dev

1a18b90

huggingface deleted a comment from code30x58 Dec 18, 2024

ArthurZucker approved these changes Dec 20, 2024

View reviewed changes

docs/source/en/quantization/vptq.md Show resolved Hide resolved

docs/source/en/quantization/vptq.md

Copy link

Collaborator

ArthurZucker Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice readme! 🤗

ArthurZucker merged commit 4e27a40 into huggingface:main Dec 20, 2024
26 checks passed

wejoncy deleted the vptq/dev branch December 20, 2024 08:48


		EXPECTED_OUTPUT = "Hello my name is Sarah and I am a 25 year old woman from the United States. I am a college graduate and I am currently working as a marketing specialist for a small"

		device_map = "cuda"

FEAT : Adding VPTQ quantization method to HFQuantizer #34770

FEAT : Adding VPTQ quantization method to HFQuantizer #34770

Uh oh!

Conversation

wejoncy commented Nov 18, 2024

What does this PR do?

Who can review?

Uh oh!

SunMarc commented Nov 18, 2024

Uh oh!

ArthurZucker commented Nov 20, 2024

Uh oh!

MekkCyber commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wejoncy commented Nov 21, 2024

Uh oh!

MekkCyber commented Nov 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MekkCyber commented Nov 22, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Nov 22, 2024

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YangWang92 commented Nov 25, 2024

Uh oh!

SunMarc commented Dec 2, 2024

Uh oh!

wejoncy commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MekkCyber commented Nov 20, 2024 •

edited

Loading

wejoncy commented Dec 6, 2024 •

edited

Loading