example: int4 weight decompression #2193

rupakroyintel · 2024-10-29T18:27:59Z

Description

oneDNN supports INT4 autoGPTQ and AWQ quantization features. This is an example in oneDNN example to demonstrate Matmul INT4 weights decompression support and how to configure the APIs for autoGPTQ and AWQ quantization features. The request originally came from IPEX team: "AWQ (activation-aware quantization) is very popular in the community and we need to support. We need oneDNN INT4 GEMM API support the below input packing approach.The weights is packed in N direction, [K, N/8]; zeros point is packed in both K and N, [K/G, N/8], scale is in K direction [K/G, N].The input data type of weight and zero point is int32 and scale is fp16."

Checklist

General

[✔ ] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
[✔ ] Have you formatted the code using clang-format?

Performance improvements

Have you submitted performance data that demonstrates performance improvements? Not yet

New features

Have you published an RFC for the new feature? No
Was the RFC approved? N/A
Have you added relevant tests? N/A

Bug fixes

Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
Have you added relevant regression tests?

RFC PR

Does RFC document follow the template?
Have you added a link to the rendered document?

examples/tutorials/matmul/int4_weight_decompression_cmnts.cpp

shu1chen · 2024-10-30T02:14:52Z

The file name of the example int4_weight_decompression_cmnts.cpp doesn't seem good. What is cmnts?

rupakroyintel · 2024-10-30T04:52:32Z

The file name of the example int4_weight_decompression_cmnts.cpp doesn't seem good. What is cmnts?

Removed the int4_weight_decompression_cmnts.cpp and added int4_weight_decompression,cpp

vpirogov · 2024-10-31T23:50:17Z

@rupakroyintel, please make sure commits in your branch comply with contributing guidelines and do not contain merge commits.

@theComputeKid, @mgouicem, looks like PR Checks / title does not catch issue with commit history...

examples/tutorials/matmul/int4_weight_decompression.cpp

vpirogov · 2024-10-31T23:59:58Z

examples/tutorials/matmul/int4_weight_decompression.cpp

+
+// Compares the results of reference matrix multiplication and oneDNN weights
+// decompression.
+void compare_ref_and_weights_decompression(engine::kind engine_kind) {


It would be great to follow structure and flow of int8 decompression example (weights_decompression_matmul) and add additional information about specifics of int4 data storage. If you remember the case that triggered the request for example was related to feeding prepacked weights to oneDNN and dealing with groups and zero-points.

Previously in the Teams Pytorch channel, Dmitry provided the following detailed advice:

...
Secondly, using this group of 8 along N in case of PT is not required. Group size says how many consecutive points of the tensor zero points are applied to should share a single zero point value. It has nothing to do with how PT pack their zero points.

Thirdly, it is the most important HOW these zero points are stored in memory. There was a recent story where IPEX engineer tried to enable oneDNN's int4 and failed to do so because weights were transposed (because of that 8xPack thing), and everything what should have been done was to transpose them again to match oneDNN's API. I would assume this story should follow the same pattern - before calling oneDNN API, it's highly likely those zero points should be transposed and only then passed as an int4 object inside the library to get correct results.

@dzarukin What do you suggest? It seems that it's better to pass an int4 object to oneDNN rather than to prepack 8*int4 and pass an int32 object.

oneDNN developed API to work with int4 memory objects directly. This hasn't happened to PyTorch yet. Their implementation side has a detail of pre-packing. The example should probably demonstrate how to translate packed 8 int4 values as a single int value language into oneDNN language and what operations should be done in terms of memory (necessary transpositions and/or reorders).

mgouicem · 2024-11-04T14:49:06Z

@theComputeKid, @mgouicem, looks like PR Checks / title does not catch issue with commit history...

Let me see what goes off in the jobs. I checked out the branch and ran locally, and it properly catches the first improper message.

> git remote add rupakroy https://github.com/rupakroyintel/oneDNN.git
> git fetch rupakroy
> git co add_int4_decompression_example
Updating files: 100% (776/776), done.
branch 'add_int4_decompression_example' set up to track 'rupakroy/add_int4_decompression_example'.
Switched to a new branch 'add_int4_decompression_example'
> python3 ./.github/automation/commit-msg-check.py "1abe160095ef52c7ad879b75331dbe4b4e17be6d" "1fe8ee54b18c764d32932d21e776a86f46a6d0cf"
msg: Merge branch 'add_int4_decompression_example' of https://github.com/rupakroyintel/oneDNN into add_int4_decompression_example
Traceback (most recent call last):
  File "./.github/automation/commit-msg-check.py", line 82, in <module>
    main()
  File "./.github/automation/commit-msg-check.py", line 77, in main
    __numCharacterCheck(commit_msg)
  File "./.github/automation/commit-msg-check.py", line 58, in __numCharacterCheck
    raise ValueError(
ValueError: Please see contribution guidelines. Message summary must be less than 72. Got: 124

rupakroyintel · 2024-11-22T20:44:21Z

@vpirogov @dzarukin We tried translating packed 8 int4 values into a single int value. However, it looks like the zero-points attribute wei:per_ocic:s4:32x8 is not supported. Here is the output from benchdnn:

./tests/benchdnn/benchdnn --matmul --engine=gpu --dt=f16:s4:f16 --stag=any --wtag=abc --dtag=acb --attr-scales=wei:per_ocic:f16:32x1 --attr-zero-points=wei:per_ocic:s4:32x8 --attr-fpmath=f16:true 7x24x32:7x32x64
Error: Function 'check_dnnl_status' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:327) returned 'unimplemented'
Error: Function 'create_primitive' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:401) returned '1'
Error: Function 'init_prim' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:471) returned '1'
Error: Function 'createit' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/matmul/matmul.cpp:881) returned '1'
Error: Function 'create' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/utils/task.hpp:49) returned '1'
0:UNIMPLEMENTED __REPRO: --matmul --engine=gpu --dt=f16:s4:f16 --wtag=abc --dtag=acb --attr-scales=wei:per_ocic:f16:32x1 --attr-zero-points=wei:per_ocic:s4:32x8 --attr-fpmath=f16:true 7x24x32:7x32x64
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:1 invalid_arguments:0 failed:1 listed:0
total: 0.05s; fill: 0.00s (0%); compute_ref: 0.00s (0%); compare: 0.00s (0%);

dzarukin · 2024-11-22T23:23:13Z

@vpirogov @dzarukin We tried translating packed 8 int4 values into a single int value. However, it looks like the zero-points attribute wei:per_ocic:s4:32x8 is not supported. Here is the output from benchdnn:

./tests/benchdnn/benchdnn --matmul --engine=gpu --dt=f16:s4:f16 --stag=any --wtag=abc --dtag=acb --attr-scales=wei:per_ocic:f16:32x1 --attr-zero-points=wei:per_ocic:s4:32x8 --attr-fpmath=f16:true 7x24x32:7x32x64
Error: Function 'check_dnnl_status' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:327) returned 'unimplemented'
Error: Function 'create_primitive' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:401) returned '1'
Error: Function 'init_prim' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:471) returned '1'
Error: Function 'createit' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/matmul/matmul.cpp:881) returned '1'
Error: Function 'create' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/utils/task.hpp:49) returned '1'
0:UNIMPLEMENTED __REPRO: --matmul --engine=gpu --dt=f16:s4:f16 --wtag=abc --dtag=acb --attr-scales=wei:per_ocic:f16:32x1 --attr-zero-points=wei:per_ocic:s4:32x8 --attr-fpmath=f16:true 7x24x32:7x32x64
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:1 invalid_arguments:0 failed:1 listed:0
total: 0.05s; fill: 0.00s (0%); compute_ref: 0.00s (0%); compare: 0.00s (0%);

@rupakroyintel, oneDNN doesn't have any idea about external to it 8-int4 values packing implementation detail. Zero-point group API is not designed for it. From oneDNN perspective you need to think about each value independently and use a single dimension in groups. The observed benchdnn output is expected.

rupakroyintel · 2025-03-12T02:01:54Z

@dzarukin @vpirogov @shu1chen I have pushed the latest changes. The example passed on GPU. However, it failed on CPU. I have added the verbose log for the failing case:
ONEDNN_VERBOSE=all ./tutorials-matmul-int4-weight-decompression-cpp ... onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx512_core_bf16,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx512_core,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx2_vnni_2,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx2_vnni,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:115 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx2,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:115 oneDNN error caught: Status: unimplemented Message: could not create a primitive descriptor for the matmul primitive. Run workload with environment variable ONEDNN_VERBOSE=all to get additional diagnostic information. Example failed on CPU.

dzarukin · 2025-03-12T02:18:58Z

@rupakroyintel Decompression is not implemented for AVX2.
Suggest to update the call to a primitive descriptor constructor utilizing allow_empty argument and checking if it's empty, then successfully finish the example with the unsupported message.

shu1chen · 2025-03-12T02:54:52Z

Since this datatype combination is unsupported on CPU, another option is to skip the test on CPU and only enable this example on GPU, similar to what is done in weights_decompression_matmul.cpp#L193-#L195:

// CPU is not supported
if (engine_kind != engine::kind::gpu) return 0;

It would also be great to add the link to this new example in the documentation matmul.md, examples.md.

shu1chen · 2025-03-13T01:59:43Z

examples/tutorials/matmul/int4_weight_decompression.cpp

+    dnnl::matmul::primitive_desc matmul_pd;
+
+    // Create a MatMul primitive descriptor
+    matmul_pd = matmul::primitive_desc(eng, a_md, b_s4_md, c_md, attr, true);


I suggest adding a comment in the code to clarify that the last variable true is used as the allow_empty argument, and to explain the reason for its inclusion.

examples/tutorials/matmul/int4_weight_decompression.cpp

rupakroyintel · 2025-03-13T14:35:02Z

@vpirogov @dzarukin @shu1chen I have made the changes based on the reviews. Can you please check and approve the changes?

dzarukin · 2025-03-14T18:29:55Z

examples/tutorials/matmul/int4_weight_decompression.cpp

+    memory::desc a_md({M, K}, memory::data_type::f16,
+            memory::format_tag::ab); // M x K layout
+    // oneDNN doesn't have a notion of format for zero-points and it's always considered as tag::ab
+    // In this example, we align the weights format to match the format tag::ab of the zero-points


IIRC, PyTorch already had weights in ba which suits computations better. And zero-points and scales were in ba as well which was the problem for them. It's probably better to show how to transpose zero-points (which I can see below in the code) and scales instead because it's less memory to touch and should be more effective at the end.

dzarukin · 2025-03-14T18:33:22Z

examples/tutorials/matmul/int4_weight_decompression.cpp

+    // Set fpmath mode with `apply_to_int=true` to apply fpmath mode behavior to
+    // integral primitives (in this example, matmul).


... to indicate the library should up convert int4 weights to f16.

@dzarukin Are you suggesting to add a comment (the library automatically up converts int4 ->f16) here or should I explicitly up convert int4 weights to f16?

It's just about the comment to make it more clear.

rupakroyintel requested a review from a team as a code owner October 29, 2024 18:28

shu1chen reviewed Oct 30, 2024

View reviewed changes

examples/tutorials/matmul/int4_weight_decompression_cmnts.cpp Outdated Show resolved Hide resolved

shu1chen reviewed Oct 30, 2024

View reviewed changes

examples/tutorials/matmul/int4_weight_decompression_cmnts.cpp Outdated Show resolved Hide resolved

rupakroyintel changed the title ~~Add int4 decompression example~~ example: int4 weight decompression Oct 31, 2024

vpirogov reviewed Oct 31, 2024

View reviewed changes

examples/tutorials/matmul/int4_weight_decompression.cpp Outdated Show resolved Hide resolved

vpirogov reviewed Oct 31, 2024

View reviewed changes

rupakroyintel force-pushed the add_int4_decompression_example branch from 1abe160 to 41ddb5f Compare March 10, 2025 01:52

github-actions bot added the component:examples label Mar 10, 2025

rupakroyintel force-pushed the add_int4_decompression_example branch 5 times, most recently from 5ebab41 to 658ac77 Compare March 11, 2025 03:06

rupakroyintel force-pushed the add_int4_decompression_example branch 2 times, most recently from ad69b5a to 1bfd3a6 Compare March 12, 2025 20:07

shu1chen reviewed Mar 13, 2025

View reviewed changes

rupakroyintel requested review from a team as code owners March 13, 2025 03:44

github-actions bot added the documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc label Mar 13, 2025

rupakroyintel force-pushed the add_int4_decompression_example branch from e934315 to 7df0915 Compare March 13, 2025 06:41

rupakroyintel requested review from dzarukin and shu1chen March 13, 2025 19:39

shu1chen approved these changes Mar 14, 2025

View reviewed changes

rupakroyintel requested a review from vpirogov March 14, 2025 05:19

dzarukin reviewed Mar 14, 2025

View reviewed changes

rupakroyintel force-pushed the add_int4_decompression_example branch from 7df0915 to f29c7d0 Compare March 17, 2025 18:55

gpu: example: add example for int4 weight decompression

f29c7d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example: int4 weight decompression #2193

example: int4 weight decompression #2193

rupakroyintel commented Oct 29, 2024 •

edited by dzarukin

Loading

shu1chen commented Oct 30, 2024

rupakroyintel commented Oct 30, 2024

vpirogov commented Oct 31, 2024

vpirogov Oct 31, 2024

shu1chen Nov 5, 2024 •

edited

Loading

dzarukin Nov 5, 2024

mgouicem commented Nov 4, 2024 •

edited

Loading

rupakroyintel commented Nov 22, 2024

dzarukin commented Nov 22, 2024

rupakroyintel commented Mar 12, 2025

dzarukin commented Mar 12, 2025

shu1chen commented Mar 12, 2025

shu1chen Mar 13, 2025

rupakroyintel commented Mar 13, 2025

dzarukin Mar 14, 2025

dzarukin Mar 14, 2025

rupakroyintel Mar 17, 2025

dzarukin Mar 17, 2025

		// Set fpmath mode with `apply_to_int=true` to apply fpmath mode behavior to
		// integral primitives (in this example, matmul).

example: int4 weight decompression #2193

Are you sure you want to change the base?

example: int4 weight decompression #2193

Conversation

rupakroyintel commented Oct 29, 2024 • edited by dzarukin Loading

Description

Checklist

General

Performance improvements

New features

Bug fixes

RFC PR

shu1chen commented Oct 30, 2024

rupakroyintel commented Oct 30, 2024

vpirogov commented Oct 31, 2024

vpirogov Oct 31, 2024

Choose a reason for hiding this comment

shu1chen Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

dzarukin Nov 5, 2024

Choose a reason for hiding this comment

mgouicem commented Nov 4, 2024 • edited Loading

rupakroyintel commented Nov 22, 2024

dzarukin commented Nov 22, 2024

rupakroyintel commented Mar 12, 2025

dzarukin commented Mar 12, 2025

shu1chen commented Mar 12, 2025

shu1chen Mar 13, 2025

Choose a reason for hiding this comment

rupakroyintel commented Mar 13, 2025

dzarukin Mar 14, 2025

Choose a reason for hiding this comment

dzarukin Mar 14, 2025

Choose a reason for hiding this comment

rupakroyintel Mar 17, 2025

Choose a reason for hiding this comment

dzarukin Mar 17, 2025

Choose a reason for hiding this comment

rupakroyintel commented Oct 29, 2024 •

edited by dzarukin

Loading

shu1chen Nov 5, 2024 •

edited

Loading

mgouicem commented Nov 4, 2024 •

edited

Loading