-
Notifications
You must be signed in to change notification settings - Fork 1k
example: int4 weight decompression #2193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
example: int4 weight decompression #2193
Conversation
The file name of the example int4_weight_decompression_cmnts.cpp doesn't seem good. What is cmnts? |
Removed the int4_weight_decompression_cmnts.cpp and added int4_weight_decompression,cpp |
@rupakroyintel, please make sure commits in your branch comply with contributing guidelines and do not contain merge commits. @theComputeKid, @mgouicem, looks like |
|
||
// Compares the results of reference matrix multiplication and oneDNN weights | ||
// decompression. | ||
void compare_ref_and_weights_decompression(engine::kind engine_kind) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to follow structure and flow of int8 decompression example (weights_decompression_matmul) and add additional information about specifics of int4 data storage. If you remember the case that triggered the request for example was related to feeding prepacked weights to oneDNN and dealing with groups and zero-points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously in the Teams Pytorch channel, Dmitry provided the following detailed advice:
...
Secondly, using this group of 8 along N in case of PT is not required. Group size says how many consecutive points of the tensor zero points are applied to should share a single zero point value. It has nothing to do with how PT pack their zero points.Thirdly, it is the most important HOW these zero points are stored in memory. There was a recent story where IPEX engineer tried to enable oneDNN's int4 and failed to do so because weights were transposed (because of that 8xPack thing), and everything what should have been done was to transpose them again to match oneDNN's API. I would assume this story should follow the same pattern - before calling oneDNN API, it's highly likely those zero points should be transposed and only then passed as an int4 object inside the library to get correct results.
@dzarukin What do you suggest? It seems that it's better to pass an int4 object to oneDNN rather than to prepack 8*int4 and pass an int32 object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oneDNN developed API to work with int4 memory objects directly. This hasn't happened to PyTorch yet. Their implementation side has a detail of pre-packing. The example should probably demonstrate how to translate packed 8 int4 values as a single int value language into oneDNN language and what operations should be done in terms of memory (necessary transpositions and/or reorders).
Let me see what goes off in the jobs. I checked out the branch and ran locally, and it properly catches the first improper message.
|
@vpirogov @dzarukin We tried translating packed 8 int4 values into a single int value. However, it looks like the zero-points attribute wei:per_ocic:s4:32x8 is not supported. Here is the output from benchdnn:
|
@rupakroyintel, oneDNN doesn't have any idea about external to it 8-int4 values packing implementation detail. Zero-point group API is not designed for it. From oneDNN perspective you need to think about each value independently and use a single dimension in groups. The observed benchdnn output is expected. |
1abe160
to
41ddb5f
Compare
5ebab41
to
658ac77
Compare
@dzarukin @vpirogov @shu1chen I have pushed the latest changes. The example passed on GPU. However, it failed on CPU. I have added the verbose log for the failing case: |
@rupakroyintel Decompression is not implemented for AVX2. |
Since this datatype combination is unsupported on CPU, another option is to skip the test on CPU and only enable this example on GPU, similar to what is done in weights_decompression_matmul.cpp#L193-#L195:
It would also be great to add the link to this new example in the documentation matmul.md, examples.md. |
ad69b5a
to
1bfd3a6
Compare
dnnl::matmul::primitive_desc matmul_pd; | ||
|
||
// Create a MatMul primitive descriptor | ||
matmul_pd = matmul::primitive_desc(eng, a_md, b_s4_md, c_md, attr, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding a comment in the code to clarify that the last variable true
is used as the allow_empty
argument, and to explain the reason for its inclusion.
e934315
to
7df0915
Compare
memory::desc a_md({M, K}, memory::data_type::f16, | ||
memory::format_tag::ab); // M x K layout | ||
// oneDNN doesn't have a notion of format for zero-points and it's always considered as tag::ab | ||
// In this example, we align the weights format to match the format tag::ab of the zero-points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, PyTorch already had weights in ba
which suits computations better. And zero-points and scales were in ba
as well which was the problem for them. It's probably better to show how to transpose zero-points (which I can see below in the code) and scales instead because it's less memory to touch and should be more effective at the end.
// Set fpmath mode with `apply_to_int=true` to apply fpmath mode behavior to | ||
// integral primitives (in this example, matmul). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... to indicate the library should up convert int4 weights to f16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dzarukin Are you suggesting to add a comment (the library automatically up converts int4 ->f16) here or should I explicitly up convert int4 weights to f16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just about the comment to make it more clear.
7df0915
to
f29c7d0
Compare
Description
oneDNN supports INT4 autoGPTQ and AWQ quantization features. This is an example in oneDNN example to demonstrate Matmul INT4 weights decompression support and how to configure the APIs for autoGPTQ and AWQ quantization features. The request originally came from IPEX team: "AWQ (activation-aware quantization) is very popular in the community and we need to support. We need oneDNN INT4 GEMM API support the below input packing approach.The weights is packed in N direction, [K, N/8]; zeros point is packed in both K and N, [K/G, N/8], scale is in K direction [K/G, N].The input data type of weight and zero point is int32 and scale is fp16."
Checklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?Performance improvements
New features
Bug fixes
RFC PR