src: cpu: aarch64: lowp_matmul_sq: Make weights constant #2212

fadara01 · 2024-11-08T16:08:55Z

Description

Setting the weights as constant allows us to avoid redundant pretranspose operations in Arm Compute Library (ACL) every time execute is called (they are now run once and cached). This delives big speedups especially for relatively small matmuls.
Note that this is a temp fix that needs to be handled carefully by primitive caches in frameworks, since the ACL object is now holding more state - i.e. we want to make sure that the cahce maps a layer with a specific set of weights to the oneDNN primitive storing those weights.
We're currently working on the proper fix for this which involves making lowp_gemm stateless and fixed-format in ACL and oneDNN.

Fixes # (github issue)

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

Performance improvements

Have you submitted performance data that demonstrates performance improvements?

New features

Have you published an RFC for the new feature?
Was the RFC approved?
Have you added relevant tests?

Bug fixes

Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
Have you added relevant regression tests?

RFC PR

Does RFC document follow the template?
Have you added a link to the rendered document?

mgouicem · 2024-11-08T16:23:42Z

src/cpu/aarch64/matmul/acl_lowp_matmul_sq.hpp

+                    arm_compute::TensorShape(N(), K()), 1,
+                    acl_utils::get_acl_data_t(wei_d.data_type(), true),
+                    arm_compute::QuantizationInfo(1.0, 0, true));
+            almc_.wei_tensor_info.set_are_values_constant(true);


Hi @fadara01 ,
What does constant mean here? Is it about constant values within the tensor or just that the shape is constant?
If it is about constant values in weights, it does not seem that primitive creation takes the weights, does it mean the primitive state changes upon first execution?

mgouicem · 2024-11-08T16:28:35Z

src/cpu/aarch64/matmul/acl_lowp_matmul_sq.hpp

+    }
+
+    status_t execute(const exec_ctx_t &ctx) const {
+        std::lock_guard<std::mutex> _lock {this->mtx};


Is the mutex protection only about the quantization parameters that are reset, or is there something else due to constant weights?

mgouicem

One general note here, oneDNN primitive API has no concept of constant memory object, so such an optimization is unsafe (there is no guarantee from primitive implementation perspective that weights will not change between execute calls).

oneDNN Graph API has that concept though, and it might make sense to add it to primitive API if it helps. But I would hold off commiting this patch unless we extend API with constant memory objects.

Currently limited to SYCL with L0 backend.

The function will run over every M creating a full brg descriptor for it. For large W it would cause significant slowdown. Reducing the set by checking only relevant M, it improves the execution time.

Setting the weights as constant allows us to avoid redundant pretranspose operations in Arm Compute Library (ACL) every time execute is called (they are now run once and cached). This delives big speedups especially for relatively small matmuls. Note that this is a temp fix that needs to be handled carefully by primitive caches in frameworks, since the ACL object is now holding more state - i.e. we want to make sure that the cahce maps a layer with a specific set of weights to the oneDNN primitive storing those weights. We're currently working on the proper fix for this which involves making lowp_gemm stateless and fixed-format in ACL and oneDNN.

mgouicem reviewed Nov 8, 2024

View reviewed changes

TaoLv and others added 27 commits January 8, 2025 23:15

examples: graph: int4 gated mlp

369312f

benchdnn: inputs: graph: add gated mlp int4 case

49931b1

graph: backend: dnnl: pattern: mlp: restriction for inputs

a1f8b40

benchdnn: inputs: graph: fix batch file test_graph_all

6a980c4

xe: sdpa: Add support for causal mask in micro_sdpa

351237a

xe: sdpa: Create a tile op to perform the mask operation

96d9b07

xe: sdpa: Update copyright to 2025

b743c3e

x64: brgemm: common base class for all brgemm jit kernels

4d5816d

x64: brgemm: support arbitrary K for AMX

38f63f8

x64: brgemm 1x1 conv: support arbitrary ic without rtus

6e62e6c

x64: brgemm_matmul_copy_utils: support arbitrary padding

968ea24

x64: brgemm matmul: support arbitrary K on AMX

bc0ce23

benchdnn: binary: update filling

eba1ce0

fixup: x64: conv: brdgmm: enable zps per group

20254fc

xehpg: jit: gemm: updated int4/8 compressed weights strategies

8c4f815

xe: jit: codegen: add 64-bit immediate ecmp

2dc3440

xe: jit: ir: improve assert type issue message

599e9fa

xe: jit: ir: enable printing s64 type

0fc6e4d

xe: jit: conv: restrict usage to integer dimension sizes

3a5e23e

xe: jit: v2: conv: disable large buffer support

77dfda9

xe: ocl: enable large buffer offset support for ref_convolution

2b28d05

xe: ocl: disable large buffer support for wino convolution

66f51c8

xe: jit: enable emul(qw, qw, dw)

52c6803

xe: jit: gemm: disable large buffer jit:xe_hp implementation support

1215eb5

xe: jit: force A64

dae0e59

xe: jit: gemm: use 64-bit arithmetic for ld offsets

731804f

xe: ocl: gemm: fix ref_gemm types

f0bb3e2

echeresh and others added 30 commits January 17, 2025 16:32

xe: conv_v2: enable Stream-K kernels

5c3788a

xe: conv_v2: remove hw from kernel descriptor

6133246

xe: conv_v2: planner: fix bwd_w with bias

76d4e4e

common, xe: add API to set kernel/primitive cache capacity separately

2d136da

tests: benchdnn: add support for graph execution

c4e4c31

Currently limited to SYCL with L0 backend.

xe: jit: gemm: fix signedness in arg to ilog2

5502912

xe: jit: pooling: correct dim idx assertions

35a5f25

ngen: mark const guard variable as such

5c98ed0

xe: jit: gemm: add dynamic quantization kernel for 2nd token

1d4690a

xe: jit: gemm: move LSC instruction type override before post ops

03a215c

xe: jit: gemm: add N=32 dynamic quantization kernel

eb54552

xe: jit: gemm: use better performance model for [FO]OS Xe2 kernels

1d821ef

cpu: x64: brgemm_conv: move get_{o,k}w_range functions to utils

e9d44f3

cpu: x64: brgconv: improve get_brgemm_ur exec time

611f2fe

The function will run over every M creating a full brg descriptor for it. For large W it would cause significant slowdown. Reducing the set by checking only relevant M, it improves the execution time.

graph: backend: gc: fix itt include path

4cd2799

gtests: graph: unit: gc: fix gtest include path

fb9592e

graph: backend: gc: fix brg variable name

ada0370

graph: utils: pm: verify iterator after find

c1e04db

graph: backend: dnnl: fix type mismatch

0cb6ddf

graph: utils: pm: use const reference to avoid copy

0715445

benchdnn: graph: custom driver: fix uninitialized variables

51d3b4c

benchdnn: graph: parser: check return value of parse_dt

7e82464

cpu: x64: use plain layout for matmul-based IP for forward_training

5bba3d8

cpu: x64: enable matmul-based IP for bwd_d

26a5387

cpu: x64: enable matmul-based IP for bwd_wb

22037c4

src: cpu: aarch64: Enable matmul static quantisation.

849da8e

src: cpu: aarch64: Enable convolution static quantisation.

507bb69

Rebase

fdad2ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

src: cpu: aarch64: lowp_matmul_sq: Make weights constant #2212

src: cpu: aarch64: lowp_matmul_sq: Make weights constant #2212

Uh oh!

fadara01 commented Nov 8, 2024

Uh oh!

mgouicem Nov 8, 2024

Uh oh!

mgouicem Nov 8, 2024

Uh oh!

mgouicem left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

src: cpu: aarch64: lowp_matmul_sq: Make weights constant #2212

Are you sure you want to change the base?

src: cpu: aarch64: lowp_matmul_sq: Make weights constant #2212

Uh oh!

Conversation

fadara01 commented Nov 8, 2024

Description

Checklist

General

Performance improvements

New features

Bug fixes

RFC PR

Uh oh!

mgouicem Nov 8, 2024

Choose a reason for hiding this comment

Uh oh!

mgouicem Nov 8, 2024

Choose a reason for hiding this comment

Uh oh!

mgouicem left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants