ggml-hexagon: gelu operation #17921

joeldushouyu · 2025-12-10T23:36:27Z

Support GELU operation for ggml-hexagon.

joeldushouyu · 2025-12-11T21:36:14Z

While both commit 2a787a6 and 83412e0 is a fully functional gelu implementation that passed the official ggml test by running, but there is a significant performance difference between the two and maybe worth an discussion?

HB=0  ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o GELU

The GELU code from commit 2a787a6 is simply a sigmoid GELU implementation but commit 83412e0 is a 7th polynomial approximation of GELU generated from the qhcg tool from Hexagon SDK with some modification.

When running on input of size [4096x4304] ( non-linear input data size for gemma3 vision model), I got some significant performance between this two implementation.

data-size	gelu-sigmoid	gelu-polynomial approximation
4096x4304( unaligned address)	11772 usec	6531usec
4096x4096 (aligned address)	4988 usec	4799 usec

The data above are tested on Samsung galaxy s25 ultra using the test repo I wrote : current on refactor-dev branch.

For the usec second above, I recorded the longest usec among the 6 threads that is printed out to FARF log

In addition, when plotting out the polynomial approximation using my plot script I wrote in my test repo, I did not much if any error between the polynimal approximation version vs the CPU reference.

joeldushouyu · 2025-12-11T22:19:11Z

After revisiting the polynomial-approximation implementation, I noticed there was a block-prefetch operation in the code which I have commented out in commit 7233999 for a fair comparison. With that in mind, here are the updated results:

data-size	GELU-sigmoid	GELU polynomial approximation
4096×4304 (unaligned)	11833 µs	8680 µs
4096×4096 (aligned)	5006 µs	7990 µs

From the new testbench runs, a few things stand out:

The 7th-order polynomial path is actually slower computationally than the sigmoid-based GELU.
The current L2-prefetch logic used in the regular SiLU, GELU, and SwiGLU kernels appears to be either underutilized or triggered too late. This likely explains why the polynomial approximation—with its more aggressive L2 prefetching—outperforms the sigmoid-GELU in the unaligned case.
The unaligned version of the sigmoid-GELU could potentially benefit from adopting the same unaligned-load strategy used in the polynomial-approximation path.

For the remainder of this PR, I plan to:

refine the L2-prefetching strategy in sigmoild-gelu, and
apply the polynomial-approximation’s unaligned-load approach to the sigmoid-GELU path

to see whether we can push sigmoid-GELU performance ahead of the polynomial method.

May I kindly ask for your thoughts and suggestions @max-krasnyansky ?

This reverts commit 7233999.

This reverts commit 2a787a6.

joeldushouyu · 2025-12-12T17:27:36Z

For commit fc2289d, I noticed a significant performance gap between the two sigmoid implementations that handle unaligned input vectors.

The first version treats x as an aligned HVX_Vector pointer and uses Q6_V_valign_VVR to handle any address misalignment, while the second version treats the input as an unaligned HVX_UVector pointer.

On my local device, when used inside the GELU kernel, the first implementation runs in about 6400–6500 µs, whereas the unaligned version takes around 7300–7400 µs for an input of size 4096 × 4304. This seems consistent with my point (3) above? It would be good to have someone else verify these numbers, but if they hold, it might be worth applying the same approach to other functions such as hvx_mul_f32 and hvx_add_f32?

joeldushouyu · 2025-12-12T22:18:27Z

With the newest commit of cbd4e93, the gelu now take about 5300-5400 cycle for 4096x4304 data size.

max-krasnyansky · 2025-12-15T19:05:38Z

For commit fc2289d, I noticed a significant performance gap between the two sigmoid implementations that handle unaligned input vectors.

The first version treats x as an aligned HVX_Vector pointer and uses Q6_V_valign_VVR to handle any address misalignment, while the second version treats the input as an unaligned HVX_UVector pointer.

On my local device, when used inside the GELU kernel, the first implementation runs in about 6400–6500 µs, whereas the unaligned version takes around 7300–7400 µs for an input of size 4096 × 4304. This seems consistent with my point (3) above? It would be good to have someone else verify these numbers, but if they hold, it might be worth applying the same approach to other functions such as hvx_mul_f32 and hvx_add_f32?

Yes, it's generally better to use explicit valign instead of unaligned access, especially when the data is in DDR.
In some of the cases (like reading the Scales from VTCM in q4/q8/mxfp4 matmuls) unaligned reads require fewer instructions/registers and therefor beneficial but in general it's better to use valign.

max-krasnyansky · 2025-12-15T19:09:33Z

After revisiting the polynomial-approximation implementation, I noticed there was a block-prefetch operation in the code which I have commented out in commit 7233999 for a fair comparison. With that in mind, here are the updated results:

data-size GELU-sigmoid GELU polynomial approximation
4096×4304 (unaligned) 11833 µs 8680 µs
4096×4096 (aligned) 5006 µs 7990 µs
From the new testbench runs, a few things stand out:

The 7th-order polynomial path is actually slower computationally than the sigmoid-based GELU.

The current L2-prefetch logic used in the regular SiLU, GELU, and SwiGLU kernels appears to be either underutilized or triggered too late. This likely explains why the polynomial approximation—with its more aggressive L2 prefetching—outperforms the sigmoid-GELU in the unaligned case.

The unaligned version of the sigmoid-GELU could potentially benefit from adopting the same unaligned-load strategy used in the polynomial-approximation path.

For the remainder of this PR, I plan to:

refine the L2-prefetching strategy in sigmoild-gelu, and

apply the polynomial-approximation’s unaligned-load approach to the sigmoid-GELU path

to see whether we can push sigmoid-GELU performance ahead of the polynomial method.

May I kindly ask for your thoughts and suggestions @max-krasnyansky ?

Yes. Review & revisit of the l2-prefetch logic (in all the Ops) is on the TODO list :).
Thanks for starting the discussion and updates in this area!

joeldushouyu · 2025-12-15T19:23:06Z

For commit fc2289d, I noticed a significant performance gap between the two sigmoid implementations that handle unaligned input vectors.
The first version treats x as an aligned HVX_Vector pointer and uses Q6_V_valign_VVR to handle any address misalignment, while the second version treats the input as an unaligned HVX_UVector pointer.
On my local device, when used inside the GELU kernel, the first implementation runs in about 6400–6500 µs, whereas the unaligned version takes around 7300–7400 µs for an input of size 4096 × 4304. This seems consistent with my point (3) above? It would be good to have someone else verify these numbers, but if they hold, it might be worth applying the same approach to other functions such as hvx_mul_f32 and hvx_add_f32?

Yes, it's generally better to use explicit valign instead of unaligned access, especially when the data is in DDR. In some of the cases (like reading the Scales from VTCM in q4/q8/mxfp4 matmuls) unaligned reads require fewer instructions/registers and therefor beneficial but in general it's better to use valign.

Thanks for confirming this! I can submit another pr that applies the same optimization, both valign and larger L2 fetch, to all other ops if it sounds good to you.

…purpose" This reverts commit 84f2f23.

max-krasnyansky · 2025-12-16T19:21:42Z

@joeldushouyu there are a bunch of reformatting changes (spaces, etc).
Looks like your editor applied those automatically. Let's please undo those.
Especially the #pragma unroll, clangformat almost always gets this one wrong, it should be aligned to the loop.
Otherwise looks good to me

This reverts commit 952877e.

joeldushouyu · 2025-12-16T19:38:46Z

Probably this should be done in another pr(if we like it) along with the remaining optimization.

As I look at the current code implementation, for example the hvx_mul_scalar_f32 function. The current logic for handing unaligned load is like the following

if(  aligned src and aligned dest){
   HVX_Vector src....
   HVX Vector dst,
}else{
   HVX_UVector src
   HVX_UVector dst ...
}

Looking through the SDK documentation, it says that the difference of align/unaligned load/store will cause significant performance difference, even on L2 cache level.

Thus, I wonder would if it would be beneficial if we change the logic to the following

if( align src and align dst){
   HVX_Vector src....
   HVX Vector dst,
}else if(align src and NOT align dst){
   HVX_Vector src....
   HVX_UVector dst,
}else if( NOT align src and  align dst){
   HVX_UVector src....
   HVX_Vector dst,
}else{
   HVX_UVector src....
   HVX_UVector dst,
}

AKA, we do more granularity implementation for different possible address alignment case. I wonder if this could help in performance, or would it turn out to hurt the current performance? Since in case, the program size increase and might exceed the L1 instruction cache size, causing frequent DDR access for the program instruction?

@max-krasnyansky

max-krasnyansky · 2025-12-16T19:55:51Z

Probably this should be done in another pr(if we like it) along with the remaining optimization.
...

Yep something like that.
It makes sense to review/update l2-prefetching and general handling of unaligned data in all those Ops.

max-krasnyansky · 2025-12-16T20:59:15Z

I'm seeing consistent segfaults with the GELU test. Test on Gen4 (S25+) and Gen5 devices.

$ HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o GELU
  ...
  GELU(type=f32,ne_a=[128,2,2,2],v=0): OK
/workspace/ggml/src/ggml-hexagon/ggml-hexagon.cpp:315: ggml-hex: dspqueue_read failed: 0x0000002e
0: 0x7880b1171c 
1: 0x7880b11628 ggml_print_backtrace
2: 0x7880b11914 ggml_abort
3: 0x787addddfc _ZN20ggml_hexagon_session5flushEv

$ adb logcat -e CDSP
--------- beginning of main
12-16 20:41:38.840  2382  2736 V adsprpc : 033b:06: CDSP0:[R]: ############################### Process on cDSP0 CRASHED!!!!!!! ########################################
12-16 20:41:38.840  2382  2736 V adsprpc : 033b:06: CDSP0:[R]: --------------------- Crash Details are furnished below ------------------------------------
12-16 20:41:38.840  2382  2736 V adsprpc : 033b:06: CDSP0:[R]: Process "/frpc/f0542a20 test-backend-op" crashed in thread "0xfd040010:work" due to TLBMISS RW occurrence
12-16 20:41:38.840  2382  2736 V adsprpc : 033b:06: CDSP0:[R]: Crashed Shared Object "./libggml-htp-v81.so" load address : 0xFDE9B000 
...

Tested with NHVX=1...8 with same result (i.e. not threading related).
Also tested with OPMASK=0x3 (i.e. dispatch Ops but skip compute on the HTP)

$ HB=0 OPMASK=0x3 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o GELU
  GELU(type=f16,ne_a=[128,2,2,2],v=0): not supported [HTP0] 
  GELU(type=f16,ne_a=[5,7,11,13],v=0): not supported [HTP0] 
  GELU(type=f16,ne_a=[128,2,2,2],v=1): not supported [HTP0] 
  GELU(type=f16,ne_a=[5,7,11,13],v=1): not supported [HTP0] 
[GELU] ERR = 1.512005117 > 0.000000100   GELU(type=f32,ne_a=[128,2,2,2],v=0): FAIL
[GELU] ERR = 1.512211652 > 0.000000100   GELU(type=f32,ne_a=[5,7,11,13],v=0): FAIL
  GELU(type=f32,ne_a=[128,2,2,2],v=1): not supported [HTP0] 
  GELU(type=f32,ne_a=[5,7,11,13],v=1): not supported [HTP0] 
  0/2 tests passed

Failing tests:
  GELU(type=f32,ne_a=[128,2,2,2],v=0)
  GELU(type=f32,ne_a=[5,7,11,13],v=0)
  Backend HTP0: FAIL

So the host side looks OK (failures are expected since compute is skipped).

This is probably due to l2fetch fetching past the mapped buffer with GELU(type=f32,ne_a=[5,7,11,13],v=0).
I'll stare at the code some more to see if I can spot anything obvious.

joeldushouyu · 2025-12-16T21:58:11Z

Probably this should be done in another pr(if we like it) along with the remaining optimization.
...

Yep something like that. It makes sense to review/update l2-prefetching and general handling of unaligned data in all those Ops.

I see. I am planning to submit another pr for enabling swiglu_oai. The improvement pr should probably be done after it.

joeldushouyu · 2025-12-16T22:00:03Z

I'm seeing consistent segfaults with the GELU test. Test on Gen4 (S25+) and Gen5 devices.

$ HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o GELU
  ...
  GELU(type=f32,ne_a=[128,2,2,2],v=0): OK
/workspace/ggml/src/ggml-hexagon/ggml-hexagon.cpp:315: ggml-hex: dspqueue_read failed: 0x0000002e
0: 0x7880b1171c 
1: 0x7880b11628 ggml_print_backtrace
2: 0x7880b11914 ggml_abort
3: 0x787addddfc _ZN20ggml_hexagon_session5flushEv

$ adb logcat -e CDSP
--------- beginning of main
12-16 20:41:38.840  2382  2736 V adsprpc : 033b:06: CDSP0:[R]: ############################### Process on cDSP0 CRASHED!!!!!!! ########################################
12-16 20:41:38.840  2382  2736 V adsprpc : 033b:06: CDSP0:[R]: --------------------- Crash Details are furnished below ------------------------------------
12-16 20:41:38.840  2382  2736 V adsprpc : 033b:06: CDSP0:[R]: Process "/frpc/f0542a20 test-backend-op" crashed in thread "0xfd040010:work" due to TLBMISS RW occurrence
12-16 20:41:38.840  2382  2736 V adsprpc : 033b:06: CDSP0:[R]: Crashed Shared Object "./libggml-htp-v81.so" load address : 0xFDE9B000 
...

Tested with NHVX=1...8 with same result (i.e. not threading related). Also tested with OPMASK=0x3 (i.e. dispatch Ops but skip compute on the HTP)

$ HB=0 OPMASK=0x3 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o GELU
  GELU(type=f16,ne_a=[128,2,2,2],v=0): not supported [HTP0] 
  GELU(type=f16,ne_a=[5,7,11,13],v=0): not supported [HTP0] 
  GELU(type=f16,ne_a=[128,2,2,2],v=1): not supported [HTP0] 
  GELU(type=f16,ne_a=[5,7,11,13],v=1): not supported [HTP0] 
[GELU] ERR = 1.512005117 > 0.000000100   GELU(type=f32,ne_a=[128,2,2,2],v=0): FAIL
[GELU] ERR = 1.512211652 > 0.000000100   GELU(type=f32,ne_a=[5,7,11,13],v=0): FAIL
  GELU(type=f32,ne_a=[128,2,2,2],v=1): not supported [HTP0] 
  GELU(type=f32,ne_a=[5,7,11,13],v=1): not supported [HTP0] 
  0/2 tests passed

Failing tests:
  GELU(type=f32,ne_a=[128,2,2,2],v=0)
  GELU(type=f32,ne_a=[5,7,11,13],v=0)
  Backend HTP0: FAIL

So the host side looks OK (failures are expected since compute is skipped).

This is probably due to l2fetch fetching past the mapped buffer with GELU(type=f32,ne_a=[5,7,11,13],v=0). I'll stare at the code some more to see if I can spot anything obvious.

sorry. I failed to test the newer commit with the standard testcase. I just pushed a commit(52f43fb) that fix the issue, basically the unsigned int variable overflows when decrements. Not sure why it did not occur when running on larger number, like 4096x4,304 though.

Right now, I can run and pass all the test for GELU on my end with the fix.

max-krasnyansky · 2025-12-16T22:48:08Z

I'm seeing consistent segfaults with the GELU test. Test on Gen4 (S25+) and Gen5 devices.
...
sorry. I failed to test the newer commit with the standard testcase. I just pushed a commit(52f43fb) that fix the issue, basically the unsigned int variable overflows when decrements. Not sure why it did not occur when running on larger number, like 4096x4,304 though.

Right now, I can run and pass all the test for GELU on my end with the fix.

Yep. Looks good here too. Merging after CI pass

max-krasnyansky · 2025-12-17T03:12:03Z

@joeldushouyu

Right now, I can run and pass all the test for GELU on my end with the fix.

Yep. Looks good here too. Merging after CI pass

Looks like the editor-config check is failing.
https://github.com/ggml-org/llama.cpp/actions/runs/20284037169/job/58253291589?pr=17921

Could you please fix those trailing spaces but don't run the full clang-format :)

joeldushouyu · 2025-12-17T13:39:28Z

@joeldushouyu

Right now, I can run and pass all the test for GELU on my end with the fix.

Yep. Looks good here too. Merging after CI pass

Looks like the editor-config check is failing. https://github.com/ggml-org/llama.cpp/actions/runs/20284037169/job/58253291589?pr=17921

Could you please fix those trailing spaces but don't run the full clang-format :)

I pushed a fix that addresses the editor address, and hopefully I have done everything correctly.

Out of curiosity, I wonder what C++ format tool you use to ensure the code passes the CI checks? Previously, I was using claing-formatter as specified in the Contribute.md, but probably the document is out-of-date?

max-krasnyansky · 2025-12-17T16:22:55Z

Looks like the editor-config check is failing. https://github.com/ggml-org/llama.cpp/actions/runs/20284037169/job/58253291589?pr=17921
Could you please fix those trailing spaces but don't run the full clang-format :)

I pushed a fix that addresses the editor address, and hopefully I have done everything correctly.

Out of curiosity, I wonder what C++ format tool you use to ensure the code passes the CI checks? Previously, I was using claing-formatter as specified in the Contribute.md, but probably the document is out-of-date?

I did use clang-format to format everything before upstreaming but as I mentioned it does get several things wrong, like mis-indenting the pragmas (and I believe ifdefs). Also some of the code is more readable if formatted by hand slightly out of strict format spec. We can technically sprinkle a bunch of // clang-format off/on but that gets ugly quickly as well.
Anyway, clang-format is the right thing with exceptions :) We could do a separate PR with just format updates (fixed by hand) if the code drifts too far.

btw Editor Config check can be triggered manually https://github.com/ggml-org/llama.cpp/actions/workflows/editorconfig.yml (Run workflow)
You can run that in your fork/branch.

feat: inital support for gelu using sigmoid approximation

83412e0

loci-dev mentioned this pull request Dec 11, 2025

UPSTREAM PR #17921: ggml-hexagon: gelu operation auroralabs-loci/llama.cpp#519

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 11, 2025

snapshot: faster gelu using polynomial approximation

2a787a6

test: disable l2-block prefetch in polynomail approximation

7233999

joeldushouyu added 4 commits December 12, 2025 10:03

Revert "test: disable l2-block prefetch in polynomail approximation"

470b499

This reverts commit 7233999.

Revert "snapshot: faster gelu using polynomial approximation"

999492f

This reverts commit 2a787a6.

debug: temporarily disable unnecessary log message for debug purpose

84f2f23

Feat: optiized unaligned sigmoid_f32

fc2289d

joeldushouyu added 2 commits December 12, 2025 17:15

Feat: larger l2prefetch block

8bc299d

feat: apply unaligned-load optimization on mul and mul_scalar

cbd4e93

joeldushouyu added 3 commits December 16, 2025 08:27

Revert "debug: temporarily disable unnecessary log message for debug …

e51b6bf

…purpose" This reverts commit 84f2f23.

refactor: cleanup commented unused code

0569335

chore: reformat code with clang-formatter to pass cli test

952877e

joeldushouyu marked this pull request as ready for review December 16, 2025 13:42

joeldushouyu requested review from lhez and max-krasnyansky as code owners December 16, 2025 13:42

Revert "chore: reformat code with clang-formatter to pass cli test"

cf3a65f

This reverts commit 952877e.

fix: fix loop overflow

52f43fb

max-krasnyansky approved these changes Dec 16, 2025

View reviewed changes

chore: fix formating ci error

15ef64b

max-krasnyansky merged commit 4470a07 into ggml-org:master Dec 17, 2025
64 of 71 checks passed

joeldushouyu mentioned this pull request Dec 17, 2025

ggml-hexagon: gelu optimization #18151

Draft

loci-dev mentioned this pull request Dec 17, 2025

UPSTREAM PR #18151: ggml-hexagon: gelu optimization auroralabs-loci/llama.cpp#609

Open

ggml-hexagon: gelu operation #17921

ggml-hexagon: gelu operation #17921

Conversation

joeldushouyu commented Dec 10, 2025

Uh oh!

joeldushouyu commented Dec 11, 2025

Uh oh!

joeldushouyu commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeldushouyu commented Dec 12, 2025

Uh oh!

joeldushouyu commented Dec 12, 2025

Uh oh!

max-krasnyansky commented Dec 15, 2025

Uh oh!

max-krasnyansky commented Dec 15, 2025

Uh oh!

joeldushouyu commented Dec 15, 2025

Uh oh!

max-krasnyansky commented Dec 16, 2025

Uh oh!

joeldushouyu commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Dec 16, 2025

Uh oh!

max-krasnyansky commented Dec 16, 2025

Uh oh!

joeldushouyu commented Dec 16, 2025

Uh oh!

joeldushouyu commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Dec 16, 2025

Uh oh!

max-krasnyansky commented Dec 17, 2025

Uh oh!

joeldushouyu commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joeldushouyu commented Dec 11, 2025 •

edited

Loading

joeldushouyu commented Dec 16, 2025 •

edited

Loading

joeldushouyu commented Dec 16, 2025 •

edited

Loading

joeldushouyu commented Dec 17, 2025 •

edited

Loading