-
Notifications
You must be signed in to change notification settings - Fork 14.1k
ggml-hexagon: gelu operation #17921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-hexagon: gelu operation #17921
Conversation
|
While both commit 2a787a6 and 83412e0 is a fully functional gelu implementation that passed the official ggml test by running, but there is a significant performance difference between the two and maybe worth an discussion? HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o GELUThe GELU code from commit 2a787a6 is simply a sigmoid GELU implementation but commit 83412e0 is a 7th polynomial approximation of GELU generated from the qhcg tool from Hexagon SDK with some modification. When running on input of size [4096x4304] ( non-linear input data size for gemma3 vision model), I got some significant performance between this two implementation.
The data above are tested on Samsung galaxy s25 ultra using the test repo I wrote : current on refactor-dev branch. For the usec second above, I recorded the longest usec among the 6 threads that is printed out to FARF log In addition, when plotting out the polynomial approximation using my plot script I wrote in my test repo, I did not much if any error between the polynimal approximation version vs the CPU reference.
|
|
After revisiting the polynomial-approximation implementation, I noticed there was a block-prefetch operation in the code which I have commented out in commit 7233999 for a fair comparison. With that in mind, here are the updated results:
From the new testbench runs, a few things stand out:
For the remainder of this PR, I plan to:
to see whether we can push sigmoid-GELU performance ahead of the polynomial method. May I kindly ask for your thoughts and suggestions @max-krasnyansky ? |
|
For commit fc2289d, I noticed a significant performance gap between the two sigmoid implementations that handle unaligned input vectors. The first version treats On my local device, when used inside the GELU kernel, the first implementation runs in about 6400–6500 µs, whereas the unaligned version takes around 7300–7400 µs for an input of size 4096 × 4304. This seems consistent with my point (3) above? It would be good to have someone else verify these numbers, but if they hold, it might be worth applying the same approach to other functions such as |
|
With the newest commit of cbd4e93, the gelu now take about 5300-5400 cycle for 4096x4304 data size. |
Yes, it's generally better to use explicit |
Yes. Review & revisit of the l2-prefetch logic (in all the Ops) is on the TODO list :). |
Thanks for confirming this! I can submit another pr that applies the same optimization, both valign and larger L2 fetch, to all other ops if it sounds good to you. |
|
@joeldushouyu there are a bunch of reformatting changes (spaces, etc). |
This reverts commit 952877e.
|
Probably this should be done in another pr(if we like it) along with the remaining optimization. As I look at the current code implementation, for example the hvx_mul_scalar_f32 function. The current logic for handing unaligned load is like the following if( aligned src and aligned dest){
HVX_Vector src....
HVX Vector dst,
}else{
HVX_UVector src
HVX_UVector dst ...
}
Looking through the SDK documentation, it says that the difference of align/unaligned load/store will cause significant performance difference, even on L2 cache level. Thus, I wonder would if it would be beneficial if we change the logic to the following if( align src and align dst){
HVX_Vector src....
HVX Vector dst,
}else if(align src and NOT align dst){
HVX_Vector src....
HVX_UVector dst,
}else if( NOT align src and align dst){
HVX_UVector src....
HVX_Vector dst,
}else{
HVX_UVector src....
HVX_UVector dst,
}AKA, we do more granularity implementation for different possible address alignment case. I wonder if this could help in performance, or would it turn out to hurt the current performance? Since in case, the program size increase and might exceed the L1 instruction cache size, causing frequent DDR access for the program instruction? |
Yep something like that. |
|
I'm seeing consistent segfaults with the GELU test. Test on Gen4 (S25+) and Gen5 devices. Tested with NHVX=1...8 with same result (i.e. not threading related). So the host side looks OK (failures are expected since compute is skipped). This is probably due to |
I see. I am planning to submit another pr for enabling swiglu_oai. The improvement pr should probably be done after it. |
sorry. I failed to test the newer commit with the standard testcase. I just pushed a commit(52f43fb) that fix the issue, basically the unsigned int variable overflows when decrements. Not sure why it did not occur when running on larger number, like 4096x4,304 though. Right now, I can run and pass all the test for GELU on my end with the fix. |
Yep. Looks good here too. Merging after CI pass |
Looks like the editor-config check is failing. Could you please fix those trailing spaces but don't run the full clang-format :) |
I pushed a fix that addresses the editor address, and hopefully I have done everything correctly. Out of curiosity, I wonder what C++ format tool you use to ensure the code passes the CI checks? Previously, I was using claing-formatter as specified in the Contribute.md, but probably the document is out-of-date? |
I did use btw Editor Config check can be triggered manually https://github.com/ggml-org/llama.cpp/actions/workflows/editorconfig.yml ( |


Support GELU operation for ggml-hexagon.