-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve coalesced reduction performance for tall and thin matrices (up to 2.6x faster) #2259
Conversation
…lower LSU utilization and coalesced global stores
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Louis for this updte! Overeall it looks good, I just have two questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Louis for the explanation! The PR looks good to me.
/ok to test |
/ok to test |
/merge |
This PR implements two optimizations to
coalescedReductionThinKernel
which is used for coalesced reductions of tall matrices (many rows) and/or thin (few columns):The benchmark below shows the achieved SOL percentage on A30. I also measured that on H200, it achieved 84% SOL for 32 columns and up to 94% for 512 columns.