Improve coalesced reduction performance for tall and thin matrices (up to 2.6x faster) #2259

Nyrio · 2024-04-09T19:19:04Z

This PR implements two optimizations to coalescedReductionThinKernel which is used for coalesced reductions of tall matrices (many rows) and/or thin (few columns):

Process multiple rows per warp to increase bytes in flight and amortize load latencies.
Use a vectorized reduction to avoid the LSU bottleneck and have fewer global stores (and at least partially coalesced).

The benchmark below shows the achieved SOL percentage on A30. I also measured that on H200, it achieved 84% SOL for 32 columns and up to 94% for 512 columns.

…lower LSU utilization and coalesced global stores

copy-pr-bot · 2024-04-09T19:19:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Nyrio · 2024-04-09T19:24:35Z

/ok to test

Nyrio · 2024-04-09T19:31:57Z

/ok to test

Nyrio · 2024-04-10T10:44:03Z

/ok to test

tfeher

Thanks Louis for this updte! Overeall it looks good, I just have two questions.

cpp/include/raft/util/reduction.cuh

cpp/include/raft/linalg/detail/coalesced_reduction-inl.cuh

tfeher

Thanks Louis for the explanation! The PR looks good to me.

tfeher · 2024-04-17T09:19:35Z

/ok to test

tfeher · 2024-04-22T11:53:20Z

/ok to test

cjnolet · 2024-04-22T14:56:26Z

/merge

Nyrio added 2 commits April 8, 2024 18:25

Increase number of loads per thread in coalescedReductionThinKernel

859495a

Use optimized vector warp reduce in coalescedReductionThinKernel for …

51dfdbd

…lower LSU utilization and coalesced global stores

Nyrio added 3 - Ready for Review improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cpp labels Apr 9, 2024

Nyrio requested a review from a team as a code owner April 9, 2024 19:19

Merge branch 'branch-24.06' into enh-reduction-perf

673f804

Update copyright year

102dc33

Avoid circular include dependency in reduction.cuh

591a9e3

tfeher reviewed Apr 12, 2024

View reviewed changes

cpp/include/raft/util/reduction.cuh Show resolved Hide resolved

cpp/include/raft/linalg/detail/coalesced_reduction-inl.cuh Show resolved Hide resolved

tfeher approved these changes Apr 16, 2024

View reviewed changes

tfeher added 2 commits April 16, 2024 09:45

Merge branch 'branch-24.06' into enh-reduction-perf

2b2d8af

Merge branch 'branch-24.06' into enh-reduction-perf

b553fb0

Merge branch 'branch-24.06' into enh-reduction-perf

001c2dd

cjnolet approved these changes Apr 22, 2024

View reviewed changes

rapids-bot bot merged commit 317a61c into rapidsai:branch-24.06 Apr 22, 2024
69 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve coalesced reduction performance for tall and thin matrices (up to 2.6x faster) #2259

Improve coalesced reduction performance for tall and thin matrices (up to 2.6x faster) #2259

Nyrio commented Apr 9, 2024

copy-pr-bot bot commented Apr 9, 2024

Nyrio commented Apr 9, 2024

Nyrio commented Apr 9, 2024

Nyrio commented Apr 10, 2024

tfeher left a comment

tfeher left a comment

tfeher commented Apr 17, 2024

tfeher commented Apr 22, 2024

cjnolet commented Apr 22, 2024

Improve coalesced reduction performance for tall and thin matrices (up to 2.6x faster) #2259

Improve coalesced reduction performance for tall and thin matrices (up to 2.6x faster) #2259

Conversation

Nyrio commented Apr 9, 2024

copy-pr-bot bot commented Apr 9, 2024

Nyrio commented Apr 9, 2024

Nyrio commented Apr 9, 2024

Nyrio commented Apr 10, 2024

tfeher left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

tfeher commented Apr 17, 2024

tfeher commented Apr 22, 2024

cjnolet commented Apr 22, 2024