Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and improve one-block radix select #1878

Merged
merged 16 commits into from
Nov 9, 2023

Conversation

yong-wang
Copy link
Contributor

@yong-wang yong-wang commented Oct 9, 2023

  • fix matrix::detail::select::radix::calc_chunk_size() for one-block kernel
  • use calc_buf_len() rather than len as the buffer size of one-block kernel
  • reduce register footprint of one-block kernel by reducing the number of buffer pointers
  • reduce the buffer size by 1/8 for all radix select functions

Resolve #1823

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 9, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp label Oct 9, 2023
@cjnolet cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 9, 2023
@cjnolet
Copy link
Member

cjnolet commented Oct 9, 2023

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Oct 12, 2023

/ok to test

@yong-wang
Copy link
Contributor Author

Using smaller buffer for radix_topk_one_block_kernel adds a few branches in the code, which increases #register and decreases occupancy.

To alleviate register pressure, radix_topk_one_block_kernel is changed to take only one buf pointer as input instead of 4 buf pointers. These pointers have long life range hence increase register pressure.

As a result, the occupancy is kept almost unchanged for this PR.

For sm80, #register and occupancy are the same before and after this PR:

T IdxT #reg occupancy
float int32 32 100%
float uint32 32 100%
float int64 44 50%
float uint64 44 50%
half int32 32 100%
half uint32 32 100%
half int64 44 50%
half uint64 44 50%
double int32 32 100%
double uint32 32 100%
double int64 44 50%
double uint64 44 50%

For sm90, #register and occupancy change just a little after this PR (- is for commit cf085586, + denotes this PR):

T IdxT #reg occupancy
- float int32 40 75%
+ float int32 39 75%
float uint32 40 75%
float int64 64 50%
float uint64 64 50%
- half int32 40 75%
+ half int32 39 75%
half uint32 40 75%
half int64 64 50%
half uint64 64 50%
- double int32 40 75%
+ double int32 39 75%
double uint32 40 75%
- double int64 68 25%
- double uint64 68 25%
+ double int64 64 50%
+ double uint64 71 25%

@yong-wang
Copy link
Contributor Author

Benchmark results.

(1) Perf for inputs containing inf (e.g. label="10#1000000#1#infs-0.1") changes a lot. It dues to the change of buffer size (decreased by 1/8) so writing buffer happens less often. The speedup data for this PR is:

For A100:

speedup min max avg
float-uint32_t kRadix11bits 0.994 2.278 1.329
float-uint32_t kRadix11bitsExtraPass 0.997 2.276 1.327
float-uint32_t kRadix8bits 0.893 3.047 1.522
double-uint32_t kRadix11bits 0.998 2.858 1.437
double-uint32_t kRadix11bitsExtraPass 0.998 2.856 1.437
double-uint32_t kRadix8bits 0.850 3.232 1.473
double-int64_t kRadix11bits 0.998 1.031 1.008
double-int64_t kRadix11bitsExtraPass 0.997 1.030 1.005
double-int64_t kRadix8bits 0.952 1.012 0.993

For H100:

speedup min max avg
float-uint32_t kRadix11bits 0.984 2.207 1.342
float-uint32_t kRadix11bitsExtraPass 0.983 2.205 1.336
float-uint32_t kRadix8bits 0.979 2.859 1.519
double-uint32_t kRadix11bits 0.995 3.502 1.614
double-uint32_t kRadix11bitsExtraPass 0.996 3.502 1.615
double-uint32_t kRadix8bits 0.857 4.116 1.710
double-int64_t kRadix11bits 0.996 1.053 1.008
double-int64_t kRadix11bitsExtraPass 0.996 1.053 1.008
double-int64_t kRadix8bits 0.996 1.009 1.001

Although there are slowdown cases for kRadix8bits, e.g. -15%, most cases are faster.

(2) Perf for cases with label "same-leading-bits" is almost the same, so I omit the data here.

@yong-wang
Copy link
Contributor Author

(3) Cases with label "20000#500", that is, batch size = 20000 and len=500, become slower:

For A100:

speedup min max avg
float-uint32_t kRadix11bits 0.891 0.954 0.922
float-uint32_t kRadix11bitsExtraPass 0.892 0.955 0.922
float-uint32_t kRadix8bits 0.881 0.930 0.900
double-uint32_t kRadix11bits 0.911 0.944 0.929
double-uint32_t kRadix11bitsExtraPass 0.911 0.944 0.929
double-uint32_t kRadix8bits 0.910 0.952 0.937
double-int64_t kRadix11bits 0.944 0.974 0.964
double-int64_t kRadix11bitsExtraPass 0.944 0.974 0.964
double-int64_t kRadix8bits 0.949 0.985 0.967

For H100:

speedup min max avg
float-uint32_t kRadix11bits 0.877 0.983 0.937
float-uint32_t kRadix11bitsExtraPass 0.877 0.983 0.937
float-uint32_t kRadix8bits 1.040 1.147 1.066
double-uint32_t kRadix11bits 0.906 0.953 0.923
double-uint32_t kRadix11bitsExtraPass 0.906 0.954 0.924
double-uint32_t kRadix8bits 1.096 1.112 1.103
double-int64_t kRadix11bits 0.944 0.977 0.965
double-int64_t kRadix11bitsExtraPass 0.944 0.977 0.965
double-int64_t kRadix8bits 0.946 0.991 0.970

The reason is that the chunk size is chosen to be 20000 before, and becomes 4320 after this PR. So five kernels are run instead of just one.

It's easy to fix this by setting minimal chunk size. For example, if 10 million elements per chunk is allowed, we can set minimal chunk size to be 10M/len= 10M/500 = 20000. It's somewhat reasonable because a few million elements should not take too much memory. However, it feels like overfitting the benchmark parameters.

More aggressively, it's possible that we remove chunking. The reason is that the buffer size is quite small after this PR. Take float-uint32_t data as an example, the buffer size is 1/32 of len. There are 4 buffers so the total buffer size is 1/8 of input data. It means the memory overhead is 1/8 even if we don't use chunking at all.

Any suggestion about these options?

@yong-wang
Copy link
Contributor Author

(4) For other cases, the overall effects are positive:
For A100:

speedup min max avg
float-uint32_t kRadix11bits 0.995 1.412 1.253
float-uint32_t kRadix11bitsExtraPass 0.995 1.393 1.245
float-uint32_t kRadix8bits 1.175 1.479 1.378
double-uint32_t kRadix11bits 1.060 1.252 1.158
double-uint32_t kRadix11bitsExtraPass 1.059 1.251 1.158
double-uint32_t kRadix8bits 1.003 1.275 1.091
double-int64_t kRadix11bits 1.048 1.141 1.125
double-int64_t kRadix11bitsExtraPass 1.048 1.140 1.124
double-int64_t kRadix8bits 1.000 1.033 1.011

For H100:

speedup min max avg
float-uint32_t kRadix11bits 1.034 1.708 1.428
float-uint32_t kRadix11bitsExtraPass 1.032 1.669 1.403
float-uint32_t kRadix8bits 1.264 1.616 1.431
double-uint32_t kRadix11bits 1.197 1.397 1.296
double-uint32_t kRadix11bitsExtraPass 1.200 1.398 1.295
double-uint32_t kRadix8bits 1.013 1.378 1.132
double-int64_t kRadix11bits 1.058 1.128 1.108
double-int64_t kRadix11bitsExtraPass 1.058 1.128 1.109
double-int64_t kRadix8bits 0.998 1.046 1.014

(5) Beside the cases in current benchmark, I also run the case batch=4000 and len=500000. It's similar to the slow case reported by Artem. It's the case that this PR supposed to solve.

speedup A100 H100
float-uint32_t kRadix11bits 2.335 2.537
float-uint32_t kRadix11bitsExtraPass 2.335 2.536
float-uint32_t kRadix8bits 2.770 3.686
double-uint32_t kRadix11bits 2.014 2.489
double-uint32_t kRadix11bitsExtraPass 2.015 2.489
double-uint32_t kRadix8bits 2.152 3.042
double-int64_t kRadix11bits 2.579 2.582
double-int64_t kRadix11bitsExtraPass 2.579 2.582
double-int64_t kRadix8bits 1.786 2.449

Speedup is higher if len is larger (but uses more memory). But I'm not sure whether to add such case to the benchmark. It uses 4e3 * 5e5 * 8 = 16GB for the input of double data type. Some GPUs might fail to run this, so it can be inconvenient for testing.

@yong-wang yong-wang marked this pull request as ready for review October 17, 2023 07:32
@yong-wang yong-wang requested a review from a team as a code owner October 17, 2023 07:32
@yong-wang yong-wang requested review from tfeher and achirkin October 17, 2023 07:33
@benfred
Copy link
Member

benfred commented Nov 7, 2023

/ok to test

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yong-wang for these changes! It is nice to see these perf improvements! The PR looks good to me.

Regarding the open question about the parameter combination where we see a regression, I think we shall address that in a follow up. I have created Issue #1974 to keep track and continue the discussion there.

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @yong-wang for the PR and the thorough performance analysis. I'm looking forward to see this merged and getting one step closer to removing the legacy select-k implementations (I've created an issue for a follow-up #1975). Just a couple very small nitpicks.

cpp/include/raft/matrix/detail/select_radix.cuh Outdated Show resolved Hide resolved
@yong-wang
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks for the extra comments!

@achirkin
Copy link
Contributor

achirkin commented Nov 9, 2023

/ok to test

@benfred
Copy link
Member

benfred commented Nov 9, 2023

/merge

@rapids-bot rapids-bot bot merged commit 93e393d into rapidsai:branch-23.12 Nov 9, 2023
61 checks passed
benfred added a commit to benfred/raft that referenced this pull request Nov 10, 2023
With rapidsai#1878 merged, the performance
of the radix select algorithms is much improved and we no longer need
to incorporate the faiss block select algorithm. With rapidsai#1878 merged,
faiss block select goes from being the 3rd ranked selection algorithm,
to the 5th.

This regenerates the heuristic function with the latest benchmark times,
and removes the faiss block select in favour of kWarpImmediate and
kRadix11bitsExtraPass.
rapids-bot bot pushed a commit that referenced this pull request Nov 14, 2023
With #1878 merged, the performance of the radix select algorithms is much improved and we no longer need to incorporate the faiss block select algorithm. With #1878 merged, faiss block select goes from being the 3rd ranked selection algorithm, to the 5th.

This regenerates the heuristic function with the latest benchmark times, and removes the faiss block select in favour of kWarpImmediate and kRadix11bitsExtraPass.

Authors:
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #1985
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

[BUG][Performance] matrix::detail::select::radix: inefficient "one-block" kernel launch
5 participants