-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and improve one-block radix select #1878
Fix and improve one-block radix select #1878
Conversation
/ok to test |
/ok to test |
Using smaller buffer for To alleviate register pressure, As a result, the occupancy is kept almost unchanged for this PR. For sm80, #register and occupancy are the same before and after this PR:
For sm90, #register and occupancy change just a little after this PR (
|
Benchmark results. (1) Perf for inputs containing inf (e.g. label="10#1000000#1#infs-0.1") changes a lot. It dues to the change of buffer size (decreased by 1/8) so writing buffer happens less often. The speedup data for this PR is: For A100:
For H100:
Although there are slowdown cases for kRadix8bits, e.g. -15%, most cases are faster. (2) Perf for cases with label "same-leading-bits" is almost the same, so I omit the data here. |
(3) Cases with label "20000#500", that is, batch size = 20000 and len=500, become slower: For A100:
For H100:
The reason is that the chunk size is chosen to be 20000 before, and becomes 4320 after this PR. So five kernels are run instead of just one. It's easy to fix this by setting minimal chunk size. For example, if 10 million elements per chunk is allowed, we can set minimal chunk size to be 10M/len= 10M/500 = 20000. It's somewhat reasonable because a few million elements should not take too much memory. However, it feels like overfitting the benchmark parameters. More aggressively, it's possible that we remove chunking. The reason is that the buffer size is quite small after this PR. Take float-uint32_t data as an example, the buffer size is 1/32 of len. There are 4 buffers so the total buffer size is 1/8 of input data. It means the memory overhead is 1/8 even if we don't use chunking at all. Any suggestion about these options? |
(4) For other cases, the overall effects are positive:
For H100:
(5) Beside the cases in current benchmark, I also run the case batch=4000 and len=500000. It's similar to the slow case reported by Artem. It's the case that this PR supposed to solve.
Speedup is higher if len is larger (but uses more memory). But I'm not sure whether to add such case to the benchmark. It uses 4e3 * 5e5 * 8 = 16GB for the input of double data type. Some GPUs might fail to run this, so it can be inconvenient for testing. |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yong-wang for these changes! It is nice to see these perf improvements! The PR looks good to me.
Regarding the open question about the parameter combination where we see a regression, I think we shall address that in a follow up. I have created Issue #1974 to keep track and continue the discussion there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @yong-wang for the PR and the thorough performance analysis. I'm looking forward to see this merged and getting one step closer to removing the legacy select-k implementations (I've created an issue for a follow-up #1975). Just a couple very small nitpicks.
/ok to test |
Fix small typos in the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks for the extra comments!
/ok to test |
/merge |
With rapidsai#1878 merged, the performance of the radix select algorithms is much improved and we no longer need to incorporate the faiss block select algorithm. With rapidsai#1878 merged, faiss block select goes from being the 3rd ranked selection algorithm, to the 5th. This regenerates the heuristic function with the latest benchmark times, and removes the faiss block select in favour of kWarpImmediate and kRadix11bitsExtraPass.
With #1878 merged, the performance of the radix select algorithms is much improved and we no longer need to incorporate the faiss block select algorithm. With #1878 merged, faiss block select goes from being the 3rd ranked selection algorithm, to the 5th. This regenerates the heuristic function with the latest benchmark times, and removes the faiss block select in favour of kWarpImmediate and kRadix11bitsExtraPass. Authors: - Ben Frederickson (https://github.com/benfred) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #1985
calc_buf_len()
rather thanlen
as the buffer size of one-block kernelResolve #1823