Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matrix::select_k: move selection and warp-sort primitives #1085

Merged
merged 42 commits into from
Jan 23, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
39c10a9
Make warp-level bitonic sort public
achirkin Dec 9, 2022
6cda736
Move spatial::*::select_topk to matrix::select_k
achirkin Dec 9, 2022
c5631bf
Fix includes style
achirkin Dec 9, 2022
fb88433
Use cmake-format
achirkin Dec 9, 2022
f64325b
Refactored warpsort module and made tests for all implementations in …
achirkin Dec 13, 2022
20d01d7
Resort to UVM when radix buffers are too big
achirkin Dec 14, 2022
4813bae
Adjust the dummy_block_sort_t to the changes in the warpsort impl
achirkin Dec 14, 2022
6cdb79a
Fix incorrect include
achirkin Dec 14, 2022
870fc86
Add benchmarks
achirkin Dec 14, 2022
2af45bf
Update CMakeLists.txt style
achirkin Dec 14, 2022
5b336ee
Update CMakeLists.txt style
achirkin Dec 14, 2022
b3e5d9c
Add mdspanified interface
achirkin Dec 15, 2022
164157b
Remove benchmarks for the legacy interface
achirkin Dec 15, 2022
69c81dd
Remove a TODO comment about a seemingly resolved bug
achirkin Dec 15, 2022
d64b12b
Merge remote-tracking branch 'rapidsai/branch-23.02' into enh-matrix-…
achirkin Dec 15, 2022
9d4476a
Fix the changed include extension
achirkin Dec 15, 2022
3e40435
Fix includes in tests
achirkin Dec 16, 2022
e20578e
Merge remote-tracking branch 'rapidsai/branch-23.02' into enh-matrix-…
achirkin Dec 16, 2022
b2c79f5
Merge branch 'branch-23.02' into enh-matrix-topk
achirkin Dec 20, 2022
98e2c2a
Address comments: bitonic_sort
achirkin Dec 20, 2022
af4c146
Replace stream argument with handle_t
achirkin Dec 20, 2022
471828e
rename files to select.* -> select_k.*
achirkin Dec 20, 2022
f6ff223
Use raft macros
achirkin Dec 20, 2022
066208d
Try to pass null and non-null arguments to select_k
achirkin Dec 20, 2022
aeaa1ef
Remove raw-pointer api from the public namespace
achirkin Dec 20, 2022
685b6bf
Updates public docs (add example usage)
achirkin Dec 21, 2022
5c42209
Merge remote-tracking branch 'rapidsai/branch-23.02' into enh-matrix-…
achirkin Jan 9, 2023
2cea50d
Add device_mem_resource
achirkin Jan 9, 2023
a31e61e
Add Doxygen docs
achirkin Jan 10, 2023
a8c5a70
Merge remote-tracking branch 'rapidsai/branch-23.02' into enh-matrix-…
achirkin Jan 10, 2023
8a5978b
Revert the memory_resource param changes in the detail namespace to a…
achirkin Jan 10, 2023
8e58cab
Merge remote-tracking branch 'rapidsai/branch-23.02' into enh-matrix-…
achirkin Jan 11, 2023
a01a75f
Remove device_mem_resource
achirkin Jan 11, 2023
c6256b7
Merge branch 'branch-23.02' into enh-matrix-topk
achirkin Jan 16, 2023
c25e859
Merge branch 'branch-23.02' into enh-matrix-topk
cjnolet Jan 19, 2023
6e56106
Merge branch 'branch-23.02' into enh-matrix-topk
achirkin Jan 20, 2023
c78d9b0
Reference a TODO issue
achirkin Jan 20, 2023
a55a6cb
Merge branch 'enh-matrix-topk' of github.com:achirkin/raft into enh-m…
achirkin Jan 20, 2023
307b113
Deprecation notice
achirkin Jan 20, 2023
c0ce160
Add [in] annotation to all arguments
achirkin Jan 20, 2023
e2cc7ad
Merge branch 'branch-23.02' into enh-matrix-topk
achirkin Jan 23, 2023
dc3043c
Merge branch 'branch-23.02' into enh-matrix-topk
cjnolet Jan 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add benchmarks
  • Loading branch information
achirkin committed Dec 14, 2022
commit 870fc8641571d808d3eaa4cd5e87f7df7dbc1a5c
2 changes: 1 addition & 1 deletion cpp/bench/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ if(BUILD_BENCH)
bench/main.cpp
)

ConfigureBench(NAME MATRIX_BENCH PATH bench/matrix/argmin.cu bench/main.cpp)
ConfigureBench(NAME MATRIX_BENCH PATH bench/matrix/argmin.cu bench/matrix/select.cu bench/main.cpp)

ConfigureBench(
NAME RANDOM_BENCH PATH bench/random/make_blobs.cu bench/random/permute.cu bench/random/rng.cu
Expand Down
126 changes: 126 additions & 0 deletions cpp/bench/matrix/select.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
/*
achirkin marked this conversation as resolved.
Show resolved Hide resolved
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "../../test/matrix/select.cuh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to mix/include test files in benchmark code (and vice versa)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a great deal of shared code in this file! Would you consider adding some shared test+bench header-only component? As far as I know, at the moment we have only one more place where we explicitly refer to a test header from the benchmarks (bench/neighbors/refine.cu#L39), but there could definitely be more.


#include <common/benchmark.hpp>

#include <raft/random/rng.cuh>
#include <raft/sparse/detail/utils.h>
#include <raft/util/cudart_utils.hpp>

#include <raft/matrix/detail/select_radix.cuh>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are needed to be tested individually, I think we should expose public APIs for them. in general, we should only be benchmarking/testing public APIs.

(I know we are benchmarking the balanced kmeans which is currently in detail, but we're working to move this out into the public).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation variants in radix- and warpsort- files weren't and were never meant to be public. We've been automatically selecting the implementation in the public wrappers based on the runtime arguments. I see a few issues in making all of them public:

  1. There are a few rather experimental implementations (in warpsort- version), which are not meant to be used directly and probably may be deleted after we do extensive benchmarking.
  2. The api of all these functions is changing a little bit, it's going to be difficult to maintain this in the public namespace. We're also waiting for more implementations to arrive :) (faster radix-based selection)
  3. The api not very convenient for an end-user and contains template parameters that are for performance tweaking and implementation detail (block size for radix, warp-level implementation template for warpsort)
  4. All-in-all I don't see any benefit in imposing on the end user a burden of selecting one implementation of an algorithm among a dozen variants. I believe, we should continue to provide just one wrapper and be able to tweak the selection logic as new implementations and optimizations arrive to raft.

At the same time, we cannot remove the tests for individual variants, because the selection logic of the public wrapper makes it impossible to test full input ranges for every variant.
And we really need the individual benchmarks, so that we can compare the implementations and then adjust the selection logic (and repeat this as new implementations and optimizations arrive).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation variants in radix- and warpsort- files weren't and were never meant to be public.

I think this begs the questions as to how much we should expose and how advanced we should consider our users to be. I'm not opposed to keeping these internal if we have a good well-defined (and well-documented) set of rules guiding the choice, resource, and runtime trade-offs of each.

At the same time, we cannot remove the tests for individual variants, because the selection logic of the public wrapper makes it impossible to test full input ranges for every variant.

I'm not saying we should remove the tests for individual variants but testing internals does not replace the need to test the public APIs comprehensively and exhaustively with all available options. After all, the users interact with the public APIs and the quality of their user experience should be better guaranteed by successfully passing tests.

The problem that arises from testing only internals and not also exhaustively testing the public APIs is that we gain a false sense of correctness when a developer makes changes to the public APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, we have 8 implementations/combinations. At least one more (new advances in radix-based select) in the works, and some of the current ones in warp_sort may be strictly worse than others. I hope we can keep all of them in detail, so that we can improve the implementation selection logic and remove some of the implementations in future.

Perhaps, it's not clear behind the templates, but the public API interface is the most tested in test/matrix/select_k.cu. Besides the simple tests against static in-out values, we compare the outputs of the various implementations against the public one; this should ensure that the public-selected-implementation is compared against other non-selected implementations for various parameter ranges.
I think, this should cover the public interface well, but I'd be happy to add more tests if you have any further suggestions.

#include <raft/matrix/detail/select_warpsort.cuh>
#include <raft/matrix/select.cuh>

#include <rmm/device_uvector.hpp>
#include <rmm/mr/device/per_device_resource.hpp>
#include <rmm/mr/device/pool_memory_resource.hpp>

namespace raft::matrix {

using namespace raft::bench; // NOLINT

template <typename KeyT, typename IdxT, select::Algo Algo>
struct selection : public fixture {
explicit selection(const select::params& p)
: params_(p),
in_dists_(p.batch_size * p.len, stream),
in_ids_(p.batch_size * p.len, stream),
out_dists_(p.batch_size * p.k, stream),
out_ids_(p.batch_size * p.k, stream)
{
raft::sparse::iota_fill(in_ids_.data(), IdxT(p.batch_size), IdxT(p.len), stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get this out of raft::sparse at some point. Nothin to do in this PR but just making a note.

raft::random::RngState state{42};
raft::random::uniform(handle, state, in_dists_.data(), in_dists_.size(), KeyT(-1.0), KeyT(1.0));
}

void run_benchmark(::benchmark::State& state) override // NOLINT
{
using_pool_memory_res res;
try {
std::ostringstream label_stream;
label_stream << params_.batch_size << "#" << params_.len << "#" << params_.k;
state.SetLabel(label_stream.str());
loop_on_state(state, [this]() {
select::select_k_impl<KeyT, IdxT>(Algo,
in_dists_.data(),
in_ids_.data(),
params_.batch_size,
params_.len,
params_.k,
out_dists_.data(),
out_ids_.data(),
params_.select_min,
stream);
});
} catch (raft::exception& e) {
state.SkipWithError(e.what());
}
}

private:
const select::params params_;
rmm::device_uvector<KeyT> in_dists_, out_dists_;
rmm::device_uvector<IdxT> in_ids_, out_ids_;
};

const std::vector<select::params> kInputs{
{20000, 500, 1, true}, {20000, 500, 2, true}, {20000, 500, 4, true},
{20000, 500, 8, true}, {20000, 500, 16, true}, {20000, 500, 32, true},
{20000, 500, 64, true}, {20000, 500, 128, true}, {20000, 500, 256, true},

{1000, 10000, 1, true}, {1000, 10000, 2, true}, {1000, 10000, 4, true},
{1000, 10000, 8, true}, {1000, 10000, 16, true}, {1000, 10000, 32, true},
{1000, 10000, 64, true}, {1000, 10000, 128, true}, {1000, 10000, 256, true},

{100, 100000, 1, true}, {100, 100000, 2, true}, {100, 100000, 4, true},
{100, 100000, 8, true}, {100, 100000, 16, true}, {100, 100000, 32, true},
{100, 100000, 64, true}, {100, 100000, 128, true}, {100, 100000, 256, true},

{10, 1000000, 1, true}, {10, 1000000, 2, true}, {10, 1000000, 4, true},
{10, 1000000, 8, true}, {10, 1000000, 16, true}, {10, 1000000, 32, true},
{10, 1000000, 64, true}, {10, 1000000, 128, true}, {10, 1000000, 256, true},
};

#define SELECTION_REGISTER(KeyT, IdxT, A) \
namespace BENCHMARK_PRIVATE_NAME(selection) \
{ \
using SelectK = selection<KeyT, IdxT, select::Algo::A>; \
RAFT_BENCH_REGISTER(SelectK, #KeyT "/" #IdxT "/" #A, kInputs); \
}

SELECTION_REGISTER(float, int, kPublicApi); // NOLINT
SELECTION_REGISTER(float, int, kRadix8bits); // NOLINT
SELECTION_REGISTER(float, int, kRadix11bits); // NOLINT
SELECTION_REGISTER(float, int, kWarpAuto); // NOLINT
SELECTION_REGISTER(float, int, kWarpImmediate); // NOLINT
SELECTION_REGISTER(float, int, kWarpFiltered); // NOLINT
SELECTION_REGISTER(float, int, kWarpDistributed); // NOLINT
SELECTION_REGISTER(float, int, kWarpDistributedShm); // NOLINT

SELECTION_REGISTER(double, int, kRadix8bits); // NOLINT
SELECTION_REGISTER(double, int, kRadix11bits); // NOLINT
SELECTION_REGISTER(double, int, kWarpAuto); // NOLINT

SELECTION_REGISTER(double, size_t, kRadix8bits); // NOLINT
SELECTION_REGISTER(double, size_t, kRadix11bits); // NOLINT
SELECTION_REGISTER(double, size_t, kWarpImmediate); // NOLINT
SELECTION_REGISTER(double, size_t, kWarpFiltered); // NOLINT
SELECTION_REGISTER(double, size_t, kWarpDistributed); // NOLINT
SELECTION_REGISTER(double, size_t, kWarpDistributedShm); // NOLINT

} // namespace raft::matrix
Loading