Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ivf_flat::index: hide implementation details #747

Merged
merged 147 commits into from
Aug 24, 2022
Merged
Changes from 1 commit
Commits
Show all changes
147 commits
Select commit Hold shift + click to select a range
35ab60d
inital commit and formatting cleanup
achirkin May 13, 2022
24e8c4d
update save/load index function to work with cuann benchmark suite, s…
achirkin May 16, 2022
cb8bcd2
Added benchmarks.
achirkin May 16, 2022
884723c
Merge branch 'branch-22.06' into fea-knn-ivf-flat
achirkin May 17, 2022
8c4a0a0
Add a missing parameter docs
achirkin May 17, 2022
070fd05
Adapt to the changes in the warpsort api
achirkin May 17, 2022
83b6630
cleanup: use WarpSize constant
achirkin May 17, 2022
3a2703c
cleanup: remove unnecessary helpers
achirkin May 17, 2022
31bbaec
Use a more efficient warp_sort_filtered
achirkin May 17, 2022
4b40181
Recover files that have only non-relevant changes to reduce the size …
achirkin May 17, 2022
7e3041c
wip: replacing explicit allocations with rmm buffers
achirkin May 17, 2022
f6556b7
Merge branch 'branch-22.06' into fea-knn-ivf-flat
achirkin May 17, 2022
f75761f
Merge branch 'branch-22.06' into fea-knn-ivf-flat
achirkin May 18, 2022
dd558b4
Update cpp/include/raft/spatial/knn/detail/ann_quantized_faiss.cuh
achirkin May 18, 2022
94b3cbe
Update cpp/include/raft/spatial/knn/detail/ann_quantized_faiss.cuh
achirkin May 18, 2022
2be45a9
wip: replace cudaMemcpy with raft::copy
achirkin May 18, 2022
30c32a9
Simplified some cudaMemcpy invocations
achirkin May 18, 2022
c8e7b4d
Merge branch 'branch-22.06' into fea-knn-ivf-flat
achirkin May 19, 2022
150a438
Refactoring with helper functions
achirkin May 19, 2022
ddfb8cc
Make the scratch buf 3x L2 cache size
achirkin May 19, 2022
b788e2e
Remove serialization code for now
achirkin May 19, 2022
3e1c14d
remove obsolete comment
achirkin May 19, 2022
a001999
Add a missing sync
achirkin May 19, 2022
2d08271
Rename ann_quantized_faiss
achirkin May 19, 2022
0f88aaa
wip from manual allocations to rmm: updated some parts with pointer r…
achirkin May 19, 2022
363dfc9
wip from manual allocations to rmm
achirkin May 19, 2022
e5399f8
fix style
achirkin May 19, 2022
306f5bf
Set minimum memory pool size in radix_topk to 256 bytes
achirkin May 20, 2022
fd7d2ba
wip malloc-to-rmm: removed most of the manual allocations
achirkin May 20, 2022
403667a
misc cleanup
achirkin May 20, 2022
4c6d563
Refactoing; used raft::handle in place of cublas handle everywhere
achirkin May 20, 2022
3ae52ea
Fix the value type at runtime (use templates instead of runtime dtype)
achirkin May 20, 2022
6fecd7f
ceildiv
achirkin May 20, 2022
174854f
Use rmm's memory pool in place of explicitly allocated buffers
achirkin May 20, 2022
b45b14c
Merge branch 'branch-22.06' into fea-knn-ivf-flat
achirkin May 20, 2022
ca1aaad
Use raft logging
achirkin May 24, 2022
4228a02
Merge branch 'branch-22.06' into fea-knn-ivf-flat
achirkin May 24, 2022
70d84ec
Updated logging and nvtx markers
achirkin May 24, 2022
f9c12f8
clang-format
achirkin May 24, 2022
17968e4
Merge branch 'branch-22.06' into fea-knn-ivf-flat
achirkin May 24, 2022
957ac94
Use the recommended logger header
achirkin May 24, 2022
ccfbccc
Use warpsort for smaller k
achirkin May 25, 2022
7819397
Using raft helpers
achirkin May 25, 2022
510c467
Determine the template parameters Capacity and Veclen recursively
achirkin May 25, 2022
c5087be
wip: refactoring and reducing duplicate calls
achirkin May 26, 2022
f850a4a
Merge branch 'branch-22.06' into fea-knn-ivf-flat
achirkin May 27, 2022
c5f1c89
Refactor and document ann_ivf_flat_kernel
achirkin May 27, 2022
7b2b9ff
Documenting and refactoring the kernel
achirkin May 27, 2022
913edfb
Merge branch 'branch-22.06' into fea-knn-ivf-flat
achirkin May 30, 2022
b1208ed
Add a case of high dimensionality
achirkin May 31, 2022
a30ade5
Add more sync into the test to detect device errors
achirkin May 31, 2022
84db732
Add more sync into the test to detect device errors
achirkin May 31, 2022
346afb2
Allow large batch sizes and document more functions
achirkin May 31, 2022
fc201b5
Add a lower bound on expected recall
achirkin May 31, 2022
4021ea2
Compure required memory dynamically
achirkin May 31, 2022
ea8b1c4
readability quickfix
achirkin May 31, 2022
d8a034a
Correct the smem size for the warpsort and add launch bounds
achirkin May 31, 2022
d97d248
Add couple checks against floating point exceptions
achirkin Jun 1, 2022
2e64037
Don't run kmeans on empty dataset
achirkin Jun 2, 2022
9ed50ac
Order all ops by a cuda stream
achirkin Jun 2, 2022
1f9352c
Update comments
achirkin Jun 2, 2022
c048af2
Suggest replacing _cuann_sqsum
achirkin Jun 2, 2022
96f39a8
wip: refactoting utils
achirkin Jun 2, 2022
888daeb
minor comments
achirkin Jun 2, 2022
e6ff267
ann_utils refactoring, docs, and clang-tidy
achirkin Jun 3, 2022
426f713
Merge branch 'branch-22.06' into fea-knn-ivf-flat
achirkin Jun 7, 2022
bacb402
Refactor tests and reduce their memory footprint
achirkin Jun 7, 2022
4042b28
Refactored and documents ann_kmeans_balanced
achirkin Jun 7, 2022
bb5726b
Use memory_resource for temp data in kmeans
achirkin Jun 7, 2022
810c26b
Address clang-tidy and other refactoring suggestions
achirkin Jun 8, 2022
042c410
Move part of the index building onto gpu
achirkin Jun 8, 2022
7ace0fb
Document the index building kernel
achirkin Jun 15, 2022
e9c0d49
Merge branch 'branch-22.08' into fea-knn-ivf-flat
achirkin Jun 15, 2022
3515715
Added a dims padding todo
achirkin Jun 15, 2022
6bd6560
Move kmeans-related allocations and routines to ann_kmeans_balanced.cuh
achirkin Jun 15, 2022
2811814
Add documentation to the build_optimized_kmeans
achirkin Jun 15, 2022
fc3e46e
Using mdarrays and structured index
achirkin Jun 16, 2022
fb8c4b1
Fixed a memory leak and introduced a few assertions to check pointer …
achirkin Jun 17, 2022
f3b2cb2
Merge branch 'branch-22.08' into fea-knn-ivf-flat
cjnolet Jun 17, 2022
092d428
Refactoring build_optimized_kmeans
achirkin Jun 17, 2022
fbcb16b
A few smaller refactorings for kmeans
achirkin Jun 17, 2022
29ca199
Add docs to public methods of the handle
achirkin Jun 20, 2022
38b3cec
Made the metric be a part of the index struct and set the greater_ = …
achirkin Jun 21, 2022
d19bb5f
Do not persist grid_dim_x between searches
achirkin Jun 21, 2022
9094707
Refactor names according to clang-tidy
achirkin Jun 21, 2022
325e201
Refactor the usage of stream and params
achirkin Jun 21, 2022
2a3eb33
Refactor api to have symmetric index/search params
achirkin Jun 21, 2022
867beca
refactor away ivf_flat_index
achirkin Jun 22, 2022
059a6c0
Add the memory resource argument to warp_sort_topk
achirkin Jun 22, 2022
df17b5b
update docs
achirkin Jun 22, 2022
fe9ced1
Allow empty mesoclusters
achirkin Jun 23, 2022
91fdcbb
Add low-dimensional and non-veclen-aligned-dimensional test cases
achirkin Jun 23, 2022
be14c63
Refactor and document loadAndComputeDist
achirkin Jun 23, 2022
eeb4601
Minor renamings
achirkin Jun 23, 2022
025e5a5
Add 8bit int types to knn benchmarks
achirkin Jun 23, 2022
3821366
Fix incorrect data mapping for int8 types
achirkin Jun 24, 2022
d596842
Merge branch 'branch-22.08' into fea-knn-ivf-flat
achirkin Jun 24, 2022
a29baa7
Introduce kIndexGroupSize constant
achirkin Jun 27, 2022
546bef8
Cleanup ann_quantized
achirkin Jun 27, 2022
32d0d2e
Add several type aliases and helpers for creating mdarrays
achirkin Jun 27, 2022
5f427c0
Remove unnecessary inlines and fix docs
achirkin Jun 28, 2022
c581fe2
More refactoring and a few forceinlines
achirkin Jun 28, 2022
805e78c
Add a helper for creating pool_memory_resource when it makes sense
achirkin Jun 29, 2022
a4973e6
Force move the mdarrays when creating index to avoid copying them
achirkin Jun 29, 2022
68c267e
Minor refactorings
achirkin Jun 29, 2022
f2b8ed8
Add nvtx annotations to the outermost ANN calls for better performanc…
achirkin Jun 29, 2022
f91c7f7
Add a few more test cases and annotations for them
achirkin Jun 29, 2022
84b1c5b
Fix a typo
achirkin Jun 29, 2022
afc1f6a
Move ensure_integral_extents to the detail folder
achirkin Jun 30, 2022
3a10f86
Lift the requirement to have query pointers aligned with Veclen
achirkin Jun 30, 2022
9f5c64c
Merge branch 'branch-22.08' into enh-mdarray-helpers
achirkin Jun 30, 2022
1afd667
Use move semantics for the index everywhere, but try to keep it const…
achirkin Jun 30, 2022
73ce9e1
Update documentation
achirkin Jun 30, 2022
2a45645
Remove the debug path USE_FAISS
achirkin Jun 30, 2022
75a48b4
Add a type trait for checking if the conversion between two numeric t…
achirkin Jul 1, 2022
ed25cae
Merge branch 'branch-22.08' into fea-knn-ivf-flat
achirkin Jul 1, 2022
388200c
Support 32bit and unsigned indices in bruteforce KNN
achirkin Jul 1, 2022
f08df83
Merge branch 'enh-mdarray-helpers' into fea-knn-ivf-flat
achirkin Jul 1, 2022
9200886
Merge branch 'enh-knn-bruteforce-uint32' into fea-knn-ivf-flat
achirkin Jul 1, 2022
14bfe02
Make index type a template parameter
achirkin Jul 1, 2022
1283cbe
Revert the api changes as much as possible and deprecate the old api
achirkin Jul 1, 2022
e73b259
Remove the stream argument from the public API
achirkin Jul 4, 2022
8e7ffb8
Merge branch 'branch-22.08' into fea-knn-ivf-flat
achirkin Jul 5, 2022
5f5dc0d
Merge branch 'branch-22.08' into fea-knn-ivf-flat
achirkin Jul 5, 2022
03ebbe0
Simplify kmeans::predict a little bit
achirkin Jul 6, 2022
cde7f97
Factor out predict from the other ops in kmeans for use outside of th…
achirkin Jul 7, 2022
305bbcd
Add new function extend(index, new_vecs, new_inds) to ivf_flat
achirkin Jul 20, 2022
76c383f
Merge branch 'branch-22.08' into fea-knn-ivf-flat
achirkin Jul 21, 2022
7f640a9
Improve the docs
achirkin Jul 21, 2022
2e9eda5
Fix using non-existing log function
achirkin Jul 21, 2022
dc62a0f
Hide all data components from ifv_flat::index and expose immutable views
achirkin Jul 21, 2022
fb841c3
Replace thurst::exclusive_scan with thrust::inclusive_scan to avoid a…
achirkin Jul 22, 2022
04bb5dc
Merge branch 'fea-knn-ivf-flat' into enh-knn-ivf-flat-hide-impl
achirkin Jul 22, 2022
c95ea85
ann_common.h: remove deps on cuda code, so that the file can be inclu…
achirkin Jul 22, 2022
0c72ee8
ann_common.h: remove deps on cuda code, so that the file can be inclu…
achirkin Jul 22, 2022
0196695
Make helper overloads inline for linking in cuml
achirkin Jul 22, 2022
eb15639
Split processing.hpp into *.cuh and *.hpp to avoid incomplete types
achirkin Jul 22, 2022
e4b2b39
WIP: investigating segmentation fault in cuml test
achirkin Jul 25, 2022
6bc0fcb
Revert the wip-changes from the last commit
achirkin Jul 26, 2022
f599aaf
Merge remote-tracking branch 'origin/fea-knn-ivf-flat' into enh-knn-i…
achirkin Jul 26, 2022
a191410
Merge branch 'branch-22.08' into enh-knn-ivf-flat-hide-impl
achirkin Jul 28, 2022
317ddf3
Enhance documentation
achirkin Jul 28, 2022
114fb63
Fix couple typos in docs
achirkin Jul 28, 2022
1d283ae
Change the data indexing to size_t to make sure the total size (size*…
achirkin Jul 28, 2022
a9bd2d6
Merge branch 'branch-22.08' into enh-knn-ivf-flat-hide-impl
achirkin Aug 2, 2022
f9d55a7
Make ivf_flat::index look a little bit more like knn::sparse api
achirkin Aug 2, 2022
fef6dac
Test both overloads of
achirkin Aug 2, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
wip malloc-to-rmm: removed most of the manual allocations
  • Loading branch information
achirkin committed May 20, 2022
commit fd7d2ba953a35d57cea456e9a03f98467ad3c9e3
185 changes: 67 additions & 118 deletions cpp/include/raft/spatial/knn/detail/ann_ivf_flat.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,8 @@ class cuivflHandle {
uint32_t getDim();

private:
rmm::cuda_stream_view stream_; // The stream for build and search

uint32_t device_;
cublasHandle_t cublas_handle_;
cudaDataType_t dtype_;
Expand All @@ -160,39 +162,35 @@ class cuivflHandle {
size_t ninterleave_; // The number of elements in 32 interleaved group for input dataset
size_t buf_topk_size_; // The size of buffer used for topk select.
size_t floatQuerySize; // The size of float converted queries from int8_t/uint8_t
rmm::cuda_stream_view stream_; // The stream for build and search
uint32_t veclen; // The vectorization length of dataset in index.
uint32_t gridDimX_; // The number of blocks launched across nprobe.
uint32_t veclen; // The vectorization length of dataset in index.
uint32_t gridDimX_; // The number of blocks launched across nprobe.

private:
// device pointer
// The device memory pointer; inverted list for data; size [ninterleave_, dim_]
void* list_data_dev_ptr_;

// The device memory pointer; inverted list for index; size [ninterleave_]
uint32_t* list_index_dev_ptr_;
rmm::device_uvector<uint32_t> list_index_dev_;
// The device memory pointer; Used for list_data_manage_ptr_; size [nlist_]
uint32_t* list_prefix_interleaved_dev_ptr_;
rmm::device_uvector<uint32_t> list_prefix_interleaved_dev_;
// The device memory pointer; the number of each cluster(list); size [nlist_]
uint32_t* list_lengths_dev_ptr_;
rmm::device_uvector<uint32_t> list_lengths_dev_;
// The device memory pointer; centriod; size [nlist_, dim_]
float* centriod_dev_ptr_;
rmm::device_uvector<float> centriod_dev_;
// The device memory pointer; centriod norm ; size [nlist_, dim_]
float* centriod_norm_dev_ptr_;
rmm::device_uvector<float> centriod_norm_dev_;

// host pointer
// The host memory pointer; inverted list for data; size [ninterleave_, dim_]
void* list_data_host_ptr_;
// The host memory pointer; inverted list for index; size [ninterleave_]
uint32_t* list_index_host_ptr_;
std::vector<uint32_t> list_index_host_;
// The host memory pointer; Used for list_data_manage_ptr_; size [nlist_]
uint32_t* list_prefix_interleaved_host_ptr_;
std::vector<uint32_t> list_prefix_interleaved_host_;
// The host memory pointer; the number of each cluster(list); size [nlist_]
uint32_t* list_lengths_host_ptr_;
// The host memory pointer; centriod; size [nlist_, dim_]
float* centriod_host_ptr_;
// The host memory pointer; centriod norm ; size [nlist_, dim_]
float* centriod_norm_host_ptr_;
std::vector<uint32_t> list_lengths_host_;

// The device memory; used for topk select.
void* buf_dev_ptr_;

Expand Down Expand Up @@ -220,17 +218,24 @@ cuivflHandle::cuivflHandle(raft::distance::DistanceType metric_type,
uint32_t nlist,
uint32_t niter,
uint32_t device)
: stream_(rmm::cuda_stream_default),
device_(device),
dim_(dim),
nlist_(nlist),
niter_(niter),
metric_type_(metric_type),
list_index_dev_(0, stream_),
list_prefix_interleaved_dev_(0, stream_),
list_lengths_dev_(0, stream_),
centriod_dev_(0, stream_),
centriod_norm_dev_(0, stream_),
list_index_host_(0),
list_prefix_interleaved_host_(0),
list_lengths_host_(0)
{
// Device
device_ = device;
dim_ = dim;
nlist_ = nlist;
niter_ = niter;
metric_type_ = metric_type;
floatQuerySize = 0;
veclen = 1;
gridDimX_ = 0;
stream_ = rmm::cuda_stream_default;

if ((dim % 4) == 0) {
veclen = 4;
Expand All @@ -247,19 +252,8 @@ cuivflHandle::cuivflHandle(raft::distance::DistanceType metric_type,
throw cuivflStatus_t::CUIVFL_STATUS_CUBLAS_ERROR;
}

list_data_dev_ptr_ = nullptr;
list_index_dev_ptr_ = nullptr;
list_prefix_interleaved_dev_ptr_ = nullptr;
list_lengths_dev_ptr_ = nullptr;
centriod_dev_ptr_ = nullptr;
centriod_norm_dev_ptr_ = nullptr;

list_data_host_ptr_ = nullptr;
list_index_host_ptr_ = nullptr;
list_prefix_interleaved_host_ptr_ = nullptr;
list_lengths_host_ptr_ = nullptr;
centriod_host_ptr_ = nullptr;
centriod_norm_host_ptr_ = nullptr;
list_data_dev_ptr_ = nullptr;
list_data_host_ptr_ = nullptr;

buf_dev_ptr_ = nullptr;
hierarchialClustering_ = true;
Expand All @@ -274,51 +268,10 @@ cuivflHandle::~cuivflHandle()
cudaFree(list_data_dev_ptr_);
list_data_dev_ptr_ = nullptr;
}
if (list_index_dev_ptr_ != nullptr) {
cudaFree(list_index_dev_ptr_);
list_index_dev_ptr_ = nullptr;
}
if (list_prefix_interleaved_dev_ptr_ != nullptr) {
cudaFree(list_prefix_interleaved_dev_ptr_);
list_prefix_interleaved_dev_ptr_ = nullptr;
}
if (list_lengths_dev_ptr_ != nullptr) {
cudaFree(list_lengths_dev_ptr_);
list_lengths_dev_ptr_ = nullptr;
}
if (centriod_dev_ptr_ != nullptr) {
cudaFree(centriod_dev_ptr_);
centriod_dev_ptr_ = nullptr;
}
if (centriod_norm_dev_ptr_ != nullptr) {
cudaFree(centriod_norm_dev_ptr_);
centriod_norm_dev_ptr_ = nullptr;
}

if (list_data_host_ptr_ != nullptr) {
free(list_data_host_ptr_);
list_data_host_ptr_ = nullptr;
}
if (list_index_host_ptr_ != nullptr) {
free(list_index_host_ptr_);
list_index_host_ptr_ = nullptr;
}
if (list_prefix_interleaved_host_ptr_ != nullptr) {
free(list_prefix_interleaved_host_ptr_);
list_prefix_interleaved_host_ptr_ = nullptr;
}
if (list_lengths_host_ptr_ != nullptr) {
free(list_lengths_host_ptr_);
list_lengths_host_ptr_ = nullptr;
}
if (centriod_host_ptr_ != nullptr) {
free(centriod_host_ptr_);
centriod_host_ptr_ = nullptr;
}
if (centriod_norm_host_ptr_ != nullptr) {
free(centriod_norm_host_ptr_);
centriod_norm_host_ptr_ = nullptr;
}
cublasDestroy(cublas_handle_);
} // end func cuivflHandle::cuivflHand

Expand Down Expand Up @@ -630,7 +583,7 @@ cuivflStatus_t cuivflHandle::cuivflBuildIndex(const void* dataset,
stream_ = stream;

rmm::mr::managed_memory_resource managed_memory;
rmm::device_uvector<float> centriod_manage_buf(nlist_ * dim_, stream, &managed_memory);
rmm::device_uvector<float> centriod_manage_buf(nlist_ * dim_, stream_, &managed_memory);
auto centriod_manage_ptr = centriod_manage_buf.data();

if (this == NULL || nrow_ == 0) { return CUIVFL_STATUS_NOT_INITIALIZED; }
Expand All @@ -639,39 +592,37 @@ cuivflStatus_t cuivflHandle::cuivflBuildIndex(const void* dataset,
}

// Alloc manage memory for centriods, trainset and workspace
rmm::device_uvector<uint32_t> datasetLabels_buf(nrow_, stream, &managed_memory); // [numDataset]
rmm::device_uvector<uint32_t> datasetLabels_buf(nrow_, stream_, &managed_memory); // [numDataset]
auto datasetLabels = datasetLabels_buf.data();

// Step 3: Predict labels of the whole dataset
cuivflBuildOptimizedKmeans(
centriod_manage_ptr, dataset, trainset, datasetLabels, dtype, nrow, ntrain, stream);
centriod_manage_ptr, dataset, trainset, datasetLabels, dtype, nrow, ntrain, stream_);

// Step 3.2: Calculate the L2 related result
centriod_norm_host_ptr_ = (float*)malloc(sizeof(float) * nlist_);
RAFT_CUDA_TRY(cudaMalloc(&centriod_norm_dev_ptr_, sizeof(float) * nlist_));
centriod_norm_dev_.resize(nlist_, stream_);

if (metric_type_ == raft::distance::DistanceType::L2Expanded) {
utils::_cuann_sqsum(nlist_, dim_, centriod_manage_ptr, centriod_norm_dev_ptr_);
utils::_cuann_sqsum(nlist_, dim_, centriod_manage_ptr, centriod_norm_dev_.data());
#ifdef DEBUG_L2
printDevPtr(centriod_norm_dev_ptr_, 20, "centriod_norm_dev_ptr_");
printDevPtr(centriod_norm_dev_.data(), 20, "centriod_norm_dev_");
#endif
}

// Step 4: Record the number of elements in each clusters
RAFT_CUDA_TRY(cudaDeviceSynchronize());
list_lengths_host_ptr_ = (uint32_t*)malloc(sizeof(uint32_t) * nlist_);
list_prefix_interleaved_host_ptr_ = (uint32_t*)malloc(sizeof(uint32_t) * nlist_);
memset(list_lengths_host_ptr_, 0, sizeof(uint32_t) * nlist_);

list_prefix_interleaved_host_.resize(nlist_);
list_lengths_host_.assign(nlist_, 0);
for (uint32_t i = 0; i < nrow_; i++) {
uint32_t id_cluster = datasetLabels[i];
list_lengths_host_ptr_[id_cluster] += 1;
list_lengths_host_[id_cluster] += 1;
}

ninterleave_ = 0;
for (uint32_t i = 0; i < nlist_; i++) {
list_prefix_interleaved_host_ptr_[i] = ninterleave_;
ninterleave_ += ((list_lengths_host_ptr_[i] - 1) / WarpSize + 1) * WarpSize;
list_prefix_interleaved_host_[i] = ninterleave_;
ninterleave_ += ((list_lengths_host_[i] - 1) / WarpSize + 1) * WarpSize;
}

if (dtype == CUDA_R_32F) {
Expand All @@ -684,9 +635,8 @@ cuivflStatus_t cuivflHandle::cuivflBuildIndex(const void* dataset,
list_data_host_ptr_ = malloc(sizeof(int8_t) * ninterleave_ * dim_);
memset(list_data_host_ptr_, 0, sizeof(int8_t) * ninterleave_ * dim_);
}
list_index_host_ptr_ = (uint32_t*)malloc(sizeof(uint32_t) * ninterleave_);
memset(list_index_host_ptr_, 0, sizeof(uint32_t) * ninterleave_);
memset(list_lengths_host_ptr_, 0, sizeof(uint32_t) * nlist_);
list_index_host_.assign(ninterleave_, 0);
list_lengths_host_.assign(nlist_, 0);

if ((dtype == CUDA_R_8I) || (dtype == CUDA_R_8U)) {
if ((dim_ % 16) == 0) {
Expand All @@ -698,8 +648,8 @@ cuivflStatus_t cuivflHandle::cuivflBuildIndex(const void* dataset,

for (size_t i = 0; i < nrow_; i++) {
uint32_t id_cluster = datasetLabels[i];
uint32_t current_add = list_lengths_host_ptr_[id_cluster];
uint32_t interleave_add = list_prefix_interleaved_host_ptr_[id_cluster];
uint32_t current_add = list_lengths_host_[id_cluster];
uint32_t interleave_add = list_prefix_interleaved_host_[id_cluster];

if (dtype == CUDA_R_32F) {
float* list_data = (float*)list_data_host_ptr_;
Expand All @@ -717,17 +667,15 @@ cuivflStatus_t cuivflHandle::cuivflBuildIndex(const void* dataset,
_ivfflat_interleaved(
list_data, ori_data + i * dim_, dim_, current_add, interleave_add, veclen);
}
list_index_host_ptr_[interleave_add + current_add] = i;
list_lengths_host_ptr_[id_cluster] += 1;
list_index_host_[interleave_add + current_add] = i;
list_lengths_host_[id_cluster] += 1;
}

RAFT_CUDA_TRY(cudaMalloc(&centriod_dev_ptr_, sizeof(float) * nlist_ * dim_));
copy(centriod_dev_ptr_, centriod_manage_ptr, nlist_ * dim_, stream);

// Store index on GPU memory: temp WAR until we've entire index building buffers on device
RAFT_CUDA_TRY(cudaMalloc(&list_prefix_interleaved_dev_ptr_, sizeof(uint32_t) * nlist_));
RAFT_CUDA_TRY(cudaMalloc(&list_lengths_dev_ptr_, sizeof(uint32_t) * nlist_));
RAFT_CUDA_TRY(cudaMalloc(&list_index_dev_ptr_, sizeof(uint32_t) * ninterleave_));
list_index_dev_.resize(ninterleave_, stream_);
list_prefix_interleaved_dev_.resize(nlist_, stream_);
list_lengths_dev_.resize(nlist_, stream_);
centriod_dev_.resize(nlist_ * dim_, stream_);

if (dtype_ == CUDA_R_32F) {
RAFT_CUDA_TRY(cudaMalloc(&list_data_dev_ptr_, sizeof(float) * ninterleave_ * dim_));
Expand All @@ -738,15 +686,16 @@ cuivflStatus_t cuivflHandle::cuivflBuildIndex(const void* dataset,
}

// Step 3: Read the list
copy(list_prefix_interleaved_dev_ptr_, list_prefix_interleaved_host_ptr_, nlist_, stream);
copy(list_lengths_dev_ptr_, list_lengths_host_ptr_, nlist_, stream);
copy(list_prefix_interleaved_dev_.data(), list_prefix_interleaved_host_.data(), nlist_, stream_);
copy(list_lengths_dev_.data(), list_lengths_host_.data(), nlist_, stream_);
copy(centriod_dev_.data(), centriod_manage_ptr, nlist_ * dim_, stream_);

RAFT_CUDA_TRY(cudaMemcpyAsync(list_data_dev_ptr_,
list_data_host_ptr_,
utils::cuda_datatype_size(dtype_) * ninterleave_ * dim_,
cudaMemcpyHostToDevice,
stream));
copy(list_index_dev_ptr_, list_index_host_ptr_, ninterleave_, stream);
stream_));
copy(list_index_dev_.data(), list_index_host_.data(), ninterleave_, stream_);

return cuivflStatus_t::CUIVFL_STATUS_SUCCESS;
} // end func cuivflBuildIndex
Expand Down Expand Up @@ -1001,9 +950,9 @@ cuivflStatus_t cuivflHandle::cuivflSearchImpl(const T* queries, // [numQueries,
beta = 1.0f;
utils::_cuann_sqsum(batch_size, dim_, convertedQueries, query_norm_dev_ptr);
utils::_cuann_outer_add(
query_norm_dev_ptr, batch_size, centriod_norm_dev_ptr_, nlist_, distance_buffer_dev_ptr);
query_norm_dev_ptr, batch_size, centriod_norm_dev_.data(), nlist_, distance_buffer_dev_ptr);
#ifdef DEBUG_L2
utils::printDevPtr(centriod_norm_dev_ptr_, 20, "centriod_norm_dev_ptr_");
utils::printDevPtr(centriod_norm_dev_.data(), 20, "centriod_norm_dev_");
utils::printDevPtr(distance_buffer_dev_ptr, 20, "distance_buffer_dev_ptr");
#endif
} else {
Expand All @@ -1018,7 +967,7 @@ cuivflStatus_t cuivflHandle::cuivflSearchImpl(const T* queries, // [numQueries,
batch_size,
dim_,
&alpha,
centriod_dev_ptr_,
centriod_dev_.data(),
CUDA_R_32F,
dim_,
convertedQueries,
Expand Down Expand Up @@ -1059,10 +1008,10 @@ cuivflStatus_t cuivflHandle::cuivflSearchImpl(const T* queries, // [numQueries,
if constexpr (std::is_same<T, float>{}) {
ivfflat_interleaved_scan<float, float>(queries,
coarse_indices_dev_ptr,
list_index_dev_ptr_,
list_index_dev_.data(),
list_data_dev_ptr_,
list_lengths_dev_ptr_,
list_prefix_interleaved_dev_ptr_,
list_lengths_dev_.data(),
list_prefix_interleaved_dev_.data(),
metric_type_,
nprobe,
k,
Expand All @@ -1078,10 +1027,10 @@ cuivflStatus_t cuivflHandle::cuivflSearchImpl(const T* queries, // [numQueries,
// we use int32_t for accumulation, and final store in fp32
ivfflat_interleaved_scan<uint8_t, uint32_t>(queries,
coarse_indices_dev_ptr,
list_index_dev_ptr_,
list_index_dev_.data(),
list_data_dev_ptr_,
list_lengths_dev_ptr_,
list_prefix_interleaved_dev_ptr_,
list_lengths_dev_.data(),
list_prefix_interleaved_dev_.data(),
metric_type_,
nprobe,
k,
Expand All @@ -1096,10 +1045,10 @@ cuivflStatus_t cuivflHandle::cuivflSearchImpl(const T* queries, // [numQueries,
} else if constexpr (std::is_same<T, int8_t>{}) {
ivfflat_interleaved_scan<int8_t, int32_t>(queries,
coarse_indices_dev_ptr,
list_index_dev_ptr_,
list_index_dev_.data(),
list_data_dev_ptr_,
list_lengths_dev_ptr_,
list_prefix_interleaved_dev_ptr_,
list_lengths_dev_.data(),
list_prefix_interleaved_dev_.data(),
metric_type_,
nprobe,
k,
Expand Down