Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cagra index construction without copying device mdarrays #1494

Merged
merged 4 commits into from
Jul 20, 2023

Conversation

tfeher
Copy link
Contributor

@tfeher tfeher commented May 8, 2023

This PR aims to improve the workflow when dealing with large datasets. When experimenting with different versions of the knn-graph, we might want to construct indices with the same dataset (see #1435 for further discussion).

If the dataset is already in device memory (and rows are properly aligned / padded), then we only store a reference to the dataset, therefore multiple indices can refer to the same dataset. Similarly, when knn_graph is a device array, then store only a reference.

Additionally, this PR adds update_dataset and update_graph methods to the index.
Closes #1479

@tfeher tfeher requested a review from a team as a code owner May 8, 2023 09:48
@github-actions github-actions bot added the cpp label May 8, 2023
@tfeher
Copy link
Contributor Author

tfeher commented May 8, 2023

Linking the original discussion: #1375 (comment) and tagging @divyegala and @enp1s0 for review.

@tfeher tfeher self-assigned this May 8, 2023
@tfeher tfeher added improvement Improvement / enhancement to an existing function breaking Breaking change labels May 8, 2023
@tfeher
Copy link
Contributor Author

tfeher commented May 8, 2023

Currently API change is non breaking. Behavior is changed: if device arrays go out of scope while the index is alive, then index would be invalid. Therefore labelled as breaking change.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple fairly minor things. It would help if we could capture an issue to revisit the sometimes owning semantics. I think it's another great case for the mdbuffer that @wphicks has been working on.

cpp/include/raft/neighbors/cagra_types.hpp Outdated Show resolved Hide resolved
cpp/include/raft/neighbors/cagra_types.hpp Outdated Show resolved Hide resolved
private:
raft::distance::DistanceType metric_;
raft::device_matrix<T, IdxT, row_major> dataset_;
raft::device_matrix<IdxT, IdxT, row_major> graph_;
raft::device_matrix_view<const T, IdxT, row_major> dataset_view_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some types to help us w/ the sometimes owning semantics that I think can be useful here. I'm okay merging this PR in for release the way it is, but I think we should consider using something like the mdbuffer or TemporaryDeviceBuffer in the future. Though I think the TemporaryDeviceBuffer will require some adjustments to make it useful for this specific case (tagging @divyegala).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, we shall return to this once we have a better way to express this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @cjnolet mentioned here. I think it will be pretty straightforward to transition to mdbuffer once it's available (hopefully very soon).

@tfeher tfeher force-pushed the cagra_index_no_copy branch from 40ba206 to 1227797 Compare July 19, 2023 22:06
@tfeher tfeher requested review from a team as code owners July 19, 2023 22:06
@tfeher tfeher force-pushed the cagra_index_no_copy branch from 1227797 to 1934c59 Compare July 19, 2023 22:12
@tfeher tfeher changed the base branch from branch-23.06 to branch-23.08 July 19, 2023 22:13
Copy link
Contributor Author

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR according to the comments.

cpp/include/raft/neighbors/cagra_types.hpp Outdated Show resolved Hide resolved
cpp/test/neighbors/ann_cagra.cuh Outdated Show resolved Hide resolved
private:
raft::distance::DistanceType metric_;
raft::device_matrix<T, IdxT, row_major> dataset_;
raft::device_matrix<IdxT, IdxT, row_major> graph_;
raft::device_matrix_view<const T, IdxT, row_major> dataset_view_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, we shall return to this once we have a better way to express this.

cpp/include/raft/neighbors/cagra_types.hpp Outdated Show resolved Hide resolved
cpp/include/raft/neighbors/cagra_types.hpp Outdated Show resolved Hide resolved
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @tfeher!

@cjnolet
Copy link
Member

cjnolet commented Jul 20, 2023

/merge

@rapids-bot rapids-bot bot merged commit db4797f into rapidsai:branch-23.08 Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change cpp improvement Improvement / enhancement to an existing function Vector Search
Projects
Development

Successfully merging this pull request may close these issues.

[FEA] CAGRA avoid unnecessary device memory copies of dataset / knn graph
5 participants