Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add python bindings for kmeans fit #1016

Merged
merged 13 commits into from
Nov 30, 2022

Conversation

benfred
Copy link
Member

@benfred benfred commented Nov 15, 2022

As a first attempt at exposing the kmeans inertia - this adds python bindings for the kmeans::fit function, which returns the inertia as part of the results.

This also adds support for accessing mdspan api's from cython directly

As a first attempt at exposing the kmeans inertia - this adds python bindings for
the kmeans::fit function, which returns the inertia as part of the results.

This also adds support for accessing mdspan api's from cython directly
@benfred benfred requested review from a team as code owners November 15, 2022 01:48
@benfred benfred marked this pull request as draft November 15, 2022 01:48
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really clean and, other than a few quirks that I think Cython introduces with the typing, I think this really looks like more of a blend of Python + C++ than having to use raw pointers for everything. Overall, I like it.

cdef handle_t *h = <handle_t*><size_t>handle.getHandle()

x_cai = X.__cuda_array_interface__
dtype = np.dtype(x_cai["typestr"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have some input validation here- X.shape[1] == centroids.shape[1], n_clusters > 0, etc...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm validating the shape/dtype/c-contiguous values - and checking for things like the n_clusters > 0 , and # of features being consistent between X and centroids should be done in the C++ code already .

python/pylibraft/pylibraft/cluster/kmeans.pyx Outdated Show resolved Hide resolved
void fit(handle_t const& handle,
const KMeansParams& params,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice. And I like that this gets us closer to ditching the pointer-based APIs.

cpp/src/distance/kmeans_fit_double.cu Show resolved Hide resolved
cpp/src/distance/kmeans_fit_float.cu Show resolved Hide resolved
python/pylibraft/pylibraft/common/mdspan.pyx Outdated Show resolved Hide resolved

centroids, inertia, n_iter = fit(params, device_ndarray(X))

# TODO: validate that centroids are reasonable ... somehow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few different ways to verify this. One way could be to wrap the make_blobs primitive and generate n_clusters blobs, that way we have label information that we can use to compare against each point's nearest centroid.



cdef device_vector_view[ElementType, int] \
device_vector_view_from_array(arr, ElementType * p) except +:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the function to accept type information directly? Like device_vector_view_from_array[ElementType](arr)?

return make_device_matrix_view(<ElementType*>ptr, <int>rows, <int>cols)


cdef device_matrix_view[const ElementType, int] \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do really like this because it makes it really easy to pass in an array and have it extract the underlying pointer. One thing that stinks here, though, is that the device_matrix_view needs to have type information associated to it at compile-time while the arr's type is only known at runtime.

const_device_matrix_view_from_array[float](X, <float*>NULL),
f_sample_weights,
device_matrix_view_from_array[float](centroids, <float*>NULL),
make_host_scalar_view[float, int](&f_inertia),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part looks really clean, honestly.


from pylibraft.common.input_validation import *
from pylibraft.distance import DISTANCE_TYPES

ctypedef const float const_float


def is_c_cont(cai, dt):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the input_validation utils in pylibraft.common now

Rather than wrap in a function using FusedTypes, use make_device_matrix_view
directly
@cjnolet cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 21, 2022
@cjnolet
Copy link
Member

cjnolet commented Nov 21, 2022

rerun tests

1 similar comment
@cjnolet
Copy link
Member

cjnolet commented Nov 21, 2022

rerun tests

@benfred benfred changed the base branch from branch-22.12 to branch-23.02 November 21, 2022 20:20
@github-actions github-actions bot added the gpuCI label Nov 28, 2022
@benfred benfred changed the base branch from branch-23.02 to branch-22.12 November 28, 2022 23:47
@benfred
Copy link
Member Author

benfred commented Nov 28, 2022

I changed the branch back to 22.12 until we get the 23.03 branch caught up - which I took a stab at here #1051

@github-actions github-actions bot removed the gpuCI label Nov 29, 2022
@cjnolet cjnolet changed the base branch from branch-22.12 to branch-23.02 November 29, 2022 21:50
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is almost there. Just two minor things and I think it's ready to merge. I'm excited to see the pointer-based runtime APIs disappear!

# TODO: use a fixed RNG state on params
params = KMeansParams(n_clusters=n_clusters)

# fit the centroids, make sure inertia has gone down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a github issue to expose make_blobs in a future PR and reference it here in this comment?

Ultimately, the make_blobs allows us to generate Gaussian clusters that we can directly compare the output against. I think it's out of scope for this PR but we definitely want to revisit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #1059 - and added a TODO in the test

@@ -71,3 +71,19 @@ def data(self):
Returns the data pointer of the underlying CUDA array interface
"""
return self.cai_["data"][0]

def validate(self, expected_dims=None, expected_dtype=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this function but I think the name could be a little more descriptive. Maybe something like validate_shape_dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in last commit

* Add seed/metric/verbosity to python kmeans params
* Add a todo about make_blobs testing
* rename cai_wrapper.validate to validate_shape_dtype
@benfred benfred marked this pull request as ready for review November 30, 2022 19:33
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cjnolet
Copy link
Member

cjnolet commented Nov 30, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 070cca2 into rapidsai:branch-23.02 Nov 30, 2022
@benfred benfred deleted the cython_kmeans branch November 30, 2022 21:12
@benfred benfred self-assigned this Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
Development

Successfully merging this pull request may close these issues.

2 participants