-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add python bindings for kmeans fit #1016
Conversation
As a first attempt at exposing the kmeans inertia - this adds python bindings for the kmeans::fit function, which returns the inertia as part of the results. This also adds support for accessing mdspan api's from cython directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really clean and, other than a few quirks that I think Cython introduces with the typing, I think this really looks like more of a blend of Python + C++ than having to use raw pointers for everything. Overall, I like it.
cdef handle_t *h = <handle_t*><size_t>handle.getHandle() | ||
|
||
x_cai = X.__cuda_array_interface__ | ||
dtype = np.dtype(x_cai["typestr"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably have some input validation here- X.shape[1] == centroids.shape[1], n_clusters > 0, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm validating the shape/dtype/c-contiguous values - and checking for things like the n_clusters > 0 , and # of features being consistent between X and centroids should be done in the C++ code already .
void fit(handle_t const& handle, | ||
const KMeansParams& params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice. And I like that this gets us closer to ditching the pointer-based APIs.
|
||
centroids, inertia, n_iter = fit(params, device_ndarray(X)) | ||
|
||
# TODO: validate that centroids are reasonable ... somehow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few different ways to verify this. One way could be to wrap the make_blobs
primitive and generate n_clusters
blobs, that way we have label information that we can use to compare against each point's nearest centroid.
|
||
|
||
cdef device_vector_view[ElementType, int] \ | ||
device_vector_view_from_array(arr, ElementType * p) except +: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for the function to accept type information directly? Like device_vector_view_from_array[ElementType](arr)
?
return make_device_matrix_view(<ElementType*>ptr, <int>rows, <int>cols) | ||
|
||
|
||
cdef device_matrix_view[const ElementType, int] \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do really like this because it makes it really easy to pass in an array and have it extract the underlying pointer. One thing that stinks here, though, is that the device_matrix_view
needs to have type information associated to it at compile-time while the arr
's type is only known at runtime.
const_device_matrix_view_from_array[float](X, <float*>NULL), | ||
f_sample_weights, | ||
device_matrix_view_from_array[float](centroids, <float*>NULL), | ||
make_host_scalar_view[float, int](&f_inertia), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part looks really clean, honestly.
|
||
from pylibraft.common.input_validation import * | ||
from pylibraft.distance import DISTANCE_TYPES | ||
|
||
ctypedef const float const_float | ||
|
||
|
||
def is_c_cont(cai, dt): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the input_validation utils in pylibraft.common now
Rather than wrap in a function using FusedTypes, use make_device_matrix_view directly
rerun tests |
1 similar comment
rerun tests |
I changed the branch back to 22.12 until we get the 23.03 branch caught up - which I took a stab at here #1051 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is almost there. Just two minor things and I think it's ready to merge. I'm excited to see the pointer-based runtime APIs disappear!
# TODO: use a fixed RNG state on params | ||
params = KMeansParams(n_clusters=n_clusters) | ||
|
||
# fit the centroids, make sure inertia has gone down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create a github issue to expose make_blobs in a future PR and reference it here in this comment?
Ultimately, the make_blobs allows us to generate Gaussian clusters that we can directly compare the output against. I think it's out of scope for this PR but we definitely want to revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #1059 - and added a TODO in the test
@@ -71,3 +71,19 @@ def data(self): | |||
Returns the data pointer of the underlying CUDA array interface | |||
""" | |||
return self.cai_["data"][0] | |||
|
|||
def validate(self, expected_dims=None, expected_dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this function but I think the name could be a little more descriptive. Maybe something like validate_shape_dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in last commit
* Add seed/metric/verbosity to python kmeans params * Add a todo about make_blobs testing * rename cai_wrapper.validate to validate_shape_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@gpucibot merge |
As a first attempt at exposing the kmeans inertia - this adds python bindings for the kmeans::fit function, which returns the inertia as part of the results.
This also adds support for accessing mdspan api's from cython directly