Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect balanced k-means out-of-memory in some cases #1161

Merged

Conversation

achirkin
Copy link
Contributor

There's no guarantee that our balanced k-means implementation always produces balanced clusters. In the first stage, when mesoclusters are trained, the biggest cluster can grow larger than half of all input data. This becomes a problem at the second stage, when in build_fine_clusters, the mesocluster data is copied in a temporary buffer. If size is too big, there may be not enough memory on the device. A quick workaround:

  1. Expand the error reporting (RAFT_LOG_WARN)
  2. Artificially limit the mesocluster size in the event of highly unbalanced clustering

…when the mesoclusters turn out to be unbalanced
@achirkin achirkin requested a review from a team as a code owner January 20, 2023 09:33
@github-actions github-actions bot added the cpp label Jan 20, 2023
@achirkin achirkin added 3 - Ready for Review improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 20, 2023
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the PR, this is a good workaround for the imbalanced training issue. LGTM.

for (IdxT j = 0; j < n_rows; j++) {
if (labels_mptr[j] == (LabelT)i) { mc_trainset_ids[k++] = j; }
for (IdxT j = 0; j < n_rows && k < mesocluster_size_max; j++) {
if (labels_mptr[j] == LabelT(i)) { mc_trainset_ids[k++] = j; }
}
if (k != mesocluster_sizes[i])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we omit warning when k is limited due to mesocluster_size_max.

Suggested change
if (k != mesocluster_sizes[i])
if (k != mesocluster_sizes[i] && k < mesocluster_size_max)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it's better to leave the warning, because it also gives an extra information about which clusters are the offending clusters and emphasizes the importance of the previous warning about the unbalanced mesoclusters. In the logs, they go one after another and it's rather easy to see the link between the two.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the PR, this is a good workaround for the imbalanced training issue. LGTM.

@codecov-commenter
Copy link

Codecov Report

Base: 87.99% // Head: 87.99% // No change to project coverage 👍

Coverage data is based on head (d247dab) compared to base (d233a2c).
Patch has no changes to coverable lines.

Additional details and impacted files
@@              Coverage Diff              @@
##           branch-23.02    #1161   +/-   ##
=============================================
  Coverage         87.99%   87.99%           
=============================================
  Files                21       21           
  Lines               483      483           
=============================================
  Hits                425      425           
  Misses               58       58           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@cjnolet
Copy link
Member

cjnolet commented Jan 21, 2023

/merge

@rapids-bot rapids-bot bot merged commit b70519e into rapidsai:branch-23.02 Jan 21, 2023
ahendriksen pushed a commit to ahendriksen/raft that referenced this pull request Jan 23, 2023
There's no guarantee that our balanced k-means implementation always produces balanced clusters. In the first stage, when mesoclusters are trained, the biggest cluster can grow larger than half of all input data. This becomes a problem at the second stage, when in `build_fine_clusters`, the mesocluster data is copied in a temporary buffer. If size is too big, there may be not enough memory on the device. A quick workaround:

 1. Expand the error reporting (RAFT_LOG_WARN)
 2. Artificially limit the mesocluster size in the event of highly unbalanced clustering

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#1161
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants