-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster matrix-vector-ops #401
Faster matrix-vector-ops #401
Conversation
d4a1e69
to
5c63642
Compare
I have a question here. I've used NVTX profiling for measuring the execution time in the old and new versions. However, I've noticed it was not used before. Shall I remove it from the PR? Also, if you want to measure the execution time of the old version, check out the last commit before I replaced the
|
int BlockSize, | ||
typename Lambda, | ||
typename... Vecs> | ||
void matrixLinewiseVecCols(Type* out, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tagging @divyegala for his ongoing work to hide implementation details of dense linalg
primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the changes here look great but we've been bitten a few times recently by making changes in RAFT that inadvertently break cuML (or its CI). The longer-term goal as RAFT becomes more stable is to get to a point where it can act independently from consuming projects. For now, however, we should open a PR in cuml that builds against this branch and verify CI runs successfully.
Yeah, actually, I just found out today that cuml's CD solver tests crash with my changes in raft on a matrix of size (1×2). |
Test cuml with changes in rapidsai/raft#401
488754e
to
40e1a7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @achirkin for this PR. I have reviewed the row-wise code path. Overall it looks great, but it is difficult to follow the complex indexing logic. I have left a few suggestions where the description could be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have finished the review, I have a few additional comments toward improving the description of the algorithm. Apart from these, it looks good.
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @achirkin for addressing the issues and improving the description of the functions! It looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updats mostly look good from my end. I realized I had not submitted my original review, which had a bunch of requests for docs and have been addressed already from Tamas' review.
I just have a little question and a trivial update to use new (non-deprecated) macros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM, thanks for adding the constants
@gpucibot merge |
Introduce
matrixLinewiseOp
for applying row- or column-wise operations on matrices with (templated) fixed number of vectors. This is a rewriting ofmatrixVectorOp
.The new primitive is on average 2x faster for various numbers of columns/rows. In general case: it improves performance by reusing the vector values across multiple matrix rows/columns (trying to load vector value once or at least cache it). In edge case: it tries to use vectorized load/store operations on the input/output matrices even when the pointers are not properly aligned, or the vectors' length is not multiple of the alignment.