Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lds and sts inline ptx instructions to force vector instruction generation #273

Merged
merged 25 commits into from
Jun 21, 2021

Conversation

mdoijade
Copy link
Contributor

@mdoijade mdoijade commented Jun 11, 2021

Adds inline ptx assembly for lds & sts instructions for float, float2, float4, double, double2.
This ensures that compiler doesn't mistakenly generate non-vectorized instructions whenever we need it to generate vectorized version.
Also this ensures that we always generate non-generic ld/st instructions eliminating compiler from generating generic ld/st instructions.
These functions now requires the given shmem pointer should be aligned by the vector length, like for float4 lds/sts shmem pointer should be aligned by 16 bytes else it might silently fail or can also give runtime error.

mdoijade added 24 commits May 12, 2021 22:03
…r usage in all contraction based kernels so that n is along x dir and m is along y dir blocks
…kernels. --add launch config generator function to launch optimal grid size kernel for these pairwise dist kernels
…ed up over previous version. -- improve logic of the grid launch config generator for x-dir blocks
… for subsequent gridStrideX variations. this overall improves perf of fusedL2NN to 1.85x over previous version. --Also remove checking keys only check values in fusedL2nn test case, as it may happen a row has multiple keys with same min val
…und in launchConfigGenerator. --Use constexpr in shmemSize.
…e sure next grid stride doesn't pollute shmem before completion of this calculation
@mdoijade mdoijade requested review from a team as code owners June 11, 2021 16:00
@github-actions github-actions bot added the cpp label Jun 11, 2021
@mdoijade
Copy link
Contributor Author

@teju85 to help with review.

@teju85
Copy link
Member

teju85 commented Jun 12, 2021

@mdoijade did you get a chance to check whether this change causes any perf regressions, especially in the distance prims? (asking because AFAIK, they are only ones who are using these methods in their kernels)

@teju85 teju85 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 12, 2021
@mdoijade
Copy link
Contributor Author

@mdoijade did you get a chance to check whether this change causes any perf regressions, especially in the distance prims? (asking because AFAIK, they are only ones who are using these methods in their kernels)

@teju85 yes I did checked the performance, there is no regression due to this patch to distance prims. also can you rerun the tests it looks one of the config hit intermittent CI issue of docker installation.

@teju85
Copy link
Member

teju85 commented Jun 14, 2021

rerun tests

Copy link
Member

@teju85 teju85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM

@mdoijade
Copy link
Contributor Author

@teju85 can we merge this PR now? btw the above failure in "gpuCI/raft/gpu/cuda/11.0/python/3.7/ubuntu16.04" is CI docker installation issue.

@teju85
Copy link
Member

teju85 commented Jun 16, 2021

@dantegd can we get some help in getting the CI passed for this PR, please?

@teju85
Copy link
Member

teju85 commented Jun 18, 2021

rerun tests

@mdoijade
Copy link
Contributor Author

@teju85 can we merge this now?

@teju85
Copy link
Member

teju85 commented Jun 21, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b266d54 into rapidsai:branch-21.08 Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants