Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support for half-float mixed precise in brute-force #2382

Merged
merged 20 commits into from
Aug 22, 2024

Conversation

rhdong
Copy link
Member

@rhdong rhdong commented Jul 17, 2024

  • distance supports half-float
  • SDDMM support half-float
  • gemm supports multi-type compose
  • transpose & copy support half
  • random supports half

- distance supports half-float
- SDDMM support half-float
- gemm supports multi-type compose
- transpose & copy support half
- random supports half
@rhdong rhdong requested review from benfred and cjnolet July 17, 2024 05:22
@rhdong rhdong requested review from a team as code owners July 17, 2024 05:22
@rhdong rhdong added enhancement New feature or request 3 - Ready for Review feature request New feature or request non-breaking Non-breaking change and removed cpp CMake labels Jul 17, 2024
@@ -219,38 +219,7 @@ if(BUILD_TESTS)
NAME
LINALG_TEST
PATH
linalg/add.cu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file was accidentally checked in - we should probably keep the linalg/sparse tests here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG, I forgot to recover the test cases a second time...T_T

* @param a need to convert
*/
template <typename T>
HDI auto half2float(T& a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this API should be renamed - since this not only converts from half2float, but also handles conversions when T=float.

I also think we probably want to have both an input and output template parameters for the conversion - rather than have a template parameter for the input, and assume the output is always a float.

The faiss codebase has a generic 'convert' operator that does this -
(https://github.com/facebookresearch/faiss/blob/dd72e4121dc6684c6fbf289949ba4526a54d9f3b/faiss/gpu/utils/ConversionOperators.cuh#L39-L45) which seems to work pretty well

@@ -102,7 +102,13 @@ template <typename T>
RAFT_INLINE_FUNCTION auto asin(T x)
{
#ifdef __CUDA_ARCH__
return ::asin(x);
if constexpr (std::is_same<T, __half>::value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need half support for the asin function?

I'm wondering if we should either remove the half support for this function, or add half support for all the other trigonometric functions in this file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe we can. As of now, it's a kind of trade-off. I understand your meaning; some distance algorithms need this as supporting half; removing it will cause a compilation error. If we bring half everything, that could be ideal, but the workload can be out of control..

@github-actions github-actions bot removed the CMake label Jul 17, 2024
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned by the transpose kernel and I think this PR needs some additional benchmarks- both on the perf w/ half precision and on the increase in build time and binary size.

Given that we're butting right up against code freeze, I'd like to retarget this PR to 24.10 so we can make sure we're happy w/ the perf and changes to the build.

const half* __restrict__ in,
half* __restrict__ out)
{
__shared__ half tile[TILE_DIM][TILE_DIM + 1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transpose of a non-symmetric non-square matrix is non-trivial to parallelize. This seems to be doing it all in a single kernel. This makes me wonder if we might be lacking test coverage here.

@rhdong rhdong changed the base branch from branch-24.08 to branch-24.10 July 26, 2024 17:06
@rhdong
Copy link
Member Author

rhdong commented Aug 13, 2024

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Aug 20, 2024

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Aug 21, 2024

/ok to test

@raydouglass
Copy link
Member

/okay to test

@cjnolet
Copy link
Member

cjnolet commented Aug 21, 2024

@rhdong before we merge this, we need to make sure it doesn't break cuml. Can you build cuML with this RAFT branch and just verify it doesn't need to be updated?

@rhdong
Copy link
Member Author

rhdong commented Aug 21, 2024

@rhdong before we merge this, we need to make sure it doesn't break cuml. Can you build cuML with this RAFT branch and just verify it doesn't need to be updated?

OK, I will verify it.

@cjnolet
Copy link
Member

cjnolet commented Aug 22, 2024

/merge

@rapids-bot rapids-bot bot merged commit db07998 into rapidsai:branch-24.10 Aug 22, 2024
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review cpp enhancement New feature or request feature request New feature or request non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants