Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding padded layout 'layout_padded_general' #725

Conversation

mfoerste4
Copy link
Collaborator

This is a different approach / followup PR of #663 for issue #497.

I implemented a layout_padded_general within raft to statically enforce padding on mdpsan accesses.

  • The layout has template parameters for ValueType, StorageOrder (default row_major_t), and ByteAlignment (default 128)
  • in order to not require changes upstream I skipped submdspan functionality right now. I have a branch on a mdspan fork where I tested this though (https://github.com/mfoerste4/mdspan/tree/layout_padded).

@mfoerste4
Copy link
Collaborator Author

@achirkin, please have a look at this. It is now possible to retrieve the layout / padding width at compile time which makes it possible to target optimized kernel code correctly.

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I wonder if we can further simplify/optimize some parts of the linewiseOp for padded data?..

strides[r] = stride;
if (stride == 1) {
stride *=
std::max<size_t>(alignment, (__exts.extent(r) + alignment - 1) / alignment * alignment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nitpick: perhaps, we can use raft::ceildivfor readability?

strides[r] = stride;
if (stride == 1) {
stride *=
std::max<size_t>(alignment, (__exts.extent(r) + alignment - 1) / alignment * alignment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nitpick: perhaps, we can use raft::ceildiv for readability?

for (int k = threadIdx.x; k < VecElems * BlockSize; k += BlockSize, j += BlockSize) {
while (j >= rowLenPadded)
j -= rowLenPadded;
shm[k] = j < rowLen ? p[j] : Type(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: why ones and not zeroes? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had division ops in mind and did not want to risk any division by zero. I was unsure whether this might cause issues with tools like valgrind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good idea, I guess. I don't think it will cost anything anyway.

// similar to layout_strided, but contiguous with padding in second smallest stride dimension
template <typename ValueType,
StorageOrderType StorageOrder = StorageOrderType::row_major_t,
size_t ByteAlignment = 128>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just realized that the padding of the strides is done in terms of elements rather than bytes. I assume, it wouldn't be possible to express the padding in bytes due to how mapping works? If no, what would you think about having the template parameter expressed also in elements?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mapping redirects index-access based on elements, so yes - I don't think that can be changed. But as the hardware constraint for data is byte-based I thought it would be good to have a template based on the bytes (with reasonable default) here. This way the user does not have to think about the width of the datatypes he uses and the class computes the element-alignment automatically.
It can also be retrieved statically from the layout via
static constexpr size_t element_alignment = std::max(ByteAlignment / sizeof(ValueType), 1ul);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Maybe you'd also want to constrain it here to always be power-of-two (I've noticed you used Pow2<..> utils somewhere)? That would still cover our hardware-inspired use cases.

@mfoerste4
Copy link
Collaborator Author

Looks good! I wonder if we can further simplify/optimize some parts of the linewiseOp for padded data?..

The current implementation basically consists of the old main kernel which was running the aligned data portion. I don't think we can simplify it further without loosing performance. Regarding optimizations - we could skip the operation on the padded portion of data, but I guess that would only have very limited effect on larger datasets.

@cjnolet cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 14, 2022
@mfoerste4
Copy link
Collaborator Author

@achirkin , what are the next steps here to proceed?

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me.

The only thing I'm not sure is whether we'd want to define the padded layout synonyms for public use with or without the template parameter padding size. On the one hand, the parameterized version seems to be more logical for whatever use case. On the other hand, any application I could imagine, it only matters to have the specific padding of 128 bytes for the coalesced memory access. Or, maybe, we should have both?

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for soliciting my review! I have a few comments on the design.


template <typename ElementType, storage_order_type order>
using padded_layout = detail::stdex::layout_padded_general<
detail::stdex::padding<std::remove_cv_t<std::remove_reference_t<ElementType>>>::value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[OPTIONAL]

C++20 introduces remove_cvref_t, but alas, we're probably stuck on C++17 at the latest : - ( You could always use the feature test macro __cpp_lib_remove_cvref:

namespace detail::stdex {

#if defined(__cpp_lib_remove_cvref)

using ::std::remove_cvref;
using ::std::remove_cvref_t;

#else

template<class T>
struct remove_cvref {
    using type = ::std::remove_cv_t<::std::remove_reference_t<T>>;
};
template<class T>
using remove_cvref_t = typename remove_cvref<T>::type;

#endif

} 

// that encodes alignment as a non-type template parameter.
assert(input_pointer == alignTo(input_pointer, alignment::value));

pointer aligned_pointer = input_pointer; // assert_aligned(input_pointer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[INFORMATIVE NOTE]

I've been thinking about how to express this idea. The issue is that std::assume_aligned (C++20) or equivalent compiler built-ins (e.g., GCC's __builtin_assume_aligned) don't affect the type. My current thinking is that aligned_accessor::access should do return (std::assume_aligned(p))[i];. Please see my comment below on aligned_accessor::access.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my latest thoughts on assume_aligned etc., please see my aligned_accessor PR kokkos/mdspan#176 .

@mfoerste4 mfoerste4 requested review from a team as code owners October 17, 2022 11:08
@mfoerste4 mfoerste4 changed the base branch from branch-22.10 to branch-22.12 October 17, 2022 11:13
@mfoerste4
Copy link
Collaborator Author

@cjnolet , thanks for reviewing. I re-based the branch to 22.12 and tried to align naming of template classes and API as close to the existing pattern as possible.

cjnolet
cjnolet previously approved these changes Oct 19, 2022
@cjnolet cjnolet removed the request for review from a team October 19, 2022 22:29
@cjnolet cjnolet dismissed their stale review October 19, 2022 22:35

Premature approval. Hoping for @mhoemmen's blessing before we merge this over.

@mhoemmen
Copy link
Contributor

@cjnolet @mfoerste4 I've been completely overwhelmed with my current project and haven't had time to review this. Please don't feel like you have to wait on me, though I appreciate that you asked me for feedback!

@cjnolet
Copy link
Member

cjnolet commented Oct 27, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit af05bcc into rapidsai:branch-22.12 Oct 27, 2022
rapids-bot bot pushed a commit that referenced this pull request Oct 28, 2022
This should fix the failures @Nyrio found in [#911](#911 (comment)). This is a test issue within a new testcase that was introduced by [#725](#725).

@cjnolet , FYI.

Authors:
  - Malte Förster (https://github.com/mfoerste4)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #964
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants