Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to improve driftsort with dynamically sized data #30

Open
SchrodingerZhu opened this issue Dec 4, 2024 · 1 comment
Open

How to improve driftsort with dynamically sized data #30

SchrodingerZhu opened this issue Dec 4, 2024 · 1 comment

Comments

@SchrodingerZhu
Copy link

SchrodingerZhu commented Dec 4, 2024

Outside Rust, I am trying to port driftsort as libc's qsort implementation. See https://github.com/SchrodingerZhu/qsort-driftsort.

For qsort, the size of the data is unknown at build time. I found that memcpy can influence the performance significantly. So I made some specialization for small data sizes:

  void copy_nonoverlapping(BlobPtr dst, size_t n = 1) const {
#if __has_builtin(__builtin_memcpy_inline)
    if (element_size <= 16 && n == 1) {
      std::byte *src = data;
      std::byte *dest = dst.data;
      if (element_size >= 8) {
        __builtin_memcpy_inline(dest, src, 8);
        __builtin_memcpy_inline(&dest[element_size - 8], &src[element_size - 8],
                                8);
      } else if (element_size >= 4) {
        __builtin_memcpy_inline(dest, src, 4);
        __builtin_memcpy_inline(&dest[element_size - 4], &src[element_size - 4],
                                4);
      } else if (element_size >= 2) {
        __builtin_memcpy_inline(dest, src, 2);
        __builtin_memcpy_inline(&dest[element_size - 2], &src[element_size - 2],
                                2);
      } else {
        __builtin_memcpy_inline(dest, src, 1);
      }
      return;
    }
#endif
    std::memcpy(dst, data, element_size * n);
  }

In this way, it beats glibc's implementation in most cases with small data sizes. See performance number:
https://docs.google.com/spreadsheets/d/1p5Ltrq-HyVY4M_lKTWRD0pauJgT34yrxwWx6ZQnrit8/edit?gid=329046775#gid=329046775

However, when data size is large (e.g. std::array<long long, 16>), I can see there is still a huge gap between driftsort and glibc's implementation. See benchmark_qsort_with_large_data_size portion of the data in the above link.

Another complication is that, for qsort_r interface, we do not know the alignment of the data, while the comparison function may be invoked on scratch pad. So what I have to do is

  size_t alignment = guess_alignment(element_size, data);
  BlobComparator<Comp> comp{element_size, alignment, compare};

  // Insertion sort is always fine because we never call comparison on temporary
  // space
  constexpr size_t MAX_LEN_ALWAYS_INSERTION_SORT = 20;
  constexpr size_t MAX_ALIGNMENT = 32;
  if (DRIFTSORT_LIKELY(length <= MAX_LEN_ALWAYS_INSERTION_SORT)) {
    BlobPtr v = comp.lift(data);
    return small::insertion_sort_shift_left(v, length, 1, comp);
  }

  // unfortunately, due to qsort_r's restriction, we cannot really know the
  // alignment of over-aligned types, so we have to fall back to slow path.
  // This is similar to the behavior of glibc's qsort.
  if (DRIFTSORT_UNLIKELY(alignment > MAX_ALIGNMENT))
    return trivial_heap_sort(data, length, comp);

This does not affect the performance here but I would like to know if you have more ideas on improving the sorting algorithm for qsort scenario.

Thanks!


P.S. although the code is heavily fuzzed, I have not guarantee that it resembles the driftsort exactly.

@Voultapher
Copy link
Owner

Hey, a couple thoughts based on what you shared:

  • I'm surprised you picked driftsort for this purpose. All the qsort implementations I am aware of are in-place and unstable. I'd expect it to be a hard sell to suggest switching to an implementation that allocates. Plus there are various optimizations not done or done specifically because of the stability requirement Orson and I designed driftsort in mind with.
  • There are various optimizations in driftsort and ipnsort that only really make sense if the compiler can inline the comparison function, which isn't possible in C without PGO. I've written previously about the performance impacts of it here.
  • If you want to get good test coverage I suggest you fork this repository and attach your implementation to it, so it can leverage the existing tests and benchmarks.
  • Benchmarks are tricky, for me to evaluate and comment on your findings I would have to dig through your benchmark code and implementation, and frankly I don't have the time for that currently. I suggest you fork this repository and attach your implementation to it, c_std_sys.cpp should be a good starting point see this commit for a list of associated changes. From there you can follow the top-level README to run and visualize benchmarks. In this familiar framework it would be easier for me to reason about performance characteristics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants