How to improve driftsort with dynamically sized data #30

SchrodingerZhu · 2024-12-04T22:27:40Z

Outside Rust, I am trying to port driftsort as libc's qsort implementation. See https://github.com/SchrodingerZhu/qsort-driftsort.

For qsort, the size of the data is unknown at build time. I found that memcpy can influence the performance significantly. So I made some specialization for small data sizes:

  void copy_nonoverlapping(BlobPtr dst, size_t n = 1) const {
#if __has_builtin(__builtin_memcpy_inline)
    if (element_size <= 16 && n == 1) {
      std::byte *src = data;
      std::byte *dest = dst.data;
      if (element_size >= 8) {
        __builtin_memcpy_inline(dest, src, 8);
        __builtin_memcpy_inline(&dest[element_size - 8], &src[element_size - 8],
                                8);
      } else if (element_size >= 4) {
        __builtin_memcpy_inline(dest, src, 4);
        __builtin_memcpy_inline(&dest[element_size - 4], &src[element_size - 4],
                                4);
      } else if (element_size >= 2) {
        __builtin_memcpy_inline(dest, src, 2);
        __builtin_memcpy_inline(&dest[element_size - 2], &src[element_size - 2],
                                2);
      } else {
        __builtin_memcpy_inline(dest, src, 1);
      }
      return;
    }
#endif
    std::memcpy(dst, data, element_size * n);
  }

In this way, it beats glibc's implementation in most cases with small data sizes. See performance number:
https://docs.google.com/spreadsheets/d/1p5Ltrq-HyVY4M_lKTWRD0pauJgT34yrxwWx6ZQnrit8/edit?gid=329046775#gid=329046775

However, when data size is large (e.g. std::array<long long, 16>), I can see there is still a huge gap between driftsort and glibc's implementation. See benchmark_qsort_with_large_data_size portion of the data in the above link.

Another complication is that, for qsort_r interface, we do not know the alignment of the data, while the comparison function may be invoked on scratch pad. So what I have to do is

  size_t alignment = guess_alignment(element_size, data);
  BlobComparator<Comp> comp{element_size, alignment, compare};

  // Insertion sort is always fine because we never call comparison on temporary
  // space
  constexpr size_t MAX_LEN_ALWAYS_INSERTION_SORT = 20;
  constexpr size_t MAX_ALIGNMENT = 32;
  if (DRIFTSORT_LIKELY(length <= MAX_LEN_ALWAYS_INSERTION_SORT)) {
    BlobPtr v = comp.lift(data);
    return small::insertion_sort_shift_left(v, length, 1, comp);
  }

  // unfortunately, due to qsort_r's restriction, we cannot really know the
  // alignment of over-aligned types, so we have to fall back to slow path.
  // This is similar to the behavior of glibc's qsort.
  if (DRIFTSORT_UNLIKELY(alignment > MAX_ALIGNMENT))
    return trivial_heap_sort(data, length, comp);

This does not affect the performance here but I would like to know if you have more ideas on improving the sorting algorithm for qsort scenario.

Thanks!

P.S. although the code is heavily fuzzed, I have not guarantee that it resembles the driftsort exactly.

The text was updated successfully, but these errors were encountered:

Voultapher · 2024-12-05T08:09:07Z

Hey, a couple thoughts based on what you shared:

I'm surprised you picked driftsort for this purpose. All the qsort implementations I am aware of are in-place and unstable. I'd expect it to be a hard sell to suggest switching to an implementation that allocates. Plus there are various optimizations not done or done specifically because of the stability requirement Orson and I designed driftsort in mind with.
There are various optimizations in driftsort and ipnsort that only really make sense if the compiler can inline the comparison function, which isn't possible in C without PGO. I've written previously about the performance impacts of it here.
If you want to get good test coverage I suggest you fork this repository and attach your implementation to it, so it can leverage the existing tests and benchmarks.
Benchmarks are tricky, for me to evaluate and comment on your findings I would have to dig through your benchmark code and implementation, and frankly I don't have the time for that currently. I suggest you fork this repository and attach your implementation to it, c_std_sys.cpp should be a good starting point see this commit for a list of associated changes. From there you can follow the top-level README to run and visualize benchmarks. In this familiar framework it would be easier for me to reason about performance characteristics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve driftsort with dynamically sized data #30

How to improve driftsort with dynamically sized data #30

SchrodingerZhu commented Dec 4, 2024 •

edited

Loading

Voultapher commented Dec 5, 2024

How to improve driftsort with dynamically sized data #30

How to improve driftsort with dynamically sized data #30

Comments

SchrodingerZhu commented Dec 4, 2024 • edited Loading

Voultapher commented Dec 5, 2024

SchrodingerZhu commented Dec 4, 2024 •

edited

Loading