1. 79
    1. 18

      That’s a great discussion. When you do the strlen, you’re streaming data in, so you’ll hit fast paths in the prefetch and, importantly, the termination branch will be predicted not taken so you can run hundreds of instructions forward, skipping cache misses, and do it with a very high degree of parallelism. Eventually, you’ll resolve the branch that escapes the loop and unwind, which comes with a fixed cost.

      After that, the cache will be warm. The most important thing here is that you don’t need to reorder the stores. You’re writing full cache lines with data that’s already in cache and so you’ll do a little bit of reassembly and then write entire new lines. Modern caches do allocate in place, so if you store a whole line of data they won’t fetch from memory. If the source and destination are differently aligned, one cache miss in the middle can cause the stores to back up in the store queue waiting for the cache and backpressure the entire pipeline.

      1. 1

        You’re writing full cache lines with data that’s already in cache and so you’ll do a little bit of reassembly and then write entire new lines

        Do modern CPUs do any coalescing of stores? I.e. is this:

        vmovdq YMMWORD PTR [r8], ymm0
        vmovdq YMMWORD PTR [r8+32], ymm1
        

        coalesced into a single cacheline-sized store somehow?

        1. 2

          Yes, there’s a small store queue. If a full cache line of data is in the store queue, you avoid the load from memory.

          1. 1

            Yes, there’s a small store queue. If a full cache line of data is in the store queue, you avoid the load from memory.

            Thanks. Do the stores need to be aligned to be coalesced? I.e. does this, for r8 not aligned to cacheline, avoid one of the 3 loads? If not, are the alignment requirements documented somewhere for x64/aarch64?

            vmovdqu YMMWORD PTR [r8], ymm0
            vmovdqu YMMWORD PTR [r8+32], ymm1
            vmovdqu YMMWORD PTR [r8+64], ymm2
            
            1. 2

              It will vary a lot across microarchitectures. Generally, stores that span a cache line will consume multiple entries in a store queue and then be coalesced, but the size of the store queue may vary both in the width of entries and the number. I’m also not sure the amount of reordering that!s permitted on x86 with TSO, that may require holding up the coalesced store middle line until the load of the first one has happened. There’s also some complexity that Intel chips often fill pairs of cache lines (but evict individual ones) so the miss in the first may still bring the middle line into LLC and then be replaced.

              TL;DR: Computers are weird.

              1. 2

                Dumb followup question: do you think that Torvalds argument that the ISA should have some kind of rep mov that can skip some of the CPU internal machinery (store coalescing but maybe the register file too?) holds water?

                (I’m doesn’t need to be rep mov, you can imagine a limited version that requires alignment to cachelines in both source, size and destination, or even a single cacheline copy.)

                1. 5

                  Yes. I was one of the reviewers for Arm’s version, which is designed to avoid the complex microcode requirements of rep movsb. In the Arm version, each memcpy operation is split into three instructions. On a complex out-of-order machine, you’ll probably treat the first and last as NOPs and just do everything in the middle, but in a simpler design the first and last can handle unaligned starts and ends and the middle one can do a bulk copy.

                  The bulk copy can be very efficient if it’s doing a cache line at a time. Even if the source and destination have different alignment, if you can load two cache lines and then fill one from overlapping bits, you guarantee that you’re never needing to read anything from the target. There are a bunch of other advantages:

                  • You know where the end is, so if it’s in the middle of a cache line you can request that load right at the start.
                  • The loads are entirely predictable and so you can issue a load of them together without needing to involve the speculative execution machinery.
                  • The stores are unsequenced, so you can make them visible in any order as long as you make all of the ones before visible when you take an interrupt.
                  • There’s no register rename involved in the copies, you’re not making one vector register live for the duration of the copy. This is especially important for smaller cores, where you might not bother with register renaming on the vector registers (there are enough vector registers to keep the pipelines full without it and the vector register state is huge).
                  • For very large copies, if they miss in cache you can punt them lower in the memory subsystem.

                  The last point is the same motivation as atomic read-modify-write sequences. If you have something like CXL memory, you have around a 1500 cycle latency. If you need to do an atomic operation by pulling that data into the cache and operating on it, it will almost certainly back pressure the pipeline. If you can just send an instruction to the remote memory controller to do it, the CPU can treat it as retired (unless it needs to do something with the result). If you’re doing a copy of a page in CXL memory (e.g. a CoW fault after fork), you can just send a message telling the remote controller to do the copy and, at the same time, read a small subset values at the source into the cache for the destination. Some CXL memory controllers do deduplication (very useful if you have eight cloud nodes using one CXL memory device and all running multiple copies of the same VM images) and so having an explicit copy makes this easy: the copy is just a metadata update that adds a mapping table entry and updates a reference count.

        2. 2

          yes ‘write combining’

          the main thing you want is to avoid actually paging in the cache line in question

    2. 7

      In the openbsd version, because the length and copy loop are fused together, whether or not the next byte will be copied depends on the byte value of the previous iteration. Effectively the cost of this dependency is now not just imposed on the length computation but also on the copy operation. And to add insult to injury, dependencies are not just difficult for the CPU, they are also difficult for the compiler to optimize/auto-vectorize resulting in worse code generation - a compounding effect.

      This seems wrong? The if ((*dst++ = *src++) == '\0') branch should be very predictable and shouldn’t hinder the CPU.

      What I believe is happening is that splitting the strlen and memcpy loops removes the early exit from the memcpy loop, allowing the autovectorizer to kick in.

      Godbolt seems to confirm my thesis, with bespoke_strlcpy being vectorized at -O3 by gcc 14.

    3. 4

      I recently implemented a SIMD accelerated strlcpy. Two implementations are provided: a scalar implementation, which calls into strlen and memcpy to do the job as well as an SSE-vectorised implementation that does it all in one step.

      Turns out that when you program carefully, the naïve approach can be beaten soundly:

      os: FreeBSD
      arch: amd64
      cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
              │ strlcpy.scalar.out │        strlcpy.baseline.out         │
              │       sec/op       │   sec/op     vs base                │
      Short           129.10µ ± 0%   79.67µ ± 0%  -38.29% (p=0.000 n=20)
      Mid             29.928µ ± 1%   8.652µ ± 1%  -71.09% (p=0.000 n=20)
      Long            10.064µ ± 0%   3.577µ ± 0%  -64.45% (p=0.000 n=20)
      geomean          33.88µ        13.51µ       -60.12%
      
              │ strlcpy.scalar.out │          strlcpy.baseline.out          │
              │        B/s         │      B/s       vs base                 │
      Short           923.4Mi ± 0%   1496.3Mi ± 0%   +62.05% (p=0.000 n=20)
      Mid             3.890Gi ± 1%   13.455Gi ± 1%  +245.89% (p=0.000 n=20)
      Long            11.57Gi ± 0%    32.54Gi ± 0%  +181.31% (p=0.000 n=20)
      geomean         3.436Gi         8.617Gi       +150.76%
      

      Here “short” is a benchmark with strings that average 16 bytes of length, “mid” averages 64 bytes and “long” is one 128 kB string.

      I’d say that the problem is more that gcc and clang suck at optimising string code rather than that the naïve approach is inherently superior or anything.

    4. 4

      I very much agree with the “Don’t throw away the length” section at the end. NUL-terminated strings probably deserve to be called another “billion dollar mistake”!

      About 10 years ago I made up a C++ struct called slice that’s just {const void*; size_t} with a bunch of useful methods, and started using it everywhere. It’s been extremely useful. Now that the standard lib has evolved, I’m using string_view and span for the same purposes.

      1. 3

        This was also one of the places where I’ve seen Objective-C massively outperform C++. Being able to provide custom NSString subclasses that do this kind of thing (sharing or not sharing based on app-specific heuristics) can be a huge win.

        1. 1

          Yes, and in fact it points to a wrong direction that performance work has been moving in for a long time: chasing (fairly) small across-the-board gains by making things more static so the compiler can optimize a bit better.

          While these across-the-board gains aren’t nothing, they are far less valuable than having the flexibility to slot in problem-specific optimizations like the one you mentioned. For example, I also have one of these in MPWFoundation, and for high-perf parsing tasks it is the difference between performance that would largely be a joke and performance that is world-beating.

          The problem is that these much more valuable optimizations aren’t something that a library provider can easily quantify, as they are both highly local and specialized, and also just represent opportunities, not concrete gains that the library or compiler provider can point to.

          One could argue that the small across-the-board gains are actually more valuable because of their wider impact, but that ignores the fact that most code doesn’t need much if any optimization (Knuth’s 97/3 rule). Having all that boilerplate and steering code be a bit faster is almost entirely useless compared to being able to optimize the crap out of your hot spots.

      2. 2

        That’s kind of what glib does https://docs.gtk.org/glib/struct.String.html

        If you can get away with most operations being in your code, it’s a nice approach.

    5. 6

      it’s important to make it clear whether we are talking about it in an academic setting or in a practical setting because CPUs do not care about common sense or big-O notation.

      Doing one or two traversals is the same big-O complexity, and it’s kind of the point. It’s unfortunate to end a blog post on something completely unrelated with a negative remark that suggests a reasoning error about big-O notations. (There are also instances of algorithmically-better algorithms being slower in practice on small-enough inputs, for example linear search versus binary search.)

    6. 3

      Yes, it’s often incredibly hard to predict the performance impact of such things.

      But even if the author is right about the reason why, it’s wrong of them to suggest that CPUs are not bound by complexity analysis (the exact quote is “CPUs do not care about common sense or big-O notation.”).

      Big O notation ignores coefficients and constants, and those values can definitely result in systems where, for any problem that matters, the practical performance will be better with the algorithm that is “less complex”. Good hash tables are an obvious example– the (amortized) constant overhead is often high enough that an associative array can easily be trivially better depending on the circumstances (number of items, how hashes are generated, etc).

      1. 3

        especially fun: consider bubblesort on low amounts of items. no calls, no recursion, basically perfectly predictable code. i wonder what element count the threshold between stupid N² bubblesort and quicksort is

        1. 2

          On some NVidia GPUs, the threshold is 31 elements. IIRC on most other systems, insertion sort is more cache-friendly than bubble sort.

      2. 1

        Good hash tables are an obvious example– … an associative array can easily be trivially better

        I agree about asymptotic complexity, but hash tables are a type of associative array. Did you mean a specific other type of associative array?

        1. 1

          Hash tables are always associative, but not always in a flat array (e.g., chaining). But point taken on the ambiguous terminology. I generally think of old-school LISP, where associative arrays are essentially lists of 2-tuples, where lookups are linear.

          1. 1

            Association lists, yes: I use them both in old Emacs Lisp and modern Guile, although I imagine a more contiguous structure would be more efficient on modern CPUs.

    7. 3

      C/C++ optimizers don’t but ideally would be able to identify these patterns and compile both implementations to the same assembly sequence. Not sure if there is any fundamental reason why they couldn’t.

      Something smells off to me when reordering C/C++ code has drastic effects like this. Might as well write assembly if there is going to be this much correspondence between the source and the resulting machine code.

      1. 2

        char and memcpy are allowed to break aliasing rules, so they’re hard to optimize, because they could be doing very hacky weird stuff. Per language rules (not library docs), the source and dest strings could overlap.

        1. 2

          The string parameters of strlcpy and memcpy are restrict-qualified, so this does not apply.

        2. 1

          memcpy is undefined behaviour if the source and destination overlap; memmove is defined to work as expected for overlaps. (Not entirely as expected, tho, because a zero length memmove is not a no-op: it is undefined behaviour when a pointer is NULL.) Some codebases forbid memcpy and require memmove instead, because overlaps are dangerous.

          1. 1

            I mean strlcpy doesn’t have restrict pointers (at least not in any docs I found), so the compiler can’t safely change a custom loop to memcpy.

            1. 1

              Good point!