1. 36
  1.  

    1. 13

      Even though Intel introduced branch hints in Pentium 4, they found that in practice there was no performance benefit and they just increased code size

      This is complicated. This claim is definitely true for x86, because branch hints were introduced long after the relevant cores were deep superscalar out-of-order designs. Branch hints can be a big win on in-order cores. Most such cores statically predict backwards branches as taken and forward ones as not-taken and the compiler will lay things out assuming this, a few give some hints beyond that.

      The branch hint that is useful in hardware is not common: an unpredictable hint. This tells the implementation that the branch is data dependent and that prior accesses are probably not useful in predicting the new target. The main benefit of this is to avoid wasting branch predictor state training on something where you’ll keep getting it wrong. Megamorphic call sites (typically <1% of call sites in OO languages) benefit from this. Or did: I’ve no idea how it works on a modern neural network predictor, those are basically magic.

      -O3 produces much faster code than -O2

      I’m not sure what clang does here, but GCC never made this claim. In fact, quite the reverse. GCC’s claim was that -O2 enabled the optimisations that were almost always a win, -O3 the ones that were sometimes a big win and sometimes a regression. GCC also exposes a lot of tuning knobs, so the advice was to default to -O2, try O3 and if it isn’t a speedup (and you really care) then try tweaking some of the other knobs until you make it faster.

      The compiler optimizes for data locality

      There’s some really nice work from MIT a decade ago that shows why, even for constrained things like shader programs, this is a really hard problem. The tradeoff between caching and recomputing is incredibly complex. A memory access that hits in the cache is about as fast as a single ALU op (maybe faster than a multiply or, worse, a divide) but one that goes all of the way out to main memory is going to take as long as hundreds of operations.

      -O0 gives you fast compilation

      There’s another axis here. For C++ or other languages that make heavy use of COMDATs, you may find that linking is a lot slower. Especially with -g and without separate debug info. Modern linkers improve that a lot, but it can be a complex tradeoff between doing more work in the compiler and less in the linker.

      C++ templates are slow to compile because C++’s compilation model is terrible

      Yes and no. There are two distinct issues:

      • Compiling templates is slow because it’s doing a lot of work at compile times.
      • Compiling template-heavy codebases is slow because you’re duplicating work in a lot of compilation units and throwing away the duplicated work at the end.

      The second is definitely a dominating factor (you can offset this with extern template definitions to some degree) but it’s not the whole story. Compile-time reification, like any other compiler feature, is slower than not doing it. Complex templates are, by definition, complex.

      Beyond personal experience, Ubisoft has also reported using unity builds to improve compilation times of its games

      There’s a big caveat here: It’s based on Windows, where process creation is slow and can dominate compilation times for small projects. Visual Studio tries to amortise this with a single cl.exe invocation for half a dozen or so source files, but if a lot of the code is in headers then combining them may offset the template-duplication problem.

      For most things I’ve worked on, separate compilation is faster, including the project I’ve just done where it’s only a dozen source files (splitting one out into a separate plugin that’s built as a shared library roughly halved build times).

      The role of the inline keyword

      This is somewhat confusing because C and C++ both have a keyword called inline but they are totally unrelated. The misconception is mostly to do with the C++ one.

      So, to learn compilers, as I suggested in a Reddit comment, you can take a look at the Go compiler

      Not sure I’d recommend that. It’s written in a style I found very hard to follow.

      99% correctness rate is ok

      Nuno Lopes and John Regehr wrote a position paper a few years ago that framed the answer here a lot better. They said that a compiler is composed of three things:

      • A transform, which is valid if and only if some conditions about the surrounding code hold.
      • An analysis that determines (conservatively - false negatives are fine though undesirable, false positives are unacceptable) whether those conditions are true.
      • An heuristic that determines whether doing the transform would improve performance for the current program.

      The first two are a good fit for formal verification because you need to be 100% correct, the last one is a good fit for machine learning because being mostly right is a big win.

      1.  

        I’m curious at what would be a project where linking is a significant time challenge compared to compiling. On my main project, roughly 600k lines without counting the dependencies headers (which amount to multi-million lines - boost, qt, and a ton of other libraries), compiling is a ~10 minute affair and linking the whole thing with mold less than a second.

        1.  

          My little web browser was one - compiling takes about two seconds, linking used to take ages (the Chromium Embedded Framework is like a 2 gig library). I switched it to dynamic load and now it is faster to link and faster to start up. That’s kinda cheating because if you included however long it takes to build chromium and its dependencies im sure it’d be hours, but I’ve never done that so linking my project was slower than compiling my project due to linking in the large prebuilt library.

          but yikes, it compiling ever took me 10 minutes, i’d lose my mind lol. i consider 3 seconds to be just about the upper limit of my tolearance!

          1.  

            Linking the entirety of chromium with mold is around 1s iirc. Note that when I say 10 minutes for compiling, it’s for the whole project. Editing a single file is generally on the order of seconds too

      2.  

        Beyond personal experience, Ubisoft has also reported using unity builds to improve compilation times of its games

        There’s a big caveat here: It’s based on Windows, where process creation is slow and can dominate compilation times for small projects. Visual Studio tries to amortise this with a single cl.exe invocation for half a dozen or so source files, but if a lot of the code is in headers then combining them may offset the template-duplication problem.

        For most things I’ve worked on, separate compilation is faster, including the project I’ve just done where it’s only a dozen source files (splitting one out into a separate plugin that’s built as a shared library roughly halved build times).

        I worked at a company with a huge C++ codebase. Release build times were on the order of hours. I experimented with unity builds and found significant build speedups, even on Linux and AIX.

        In our case, many source files #included the same set of header files, which transitively pulled in several hundred megabytes of source code for each compilation unit, causing parsing in the compiler to consume around 95% of time. Unity builds helped substantially here.

        1.  

          A lot of that benefit can also be achieved by precompiled headers. Visual Studio and XCode both default to this. Not sure how big the windows.h monster is but Cocoa.h was over 10 MB twenty years ago and has been growing ever since. Objective compilation units are typically small, usually under a thousand lines, so parsing that was a bottleneck, but with precompiled headers clang emits a serialised (and lazily deserialisable) AST for the headers and so only the referenced entities get reloaded. This is less of an issue than it was because the pasting phase in clang is really fast. One of the original motivations for clang was that distcc didn’t get more than an 8x speedup no matter how many machines you had, because you needed to preprocess locally and ship the preprocessed source and GCC was very slow at that phase. For C and Objective-C, well over 90% of the time is now lowering to IR, optimisation, and code generation. With over a hundred MBs of includes, I can see either approach having benefits.

          For C++, the parsing is a bit slower but instantiating templates is a lot more work. This involves cloning a chunk of AST, filling in the template parameter placeholders with concrete things, and then type checking, but then also involves generating IR for each instantiation, optimising it, and then (if it’s the same instantiation in multiple compilation units) more work for the linker to throw them all away.

          Sony has a thing that they call a ‘compilation database’, which might be the worst named thing ever. It’s basically a cache for every phase of the compiler. Instantiating a template? Check if that template is instantiated with the same types elsewhere and reproduce that. Lowering an AST subtree to IR? There’s a cache. Optimising a function? Maybe it’s in another compilation unit and someone has done it already. Generating machine code for a function? Might be done already. It’s a massive speedup for incremental builds, but also quite a large win for clean builds.

      3.  

        There’s some really nice work from MIT a decade ago that shows why, even for constrained things like shader programs, this is a really hard problem

        have a link?

        1.  

          I don’t know where it was published, sadly. I saw a talk by one of the people who did the work and chatted afterwards when I visited.

          1.  

            Halide, maybe?

    2. 8

      I think my favorite bit is from footnote 35:

      I have no idea what counts as a credible source in C++-land. The C++ standard and all these “authoritative sources” appear more and more like broadcasts of random string generators.

    3.  

      I thought this was really interesting, but wanted to share this discussion I just had.

      The linked article under “Undefined behavior only enables optimizations” gives this example for why UB in a spec can prevent a compiler making optimizations:

      Suppose that you have this loop:

      int b, c;
      ...
      for (int i = 0; i < N; ++i) {
        a[i] = b + c;
      }
      

      What we would like to do here is not perform this addition in every iteration. Instead, we would like to hoist it out of a loop, in a temporary, and only do a simple copy:

      int tmp = b + c;
      for (int i = 0; i < N; ++i) {
        a[i] = tmp;
      }
      

      But surprise, surprise, we can’t do that, if we don’t know the values of b, c, N. The reason is that as we learned, if b + c overflows, that’s undefined behavior. Now, assume we hoist it and N <= 0.

      However my coworker has convinced me that the compiler can safely make transformations like this. Overflowing a + b is only UB in the original C code, it’s not like the CPU will explode if you overflow an addition in the generated assembly. The transformed version of the source code isn’t C code (though it’s printed as C here), it just has to do the same thing as the original within the C abstract machine.

      In the case where N = 0, the overflowed addition is still never written to memory so it’s not observable and therefore doesn’t matter how it was defined.

      1.  

        The transformed version of the source code isn’t C code

        Only because of the (somewhat remote) possibility that a signed integer add might trap on some machines or compilers that add overflow checks.

        But you can do this:

        int tmp;
        if (N > 0) tmp = b + c;
        for {int i = 0; i < N; ++i) {
          a[i] = tmp;
        }
        

        It will never trap unless the original code does. There is slight overhead that can only matter at all if N is small. tmp might be uninitiaslised, but only in cases where the value is not used.

        Incidentally, the original code is likely to be compiled as …

        if (N <= 0) goto skip;
        {
          int i = 0;
          loop:
          a[i] = b + c;
          ++i;
          if (i < N) goto loop;
        }
        skip:
        

        … in order to avoid needing a branch to a branch.

        In this case you already have a completely natural, zero overhead, can’t introduce a new trap, place to move the b + c to.

        1.  

          Absolutely. The compiler knows the target ISA and operations that were undefined in C like overflowed addition and div/0 will have definitions there. As long as it uses them in a way that satisfies the original C code all is well.

          In your example, couldn’t the a + b still be hoisted just above loop: into a temporary? On RISC-V that avoids the two (potential) loads and the add within the loop, right?

          1.  

            in your example, couldn’t the a + b still be hoisted just above loop: into a temporary?

            That was precisely my point: “In this case you already have a completely natural, zero overhead, can’t introduce a new trap, place to move the b + c to.”

            Namely: just above “loop:”

            The compiler knows the target ISA and operations that were undefined in C like overflowed addition and div/0 will have definitions there.

            True, but my examples do not depend on anything like that. They are both conforming transformations entirely within the C language.

            1.  

              Ah, I understand now, thanks!

      2. [Comment removed by author]

    4.  

      This seems to have a lot of its own misconceptions buried in it. To start with there’s a bit about code size being an optimal substructure problem, that the author has already struck out themselves:

      To see what this means in practice, let’s say you have a function A whose size you want to minimize (i.e., find the minimum size). Now, let’s say you split the function in two parts and you find their minimums independently. Optimal substructure tells us that if you just merge these two, you’ll get the minimum version of the whole function. In other words, combining individual minimums gives you a global minimum

      if a compiler decides a set of functions that it will compile separately and independently as a first step, then within that constrained solution it is true (or at least, close enough to true) that minimising the compiled size of each function will minimise the program size. But of course functions are not independent; they call each other, they can be inlined, different calling contexts may have different optimal translations.

      Knowing which paths are hot in any Javascript bytecode is not enough to produce optimized code; you also need to know the types. And you only know the types at runtime. This is the main reason why JIT compilers compile code at runtime.

      It’s largely the same thing. The code branches and does different things depending on the type - that’s the nature of a dynamically-tyoed language - and knowing what type (or types) is/are usually found at some point decides the hot path for an operation. A JIT compiler often needs to insert guards to check that the type is what is expected and handle it if not - that’s the cold path.

      Nope, undefined behavior can also disable optimizations. I wrote an article on this.

      If a compiler observes that a particular state or code path necessitates undefined behaviour, it can exclude that state/path from consideration when choosing what code to emit. In no circumstance does the presence of undefined behavior require disabling optimisations.

      The example in the linked article shows a loop, with an unknown and possibly zero iteration count, where an addition of two int values inside the loop is repeated each interation. It claims that the addition can’t be hoisted outside the loop because that would make it unconditional (where it is previously conditional upon at least one iteration of the loop being executed) and that turns conditional UB into unconditional UB. While that’s true for a transformation at the source-code level, optimisations operate on an intermediate representation, and the output is assembly, where addition of two signed integers is generally not undefined behaviour.

      Compilers can and will hoist an addition outside of a loop in cases that look like the example in the linked article, eg: https://godbolt.org/z/YaxbjvssW

      (There you can see the addl %esi, %edi instruction executed almost immediately upon entering the function, unconditionally).

      It is true that the compiler can’t introduce changes which break the function (in the absence of UB) of a program. This means that it certain cases the compiler must be conservative about optimisations involving operations that might exhibit UB (dereferencing a potentially null pointer makes a better example than the integer addition used in the linked article though). Stating this as “UB can disable optimisations”, though, is a big stretch.

    5.  

      Long, but absolutely fantastic. I didn’t know that -O3 and -O2 were so close to each other, and that people had been using single-compilation-unit “unity” builds to speed up compilation. And this gives so many citations to follow up on!

      1.  

        With the D programming language, all-at-once compiles (not exactly the same as a unity build dumping it all in one file, but either passing all files to the compiler at once or letting it discover them as needed following imports automatically) often outperform separate compiles, sometimes significantly so, because the compiler is able to share more work. If two modules import each other and you compile separately, it reparses, recalculates compile time functions, reinstantiates templates, for each build. If done all at once, it sees its memory cache and reuses it.

        I wasn’t surprised to see the author had some D experience here - he cites its standard lib build (which is actually pretty slow to compile as far as D code goes!) - as this result is pretty surprising to newbies and sometimes people don’t believe it even after seeing it for themselves!

        C and C++ isn’t exactly the same, but there’s still a lot of duplicated work it is able to eliminate when you’re able to combine things.