The Go compiler needs to be smarter

One of my favorite languages is the Go language. I love its simplicity. It is popular and useful in a cloud setting. Many popular tools are written in Go, and for good reasons.

I gave a talk on Go last year and I was asked for a criticism of Go. I do not mind that Go lacks exceptions or generics. These features are often overrated.

However, as charming as Go might be, I find that its compiler is not on par with what I expect from other programming languages. It could be excused when Go was young and immature. But I think that the Go folks now need to tackle these issues.

My first beef with the compiler is that it is shy about inlining. Inlining is the process by which you bring  a function into another function, bypassing the need for a function call. Even when the function call is inexpensive, inlining often brings many benefits.

Go improved its inlining strategies over time, but I would still describe it as “overly shy”.

Let me consider the following example where I first define a function which sums the element in an array, and then I call this function on an array I just defined:

func sum(x  []uint64) uint64 {
    var sum = uint64(0)
    for _, v := range x {
        sum += v
    }
    return  sum
}


func fun() uint64 {
    x := []uint64{10, 20, 30}
    return sum(x)
}

Whether you use Rust, Swift, C, C++… you expect a good optimizing compiler to basically inline the call to the ‘sum’ function and then to figure out that the answer can be determined at compile time and to optimize the ‘fun’ function to something trivial.

Not so in Go. It will construct the array and then call the ‘sum’ function.

In practice, it means that if you want good performance in Go, you often have to manually inline your functions. And I mean: fully inline. You have to write a really explicit function if you want the Go compiler to optimize the computation away, like so:

func fun3() uint64 { 
    x := [3]uint64{10001, 21, 31}
    return x[0] + x[1] + x[2] 
}

My second concern with the Go language is that it has no real concept of runtime constant variable. That is, you have compile-time constants but if you have a variable that is set once in the life of your program, and never change, Go will still treat it as if it could change. The compiler does not help you.

Let us take an example. Go has added nice function that give you access to fast processor instructions. For example, most x64 processors have a popcnt instruction that gives you the number 1-bit in a 64-bit word. It used to be that the only way to access this instruction in Go was by writing assembly. Thas been resolved. So let us put this code into action:

import "math/bits"

func silly() int {
    return  bits.OnesCount64(1) + bits.OnesCount64(2)
}
    

This function should return 2 since both values provided (1 and 2) have exactly one bit set. I bet that most C/C++ compilers can figure that one out. But we may excuse Go for not getting there.

Go needs to check, before using the popcnt instruction, that the processor supports it. When you start Go, it queries the processor and fills a variable with this knowledge. This could be done at compile-time but then your binary would crash or worse when run on a processor that does not support popcnt.

In a language with just-in-time compilation like Java or C#, the processor is detected at compile-time so no check is needed. In less fanciful languages like C or C++, the programmer needs to check what the processor supports themselves.

I can excuse Go for checking that popcnt is supported each and every time that the ‘silly’ function called. But that is not what Go does. Go checks it twice:

        cmpb    runtime.x86HasPOPCNT(SB), $0
        jeq     silly_pc115
        movl    $1, AX
        popcntq AX, AX
        cmpb    runtime.x86HasPOPCNT(SB), $0
        jeq     silly_pc85
        movl    $2, CX
        popcntq CX, CX

That is because the compiler does not trust, or cannot determine, that the variable ‘runtime.x86HasPOPCNT’ is a runtime constant.

Some people will object that such checks are inexpensive. I think that this view should be challenged:

  • As is apparent in the assembly code I provide, you might be doubling or at least increasing by 50% the number of instructions required. A comparison and a jump is cheap, but so is popcnt (some processors can retire two popcnt per cycle!). Increasing the number of instructions makes code slower.
  • It is true that the branch/jump is likely to be correctly predicted by the processor. This makes the guarding code much cheaper than a branch that could sometimes be mispredicted. But that does not mean that you are not getting hurt:

    Even when it is impossible to remove all branches, reducing the number of branches “almost always taken” or “almost never taken” may help the processor better predict the remaining branches.  (…) A possible simplified explanation for this phenomenon is that processors use the history of recent branches to predict future branches. Uninformative branches may reduce the ability of the processors to make good predictions.

Go’s saving grace is that it makes it easy to integrate assembly code into your code base. So you can write your performance-critical in C, compile it, and use the result in your Go project. That is how we do it in roaring, for example. People have ported the really fast Stream VByte encoding and the very fast simdjson parser in Go, again by using assembly. It works.

However, it leaves the bulk of the Go software running at a fraction of the performance it could reach with a great optimizing compiler.

Appendix: Compiling Go with gccgo solves these particular problems. However, reportedly, the overall performance of gccgo is worse.

Daniel Lemire, "The Go compiler needs to be smarter," in Daniel Lemire's blog, June 4, 2020, https://lemire.me/blog/2020/06/04/the-go-compiler-needs-to-be-smarter/.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

25 thoughts on “The Go compiler needs to be smarter”

  1. I spent some time messing with the compiler’s code generation for OnesCount64, and because I am sometimes a flake, I accidentally generated a variant in which it had the conditional branch, but didn’t have the actual popcount instruction.

    My microbenchmark was of a loop that unions two 1024-item slices of uint64, and possibly popcounts them. Time on my laptop came out to:

    450ns: union the slices, no popcount
    520ns: union the slices, popcount instruction with no branch
    600ns: union the slices, always-untaken branch, no popcount (oops)
    810ns: go’s normal code (branch plus popcount)

    What’s interesting to me is that the popcount instructions add 70ns to execution without the branch, but 210ns to execution with the branch for some reason. I don’t know why, but the net result is that with those branches, the function goes from 450->810ns if I do the popcounts (probably too expensive and unacceptable, better to do the popcounts in a separate pass after multiple runs of this). But if it were 450->520, I’d probably do the popcounts in every pass and drop all the additional complexity…

    1. The generated assembly is better, since the value is loaded in a register once… but Go still checks the register value each and every time… it is depressing. I realize that these are hard problems, but most other optimizing compilers would do better.

      1. I gather you next steps will be a Part 2 blog posting outlining the details of your PR to address these complaints?

  2. https://github.com/golang/go/issues/17566 is an active issue about improving the in-lining cost model that has been open for 4 years. It actually references a previous blog post of yours from 2017 https://lemire.me/blog/2017/09/05/go-does-not-inline-functions-when-it-should/ A brief summary of that issue is that one reason in-lining is poor is that it uses a crude heuristic applied to the AST rather than SSA form of the code when more information is available. There is also a lot of debate around the trade-off between increased code size and improved performance with in-lining.

    For the popcount issue, even with the recognition that the feature detection for popcount is a runtime constant, it is still a hard problem to decide on the trade-off between code size and performance. For optimum performance you would want to transform code like.

    for (hot loop):
    if (Likely) hasPopCount:
    usePopCountInst;
    else (Unlikely):
    useFallBack;

    to

    if (Likely) hasPopCount:
    for (hot loop):
    usePopCountInst;
    else (Unlikely):
    for (hot loop):
    useFallBack;

    That is, hoist the check out of the loop and duplicate the hot loop and maybe even move the unlikely hot loop into a cold seciton of code to avoid polluting the instruction cache. If you decide to go this path, now there is a decision for how high to hoist the feature check (only very small loops? all the way up to the whole function?), which affects the generated code size. I’m not aware of other approaches for a single binary to efficiently support different instructions at run time, are you?

    Also, the go implementation of Stream-VByte that you mention is actually a pure go implementation. I have a version that implements the SSE3 version from your paper at https://github.com/bmkessler/streamvbyte Go does make it fairly easy to implement assembly optimizations. Although the function call overhead due to not being able to in-line functions written in assembly means you can’t just use short snippets of assembly for good performance.

    1. even with the recognition that the feature detection for popcount is a runtime constant, it is still a hard problem to decide on the trade-off between code size and performance.

      You make good points.

      Just so we are clear, I realize that these problems are hard.

      I’m not aware of other approaches for a single binary to efficiently support different instructions at run time, are you?

      In theory, you could use a fat binary approach where you have multiple versions of the code, compiled for multiple platforms, and then you select the right one at runtime. That is what simdjson does. As far as I can tell, it is never done automatically.

      Also, the go implementation of Stream-VByte that you mention is actually a pure go implementation. I have a version that implements the SSE3 version from your paper at https://github.com/bmkessler/streamvbyte

      I corrected the link.

      Go does make it fairly easy to implement assembly optimizations.

      Yes. With the caveat that you need to do it for sizeable functions.

      That is precisely why I still think that Go is underrated. Despite all my criticism, I believe it got many things right.

  3. Following are some other frustration I have with go compiler

    Verbose lambda syntax.
    No tail recursion optimization
    Non exhaustive switch. This one might be tricky with duck typing type system, but some low handing fruits can be picked.

    I don’t agree with Go’s core belief of fast compile time. I believe adding a few seconds in the compile phase is worth it to make the language safer and less verbose

  4. A good program is explicit in what it intends to do. An important quality. A competent programmer knows which functions are pure and may choose to move those with constant arguments to an initialisation step or to pre-compute constants.

  5. Perhaps this is an unintentional design trade off. Go does not make a strong distinction between debug and release builds. Code optimizations belong in the release build and logically introduce some degree of overhead that slows compilation in the frequent debug builds.

      1. With more and more CPUs with many cores coming onto the market (ARM/AWS Graviton2 with 64 cores, AMD EPYC with 48 cores), in an ideal world it would be both possible to keep the crucial-for-productivity short build times as well as devote more processing resources to smarter optimizations.

        I’d be happy to pay the price of a cup of coffee per day for this (versus running on a large or even medium instance since it doesn’t matter much anyways for development).

  6. Excellent blog post.

    As a rule of thumb, I expect the Go software I write to be ~2/3 as fast as what I could write in C. I’ve used cgo and assembly to do better, but neither is currently good enough for me to want to use them in personal projects; I’m still more productive writing everything in C/C++ whenever it’s worth the effort to do better than 2/3-speed. But this isn’t due to some fundamental limitation of the language, it’s almost entirely driven by compiler weaknesses like the ones you describe here.

  7. A comment on reading comments:
    When I click on “Comments” at the end of the email, I go to the start of the article instead of the comments. I then have to scroll through the already-read article to get to the comments.
    A small thing, but I find it annoying.
    Using Brave on OS X 10.13.6.

  8. What are the drawbacks to too aggressively in-lining code?

    Binary size seems like the most obvious thing to me, but are there performance risks from it?

    1. The two big drawbacks are bigger binaries and slower compilation; the effects there are very noticeable. And surveys of gophers consistently show more concern about binary size and compilation speed than performance.

      Binary size can actually improve due to increased and/or smarter inlining (although it often does not). Compilation speed never does. Disabling inlining speeds up compilation by 10%. The challenge is not just to increase inlining, which is easy, but to use our (implicit) inlining budget better, which is not easy.

      In my experiments, performance degradation to increased inlining is rare but possible, at least on amd64. One extreme example occurs with helper functions in machine-generated code. The compiler itself uses many of these, and increasing inlining can cause these to blow up, causing noticeable performance regressions.

            1. Not that it matters, really–we’re just two people chatting in the comments of a blog–but I’m personally more open to annotations and hints now than I was when I first replied on that issue. The no-knobs philosophy is predicated on having the toolchain eventually be good enough that the knobs don’t buy you much.

              I’ve come to believe that there are a few places where toolchain progress is extremely difficult due to trade-offs vs:

              compiler performance (e.g. powerful BCE)
              binary size (e.g. inlining)
              usability (e.g. FGO/PGO)
              available compiler programmer hours (lots of things)
              difficult API design (e.g. vector instructions without assembly)
              stability (e.g. if we change inlining heuristics and 90% of programs benefit and 30% regress we will spend an untold amount of time and energy dealing with those regressions)

              In some such cases, I’m increasingly open to providing a pressure release valve such as inlining annotations or unsafe.Unreachable (https://github.com/golang/go/issues/30582). However, I am not the decision maker; not even close.

              Knobs notwithstanding, I don’t think the dream of better inlining in Go is dead, despite the widespread despair. If we move inlining later in the compiler, we will be in a position to do much better. And it will be easier to improve the heuristics. And it’ll provide enough disruption that we can sneak in a bunch of other improvements under the same umbrella.

              As long as I’m writing a novel, I might mention that one source of despair is that to a first approximation no one is even working on inlining. I don’t know why the Go team has not prioritized that; I have very little insight into their decisions. I can say that I am not because making meaningful progress requires more time more consistently than I can dedicate as a just-for-fun volunteer. I will note, though, that “no one is working on it” is only true until it is not. 🙂

  9. This is exactly my complaint about Go. I posted about it two months ago: https://www.reddit.com/r/golang/comments/fp9mrk/how_much_performance_headroom_is_left_longer/

    They ruthlessly optimize for compile time, which values the programmer’s time over the users’ time. Since there might be thousands or millions of users, and a program might run millions of times, this never made any sense to me.

    I want to let applications compile over a weekend for maximum optimization if it’s something that will run thousands or millions of times, or run for thousands of hours. It would be great to have an alternative compiler for Go, a true and modern optimizing compiler. I’ve thought about the business prospects for one, a proprietary compiler and toolchain – seems like there’s a market for one, and the Google compiler just isn’t very good. It doesn’t leverage modern CPUs, and it doesn’t seem to support Whole Program Optimization / LTO, or Profile Guided Optimization. LTO is less relevant for Go than C/C++, but there are still some opportunities.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.