One important note that I wish would be made in articles like these is that the code presented is explicitly not ISO C. E.g., alloca() is non-standard, sigaltstack() is not specified by ISO C but rather by POSIX, etc.
Using these things is, of course, fine. But, by using them, you are now writing your code specific to platforms which follow your assumptions. The best possible way to document this, in my opinion, would be to add notes to each proposed solution of what feature-test macros are needed to pass in order to use the proposal.
A great post though; with the end of Intel’s tick-tock, I’m hoping for a resurgence of care in the world of programming for performance, and these kinds of posts helping guide programmers are a necessity!
A great post though; with the end of Intel’s tick-tock, I’m hoping for a resurgence of care in the world of programming for performance, and these kinds of posts helping guide programmers are a necessity!
This sounds interesting and I’d love to learn more about it, but I don’t quite get it. Could you unpack it a bit? Thanks!
Sure. If you are not familiar with Intel’s Tick-Tock, the tl;dr is that Intel, since 2007, has tried to have an annual release schedule where they alternate between completely restructuring their fabrication process (tick) and completely restructuring their microarchitecture (tock). This has served as one of the more real-world measures that people use when they discuss the most recent redefinitions of Moore’s Law (it tended to coincide with a near-doubling of transistor counts every 2 years).
This year, Intel announced that it deprecated the Tick-Tock in favor of a three-step cycle (more descriptively named “process-architecture-optimization”). We have been headed in this direction for a while now—i.e., each tick was taking longer and longer to be completed, so the cycle was already elongating—it is just that now, that elongation and change in strategy has been made official.
The moral of the story here is that the old adage of “computers double in speed every two years” (though it was really never something that anyone should have relied on anyway) should be considered ended (with our current programming and optimization strategies).
While I am sure we will eventually figure out a way to have another set of performance gains (quantum? easily-accessible-mass-multithreading? something else?), I am hoping it leads to a resurgence of programmers caring about single-threaded performance. I have heard people tell me so many times that “Python is fast enough” or that “there is no need to optimize code because, in a couple of years, all computers will be able to handle what I write now”. Perhaps, now that I can (admittedly, surreptitiously) say that “Moore’s Law is dead”, people will start caring about performance again.
Excellent, thank you so much for elaborating. I had heard about the tick-tock rhythm before but had no idea it was officially being deprecated. Really cool stuff.
A great post though; with the end of Intel’s tick-tock, I’m hoping for a resurgence of care in the world of programming for performance, and these kinds of posts helping guide programmers are a necessity!
The industry has plateaued in performance for a while. A Core 2 Duo from 2006 can run Windows 10 absolutely fine. Modern sites with bloated JS are unfortunately more of a stretch, but it can handle the web cromulently. (Source: My gaming PC still uses Core 2 Duo.)
The motivation for this is lower end and/or mobile hardware. Netbooks and tablets were a major driver of the plateau.
That’s all true, but every mainstream platform I know of (thankfully) has equivalents of these. Developers working on OS X/iOS should be looking at NSZone and friends, while Windows developers likely want to look at the HeapCreate/VirtualAlloc family of functions.
There’s some good advice here, but note that brk/sbrk are fairly dated, and you may well be better off making your own stack out of mmap. That way you can have multiple stack-alloc arenas and you won’t pay a syscall penalty on every alloc/dealloc.
Also a note on the multi-threaded alloc: your caches will thank you for using thread-local pools.
One major limitation of malloc (and even the best implementations like jemalloc and dlmalloc) is that they try to use a single allocator for each data structure. This is a mistake: A huge performance gain can be had by using a separate allocator for each of your data structures — or rather, for each of your data usage patterns.
I’m interested to read more about this, any link suggestion?
Besides tighter packing, you have better control over object lifetime. If you have a foo and a bar type of equal size, and you alternately allocate a bajillion of each, then free all the foo, that doesn’t really free up any memory for objects of a different size.
I’m interested to read more about this, any link suggestion?
You might also find the memory allocation discussions in Jason Gregory’s Game Engine Architecture interesting. Even if you’re not interested in making games per se, there’s a lot of useful information about memory allocation performance and techniques for doing it very efficiently that apply well beyond the games field.
One important note that I wish would be made in articles like these is that the code presented is explicitly not ISO C. E.g.,
alloca()
is non-standard,sigaltstack()
is not specified by ISO C but rather by POSIX, etc.Using these things is, of course, fine. But, by using them, you are now writing your code specific to platforms which follow your assumptions. The best possible way to document this, in my opinion, would be to add notes to each proposed solution of what feature-test macros are needed to pass in order to use the proposal.
A great post though; with the end of Intel’s tick-tock, I’m hoping for a resurgence of care in the world of programming for performance, and these kinds of posts helping guide programmers are a necessity!
This sounds interesting and I’d love to learn more about it, but I don’t quite get it. Could you unpack it a bit? Thanks!
Sure. If you are not familiar with Intel’s Tick-Tock, the tl;dr is that Intel, since 2007, has tried to have an annual release schedule where they alternate between completely restructuring their fabrication process (tick) and completely restructuring their microarchitecture (tock). This has served as one of the more real-world measures that people use when they discuss the most recent redefinitions of Moore’s Law (it tended to coincide with a near-doubling of transistor counts every 2 years).
This year, Intel announced that it deprecated the Tick-Tock in favor of a three-step cycle (more descriptively named “process-architecture-optimization”). We have been headed in this direction for a while now—i.e., each tick was taking longer and longer to be completed, so the cycle was already elongating—it is just that now, that elongation and change in strategy has been made official.
The moral of the story here is that the old adage of “computers double in speed every two years” (though it was really never something that anyone should have relied on anyway) should be considered ended (with our current programming and optimization strategies).
While I am sure we will eventually figure out a way to have another set of performance gains (quantum? easily-accessible-mass-multithreading? something else?), I am hoping it leads to a resurgence of programmers caring about single-threaded performance. I have heard people tell me so many times that “Python is fast enough” or that “there is no need to optimize code because, in a couple of years, all computers will be able to handle what I write now”. Perhaps, now that I can (admittedly, surreptitiously) say that “Moore’s Law is dead”, people will start caring about performance again.
New law: mobile phones double in cores every two years.
Excellent, thank you so much for elaborating. I had heard about the tick-tock rhythm before but had no idea it was officially being deprecated. Really cool stuff.
The industry has plateaued in performance for a while. A Core 2 Duo from 2006 can run Windows 10 absolutely fine. Modern sites with bloated JS are unfortunately more of a stretch, but it can handle the web cromulently. (Source: My gaming PC still uses Core 2 Duo.)
The motivation for this is lower end and/or mobile hardware. Netbooks and tablets were a major driver of the plateau.
That’s all true, but every mainstream platform I know of (thankfully) has equivalents of these. Developers working on OS X/iOS should be looking at
NSZone
and friends, while Windows developers likely want to look at theHeapCreate
/VirtualAlloc
family of functions.There’s some good advice here, but note that
brk
/sbrk
are fairly dated, and you may well be better off making your own stack out ofmmap
. That way you can have multiple stack-alloc arenas and you won’t pay a syscall penalty on every alloc/dealloc.Also a note on the multi-threaded alloc: your caches will thank you for using thread-local pools.
I’m interested to read more about this, any link suggestion?
http://paperswelove.org/2015/video/ryan-zezeski-memory-by-the-slab/
Long story short - if you know how large the type is, you can make a lot of assumptions that reduce fragmentation.
Besides tighter packing, you have better control over object lifetime. If you have a foo and a bar type of equal size, and you alternately allocate a bajillion of each, then free all the foo, that doesn’t really free up any memory for objects of a different size.
You might also find the memory allocation discussions in Jason Gregory’s Game Engine Architecture interesting. Even if you’re not interested in making games per se, there’s a lot of useful information about memory allocation performance and techniques for doing it very efficiently that apply well beyond the games field.