In compiler implementation, "global" optimizations are those that apply to an entire procedure at a time, in contrast to "local" optimizations, which operate on individual basic blocks only. After some work, float unboxing is the latest optimization which is now done globally in Factor.
Factor's float unboxing optimization used to be part of
local value numbering. Originally I planned on implementing global value numbering and trying to do float unboxing there, but since then I discovered a better way. I also improved the SSA coalescing pass I talked about
last time, and implemented some optimizations which help compile the code in the
math.vectors vocabulary more efficiently.
Global float unboxing
Instead of modeling float unboxing as an algebraic simplification problem (where
unbox(box(x))
can be rewritten as
x
) I now model it as a representation problem, ie, the optimizer makes a global decision, for every virtual register, whether to store it as a tagged pointer, or an unboxed float.
When the low-level IR is built, no boxing or unboxing operations are inserted, and the compiler assumes all floating point values will end up unboxed. The new
compiler.cfg.representations pass is responsible for inserting just enough boxing and unboxing operations to make the resulting code valid. This pass consists of several step.
Representations and register classes
A representation is like a low-level data type associated to a virtual register. At this stage, possible repesentations are tagged pointers, single-precision floats, and double-precision floats. Note that fixnums (which are stored directly inside a pointer, rather than being heap allocated) are considered as tagged pointers, and that single-precision floats are only used by the FFI. In the future, I will also have a representation type for untagged integers (so that Factor code can perform arithmetic on 32 and 64-bit numbers without boxing them) and various SSE vector types.
A register class is a set of machine registers that instructions can operate on. Every representation has a register class, and values in this representation must be stored in registers of that class. Multiple representations can share a register class; for example, single and double precision floats all go into floating point registers. The register allocator deals with register classes.
When the low-level IR is built, virtual registers do not have representations assigned to them. Instead, this mapping is established later, by the
compiler.cfg.representations pass.
First step: compute possible representations for each virtual register
The first step to deciding what representation to use for a virtual register is to take stock of the possibilities. The
compiler.cfg.representations.preferred vocabulary defines words which output, for each instruction, the preferred representation of the input values, and the preferred representation of the output value, if any.
For example, the
##slot
instruction, which accesses an object's slot, prefers that its inputs and outputs be tagged pointers. On the other hand,
##add-float
prefers that its inputs and outputs are unboxed floats (for otherwise, boxing is required), and
##integer>float
takes an integer but outputs an unboxed float.
This information is collected into a single mapping, from virtual registers to sequences of representations.
Second step: computing costs for different representations
The next step is to compute the cost of keeping each virtual register in every possible representation for that register. The compiler uses a simple cost heuristic. The cost of keeping a register in a given representation
R
is obtained by iterating over all instructions that use that register, but expect to receive the value in a different representation from
R
. For each such instruction, its loop nesting depth is added up. Finally, if the representation
R
differs from the output representation of the instruction that computed the register, the loop nesting is added to the tally. This gives an approximation of the overhead of boxing and unboxing the value.
To calculate the loop nesting depth of an instruction, I use the standard natural loop detection algorithm. This is implemented in
compiler.cfg.loop-detection. The details can be found in any moderately advanced compiler textbook; for each basic block, it identifies a set of loops this basic block belongs to (this set may be empty). A loop consists of a header, a set of more or more end blocks, and a set of blocks making up this loop. Multiple loops can share a header, but each loop can only have a single header because all control flow graphs that the low-level optimizer constructs are reducible (at least I think so).
Third step: representation selection
Once costs have been computed, it is easy to assign to each virtual register the representation that has the least cost. There is one subtlety here though; if an instruction outputs a tagged pointer, no other representation may be used for that virtual register. For example, in this code, it would not be sound to pick an unboxed float representation for the input parameter, since in one branch, it is used as a fixnum, and trying to perform a float unboxing on a fixnum will just crash:
: silly-word ( x -- y )
{
{ [ dup fixnum? ] [ 1 + ] }
{ [ dup float? ] [ 3 * ] }
} cond ;
The problem here is that in low-level IR, the
##peek
instruction which loads the value from the data stack outputs a tagged value, and
a priori the compiler cannot assume its going to be a boxed float that can be safely unboxed -- only the output of a float-producing instruction can be unboxed. A more complicated unboxing algorithm based on control-dependence was suggested to me by
Daniel Ehrenberg. However, because the low-level IR is single-assignment, the potential overhead from the current unboxing scheme is not so great, and so I'm not worried about it right now.
Fourth step: inserting boxing and unboxing instructions
Once the compiler has determined the representation to use for each virtual register, it must now iterate over all instructions again, this time, inserting conversions whenever a virtual register is used with a representation other than the one that was chosen for it. This pass introduces new temporaries, and renames virtual registers used and defined in instructions, in order to preserve single assignment. Other than that, it is relatively straightforward.
Live-range coalescing revisited
Last time I mentioned that I had implemented a rather complex algorithm for converting out of SSA form and doing copy coalescing at the same time. After having some problems getting the algorithm working in certain corner cases, I decided to scrap it and implement something simpler instead. My new SSA coalescing algorithm uses the classical interference graph approach, except it doesn't actually have to build the interference graph, since interference testing on SSA form can be done in O(1) time.
The register allocator used to do its own coalescing, to eliminate copies inserted by the two-operand conversion pass. The two-operand pass converts
x = y + z
into
x = y; x = x + z
(and similarly for all other arithmetic operations). This makes arithmetic operations fit the x86 instruction encoding scheme, where the destination register is also the first register. Implementing copy coalescing twice, in two different ways, seemed inelegant to me. A big breakthrough was when I was able to get my SSA coalescing algorithm working on a relaxed SSA form, where values can be reassigned more than once, but only within the same basic block where they are defined. This means that the two-operand conversion pass can run before SSA coalescing, and the register allocator does not need to perform coalescing at all.
The new coalescing algorithm was not noticably slower than the old one (it still benefits from not having to construct an interference graph), the code is much simpler, and the register allocator's copy coalescing code is gone. The compiler feels cleaner as a result.
Improved tuple unboxing
I implemented
escape analysis for tuple unboxing last year and recently came up with a way to improve it. Whereas previously, only freshly-allocated tuples could be unboxed, I realized that if an input parameter to a word is known to be an instance of an immutable tuple class, and the input parameter does not escape (ie, not passed to a subroutine call, or returned) then it can be unboxed, too. Whereas unboxing a freshly-allocated tuple involves deleting the allocation call, and replacing slot access with references to the inputs to the allocation call, unboxing an input parameter is a bit trickier. Actual code has to be inserted at the beginning of the word, which unpacks the tuple. This optimization is only a win if the same slot is accessed more than once within a word; in particular, its a win if the slot access occurs inside a loop. This situation came up with
specialized arrays; a specialized array is a tuple with two slots, a byte array, and a length. To access the nth element of a specialized array involves an indirection, because the byte array has to be retrieved. By unboxing the specialized array before a loop, a slot access is eliminated from the loop. This gives the high-level optimizer a form of loop-invariant code motion (real loop-invariant code motion might appear in the low-level optimizer at some point, too).
Vector operations on specialized arrays
I used to use
hints to make the
math.vectors vocabulary faster. While vector words could be used on any sequence type, hints would automatically compile a specialized version of each vector word for the
double-array
type. This gave a nice speedup on benchmarks which performed vector operations on arrays of double-precision floats, but it didn't help at all for vector operations on other specialized arrays, and there was still at least one runtime dispatch per vector operation, to determine if the generic or hinted version should be used.
While hints have their place, I don't use them for vector words anymore. Instead, I cooked up some special compiler extensions for them (easier than it sounds; the compiler is pretty extensible, at least the pass that I needed to extend for this). Now, every specialized array type compiles a specialized set of vector words, and if the input types of a vector word are known at compile time, the right specialized version is used instead. Of course there's no actual code duplication here, the code is generated at runtime.
Benchmark results
The above improvements, as well as the changes I outlined in my
previous post, give a nice speedup since
the last time I did some benchmarks.
Benchmark | May 31st | July 17 | Aug 10 |
benchmark.backtrack | 1.767561 | 1.330641 | 1.111011 |
benchmark.base64 | 1.997951 | 1.738677 | 1.536447 |
benchmark.beust1 | 2.765257 | 2.461088 | 2.360084 |
benchmark.beust2 | 3.584958 | 1.694427 | 1.691499 |
benchmark.binary-search | 1.55002 | 1.574595 | 1.585475 |
benchmark.binary-trees | 1.845798 | 1.733355 | 1.771849 |
benchmark.bootstrap1 | 10.860492 | 11.447687 | 9.962971 |
benchmark.dawes | 0.229999 | 0.161726 | 0.104693 |
benchmark.dispatch1 | 2.015653 | 2.119268 | 2.252401 |
benchmark.dispatch2 | 1.817941 | 1.216618 | 1.629185 |
benchmark.dispatch3 | 2.568987 | 1.899128 | 2.692071 |
benchmark.dispatch4 | 2.319587 | 2.032182 | 2.073563 |
benchmark.dispatch5 | 2.346744 | 1.614045 | 1.369721 |
benchmark.empty-loop-0 | 0.146716 | 0.12589 | 0.12608 |
benchmark.empty-loop-1 | 0.430314 | 0.342426 | 0.351241 |
benchmark.empty-loop-2 | 0.429012 | 0.342097 | 0.350182 |
benchmark.euler150 | 16.901451 | 15.288867 | 13.046828 |
benchmark.euler186 | 8.805434999999999 | 7.920478 | 7.997483 |
benchmark.fannkuch | 3.202698 | 2.964037 | 2.859117 |
benchmark.fasta | 5.52608 | 4.934112 | 5.135706 |
benchmark.gc0 | 2.15066 | 1.993158 | 2.03638 |
benchmark.gc1 | 4.984841 | 4.961272 | 5.075487 |
benchmark.gc2 | 3.327706 | 3.265462 | 3.376618 |
benchmark.iteration | 3.736756 | 3.30438 | 3.22603 |
benchmark.javascript | 9.79904 | 9.164517 | 9.000165000000001 |
benchmark.knucleotide | 0.282296 | 0.251879 | 0.231547 |
benchmark.mandel | 0.125304 | 0.123945 | 0.123321 |
benchmark.md5 | 0.946516 | 0.85062 | 0.84516 |
benchmark.nbody | 3.982774 | 3.349595 | 2.204085 |
benchmark.nested-empty-loop-1 | 0.116351 | 0.135936 | 0.053609 |
benchmark.nested-empty-loop-2 | 0.692668 | 0.438932 | 0.390032 |
benchmark.nsieve | 0.714772 | 0.698262 | 0.694443 |
benchmark.nsieve-bits | 1.451828 | 0.907247 | 0.727394 |
benchmark.nsieve-bytes | 0.312481 | 0.300053 | 0.225218 |
benchmark.partial-sums | 1.205072 | 1.221245 | 1.152942 |
benchmark.random | 2.574773 | 2.706893 | 2.403702 |
benchmark.raytracer | 3.481714 | 2.914116 | 2.428384 |
benchmark.recursive | 5.964279 | 3.215277 | 3.178855 |
benchmark.regex-dna | 0.132406 | 0.093095 | 0.08619400000000001 |
benchmark.reverse-complement | 3.811822 | 3.257535 | 3.003156 |
benchmark.ring | 1.756481 | 1.79823 | 1.696271 |
benchmark.sha1 | 2.267648 | 1.473887 | 1.43003 |
benchmark.sockets | 8.794280000000001 | 8.783398 | 8.643011 |
benchmark.sort | 0.421628 | 0.363383 | 0.367602 |
benchmark.spectral-norm | 3.830249 | 4.036353 | 3.277558 |
benchmark.stack | 2.086594 | 1.014408 | 0.91959 |
benchmark.sum-file | 0.528061 | 0.422194 | 0.383183 |
benchmark.tuple-arrays | 0.127335 | 0.103421 | 0.08908199999999999 |
benchmark.typecheck1 | 0.876559 | 0.6723440000000001 | 0.744205 |
benchmark.typecheck2 | 0.878561 | 0.671624 | 0.770056 |
benchmark.typecheck3 | 0.86596 | 0.670099 | 0.699071 |
benchmark.ui-panes | 0.426701 | 0.372301 | 0.369152 |
benchmark.xml | 2.351934 | 2.187999 | 1.630145 |