Saturday, May 29, 2010

Comparing Factor's performance against V8, LuaJIT, SBCL, and CPython

Together with Daniel Ehrenberg and Joe Groff, I'm writing a paper about Factor for DLS2010. We would appreciate feedback about the draft version of the paper. As part of the paper we include a performance comparison between Factor, V8, LuaJIT, SBCL, and Python. The performance comparison consists of some benchmarks from the The Computer Language Benchmarks Game. I'm posting the results here first, in case there's something really stupid here.

Language implementations

Factor and V8 were built from their respective repositories. SBCL is version 1.0.38. LuaJIT is version 2.0.0beta4. CPython is version 3.1.2. All language implementations were built as 64-bit binaries and run on an 2.4 GHz Intel Core 2 Duo.

Benchmark implementations

Factor implementations of the benchmarks can be found in our source repository:

Implementations for the other languages can be found at the language benchmark game CVS repository:

LuaJITSBCLV8CPython
binary-trees binarytrees.lua-2.luabinarytrees.sbcl binarytrees.javascript binarytrees.python3-6.python3
fasta fasta.lua fasta.sbcl fasta.javascript-2.javascript fasta.python3-2.python3
knucleotide knucleotide.lua-2.luaknucleotide.sbcl-3.sbcl knucleotide.javascript-3.javascriptknucleotide.python3-4.python3
nbody nbody.lua-2.lua nbody.sbcl nbody.javascript nbody.python3-4.python3
regex-dna regexdna.sbcl-3.sbcl regexdna.javascript regexdna.python3
reverse-complement revcomp.lua revcomp.sbcl revcomp.javascript-2.javascript revcomp.python3-4.python3
spectral-norm spectralnorm.lua spectralnorm.sbcl-3.sbclspectralnorm.javascript spectralnorm.python3-5.python3

In order to make the reverse complement benchmark work with SBCL on Mac OS X, I had to apply this patch; I don't understand why:

--- bench/revcomp/revcomp.sbcl 9 Feb 2007 17:17:26 -0000 1.4
+++ bench/revcomp/revcomp.sbcl 29 May 2010 08:32:19 -0000
@@ -26,8 +26,7 @@

(defun main ()
(declare (optimize (speed 3) (safety 0)))
- (with-open-file (in "/dev/stdin" :element-type +ub+)
- (with-open-file (out "/dev/stdout" :element-type +ub+ :direction :output :if-exists :append)
+ (let ((in sb-sys:*stdin*) (out sb-sys:*stdout*))
(let ((i-buf (make-array +buffer-size+ :element-type +ub+))
(o-buf (make-array +buffer-size+ :element-type +ub+))
(chunks nil))
@@ -72,4 +71,4 @@
(setf start 0)
(go read-chunk))))
end-of-input
- (flush-chunks)))))))
+ (flush-chunks))))))

Running the benchmarks

I used Factor's deploy tool to generate minimal images for the Factor benchmarks, and then ran them from the command line:

./factor -e='USE: tools.deploy "benchmark.nbody-simd" deploy'
time benchmark.nbody-simd.app/Contents/MacOS/benchmark.nbody-simd

For the scripting language implementations (LuaJIT and V8) I ran the scripts from the command line:

time ./d8 ~/perf/shootout/bench/nbody/nbody.javascript -- 1000000
time ./src/luajit ~/perf/shootout/bench/nbody/nbody.lua-2.lua 1000000

For SBCL, I did what the shootout does, and compiled each file into a new core:

ln -s ~/perf/shootout/bench/nbody/nbody.sbcl .

cat > nbody.sbcl_compile <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(handler-bind ((sb-ext:defconstant-uneql (lambda (c) (abort c))))
(load (compile-file "nbody.sbcl" )))
(save-lisp-and-die "nbody.core" :purify t)
EOF

sbcl --userinit /dev/null --load nbody.sbcl_compile

cat > nbody.sbcl_run <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(main) (quit)
EOF

time sbcl --dynamic-space-size 500 --noinform --core nbody.core --userinit /dev/null --load nbody.sbcl_run 1000000

For CPython, I precompiled each script into bytecode first:

python3.1 -OO -c "from py_compile import compile; compile('nbody.python3-4.py')"

Benchmark results

All running times are wall clock time from the Unix time command. I ran each benchmark 5 times and used the best result.

FactorLuaJITSBCL V8 CPython
fasta 2.597s1.689s2.105s3.948s 35.234s
reverse-complement2.377s1.764s2.955s3.884s 1.669s
nbody 0.393s0.604s0.402s4.569s 37.086s
binary-trees 1.764s6.295s1.349s2.119s 19.886s
spectral-norm 1.377s1.358s2.229s12.227s1m44.675s
regex-dna 0.990sN/A 0.973s0.166s 0.874s
knucleotide 1.820s0.573s0.766s1.876s 1.805s

Benchmark analysis

Some notes on the results:

  • There is no Lua implementation of the regex-dna benchmark.
  • Some of the SBCL benchmark implementations can make use of multiple cores if SBCL is compiled with thread support. However, by default, thread support seems to be disabled on Mac OS X. None of the other language implementations being tested have native thread support, so this is a single-core performance test.
  • Factor's string manipulation still needs work. The fasta, knucleotide and reverse-complement benchmarks are not as fast as they should be.
  • The binary-trees benchmark is a measure of how fast objects can be allocated, and how fast the garbage collector can reclaim dead objects. LuaJIT loses big here, perhaps because it lacks generational garbage collection, and because Lua's tables are an inefficient object representation.
  • The regex-dna benchmark is a measure of how efficient the regular expression implementation is in the language. V8 wins here, because it uses Google's heavily-optimized Irregexp library.
  • Factor beats the other implementations on the nbody benchmark because it is able to make use of SIMD.
  • For some reason SBCL is slower than the others on spectral-norm. It should be generating the same code.
  • The benchmarks exercise insufficiently-many language features. Any benchmark that uses native-sized integers (for example, an implementation of the SHA1 algorithm) would shine on SBCL and suffer on all the others. Similarly, any benchmark that requires packed binary data support would shine on Factor and suffer on all the others. However, the benchmarks in the shootout mostly consist of scalar floating point code, and text manipulation only.

Conclusions

Factor's performance is coming along nicely. I'd like to submit Factor to the computer language shootout soon. Before doing that, we need a Debian package, and the deploy tool needs to be easier to use from the command line.

Monday, May 10, 2010

A collection of small compiler improvements

One of the big optimizations I'm planning on implementing in Factor is support for computations on machine-word-sized integers. The motivation is as follows. While code that operates on small integers (fixnums) does not currently allocate memory for intermediate values, and as a result can compile very efficiently if type checks are eliminated, sometimes fixnum precision is not quite enough. Using bignums in algorithms such as SHA1 that require 32-bit or 64-bit arithmetic incurs a big performance hit over writing the code in a systems language. My plan is to have the compiler support machine integers as an intermediate step between fixnums and bignums. Machine integers would be boxed and unboxed at call boundaries, but tight loops operating on them will run in registers.

While I haven't yet implemented the above optimization, I've laid the groundwork and made it possible to represent operations on untagged integers at the level of compiler IR at least. While unboxed floats do not cause any complications for the GC, unboxed integers do, because they share a register file with tagged pointers. To make Factor's precise GC work, the compiler can now distinguish registers containing tagged pointers from those containing arbitrary integer data, by breaking down the register class of integer registers into two "representations", tagged and untagged.

Since the above reworking touched many different parts of the compiler, I also took the time to implement some minor optimizations and clean up some older code. These improvements will be the topic of this post.

To understand this post better, you should probably skim the list of low-level IR instructions defined in compiler.cfg.instructions first. Also, feel free to post comments here or e-mail me questions about the design of the Factor compiler.

Improvements to value numbering

In the x86 architecture, almost any instruction can take a memory operand, and memory operands take the form (base + displacement * scale + offset), where base and displacement are registers, offset is a 32-bit signed integer, and scale is 1, 2, 4 or 8. While using a complex addressing mode is not any faster than writing out the arithmetic by hand with a temporary register used to store the final address, it can reduce register pressure and code size.

The compiler now makes better use of complex addressing modes. Prior to optimization, the low level IR builder lowers specialized array access to the ##load-memory-imm and ##store-memory-imm instructions. For example, consider the following code:

USING: alien.c-types specialized-arrays ;
SPECIALIZED-ARRAY: float

: make-data ( -- seq ) 10 iota [ 3.5 + sqrt ] float-array{ } map-as ;
The inner loop consists of the following low level IR:
##integer>float 23 14
##sqrt 24 23
##load-integer 25 2
##slot-imm 26 15 2 7
##load-integer 27 4
##mul 28 14 27
##tagged>integer 29 26
##add-imm 30 29 7
##add 31 30 28
##store-memory-imm 24 31 0 float-rep f
##load-integer 32 1
##add 33 14 32

The ##mul, ##add-imm and ##add instructions compute the address input to ##store-memory-imm. After value numbering, these three instructions have been fused with ##store-memory-imm to form ##store-memory:

##integer>float 62 57
##sqrt 63 62
##slot-imm 65 48 2 7
##tagged>integer 68 65
##store-memory 63 68 57 2 7 float-rep f
##add-imm 72 57 1

Optimistic copy propagation

While the low-level optimizer's value numbering and alias analysis passes only operate on individual basic blocks (local optimizations), the copy propagation pass is global, meaning it takes the entire procedure being compiled into account.

Eventually I will merge the local alias analysis and value numbering passes with the copy propagation pass to get an optimization known as "global value numbering with redundant load elimination". One step along this road is a rewrite that makes the copy propagation pass optimistic, in the following sense.

Copy propagation concerns itself with eliminating ##copy instructions, which correspond to assignments from one SSA value to another (new) value:

y = x

When such an instruction is processed, all usages of the value y can be replaced with the value x, in every basic block, and the definition of y deleted.

A simple extension of the above treats a ##phi instruction where all inputs are equal the same as a copy. This optimizes cases such as:

 a = ...
if(...)
{
b = a
}
c = phi(b,a)

The case of ##phi instructions where one of the inputs is carried by a back edge in the control flow graph (ie, a loop) is where the optimistic assumption comes in. Consider the following code, where the definition of x'' is carried by a back edge:

x' = 0

while(...)
{
x = phi(x',x'')
x'' = ...
}

Optimistic copy propagation is able to eliminate the phi node entirely. Switching from pessimistic to optimistic copy propagation eliminated about 10% of copy instructions. While this is not a great amount, it did reduce spilling in a few subroutines I looked at, and there is no increase in code complexity from this new approach.

Representation selection improvements

The low-level IR now includes three new instructions, ##load-double, ##load-float and ##load-vector. These are inteded for loading unboxed floating point values and SIMD vectors into vector registers without using an integer register to temporarily hold a pointer to the boxed value.

Uses of these instructions are inserted by the representation selection pass. If the destination of a ##load-reference is used in an unboxed representation, then instead of inserting a memory load following the ##load-reference, the instruction is replaced with one of the new variants.

Loads from constant addresses directly into floating point and vector registers are only possible on x86, and not PowerPC, so the optimization is not performed on the latter architecture. On x86-32, an immediate displacement can encode any 32-bit address. On x86-64, loading from an absolute 64-bit address requires an integer register, however instruction pointer-relative addressing is supported with a 32-bit displacement. To make use of this capability, the compiler now supports a new "binary literal area" for unboxed data that compiled code can reference. The unboxed data is placed immediately following a word's machine code, allowing RIP-relative addressing to be used on x86-64.

This optimization, together with value numbering improvements, helps reduce pressure on integer registers in floating point loops.

Register allocator improvements

Once representation selection has run, each SSA value has an associated representation and register class. Each representation always belongs to one specific register class; a register class is a finite set of registers from which the register allocator is allowed to choose a register for this type of value.

Before register allocation, the SSA destruction pass transforms ##phi instructions into ##copys, and then subsequently performs live range coalescing to eliminate as many of the copies as possible.

Coalescing two SSA values from different register classes does not make sense; the compiler will not be able to generate valid machine code if a single SSA value is used as both a float and an integer, since on most CPUs the integer and floating point register files are distinct.

However, coalescing values with different representations, but the same register class, is okay. Consider the case where a double-precision float is computed, and then converted into a single precision float, and subsequently stored into memory. If the double-precision value is not used after the conversion, then the same register that held the input value can be reused to store the result of the single-precision conversion.

Previously this pass only coalesced values with identical representations; now I've generalized it to coalescing values with the same register class but possibly different representations. This reduces register pressure and spilling.

The tricky part about making it work is that the register allocator needs to know the value's representation, not just its register class, for generating correct spills and reloads. For example, a spilled value with register class float-regs can use anywhere from 4 to 16 bytes in the stack frame, depending on the specific representation: single precision float, double precision float, or SIMD vector. Clearly, if coalescing simply lost this fine-grained representation information and only retained register classes, the register allocator would not have enough information to generate valid code.

The solution is to have live range coalescing compute the equivalence classes of SSA values without actually renaming any usages to point to the canonical representative. The renaming map is made available to the register allocator. Most places in the register allocator where instruction operands are examined make use of the renaming map.

Until the splitting phase, these equivalence classes are in one-to-one correspondence with live intervals, and each live interval has a single machine register and spill slot. However, when spill and reload decisions are made, the register allocator uses the original SSA values to look up representations.

If Factor's compiler did not use SSA form at all, there would still be a copy coalescing pass, and the register allocator could also support coalescing values with different representations, by first performing a dataflow analysis known as reaching definitions. This would propagate representation information to use sites.

Retaining SSA structure all the way until register allocation gives you the results of this reaching definitions analysis "for free". In both the SSA and non-SSA case, you still don't want definitions with multiple different representations reaching the same call site. The SSA equivalent of this would be a phi instruction whose inputs had different representations. The compiler ensures this doesn't happen by first splitting up the set of SSA values into strongly-connected components (SCCs) whose edges are phi nodes. The same representation is then assigned to each member of every SCC; if an agreeable representation is not found then it falls back on boxing and tagging all members.

Notice that while both representation selection and SSA destruction group values into equivalence classes, the groupings do not correspond to each other and neither one is a refinement of the other. Not all copies resulting from phi nodes get coalesced away, so a single SCC may intersect multiple coalescing classes. This can happen if the first input to a phi node is live at the definition point of the second:

b = ...

if(...)
{
a = ...
foo(b)
}
/* 'c' can be coalesced with 'a' or 'b' -- but not both.
All three values belong to the same SCC and must have the same
representation */
c = phi(a,b)
On the other hand, two values from different SCCs might also get coalesced together:
a = some-double-float
/* if 'a' is not live after this instruction, then 'a' and 'b'
can share a register. They belong to different SCCs and have
different representations */
b = double-to-single-precision-float(a)

Compiler code cleanups

In addition to just adding new optimizations I've also simplified and refactored the internals of some of the passes I worked on.

One big change is that the "machine representation" used at the very end of the compilation process is gone. Previously, after register allocation had run, the control flow graph would be flattened into a list of instructions with CFG edges replaced by labels and jumps. The code generator would then iterate over this flat representation and emit machine code for each virtual instruction. However since no optimization passes ran on the flat representation, it was generated and then immediately consumed. Combining the linearization pass with the code generation pass let me eliminate about a dozen IR instructions that were specific to the flat representation (anything dealing with labels and jumps to labels). It also sped up compilation time slightly.

Value numbering's common subexpression elimination now uses a simpler and more efficient scheme for hashed lookup of expressions. New algebraic simplifications are also easier to add now, because the def-use graph is easier to traverse.

Some instructions which could be expressed in terms of others were removed, namely the ##string-nth and ##set-string-nth-fast instructions. Reading and writing characters from strings can be done by combining simpler memory operations already present in the IR.

A number of instructions were combined. All of the instructions for reading from memory in different formats, ##alien-unsigned-1, ##alien-signed-4, ##alien-double, have been merged into a single ##load-memory instruction which has a constant operands storing the C type and representation of the data being loaded. Similarly, ##set-alien-signed-8, etc have all been merged into ##store-memory. This restructuring allows optimization passes to more easily treat all memory accesses in a uniform way.

A bug fix for alias analysis

While working on the above I found an interesting bug in alias analysis. It would incorrectly eliminate loads and stores in certain cases. The optimizer assumed that a pointer to a freshly-allocated object could never alias a pointer read from the heap. However the problem of course is that such a pointer could be stored in another object, and then read back in.

Here is a Factor example that demonstrates the problem. First, a couple of tuple definitions; the type declarations and initial value are important further on.

USING: typed locals ;

TUPLE: inner { value initial: 5 } ;
TUPLE: outer { value inner } ;

Now, a very contrieved word which takes two instances of outer,
mutates one, and reads a value out of the other:

TYPED: testcase ( x: outer y: outer -- )
inner new :> z ! Create a new instance of 'inner'
z x value<< ! Store 'z' in 'x'
y value>> :> new-z ! Load 'new-z' from 'y'
10 new-z value<< ! Store a value inside 'new-z'
z value>> ! Read a value out of 'z'
;

Note that the initial value of the value slot in the inner tuple is 5, so inner new creates a new instance holding the value 5. In the case where x and y point at the same object, new-z and z will point to the same object, and the newly-allocated object's value is set to 10. However, the compiler did not realize this, and erronously constant-folded z value>> down to 5!

This bug never came up in actual code; the facepalm moment came while I was doing some code refactoring and noticed that the above case was not being handled.

Instruction scheduling to reduce register pressure

I'm not the only one who's been busy improving the Factor compiler. One of the co-inventors of Factor has cooked up an instruction scheduler and improved method inlining. The scheduler has been merged, and the it looks like the method inlining improvements should be ready in a few days.

Some benchmarks

Here is a comparison between the April 17th build of Factor with the latest code from the GIT repository. I ran a few of the language shootout benchmarks on a 2.4 GHz MacBook Pro. Factor was built in 32-bit mode, because the new optimizations are most effective on the register-starved x86.

BenchmarkBefore (ms)After (ms)
nbody-simd415349
fasta26672485
knucleotide182177
regex-dna10986
spectral-norm16411347