Together with Daniel Ehrenberg and Joe Groff, I'm writing a paper about Factor for DLS2010. We would appreciate feedback about the draft version of the paper. As part of the paper we include a performance comparison between Factor, V8, LuaJIT, SBCL, and Python. The performance comparison consists of some benchmarks from the The Computer Language Benchmarks Game. I'm posting the results here first, in case there's something really stupid here.
Language implementations
Factor and V8 were built from their respective repositories. SBCL is version 1.0.38. LuaJIT is version 2.0.0beta4. CPython is version 3.1.2. All language implementations were built as 64-bit binaries and run on an 2.4 GHz Intel Core 2 Duo.
Benchmark implementations
Factor implementations of the benchmarks can be found in our source repository:
Implementations for the other languages can be found at the language benchmark game CVS repository:
In order to make the reverse complement benchmark work with SBCL on Mac OS X, I had to apply this patch; I don't understand why:
--- bench/revcomp/revcomp.sbcl 9 Feb 2007 17:17:26 -0000 1.4
+++ bench/revcomp/revcomp.sbcl 29 May 2010 08:32:19 -0000
@@ -26,8 +26,7 @@
(defun main ()
(declare (optimize (speed 3) (safety 0)))
- (with-open-file (in "/dev/stdin" :element-type +ub+)
- (with-open-file (out "/dev/stdout" :element-type +ub+ :direction :output :if-exists :append)
+ (let ((in sb-sys:*stdin*) (out sb-sys:*stdout*))
(let ((i-buf (make-array +buffer-size+ :element-type +ub+))
(o-buf (make-array +buffer-size+ :element-type +ub+))
(chunks nil))
@@ -72,4 +71,4 @@
(setf start 0)
(go read-chunk))))
end-of-input
- (flush-chunks)))))))
+ (flush-chunks))))))
Running the benchmarks
I used Factor's deploy tool to generate minimal images for the Factor benchmarks, and then ran them from the command line:
./factor -e='USE: tools.deploy "benchmark.nbody-simd" deploy'
time benchmark.nbody-simd.app/Contents/MacOS/benchmark.nbody-simd
For the scripting language implementations (LuaJIT and V8) I ran the scripts from the command line:
time ./d8 ~/perf/shootout/bench/nbody/nbody.javascript -- 1000000
time ./src/luajit ~/perf/shootout/bench/nbody/nbody.lua-2.lua 1000000
For SBCL, I did what the shootout does, and compiled each file into a new core:
ln -s ~/perf/shootout/bench/nbody/nbody.sbcl .
cat > nbody.sbcl_compile <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(handler-bind ((sb-ext:defconstant-uneql (lambda (c) (abort c))))
(load (compile-file "nbody.sbcl" )))
(save-lisp-and-die "nbody.core" :purify t)
EOF
sbcl --userinit /dev/null --load nbody.sbcl_compile
cat > nbody.sbcl_run <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(main) (quit)
EOF
time sbcl --dynamic-space-size 500 --noinform --core nbody.core --userinit /dev/null --load nbody.sbcl_run 1000000
For CPython, I precompiled each script into bytecode first:
python3.1 -OO -c "from py_compile import compile; compile('nbody.python3-4.py')"
Benchmark results
All running times are wall clock time from the Unix time
command. I ran each benchmark 5 times and used the best result.
Factor | LuaJIT | SBCL | V8 | CPython | |
---|---|---|---|---|---|
fasta | 2.597s | 1.689s | 2.105s | 3.948s | 35.234s |
reverse-complement | 2.377s | 1.764s | 2.955s | 3.884s | 1.669s |
nbody | 0.393s | 0.604s | 0.402s | 4.569s | 37.086s |
binary-trees | 1.764s | 6.295s | 1.349s | 2.119s | 19.886s |
spectral-norm | 1.377s | 1.358s | 2.229s | 12.227s | 1m44.675s |
regex-dna | 0.990s | N/A | 0.973s | 0.166s | 0.874s |
knucleotide | 1.820s | 0.573s | 0.766s | 1.876s | 1.805s |
Benchmark analysis
Some notes on the results:
- There is no Lua implementation of the regex-dna benchmark.
- Some of the SBCL benchmark implementations can make use of multiple cores if SBCL is compiled with thread support. However, by default, thread support seems to be disabled on Mac OS X. None of the other language implementations being tested have native thread support, so this is a single-core performance test.
- Factor's string manipulation still needs work. The fasta, knucleotide and reverse-complement benchmarks are not as fast as they should be.
- The binary-trees benchmark is a measure of how fast objects can be allocated, and how fast the garbage collector can reclaim dead objects. LuaJIT loses big here, perhaps because it lacks generational garbage collection, and because Lua's tables are an inefficient object representation.
- The regex-dna benchmark is a measure of how efficient the regular expression implementation is in the language. V8 wins here, because it uses Google's heavily-optimized Irregexp library.
- Factor beats the other implementations on the nbody benchmark because it is able to make use of SIMD.
- For some reason SBCL is slower than the others on spectral-norm. It should be generating the same code.
- The benchmarks exercise insufficiently-many language features. Any benchmark that uses native-sized integers (for example, an implementation of the SHA1 algorithm) would shine on SBCL and suffer on all the others. Similarly, any benchmark that requires packed binary data support would shine on Factor and suffer on all the others. However, the benchmarks in the shootout mostly consist of scalar floating point code, and text manipulation only.
Conclusions
Factor's performance is coming along nicely. I'd like to submit Factor to the computer language shootout soon. Before doing that, we need a Debian package, and the deploy tool needs to be easier to use from the command line.
16 comments:
For DLS2010, some might be interested to see a performance comparison with Ruby and/or Python. Any data on those?
dh: good point. I will run the benchmarks in Python tomorrow and update this blog post.
Odd. When I last hacked on the spectral-norm implementation for SBCL, it was tying with C, Fortran, and all the rest of those in performance. That was close to two years ago; perhaps there's been some regression in the compiler.
Perhaps include a pypy comparison (but you already have python which I guess is more meaningful)?
Pretty much meaningless comparison.
Out of curiosity I took a look at spectral-norm written in Factor at noticed [as soon as my eyes adopted to it] that you are using unboxed arrays (I would not be surprised if compiler had a special support for such arrays). The same is probably true for tuples used in binary-trees benchmark. It appears that you are giving hits to the JIT [I interpret ; inline things like hints, maybe I am wrong]
It also seems to me that Factor itself has a simpler and stricter scoping rules if compared to other JavaScript, Python or (probably) even Lua.
Given all this tricks in the sleeve and given semantic differences it would be just surprising to see slower execution.
Ouch. I just looked at the nbody benchmark sources.
Please correct me if I am wrong: you are declaring a structure there with typed fields while other dynamic languages use dictionaries to store bodies?
Maybe you should use proper statistics rather than taking the fastest time possible. Like run it enough times to do a t-test to show the significance of the difference between measurements.
You say you're the world's greatest systems programmer, but all you academics know how to do is prove academic theorems about the module substructure over nilpotent Lie algebras, and even then you can't manage to generalize it past the two-step case. Your method fails to scale, because in the real world, nilpotent Lie algebras can take as many as ten steps to get to zero. Just like your statistics.
Hey. Can you make clear in your comparison that you're comparing CPython and not Python? (Python is a language, CPython is an actual VM).
Also, you don't need a debian package of factor to deploy it on computer language shootout.
"computer language shootout"
Please use the title shown on the website - The Computer Language Benchmarks Game
> For Python, I precompiled each script into bytecode first
Did you check if doing so gave faster or slower timings?
> Implementations for the other languages can be found at ...
The benchmarks game website often shows more than one program per language per benchmark - how is anyone to know which programs you timed?
multiprocessing in Python? For what? O_o
> can be found at the language benchmark game CVS repository
The column headings seem to have gotten mixed up - they don't match the languages linked to in the cells.
Maybe some redundancy could be removed from the table - the column of benchmark names?
> Benchmark results
The table contents seem completely disorganised.
Easier to read if there's some order imposed:
- pick a language, and sort the benchmark rows on the time taken for that language
- pick the benchmark program that took the longest or shortest time for that language, and sort the language cols on the time taken for that benchmark
the links to the javascript program implementations are several links to the several-years-old editions; some are much-improved since then. This makes me wonder whether your test used these older/slower editions.
Post a Comment