Factor: a practical stack language

Factor 0.94 now available

2010-09-18T23:40:00.003-04:00

Factor 0.94 is now available from the Factor website, five months after the previous release, Factor 0.93. Binaries are provided for 10 platforms.

As usual, contributors did most of the work. Thanks to Daniel Ehrenberg, Dmitry Shubin, Doug Coleman, Erik Charlebois, Joe Groff, John Benediktsson, Jose A. Ortega Ruiz, Niklas Waern, Samuel Tardieu, Sascha Matzke and everyone else who helped out this time around!

Incompatible changes:

The PowerPC architecture is no longer supported. (Slava Pestov)
The require-when word now supports dependencies on multiple vocabularies. (Daniel Ehrenberg)
The C-ENUM: word in the C library interface has been replaced with ENUM:, a much improved word for defining type-safe enumerations. (Erik Charlebois, Joe Groff)
Tuple slot setter words with stack effect ( value object -- ) are now named foo<< instead of (>>foo). Most code is unaffected since it uses the >>foo form. (Slava Pestov)
The older system-micros word, which returned microseconds since the Unix epoch as an integer, has been removed. For a while, the recommended way to get the current time has been to call the now word from the calendar vocabulary, which returned a timestamp instance. (Doug Coleman)
A few sequence-related words were moved from the generalizations vocabulary to sequences.generalizations. (Slava Pestov)
The alarms vocabulary has been renamed to timers to better explain its true purpose, with improved timing accuracy and robustness. (Doug Coleman)
cocoa.subclassing: the syntax for defining new Objective-C classes has been changed to improve readability. (Slava Pestov)
io.streams.limited: the ability to throw errors on EOF was extracted from limited streams, and limited streams simplified as a result. Throwing on EOF is now implemented by the io.streams.throwing vocabulary. (Doug Coleman)

New libraries:

boyer-moore: efficient text search algorithm (Dmitry Shubin)
checksums.internet: implementation of checksum algorithm used by ICMP for the checksums framework (John Benediktsson)
gdbm: disk-based hash library binding (Dmitry Shubin)
io.encodings.detect: binary file/text encoding detection heuristics from jEdit (Joe Groff)
javascriptcore: FFI to the WebKit JavaScript engine (Doug Coleman)
lua: FFI to the Lua scripting language (Erik Charlebois)
oauth: minimal implementation of client-side OAuth (Slava Pestov)
sequences.unrolled: efficient unrolled loops with constant iteration count (Joe Groff)

Improved libraries:

cuda: various improvements (Joe Groff, Doug Coleman)
game.input: now uses XInput2 on X11 (Niklas Waern)
gpu.render, gpu.buffers: various improvements (Joe Groff)
math.combinatorics: various improvements (John Benediktsson)
math.primes: various improvements (Samuel Tardieu)
math.vectors.simd.cords: compound "256-bit" SIMD types now support the full set of SIMD operations (Joe Groff)
mongodb: various improvements (Sascha Matzke)
twitter: now uses OAuth as required by Twitter (Slava Pestov)
unix.users, unix.groups: various improvements (Doug Coleman)

Compiler improvements:

Improved instruction selection, copy propagation, representation selection, and register allocation; details in a blog post. (Slava Pestov)
An instruction scheduling pass now runs prior to register allocation, intended to reduce register pressure by moving uses closer to definitions; details in a blog post. (Daniel Ehrenberg)
The code generation for the C library interface has been revamped; details in a blog post. (Slava Pestov)
Something similar to what C++ and C# programmers refer to as "value types"; binary data can now be allocated on the call stack, and passed to C functions like any other pointer. The with-out-parameters combinator replaces tricky code for allocating and juggling multiple temporary byte arrays used as out parameters for C function calls, making this idiom easier to read and more efficient. The with-scoped-allocation combinator presents a more general, lower-level interface. (Slava Pestov)
The compiler can now use the x87 floating point unit on older CPUs where SSE2 is not available. However, this is not recommended, because the build farm does not test Factor (or build any binaries) in x87 mode, so this support can break at any time. To use x87 code generation, you must download the source code and bootstrap Factor yourself, on a CPU without SSE2. (Slava Pestov)

Miscellaneous improvements:

fuel: Factor's Ultimate Emacs Library has seen many improvements, and also some keyboard shortcuts have changed; see the README. (Erik Charlebois, Dmitry Shubin, Jose A. Ortega Ruiz)
A new factor.cmd script is now included in the build-support directory, to automate the update/build/bootstrap cycle for those who build from source. Its functionality is a subset of the factor.sh script for Unix. (Joe Groff)
The default set of icons shipped in misc has been tweaked, with better contrast and improved appearance when scaled down. (Joe Groff)

An overview of Factor's I/O library

2010-09-11T22:23:00.006-04:00

Factor has grown a very powerful and high-level I/O library over the years, however it is easy to get lost in the forest of reference documentation surrounding the io vocabulary hierarchy. In this blog post I'm attempting to give an overview of the functionality available, with some easy-to-digest examples, along with links for futher reading. I will also touch upon some common themes that come up throughout the library, such as encoding support, timeouts, and uses for dynamically-scoped variables.

Factor's I/O library is the work of many contributors over the years. Implementing FFI bindings to native I/O APIs, developing high-level abstractions on top, and making the whole thing cross-platform takes many people. In particular Doug Coleman did a lot of heavy lifting early on for the Windows port, and also implemented several new cross-platform features such as file system metadata and memory mapped files.

Our I/O library is competitive with Python's APIs and Java's upcoming NIO2 in breadth of functionality. I like to think the design is quite a bit cleaner too, because instead of being a thin wrapper over POSIX we try to come up with clear and conherent APIs that make sense on both Windows and Unix.

First example: converting a text file from MacRoman to UTF8

The io.files vocabulary defines words for reading and writing files. It supports two modes of operation in a pretty standard fashion:

Stream-based: with-file-reader, with-file-writer
Entire file: file-contents, set-file-contents, file-lines, set-file-lines

What makes Factor's file I/O interesting is that it takes advantage of pervasive support for I/O encoding. In Factor, a string is not a sequence of bytes; it is a sequence of Unicode code points. When reading and writing strings on external resources, which only consist of bytes, an encoding parameter is given to specify the conversion from strings to byte arrays.

Let's convert foo.txt from MacRoman, an older encoding primarily used by classic Mac OS, to UTF8:

USING: io.encodings.8-bit.mac-roman io.encodings.utf8 io.files ;

"foo.txt" mac-roman file-contents
"foo.txt" utf8 set-file-contents

This is a very simple and concise implementation but it has the downside that the entire file is read into memory. For most small text files this does not matter, but if efficiency is a concern then we can do the conversion a line at a time:

USING: io io.encodings.8-bit.mac-roman io.encodings.utf8
io.files ;

"out.txt" utf8 [
    "in.txt" mac-roman [
        [ print ] each-line
    ] with-file-reader
] with-file-writer

Converting a directory full of files from MacRoman to UTF8

The io.files vocabulary defines words for listing and modifying directories. Let's make the above example more interesting by performing the conversion on a directory full of files:

USING: io.directories io.encodings.8-bit.mac-roman
io.encodings.utf8 io.files ;

: convert-directory ( path -- )
    [
        [
            [ mac-roman file-contents ] keep
            utf8 set-file-contents
        ] each
    ] with-directory-files ;

An aside: generalizing the "current working directory"

If you run the following, you will see that with-directory-files returns relative, and not absolute, file names:

"/path/to/some/directory"
[ [ print ] each ] with-directory-files

So the question is, how did file-contents above know what directory to look for files in? The answer is that in addition to calling the quotation with the directory's contents, the with-directory-files word also rebinds the current-directory dynamic variable.

This directory is the Factor equivalent of the familiar Unix notion of "current working directory". It generalizes the Unix feature by making it dynamically-scoped; within the quotation passed to the with-directory combinator, relative paths are resolved relative to that directory, but other coroutines executing at the time, or code after the quotation, is unaffected. This functionality is implemented entirely at the library level; all pathname strings are "normalized" with the normalize-pathname word before being handed off to the operating system.

When calling a shell command with io.launcher, the child process is run from the Factor current-directory so relative pathnames passed on the command line will just work. However, when making C FFI calls which take pathnames, you pass in absolute paths only, or normalize the path with normalize-path first, otherwise the C code wlll search for it in the wrong place.

Checking free disk space

The io.files.info vocabulary defines two words which return tuples containing information about a file, and the file system containing the file, respectively:

Let's say your application needs to install some files in the user's home directory, but instead of failing half-way through in the event that there is insufficient space, you'd rather display a friendly error message upfront:

ERROR: buy-a-new-disk ;

: gb ( m -- n ) 30 2^ * ;

: check-space ( -- )
    home file-system-info free-space>> 10 gb <
    [ buy-a-new-disk ] when ;

Now if there is less than 10gb available, the check-space word will throw a buy-a-new-disk error.

The file-system-info word reports a bunch of other info. There is a Factor implementation of the Unix df command in the tools.files vocabulary:

( scratchpad ) file-systems.
+device-name+ +available-space+ +free-space+ +used-space+ +total-space+ +percent-used+ +mount-point+
/dev/disk0s2  15955816448       16217960448  183487713280 199705673728  91             /
fdesc         0                 0            1024         1024          100            /dev
fdesc         0                 0            1024         1024          100            /dev
map -hosts    0                 0            0            0             0              /net
map auto_home 0                 0            0            0             0              /home
/dev/disk1s2  15922262016       15922262016  383489052672 399411314688  96             /Users/slava

Doug has two blog posts about these features, part 1 and part 2.

Unix only: symbolic links

Factor knows about symbolic links on Unix. The io.files.links vocabulary defines a pair of words, make-link and make-hard-link. The link-info word is like file-info except it doesn't follow symbolic links. Finally, the directory hierarchy traversal words don't follow links, so a link cycle or bogus link to / somewhere won't break everything.

File system monitoring

The io.monitors vocabulary implements real-time file and directory change monitoring. Unfortunately at this point in time, it is only supported on Windows, Linux and Mac. Neither one of FreeBSD and OpenBSD exposes the necessary information to user-space.

Here is an example for watching a directory for changes, and logging them:

USE: io.monitors

: watch-loop ( monitor -- )
    dup next-change path>> print flush watch-loop ;

: watch-directory ( path -- )
    [ t [ watch-loop ] with-monitor ] with-monitors ;

Try pasting the above code into a Factor listener window, and then run home watch-directory. Every time a file in your home directory is modified, its full pathname will be printed in the listener.

Java will only begin to support symbolic links and directory monitoring in the upcoming JDK7 release.

Memory mapped files

The io.mmap vocabulary defines support for working with memory-mapped files. The highest-level and easiest to use interface is the with-mapped-array combinator. It takes a file name, a C type, and a quotation. The quotation can perform generic sequence operations on the mapped file.

Here is an example which reverses each group of 4 bytes:

USING: alien.c-types grouping io.mmap sequences
specialized-arrays ;
SPECIALIZED-ARRAY: char

"mydata.dat" char [
    4 <sliced-groups>
    [ reverse! drop ] each
] with-mapped-array

The <sliced-groups> word returns a view of an underlying sequence, grouped into n-element subsequences. Mutating one of these subsequences in-place mutates the underlying sequence, which in our case is a mapped view of a file.

A more efficient implementation of the above is also possible, by mapping in the file as an int array and then performing bitwise arithmetic on the elements.

Launching processes

Factor's io.launcher vocabulary was originally developed for use by the build farm. The build farm needs to launch processes with precise control over I/O redirection and timeouts, and so a rich set of cross-platform functionality was born.

The central concept in the library is the process, tuple, constructed by calling <process>. Various slots of the process tuple can be filled in to specify the command line, environment variables, redirection, and so on. Then the process can be run in various ways, running in the foreground, in the background, or with input and output attached to Factor streams.

The launcher's I/O redirection is very flexible. If you don't touch the redirection slots in a process tuple, the subprocess will just inherit the current standard input and output. You can specify a file name to read or write from, a file name to append to, or even supply a pipe object, constructed from the io.pipes vocabulary.

<process>
    "rotate-logs" >>command
    +closed+ >>stdin
    "out.txt" >>stdout
    "error.log" <appender> >>stderr

It is possible to specify a timeout when running a process:

<process>
    { "ssh" "myhost" "-l" "jane" "do-calculation" } >>command
    15 minutes >>timeout
    "results.txt" >>stdout
run-process

The process will be killed if it runs for longer than the timeout period. Many other features are supported; setting environment variables, setting process priority, and so on. The io.launcher vocabulary has all the details.

Support for timeouts is a cross-cutting concern that touches many ports of the I/O API. This support is consolidated in the io.timeouts vocabulary. The set-timeout generic word is supported by all external resources which provide interruptible blocking operations.

Timeouts are implemented on top of our monotonic timer support; changing your system clock while Factor is running won't screw with active network connections.

Unix only: file ownership and permissions

The io.files.unix vocabulary defines words for reading and writing file ownership and permissions. Using this vocabulary, we can write a shell script to a file, make it executable, and run it. An essential component of any multi-language quine:

USING: io.encodings.ascii io.files io.files.info.unix
io.launcher ;

"""
#!/bin/sh
echo "Hello, polyglot"
""" "script.sh" ascii set-file-contents
OCT: 755 "script.sh" set-file-permissions
"./script.sh" run-process

There are even more Unix-specific words in the unix.users and unix.groups vocabularies. Using these words enables listing all users on the system, converting user names to UIDs and back, and even setuid and setgid.

Networking

Factor's io.sockets vocabulary supports stream and packet-based networking.

Stream-based: with-client, <server>, accept
Packet-based: <datagram>, send, receive

Network addresses are specified in a flexible manner. Specific classes exist for IPv4, IPv6 and Unix domain socket addressing. When a network socket is constructed, that endpoint is bound to a given address specifier.

Connecting to http://www.apple.com, sending a GET request, and reading the result:

USING: io io.encodings.utf8 io.sockets ;

"www.apple.com" 80 <inet> utf8 [
    """GET / HTTP/1.1\r
host: www.apple.com\r
connection: close\r
\r
""" write flush
    contents
] with-client
print

SSL support is almost transparent; the only difference is that the address specifier is wrapped in <secure>:

USING: io io.encodings.utf8 io.sockets
io.sockets.secure ;

"www.cia.gov" 443 <inet> <secure> utf8 [
    """GET / HTTP/1.1\r
host: www.cia.gov\r
connection: close\r
\r
""" write flush
    contents
] with-client
print

For details, see the io.sockets.secure documentation, and my blog post about SSL in Factor..

Of course you'd never send HTTP requests directly using sockets; instead you'd use the http.client vocabulary.

Network servers

Factor's io.servers.connection vocabulary is so cool, that a couple of years back I made a screencast about it. Nowadays the sample application developed in that screencast is in the extra/time-server; the implementation is very concise and elegant.

Under the hood

All of this functionality is implemented in pure Factor code on top of our excellent C library interface and extensive bindings to POSIX and Win32 in the unix and windows vocabulary hierarchies, respectively.

As much as possible, I/O is performed with non-blocking operations; synchronous reads and writes only suspend the current coroutine and switch to the next runnable one rather than hanging the entire VM. I recently rewrote the coroutines implementation to use direct context switching rather than continuations.

Co-ordination and scheduling of coroutines is handled with a series of simple concurrency abstractions.

Making Factor's continuous build system more robust

2010-09-05T20:50:00.006-04:00

I've done some work on Factor's continuous build system over the weekend to make it more robust in the face of failure, with improved error reporting and less manual intervention required to fix problems when they come up. The current build system is called "mason", because it is based on an earlier build system named "builder" that was written by Eduardo Cavazos. Every binary package you download from factorcode.org was built, tested and benchmarked by mason.

Checking for disk space

Every once in a while build machines run out of disk space. This is a condition that Git doesn't handle very gracefully; if a git pull fails half way through, it leaves the repository in an inconsistent state. Instead of failing during source code checkout or testing, mason now checks disk usage before attempting a build. If less than 1 Gb is free, it sends out a warning e-mail and takes no further action. Disk usage is also now part of every build report; for example, take a look at the latest Mac OS X report. Finally, mason does a better job of cleaning up after itself when builds fail, reducing the rate of disk waste overall.

I must say the disk space check was very easy to implement using Doug Coleman's excellent cross-platform file-system-info library. Factor's I/O libraries are top-notch and everything works as expected across all of the platforms we run on.

Git repository corruption

Git is not 100% reliable, and sometimes repositories will end up in a funny state. One cause is when the disk fills up in the middle of a pull, but it seems to happen in other cases too. For example, just a few days ago, our 64-bit Windows build machine started failing builds with the following error:

From git://factorcode.org/git/factor
 * branch            master     -> FETCH_HEAD
Updating d386ea7..eece1e3

error: Entry 'basis/io/sockets/windows/windows.factor' not uptodate. Cannot merge.

Of course nobody actually edits files in the repository in question, its a clone of the official repo that gets updated every 5 minutes. Why git messed up here I have no clue, but instead of expecting software to be perfect, we can design for failure.

If a pull fails with a merge error, or if the working copy somehow ends up containing modified or untracked files, mason deletes the repository and clones it again from scratch, instead of just falling over and requiring manual intervention.

Error e-mail throttling

Any time mason encounters an error, such as not being able to pull from the Factor Git repository, disk space exhaustion, or intermittent network failure, it sends out an e-mail to Factor-builds. Since it checks for new code every 5 minutes, this can get very annoying if there is a problem with the machine and nobody is able to fix it immediately; the Factor-builds list would get spammed with hundreds of duplicate messages. Now, mason uses a heuristic to limit the number of error e-mails sent out. If two errors are sent within twenty minutes of each other, no further errors are sent for another 6 hours.

More robust new code check

Previously mason would initiate a build if a git pull had pulled in patches. This was insufficient though, because if a build was killed half way through, for example due to power failure or machine reboot, it would not re-attempt a build when it came back up until new patches were pushed. Now mason compares the latest git revision with the last one that was actually built to completion (whether or not there were errors).

Build system dashboard

I've put together a simple dashboard page showing build system status. Sometimes VMs will crash (FreeBSD is particularly flaky when running under VirtualBox, for example) and we don't always notice that a VM is down until several days after, when no builds are being uploaded. Since mason now sends out heartbeats every 5 minutes to a central server, it was easy to put together a dashboard showing which machines have not sent any heartbeats for a while. These machines are likely to be down. The dashboard also allows a build to be forced even if no new code was pushed to the repository; this is useful to test things out after changing machine configuration.

The dashboard nicely complements my earlier work on live status display for the build farm.

Conclusion

I think mason is one of the most advanced continuous integration systems among open source language implementations, nevermind the less mainstream ones such as Factor. And thanks to Factor's advanced libraries, it is only 1600 lines of code. Here is a selection of the functionality from Factor's standard library used by mason:

io.files.info - checking disk usage
io.launcher - running processes, such as git, make, zip, tar, ssh, and of course the actual Factor instance being tested
io.timeouts - timeouts on network operations and child processes are invaluable; Factor's consistent and widely-used timeout API makes it easy
http.client - downloading boot images, making POST requests to builds.factorcode.org for the live status display feature
smtp - sending build report e-mails
twitter - Tweeting binary upload notifications
oauth - yes, Factor has a library to support the feared OAuth. Everyone complains about how hard OAuth is, but if you have easy to use libraries for HMAC, SHA1 and HTTP then it's no big deal at all.
xml.syntax - constructing HTML-formatted build reports using XML literal syntax

Two things every Unix developer should know

2010-09-03T23:57:00.003-04:00

Unix programming can be tricky. There are many subtleties many developers are not aware of. In this post, I will describe just two of them... my favorite Unix quirks, if you will.

Interruptible system calls

On Unix, any system call which blocks can potentially fail with an errno of EINTR, which indicates that the caller must retry the system call. The EINTR error can be raised at any time for any reason, so essentially every I/O operation on a Unix system must be prepared to handle this error properly. Surprisingly to some, this includes the C standard library functions such as fread(), fwrite(), and so on.

For example, if you are writing a network server, then most of the time, you want to ignore the SIGPIPE signal which is raised when the client closes its end of a socket. However, this ignored signal can cause some pending I/O in the server to return EINTR.

A commonly-held belief is that setting the SA_RESTART flag with the sigaction() system call means that if that signal is delivered, system calls are restarted for you and EINTR doesn't need to be handled. Unfortunately this is not true. The reason is that certain signals are unmaskable. For instance, on Mac OS X, if your process is blocking reading on standard input, and the user suspends the program by sending it SIGSTOP (usually by pressing ^Z in the terminal), then upon resumption, your read() call will immediately fail with EINTR.

Don't believe me? The Mac OS X cat program is not actually interrupt-safe, and has this bug. Run cat with no arguments in a terminal, press ^Z, then type %1, and you'll get an error from cat!

$ cat
^Z
[1]+  Stopped                 cat
$ %1
cat
cat: stdin: Interrupted system call
$

As far as I'm aware, Factor properly handles interruptible system calls, and has for a while now, thanks to Doug Coleman explaining the issue to me 4 years ago. Not having to deal with crap like this (not to mention being able to write cross-platform code that runs on both Unix and Windows) is one of the advantages of using a high-level language like Factor or Java over C.

Subprocesses inherit semi-random things from the parent process

When you fork() your process, various things are copied from the parent to the child; environment variables, file descriptors, the ignored signal mask, and so on. Less obvious is the fact that exec() doesn't reset everything. If shared file descriptors such as stdin and stdout were set to non-blocking in the parent, the child will start with these descriptors non-blocking also, which will most likely break most programs. I've blogged about this problem before.

A similar issue is that if you elect to ignore certain signals with the SIG_IGN action using sigaction(), then subprocesses will inherit this behavior. Again, this can break processes. Until yesterday, Factor would ignore SIGPIPE using this mechanism, and child processes spawned with the io.launcher vocabulary that expected to receive SIGPIPE would not work properly. There are various workarounds; you can reset the signal mask before the exec() call, or you can do what I did in Factor, and ignore the signal by giving it an empty signal handler instead of a SIG_IGN action.

Overhauling Factor's C library interface

2010-07-28T02:36:00.002-04:00

For the last while I've been working on an overhaul of Factor's C library interface and associated compiler infrastructure. The goals of the rewrite were three-fold:

Improving performance
Cleaning up some crusty old code that dates back to my earliest experiments with native code generation
Laying the groundwork for future compiler improvements

These changes were all behind-the-scenes; I did not introduce any new functionality or language changes, and I think Factor's FFI is already quite stable and feature-complete.

The previous FFI implementation

I started work on Factor's C library interface (FFI) almost immediately after I bootstrapped the native implementation off of JFactor. I began experimenting with an SDL-based UI early on and immediately decided I wanted to have a real FFI, instead of extending the VM with primitives which wrap C functions by hand.

Over time, both the FFI and compiler have evolved in parallel, but there was little integration between them, other than the fact that both used the same assembler DSL under the hood. The result is that FFI calls were generated in a somewhat inefficient way. Since the optimizing compiler had no knowledge of how they work, it would save all values to the data stack first. Then the FFI code generator would take over; it would pop the input parameters one by one, pass them to a per-type C function in the Factor VM which unboxed the value, then stored the value in the stack frame. Finally, the target C function would be invoked, then another C function in the VM would box the return value, and the return value would be pushed to the stack. The optimizing compiler would then take over, possibily generating code to pop the value from the stack.

The redundant stack traffic was wasteful. In some cases, the optimizing compiler would generate code to box a value and push it to the stack, only to have the FFI then generate code to pop it and unbox it immediately after. To make matters worse, over time the optimizing compiler gained the ability to box and unbox values with open-coded assembly sequences, but the FFI would still make calls into the VM to do it.

All in all, it was about time I rewrote the FFI, modernizing it and integrating it with the rest of the compiler in the process.

Factoring FFI calls into simpler instructions

Most low-level IR instructions are very simple; FFI calls used to be the exception. Now I've split these up into smaller instructions. Parameters and return values are now read and written from the stack with the same ##peek and ##replace instructions that everything else uses, and boxing and unboxing parameters and return values is now done with the representation selection pass. A couple of oddball C types, such as long long on x86-32, still require VM calls to box and unbox, and I added new instructions for those.

One slightly tricky thing that came up was that I had to re-engineer low-level IR to support instructions with multiple output values. This is required for C calls which return structs and long long types by value, since each word-size chunk of the return value is returned in its own register, and these chunks have to be re-assembled later. In the future, I will be able to use this support to add instructions such as the x86 division instruction, which computes x / y and x mod y simultaneously.

I also had to change low-level IR to distinguish between instructions with side effects and those without. Previously, optimization passes would assume any instruction with an output value did not have side effects, and could be deleted if the output value was not used. This is no longer true for C calls; a C function might both have a side effect and return a value.

GC maps

Now that FFI calls no longer force the optimizer to sync all live values to the data and retain stacks, it can happen that SSA values are live across an FFI call. These values get spilled to the call stack by the register allocator. Spilling to the call stack is cheaper than spilling to the data and retain stacks, because floating point values and integers do not need to be boxed, and since the spilling is done later in the optimization process, optimizations can proceed across the call site instead of being stumped by pushes and pops on either side.

However, since FFI calls can invoke callbacks, which in turn run Factor code, which can trigger a garbage collection, the garbage collector must be able to identify spill slots in call frames which contain tagged pointers.

The technique I'm using is to record "GC maps". This idea comes from a paper titled Compiler Support for Garbage Collection in a Statically Typed Language (Factor is dynamically typed, and the paper itself doesn't have anything specific to static typing in it, so I found the title a bit odd). The Java HotSpot VM uses the same technique. The basic idea is that for every call site, you record a bitmap indicating which spill slots contain tagged pointers. This information is then stored in a per-code-block map, where the keys are return addresses and the values are these bitmaps.

In the future, I intend on using GC maps at call sites of Factor words as well, instead of spilling temporary values to the retain stack; then I can eliminate the retain stack altogether, freeing up a register. After this is done the data stack will only be used to pass parameters between words, and not to store temporaries within a word. This will allow more values to be unboxed in more situations, and it will improve accuracy of compiler analyses.

In fact, getting GC maps worked out was my primary motivation for this FFI rewrite; the code cleanups and performance improvements were just gravy.

Callback support improvements

Another one of those things that only makes sense when you look at how Factor evolved is that the body of an FFI callback was compiled with the non-optimizing compiler, rather than the optimizing compiler. It used to be that only certain definitions could be optimized, because static stack effects were optional and there were many common idioms which did not have static stack effects. It has been more than a year since I undertook the engineering effort to make the compiler enforce static stack safety, and the implementation of callbacks was the last vestigial remnant from the bad old days.

This design made callbacks harder to debug than they should be; if you used up too many parameters, or forgot to push a return value, you'd be greeted with a runtime error instead of a compiler error. Now this has been fixed, and callbacks are fully compiled with the optimizing compiler.

Factor talk in Boston, July 26th

2010-07-02T00:53:00.002-04:00

I will be presenting Factor at the Boston Lisp Users' Group on July 26th, 2010. Details in François-René Rideau's announcement. I'm also going to be giving a quick talk at the Emerging Languages Camp in Portland, July 22nd. Unfortunately registration for this camp is already full.

Comparing Factor's performance against V8, LuaJIT, SBCL, and CPython

2010-05-29T04:46:00.019-04:00

Together with Daniel Ehrenberg and Joe Groff, I'm writing a paper about Factor for DLS2010. We would appreciate feedback about the draft version of the paper. As part of the paper we include a performance comparison between Factor, V8, LuaJIT, SBCL, and Python. The performance comparison consists of some benchmarks from the The Computer Language Benchmarks Game. I'm posting the results here first, in case there's something really stupid here.

Language implementations

Factor and V8 were built from their respective repositories. SBCL is version 1.0.38. LuaJIT is version 2.0.0beta4. CPython is version 3.1.2. All language implementations were built as 64-bit binaries and run on an 2.4 GHz Intel Core 2 Duo.

Benchmark implementations

Factor implementations of the benchmarks can be found in our source repository:

Implementations for the other languages can be found at the language benchmark game CVS repository:

	LuaJIT	SBCL	V8	CPython
binary-trees	binarytrees.lua-2.lua	binarytrees.sbcl	binarytrees.javascript	binarytrees.python3-6.python3
fasta	fasta.lua	fasta.sbcl	fasta.javascript-2.javascript	fasta.python3-2.python3
knucleotide	knucleotide.lua-2.lua	knucleotide.sbcl-3.sbcl	knucleotide.javascript-3.javascript	knucleotide.python3-4.python3
nbody	nbody.lua-2.lua	nbody.sbcl	nbody.javascript	nbody.python3-4.python3
regex-dna		regexdna.sbcl-3.sbcl	regexdna.javascript	regexdna.python3
reverse-complement	revcomp.lua	revcomp.sbcl	revcomp.javascript-2.javascript	revcomp.python3-4.python3
spectral-norm	spectralnorm.lua	spectralnorm.sbcl-3.sbcl	spectralnorm.javascript	spectralnorm.python3-5.python3

In order to make the reverse complement benchmark work with SBCL on Mac OS X, I had to apply this patch; I don't understand why:

--- bench/revcomp/revcomp.sbcl 9 Feb 2007 17:17:26 -0000 1.4
+++ bench/revcomp/revcomp.sbcl 29 May 2010 08:32:19 -0000
@@ -26,8 +26,7 @@
 
 (defun main ()
   (declare (optimize (speed 3) (safety 0)))
-  (with-open-file (in "/dev/stdin" :element-type +ub+)
-    (with-open-file (out "/dev/stdout" :element-type +ub+ :direction :output :if-exists :append)
+  (let ((in sb-sys:*stdin*) (out sb-sys:*stdout*))
       (let ((i-buf (make-array +buffer-size+ :element-type +ub+))
             (o-buf (make-array +buffer-size+ :element-type +ub+))
             (chunks nil))
@@ -72,4 +71,4 @@
                         (setf start 0)
                         (go read-chunk))))
            end-of-input
-             (flush-chunks)))))))
+             (flush-chunks))))))

Running the benchmarks

I used Factor's deploy tool to generate minimal images for the Factor benchmarks, and then ran them from the command line:

./factor -e='USE: tools.deploy "benchmark.nbody-simd" deploy'
time benchmark.nbody-simd.app/Contents/MacOS/benchmark.nbody-simd

For the scripting language implementations (LuaJIT and V8) I ran the scripts from the command line:

time ./d8 ~/perf/shootout/bench/nbody/nbody.javascript -- 1000000
time ./src/luajit ~/perf/shootout/bench/nbody/nbody.lua-2.lua 1000000

For SBCL, I did what the shootout does, and compiled each file into a new core:

ln -s ~/perf/shootout/bench/nbody/nbody.sbcl .

cat > nbody.sbcl_compile <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(handler-bind ((sb-ext:defconstant-uneql (lambda (c) (abort c))))
  (load (compile-file "nbody.sbcl" )))
(save-lisp-and-die "nbody.core" :purify t)
EOF

sbcl --userinit /dev/null --load nbody.sbcl_compile

cat > nbody.sbcl_run <<EOF
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed 0) (space 0)))
(main) (quit)
EOF

time sbcl --dynamic-space-size 500 --noinform --core nbody.core --userinit /dev/null --load nbody.sbcl_run 1000000

For CPython, I precompiled each script into bytecode first:

python3.1 -OO -c "from py_compile import compile; compile('nbody.python3-4.py')"

Benchmark results

All running times are wall clock time from the Unix time command. I ran each benchmark 5 times and used the best result.

	Factor	LuaJIT	SBCL	V8	CPython
fasta	2.597s	1.689s	2.105s	3.948s	35.234s
reverse-complement	2.377s	1.764s	2.955s	3.884s	1.669s
nbody	0.393s	0.604s	0.402s	4.569s	37.086s
binary-trees	1.764s	6.295s	1.349s	2.119s	19.886s
spectral-norm	1.377s	1.358s	2.229s	12.227s	1m44.675s
regex-dna	0.990s	N/A	0.973s	0.166s	0.874s
knucleotide	1.820s	0.573s	0.766s	1.876s	1.805s

Benchmark analysis

Some notes on the results:

There is no Lua implementation of the regex-dna benchmark.
Some of the SBCL benchmark implementations can make use of multiple cores if SBCL is compiled with thread support. However, by default, thread support seems to be disabled on Mac OS X. None of the other language implementations being tested have native thread support, so this is a single-core performance test.
Factor's string manipulation still needs work. The fasta, knucleotide and reverse-complement benchmarks are not as fast as they should be.
The binary-trees benchmark is a measure of how fast objects can be allocated, and how fast the garbage collector can reclaim dead objects. LuaJIT loses big here, perhaps because it lacks generational garbage collection, and because Lua's tables are an inefficient object representation.
The regex-dna benchmark is a measure of how efficient the regular expression implementation is in the language. V8 wins here, because it uses Google's heavily-optimized Irregexp library.
Factor beats the other implementations on the nbody benchmark because it is able to make use of SIMD.
For some reason SBCL is slower than the others on spectral-norm. It should be generating the same code.
The benchmarks exercise insufficiently-many language features. Any benchmark that uses native-sized integers (for example, an implementation of the SHA1 algorithm) would shine on SBCL and suffer on all the others. Similarly, any benchmark that requires packed binary data support would shine on Factor and suffer on all the others. However, the benchmarks in the shootout mostly consist of scalar floating point code, and text manipulation only.

Conclusions

Factor's performance is coming along nicely. I'd like to submit Factor to the computer language shootout soon. Before doing that, we need a Debian package, and the deploy tool needs to be easier to use from the command line.

A collection of small compiler improvements

2010-05-10T18:23:00.003-04:00

One of the big optimizations I'm planning on implementing in Factor is support for computations on machine-word-sized integers. The motivation is as follows. While code that operates on small integers (fixnums) does not currently allocate memory for intermediate values, and as a result can compile very efficiently if type checks are eliminated, sometimes fixnum precision is not quite enough. Using bignums in algorithms such as SHA1 that require 32-bit or 64-bit arithmetic incurs a big performance hit over writing the code in a systems language. My plan is to have the compiler support machine integers as an intermediate step between fixnums and bignums. Machine integers would be boxed and unboxed at call boundaries, but tight loops operating on them will run in registers.

While I haven't yet implemented the above optimization, I've laid the groundwork and made it possible to represent operations on untagged integers at the level of compiler IR at least. While unboxed floats do not cause any complications for the GC, unboxed integers do, because they share a register file with tagged pointers. To make Factor's precise GC work, the compiler can now distinguish registers containing tagged pointers from those containing arbitrary integer data, by breaking down the register class of integer registers into two "representations", tagged and untagged.

Since the above reworking touched many different parts of the compiler, I also took the time to implement some minor optimizations and clean up some older code. These improvements will be the topic of this post.

To understand this post better, you should probably skim the list of low-level IR instructions defined in compiler.cfg.instructions first. Also, feel free to post comments here or e-mail me questions about the design of the Factor compiler.

Improvements to value numbering

In the x86 architecture, almost any instruction can take a memory operand, and memory operands take the form (base + displacement * scale + offset), where base and displacement are registers, offset is a 32-bit signed integer, and scale is 1, 2, 4 or 8. While using a complex addressing mode is not any faster than writing out the arithmetic by hand with a temporary register used to store the final address, it can reduce register pressure and code size.

The compiler now makes better use of complex addressing modes. Prior to optimization, the low level IR builder lowers specialized array access to the ##load-memory-imm and ##store-memory-imm instructions. For example, consider the following code:

USING: alien.c-types specialized-arrays ;
SPECIALIZED-ARRAY: float

: make-data ( -- seq ) 10 iota [ 3.5 + sqrt ] float-array{ } map-as ;

The inner loop consists of the following low level IR:

##integer>float 23 14
##sqrt 24 23
##load-integer 25 2
##slot-imm 26 15 2 7
##load-integer 27 4
##mul 28 14 27
##tagged>integer 29 26
##add-imm 30 29 7
##add 31 30 28
##store-memory-imm 24 31 0 float-rep f
##load-integer 32 1
##add 33 14 32

The ##mul, ##add-imm and ##add instructions compute the address input to ##store-memory-imm. After value numbering, these three instructions have been fused with ##store-memory-imm to form ##store-memory:

##integer>float 62 57
##sqrt 63 62
##slot-imm 65 48 2 7
##tagged>integer 68 65
##store-memory 63 68 57 2 7 float-rep f
##add-imm 72 57 1

Optimistic copy propagation

While the low-level optimizer's value numbering and alias analysis passes only operate on individual basic blocks (local optimizations), the copy propagation pass is global, meaning it takes the entire procedure being compiled into account.

Eventually I will merge the local alias analysis and value numbering passes with the copy propagation pass to get an optimization known as "global value numbering with redundant load elimination". One step along this road is a rewrite that makes the copy propagation pass optimistic, in the following sense.

Copy propagation concerns itself with eliminating ##copy instructions, which correspond to assignments from one SSA value to another (new) value:

y = x

When such an instruction is processed, all usages of the value y can be replaced with the value x, in every basic block, and the definition of y deleted.

A simple extension of the above treats a ##phi instruction where all inputs are equal the same as a copy. This optimizes cases such as:

 a = ...
if(...)
{
    b = a
}
c = phi(b,a)

The case of ##phi instructions where one of the inputs is carried by a back edge in the control flow graph (ie, a loop) is where the optimistic assumption comes in. Consider the following code, where the definition of x'' is carried by a back edge:

x' = 0

while(...)
{
    x = phi(x',x'')
    x'' = ...
}

Optimistic copy propagation is able to eliminate the phi node entirely. Switching from pessimistic to optimistic copy propagation eliminated about 10% of copy instructions. While this is not a great amount, it did reduce spilling in a few subroutines I looked at, and there is no increase in code complexity from this new approach.

Representation selection improvements

The low-level IR now includes three new instructions, ##load-double, ##load-float and ##load-vector. These are inteded for loading unboxed floating point values and SIMD vectors into vector registers without using an integer register to temporarily hold a pointer to the boxed value.

Uses of these instructions are inserted by the representation selection pass. If the destination of a ##load-reference is used in an unboxed representation, then instead of inserting a memory load following the ##load-reference, the instruction is replaced with one of the new variants.

Loads from constant addresses directly into floating point and vector registers are only possible on x86, and not PowerPC, so the optimization is not performed on the latter architecture. On x86-32, an immediate displacement can encode any 32-bit address. On x86-64, loading from an absolute 64-bit address requires an integer register, however instruction pointer-relative addressing is supported with a 32-bit displacement. To make use of this capability, the compiler now supports a new "binary literal area" for unboxed data that compiled code can reference. The unboxed data is placed immediately following a word's machine code, allowing RIP-relative addressing to be used on x86-64.

This optimization, together with value numbering improvements, helps reduce pressure on integer registers in floating point loops.

Register allocator improvements

Once representation selection has run, each SSA value has an associated representation and register class. Each representation always belongs to one specific register class; a register class is a finite set of registers from which the register allocator is allowed to choose a register for this type of value.

Before register allocation, the SSA destruction pass transforms ##phi instructions into ##copys, and then subsequently performs live range coalescing to eliminate as many of the copies as possible.

Coalescing two SSA values from different register classes does not make sense; the compiler will not be able to generate valid machine code if a single SSA value is used as both a float and an integer, since on most CPUs the integer and floating point register files are distinct.

However, coalescing values with different representations, but the same register class, is okay. Consider the case where a double-precision float is computed, and then converted into a single precision float, and subsequently stored into memory. If the double-precision value is not used after the conversion, then the same register that held the input value can be reused to store the result of the single-precision conversion.

Previously this pass only coalesced values with identical representations; now I've generalized it to coalescing values with the same register class but possibly different representations. This reduces register pressure and spilling.

The tricky part about making it work is that the register allocator needs to know the value's representation, not just its register class, for generating correct spills and reloads. For example, a spilled value with register class float-regs can use anywhere from 4 to 16 bytes in the stack frame, depending on the specific representation: single precision float, double precision float, or SIMD vector. Clearly, if coalescing simply lost this fine-grained representation information and only retained register classes, the register allocator would not have enough information to generate valid code.

The solution is to have live range coalescing compute the equivalence classes of SSA values without actually renaming any usages to point to the canonical representative. The renaming map is made available to the register allocator. Most places in the register allocator where instruction operands are examined make use of the renaming map.

Until the splitting phase, these equivalence classes are in one-to-one correspondence with live intervals, and each live interval has a single machine register and spill slot. However, when spill and reload decisions are made, the register allocator uses the original SSA values to look up representations.

If Factor's compiler did not use SSA form at all, there would still be a copy coalescing pass, and the register allocator could also support coalescing values with different representations, by first performing a dataflow analysis known as reaching definitions. This would propagate representation information to use sites.

Retaining SSA structure all the way until register allocation gives you the results of this reaching definitions analysis "for free". In both the SSA and non-SSA case, you still don't want definitions with multiple different representations reaching the same call site. The SSA equivalent of this would be a phi instruction whose inputs had different representations. The compiler ensures this doesn't happen by first splitting up the set of SSA values into strongly-connected components (SCCs) whose edges are phi nodes. The same representation is then assigned to each member of every SCC; if an agreeable representation is not found then it falls back on boxing and tagging all members.

Notice that while both representation selection and SSA destruction group values into equivalence classes, the groupings do not correspond to each other and neither one is a refinement of the other. Not all copies resulting from phi nodes get coalesced away, so a single SCC may intersect multiple coalescing classes. This can happen if the first input to a phi node is live at the definition point of the second:

b = ...

if(...)
{
    a = ...
    foo(b)
}
/* 'c' can be coalesced with 'a' or 'b' -- but not both.
All three values belong to the same SCC and must have the same
representation */
c = phi(a,b)

On the other hand, two values from different SCCs might also get coalesced together:

a = some-double-float
/* if 'a' is not live after this instruction, then 'a' and 'b'
can share a register. They belong to different SCCs and have
different representations */
b = double-to-single-precision-float(a)

Compiler code cleanups

In addition to just adding new optimizations I've also simplified and refactored the internals of some of the passes I worked on.

One big change is that the "machine representation" used at the very end of the compilation process is gone. Previously, after register allocation had run, the control flow graph would be flattened into a list of instructions with CFG edges replaced by labels and jumps. The code generator would then iterate over this flat representation and emit machine code for each virtual instruction. However since no optimization passes ran on the flat representation, it was generated and then immediately consumed. Combining the linearization pass with the code generation pass let me eliminate about a dozen IR instructions that were specific to the flat representation (anything dealing with labels and jumps to labels). It also sped up compilation time slightly.

Value numbering's common subexpression elimination now uses a simpler and more efficient scheme for hashed lookup of expressions. New algebraic simplifications are also easier to add now, because the def-use graph is easier to traverse.

Some instructions which could be expressed in terms of others were removed, namely the ##string-nth and ##set-string-nth-fast instructions. Reading and writing characters from strings can be done by combining simpler memory operations already present in the IR.

A number of instructions were combined. All of the instructions for reading from memory in different formats, ##alien-unsigned-1, ##alien-signed-4, ##alien-double, have been merged into a single ##load-memory instruction which has a constant operands storing the C type and representation of the data being loaded. Similarly, ##set-alien-signed-8, etc have all been merged into ##store-memory. This restructuring allows optimization passes to more easily treat all memory accesses in a uniform way.

A bug fix for alias analysis

While working on the above I found an interesting bug in alias analysis. It would incorrectly eliminate loads and stores in certain cases. The optimizer assumed that a pointer to a freshly-allocated object could never alias a pointer read from the heap. However the problem of course is that such a pointer could be stored in another object, and then read back in.

Here is a Factor example that demonstrates the problem. First, a couple of tuple definitions; the type declarations and initial value are important further on.

USING: typed locals ;

TUPLE: inner { value initial: 5 } ;
TUPLE: outer { value inner } ;

Now, a very contrieved word which takes two instances of outer,
mutates one, and reads a value out of the other:

TYPED: testcase ( x: outer y: outer -- )
    inner new :> z               ! Create a new instance of 'inner'
    z x value<<                  ! Store 'z' in 'x'
    y value>> :> new-z           ! Load 'new-z' from 'y'
    10 new-z value<<             ! Store a value inside 'new-z'  
    z value>>                    ! Read a value out of 'z'
    ;

Note that the initial value of the value slot in the inner tuple is 5, so inner new creates a new instance holding the value 5. In the case where x and y point at the same object, new-z and z will point to the same object, and the newly-allocated object's value is set to 10. However, the compiler did not realize this, and erronously constant-folded z value>> down to 5!

This bug never came up in actual code; the facepalm moment came while I was doing some code refactoring and noticed that the above case was not being handled.

Instruction scheduling to reduce register pressure

I'm not the only one who's been busy improving the Factor compiler. One of the co-inventors of Factor has cooked up an instruction scheduler and improved method inlining. The scheduler has been merged, and the it looks like the method inlining improvements should be ready in a few days.

Some benchmarks

Here is a comparison between the April 17th build of Factor with the latest code from the GIT repository. I ran a few of the language shootout benchmarks on a 2.4 GHz MacBook Pro. Factor was built in 32-bit mode, because the new optimizations are most effective on the register-starved x86.

Benchmark	Before (ms)	After (ms)
nbody-simd	415	349
fasta	2667	2485
knucleotide	182	177
regex-dna	109	86
spectral-norm	1641	1347

Factor 0.93 now available

2010-04-16T18:24:00.004-04:00

After two months of development, Factor 0.93 is now available for download from the Factor website. A big thanks to all the contributors, testers and users. This release would not be possible without your help. A summary of the most important changes follows:

Incompatible changes:

Factor no longer supports NetBSD, due to limitations in that operating system. (Slava Pestov)
sets, hash-sets, bit-sets: these vocabularies have been redesigned around a new generic protocol for sets (Daniel Ehrenberg)
Strings can no longer be used to refer to C types in FFI calls; you must use C type words instead. Also, ABI specifiers are now symbols rather than strings. So for example, the following code will no longer work:
```
: my-callback ( -- callback ) "int" { "int" "int" } "cdecl" [ + ] alien-callback ;
```
you must now write:
```
: my-callback ( -- callback ) int { int int } cdecl [ + ] alien-callback ;
```
(Joe Groff)
The behavior of string types has changed. The char* C type is now just a bare pointer; to get the automatic conversion to and from Factor strings, use the c-string type. See the documentation for details. (Joe Groff)
FFI function return values which are pointers to structs are now boxed in a struct class, instead of returning a bare alien. This means that many memory>struct calls can be removed. If the return value is actually a pointer to an array of structs, use specialized arrays as before. (Joe Groff)
C-ENUM: now takes an additional parameter which is either the name of a new C type to define, or f. This type is aliased to int. (Erik Charlebois)
The stack checker now supports row polymorphic stack effects, for improved error checking. See the documentation for details. (Joe Groff)

New features:

Co-operative threads now use efficient context-switching primitives instead of copying stacks with continuations (Slava Pestov)
Added final class declarations, to prohibit classes from being subclassed. This enables a few compiler optimizations (Slava Pestov)
Added platform support metadata for vocabularies. Vocabulary directories can now contain a platforms.txt file listing operating system names which they can be loaded under. (Slava Pestov)
The deploy tool can now deploy native libraries and resource files, and supports custom icons. (Joe Groff)
Joint behavior of vocabularies - the new require-when word can be used to express that a vocabulary should be loaded if two existing vocabularies have been loaded (Daniel Ehrenberg)
Callstack overflow is now caught and reported on all platforms except for Windows and OpenBSD. (Slava Pestov)
The prettyprinter now has some limits set by default. This prevents out of memory errors when printing large structures, such as the global namespace. Use the without-limits combinator to disable limits. (Slava Pestov)
Added fastcall and thiscall ABIs on x86-32 (Joe Groff)
The build farm's version release process is now more automated (Slava Pestov)

Improved libraries:

delegate: add BROADCAST: syntax, which delegates a generic word with no outputs to an array of multiple delegates. (Joe Groff)
game.input: X11 support. (William Schlieper)
gpu: geometry shader support. (Joe Groff)
opengl: OpenGL 4.0 support. (Joe Groff)

New libraries:

bit.ly: Factor interface to the bit.ly URL shortening service. (Slava Pestov)
chipmunk: binding for Chipmunk 2D physics library. (Erik Charlebois)
cuda: binding to NVidia's CUDA API for GPU computing. (Doug Coleman, Joe Groff)
cursors: experimental library for iterating over collections, inspired by STL iterators. (Joe Groff)
elf: parsing ELF binaries. The elf.nm vocabulary is a demo which prints all symbols in an ELF binary. (Erik Charlebois)
images.pbm, images.pgm, images.ppm: libraries for loading PBM, PGM and PPM images (Erik Charlebois)
macho: parsing Mach-O binaries. (Erik Charlebois)
opencl: binding for the OpenCL standard interface for GPU computing. (Erik Charlebois)
path-finding: implementation of A* and BFS path finding algorithms. (Samuel Tardieu)
windows.ddk: binding for Windows hardware abstraction layer. (Erik Charlebois)

Frame-based structured exception handling on Windows x86-64

2010-04-08T20:16:00.002-04:00

Factor used to use vectored exception handlers, registered with AddVectoredExceptionHandler, however vectored handlers are somewhat problematic. A vectored handler is always called prior to any frame-based handlers, so Factor could end up reporting bogus exceptions if the FFI is used to call a library that uses SEH internally. This prompted me to switch to frame-based exception handlers. Unfortunately, these are considerably more complex to use, and the implementation differs between 32-bit and 64-bit Windows.

I briefly discussed frame-based SEH on 32-bit Windows in my previous post. During the switch to 64 bits, Microsoft got rid of the old frame-based SEH implementation and introduced a new, lower-overhead approach. Instead of pushing and popping exception handlers onto a linked list at runtime, the system maintains a set of function tables, where each function table stores exception handling and stack frame unwinding information.

Normally, the 64-bit Windows function tables are written into the executable by the native compiler. However, language implementations which generate code at runtime need to be able to define new function tables dynamically. This is done with the RtlAddFunctionTable() function.

It took me a while to figure out the correct way to call this function. I found the os_windows_x86.cpp source file from Sun's HotSpot Java implementation was very helpful, and I based my code on the register_code_area() function from this file.

Factor and HotSpot only use function tables in a very simple manner, to set up an exception handler. Function tables can also be used to define stack unwinding behavior; this allows debuggers to generate backtraces, and so on. Doing that is more complicated and I don't understand how it works, so I won't attempt to discuss it here.

The RtlAddFunctionTable() function takes an array of RUNTIME_FUNCTION structures and a base address. For some unknown reason, all pointers in the structures passed to this function are 32-bit integers relative to the base address.

For a runtime compiler that does not need to perform unwinding, it suffices to map the entire code heap to one RUNTIME_FUNCTION. A RUNTIME_FUNCTION has three fields:

BeginAddress - the start of the function
EndAddress - the end of the function
UnwindData - a pointer to unwind data

All pointers are relative to the base address passed into RtlAddFunctionTable(). The unwind data can take various forms. For the simple case of no unwind codes and an exception handler, the following structure is used:

struct UNWIND_INFO {
    UBYTE Version:3;
    UBYTE Flags:5;
    UBYTE SizeOfProlog;
    UBYTE CountOfCodes;
    UBYTE FrameRegister:4;
    UBYTE FrameOffset:4;
    ULONG ExceptionHandler;
    ULONG ExceptionData[1];
};

The Version and Flags fields should be set to 1, the ExceptionHandler field set to a function pointer, and the rest of the fields set to 0. The exception handler pointer must be within relative to the base address, and it must also be within the memory range specified by the BeginAddress and EndAddress fields of the RUNTIME_FUNCTION structure. The exception handler has the same function signature as in the 32-bit SEH case:

LONG exception_handler(PEXCEPTION_RECORD e, void *frame, PCONTEXT c, void *dispatch)

In both Factor and HotSpot, the exception handler is a C function, however the RtlAddFunctionTable() API requires that it be within the bounds of the runtime code heap. To get around the restriction, both VMs allocate a small trampoline in the code heap which simply jumps to the exception handler, and use a pointer to the trampoline instead. Similarly, because the "pointers" in these structures are actually 32-bit integers, it helps to allocate the RUNTIME_FUNCTION and UNWIND_INFO in the code heap as well, to ensure that everything is within the same 4Gb memory range.

The above explanation probably didn't make much sense, so I invite you to check out the source code instead: os-windows-nt-x86.64.cpp.

Switching call stacks on different platforms

2010-04-04T16:24:00.005-04:00

User-space implementations of coroutines and green threads need to be able to switch the CPU's call stack pointer between different memory regions. Since this is inherently CPU- and OS-specific, I'll limit my discussion to CPUs and platforms that Factor supports.

System APIs for switching call stacks

Windows has an API for creating and switching between contexts, which it calls "fibers". The main functions to look at are CreateFiber() and SwitchToFiber(). On Windows, by far the easiest way to switch call stacks is to just use fibers.
Most Unix systems have a set of functions operating on ucontexts, such as makecontext(), swapcontext() and so on. On some systems, these APIs are poorly implemented and documented.
The C standard library functions setjmp() and longjmp() can be (ab)used to switch contexts. The jmpbuf structure stored to by setjmp() and read from by longjmp() contains a snapshot of all registers, including the call stack pointer. Once you've allocated your own call stack, you can capture a jmp_buf with setjmp(), change the call stack pointer, and then resume execution with longjmp(). As far as I know, this is completely undocumented and unsupported with every major C compiler.
The most direct way is to write some assembly to switch the relevant registers directly. Paradoxically, this approach is the most portable out of all of these, because while it is CPU-specific, the details are pretty much the same across most OSes, Windows being the exception. Switching call stacks on Windows is a bit more involved than just changing the ESP register, and requires updating some Windows-specific in-memory structures. However, it is still easy enough to do directly, if you do not wish to use the fiber API. I will describe the details at the end of this post.

High-level libraries for switching call stacks

A couple of existing libraries implement high-level, cross-platform abstractions using some combination of the above mechanisms.

libcoroutine uses fibers on Windows, and ucontext and longjmp on Unix.
libcoro uses a handful of Unix-specific APIs if they're available, or falls back onto some hand-coded assembly routines. The latter works on Windows.

NetBSD/x86 limitation

There is no realiable way to switch call stacks on NetBSD/x86, because of a limitation in the implementation of libpthread. Even if your program doesn't use pthreads, it might link with a library that does, such as Glib. The pthread library replaces various C standard library functions with its own versions, and since this includes some extremely common calls such as malloc(), almost nothing will work as a result.

The reason is that the NetBSD pthread library uses a silly trick to implement thread-local storage, one which unfortunately assumes that every native thread has exactly one call stack. The trick is to always allocate thread call stacks on multiples of the maximum call stack size. Then, each thread's unique ID can just be the result of masking the call stack pointer by the maximum call stack size.

Windows-specific concerns

So far, I've only worked out the details of switching call stacks on 32-bit Windows. I'll talk about 64-bit Windows in another post.

Windows has a mechanism known as Structured Exception Handling, which is a sort of scoped exception handling mechanism with kernel support. Windows uses SEH to deliver processor traps to user-space applications (illegal memory access, division by zero, etc). Some C++ compilers also use SEH to implement C++ exceptions.

On Windows, simply changing the ESP register to point at another memory region is insufficient because structured exception handling must know the current call stack's beginning and end, as well as a pointer to the innermost exception handler record.

These three pieces of information are stored in a per-thread information block, known as the TIB. The fiber switching APIs update the TIB for you, which is why Microsoft recommends their use.

The TIB

The TIB is defined by the NT_TIB struct in winnt.h:

typedef struct _NT_TIB {
    struct _EXCEPTION_REGISTRATION_RECORD *ExceptionList;
    PVOID StackBase;
    PVOID StackLimit;
    PVOID SubSystemTib;
    _ANONYMOUS_UNION union {
        PVOID FiberData;
        DWORD Version;
    } DUMMYUNIONNAME;
    PVOID ArbitraryUserPointer;
    struct _NT_TIB *Self;
} NT_TIB,*PNT_TIB;

The relevant fields that must be updated when switching call stacks are ExceptionList, StackBase and StackLimit.

Note that since the x86 call stack grows down, StackLimit is the start of the call stack's memory region and StackBase is the end.

Structured exception handlers

Structured exception handlers are stored in a linked list on the call stack, with the head of the list pointed at by the ExceptionList field of the TIB:

struct _EXCEPTION_REGISTRATION_RECORD
{
     PEXCEPTION_REGISTRATION_RECORD Next;
     PEXCEPTION_DISPOSITION Handler;
} EXCEPTION_REGISTRATION_RECORD, *PEXCEPTION_REGISTRATION_RECORD;

The exception list should be saved and restored when switching between call stacks, and a new call stack should begin with an empty list (ExceptionList set to NULL).

If the ExceptionList field of the TIB or the Next field of an exception record point outside the call stack, then the handler in question will not be called at all.

Accessing the TIB

On x86-32, the current thread's TIB is stored starting at address 0 in the segment pointed at by the FS segment register. The Self field always points at the struct itself. Since Windows uses flat addressing, the segment storing the TIB begins somewhere in the linear 32-bit address space, so an assembly routine to fetch the TIB and return a pointer to it in EAX looks like this:

fetch_tib:
    mov eax,[fs:24]
    ret

This assembly routine can then be called from C code, which can manipulate the TIB like any other structure.

Sample code for updating Windows-specific structures

The basis/cpu/x86/32/winnt/bootstrap.factor file defines assembly subroutines for updating the TIB when switching call stacks. These routines are invoked by the x86-32 non-optimizing compiler backend:

Incompatible change in Mach exception handler behavior on Mac OS X 10.6

2010-03-23T04:29:00.004-04:00

On Mac OS X, Factor uses Mach exceptions rather than Unix signals to receive notifications of illegal memory accesses and arithmetic exceptions from the operating system. This is used to catch datastack underflows, division by zero, and the like. The code implementing this can be found in vm/mach_signal.c in the Factor repository. This file is based on code from the GNU libsigsegv project, with special permission to use it under a BSD license instead of the restrictive GPL.

It seems that as of Mac OS X 10.6, exceptions raised by child processes are now reported to the parent if the parent has an exception handler thread. This caused a problem in Factor if a child process crashed with an access violation; Factor would think the access violation occurred in Factor code, and die with an assertion failure when attempting to map the thread ID back into a Factor VM object. It seems the simplest way is to add the following clause to the catch_exception_raise() function:

if(task != mach_task_self()) return KERN_FAILURE;

This fix should be applicable to the original libsigsegv code as well, along with other projects, such as CLisp, that use this library.

Adding a recaptcha to the Factor pastebin

2010-03-11T03:16:00.005-05:00

Lately the Factor pastebin has been flooded with penis enlargement spam. The pastebin had its own very weak captcha that worked well enough for a while: there was a form field that was required to be left blank, and if it was not blank validation would fail. However, spammers have started working around it so we needed something stronger. I decided to solve the problem once and for all by integrating Doug Coleman's furnace.recaptcha vocabulary into the pastebin. Doing this was surprisingly easy.

First, I changed the USING: form in webapps.pastebin to reference the furnace.recaptcha vocabulary:

USING: namespaces assocs sorting sequences kernel accessors
hashtables db.types db.tuples db combinators
calendar calendar.format math.parser math.order syndication urls
xml.writer xmode.catalog validators
html.forms
html.components
html.templates.chloe
http.server
http.server.dispatchers
http.server.redirection
http.server.responses
furnace
furnace.actions
furnace.redirection
furnace.auth
furnace.auth.login
furnace.boilerplate
furnace.recaptcha
furnace.syndication
furnace.conversations ;

Next, I changed the validate-entity word. This word is used to validate a form submission when adding a new paste or a new annotation. Instead of validating a captcha using the old simple scheme, it now calls validate-recaptcha:

: validate-entity ( -- )
    {
        { "summary" [ v-one-line ] }
        { "author" [ v-one-line ] }
        { "mode" [ v-mode ] }
        { "contents" [ v-required ] }
    } validate-params
    validate-recaptcha ;

Finally, I edited the new-paste.xml and new-annotation.xml templates to add a recaptcha component inside the t:form tag:

<tr><td colspan="2"><t:recaptcha /></td></tr>

The implementation of the furnace.recaptcha vocabulary is very straightforward and takes advantage of several features of Factor's web framework. It uses http.client to communicate with the recaptcha server, and furnace.conversations to pass the validation error between requests. Finally, it uses the CHLOE: parsing word to define the t:recaptcha tag for use in templates:

CHLOE: recaptcha drop [ render-recaptcha ] [xml-code] ;

Final classes and platform-specific vocabularies

2010-02-24T02:35:00.006-05:00

Final classes

I added final classes, in the Java sense. Attempting to inherit from a final class raises an error. To declare a class as final, suffix the definition with "final", in the same manner that words can be declared "inline" or "foldable":

TUPLE: point x y z ; final

The motivation for final classes was not obvious. There are three main reasons I added this feature.

Unboxed tuple arrays used to have the caveat that if you store an instance of a subclass into a tuple array of a superclass, then the slots of the subclass would be "sliced off":

TUPLE: point-2d x y ;

TUPLE-ARRAY: point-2d

TUPLE: point-3d < point-2d z ;

SYMBOL: my-array

1  my-array set

1 2 3 point-3d boa my-array get set-first

my-array get first .
=> T{ point-2d { x 1 } { y 2 } }

This warranted a paragraph in the documentation, and vigilance on the part of the programmer. Now, tuple arrays simply enforce that the element class is final, and if it is not, an error is raised. This removes a potential source of confusion; it is always nice when written warnings in the documentation can be replaced by language features.

Joe Groff's typed library has a similar problem. This library has a feature where input and output parameters which are read-only tuples are passed by value to improve performance. This could cause the same "slicing problem" as above. Now, typed only passes final read-only tuples by value.

Finally, there was a previous mechanism for prohibiting subclassing, but it wasn't exposed as part of the syntax. It was used by the implementation of struct classes to prevent subclassing of structs. The struct class implementation now simply declares struct classes as final.

Typed words can now be declared foldable and flushable

Factor has a pretty unique feature; the user can declare words foldable (which makes them eligible for constant folding at compile time if the inputs are all literal) or flushable (which makes them eligible for dead code elimination if the output results are not used). These declarations now work with typed words.

Platform-specific vocabularies

I added a facility to declare what operating systems a vocabulary runs on. Loading a vocabulary on an unsupported platform raises an error, with a restart if you know what you're doing. The load-all word skips vocabularies which are not supported by the current platform.

If a platforms.txt file exists in the vocabulary's directory, this file is interpreted as a newline-separated list of operating system names from the system vocabulary. This complements the existing vocabulary metadata for authors, tags, and summary.

This feature helps the build farm avoid loading code for the wrong platform. It used to be that vocabularies with the "unportable" tag set would be skipped by load-all, however this was too coarse-grained. For example, both the curses and DirectX bindings were tagged as unportable, and so the build farm was not loading or testing them on any platform. However, curses is available on Unix systems, and DirectX is available on Windows systems. With the new approach, there is a extra/curses/platforms.txt file listing unix as a supported platform, and the various DirectX vocabulary directories have platforms.txt files listing windows.

Factor 0.92 now available

2010-02-16T03:47:00.003-05:00

I'm proud to announce the release of Factor 0.92. This release comes two years after the last release, 0.91. Proceed to the Factor website or directly to downloads directory to obtain a copy.

More than 30 developers have contributed code to this release. I would like to thank all of the users and contributors for their efforts; Factor would not be at the stage it is today without your help.

Since there have been so many changes since the last release, the below changelog is only an incomplete summary. In particular, there have been many incompatible syntax and core library changes since the last release.

The next release cycle will not last as long as this one; from this point on, I'm planning on having monthly releases or so. There will be fewer incompatible language changes in the upcoming months; the core language has almost been finalized at this point.

Language improvements

Static stack effects enforced for all words (Slava Pestov, Daniel Ehrenberg)
New object system: generic accessors, inheritance replaces delegation, type declarations on slots, read only slots, singletons (Slava Pestov)
Data flow combinators (Eduardo Cavazos)
Partial application syntax (Eduardo Cavazos)
Integers-as-sequences is no longer supported; use the iota virtual sequence instead (Doug Coleman, Slava Pestov)

Notable new libraries

db: Relational database access library with SQLite and PostgreSQL backends (Doug Coleman)
furnace: web framework that runs concatenative.org (Slava Pestov)
io.encodings: Extensive support for I/O encodings. All operations that transform strings to bytes and vice versa now take an encoding parameter (Daniel Ehrenberg)
unicode: Unicode-aware case conversion, character classes, collation, breaking, and normalization (Daniel Ehrenberg)
regexp: DFA-based regular expression matching. Supports Unicode and compiles to machine code (Daniel Ehrenberg)
io.sockets.secure: secure sockets with OpenSSL (Slava Pestov)
io.files.info, io.monitors: File system metadata and change monitoring (Doug Coleman, Slava Pestov)
images: loading and displaying BMP, TIFF, PNG, JPEG and GIF images (Doug Coleman, Marc Fauconneau)
gpu, game: Frameworks for GPU rendering and game development using OpenGL (Joe Groff)
math.floats.env: advanced IEEE floating point features (Joe Groff)
math.blas, alien.fortran: bindings to BLAS linear algebra library, built on top of a new Fortran FFI
math.vectors.simd: high-performance SIMD arithmetic with support for SSE versions up to 4.2 (Joe Groff)
specialized-arrays, specialized-vectors, classes.struct: high-performance support for unboxed value types (Slava Pestov, Joe Groff)

Implementation improvements

Thanks to the continuous integration system, binary packages are now available for 13 platforms (Eduardo Cavazos, Slava Pestov)
The compiler has been rewritten to improve compile time, performance of generated code, and correctness (1, 2, 3, 4, 5, 6) (Slava Pestov)
Polymorphic inline caching (Slava Pestov)
The garbage collector has been rewritten. A more precise write barrier speeds up minor collections, and a mark-sweep-compact algorithm is now used for full collections (Slava Pestov)
The Factor UI now supports Unicode font rendering, and the UI developer tools have all seen significant improvements (Slava Pestov)

Editor integration

fuel: Factor's Ultimate Emacs Library, a powerful emacs mode for editing Factor code. See misc/fuel/README in the Factor download for details (Jose A Ortega)

Replacing GNU assembler with Factor code

2010-01-25T14:44:00.013-05:00

Lately a major goal of mine has been to get the Factor VM to be as close as possible to standard C++, and to compile it with at least one non-gcc compiler, namely Microsoft's Windows SDK.

I've already eliminated usages of gcc-specific features from the C++ source (register variables mostly) and blogged about it.

The remaining hurdle was that several of the VM's low-level subroutines were written directly in GNU assembler, and not C++. Microsoft's toolchain uses a different assembler syntax, and I don't fancy writing the same routines twice, especially since the code in question is already x86-specific. Instead, I've decided to eliminate assembly code from the VM altogether. Factor's compiler infrastructure has perfectly good assembler DSLs for x86 and PowerPC written in Factor itself already.

Essentially, I rewrite the GNU assembler source files in Factor itself. The individual assembler routines have been replaced by new functionality added to both the non-optimizing and optimizing compiler backends. This avoids the issue of the GNU assembler -vs- Microsoft assembler syntax entirely. Now all assembly in the implementation is consistently written in the same postfix Factor assembler syntax.

The non-optimizing compiler already supports "sub-primitives", words whose definition consists of assembler opcodes, inlined at each call site. I added new sub-primitives to these files to replace some of the VM's assembly routines:

A few entry points are now generated by the optimizing compiler, too. The optimizing compiler has complicated machinery for generating arbitrary machine code. I extended this with a Factor language feature similar to C's inline assembly, where Factor's assembler DSL is used to generate arbitrary assembly within a word's compiled definition. This is more flexible than the non-optimizing compiler's sub-primitives, and has wide applications beyond the immediate task of replacing a few GNU assembler routines with Factor code.

Factor platform ABI

Before jumping into a discussion of the various assembly routines in the VM, it is important to understand the Factor calling convention first, and how it differs from C.

Factor's machine language calling convention in a nutshell:

Two registers are reserved, for the data and retain stacks, respectively.

Both registers point into an array of tagged pointers. Quotations and words pass and receive parameters as tagged pointers on the data stack.

On PowerPC and x86-64, an additional register is reserved for the current VM instance.

The call stack register (ESP on x86, r1 on PowerPC) is used like in C, with call frames stored in a contiguous fashion.

Call frames must have a bit of metadata so that the garbage collector can mark code blocks that are referenced via return addresses. This ensures that currently-executing code is not deallocated, even if no other references to it remain.

This GC meta-data consists of three things: the stack frame's size in bytes, a pointer to the start of a compiled code block, and a return address inside that code block. Since every frame records its size and the next frame immediately follows, the garbage collector can trace and update all return addresses accurately.

A call frame can have arbitrary size, but the garbage collector does not inspect the additional payload; its can be any blob of binary data at all. The optimizing compiler generates large call frames in a handful of rare situations when a scratch area is needed for raw data.

Quotations must be called with a pointer to the quotation object in a distinguished register (even on x86-32, where the C ABI does not use registers at all). Remaining registers do not have to be preserved, and can be used for any purpose in the compiled code.

Tail calls to compiled words must load the program counter into a special register (EBX on x86-32). This allows polymorphic inline caches at tail call sites to patch the call address if the cache misses. A non-tail call PIC can look at the return address on the call stack, but for a space-saving tail-call, this is not available, so to make inline caching work in all cases, tail calls have to pass this address directly. The only compiled blocks that read the value of this register on entry are tail call PIC miss stubs.

All other calls to compiled words are made without any registers having defined contents at all, so effectively all registers that are not reserved for a specific purpose are volatile.

The call stack pointer must be suitably aligned so that SIMD code can spill vector data to the call frame. This is already the case in the C ABI on all platforms except non-Mac 32-bit x86.

Replacing the `c_to_factor()` entry point

There are two situations where C code needs to jump into Factor; when Factor is starting up, and when C functions invoke function pointers generated by alien-callback.

There used to be a c_to_factor() function defined in a GNU assembly source file as part of the VM, that would take care of translating from the C ABI to the Factor ABI. C++ code can call assembly routines that obey the C calling convention directly.

Now that the special assembly entry point is gone from the VM, a valid question to ask is how is it even possible to switch ABIs and jump out into Factor-land, without stepping outside the bounds of C++, by writing some inline assembler at least. It seems like an impossible dilemma.

The Factor VM ensures that the transition stub that C code uses to call into Factor is generated seemingly out of thin air. It turns out the only unportable C++ feature you really need when bootstrapping a JIT is the ability to cast a data pointer into a function pointer, and call the result.

The new factor_vm::c_to_factor() method, called on VM startup, looks for a function pointer in a member variable named factor_vm::c_to_factor_func. Initially, the value is NULL, and if this is the case, it dynamically generates the entry point and then calls the brand-new function pointer:

void factor_vm::c_to_factor(cell quot)
{
    /* First time this is called, wrap the c-to-factor sub-primitive inside
    of a callback stub, which saves and restores non-volatile registers
    as per platform ABI conventions, so that the Factor compiler can treat
    all registers as volatile */
    if(!c_to_factor_func)
    {
        tagged<word> c_to_factor_word(special_objects[C_TO_FACTOR_WORD]);
        code_block *c_to_factor_block = callbacks->add(c_to_factor_word.value(),0);
        c_to_factor_func = (c_to_factor_func_type)c_to_factor_block->entry_point();
    }

    c_to_factor_func(quot);
}

All machine code generated by the Factor compiler is stored in the code heap, where blocks of code can move. But c_to_factor() needs a stable function pointer to make the initial jump out of C and into Factor. As I briefly mentioned in a blog post about mark sweep compact garbage collection, Factor has a separate callback heap for allocating unmovable function pointers intended to be passed to C.

This callback heap is used for the initial startup entry point too, as well as callbacks generated by alien-callback..

As mentioned in the comment, the callback stub now takes care of saving
and restoring non-volatile registers, as well as aligning the stack frame. You can see how callback stubs are defined with the Factor assembler by grepping for callback-stub in the non-optimizing compiler backend.

The new callback stub covers part of what the old assembly c_to_factor() entry point did. The remaining component is calling the quotation itself, and this is now done by a special word c-to-factor.

The c-to-factor word loads the data stack and retain stack pointers and jumps to the quotation's compiled definition. Grep for c-to-factor-impl in the non-optimizing compiler backend.

In effect, by abusing the non-optimizing compiler's support for "subprimitives", machine code for the C-to-Factor entry point can be generated by the VM itself.

Callbacks generated by alien-callback in the optimizing compiler used to contain a call to the c_to_factor() assembly routine. The equivalent machine code is now generated directly by the optimizing compiler.

Replacing the `throw_impl()` entry point

When Factor code throws an error, a continuation is popped off the catch stack, and resumed. When the VM needs to throw an error, it has to go through the same motions, but perform a non-local return to unwind any C++ stack frames first, before it can jump back into Factor and resume another continuation.

There used to be an assembly routine named throw_impl() which would take a quotation and a new value for the stack pointer.

This is now handled in a similar manner to c_to_factor(). The unwind-native-frames word in kernel.private is another one of those very special sub-primitives that uses the C calling convention for receiving parameters. It reloads the data and retain stack registers, and changes the call stack pointer to the given parameter. The call is coming in from C++ code, and the contents of these registers are not guaranteed since they play no special role in the C ABI. Grep for unwind-native-frames in the non-optimizing compiler backend.

Replacing the `lazy_jit_compile_impl()` and `set_callstack()`entry points

As I discussed in my most recent blog post, on the implementation of closures in Factor, quotation compilation is deferred, and initially all quotations point to the same shared entry point. This shared entry point used to be an assembly routine in the VM. It is now the lazy-jit-compile sub-primitive.

The set-callstack primitive predates the existence of sub-primitives, so it was implemented as an assembly routine in the VM for historical reasons.

These two entry points are never called directly from C++ code in the VM, so unlike c_to_factor() and throw_impl(), there is no C++ code to fish out generated code from a special word and turn it into a function pointer.

Inline assembly with the `alien-assembly` combinator

I added a new word, alien-assembly. In the same way as alien-invoke, it generates code which marshals Factor data into C values, and passes them according to the C calling convention; but where alien-invoke would generate a
subroutine call, alien-assembly just calls the quotation at
compile time instead, no questions asked. The quotation can emit
any machine code it desires, but the result has to obey the C calling
convention.

Here is an example: a horribly unportable way to add two floating-point numbers that only works on x86.64:

: add ( a b -- c )
    double { double double } "cdecl"
    [ XMM0 XMM1 ADDPD ]
    alien-assembly ;

1.5 2.0 add .
3.5

The important thing is that unlike assembly code in the VM, using this feature the same assembly code will work regardless of whether the Factor VM was compiled with gcc or the Microsoft toolchain!

Replacing FPU state entry points

The remaining VM assembly routines were used to save and restore FPU state used for advanced floating point features. The math.floats.env vocabulary would call these assembly routines as if they were ordinary C functions, using alien-invoke. After the refactoring, the optimizing compiler now generates code for these routines using alien-assembly instead. A hook dispatching on CPU type is used to pick the implementation for the current CPU.

See these files for details:

On PowerPC, using non-gcc compilers is not a goal, so these routines remain in the VM, and the vm/cpu-ppc.S source file still exists. It contains these FPU routines along with one other entry point for flushing the instruction cache (this is a no-op on x86).

Compiling Factor with the Windows SDK

With this refactoring out of the way, the Factor VM can now be compiled using the Microsoft toolchain, in addition to Cygwin and Mingw. The primary benefit of using the Microsoft toolchain is that it has allowed me to revive the 64-bit Windows support.

Last time I managed to get gcc working successfully on Win64, the Factor VM was written in C. The switch to C++ killed the Win64 port, because the 64-bit Windows Mingw port is a huge pain to install correctly. I never got gcc to produce a working executable after the C++ rewrite of the VM, and as a result we haven't had a binary package on this platform since April 2009.

Microsoft's free (as in beer) Windows SDK includes command-line compilers for x86-32 and x86-64, together with various tools, such as a linker, and nmake, a program similar to Unix make. Factor now ships with an Nmakefile that uses the SDK tools to build the VM:

nmake /f nmakefile

After fixing some minor compile errors, the Windows SDK produced a working Win64 executable. After updating the FFI code a little, I quickly got it passing all of the compiler tests. As a result of this work, Factor binaries for 64-bit Windows will be available again soon.

How Factor implements closures

2010-01-18T02:45:00.009-05:00

The recent blog post from the Clozure CL team on Clozure CL's implementation of closures inspired me to do a similar writeup about Factor's implementation. It is often said that "closures are a poor man's objects", or "objects are a poor man's closures". Factor takes the former view, because as you will see they are largely implemented within the object system itself.

Quotations

First, let us look at quotations, and what happens internally when a quotation is called. A quotation is a sequence of literals and words. Quotations do not close over any lexical environment; they are entirely self-contained, and their evaluation semantics only depend on their elements, not any state from the time they were constructed. So quotations are anonymous functions but not closures.

Internally, a quotation is a pair, consisting of an array and a machine code entry point. The array stores the quotation's elements, and when you print a quotation with the prettyprinter, this is how Factor knows what its elements are:

( scrachpad ) [ "out.txt" utf8 [ "Hi" print ] with-file-writer ] .
[ "out.txt" utf8 [ "Hi" print ] with-file-writer ]

The machine code entry point refers to a code block in the code heap containing the quotation's compiled definition.

The call generic word calls quotations. This word can take any callable, not just a quotation, and has a method for each type of callable. The callable class includes quotations, as well as curry and compose which are discussed below. This means that closure invocation is implemented on top of method dispatch in Factor.

For reasons that will become clear in the last section on compiler optimizations, most quotations never have their entry points called directly, and so it would be a waste of time to compile all quotations as they are read by the parser.

Instead, all freshly-parsed quotations have their entry points set to the lazy-jit-compile primitive from the kernel.private vocabulary.

The call generic word has a method on the quotation class. This method invokes the (call) primitive from the kernel.private vocabulary. The (call) primitive does not type check, since by the time it is called it is known that the input is in fact a quotation. This primitive has a very simple machine code definition:

mov eax,[esi]    ! Pop datastack
sub esi,4
jmp [eax+13]     ! Jump to quotation's entry point

Note that the quotation is stored in the EAX register; this is important. Recall that initially, a quotation's entry point is set to the lazy-jit-compile word, and that all quotations initially share this entry point.

This word, which is not meant to be invoked directly, compiles quotations the first time they are called. Since all quotations share the same initial entry point, it needs to know which quotation invoked it. This is done by passing the quotation to this word in the EAX register. The lazy-jit-compile word compiles this quotation, sets its entry point to the new compiled code block, and then calls it again. On subsequent calls of the same quotation, the new compiled definition will be jumped to directly, and the lazy-jit-compile entry point is not involved.

If you're interested in the definition of lazy-jit-compile, search for it in these files:

The curry and compose words

curry

The curry word is the fundamental constructor for making closures. It takes a value and a callable, and returns something that prints out as if it were a new quotation:

( scratchpad ) 5 [ + 2 * ] curry .
[ 5 + 2 * ]

This is not a quotation though, but rather an instance of the curry class. Instances of this class are pairs of elements: an object, and another callable.

Instances of curry are callable, because the call generic word has a method on curry. This method pushes the first element of the pair on the data stack, then recursively calls call on the second element.

Calls to curry can be chained together:

( scratchpad ) { "a" "b" "c" } 1 2 [ 3array ] curry curry map .
{ { "a" 1 2 } { "b" 1 2 } { "c" 1 2 } }

Note that using curry, many callables can be constructed which share the same compiled definition; only the data value differes.

Technically, curry is just an optimization -- it would be possible to simulate it by using sequence words to construct a new quotation with the desired value prepended, but this would be extremely inefficient. Prepending an element would take O(n) time, and furthermore, result in the new quotation being compiled the first time the result was called. Indeed, in some simpler concatenative languages such as Joy, quotations are just linked lists, and they execute in the interpreter, so partial application can be done by using the cons primitive for creating a new linked list.

compose

The compose word takes two callables and returns a new instance of the compose class. Instances of this class are pairs of callables.

( scratchpad ) [ 3 + ] [ sqrt ] compose .
[ 3 + sqrt ]

As with curry, the call generic word has a method on the compose class. This method recursively applies call to both elements.

It is possible to express the effect of compose using curry and dip:

: my-compose ( quot1 quot2 -- compose )
    [ [ call ] dip call ] curry curry ; inline

The main reason compose exists as a distinct type is to make the result prettyprint better. Were it defined as above, the result would not prettyprint in a nice way:

( scratchpad ) [ 3 + ] [ sqrt ] [ [ call ] dip call ] curry curry .
[ [ 3 + ] [ sqrt ] [ call ] dip call ]

The curry and compose constructors are sufficient to express all higher-level forms of closures.

An aside: compose and dip

It is possible to express compose using curry and dip. Conversely, it is also possible to express dip using compose and curry.

: my-dip ( a quot -- ) swap [ ] curry compose call ;

Here is an example of how this works. Suppose we have the following code:

123 321 [ number>string ] my-dip

Using the above definition of my-dip, we get

123 321 [ number>string ] swap [ ] curry compose call  ! substitute definition of 'my-dip'
123 [ number>string ] 321 [ ] curry compose call       ! evaluate swap
123 [ number>string ] [ 321 ] compose call             ! evaluate curry
123 [ number>string 321 ] call                         ! evaluate compose
123 number>string 321                                  ! evaluate call
"123" 321                                              ! evaluate number>string

Fry syntax

The fry syntax provides nicer syntax sugar for more complicated usages of curry and compose. Beginners learning Factor should start with fry syntax, and probably don't need to know about curry and compose at all; but this syntax desugars trivially into curry and compose, as explained in the documentation.

Lexical variables

Code written with the locals vocabulary can create closures by referencing lexical variables from nested quotations. For example, here is a word from the compiler which computes the breadth-first order on a control-flow graph:

:: breadth-first-order ( cfg -- bfo )
    <dlist> :> work-list
    cfg post-order length  :> accum
    cfg entry>> work-list push-front
    work-list [
        [ accum push ]
        [ dom-children work-list push-all-front ] bi
    ] slurp-deque
    accum ;

In the body of the word, three lexical variables are used; the input parameter cfg, and the two local bindings work-list and accum.

When the parser reads the above definition, it creates several quotations whose bodies reference local variables, for example, [ accum push ]. Before defining the new word, however, the :: parsing word rewrites this into concatenative code.

Suppose we were doing this rewrite by hand. The first step would be to make closed-over variables explicit. We can do this by currying the values of accum and work-list onto the two quotations passed to bi:

:: breadth-first-order ( cfg -- bfo )
    <dlist> :> work-list
    cfg post-order length  :> accum
    cfg entry>> work-list push-front
    work-list [
        accum [ push ] curry
        work-list [ [ dom-children ] dip push-all-front ] curry bi
    ] slurp-deque
    accum ;

And now, we can do the same transformation on the quotation passed to slurp-deque:

:: breadth-first-order ( cfg -- bfo )
    <dlist> :> work-list
    cfg post-order length  :> accum
    cfg entry>> work-list push-front
    work-list accum work-list [
        [ [ push ] curry ] dip
        [ [ dom-children ] dip push-all-front ] curry bi
    ] curry curry slurp-deque
    accum ;

Note how three usages of curry have appeared, and all lexical variable usage now only occurs at the same level where the variable is actually defined. The final rewrite into concatenative code is trivial, and involves some tricks with dup and dip.

A lexical closure closing over many variables will be rewritten into code which builds a long chain of curry instances, essentially a linked list. This is less efficient than closure representations used in Lisp implementations, where typically all closed-over values are stored in a single array. However in practice this is not usually a problem, because of optimizations outlined below.

Mutable local variables

Mutable local variables are denoted by suffixing their name with !. Here is an example of a loop that counts a mutable variable up to 100:

:: counted-loop-test ( -- )
    0 :> i!
    100 [ i 1 + i! ] times ;

Factor distinguishes between binding (done with :>) and assignment (done with foo! where foo is a local previously declared to be mutable). At a fundamental level, all bindings and closures are immutable; code using mutable locals is rewritten to close over a mutable heap-allocated box instead, and reads and writes involve an extra indirection. The previous example is roughly equivalent to the following, where we use an immutable local variable holding a mutable one-element array:

:: counted-loop-test ( -- )
    { 0 } clone :> i
    100 [ i first 1 + i set-first ] times ;

For this reason, code that uses mutable locals does not optimize as well, and iterative loops that uses mutable local variables can run slower than tail-recursive functions which uses immutable local variables. This might be addressed in the future with improved compiler optimizations.

Optimizations

Quotation inlining

If after inlining, the compiler sees that call is applied to a literal quotation, it inlines the quotation's body at the call site. This optimization works very well when quotations are used as "downward closures", and this is why most quotations never have their dynamic entry point invoked at all.

Escape analysis

Since curry and compose are ordinary tuple classes, any time some code constructs instances of curry and compose, and immediately unpacks them, the compiler's escape analysis pass can eliminate the allocations entirely.

The escape analysis pass has no special knowledge of curry and compose; it applies an optimization intended for object-oriented code.

Again, this helps optimize code with "downward closures", with the result that most usages of curry and compose will never allocate any memory at run time.

For more details, see my blog post about Factor's escape analysis algorithm.

Factor's bootstrap process explained

2010-01-10T09:34:00.001-05:00

Separation of concerns between Factor VM and library code

The Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap. It then begins executing code in the image, by calling a special startup quotation.

When new source files are loaded into a running Factor instance by the developer, they are parsed and compiled into a collection of objects -- words, quotations, and other literals, along with executable machine code. The new data and code heaps can then be saved into another image file, for faster loading in the future.

Factor's core data structures, object system, and source parser are implemented in Factor and live in the image, so the Factor VM does not have enough machinery to start with an empty data and code heap and parse Factor source files by itself. Instead, the VM needs to start from a data and code heap that already contains enough Factor code to parse source files. This poses a chicken-and-egg problem; how do you build a Factor system from source code? The VM can be compiled with a C++ compiler, but the result is not sufficient by itself.

Some image-based language systems cannot generate new images from scratch at all, and the only way to create a new image is to snapshot an existing session. This is the simplest approach but it has serious downsides -- it lacks determinism and reproducability, and it is difficult to make big changes to the system.

While Factor can snapshot the current execution state into an image, it also has a tool to generate "fresh" image from source. While this tool is written in Factor and runs inside of an existing Factor system, the resulting images depend as little as possible on the particular state of the system that generated it.

Stage 1 bootstrap: creating a boot image

The initial data heap comes from a boot image, which is built from an existing Factor system, known as the host. The result is a new boot image which can run in the target system. Boot images are created using the bootstrap.image tool, whose main entry point is the make-image tool. This word can be invoked from the listener in a running Factor instance:

"x86.32" make-image

The make-image word parses source files using the host's parser, and the code in those source files forms the target image. This tool can be thought of as a form of cross-compiler, except boot images only contain a data heap, and not a code heap. The code heap is generated on the target by the VM, and later by the target's optimizing compiler in Factor.

The make-image word runs the source file core/bootstrap/stage1.factor, which kicks off the bootstrap process.

Building the embryo for the new image

The global namespace is most important object stored in an image file. The global namespace contains various global variables that are used by the parser, along them the dictionary. The dictionary is a hashtable mapping vocabulary names to vocabulary objects. Vocabulary objects contain various bits of state, among them a hashtable mapping word names to word objects. Word objects store their definition as a quotation. The dictionary is how code is represented in memory in Factor; it is built and modified by loading source files from disk.

One of the tasks performed by stage1.factor is to read the source file core/bootstrap/primitives.factor. This source file creates a minimal global namespace and dictionary that target code can be loaded into. This initial dictionary consists of primitive words corresponding to all primitives implemented in the VM, along with some initial state for the object system, consisting of built-in classes such as array. The code in this file runs in the host, but it constructs objects that will ultimately end up in the boot image of the target.

A second piece of code that runs in order to prepare the environment for the target is the CPU-specific backend for the VM's non-optimizing compiler. Again, these are source files which run on the host:

basis/cpu/x86/bootstrap.factor

basis/cpu/x86/32/bootstrap.factor

basis/cpu/x86/64/bootstrap.factor

basis/cpu/ppc/bootstrap.factor

The non-optimizing compiler does little more than glue chunks of machine code together, so the backends are relatively simple and consist of several dozen short machine code definitions. These machine code chunks are stored as byte arrays, constructed by Factor's x86 and PowerPC assemblers.

Loading core source files

Once the initial global environment consisting of primitives and built-in classes has been prepared, source files comprising the core library are loaded in. From this point on, code read from disk does not run in the host, only in the target. The host's parser is still being used, though.

Factor's vocabulary system loads dependencies automatically, so stage1.factor simply calls require on a few essential vocabularies which end up pulling in everything in the core vocabulary root.

During normal operation, any source code at the top level of a source file, not in any definition, is run when the source file is loaded. During stage1 bootstrap, top-level forms from source files in core are not run on the host. Instead, they need to be run on the target, when the VM is is launched with the new boot image.

After loading all source files from core, this startup quotation is constructed. The startup quotation begins by calling top-level forms in core source files in the order in which they were loaded, and then runs basis/bootstrap/stage2.factor.

Serializing the result

At this point, stage1 bootstrap has constructed a new global namespace consisting of a dictionary, object system meta-data, and other objects, together with a startup quotation which can kick off the next stage of bootstrap.

Data heap objects that the VM needs to know about, such as the global namespace, startup quotation, and non-optimizing compiler definitions, are stored in an array of "special objects". Entries are defined in vm/objects.hpp, and in the image file they are stored in the image header.

This object graph, rooted at the special objects array, is now serialized to disk into an image file. The bootstrap image generator serializes objects in the same format in which they are stored in the VM's heap, but it does this without dumping VM's memory directly. This allows object layouts to be changed relatively easily, by first updating the bootstrap image tool, generating an image with the new layouts, then updating the VM and running the new VM with the new image.

The bootstrap image generator also takes care to write the resulting data with the correct cell size and endianness. Along with containing CPU-specific machine code templates for the non-optimizing compiler, this is what makes boot images platform-specific.

Stage 2 bootstrap: fleshing out a development image

At this point, the host system has writen a boot image file to disk, and the next stage of bootstrap can begin. This stage runs on the target, and is initiated by starting the Factor VM with the new boot image:

./factor -i=boot.x86.32.image

The VM reads the new image into an empty data heap. At this point, it also notices that the boot image does not have a code heap, so it cannot start executing the boot quotation just yet.

Early initialization

Boot images have a special flag set in them which kicks off the "early init" process in the VM. This only takes a few seconds, and entails compiling all words in the image with the non-optimizing compiler. Once this is done, the VM can call the startup quotation. Quotations are also compiled by the non-optimizing compiler the first time they are called.

This startup quotation was constructed during stage1 bootstrap. It runs top-level forms in core source files, then runs basis/bootstrap/stage2.factor.

Loading major subsystems

Whereas the goal of stage1 bootstrap is to generate a minimal image that contains barely enough code to be able to load additional source files, stage2 creates a usable development image containing the optimizing compiler, documentation, UI tools, and everything else that makes it into a recognizable Factor system.

The major vocabularies loaded by stage2 bootstrap include:

Optimizing compiler -- after the optimizing compiler has been loaded, all words in the image are compiled again. This is the longest part of stage2 bootstrap, taking several minutes.
Help system
Developer tools
UI framework
UI tools

Finishing up

The last step taken by stage2 bootstrap is to install a new startup quotation. This startup quotation does the usual command-line processing; if no switches are specified, it starts the UI listener, otherwise it runs a source file or vocabulary given on the command line.

Once the new startup quotation has been installed, the current session is saved to a new image file using the save-image-and-exit word.

Freeing Factor from gcc's embrace-and-extend C language extensions

2009-12-27T06:59:00.001-05:00

I have completed a refactoring of the Factor VM, eliminating usage of gcc-specific language extensions, namely global register variables, and on x86-32, the regparm calling convention. My motivation for this is two-fold.

First of all, I'd like to have the option of compiling the Factor VM with compilers other than gcc, such as Visual Studio. While gcc runs on all of Factor's supported platforms and then some, setting up a build environment using the GNU toolchain on Windows takes a bit of work, especially on 64-bit Windows. Visual Studio will provide an easier alternative for people who wish to build Factor from source on that platform. In the future, I'd also like to try using Clang to build Factor.

The second reason is that the global register variable extension is poorly implemented in gcc. Anyone who has followed Factor development for a while will know that gcc bugs come up pretty frequently, and most of these seem to be related to global register variables. This is quite simply one of the less well-tested corners of gcc, and the gcc developers seem to mostly ignore optimizer bugs triggered by this feature.

The Factor VM used a pair of register variables to hold data stack and retain stack pointers. These are just ordinary fields in a struct now. Back in the days when Factor was interpreted and the interpreter was part of the VM, a lot of time was spent executing code within the VM itself, and keeping these pointers in registers was important. Nowadays the Factor implementation compiles to machine code even during interactive use, using a pair of compilers called the non-optimizing compiler and optimizing compiler. Code generated by Factor's compilers tends to dominate the performance profile, rather than code in the VM itself. Compiled code can utilize registers in any matter desired, and so it continues to access the data stack and retain stack through registers. To make it work with the modified VM, the compiler generates code for saving and restoring these registers in the VM's context structure before and after calls into the VM.

A few functions defined in the VM used gcc's regparm calling convention. Normally, on x86-32, function parameters are always passed in an array on the call stack in the esp register; regparm functions instead pass the first 1, 2 or 3 arguments in registers. Whether or not this results in a performance boost is debatable, but my motivation for using this feature was not performance but perceived simplicity. The optimizing compiler would generate calls to these functions, and the machine code generated is a little simpler if it can simply stick parameters in registers instead of storing them on the call stack. Eliminating regparm did not in reality make anything more complex, as only a few locations were affected and the issue was limited to the x86-32 backend only.

I'm pretty happy with how the refactoring turned out. It did not seem to affect performance at all, which was not too surprising, since code generated by Factor's optimizing compiler did not change, and the additional overhead surrounding Factor to C calls is lost in the noise.

My goal of getting Factor to build with other compilers is not quite achieved yet, however. While the gcc extensions are gone from C++ code, the VM still has some 800 lines of assembly source using GNU assembler syntax, in the files cpu-x86.32.S, cpu-x86.64.S, and cpu-ppc.S. This code includes fundamental facilities which cannot be implemented in C++, such as the main C to Factor call gate, the low-level set-callstack primitive used by the continuations implementation, and a few other things. The assembly source also has a few auxilliary CPU-dependent functions, for example for saving and restoring FPU flags, and detecting the SSE version on x86.

I plan on elimiating the assembly from the VM entirely, by having the Factor compiler generate this code instead. The non-optimizing compiler can generate things such as the C to Factor call gate. For the the remaining assembly routines, such as FPU feature access and SSE version detection, I plan on adding an "inline assembly" facility to Factor itself, much like gcc's asm statement. The result will be that the VM will be a pure C++ codebase, and the machine code generation will be entirely offloaded into Factor code, where it belongs. Factor's assembler DSL is much nicer to use than GNU assembler.

Reducing image size by eliminating literal tables from code heap entries

2009-12-06T02:00:00.001-05:00

Introduction

The compiled code heap consists of code blocks which reference other code blocks and objects in the data heap. For example, consider the following word:

: hello ( -- ) "Hello world" print ;

The machine code for this word is the following:

000000010ef55f10: 48b87b3fa40d01000000  mov rax, 0x10da43f7b
000000010ef55f1a: 4983c608              add r14, 0x8
000000010ef55f1e: 498906                mov [r14], rax
000000010ef55f21: 48bb305ff50e01000000  mov rbx, 0x10ef55f30 (hello+0x20)
000000010ef55f2b: e9e0ffabff            jmp 0x10ea15f10 (print)

The immediate operand of the first mov instruction (0x10da43f7b) is the address of the string "Hello world" in the data heap. The immediate operand of the last jmp instruction (0x10ea15f10) is the address of the machine code of the print word in the code heap.
Unlike some dynamic language JITs where all references to data and compiled code from machine code are done via indirection tables, Factor embeds the actual addresses of the data in the code. This means that the garbage collector needs to be able to find all pointers in a code heap block (for the "sweep" phase of garbage collection), and update them (for the "compact" phase).

Relocation tables

Associated to each code block is a relocation table, which tells the VM what instruction operands contain special values that it must be aware of. The relocation table is a list of triples, packed into a byte array:

The relocation type is an instance of the relocation_type enum in instruction_operands.hpp. This value tells the VM what kind of value to deposit in the operand -- possibilities include a data heap pointer, the machine code of a word, and so on.
The relocation class is an instance of the relocation_class enum in instruction_operands.hpp. This value tells the VM how the operand is represented -- the instruction format, whether or not it is a relative address, and such.
The relocation offset is a byte offset from the start of the code block where the value is to be stored.

Code that needs to inspect relocation table entries uses the each_instruction_operand() method defined in code_block.hpp. This is a template method which can accept any object overloading operator().

Literal tables

The next part is what I just finished refactoring. I'll describe the old approach first. The simplest way, and what Factor used until now, is the following. Relocation table entries that expect a parameter, such as those that deposit addresses from the data heap and code heap, take the parameter from a literal table associated to each code block. When the compiler compiles a word, it spits out some machine code and a literal table. It hands these off to the Factor VM. The "sweep" phase of the garbage collector traces each code block's literal table, and the "compact" phase, after updating the literal table, stores the operands back in each instruction referenced from the relocation table.

Eliminating literal tables

The problem with the old approach is that the garbage collector doesn't really need the literal table. The address of each data heap object and code block referenced from machine code is already stored in the machine code itself. Indeed, the only thing missing until now was a way to read instruction operands out of instructions. With this in place, code blocks no longer had to hold on to the literal table after being constructed. Each code block's literal table is only used to deposit the initial values into machine code when a code block is first compiled by the compiler. Subsequently, the literal table becomes garbage and is collected by the garbage collector. When tracing code blocks, the garbage collector traverses the instruction operands themselves, using the relocation table alone. In addition to the space savings gained by not keeping these arrays of literals around, another interesting side-effect of this refactoring is that a full garbage collection no longer resets generic word call sites back to the cold call entry point, which would effectively discard all inline caches (read about inline caching in Factor).

Coping with code redefinition

A call to a word is compiled as a direct jump in Factor. This means that if a word is redefined and recompiled, existing call sites need to be updated to point to the new definition. The implementation of this is slightly more subtle now that literal tables are gone. Every code block in the code heap has a reference to an owner object in its header (see the code_block struct in code_blocks.cpp). The owner is either a word or a quotation. Words and quotations, in turn, have a field which references a compiled code heap block. The latter always points at the most recent compiled definition of that object (note that quotations are only ever compiled once, because they are immutable. Words, however, can be redefined by reloading source files). At the end of each compilation unit, the code heap is traversed, and each code block is updated in turn. The code block's relocation table is consulted, and instruction operands which reference compiled code heap blocks are updated. Before this would be done by overwriting all call targets from the literal table. Now, this is accomplished by looking at the owner object of the current target, and then storing the owner's most recent code block back in the instruction operand. This is implemented in the update_word_references() method defined in code_blocks.cpp. In addition to helping with redefinition, the owner object reference is used to construct call stack traces.

Additional space savings in deployed binaries

Normally, every compiled code block references its owner object, so that code redefinition can work. This means that if a word or quotation is live, then the code block corresponding to its most recent definition will be live, and vice versa. In deployed images where the compiler and debugger have been stripped out, words cannot be redefined and stack traces are not needed, so the owner field can be stripped out. This means that a word object can get garbage collected at deploy time even if its compiled code block is called. As it turns out, most words are never used as objects, and can be stripped out in this manner. So the literal table removal has an even bigger effect in deployed images than development images.

Size comparison

The following table shows image sizes before and after this change on Mac OS X x86-64.

Image	Before (Megabytes)	After (Megabytes)
Development image	92 Mb	90 Mb
Minimal development image	8.6 Mb	8.2 Mb
Deployed hello-ui	2.3 Mb	1.5 Mb
Deployed bunny	3.5 Mb	3.1 Mb

Mark-compact garbage collection for oldest generation, and other improvements

2009-11-16T03:24:00.006-05:00

Factor now uses a mark-sweep-compact garbage collector for the oldest generation (known as tenured space), in place of the copying collector that it used before. This reduces memory usage. The mark-sweep collector used for the code heap has also been improved. It now shares code with the data heap collector and uses the same compaction algorithm.

Mark-sweep-compact garbage collection for tenured space

Mark phase

During the mark phase, the garbage collector computes the set of reachable objects by starting from the "roots"; global variables, and objects referenced from runtime stacks. When an object is visited, the mark phase checks if the object has already been marked. If it hasn't yet been marked, it is marked, and any object that it refers to are then also visited in turn. If an object has already been marked, nothing is done. As described above, the algorithm is recursive, which is problematic. There are two approaches to turn it into an iterative algorithm; either objects yet to be visited are pushed on a mark stack, and a top-level loop drains this stack, or a more complicated scheme known as "pointer inversion" is used. I decided against pointer inversion, since the mark stack approach is simpler and I have yet to observe the mark stack grow beyond 128Kb or so anyway. I use an std::vector for the mark stack and it works well enough. The mark stack and the loop that drains it are implemented in full_collector.cpp. There are several approaches to representing the set of objects which are currently marked, also. The two most common are storing a mark bit for each object in the object's header, and storing mark bits in a bitmap off to the side. I chose the latter approach, since it speeds up the sweep and compact phases of the garbage collector, and doesn't require changing object header layout. Each bit in the mark bitmap corresponds to 16 bytes of heap space. For reasons that will become clear, when an object is marked mark bitmap, every bit corresponding to space taken up by the object is marked, not just the first bit. The mark bitmap is implemented in mark_bits.hpp.

Sweep phase

Once the mark phase is complete, the mark bitmap now has an up-to-date picture of what regions in the heap correspond to reachable objects, and which are free. The sweep phase begins by clearing the free list, and then computes a new one by traversing the mark bitmap. For every contiguous range of clear bits, a new entry is added to the free list. This is why I use a mark bitmap instead of mark bits in object headers; if I had used mark bits in headers, then the sweep phase would need to traverse the entire heap, not just the mark bitmap, which has significantly worse locality. The sweep phase is implemented by the free_list_allocator::sweep() method in free_list_allocator.hpp. The key to an efficient implementation of the sweep algorithm is the log2() function in bitwise_hacks.hpp; it is used to find the first set bit in a cell. I use the BSR instruction on x86 and cntlzw on PowerPC. The sweep phase also needs to update the object start offset map used for the card marking write barrier. When collecting a young generation, the garbage collector scans the set of marked cards. It needs to know where the first object in each card is, so that it can properly identify pointers. This information is maintained by the object_start_map class defined in object_start_map.cpp. If an object that happens to be the first object in a cardwas deallocated by the sweep phase, the object start map must be updated to point at a subsequent object in that card. This is done by the object_start_map::update_for_sweep() method.

Compact phase

The compact phase is optional; it only runs if the garbage collector determines that there is too much fragmentation, or if the user explicitly requests it. The compact phase does not require that the sweep phase has been run, only the mark phase. Like the sweep phase, the compact phase relies on the mark bitmap having been computed. Whereas the sweep phase identifies free blocks and adds them to the free list, the compact phase slides allocated space up so that all free space ends up at the end of the heap, in a single block. The compact phase has two steps. The first step computes a forwarding map; a data structure that can tell you the final destination of every heap block. It is easy to see that the final destination of every block can be determined from the number of set bits in the mark bitmap that precede it. Since the forwarding map is consulted frequently -- once for every pointer in every object that is live during compaction -- it is important that lookups are as fast as possible. The forwarding map should also be space-efficient. This completely rules out using a hashtable (with many small objects in the heap, it would grow to be almost as big as the heap itself) or simply scanning the bitmap and counting bits every time (since now compaction will become an O(n^2) algorithm). The correct solution is very simple, and well-known in language implementation circles, but I wasn't aware of it until I studied the Clozure Common Lisp garbage collector. You count the bits set in every group of 32 (or 64) bits in the mark bitmap, building an array of cumulative sums as you go. Then, to count the number of bits that are set up to a given element, you look up the pre-computed population count for the nearest 32 (or 64) bit boundary, and manually compute the population count for the rest. This gives you a forwarding map with O(1) lookup time. This algorithm relies on a fast population count algorithm; I used the standard CPU-independent technique in the popcount() function of bitwise_hacks.hpp. Nowadys, as of SSE 4.2, x86 CPUs even include a POPCNT instruction, but since compaction spends most of its time in memmove(), I didn't investigate if this would offer a speedup. It would require a CPUID check at runtime and the fallback would still need to be there for pre-Intel i7 CPUs and PowerPC, so it didn't seem worth the extra complexity to me. Once the forwarding map has been computed, objects can be moved up and pointers that they contain updated in one pass. A mark-compact cycle takes roughly twice as long as a mark-sweep, which is why I elected not to perform a mark-compact on every full collection. The latter leads to a simpler implementation (no sweep phase, and no free list; allocation in tenured space is done by bumping a pointer just as with a copying collector) however the performance penalty didn't seem worth the minor code size reduction to me.

Code heap compaction

Code heap compaction has been in Factor for a while, in the form of the save-image-and-exit word. This used an inefficient multi-pass algorithm (known as the "LISP-2 compaction algorithm") however since it only ran right before exiting Factor, it didn't matter too much. Now, code heap compaction can happen at any time as a result of heap fragmentation, and uses the same efficient algorithm as tenured space compaction. Compaction moves code around, and doing this at a time other than right before the VM exiting creates a few complications:

Return addresses in the callstack need to be updated.
If an inline cache misses, a new inline cache is compiled, and the call site for the old cache is patched. Inline cache construction allocates memory and can trigger a GC, which may in turn trigger a code heap compaction; if this occurs, the return address passed into the inline cache miss stub may have moved, and the code to patch the call site needs to be aware of this.
If a Factor callback is passed into C code, then moving code around in the code heap may invalidate the callback, and the garbage collector has no way to update any function pointers that C libraries might be referencing.

The solution to the first problem was straightforward and involved adding some additional code to the code heap compaction pass. The second problem is trickier. I added a new code_root smart pointer class, defined in code_roots.hpp, to complement the data_root smart pointer (see my blog post about moving the VM to C++ for details about that). The inline cache compiler wraps the return address in a code_root to ensure that if the code block that contains it is moved by any allocations, the return address can be updated properly. I solved the last problem with a layer of indirection (that's how all problems are solved in CS, right?). Callbacks are still allocated in the code heap, but the function pointer passed to C is actually stored in a separate "callback heap" where every block consists of a single jump instruction and nothing else. When a code heap compaction occurs, code blocks in the code heap might be moved around, and all jumps in the callback heap are updated. Blocks within the callback heap are never moved around (or even deallocated, since that isn't safe either).

New object alignment and tagging scheme

The last major change I wanted to discuss is that objects are now aligned on 16-byte boundaries, rather than 8-byte boundaries. This wastes more memory, but I've already offset most of the increase with some space optimizations, with more to follow. There are several benefits to this new system. First of all, the binary payload of byte array objects now begins on a 16-byte boundary, which allows SIMD intrinsics to use aligned access instructions, which are much faster. Second, it simplifies the machine code for inline caches and megamorphic method dispatch. Since the low 4 bits of every pointer are now clear, this allows all built-in VM types to fit in the pointer tag itself, so the logic to get a class descriptor for an object in a dispatch stub is very simple now; here is pseudo-code for the assembly that gets generated:

cell tag = ptr & 15; if(tag == tuple_tag) tag = ptr[cell - tuple_tag]; ... dispatch on tag ...

Improved write barriers in Factor's garbage collector

2009-10-15T07:16:00.009-04:00

The Factor compiler has been evolving very quickly lately; it has been almost completely rewritten several times in the last couple of years. The garbage collector, on the other hand, hasn't seen nearly as much action. The last time I did any work on it was May 2008, and before that, May 2007. Now more than a year later, I've devoted a week or so to working on it. The result is a cleaner implementation, and improved performance.

Code restructuring

I re-organized the garbage collector code to be more extensible and easier to maintain. I did this by splitting off a bunch of the garbage collector methods from the factor_vm class into their own set of classes. I made extensive use of template metaprogramming to help structure code in a natural way. Many people dislike C++, primarily because of templates, but I don't feel that way at all. Templates are my favorite C++ feature, and if it wasn't for templates C++ would just be a shitty object-oriented dialect of C.

First up is the collector template class, defined in collector.hpp:

template<typename TargetGeneration, typename Policy> struct collector

This class has two template parameters:

TargetGeneration - this is the generation that the collector will be copying objects to. A generation is any class that implements the allot() method.
Policy - this is a class that simulates a higher-order function. It implements a should_copy_p() method that tells the collector if a given object should be promoted to the target generation, or left alone.

On its own, the collector class can't do any garbage collection itself; it just implements methods which trace GC roots: trace_contexts() (traces active stacks), trace_roots() (traces global roots), and trace_handle() (traces one pointer).

Next up is the copying_collector template class, defined in copying_collector.hpp:

template<typename TargetGeneration, typename Policy> struct copying_collector

This class has the same two template parameters as collector; the target generation must define one additional method, next_object_after(). This is used when scanning newly copied objects. This class implements logic for scanning marked cards, as well as Cheney's algorithm for copying garbage collection.

Then, there are four subclasses implementing each type of GC pass:

nursery_collector - copies live objects from the nursery into aging space, defined in nursery_collector.hpp and nursery_collector.cpp
aging_collector - copies live objects from the first aging semi-space to the second aging semi-space, defined in aging_collector.hpp and aging_collector.cpp
to_tenured_collector - copies live objects from aging space into tenured space, defined in to_tenured_collector.hpp and to_tenured_collector.cpp
full_collector - copies live objects from the first tenured semi-space to the second tenured semi-space, defined in nursery_collector.hpp and nursery_collector.cpp

Each class subclasses copying_collector and specializes the two template arguments. For example, let's take a look at the declaration of the nursery collector:

struct nursery_collector : copying_collector<aging_space,nursery_policy>

This subclass specializes its superclass to copy objects to tenured space, using the following policy class:

struct nursery_policy {
 factor_vm *myvm;

 nursery_policy(factor_vm *myvm_) : myvm(myvm_) {}

 bool should_copy_p(object *untagged)
 {
  return myvm->nursery.contains_p(untagged);
 }
};

That is, any object that is in the nursery will be copied to aging space by the nursery_collector. Other collector subclasses are similar.

This code all uses templates, rather than virtual methods, so every collector pass will have a specialized code path generated for it. This gives higher performance, with cleaner code, than was is possible in C. The old garbage collector was a tangle of conditionals, C functions, and global state.

Partial array card marking

When a pointer to an object in the nursery is stored into a container in aging or tenured space, the container object must be added to a "remembered set" so that on the next minor collection, so that it can be scanned, and its elements considered as GC roots.

Old way

Storing a pointer into an object marks the card containing the header of the object. On a minor GC, all marked cards are be scanned; if a marked card was bound, then every object whose header is contained in this card would be scanned.

Problem

Storing a pointer into an array would necessitate the array to be scanned in its entirety on the next minor collection. This is bad if the array is large. Consider an algorithm that stores successive elements into an array on every iteration, and also performs enough work per iteration to trigger a nursery collection. Now every nursery collection -- and hence every iteration of the loop -- is scanning the entire array. We're doing a quadratic amount of work for what should be a linear-time algorithm.

New way

Storing a pointer into an object now marks the card containing the slot that was mutated. On a minor GC, all marked cards are scanned. Every object in every marked card is inspected, but only the subrange of slots that fit inside the card are scanned. This greatly reduces the burden placed on the GC from mutation of large arrays. The implementation is tricky. I need to spend some time thinking about and simplifying the code, as it stands the card scanning routine has three nested loops, and two usages of goto!

Implementation

copying_collector.hpp, trace_cards()

New object promotion policy

When aging space is being collected, objects contained in marked cards in tenured space must be traced.

Old way

These cards would be scanned, but could not be unmarked, since the objects they refer to were copied to the other aging semi-space, and would need to be traced on the next aging collection.

The problem

The old way was bad because these cards would remain marked for a long time, and would be re-scanned on every collection. Furthermore the objects they reference would likely live on for a long time, since they're referenced from a tenured object, and would needlessly bounce back and forth between the two aging semi-spaces.

New way

Instead, an aging collection proceeds in two phases: the first phase promotes aging space objects referenced from tenured space to tenured space, unmarking all marked cards. The second phase copies all reachable objects from aging to second aging semi-space. This promotes objects likely to live for a long time all the way to tenured space, and scans less cards on an aging collection since more cards can get unmarked.

Implementation

aging_collector.cpp

Faster code heap remembered set

If a code block references objects in the nursery, the code block needs to be updated after a nursery collection. This is because the machine code of compiled words directly refers to objects; there's no indirection through a literal table at runtime. This improves performance but increases garbage collector complexity.

Old way

When a new code block was allocated, a global flag would be set. A flag would also be set in the code block's header. On the next nursery collection, the entire code heap would be scanned, and any code blocks with this flag set in them would have their literals traced by the garbage collector.

New way

The problem with the old approach is that adding a single code block entails a traversal of the entire code heap on the next minor GC, which is bad for cache. While most code does not allocate in the code heap, the one major exception is the compiler itself. When loading source code, a significant portion of running time was spent scanning the code heap during minor collections. Now, the list of code blocks containing literals which refer to the nursery and aging spaces are stored in a pair of STL sets. On a nursery or aging collection, these sets are traversed and the code blocks they contain are traced. These sets are typically very small, and in fact empty most of the time.

Implementation

code_heap.cpp, write_barrier()
copying_collector.hpp, trace_code_heap_roots()

Faster card marking and object allocation

The compiler's code generator now generates tighter code for common GC-related operations too. A write barrier looks like this in pseudo-C:

cards[(obj - data_heap_start) >> card_bits] = card_mark_mask;

Writing out the pointer arithmetic by hand, we get:

*(cards + (obj - data_heap_start) >> card_bits) = card_mark_mask;

Re-arranging some operations,

*(obj >> card_bits + (cards - data_heap_start >> card_bits) = card_mark_mask;

Now, the entire expression

(cards - data_heap_start >> card_bits)

is a constant. Factor stores this in a VM-global variable, named cards_offset. The value used to be loaded from the global variable every time a write barrier would execute. Now, its value is inlined directly into machine code. This requires code heap updates if the data heap grows, since then either the data heap start or the card array base pointer might change. However, the upside is that it eliminates several instructions from the write barrier. Here is a sample generated write barrier sequence; only 5 instructions on x86.32:

0x08e3ae84: lea    (%edx,%eax,1),%ebp
0x08e3ae87: shr    $0x8,%ebp
0x08e3ae8a: movb   $0xc0,0x20a08000(%ebp)
0x08e3ae91: shr    $0xa,%ebp
0x08e3ae94: movb   $0xc0,0x8de784(%ebp)

Object allocations had a slight inefficiency; the code generated to compute the effective address of the nursery allocation pointer did too much arithmetic. Adding support to the VM for embedding offsets of VM global variables directly into machine code saved one instruction from every object allocation. Here is some generated machine code to box a floating point number; only 6 instructions on x86.32 (of course Factor does float unboxing to make your code even faster):

0x08664a33: mov    $0x802604,%ecx
0x08664a38: mov    (%ecx),%eax
0x08664a3a: movl   $0x18,(%eax)
0x08664a40: or     $0x3,%eax
0x08664a43: addl   $0x10,(%ecx)
0x08664a46: movsd  %xmm0,0x5(%eax)

Implementation

cpu/x86/x86.factor: %write-barrier, %allot
cpu/ppc/ppc.factor: %write-barrier, %allot

Performance comparison

I compared the performance of a Mac OS X x86-32 build from October 5th, with the latest sources as of today.

Bootstrap time saw a nice boost, going from 363 seconds, down to 332 seconds.

The time to load and compile all libraries in the source tree (load-all) was reduced from 537 seconds to 426 seconds.

Here is a microbenchmark demonstrating the faster card marking in a very dramatic way:

: bench ( -- seq ) 10000000 [ >bignum 1 + ] map ;

The best time out of 5 iterations on the old build was 12.9 seconds. Now, it has gone down to 1.9 seconds.

A survey of domain-specific languages in Factor

2009-09-30T23:52:00.007-04:00

Factor has good support for implementing mini-languages as libraries. In this post I'll describe the general techniques and look at some specific examples. I don't claim any of this is original research -- Lisp and Forth programmers have been doing DSLs for decades, and recently the Ruby, Haskell and even Java communities are discovering some of these concepts and adding a few of their own to the mix. However, I believe Factor brings some interesting incremental improvements to the table, and the specific combination of capabilities found in Factor is unique.

Preliminaries

How does one embed a mini-language in Factor? Let us look at what goes on when a source file is parsed:

The parser reads the input program. This consists of definitions and top-level forms. The parser constructs syntax trees, and adds definitions to the dictionary. The result of parsing is the set of top-level forms in the file.
The compiler is run with all changed definitions. The compiler essentially takes syntax trees as input, and produces machine code as output. Once the compiler finishes compiling the new definitions, they are added to the VM's code heap and may be executed.
The top level forms in the file are run, if any.

In Factor, all of these stages are extensible. Note that all of this happens when the source file is loaded into memory -- Factor is biased towards compile-time meta-programming.

Extending the parser with parsing words

Parsing words which execute at parse time can be defined. Parsing words can take over the parser entirely and parse custom syntax. All of Factor's standard syntax, such as : for defining words and [ for reading a quotation, is actually parsing words defined in the syntax vocabulary. Commonly-used libraries such as memoize and specialized-arrays add their own parsing words for custom definitions and data types. These don't qualify as domain-specific languages since they're too trivial, but they're very useful.

Homoiconic syntax

In most mainstream languages, the data types found in the syntax tree are quite different from the data types you operate on at runtime. For example, consider the following Java program:

if(x < 3) { return x + 3; } else { return foo(x); }

This might parse into something like

IfNode(
    condition: LessThan(Identifier(x),Integer(3))
    trueBranch: ReturnNode(Add(Identifier(x),Integer(3)))
    falseBranch: ReturnNode(MethodCall(foo,Identifier(x)))
)

You cannot work with an "if node" in your program, the identifier x does not exist at runtime, and so on.

In Factor and Lisp, the parser constructs objects such as strings, numbers and lists directly, and identifiers (known as symbols in Lisp, and words in Factor) are first-class types. Consider the following Factor code,

x 3 < [ x 3 + ] [ x foo ] if

This parses as a quotation of 6 elements,

The word x
The integer 3
The word <
A quotation with three elements, x, 3, and +
A quotation with two elements, x, and foo
The word if

An array literal, like { 1 2 3 }, parses as an array of three integers; not an AST node representing an array with three child AST nodes representing integers.

The flipside of homoiconic syntax is being able to print out (almost) any data in a form that can be parsed back in; in Factor, the . word does this.

What homoiconic syntax gives you as a library author is the ability to write mini-languages without writing a parser. Since you can input almost any data type wth literal syntax, you can program your mini-language in parse trees directly. The mini-language can process nested arrays, quotations, tuples and words instead of parsing a string representation first.

Compile-time macros

All of Factor's data types, including quotations and words, can be constructed at runtime. Factor also supports compile-time macros in the Lisp sense, but unlike Lisp where they are used to prevent evaluation, a Factor macro is called just like an ordinary word, except the parameters have to be compile-time literals. The macro evaluates to a quotation, and the quotation is compiled in place of the macro call.

Everything that can be done with macros can also be done by constructing quotations at runtime and using call( -- macros just provide a speed boost.

Parsing word-based DSLs

Many DSLs have a parsing word as their main entry point. The parsing word either takes over the parser completely to parse custom syntax, or it defines some new words and then lets the main parser take over again.

Local variables

A common misconception among people taking a casual look at Factor is that it doesn't offer any form of lexical scoping or named values at all. For example, Reg Braithwaite authoritatively states on his weblog:

the Factor programming language imposes a single, constraining set of rules on the programmer: programmers switching to Factor must relinquish their local variables to gain Factor’s higher-order programming power.

In fact, Factor supports lexically scoped local variables via the locals vocabulary. and this library is used throughout the codebase. It looks like in a default image, about 1% of all words use lexical variables.

The locals vocabulary implements a set of parsing words which augment the standard defining words. For example, :: reads a word definition where input arguments are stored in local variables, instead of being on the stack:

:: lerp ( a b c -- d )
    a c *
    b 1 c - *
    + ;

The locals vocabulary also supports "let" statements, lambdas with full lexical closure semantics, and mutable variables. The locals vocabulary compiles lexical variable usage down to stack shuffling, and curry calls (for constructing quotations that close over a variable). This makes it quite efficient, especially since in many cases the Factor compiler can eliminate the closure construction using escape analysis. The choice of whether or not to use locals is one that can be made purely on a stylistic basis, since it has very little effect on performance.

Parsing expression grammars

Parsing expression grammars describe a certain class of languages, as well as a formalism for parsing these languages.

Chris Double implemented a PEG library in Factor. The peg vocabulary offers a combinator-style interface for constructing parsers, and peg.ebnf builds on top of this and defines a declarative syntax for specifying parsers.

A simple example of a PEG grammar can be found in Chris Double's peg.pl0 vocabulary. More elaborate examples can be found in peg.javascript (JavaScript parser by Chris Double) and smalltalk.parser (Smalltalk parser by me).

One downside of PEGs is that they have some performance problems; the standard formulation has exponential runtime in the worst case, and the "Packrat" variant that Factor uses runs in linear time but also linear space. For heavy-duty parsing, it appears as if LL and LR parsers are best, and it would be nice if Factor had an implementation of such a parser generator.

However, PEGs are still useful for simple parsing tasks and prototyping. They are used throughout the Factor codebase for many things, including but not limited to:

Parsing regular expressions in the regexp vocabulary (source)
Parsing URLs in the urls vocabulary (source)
Parsing HTTP headers in the http.server and http.client vocabularies (source)

PEGs can also be used in conjunction with parsing words to embed source code written with custom grammars in Factor source files directly. The next DSL is an example of that.

Infix expressions

Philipp Brüschweiler's infix vocabulary defines a parsing word which parses an infix math expression using PEGs. The result is then compiled down to locals, which in turn compile down to stack code.

Here is a word which solves a quadratic equation ax^2 + bx + c = 0 using the quadratic formula. Infix expressions can only return one value, so this word computes the first root only:

USING: infix locals ;

:: solve-quadratic ( a b c -- r )
    [infix (-b + sqrt(b*b-4*a*c))/2*a infix] ;

Note that we're using two mini-languages here; :: begins a definition with named parameters stored in local variables, and [infix parses an infix expression which can access these local variables.

XML literals

Daniel Ehrenberg's XML library defines a convenient syntax for constructing XML documents. Dan describes it in detail in a blog post, with plenty of examples, so I won't repeat it here. The neat thing here is that by adding a pair of parsing words, [XML and <XML, he was able to integrate XML snippets into Factor, with parse-time well-formedness checking, no less.

Dan's XML library is now used throughout the Factor codebase, particularly in the web framework, for both parsing and generating XML. For example, the concatenative.org wiki uses a markup language called "farkup". The farkup markup language, developed by Dan, Doug Coleman and myself, makes heavy use of XML literals. Farkup is implemented by first parsing the markup into an abstract syntax tree, and then converting this to HTML using a recursive tree walk that builds XML literals. We avoid constructing XML and HTML through raw string concatenation; instead we use XML literals everywhere now. This results in cleaner, more secure code.

Compare the design of our farkup library with markdown.py used by reddit.com. The latter is implemented with a series of regular expression hacks and lots of ad-hoc string processing which attempts to produce something resembling HTML in the end. New markup injection attacks are found all the time; there was a particularly clever one involving a JavaScript worm that knocked reddit right down a few days ago. I don't claim that Farkup is 100% secure by any means, and certainly it has had an order of magnitude less testing, but without a doubt centralizing XHTML generation makes it much easier to audit and identify potential injection problems.

C library interface

Our C library interface (or FFI) was quite low-level initially but after a ton of work by Alex Chapman, Joe Groff, and myself, it has quite a DSLish flavor. C library bindings resemble C header files on mushrooms. Here is a taste:

TYPEDEF: int cairo_bool_t

CONSTANT: CAIRO_CONTENT_COLOR HEX: 1000
CONSTANT: CAIRO_CONTENT_ALPHA HEX: 2000
CONSTANT: CAIRO_CONTENT_COLOR_ALPHA HEX: 3000

STRUCT: cairo_matrix_t
    { xx double }
    { yx double }
    { xy double }
    { yy double }
    { x0 double }
    { y0 double } ;
    
FUNCTION: void
cairo_transform ( cairo_t* cr, cairo_matrix_t* matrix ) ;

The Factor compiler generates stubs for calling C functions on the fly from these declarative descriptions; there is no C code generator and no dependency on a C compiler. In fact, C library bindings are so easy to write that for many contributors, it is their first project in Factor. When Doug Coleman first got involved in Factor, he began by writing a PostgreSQL binding, followed by an implementation of the MD5 checksum. Both libraries have been heavily worked on since then are still in use.

For complete examples of FFI usage, check out any of the following:

unix - POSIX bindings
windows - Windows API bindings
db.sqlite.ffi - SQLite bindings
opengl.gl - OpenGL bindings

There are many more usages of the FFI of course. Since Factor has a minimal VM, all I/O, graphics and interaction with the outside world in general is done with the FFI. Search the Factor source tree for source files that use the alien.syntax vocabulary.

GPU shaders

Joe Groff cooked up a nice DSL for passing uniform parameters to pixel and vertex shaders. In his blog post, Joe writes:

The library makes it easy to load and interactively update shaders, define binary formats for GPU vertex buffers, and feed parameters to shader code using Factor objects.

Here is a snippet from the gpu.demos.raytrace demo:

GLSL-SHADER-FILE: raytrace-vertex-shader vertex-shader "raytrace.v.glsl"
GLSL-SHADER-FILE: raytrace-fragment-shader fragment-shader "raytrace.f.glsl"
GLSL-PROGRAM: raytrace-program
    raytrace-vertex-shader raytrace-fragment-shader ;

UNIFORM-TUPLE: sphere-uniforms
    { "center" vec3-uniform  f }
    { "radius" float-uniform f }
    { "color"  vec4-uniform  f } ;

UNIFORM-TUPLE: raytrace-uniforms
    { "mv-inv-matrix"    mat4-uniform f }
    { "fov"              vec2-uniform f }
    
    { "spheres"          sphere-uniforms 4 }

    { "floor-height"     float-uniform f }
    { "floor-color"      vec4-uniform 2 }
    { "background-color" vec4-uniform f }
    { "light-direction"  vec3-uniform f } ;

The GLSL-SHADER-FILE: parsing word tells Factor to load a GLSL shader program. The GPU framework automatically checks the file for modification, reloading it if necessary.

The UNIFORM-TUPLE: parsing word defines a new tuple class, together with methods which destructure the tuple and bind textures and uniform parameters. Uniform parameters are named as such because they define values which remain constant at every pixel or vertex that the shader program operates on.

Instruction definitions in the compiler

This one is rather obscure and technical, but it has made my job easier over the last few weeks. I blogged about it already.

Other examples

The next set of DSLs don't involve parsing words as much as just clever tricks with evaluation semantics.

Inverse

Daniel Ehrenberg's inverse library implements a form of pattern matching by computing the inverse of a Factor quotation. The fundamental combinator, undo, takes a Factor quotation, and executes it "in reverse". So if there is a constructed tuple on the stack, undoing the constructor will leave the slots on the stack. If the top of the stack doesn't match anything that the constructor could've produced, then the inverse fails, and pattern matching can move on to the next clause. This library works by introspecting quotations and the words they contain. Dan gives many details and examples in his paper on inverse.

Help system

Factor's help system uses an s-expression-like markup language. Help markup is parsed by the Factor parser without any special parsing words. A markup element is an array where the first element is a distinguished word and the rest are parameters. Examples:

"This is " { $strong "not" } " a good idea"

{ $list
    "milk"
    "flour"
    "eggs"
}

{ $link "help" }

This markup is rendered either directly on the Factor UI (like in this screenshot) or via HTML, as on the docs.factorcode.org site.

The nice thing about being able to view help in the UI environment is the sheer interactive nature of it. Unlike something like javadoc, there is no offline processing step which takes your source file and spits out rendered markup. You just load a vocabulary into your Factor instance and the documentation is available instantly. You can look at the help for any documented word by simply typing something like

\ append help

in the UI listener. While working on your own vocabulary, you can reload changes to the documentation and see them appear instantly in the UI's help browser.

Finally, it is worth mentioning that because of the high degree of semantic information encoded in documentation, many kinds of mistakes can be caught in an automated fashion. The help lint tool finds inconsistencies between the actual parameters that a function takes, and the documented parameters, as well as code examples that don't evaluate to what the documentation claims they evaluate to, and a few other things.

You won't find a lot of comments in Factor source, because the help system is much more useful. Instead of plain-text comments that can go out of date, why not have rich text with hyperlinks and semantic information?

For examples of help markup, look at any file whose name ends with -docs.factor in the Factor source tree. There are plenty.

x86 and PowerPC assemblers

I put this one last since its not really a DSL at all, just a nice API. The lowest level of Factor's compiler generates machine code from the compiler's intermediate form in a CPU-specific way. The CPU backends for x86 and PowerPC use the cpu.x86.assembler and cpu.ppc.assembler vocabularies for this, respectively. The way the assemblers work is that they define a set of words corresponding to CPU instructions. Instruction words takes operands from the stack -- which are objects representing registers, immediate values, and in the case of the x86, addressing modes. They then combine the operands and the instruction into a binary opcode, and add it to a byte vector stored in a dynamically-scoped variable. So instead of calling methods on, and passing around, an 'assembler object' as you would in say, a JIT coded in C++, you wrap the code generation in a call to Factor's make word, and simply invoke instruction words therein. The result looks like assembler source, except it is postfix. Here is an x86 example:

[
    EAX ECX ADD
    XMM0 XMM1 HEX: ff SHUFPS
    AL 7 OR
    RAX 15 [+] RDI MOV
] B{ } make .

When evaluated, the above will print out the following;

B{ 1 200 15 198 193 255 131 200 7 72 137 120 15 }

Of course, B{ is the homoiconic syntax for a byte array. Note the way indirect memory operands are constructed; first, we push the register (RAX) then the displacement (here the constant 15, but register displacement is supported by x86 too). Then we call a word [+] which constructs an object representing the addressing mode [RAX+15].

The rationale for choosing this somewhat funny syntax for indirect operands (there is also a [] word for memory loads without a displacement), rather than some kind of parser hack that allows one to write [RAX] or [R14+RDI] directly, is that in reality the compiler only rarely deals with hard-coded register assignments. Instead, the register allocator makes decisions a level above, and passes them to the code generator. Here is a typical compiler code generation template from the cpu.x86 vocabulary:

M:: x86 %check-nursery ( label temp1 temp2 -- )
    temp1 load-zone-ptr
    temp2 temp1 cell [+] MOV
    temp2 1024 ADD
    temp1 temp1 3 cells [+] MOV
    temp2 temp1 CMP
    label JLE ;

Here, I'm using the locals vocabulary together with the assembler. The temp1 and temp2 parameters are registers and label is, as its name implies, a label to jump to. This snippet generates machine code that checks whether or not the new object allocation area has enough space; if so, it jumps to the label, otherwise it falls through (code to save live registers and call the GC is generated next). The load-zone-ptr word is like an assembler macro here; it takes a register and generates some more code with it.

The PowerPC assembler is a tad more interesting. Since the x86 instruction set is so complex, with many addressing modes and so on, the x86 assembler is implemented in a rather tedious manner. Obvious duplication is abstracted out. However, there is a lot of case-by-case code for different groups of instructions, with no coherent underlying abstraction allowing the instruction set to be described in a declarative way.

On PowerPC, the situation is better. Since the instruction set is a lot more regular (fixed width instructions, only a few distinct instruction formats, no addressing modes), the PowerPC assembler itself is built using a DSL specifically for describing PowerPC instructions:

D: ADDI 14
D: ADDIC 12
D: ADDIC. 13
D: ADDIS 15
D: CMPI 11
D: CMPLI 10
...

The PowerPC instruction format DSL is defined in the cpu.ppc.assembler.backend vocabulary, and as a result the cpu.ppc.assembler vocabulary itself is mostly trivial.

Last words

Usually my blog posts describe recent progress in the Factor implementation, and I tend to write about what I'm working on right now. I'm currently working on code generation for SIMD vector instructions in the Factor compiler. I was going to blog about this instead, but decided not to do it until the SIMD implementation and API settles down some more.

With this post I decided to try something a bit different, and instead just describe an aspect of Factor that interests to me, without any of it being particularly breaking news. If you've been following Factor development closely, there is literally nothing in this post that you would not have seen already, however I figured people who don't track the project so closely might appreciate a general survey like this. I'm also thinking of writing a post describing Factor's various high-level I/O libraries. I'd appreciate any feedback, suggestions and ideas on this matter.

Advanced floating point features: exceptions, rounding modes, denormals, unordered compares, and more

2009-09-12T23:20:00.010-04:00

Factor now has a nice library for introspecting and modifying the floating point environment. Joe Groff implemented most of it, and I helped out with debugging and additional floating point comparison operations. All of these features are part of the IEEE floating point specification and are implemented on all modern CPUs, however few programming languages expose a nice interface to working with them. C compilers typically provide hard-to-use low-level intrinsics and other languages don't tend to bother at all. Two exceptions are the SBCL compiler and the D language.

The new functionality is mostly contained in the math.floats.env vocabulary, with a few new words in math for good measure. The new code is in the repository but it is not completely stable yet; there are still some issues we need to work out on the more obscure platforms.

To follow along with the examples below, you'll need to get a git checkout from the master branch and load the vocabulary in your listener:

USE: math.floats.env

The first two features, floating point exceptions and traps, are useful for debugging numerical algorithms and detecting potentially undesirable situations (NaNs appearing, underflow, overflow, etc).

Floating point exceptions

One of the first things people learn about floating point is that it has "special" values: positive and negative infinity, and not-a-number (NaN) values. These appear as the results of computations where the answer is undefined (division by zero, square root of -1, etc) or the answer is too small or large to be represented as a float (2 to the power of 10000, etc). A less widely-known fact is that when a special value is computed, "exception flags" are set in a hardware register which can be read back in. Most languages do not offer any way to access this functionality.

In Factor, exception flags can be read using the collect-fp-exceptions combinator, which first clears the flags, calls a quotation, then outputs any flags which were set. For example, division by zero sets the division by zero exception flag and returns infinity:

( scratchpad ) [ 1.0 0.0 / ] collect-fp-exceptions . .
{ +fp-zero-divide+ }
1/0.

Dividing 1 by 3 sets the inexact flag, because the result (0.333....) cannot be represented as a float:

( scratchpad ) [ 1.0 3.0 / ] collect-fp-exceptions . .
{ +fp-inexact+ }
0.3333333333333333

The fact that 1/3 does not have a terminating decimal or binary expansion is well-known, however one thing that many beginners find surprising is that some numbers which have terminating decimal expansions nevertheless cannot be represented precisely as floats because they do not terminate in binary (one classic case is 1.0 - 0.9 - 0.1 != 0.0):

( scratchpad ) [ 4.0 10.0 / ] collect-fp-exceptions . .
{ +fp-inexact+ }
0.4

Raising a number to a power that is too large sets both the inexact and overflow flags, and returns infinity:

( scratchpad ) [ 2.0 10000.0 ^ ] collect-fp-exceptions . .
{ +fp-inexact+ +fp-overflow+ }
1/0.

The square root of 4 is an exact value; no exceptions were set:

( scratchpad ) [ 4.0 sqrt ] collect-fp-exceptions . .
{ }
2.0

The square root of 2 is not exact on the other hand:

( scratchpad ) [ 2.0 sqrt ] collect-fp-exceptions . .
{ +fp-inexact+ }
1.414213562373095

Factor supports complex numbers, so taking the square root of -1 returns an exact value and does not set any exceptions:

( scratchpad ) [ -1.0 sqrt ] collect-fp-exceptions . .
{ }
C{ 0.0 1.0 }

However, we can observe the invalid operation exception flag being set if we call the internal fsqrt word, which operates on floats only and calls the libc function (or uses the SQRTSD instruction on SSE2):

( scratchpad ) USE: math.libm [ -1.0 fsqrt ] collect-fp-exceptions . .
{ +fp-invalid-operation+ }
NAN: 8000000000000

I describe the new NAN: syntax later in this post.

Signaling traps

Being able to inspect floating point exceptions set after a piece of code runs is all well and good, but what if you have a tricky underflow bug, or a NaN is popping up somewhere, and you want to know exactly where? In this case it is possible to set a flag in the FPU's control register which triggers a trap when an exception is raised. This trap is delivered to the Factor process as a signal (Unix), Mach exception (Mac OS X), or SEH exception (Windows). Factor then throws it as an exception which can be caught using any of Factor's error handling words, or just left unhandled in which case it will bubble up to the listener.

The with-fp-traps combinator takes a list of traps and runs a quotation with those traps enabled; when the quotation completes (or throws an error) the former FPU state is restored again (indeed it has to be this way, since running the Factor UI's rendering code with traps enabled quickly kills it). The all-fp-exceptions word is equivalent to specifying { +fp-invalid-operation+ +fp-overflow+ +fp-underflow+ +fp-zero-divide+ +fp-inexact+ }. Here is an example:

( scratchpad ) all-fp-exceptions [ 0.0 0.0 / ] with-fp-traps
Floating point trap

Without the combinator wrapped around it, 0.0 0.0 / simply returns a NaN value without throwing anything.

Rounding modes

Unlike exceptions and traps, which do not change the result of a computation but merely set status flags (or interrupt it), the next two features, the rounding mode and denormal mode, actually change the results of computations. As with exceptions and traps, they are implemented as scoped combinators rather than global state changes to ensure that code using these features is 'safe' and cannot change floating point state of surrounding code.

If a floating point operation produces an inexact result, there is the question of how the result will be rounded to a value representable as a float. There are four rounding modes in IEEE floating point:

+round-nearest+
+round-down+
+round-up+
+round-zero+

Here is an example of an inexact computation done with two different rounding modes; the default (+round-nearest+) and +round-up+:

( scratchpad ) 1.0 3.0 / .
0.3333333333333333
( scratchpad ) +round-up+ [ 1.0 3.0 / ] with-rounding-mode .
0.3333333333333334

Denormals

Denormal numbers are numbers where the exponent consists of zero bits (the minimum value) but the mantissa is not all zeros. Denormal numbers are undesirable because they have lower precision than normal floats, and on some CPUs computations with denormals are less efficient than with normals. IEEE floating point supports two denormal modes: you can elect to have denormals "flush" to zero (+denormal-flush+), or you can "keep" denormals (+denormal-keep+). The latter is the default:

( scratchpad ) +denormal-flush+ [ 51 2^ bits>double 0.0 + ] with-denormal-mode .
0.0
( scratchpad ) 51 2^ bits>double 0.0 + .
1.112536929253601e-308

Ordered and unordered comparisons

In math, for any two numbers a and b, one of the following three properties hold:

a < b
a = b
a > b

In floating point, there is a fourth possibility; a and b are unordered. This occurs if one of the two values is a NaN. The unordered? predicate tests for this possibility:

( scratchpad ) NAN: 8000000000000 1.0 unordered? .
t

If an ordered comparison word such as < or >= is called with two values which are unordered, they return f and set the +fp-invalid-operation+ exception:

( scratchpad ) NAN: 8000000000000 1.0 [ < ] collect-fp-exceptions . .
{ +fp-invalid-operation+ }
f

If traps are enabled this will throw an error:

( scratchpad ) NAN: 8000000000000 1.0 { +fp-invalid-operation+ } [ < ] with-fp-traps    
Floating point trap

If your numerical algorithm has a legitimate use for NaNs, and you wish to run it with traps enabled, and have certain comparisons not signal traps when inputs are NaNs, you can use unordered comparisons in those cases instead:

( scratchpad ) NAN: 8000000000000 1.0 [ u< ] collect-fp-exceptions . .
{ }
f

Unordered versions of all the comparisons are defined now, u<, u<=, u>, and u>=. Equality of numbers is always unordered, so it does not raise traps if one of the inputs is a NaN. In particular, if both inputs are NaNs, equality always returns f:

( scratchpad ) NAN: 8000000000000 dup [ number= ] collect-fp-exceptions . .
{ }
f

Half-precision floats

Everyone and their drunk buddy know about IEEE single (32-bit) and double (64-bit) floats; IEEE also defines half-precision 16-bit floats. These are not used nearly as much; they come up in graphics programming for example, since GPUs use them for certain calculations with color components where you don't need more accuracy. The half-floats vocabulary provides some support for working with half-floats. It defines a pair of words for converting Factor's double-precision floats to and from half-floats, as well as C type support for passing half-floats to C functions via FFI, and building packed arrays of half-floats for passing to the GPU.

Literal syntax for NaNs and hexadecimal floats

You may have noticed the funny NAN: syntax above. Previously all NaN values would print as 0/0., however this is inaccurate since not all NaNs are created equal; because of how IEEE floating point works, a value is a NaN if the exponent consists of all ones, leaving the mantissa unspecified. The mantissa is known as the "NaN payload" in this case. NaNs now print out, and can be parsed back in, using a syntax that makes the payload explicit. A NaN can also be constructed with an arbitrary payload using the <fp-nan> word:

( scratchpad ) HEX: deadbeef <fp-nan> .
NAN: deadbeef

The old 0/0. syntax still works; it is shorthand for NAN: 8000000000000, the canonical "quiet" NaN.

Some operations produce NaNs with different payloads:

( scratchpad ) USE: math.libm
( scratchpad ) 2.0 facos .
NAN: 8000000000022

In general, there is very little you can do with the NaN payload.

A more useful feature is hexadecimal float literals. When reading a float from a decimal string, or printing a float to a decimal string, there is sometimes ambiguity due to rounding. No such problem exists with hexadecimal floats.

An example of printing a number as a decimal and a hexadecimal float:

( scratchpad ) USE: math.constants
( scratchpad ) pi .
3.141592653589793
( scratchpad ) pi .h
1.921fb54442d18p1

Java supports hexadecimal float literals as of Java 1.5. Hats off to the Java designers for adding this! It would be nice if they would add the rest of the IEEE floating point functionality in Java 7.

Signed zero

Unlike twos-complement integer arithmetic, IEEE floating point has both positive and negative zero. Negative zero is used as a result of computations of very small negative numbers that underflowed. They also have applications in complex analysis because they allow a choice of branch cut to be made. Factor's abs word used to be implemented incorrectly on floats; it would check if the input was negative, and if so multiply it by negative one. However this was a problem because negative zero is not less than zero, and so the absolute value of negative zero would be reported as negative zero. The correct implementation of the absolute value function on floats is to simply clear the sign bit. It works properly now:

( scratchpad ) -0.0 abs .
0.0

Implementation

The implementation of the above features consists of several parts:

Cross-platform Factor code in the math.floats.env vocabulary implementing the high-level API
Assembly code in vm/cpu-x86.32.S, vm/cpu-x86.64.S, and vm/cpu-ppc.S to read and write x87, SSE2 and PowerPC FPU control registers
Low-level code in math.floats.env.x86 and math.floats.env.ppc which implements the high-level API in terms of the assembly functions, by calling them via Factor's FFI and parsing control registers into a cross-platform representation in terms of Factor symbols
Miscellaneous words for taking floats apart into their bitwise representation in the math vocabulary
Compiler support for ordered and unordered floating point compare instructions in compiler.cfg.instructions
CPU-specific code generation for ordered and unordered floating point compare instructions in cpu.x86 and cpu.ppc

Eliminating some boilerplate in the compiler

2009-09-02T06:59:00.003-04:00

Adding new instructions to the low-level optimizer was too hard. Multiple places had to be updated, and I would do all this by hand:

The instruction tuple itself is defined in the compiler.cfg.instructions vocabulary with the INSN: class, which also creates a word with the same name that constructs the instruction and adds it to the current sequence.
Instructions which have a destination register have convenient constructors in compiler.cfg.hats which creates a fresh virtual register, creates an instruction with this register as the destination, and outputs it. So for example, 1 2 ^^add would create an add instruction with a fresh destination register, and output this register. It might be equivalent to something like 0 1 2 ##add.
Instructions that use virtual registers are be added to the vreg-insn union, and respond to methods defs-vreg, uses-vregs, and temp-vregs in compiler.cfg.def-use. This 'def-use' information is used by SSA construction, dead code elimination, copy coalescing, register allocation, among other things.
Methods have to be defined for the instruction in compiler.cfg.renaming.functor. This functor is used to generate code for renaming virtual registers in instructions. The renaming code is used for SSA construction, representation selection, register allocation, among other things.
Instructions which use non-integer representations (eg, operations on floats) must respond to methods defs-vreg-rep, uses-vreg-reps, and temp-vreg-reps in compiler.cfg.representations.preferred.
Instructions must respond to the generate-insn method, defined in compiler.codegen.
Instructions which participate in value numbering must define an "expression" variant, and respond to the >expr method defined in compiler.cfg.value-numbering.expressions

As you can see, this is a lot of duplicated work. I used inheritance and mixins to model relationships between instructions and reduce some of this duplication by defining methods on common superclasses rather than individual instructions where possible, but this never seemed to work out well.

If you look at the latest versions of the source files I linked to above, you'll see that all the repetitive copy-pasted code has been replaced with meta-programming. The new approach extends the INSN: parsing word so now all relevant information is specified in one place. Also there is a new PURE-INSN: parsing word, to mark instructions which participate in value numbering; previously this was done with a superclass. For example,

PURE-INSN: ##add
def: dst/int-rep
use: src1/int-rep src2/int-rep ;

This defines an instruction tuple, an expression tuple, def/use information, representation information, a method for converting instructions to expressions, and a constructor, all at the same time.

For the code generator's generate-insn method, not every instruction has a straightforward implementation; some, like GC checks and FFI calls, postpone a fair amount of work until the very last stage of compilation. However, for most instructions, this method simply extracts all the slots from the instruction tuple, then calls a CPU-specific hook in the cpu.architecture vocabulary to generate code. For these cases, a CODEGEN: parsing word sets up the relevant boilerplate;

CODEGEN: ##add %add

is equivalent to

M: ##add generate-insn [ dst>> ] [ src1>> ] [ src2>> ] tri %add ;

This nicely cleans up all the repetition I mentioned in the bullet points at the top.

I've been aware of this boilerplate for a while but wanted the design of the compiler to settle down first. Now that most of the machinery is in place, I feel comfortable cooking up some complex meta-programming to clean things up. Adding new instructions should be easier. I plan on adding some SSE2 vector operations soon, and this was the main motivation behind this cleanup.

How would you do this in other languages? In Lisp, you would use a macro which expands into a bunch of defclass, defmethod, etc forms. In Java, you might use annotations:

public class AddInsn extends Insn {
    @Def @Representation("int") Register dst;
    @Use @Representation("int") Register src1;
    @Use @Representation("int") Register src2;
}

Factor: a practical stack language

Factor 0.94 now available

Incompatible changes:

New libraries:

Improved libraries:

Compiler improvements:

Miscellaneous improvements:

An overview of Factor's I/O library

First example: converting a text file from MacRoman to UTF8

Converting a directory full of files from MacRoman to UTF8

An aside: generalizing the "current working directory"

Checking free disk space

Unix only: symbolic links

File system monitoring

Memory mapped files

Launching processes

Unix only: file ownership and permissions

Networking

Network servers

Under the hood

Making Factor's continuous build system more robust

Checking for disk space

Git repository corruption

Error e-mail throttling

More robust new code check

Build system dashboard

Conclusion

Two things every Unix developer should know

Interruptible system calls

Subprocesses inherit semi-random things from the parent process

Overhauling Factor's C library interface

The previous FFI implementation

Factoring FFI calls into simpler instructions

GC maps

Callback support improvements

Factor talk in Boston, July 26th

Comparing Factor's performance against V8, LuaJIT, SBCL, and CPython

Language implementations

Benchmark implementations

Running the benchmarks

Benchmark results

Benchmark analysis

Conclusions

A collection of small compiler improvements

Improvements to value numbering

Optimistic copy propagation

Representation selection improvements

Register allocator improvements

Compiler code cleanups

A bug fix for alias analysis

Instruction scheduling to reduce register pressure

Some benchmarks

Factor 0.93 now available

Incompatible changes:

New features:

Frame-based structured exception handling on Windows x86-64

Switching call stacks on different platforms

System APIs for switching call stacks

High-level libraries for switching call stacks

NetBSD/x86 limitation

Windows-specific concerns

The TIB

Structured exception handlers

Accessing the TIB

Sample code for updating Windows-specific structures

Incompatible change in Mach exception handler behavior on Mac OS X 10.6

Adding a recaptcha to the Factor pastebin

Final classes and platform-specific vocabularies

Final classes

Typed words can now be declared foldable and flushable

Platform-specific vocabularies

Factor 0.92 now available

Language improvements

Notable new libraries

Implementation improvements

Editor integration

Replacing GNU assembler with Factor code

Factor platform ABI

Replacing the c_to_factor() entry point

Replacing the throw_impl() entry point

Replacing the `c_to_factor()` entry point

Replacing the `throw_impl()` entry point

Replacing the `lazy_jit_compile_impl()` and `set_callstack()`entry points

Inline assembly with the `alien-assembly` combinator