Monday, December 22, 2008

My employer decided to buy me a Google Android phone for XMas. I appreciate the gesture. Thank you.

I'm not a 'gadget' person. I've never owned a pager. I carried a cell phone for all of a week before I got fed up with it. I've never had a PDA, Game Boy, or PSP. I had an HP-34C in high school, but that was quite some time ago. So what am I going to do with an Android Phone? Unfortunately, there are some strings attached to the gift, so I can't sell it. (But I can complain about it. If you attach obligations to a `gift', it is no longer technically a gift.)

So what's this gadget do?

Android vs. Poke in the Eye with a Sharp Stick

Android wins here. In fact, I'd probably prefer to be poked in the eye with an Android. On the other hand, the sharp stick would work best if I wanted to poke someone else's eye. I guess it's safest to say that an Android phone is not interchangeable with a sharp stick.

Android vs. Nothing at All

This is more of a toss-up. Nothing At All has extremely low maintenance, is trivial to upgrade, and is easy to replace if lost. It suffers a bit on the functionality scale, but it works as advertised and precisely satisfies its requirements (two things that are almost always found wanting in high-tech gadgets). The last upgrade was early last century when they added vacuum energy, but prior to that, Nothing At All has been pretty stable for a long time.

Android can do a fair imitation of Nothing At All if you don't take it out of the box. There are two minor issues: it is more opaque, and one can still lose it. But if you amortize these drawbacks over the time you don't have to deal with it because it is in its box, these are fairly negligable. Nonetheless, I think Nothing At All has a slight advantage here.

But we haven't looked at the applications for which Android is *intended* to be better than Nothing At All.

First off, Android is a cell phone. Phones are good if you want to talk to somebody. I don't. However, it does seem that you can use Android even without having the cell-phone SIM card. But the phone is careful to tell me every time I turn it on that it still has no SIM card. I guess I won't have to worry about people surreptitiously installing unwanted SIM cards.

So what else does it do? If I'm not using it as a phone, that makes the `dialer' and `call log' rather pointless. It can tell me the time. Never mind, I already have a watch. It has a web browser, that's good if I don't have my computer. But I have a computer at work and at home, so I'd only use Android if I wanted to browse the web, forgot to bring my computer, but remembered to bring Android.

Come to think of it, how does Android stack up to my laptop?

Android vs. Laptop

Android is smaller. It weighs less and fits in my pocket, so it is easy to carry around. But there are drawbacks. The screen is no bigger than the device itself. The keyboard on my laptop is cramped enough to be uncomfortable to use for long periods, so the Android keyboard is so tiny as to be *almost* (but not quite) unusable. I'm amused that the cover slides open to reveal the keyboard, but the engineer in me wonders how many times I can do that before it breaks. Come to think of it, they should have put the keyboard on the slide-out part. I could still use a good chunk of the system if the keyboard broke off, but the entire thing is useless if the display breaks off.

I use my computer for email, work, surfing the web, and general hacking. Android slurped up a copy of my email, so I can read it with the phone, and I can browse the web, but I can't use it for work or general computer hacking. It doesn't run emacs or have a self-hosting development environment. Come to think of it, the keyboard is so small that I can't actually compose or answer email, and the web simply isn't useful at 480x320.

In this contest, the laptop wins hands-down.

At Least I Can Program It

Are you kidding?! Did you look at the SDK?

Wednesday, November 12, 2008

You want to know what I think? I'll tell you what I think. Here's what I think: Java Java Java is is is too too too damn damn damn verbose verbose verbose. That's what I think. And I'm sticking to it. So there.

You want to know what I think? I'll tell you what I think. Here's what I think: There's There's There's way way way too too too much much much boilerplate boilerplate boilerplate in in in Java Java Java. That's what I think. And I'm sticking to it. So there.

You want to know what I think? I'll tell you what I think. Here's what I think: If If If Charles Charles Charles Dickens Dickens Dickens had had had designed designed designed a a a language language language, it it it would would would be be be a a a lot lot lot like like like Java Java Java. That's what I think. And I'm sticking to it. So there.

I was writing a small helper class to correctly paste together the CGI arguments to an http service. The idea is simple. Rather than simply appending strings in a kludgy way, you'd collect a number of instances of the objects you want to pass (mostly integers, text, and enums) and the code would magically marshal them in to a correctly escaped URI. No big deal.

Two days later...

Sixteen hundred lines of code distributed in twenty-nine files. Is it any wonder that people get carpal tunnel syndrome? And this is mostly `throwaway' code that abstracts only the small amount of functionality I need. I don't even want to think how awful it would be if I had tried to engineer a full-blown class-based utility.

Of course I was asked “Why didn't you just use the mumble class?” A few reasons, but the primary reason was that the suggested class worked like this:

    FooBuilder fooBuilder = new fooBuilder();

    FooPartBuilder fooPartBuilder = new fooPartBuilder();
    fooPartBuilder.addName ("fooPartName");
    fooPartBuilder.addValue ("fooPartValue");

    FooPart fooPart = fooPartBuilder.createPart();
    fooBuilder.addPart (fooPart);
    .... etc. etc. ...

I wanted something like this:

    Foo myFoo = new Foo (new FooPart ("fooPartName", "fooPartValue"),
                         new BellOption (BellOption.Sound.Tinkle),
                         new WhistleOption (), ...)

What's the difference? As I was skimming the code for FooBuilder (and the code for AbstractFoo, FooInterface, AbstractFooBase, etc...) I saw a note that mentioned ‘this isn't thread safe’. Great. Another thing to think about. The problem with the FooBuilder model is that it is just lousy with state. It starts empty and then you smash some state into it. This is stupid. It begins in a state that no one wants it to be in, so the first thing everyone does is smash the state to something more useful.

They way I wanted to do it was to simply construct the object in its final configuration. The object is immutable and stateless. It is guaranteed to be thread safe because there is no way to modify it. The JVM could keep separate copies all over the place for all I care. And because there are no mutators, the code is about 30% smaller. That's 30% fewer javadoc comments that replicate the argument lists.

Another gripe: Having a database that keeps track of where the source code resides is a good idea. Using the file system as a ‘poor man's database’ for that purpose is a bad idea. To give a concrete example, I wanted to say, formally, that ‘every optional CGI argument must provide a marshaling function’. Simple statement. Here goes:

    public abstract class OptionalCgiArgument {
        abstract String marshal() throws MarshalingException;
    }

But it should be in a file all by itself, which means a copyright notice, a package declaration, and, if I'm being pedantic, a Javadoc comment.

    // Copyright (c) 2008

    package com.foo.bar.baz.utility.api.helper;

    /**
     *  Abstract base class for OptionalCgiArguments.
     *  Ensures that all OptionalCgiArguments provide
     *  a marshaling method.
     *
     * @author jmarshall
     */
    public abstract class OptionalCgiArgument {
        abstract String marshal() throws MarshalingException;
    }

Did I forget the unit test?

Tuesday, November 4, 2008

I haven't dropped off the face of the planet, but I don't have much to report. Mostly, I've been hacking on this Scheme interpreter off and on trying to figure out how to minimize the amount of variable lookup. I have a plan, but it is a bit complex, so I'm writing some Scheme code to generate some C# code that will work in the interpreter.

Thursday, October 9, 2008

Now that we have static environments, we can take advantage of the known locations of lexical variables. For instance, if the variable is bound in the immediately enclosing lambda, we can simply fetch it because it cannot be shadowed. If a variable is completely free in a fragment of code, that is, if it is a `global' or `top-level' variable, we can grab the value cell at load time and simply dereference it to evaluate the variable.

Before loading or evaluating a piece of SCode, we make a pre-pass over it to `compile' the variable lookup. (We could do this at lambda-construction time, but then we'd have to make multiple passes over deeply nested lambda expressions. Nonetheless, multiple passes are much simpler to understand. When contructing a lambda, we'd walk the body and replace all the lambda-bound variables that are free in the body with variables that `knew' their lexical depth. And it is unlikely that the lexical depth will exceed a very small number, so the multiple passes wouldn't add up. On the other hand, you would still need a final pass just before evaluation, and if you were too simple-minded about multiple passes you'd end up with an exponential growth factor.)

As I was saying, we make a pre-pass over the code to `compile' the variable lookup. We create a LexicalMap that holds a model of the runtime environment, and as we walk the code, we simulate the variable lookup against the lexical map. As we simulate the lookup, we note how deep we have to search, whether there will be incremental bindings in any of the frames, and where we ultimately find the variable (if at all). If we find the variable, we look at the path we used to get at it and create an instance of a specialized variable class that performs that particular style of lookup at runtime. If we don't find the variable, we create an instance of a specialized variable class that performs a deep search.

Wednesday, October 8, 2008

Look! Up in the environment!

Variable lookup is right up there with continuation management as an important performance bottleneck. In one benchmark, I measured 2.9 billion evaluations of which 991 million (34%) of them were variable lookups. A good chunk of compiler optimizations revolves around getting the variables to where they need to be as quickly as possible and keeping them in easy to find places.

My interpreter uses the standard `environment model' of evaluation where a linked chain of lexical environment chains are maintained during evaluation. Every variable lookup goes through a `deep search' of the frames looking for a binding with a matching name. Needless to say, this is not blazingly fast. There are two parts to solving this. First is to reduce the need for the deep search, second is to optimize the fast path.

MIT Scheme supports first-class environments. The special form the-environment returns the lexical environment at the point of invocation. The returned environment can be used as an argument to eval and that's where the trouble begins. If eval is called, the user is going to expect that the code will run in a context with the same bindings visible as where the environment was captured. This means we cannot move or collapse the environment frames because it may cause changes visible to the user. Even worse, if the user evals a define expression, or loads a file, it may inject bindings that were not previously visible. This means we cannot even cache the locations of the variables. The flexibility of first-class environments comes with a heavy cost. On the other hand, this flexibility is hardly ever used, so we end up paying this cost to no advantage.

So how do we fix this? First, because (the-environment) is the only documented way to obtain a first-class environment, we can walk the SCode and determine whether a first-class environment will be needed. I do this when I construct an SCode lambda expression. If the body contains a call to (the-environment), I construct a StandardLambda object, but if it does not, I construct a StaticLambda object. A StandardLambda, when evaluated, constructs a StandardClosure. A StandardClosure, when applied, constructs a StandardEnvironment. A StandardEnvironment must be layed out exactly as specified by the source code and variable lookups must do a deep search because the user might inject bindings via define or load.

A StaticEnvironment however, has an important property: because the user cannot use load or eval with it (he can't name the object to pass it as an argument, so this does not create a user-visible restriction), we know that there will not be incremental definitions that could hide the bindings in the environment. We know the lexical depth and offset of all variables bound in a StaticEnvironment.

So the first task is to identify and separate StandardLambda and StaticLambda instances.

Monday, October 6, 2008

Weird bug

I found (to some extent) the bug that has been vexing me for the past couple of weeks. It is obscure and I can't think of a good reason that things would work this way, but I do have a reliable way to avoid the bug.

The symptoms were this: After booting Scheme and getting to the REPL, forms typed in the REPL would be incorrectly rewritten with uninterned symbols appearing in body of the form. Naturally, these would not match the bindings of the interned symbols appearing in the binding lists, so unbound variables would be raised.

I first changed the way uninterned symbols printed so I could easily identify them. This showed a second symptom: the forms typed to the REPL would not be rewritten the same each time. Retyping the form to the REPL would give a different answer on the second rewrite.

Despite these problems, enough of Scheme is working to load and initialize the world and get to the prompt. This takes some 6 million evaluations.

A little poking around led me to the guts of the syntaxer. This is the component of MIT Scheme that builds the abstract syntax tree from s-expressions. It has two renaming phases: first, it renames all identifiers to uninterned symbols. This has the effect of alpha-renaming the bound variables so that macro expansion cannot inadvertently capture bindings. After everything is expanded, the second renaming phase renames the indentifiers back to their original names provided that it does not introduce a naming conflict. If there is a naming conflict, a non-conflicting identifier is generated for the name.

The renaming information is kept in a set of hash tables called the rename-database. What appeared to be happening was that during the second phase the code either thought that there was a conflict in renaming the variables in the body of some lambda expressions, or it could not find the entry in the rename-database that contained the original name. Either explanation could account for it. I suspected a problem with hash tables.

Now the first baffling thing is this: most of MIT Scheme is written in Scheme itself, so the hash table implementation is really only dependent on some very trivial arithmetic. Furthermore, since the hash code is computed by calling a primitive procedure that I wrote, I was able to ensure that all symbols would hash to the same bucket, thus turning hash-table lookup into a linear search for symbols. This depends only on checking EQ-ness of symbols.

The CLR has a notion of interned and uninterned immutable strings. These are semantically equivalent to symbols and I used them as such. I thought that perhaps some uninterned strings were being inadvertently interned or that perhaps some strings I thought were interned were not. I added a two-character prefix to my uninterned strings whenever I created them and I added code that checked that if the two characters were present, the symbol was uninterned, and if absent the symbol was interned. Furthermore, I checked that no interned symbol had the prefix and no uninterned symbol was missing the prefix. I performed this check on every call to EQ and on every call to a Scheme primitive that manipulated symbols. I even put a test in the variable lookup code to check the state of any symbol that may have been placed in a variable.

This code showed clearly that I was correctly managing the uninterned and interned symbol sets. No symbol accidentally migrated.

The addition of the prefix characters helped when stepping through the code. It was easy to see which symbols were uninterned and which were interned by simply looking for the prefix. Unfortunately, it is hard to distinguish between two different uninterned symbols with the same name (and there cannot be two interned symbols with the same name). I could see a few cases where two uninterned symbols with the same name were being compared and the answer indicated they were different symbols.

So to make it easier for me to tell which uninterned symbols were which, I changed the Scheme primitive that creates uninterned symbols to add a unique prefix to each one.

The bug went away.

`Aha!' you say, it *was* a case of accidentally confusing two symbols. Well, I *thought* that, too, but it is weirder than that. If giving each uninterned symbol a unique name solves the problem, then we seem to have this kind of bug:

Two symbols that are NOT identical (occupy different locations in memory) are determined to be identical because they contain the same sequence of characters.

Now I suspected a bug of this sort fairly early on in the debugging process, so I was a bit confused why I didn't find it earlier. (After all, it seems obvious)

Here is an obvious fact: the number of uninterned symbols is strictly equal to or less than the number of calls made to the primitive that creates them. There is exactly one routine in my interpreter that creates uninterned symbols, so each one must come from a call to that routine. Of course I had put a breakpoint at that routine so I could see who was generating these symbols.

The call to make-uninterned-symbol happens exactly *once* per identifier. This, too, make sense: you only need one uninterned name per identifier.

So if we are comparing two symbols that are NOT identical, where did the second one come from?

Ah, you say, maybe it was left over from a previous expansion. As it so happens, I instrumented EQ quite heavily because I wanted to be sure I knew what was what. When I use the code that creates each symbol uniquely, I can see what the uninterned symbol is compared against. It is compared against itself and the original interned symbol. Under no circumstances did I see a comparison with any uninterned symbol *other* than itself. That makes sense because I was working with very simple lambda expressions that had only one bound variable.

So if we are creating only *one* uninterned symbol per binding, and only checking against that one uninterned symbol, how could we confuse them?

The explanation I'm stuck with is this: the CLR may not be guaranteeing that uninterned symbols have reference semantics. That is, the `identity' of an uninterned string (as defined as the location in memory where the bits are) might not be preserved, but can even *split* (two strings that were once EQ become not EQ). I can't think of why an implementation would do this, but it does explain what I am seeing. (I'm thinking there might be an explanation involving cross-generational pools of uninterned strings, but I dunno.)

The workaround is trivial: I no longer use uninterned strings as `naked' object but wrap them in a trivial class object. Since class instances are defined to have reference semantics, there is no problem with using object-equality on them.

I suppose I can try to write a program that demonstrates this issue, but I'm getting tired of debugging it and I have a workaround. But this is one weird bug.

Wednesday, October 1, 2008

A curious bug

I've been hacking away on my Scheme system, but I'll give a full update later on. I've encountered a curious bug, though. When I get to the REPL and type (let ((x 3)) x), I get an unbound variable error. It seems that the syntaxer (the part of MIT Scheme that expands macros and converts the s-expressions into s-code) is renaming the bound variable x.

What is curious about this bug is the specific nature of it. Since my version of MIT-Scheme is still incomplete, I often run into missing primitives or as-yet-unimplemented behavior. This just drops me into the C# debugger. But in order to cold-load and initialize Scheme from an empty image, I needed to get a lot of stuff working correctly. Scheme does some 6 million evaluations during the cold load, so I know that large portions of the language work just fine. The unimplemented portions just crash, so it is somewhat baffling that I've encountered something that is so obviously incorrect, yet correct enough to allow booting and even to attempt evaluation.

My guess right now is that there is something slightly wrong with hash tables. It's just a hunch, though. I have tried some simple tests from the REPL, and they look ok. Unfortunately, there is a lot of evaluation going on between hitting the return key and actually running `eval', so I haven't been able to find the problem by tracing the interpretation.

Friday, September 12, 2008

Correction

“Hamiltonian graphs cannot be expressed in existential monadic second order logic because palindromes are not a regular language.”

Apparently, Keith Ellul immediately realized that without MONADIC the statement is false, because it is “trivial” to see that hamiltonian graphs can be expressed in existential second order logic.

I can't wait to tell the guys at the pub.

Did you know?

Hamiltonian graphs cannot be expressed in existential second order logic because palindromes are not a regular language.

I have to say that is one of the most obscure sentences that I've seen in a long time.

Monday, September 8, 2008

In ‘Continuations from Generalized Stack inspection’, and in the Addendum, I showed how to abuse the exception handling mechanism to implement call-with-current-continuation on a virtual machine that doesn't allow access to the stack. The problem with the method is that exception handling is astoundingly slow. Nearly any other method of handling continuations will outperform it.

The key to the method is noticing that the method associated with each stack frame has full access to its own frame even if the other methods do not. The exception mechanism was a way to transfer control back to method, but to a segment of code that unloads the stack rather than continues the computation. This is quite similar to what we just did with the calling conventions to EvalStep. The machinery necessary to handle first-class continuations is an almost trivial addition. We create special singleton object and test the return value for this object. If we see it, we unload the current frame and return. This propagates down the stack until we reach the bottom and the stack is empty. As an example, the code for interpreting a conditional expression is this:
public override bool EvalStep (out object answer, ref SCode expression, ref Environment environment)
{
    SCode unev = this.predicate;
    Environment env = environment;
    while (unev.EvalStep (out answer, ref unev, ref env)) { };
    if (answer == Interpreter.UnwindStack) {
        UnwindState xxx = <get unwind state>
        xxx.AddFrame (new ConditionalState (this, environment));
        return;
        }

    expression = (answer is bool) && (bool) answer == false
        ? this.alternative
        : this.consequent;
    return true;
}
The special code has a simple task. It should save any relevant state out of the current stack frame into a data structure, then add that data structure to the list of reified stack frames, and finally return the ‘Interpreter.UnwindStack’ object itself.

Where do we keep the reified stack frames? We'd like to pass it down from the frames above, but it seems a waste to allocate yet another local variable and ref argument for this purpose because they are used so rarely. We use the ‘Wolf in Sheep's clothing trick’ to accomplish this.
class UnwindState : Environment
{
   <whatever needed for unwinding>
 
   <a bunch of virtual methods that
    make the compiler think we handle the ‘Envionment’ 
    protocol, but just raise errors>
}
And now we can stuff the unwind state in the Environment:
     
public override bool EvalStep (out object answer, ref SCode expression, ref Environment environment)
{
    SCode unev = this.predicate;
    Environment env = environment;
    while (unev.EvalStep (out answer, ref unev, ref env)) { };
    if (answer == Interpreter.UnwindStack) {
        ((UnwindState) env).AddFrame (new ConditionalState (this, environment));
        return;
        }

    expression = (answer is bool) && (bool) answer == false
        ? this.alternative
        : this.consequent;
    return true;
}
Our calling convention is a fair amount more complicated than it was. Each caller to EvalStep must wrap the call in a while loop, then it must check for the answer being ‘Interpreter.UnwindStack’, and if so, handle the stack unwinding via the environment variable. Not pretty, but it works.

Continuations in my version of Scheme are no longer allocated in the heap, but on the stack where we can take advantage of the hardware. Tail recursion is preserved via an unusual calling convention. However, the code is able to load and unload the stack, so call-with-current-continuation works correctly.

With these changes, the C# version of the MIT-Scheme interpreter now outperforms the original C version.

Thursday, September 4, 2008

Now it starts getting interesting.

I have a Scheme system that can interpret the MIT-Scheme S-Code, but it doesn't use the underlying MIT-Scheme runtime system. It can read the fasl binaries that MIT Scheme produces, but the object format is whatever the .NET system is providing. How does it perform? Not particularly well.

It's pretty easy to figure out why. Even the simplest piece of code is creating and invoking continuations at an enormous rate. Each continuation is being allocated on the heap. Now it is the case with modern generational garbage collection that heap allocation has almost no cost, but even the very tiny cost of allocation adds up quickly when you do it for nearly every node you traverse in the abstract syntax tree.

But the real problem is this: the CPU has specialized hardware that is specifically designed to efficiently allocate, deallocate and manipulate continuations, and we aren't using it. Again, it is sitting there managing the continuations used by the virtual functions in our interpreter, while we use the main data paths and the memory management software library to mimic it's functionality for the virtual machine. In fact, we go out of our way to get it to discard these useless continuations so we can preserve tail recursion.

Why do we do this? First, the C# continuations don't quite do what we want. C#, like most other languages, unconditionally allocates continuations on procedure entry and deallocates them on exit. Scheme, on the other hand, allocates continuations only if they logically needed to regain control. If a routine has completed its work and is simply going to call another routine to get the answer, it just jumps there and the callee transmits the answer directly back to whatever routine initiated the call chain.

Second, C# continuations are too opaque for our purposes. Scheme's call-with-current-continuation primitive reifies the continuation — it takes the stack and turns it into a first-class object that can be manipulated by the user. C#, like Java, does not permit programs to randomly access the stack and certainly doesn't let programs unload and reload the contents of the stack.

But there are techniques to get around these limitations.

Let me describe how I solve the first problem.

The EvalStep method in my interpreter originally had this signature:
void EvalStep (Interpreter interpreter);
and a node in the abstract systax tree would do something like this:
 sealed class Conditional : SCode, ISystemHunk3
 {
     readonly SCode predicate;
     readonly SCode consequent;
     readonly SCode alternative;

     override void EvalStep (Interpreter interpreter)
     {
         interpreter.Continuation =
             new ConditionalDecide (interpreter.Continuation, 
                                    this,
                                    interpreter.Environment));
         interpreter.Expression = this.predicate;
         return;
     }  }
Note how the new continutation takes as one of its arguments the existing continuation. We put that into the interpreter continuation register, set the interpreter expression register to the predicate of the conditional and then return control to the caller (which is the trampoline).

The point is to evaluate the predicate and pass the answer on to the newly created ConditionalDecide continuation. It has code like this:
override void Invoke (Interpreter interpreter, object value)
{
    interpreter.Continuation = this.parent;
    interpreter.Environment = this.environment;
    interpreter.Expression = (value is bool && (bool) value == false)
                               ? this.expression.Alternative
                               : this.Expression.Consequent;
    return;

These two code fragments are part of a single logical routine that is more easily written like this:
override void EvalStep (Interpreter interpreter)
{
    object value = this.predicate.EvalStep (interpreter);
    if (value is bool && (bool) value == false)
        return this.expression.Altenative.EvalStep (interpreter);
    else
        return this.expression.Consequent.EvalStep (interpreter);
}
This is not only easier to write, it performs much better because we are co-opting the continuations that are used for C# subroutines for our own sub-evaluations. Unfortunately, however, the tail calls to EvalStep are allocating a continuation that destroys tail recursion. We need to transfer control to one of these routines without allocating a new continuation.

Here's the trick: we'll get the caller to do the transfer for us. We'll return to the caller with some indication that we're not really done, but that we intend for someone else to finish our job. Essentially, we are taking some of the code that logically belongs in our routine and placing it in the caller. This isn't possible in general because the caller is unknown and the callees can be arbitrary, but in the context of the interpreter, we have an extraordinary special case: Nearly every call that is necessary to be tail recursive is a call to EvalStep. (There are a couple of primitives that call back into the interpreter that we'll deal with in a bit.)

We'll define a new calling convention that requires the caller to handle the transfer to EvalStep for us. It's a little unorthodox, but no more so than requiring every routine to return to a trampoline. The new signature of EvalStep will be this:
    bool EvalStep (out object answer, ref SCode expression);
The boolean return value indicates whether we've returned control with an answer or whether we've returned with a new expression we wish to tail call. The caller must invoke EvalStep like this:
    SCode expr = <expression to evaluate>;
    object answer;
    while (expr.EvalStep (out answer, ref expr)) { };
    // When the while exits, we have an answer.
And the callee should do one of two things. Either assign an answer and return false, or assign a new expression and return true. The EvalStep handler for the conditional expression above would therefore look like this:
override bool EvalStep (out object answer, ref SCode expression)
{
    SCode  expr = expression.Predicate;
    object predicateValue;
    while (expr.EvalStep (out predicateValue, ref expr)) { };

    if (predicateValue is bool && (bool) predicateValue == false)
        expression = expression.Altenative;
    else
        expression = expression.Consequent;

    return true;
}
This isn't quite the whole story. Evaluation needs an environment and that is missing. The EvalStep routines will need to provide this to the caller as well, so it will handled with a ref argument, too. Here is the code for conditionals:
bool EvalStep (out object answer, ref SCode expression, ref Environment environment)
{
    SCode expr = this.predicate;
    Environment env = environment;
    while (expr.EvalStep (out answer, ref expr, ref env)) { };

    expression = (answer is bool) && (bool) answer == false
        ? this.alternative
        : this.consequent;
    // request tail recursion.
    return true;
}
Here is the code for variables:
override bool EvalStep (out object value, ref SCode expression, ref Environment environment)
{
    if (environment.DeepSearch (out value, this.name))
        throw new UnboundVariableException (this.name);
    // actually returning a value
    return false;
}
If you look at this code you'll notice that what we are doing is severely abusing the language features to change the calling conventions. We allocate some stack for EvalStep in the caller's frame. We then pass control to EvalStep and give it the addresses of the allocated variables. EvalStep eventually returns control to us, but it isn't logically finished. Recall that we've moved some of the code in the callee to the caller. If the return value is true, the while loop performs a method call to EvalStep on the new value of expr. This is the call that the callee would have performed, if he was allowed to clear the stack prior to the call. Since he is not allowed to do this, we just move the code around a little so that it runs after the return.

With the code in this form, the interpreter continuations and the C# continuations are one and the same object, and we can take advantage of the processor's stack hardware for performance. This solves the first problem we have with C# continuations. The second problem is harder to solve.

Tuesday, September 2, 2008

S-code

MIT Scheme has an internal format for code called ‘S-Code’. This is a set of data structures that represent the abstract syntax tree of the language with all the surface syntax expanded away. There are nodes for conditionals, nodes for literal values, nodes for variables, nodes for lambda expressions, etc. Each node has a particular evaluation rule. The interpreter traverses the syntax tree and evaluates each node according to the appropriate evaluation rule.

MIT Scheme uses a rather old high-tagged pointer format. This is a legacy from the Scheme 81 chip. A machine word in MIT Scheme is 32-bits, but the top six bits encode the type of word. The remaining twenty-six bits may encode an immediate value, like a small integer (fixnum), or it may encode the address at which a larger data structure may be found. Twenty-six bits of address was more than ample in 1981, but it is quite small by today's standards.

The MIT Scheme interpreter is essentially an emulator for a Scheme 81-like chip (although it has evolved substantially). The core of the interpreter is a loop that reads the expression ‘register’, extracts the tag bits and dispatches (with a switch statement) to the code that implements the evaluation rule for the expression.

My implementation of MIT Scheme uses a different strategy. Each S-Code object is an instance of a C# class. The details of how the instance is represented in memory are unimportant in this implementation because we aren't relying on the ability to inspect the raw bits. The interpreter traverses the S-Code by invoking a virtual function EvalStep defined on the abstract base class. Every concrete S-Code class must provide a concrete method that implements the appropriate evaluation rule. Rather than a big switch statement, the interpreter simply calls EvalStep to dispatch to the appropriate routine.

This requires a few slight changes to the interpreter. In MIT Scheme, literal objects can be embedded directly within the S-Code. If you refer to a number in your program, that number appears directly in the abstract syntax tree. Since every object in MIT Scheme has a tag, the interpreter will dispatch to the `self-evaluating' code when it encounters one of these embedded objects. That code simply returns the object itself.

In my version, it must be the case that every object that the interpreter acts upon is derived from the abstract class SCode and has an EvalStep virtual method. The implementors of the .NET runtime did not have the foresight to add an EvalStep method to the object, so the interpreter cannot dispatch on those. There are several solutions, but the easiest is this: self-evaluating objects are never allowed to appear directly in S-Code, they must be wrapped within a Quotation node. The implementation ensures this automatically. When an S-Code node is initialized, any subfield that is not itself an S-Code object is wrapped within a Quotation.

MIT Scheme is naturally tail recursive. The underlying dispatch loop is a while statement. The stack provided by the physical machine is not used, so the code only has to manage a virtual stack.

While the .NET CLR has the ability to perform tail-recursive calls, the C# compiler makes no use of it. This means we have to go out of our way to implement tail recursion. There are several techniques for this, but the only realistic one that allows us to use the normal function calling mechanism is a trampoline. (Baker's Cheney on the M.T.A technique won't work because the CLR prohibits the creation of linked objects on the stack.) In my implementation, the trampoline looks like this:
    public sealed class Interpreter
    {
        SCode expression;
        Environment environment;
        Continuation continuation;

        void Run ()
        {
            while (true)
                this.expression.EvalStep (this);
        }
    }
Each ‘bounce’ through the trampoline fetches the contents of the expression register and invokes the EvalStep method. The interpreter object itself is passed to the method. The method performs it's actions and modifies the interpreter's expression register before returning to the trampoline. The various EvalStep methods can ‘return’ a value by passing it on to the continuation object.

There's not much really interesting here although there is one amusing observation. To preserve tail recursion, all calls have to go through the trampoline to empty the C# stack. However, when invoking a Scheme continuation, we do not have to use the trampoline because there are no infinite chains of continuations (so control eventually returns with finite resource usage). Therefore, the routine that implements the continuation is called like a normal function. We have this curious inversion: when calling a Scheme function, we perform a return in C# in order to pop the C# stack and transfer control via the trampoline, but when performing a return in Scheme (invoking a continuation), we use a C# call and push the C# stack.

Next, a heap is fast, but the stack is faster...

Threaded interpreter

I mentioned earlier that I was playing around with my own version of MIT/GNU Scheme. Part of the reason was to understand why interpreters are so much slower than compilers and where the ‘interpretation overhead’ comes from.

Virtual machines are largely like their physical counterparts. This is because of two reasons:
  1. we aren't that imaginative, so computer architectures have the same basic building blocks,
  2. it's easier to implement an instruction if you can just use the physical hardware.
But sometimes you can find a virtual machine performing an operation very similar to the physical one, but in a far less efficient manner. Take for example a simple byte-code interpreter. The byte codes are stored in a vector and are addressed by a virtual program counter. The main dispatch loop is like this:
  while (true) {
    byte instruction = instructionStream[programCounter];
    switch (instruction) {
        case NOP:
            programCounter += 1;
            break;

        case JUMP:
            programCounter = instructionStream[programCounter + 1];
            break;
        }
    }
This is pretty straightforward and we know it doesn't perform that well. We can blame branch prediction, locality, memory traffic, etc. But the real problem is simply this: the CPU has specialized hardware that is specifically designed to efficiently fetch and decode an instruction stream, and we aren't using it. It's sitting in the background repeatedly fetching and decoding the ‘while’ loop, but we're using the main data paths and the program logic to fetch and decode our virtual instruction stream. The physical program counter is being incremented automatically by a specialized adder, but the virtual one has nothing to do with it and is using the main ALU.

Why do we do this? First of all, the virtual instruction set is quite a bit different from the native instruction set, and second of all it is somewhat tricky to dynamically install new code in the running image, but quite easy to install new data. But people have come up with other techniques to deal with this problem.

One technique is JIT compilation. JIT compilation translates the virtual instruction stream to a physical instruction stream. In it's simplest form, it can simply concatenate the subsections of code that implement each virtual instruction. JIT compilers typically perform very few optimizations, yet the resulting code is substantially faster. This is because we are now co-opting the instruction fetch hardware on the CPU to fetch and decode our translated virtual instructions directly.

Threaded interpretation is a different tactic. Threaded interpretation acutally refers to three similar techniques. ‘Subroutine’ threaded interpretation substitutes each virtual instruction with a native CALL instruction to the routine that implements the virtual instruction. So the resulting instruction stream would look like this:
   call push_immediate 3
   call push_variable x
   call add
   .... 
This gives a pretty-good speedup because we are again using the instruction fetch hardware to do our dispatch, but we also end up using the stack hardware for no good reason. On the other hand, it is extremely simple to get working.

Direct threaded code works by embedding the actual code that implements the virtual instructions within the data structure that represents the source language. The data structures are laid out in memory with the data first followed by a code block. The data structures are referred to by the pointer to the code block. You interpret the code by literally jumping to it. The data structures in the source code have the literal address of substructure in them, so the native instruction fetching hardware will be traversing the data structures directly.

Direct threading has mixed performance. On some machines, this is very fast, but other machines may make use of the fact that code and data are not often directly intermixed. Modern hardware even raises an error condition if it detects this because it is symptomatic of the kind of self-modifying code that malicious software uses. Moving the data structures (for garbage collection) is also problematic because you have to re-link the code blocks that are embedded within them.

Indirect threading works by putting the address of the code that implements the virtual instructions (rather than the code itself) within the data structures that represent the source language. Rather than directly jumping into the data structure itself, you perform an indirect jump to the address stored in the data structure. The extra level of indirection causes some minor delay, but it is still quite fast and fairly easy to implement.

There is one more technique that I haven't seen a name for, so allow me to call it ‘virtual threading’. We use a slightly higher level language and implement our interpretation subroutines as ‘virtual methods’. Implementations of languages with ‘virtual methods’ typically have a table of methods that are shared by all instances of a class of data structure. The address of the table is stored somewhere in the data structure and invoking a ‘virtual method’ basically involves dereferencing the table pointer and performing an indirect jump. Since this involves two indirections, and a call and return, it is slower than indirect threading, but there are many advantages.

First, and primarily, we are once more using the instruction fetching hardware to fetch our instruction stream directly rather than using the main data paths. This improves the performance.

Secondly, by laying out our interpreter data structures in a format that is nearly ubiquitous these days, we can take advantage of all the software that has been designed to manipulate and manage these sorts of data structures. Additionally, we can more easily interoperate with third-party software because they expect foreign data to be layed out in this manner.

Third, we can use a high-level language like C# or Java to implement the interpreter without incurring too much of a performance penalty. It is very difficult to implement indirect threading in these languages, but virtual method dispatch is built-in.

Finally, because ‘virtual method’ calling is so ubiquitous, there is an incentive to the hardware manufacturers to ensure that it is efficient.

Teaser

I've did some interesting things over the long weekend. It's going to take a few blog posts to explain them fully, but I think this will be worth it.

Monday, August 18, 2008

Differential Benchmarking

I wanted to find out the fastest way to return multiple values in the Microsoft CLR. In order to measure this, I needed to have two pieces of code that `do the same thing' except for the way values are returned. For a baseline, I used the Takeuchi function:
        static int Tak (int x, int y, int z)
        {
            return (!(y < x))
                ? z
                : Tak (Tak (x - 1, y, z),
                       Tak (y - 1, z, x),
                       Tak (z - 1, x, y));
        }
Tak (18, 12, 6) makes 63609 function calls and 47706 subtractions.

This version uses an out variable to return the value:
        void TakOut (out int answer, int x, int y, int z)
        {
            if (!(y < x)) {
                answer = z;
                return;
            }
            int x1;
            int y1;
            int z1;
            TakOut (out x1, x - 1, y, z);
            TakOut (out y1, y - 1, z, x);
            TakOut (out z1, z - 1, x, y);
            TakOut (out answer, x1, y1, z1);
        }
But the version using an out variable has an additional argument. I wanted to separate the cost of passing an additional argument from the total cost of using an out parameter, so I wrote another version of Tak that took an extra dummy argument:
        static int Tak1 (int dummy, int x, int y, int z)
        {
            return (!(y < x))
            ? z
            : Tak1 (z, Tak1 (y, x - 1, y, z),
                       Tak1 (x, y - 1, z, x),
                       Tak1 (dummy, z - 1, x, y));
        }
These run pretty quick, so I put them in a loop that runs them ten times and measures the total time. I ran this loop a thousand times and recorded the results.

I didn't expect to find Tak1 to be substantially faster than Tak, so I did some investigating. Look at the graph for Tak: You can clearly see a change in performance (there are two vertical segments). I had forgotten that my machine was in power save mode and was dynamically tuning the CPU clock speed. Somewhere in there the machine decided to hit the gas. I put the machine in performance mode to max out the clock and got a much more expected result.
In this graph, Tak is the normal Tak function. Tak1a and Tak1b are defined as follows:
        static int Tak1a (int dummy, int x, int y, int z)
        {
            return (!(y < x))
            ? z
            : Tak1a (dummy, Tak1a (dummy, x - 1, y, z),
                            Tak1a (dummy, y - 1, z, x),
                            Tak1a (dummy, z - 1, x, y));
        }

        static int Tak1b (int dummy, int x, int y, int z)
        {
            return (!(y < x))
            ? z
            : Tak1b (z, Tak1b (y, x - 1, y, z),
                        Tak1b (x, y - 1, z, x),
                        Tak1b (dummy, z - 1, x, y));
        }
We can see from the graph that it takes an additional 8-10 ns to process the additional argument, depending on whether we change the value or keep it the same. I'll speculate that this is because the processor can notice the re-use of the argument and avoid a register rename or something along those lines.

So what about passing an out variable? From this graph we can see that returning an answer in an out variable makes the code a factor of 2 slower. Using a ref variable is slightly worse, but that may be because we have to initialize the ref variable before we use it. In either case, that's quite a hit.

It isn't likely that returning an object from the heap will be fast, but we ought to actually mesaure it. This is easily done by simply declaring the return value to be of type object. The CLR will be forced to box the object on the heap.
Surprisingly, this isn't all that much slower than using an out or ref variable. What about returning a struct on the stack? It seems that this is quicker than using an out variable.

If you look closely at these benchmarks you will see that the baseline performance of Tak varies a bit from run to run. This could be because of a number of reasons. It doesn't matter all that much because I was interested in the difference between the various mechanisms of returning values.

All the code above was run with these settings:
# LENOVO ThinkPad T61
# 2 processors
#     GenuineIntel
#     Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz
#     x64 Family 6 Model 15 Stepping 11
#     2401 MHz
# Working set:  9969664
# Overflow checking is disabled.
# Compiled in RELEASE mode.
# Debugger is not attached.
# Stopwatch is high resolution.  14318180 ticks per second.
# 
# 2008-Aug-18 20:50:11
# Microsoft Windows NT 6.0.6001 Service Pack 1
# CLR 2.0.50727.3053
# 
# See http://eval.apply.googlepages.com/ for further information.
# 
# TakTest, Version=1.0.3152.23099, Culture=neutral, PublicKeyToken=null
# Debugging Flags: IgnoreSymbolStoreSequencePoints
# JIT Optimizer enabled
# JIT Tracking disabled
I've put the code online if you want to play with it.
I changed the search engine at the top of the page to search a whole bunch of Lisp oriented web sites in addition to the Hyperspec.

Monday, August 11, 2008

Interpreter Overhead

Consider this client/server model of computing. The server is a general purpose virtual machine/runtime system. The client is the program which makes ‘requests’ to the server and receives responses. And let's be completely abstract here. The client is capable of nothing but making requests. The server does all the work. If the client needs temporary storage for a variable, it requests that the server allocate it. If it wants to load a value into that variable, it requests that the server load the location. The client isn't much more than an instruction stream, but keep the image of an active, but extraordinarily primitive client in your mind.

The server, on the other hand, is amazingly powerful compared to the client, but totally ignorant. It simply does not know what to do or what it is doing. It just mechanically follows the requests of the client.

This is simply a description of the Harvard architecture, but I want to emphasize that each component --- the program (client) and the processor (server) --- sees the other as a black box. And although the processor is usually responsible for fetching instructions, I've made the client responsible for feeding them to the processor.

A general purpose processing machine combined with a program gives you a special purpose machine. Your x86 processor is general purpose, but Excel turns it into a spreadsheet processor. We can consider a program as a ‘meta-operator’ that transforms one machine into another. Instead of making requests directly to the machine, we make them to the program/machine combination. The program makes requests of the underlying machine and arranges for the underlying machine to deliver the answer in a useful format.

If you have a program written in a language that your machine doesn't natively understand you have two options to make it run. You can translate the program to a language the machine does understand, or you can translate the machine to a machine that understands the program. The first technique is compiling, the second is interpretation. An interpreter is a program that converts a machine in one language to a machine in another.

Interpretation almost always performs slower than compilation. The usual reason is given as ‘interpreter overhead’. But what is interpreter overhead? If we look at the model I described above, we have this situation if we have compiled our program:
                    compiler
           program ---------> program'

               +--------------------
               | program' | machine
               +--------------------
But if it is interpreted, we have this situation:
           +---------------------------------
           |         +-----------------------
           | program | interpreter | machine
           |         +-----------------------
           +---------------------------------
Obviously the extra layer introduced by interpretation is a problem. But in our model, the client doesn't do anything. The machine is ultimately the thing that takes action. The client is essentially passive. The only reason that things would take longer is because the machine is doing more work in the second situation.
           +---------------------------------------
           |         +-----------------------------
           | program | interpreter | machine
           |         |             |  + excess work
           |         +-----------------------------
           +---------------------------------------
The interpreter must be making more requests of the machine than the compiled program. (I never miss the obvious.) That's the interpreter overhead.

Let's turn this around. Now we're take the point of view of the machine and we are getting instructions fed to us. The picture is like this:
            -------------------+
                       program | machine
            -------------------+
or perhaps it is like this:
                     +-------------+
           ----------+             |
            program  | interpreter | machine
           ----------+             |
                     +-------------+
Another picture:
                     +--------------+
           ----------+              |
            program  | interpreter  |   machine
           ----------+              |
                     +--------------+
                          |    |
                     +--------------+
                     | co processor |
                     +--------------+
I've attached an ‘interpreter co-processor’ to the picture. What does it do? It handles all the ‘interpreter overhead’. If the interpreter needs some temporary storage to hold a variable, or if it has to maintain a lookup table to find subroutines, the co-processor is the machine that handles that. On the other hand, if the interpreter needs to perform an addition because the user's program calls ‘+’, this is given directly to the machine. The co-processor is only there to help the interpreter, not to perform any work requested by the program.

From the machine's point of view, the only way we can tell if two programs are different is if the series of instructions being fed to us is different. We don't really know, or even care what is going on outside. We're just following orders, whether they come directly from a program or via an interpreter (which, after all, is just another program as far as we're concerned), or whether the interpreter has a funny co-processor attached.

Suppose we have two programs, X and Y, that have only a small difference between them. They both feed the exact same instructions in the exact same order to the machine, but program Y uses the interpreter co-processor half as much as program X. Since the instruction stream to the machine is identical in both, it should be the case that both programs ‘do’ the same thing. The machine is the only component that can take action, and you've told it to take the same action in both cases. But program Y uses fewer resources than program X. By making fewer demands of the co-processor, perhaps our machine can be cheaper or run cooler. The instructions that are given to the co-processor are not essential to running the program.

Our machine is universal, so we really don't need to buy a separate co-processor. We can emulate it.
                     +--------------+
           ----------+              +--+---------+
            program  | interpreter  |  | machine |
           ----------+              +--+--+---+--+
                     +--------------+     |   |
                          |    |          |   |
                     +--------------------+   |
                     | co processor emulator  |
                     +------------------------+
How does this differ from the picture that looked like this?
                     +-------------+
           ----------+             |
            program  | interpreter | machine
           ----------+             |
                     +-------------+
It doesn't, except that I can point at the instructions coming through the ‘emulated co-processor’ and identify them as ‘interpreter overhead’. And as I mentioned just above, those instructions aren't essential to running the program.

If we are not trying to ‘optimize’ the program, then a compiled program should be identical to an interpreted one, but without as much ‘interpreter overhead’. The overhead comes from the fact that we are emulating some of the behavior that the interpreter relies on. Compilation wins because we remove as much of the need for the emulation as we can. But some machines these days are pretty sophisticated. Do we need to emulate the desired behavior, or can we co-opt the machine's built-in behavior to approximate the desired behavior?

Saturday, August 9, 2008

Real life nastiness

Here's an example I stole from the support code for PLT Scheme. This function seems to check whether we've hit a symbol delimiter or comment.
int wxMediaStreamIn::IsDelim(char c)
{
  if (scheme_isspace((unsigned char)c))
    return 1;
  else if (c == '#') {
    long pos;
    char next[1];
    pos = f->Tell();
    f->Read(next, 1);
    if (next[0] == '|') {
      f->Seek(pos - 1);
      return 1;
    } else {
      f->Seek(pos);
      return 0;
    }
  } else if (c == ';') {
    long pos;
    pos = f->Tell();
    f->Seek(pos - 1);
    return 1;
  } else
    return 0;
}
This function might be skipping a token in the input stream. I'm not really sure.
void wxMediaStreamIn::SkipOne(int recur)
{
  char buf[1];

  if (recur) {
    buf[0] = '#';
  } else {
    SkipWhitespace(buf);
  }

  if (!bad) {
    if (buf[0] == '#') {
      /* Byte string */
      if (f->Read(buf, 1) == 1) {
 if (buf[0] != '"') {
   bad = 1;
   BAD_PRINTF(("bad 12\n"));
 } else {
   while (1) {
     if (f->Read(buf, 1) != 1) {
       bad = 1;
       BAD_PRINTF(("bad 13\n"));
       break;
     }
     if (buf[0] == '"') {
       break;
     } else if (buf[0] == '\\') {
       if (f->Read(buf, 1) != 1) {
  bad = 1;
  BAD_PRINTF(("bad 14\n"));
  break;
       }
     }
   }
 }
      } else {
 bad = 1;
 BAD_PRINTF(("bad 15\n"));
      }
    } else if (buf[0] == '(') {
      /* List of byte strings */
      while (!bad) {
 do {
   if (f->Read(buf, 1) != 1) {
     bad = 1;
     BAD_PRINTF(("bad 16\n"));
     break;
   }
 } while (!IsDelim(buf[0]));
 if (buf[0] == ')')
   break;
 else if (buf[0] == '#') {
   SkipOne(TRUE);
 } else {
   bad = 1;
   break;
 }
      }
    } else {
      /* Number */
      do {
 if (f->Read(buf, 1) != 1) {
   bad = 1;
   BAD_PRINTF(("bad 16\n"));
   break;
 }
      } while (!IsDelim(buf[0]));
    }

    if (!bad && !recur)
      IncItemCount();
  }
}
Now suppose we wanted to change this code by in-lining the calls to IsDelim. This is not a trivial task. In fact, the calls to BAD_PRINTF seem to indicate that the author of the code had no easy time getting to this point. The duplication of code blocks shows that some in-lining has already been done.

Now consider this code. It is part of a compiler. It establishes an escape continuation upon entry to a block, and if possible, optimizes the code if it can determine either that the continuation is not used or that the call to the continuation is at the same dynamic depth as the continuation.
  (define (bind-escape-continuation variable-name binder-type)
    (lambda (block receiver)
      ;; Grab a continuation to return to if necessary.
      (let* ((block (block/make block '()))
             (cont-var (binding/variable (variable/make&bind! block variable-name #f #f)))
             (result (receiver block)))
        (cond ((and (non-local-exit? result)
                    (eq? (non-local-exit/variable result) cont-var)
                    (not (variable/referenced? cont-var (non-local-exit/body result))))
               (evil-message "Optimize:  Coalesce catch/throw" cont-var)
               (non-local-exit/body result))
              ((not (variable/referenced? cont-var result))
               (evil-message "Optimize:  Remove binding for non-local-exit" cont-var)
               result)
              (else (make binder-type
                      :block    block
                      :variable cont-var
                      :body     result))))))
And this function compiles a for statement. An escape continuation is necessary to handle calls to break and continuue.
  (defmethod (compile-expression (target <void-target>) block (statement <c-for>))
    (cons-expression
     block
     (compile-expression (make <effect-target> :parent target) block (for/initialize statement))
     ((bind-escape-continuation 'break <compiled-bind-break>)
      block
      (lambda (break-block)
        (let* ((for-block (block/make break-block '()))
               (for-var   (binding/variable (variable/make&bind! block 'for #f #f))))
          (make-iteration
           (variable/name for-var)
           (make-sequence
            (list (make-conditional
                   (compile-expression (make <predicate-target> :parent target)
                                       for-block (for/predicate statement))
                   ((bind-escape-continuation 'continue <compiled-bind-continue>)
                    for-block
                    (lambda (continue-block)
                      (let ((continue-binding (block/lookup-name continue-block 'continue #f)))
                        ;; stow the step part.
                        (setf! (binding/value continue-binding)
                               (compile-expression (make <statement-target> :parent target)
                                                   break-block (for/step statement)))
                        (cons-expression continue-block
                                         (compile-expression (make <statement-target> :parent target)
                                                             continue-block (for/body statement))
                                         (make <non-local-exit>
                                           :block continue-block
                                           :variable (binding/variable continue-binding)
                                           :body (binding/value continue-binding))))))
                   (make <non-local-exit>
                     :block for-block
                     :variable (binding/variable
                                (block/lookup-name break-block 'break #f))
                     :body #f))
                  (make <compiled-combination>
                    :operator (make <compiled-reference> :variable for-var)
                    :operands '()))
            )))))))
Inlining the call to bind-escape-continuation is trivial. (Try it.) This isn't due to the problem domain unless character at a time parsing is that much more difficult than flow-control in a compiler.

One reason I chose Lisp or Scheme for experimental code is because I can move things around so much easier than in other languages. When I tinker with code I often decide to pull a chunk out, combine it with another, change my mind, put it back, change something to a function, move a binding, etc. Lisp simply does not get in my way as much as other languages.

Done testing?

The previous post looked good to me on Blogger and under Google reader. Anyone have problems with it?

Another test.

Some of the posts coming through a feed seem to be completely screwed up with this new stylesheet. I changed the stylesheet tag to use <pre> instead of a custom class and this seems to work on the blog itself. But when I went to the reader, one of the posts looked ok, but another looked wrong. I'm encouraged by the first, but maybe I'm being fooled by some cacheing that's going on somewhere. I'm going to make a few more test posts. If it works or fails, I'd like to know. (If you have hints, etc. It'd be nice, too.)
(define (iter tree))
     (car efficiently
  ;; wite-char #\space)))
     (if (> (cdr entrees)
 (displambda (term)
      (scand (pair? tree)
  (if (leaf-node? thing)
   (n (right-child tree)))
  (n (right-child tree tree)
  (if (leaf-node? tree tree)
    (probabilith the assumption that all-terms))))))

 (for-each-list-points terms an intep trees)
         (begin
        (define (do-term term)
     (display b)
   (let ((best-a alpha beta m N)
    (do ((i 0 (+ i 1))) (if (/ (* *alpha* (gamma-point dp)
      (gamma-ratio num 0)
   (if (> rl (d (right-child right-child)
     (do-it level tree)
         (set ((pi_k))
    (num den)
  ;; Compute gamma (n tree))
        (hash-tree (cluster-data-points terms tree)
  (d (left-child tree)))

Self-learning methodologies are particularly key when it comes to replicated theory. While conventional wisdom states that this issue is always surmounted by the understanding of congestion control, we believe that a different solution is necessary. Despite the fact that conventional wisdom states that this problem is always surmounted by the refinement of the UNIVAC computer, we believe that a different approach is necessary. We emphasize that CARPER requests XML, without investigating DNS. indeed, public-private key pairs and Moore's Law have a long history of connecting in this manner.

A test post.

I've gotten sick of the way blogger is showing my code. I'm playing with the stylesheet to see if it can be improved.
        void Run ()
        {
            // The etc argument isn't used.
            // We want EvalStep to be an expression rather
            // than a statement so we chain the return value
            // around in case we think of a use in the
            // future.
            object etc = null;
            while (true)
                etc = this.expression.EvalStep (this, etc);
        }
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
What happens if I try to do in C# what I did in Lisp a few days ago?


Here are the functions we'll need to compute our probabilities in C#:
double PH1 ()
{
    return Utility.Product<string> (this.TermWeight, Globals.AllTerms);
}

double D ()
{
    return Globals.TuningAlpha * Utility.Gamma (N ())
           + leftChild.D () * rightChild.D ();
}

double Pi ()
{
    return Globals.TuningAlpha * Utility.Gamma (N ()) / D ();
}

double P ()
{
    double pi_k = Pi ();
    return pi_k * PH1 () + ((1.0 - pi_k) * leftChild.P () * rightChild.P ());
}

double R ()
{
    return Pi () * PH1 () / P ();
}

double ProbabilityToOdds (double p)
{
    return p / (1.0 - p);
}

double ROdds ()
{
    return ProbabilityToOdds (R());
}
The marginal probability of clustering is given by function R, and the marginal odds of clustering is given by function ROdds.

In the Scheme version, I proceeded to convert the program to odds space by substituting the definitions of R, probability->odds, Pi, etc. and then rewriting and simplifying the resulting code. This is because in Scheme the name of the function and the function itself are interchangable.

Let's see how this works in C#. First let's try ProbabilityToOdds. In Scheme, we have this:
(define probability->odds
  (lambda (p) (/ p (- 1.0 p))))

(define r-odds
  (lambda (cluster)
    (probability->odds (R cluster))))
And we simply replace the name with the definition:
(define r-odds
  (lambda (cluster)
    ((lambda (p) (/ p (- 1.0 p))) (R cluster))))
But in C#, what exactly is ProbabilityToOdds defined as?
    ProbabilityToOdds = ????
It's a static method, not a first-class object. But C# has delegate objects that can be used to act as if they were first-class methods. We can do this:
Func<double,double> P2O = ProbabilityToOdds;
And now we have a delegate for ProbabilityToOdds. The delegate is a first-class object. We can use it just like we used ProbabilityToOdds. For example
double ProbabilityToOdds (double p)
{
    return p / (1.0 - p);
}

static Func<double,double&gt P2O = ProbabilityToOdds;

double ROdds ()
{
    return P2O (R());
}
That didn't get us very far. We sort of want to go in the reverse direction. We want to make an unnamed delegate. You can do this in C#.
static Func<double,double&gt P2O =
    delegate (double p)
    {
        return p / (1.0 - p);
    }

double ROdds ()
{
    return P2O (R());
}
This is a little weird because it seems that the delegate doesn't have a declared return type. It gets weirder. Let's substitute P2O with the delegate.
double ROdds ()
{
    return delegate (double p)
           {
               return p / (1.0 - p);
           } (R());
}

Compiler error:  ``Method name expected.''
It seems that the compiler is playing some tricks on us. Anonymous delegates weren't originally in the language, and they aren't directly supported by the runtime. However, the compiler understands what it is we're trying to accomplish and arranges to create a good simulation. At least in the first case. We confused it with the second. With a couple of hints to the compiler, we can say what we mean:
double ROdds ()
{
    return ((Func<double,double>)
             delegate (double p)
             {
                 return p / (1.0 - p);
             }) (R());
}
We can substitute R, too.
double ROdds1 ()
{
    return ((Func<double, double>)
             delegate (double p)
             {
                 return p / (1.0 - p);
             }) (((Func<double>) delegate ()
                 {
                     return Pi () * PH1() / P ();
                 }) ());
}
We'd better simplify this. It's getting a little messy. That second function takes no arguments and is being applied to an empty argument list. We can η-reduce it.
double ROdds1 ()
{
    return ((Func<double, double>)
             delegate (double p)
             {
                 return p / (1.0 - p);
             }) (Pi () * PH1() / P ());
}
It might not be immediately clear what I just did, so I'll re-set this.
double ROdds1 ()
{
    return ((Func<double, double>)
             delegate (double p)
             {
                 return p / (1.0 - p);
             }) (((Func<double>) delegate ()
                 {
                     return Pi () * PH1() / P ();
                 }) ());
}
η-reduction is removing the function call apparatus (in green) and using the computed value (in red) directly:
double ROdds1 ()
{
    return ((Func<double, double>)
             delegate (double p)
             {
                 return p / (1.0 - p);
             }) (Pi () * PH1() / P ());
}
When invoking a function, we compute the value of the operands, bind them to the formal paramaters, and evaluate the body. In essence, we do this:
double ROdds1 ()
{
    return ((Func<double>)
             delegate ()
             {
                 double p = Pi () * PH1 () / P ();
                 return p / (1.0 - p);
             }) ();
}
Now we can η-reduce, or we could if this were syntactically allowed:
double ROdds1 ()
{
    return
             double p = Pi () * PH1 () / P ();
             p / (1.0 - p);
}
I removed the function call apparatus just as I did before, but this time it didn't work. The body of a function in C# consists of a series of statements, but a call invoking a function is an expression. You can't use a statement where an expression is expected. Then why did it work before? There was a single statement in the function body and it was a return statement. When we removed the return, we serendipitously converted the return statement to an expression. Since there were no preceeding expressions, we got away with it. In the current case, however, have a problem.

We'll have to move any statements that were in the body of our delegate to a context where statements are permitted. With luck, there will be a context close enough that it won't change the semantics of the code. As it turns out, we're in luck this time.
double ROdds1 ()
{
    double p = Pi () * PH1 () / P ();
    return

             p / (1.0 - p);
}
We were lucky because the context of the call to the delegate was the first subexpression of a statement and we didn't have to move the code very far. But if it were something more complicated, like, say, this:
    for (int i = 0; i < ((Func<int>)
                            delegate ()
                            {
                               object x = foo();
                               return bar (x, x);
                            })(); i++)
We're going to have more problems. The nearest place to hoist the binding of x is outside of loop, and this will likely change the semantics.

There is a hack to work around this limitation, but it is really very nasty. In C#, as in C, assignment is an expression rather than a statement. We can separate the binding of p from computing the value if we can figure out how to sequence the expressions.
double ROdds1 ()
{
    double p;
    return
            First
             p = Pi () * PH1 () / P ()
            then
             p / (1.0 - p);
}
In C and C++, you can sequence expressions with a comma. This is one of those things people use all the time in for statements, but forget that it can be used elsewhere as well. You would write this:
double ROdds1 ()
{
    double p;
    // In C or C++ this will work.
    // NOT IN JAVA OR C#
    return (
             p = Pi () * PH1 () / P (),
             p / (1.0 - p));
}
Unfortunately, Java and C# have overlooked the comma expression. Fortunately, they did not overlook the conditional expression. We can abuse it to force our sequencing because the predicate in the conditional is required to be evaluated first.
double ROdds1 ()
{
    double p;
    return ((p = Pi () * PH1 () / P ()) == 0.0 || true)
            ? p / (1.0 - p)
            : 0.0;
}
The assignment is put in the predicate as part of boolean expression that is always true, then the second expression is put in the true branch of the conditional. The false branch is a dummy value of the correct type.

I did say it was very nasty, but sometimes programming is a dirty job.

Obviously, we don't need to go this far for the problem at hand, this:
double ROdds1 ()
{
    double p = Pi () * PH1 () / P ();
    return p / (1.0 - p);
}
will do just fine.

Ok. This is ridiculous. People don't do this sort of stuff when they hack C, C# or Java. The reason is obvious. It's far too hard and confusing, even in this toy example. It would be insane to try to do this in real life code. Instead, of just cutting and pasting code from here to there, we'd look at the destination context, reason about how the return value is going to be used and how it could be best computed in the new context, and then rewrite the code at the destination to have the desired effect.

More to follow...

Thursday, August 7, 2008

In progress

I'm working on another blog entry, but as part of the task I have ported the simple Bayesian Hierarchical Clustering — the low performance version — algorithm to C#. It is available here: http://code.google.com/p/jrm-code-project/source/browse/ under the BHCTest directory.

Tuesday, August 5, 2008

The final trick to improving the performance is a change in how the basic clustering routine works. Heller and Ghahramani present this pseudocode:

  initialize: number of clusters c = n, and
          Di = {x(i)} for i = 1 ... n
  while c > 1 do
    Find the pair Di and Dj with the highest
        probability of the merged hypothesis:

                     pik p(Dk|H1k)
               rk = ----------------
                        p(Dk|Tk)

     Merge Dk ← Di ∪ Dj, Tk ← (Ti, Tj)
     Delete Di and Dj, c ← c - 1
  end while

.
The naive implementation would examine every pair of clusters to find the one with the best score. On the next iteration, it would do it again. This adds up quick. Keeping these scores in a table would make it much quicker, but the table size becomes the issue.

But we're on the right track. The problem with a table, besides the size, is that most of the entries are useless. If clusters A and B score well together, we certainly don't care how poorly clusters A and Q or A and Z or A and J etc. score. So we'll keep a smaller table that has a single entry for each cluster that is the cluster that it scores best against. This table will start with the same number of entries as we have leaf nodes, and it will gradually grow smaller as we create bigger and bigger clusters.

Now the issue becomes how to cleverly maintain that table. First, let us assume the steady state that each entry in the table has a key which is a cluster and a value which is another cluster that produces the best score when these are combined. This will be our invariant. When we add a new cluster to the table, we have to do the following:
  1. Find the other cluster that is the best match.

    and
  2. Update any cluster that would get a better score with the new entry than with the best it already had.

This operations can proceed in a single pass. We walk the table and test the score of merging the existing key with the new entry. We track the best we find as we walk the table. At each entry, however, we also check to see if the test score exceeds the score of that entry. If it does, then the new cluster becomes the best match for that entry, otherwise we leave the entry alone.

We start by putting 2 leaf nodes in the table, each of which matches the other as the best combination. Then we incrementally add each leaf node of our starting set. Once all the leaf nodes are added, we know the best match for each leaf.

After initializing the table, we'll iterate by selecting the best match in the table, deleting the matching elements from the table, actually performing the merge, and then inserting the merged cluster back into the table. Deleting the elements is a little tricky. Obviously, we will remove the two entries that are keyed by those elements, but it may be the case that other entries would be the best match against the ones we are deleting. We have to fix these up if we are to maintain the invariant. An easy way to maintain the invariant is to simply remove and then re-insert any entry that matches one of the two elements we are deleting.

This works pretty good, but we end up rescoring a lot of entries. Since each rescore involves scanning the entire table, that still adds up to quite a bit of work.

I'm somewhat lazy and I don't want to rescore entries unless I really need to. When an entry in the table has to be rescored because it's best matching cluster is being deleted, there's only two things that could happen: either the new cluster about to be added will be an even better score than the one we just deleted, in which case we'll be paired with that one when it gets entered, or whatever it is we'll be paired with will have an equal or worse score than what we had.

We'll change the invariant in this way. Each entry either contains the actual best cluster that the entry key should merge with, or it contains an upper bound on the best score that the key could have if we knew what it was. So rather than rescore an entry when it becomes invalid, we simply drop the matching element and retain the old score.

Now when we insert an element into the table, we still have to match it against the remaining elements. If an entry has a deferred score, and we equal or exceed that score when the new cluster is merged with the entry, we can simply replace the entry and update the score. If not, we retain the deferred score.

When we go to find the best entry in the table, we use the deferred scores as upper bound estimates of the actual score for that entry. If the best entry exceeds all the deferred scores, we can use it without determining the real match of the deferred scores. However, if we find some deferred scores that have an upper bound that exceeds the best we find in the table, then we really need to find out the actual best match for the deferred score. We do this by finally scanning the table for that entry.

Recall that each time we iterate, we make the table smaller by one. (We delete two nodes and add a single combined one). The longer we defer scoring an entry, the smaller the table gets and the fewer entries need to be scanned. This reduces the number of scoring calculations that we have to do.

Finally, we can use a little bit of knowledge about the kind of things we want to merge. In the problem I'm working on, each data point has only a handful of terms from a selection of thousands. I *know* that the data points won't score well against others unless they have common terms. By keeping a table that maps terms to clusters, I can skip computing scores for matches that simply won't happen. On the other hand, some data points are duplicates of others and obviously should be grouped together. There is no need to search for matches and invoke the full clustering machinery. We'll just take the duplicates and form initial clusters out of those before even populating the table.

So how much does all this help? I can now cluster 20000 data points by examining only 25 million possible clusters. This takes several minutes, but we're a long ways from the 20 or so data points that we started with.

Monday, August 4, 2008

And on a completely different subject

In addition to the Bayesian Hierarchical clustering, I have been playing with my port of MIT Scheme. The code is at http://code.google.com/p/jrm-code-project/source/checkout. Browse down to trunk/MIT-Scheme.

This is a port/rewrite of the MIT CScheme interpreter in C#. It can load and run both the MIT Scheme source code or the syntaxed "bin" files of MIT Scheme. Although it is far from complete, it can cold load MIT Scheme, get to a prompt and start evaluating expressions. Rather that using a huge switch statement like CScheme, this interpreter uses virtual method dispatch to direct evaluation. Tail recursion is achieved via a trampoline. Memory is managed by the CLR runtime.

Why did I do this? I wanted to find out a few things.
  • Is virtual method dispatch fast enough to base an interpreter on?
  • Is the CLR really unsuitable for dynamic languages, or is that a myth?
  • Can MIT Scheme benefit from a more sophisticated GC like in the CLR?
  • What tricks and techniques can we use to improve the performance of the interpreter itself?
I'll blog about this from time to time.

Dynamic Programming

At an interview I had a few years back I was given a question on Dynamic Programming. Fortunately I had written a diff program recently and was familiar with the term. Although the term has a broad meaning, the interviewer was looking for a particular approach to a particular problem. Let me outline that.

Given a sequence of objects, a subsequence is derived by deleting zero or more objects from the sequence. If our sequence is (a b c d e), one subsequence is (a b e), another is (b d). Given two sequences there exists at least one and possibly several sequences that are common to both. The longest common subsequence is of interest in programs like diff.

There is an obvious recursive way to find the length of the longest common subsequence:
(define (lcs seq1 seq2)
  (cond ((or (null? seq1) (null? seq2)) 0)
        ((eq? (car seq1) (car seq2)) 
         (+ 1 (lcs (cdr seq1) (cdr seq2))))
        (else (let ((t1 (lcs seq1 (cdr seq2)))
                    (t2 (lcs (cdr seq1) seq2)))
                (if (> t1 t2)
                    t1
                    t2)))))
.
Informally, we look at the head of each sequence. If the heads match, then that element is common. If not, then either we should drop the head of the first or we should drop the head of the second. We try both and pick the longer one.

This is not a very efficient algorithm. There is a lot of repeated computation. To save time and computation, we have a trick. We create an N by M array to hold the intermediate results of the recursive calls. Along the top of the array we put sequence1 in reverse order. Along the left side we put sequence 2 in reverse order. Now we can fill out the cells in the array. In the first row and the first column we put zero. This corresponds to the first clause in our conditional. In the remaining cells we do this: if the object at the top of the column matches the object in the left of the row, we add 1 to the value in the cell diagonally up and to the left of us. Otherwise, we take the maximum of the cell immediately to the left and the cell immediately above. We fill the array from left to right, top to bottom until we get to the end cell which now contains the answer.

This is a cute trick, but it is unnecessarily complicated. If you instead take the simple, obvious recursive solution and just memoize the results as they are computed, the result will be computed in the same order and with the same performance improvement as with generating an explicit table. In the interview I had, the interviewer didn't know that and was expecting me to write out a table of intermediate results. He kept hinting at it until I had to tell him that simply adding memoization to the recursive solution would yield the same answer without having to reconfigure the problem to work from bottom up in a table.

The next speedup to the Bayesian Hierarchical Clustering comes from this exact trick. When we compute the values of ph1-log and px-log for a subtree, we simply memoize them in the tree structure itself. The next time we need the value we can simply fetch it from the stored location. We can make this even more efficient by computing the value at the time of tree creation so we don't have to check if the value is memoized — it will always be present in the tree structure. If we change the code in this way, we'll see something amusing. Although conceptually we are computing the values top-down as we build the tree, we are, through memoization, computing the values in bottom-up order.

Memoization also helps significantly for log-bernoulli-beta. Because we are building our clusters from bottom up, a significant amount of time is spent checking small clusters and leaf nodes. This means that the two final arguments to log-bernoulli-beta are usually small integers. Rather than compute the values on demand, it is much quicker to pre-compute the values for small integers and simply fetch them from the table.

These improvements to the code add about another factor of ten, so we can now cluster about one or two thousand data points in a reasonable amount of time. This turned out to be not quite enough for my problem and I needed to squeeze out some more performance. That is in the next post.
Last time I converted the original Bayesian Hierarchical Clustering code to work in odds space rather than in probability space. This was the first part of converting the entire program to work in log-odds space. Converting to log space is somewhat similar to the previous conversion, but there are a couple of differences. In converting to odds space, we mostly had a hairball of simple algebra to untangle. When we convert to log space we'll have to convert more of the program and pay attention to the operators.

First we convert the top-level function r-odds:

(define (r-log-odds cluster) 
  (if (leaf-node? cluster)
      (error "Odds are infinite.")
      (10log10 (/ (ph1a cluster)
                  (* (px (left-child cluster))
                     (px (right-child cluster)))))))

.
Recall that the log of a ratio is the difference of the logs of the numerator and denominator and the log of a product is the sum of the logs of the multiplicands. We can rewrite it like this:

(define (r-log-odds cluster) 
  (if (leaf-node? cluster)
      (error "Odds are infinite.")
      (- (10log10 (ph1a cluster))
         (+ (10log10 (px (left-child cluster)))
            (10log10 (px (right-child cluster)))))))

.
And now we'll write log versions of ph1a and px, so we have:

(define (r-log-odds cluster) 
  (if (leaf-node? cluster)
      (error "Odds are infinite.")
      (- (ph1a-log cluster)
         (+ (px-log (left-child cluster))
            (px-log (right-child cluster))))))

(define (px-log cluster)
  (if (leaf-node? cluster)
      (ph1a-log cluster)
      (10log10 (+ (ph1a cluster)
                  (* (px (left-child cluster))
                     (px (right-child cluster)))))))

.
Unfortunately, the log of a sum doesn't have an easy transformation. On the other hand, it isn't all that hard, either.

(define (log-sum left right)
  (if (> right left)
      (log-sum right left)
      (+ left (10log10 (+ 1.0 (inverse-10log10 (- right left)))))))

.
If the difference between left and right is small enough, taking the inverse log (exponentiating) won't be much of an issue, but if the difference gets big, this could cause us some problems. For the moment, we'll just leave it and hope we don't have to revisit it later.

(define (px-log cluster)
  (if (leaf-node? cluster)
      (ph1a-log cluster)
      (log-sum (ph1a-log cluster)
               (+ (px-log (left-child cluster))
                  (px-log (right-child cluster))))))

(define (ph1a-log cluster)
  (+ (10log10 *alpha*)
     (log-gamma (n cluster))
     (ph1-log cluster)))

.
We're hardly the first to want to compute in log odds space. The log-gamma function is fairly well-known and there are good approximations to it.

  (define (ph1-log cluster)
    (let ((data-points (cluster-data-points cluster)))
      (sum (lambda (term)
                 (let ((present (count-occurrances term data-points))
                       (absent (- (length data-points) present))
                   (log-bernoulli-beta A B present absent))))
                all-terms))
     )

.
Ph1-log is like ph1, but we take the sum over the term rather than the product. Log-bernoulli-beta is trivially defined as the sums and differences of the appropriate log-gamma calls.

This part of the conversion was incredibly easy, but I did gloss over a couple of interesting things. I rather cavalierly propagated the log function through conditionals:

   (log (if <predicate>          (if <predicate>
            <consequence>    =>      (log <consequence>)
            <alternative>))          (log <alternative>))

.
That turns out to be a legal transformation, but it would be nice to understand why. Suppose instead of a conditional, I had a different operation that combined two functions.

   (log (op f1 f2))  <=!=>  (op (log f1) (log f2)) **WRONG

.
I cannot simply rewrite this as (op (log f1) (log f2)). If op were *, this would be clearly wrong. What I need to do is find an operation in log-space that has the effect of op after exponentiation.

   (log (op f1 f2))  <===>  ((logequiv op) (log f1) (log f2))

         (logequiv *) = +
         (logequiv /) = -
         (logequiv +) = log-sum

         (logequiv if) = ????

.
We know that * becomes +, / becomes -, + becomes log-sum, but we should find the log-space equivalent of ‘if’.

         (logequiv if) = if

.
It turns out that if simply stays if, so the transformation above is legal.

With the change from exact probabilities to floating-point operations in log space, we get about a factor of 10 improvement in performance. Now we can cluster hundreds of data points rather than only dozens. There are two more substantial speedups to apply.