The problem
About a month ago, I made some
changes to the garbage collector. These changes included drastically decreasing the default nursery size from 16mb to 1mb. This turned out to have a negative effect on performance, with GC consuming as much as 60% of run time on some tests.
Instead of making the default nursery larger, which would just mask the problem until we hit workloads with 16 times as much allocation (or when the compiler began to generate code that was 16 times faster, or any combination of the above), I decided to investigate the root cause of the problem.
Turns out that every minor GC was taking on the order of a few milliseconds at the minimum, which is too much time. Most of the time was spent scanning the cards array for marked cards. Even if there weren't any marked cards, with a 64mb heap and 64 byte card size, that's a million array entries to check.
Faster card scanning
The first thing I did was change how the card scan loop operates. Instead of iterating over the array a byte at a time, and checking if each byte satisfies the card mark mask, we iterate over the array four bytes at a time, checking each group of four bytes against the mask. GCC should really be doing this optimization for me, because its a simple case of vectorization, but it doesn't, so I just wrote the code out by hand:
u32 *quad_ptr;
u32 quad_mask = mask | (mask << 8) | (mask << 16) | (mask << 24);
for(quad_ptr = (u32 *)first_card; quad_ptr < (u32 *)last_card; quad_ptr++)
{
if(*quad_ptr & quad_mask)
{
F_CARD *ptr = (F_CARD *)quad_ptr;
int card;
for(card = 0; card < 4; card++)
{
if(ptr[card] & mask)
{
collect_card(&ptr[card],gen,here);
ptr[card] &= ~unmask;
}
}
}
}
This improved card scan performance by a factor of 3, which makes sense, because we do four times less loop iterations, but at some point, you still have to hit memory, which dampens things out a bit.
This optimization improved performance, but not sufficiently so. I realized that card scanning hurt the most in benchmarks which didn't write into objects in the old generation at all; for example, a benchmark which operates on bignums, where the intermediate values are extremely short-lived, fills up the nursery very rapidly, and every minor GC checks all cards, which not only fills up the cache with junk, but doesn't achieve anything useful as no cards will be marked.
On the other hard, a pointer recording approach where all stores are written into a buffer is also unacceptable, because in code with a high rate of mutation, you pay a hefty price every time the buffer overflows. So I had to find a compromise.
Introducing card decks
My solution is something I haven't seen done anywhere else before, but it is pretty trivial if you think about it. It amounts to two-level card marking. We have an array of cards, where each card maps to a small amount of memory: 64 bytes in my case, but it could be anything; and an array of
card decks. Each card deck corresponds to 1024 cards, and if a card is marked, then its deck must be marked too. This way, on a minor GC, we scan the deck array first, and then only check cards corresponding to marked decks.
This complicates the write barrier, because now on every store we have to mark a card and a deck, not just a card as before. I decided to counterweight this by simplifying the write barrier in another way. Formerly, it would read the card, "or" it with a mask, and store it back. This was done because the card contained an additional piece of information, and that is the offset within the card where the first object begins. Cards have two mark bits, denoting a possible pointer from this generation to either aging space or the nursery, and the other six bits were reserved for the offset of the first object. By moving object offsets (I call them "allot markers") into a separate array, I removed a read operation from the write barrier; now it simply has to write the mark mask to the card, without caring about its existing contents. This also freed up two bits in the allot marker, allowing me to switch to 256 byte cards (and 256 kilobyte card decks).
These improvements really helped on benchmarks which allocated large numbers of extremely short-lived objects, however on benchmarks which allocated longer-lived objects the improvements were still major but not so drastic. It turns out that if objects survives long enough to be copied from the nursery into aging space, then it is possible for this copying to begin to dominate running time.
Low-level optimizations
The inner loop of the GC looked like so:
void copy_handle(CELL *handle)
{
CELL pointer = *handle;
if(!immediate_p(pointer) && should_copy(pointer))
*handle = copy_object(pointer);
}
CELL collect_next(CELL scan)
{
do_slots(scan,copy_handle);
if(collecting_gen == TENURED)
do_code_slots(scan);
return scan + untagged_object_size(scan);
}
This is very elegant code;
do_slots
is a higher-order function which applies a function pointer to each slot of the object. However, GCC doesn't try to optimize higher-order code at all, after all it is a C compiler not an ML compiler, and it isn't Sufficiently Smart Either! So I rewrote the above function by manually inlining
do_slots()
and replacing the function pointer invocation with the body of
copy_handle()
. The next thing I realized was that
should_copy()
contains a large series of conditional tests which did not depend on its input parameters, only global variables whose value was invariant for the duration of the collection:
/* test if the pointer is in generation being collected, or a younger one. */
INLINE bool should_copy(CELL untagged)
{
if(in_zone(newspace,untagged))
return false;
if(collecting_gen == TENURED)
return true;
else if(HAVE_AGING_P && collecting_gen == AGING)
return !in_zone(&data_heap->generations[TENURED],untagged);
else if(HAVE_NURSERY_P && collecting_gen == NURSERY)
return in_zone(&nursery,untagged);
else
{
critical_error("Bug in should_copy",untagged);
return false;
}
}
So again, I did a bit of hand-optimization. If
foo()
does not depend on the loop variable or the side effects of the body, then the following are equivalent:
while(A)
{
if(foo()) B else C;
}
if(foo)
{
while(A) B;
}
else
{
while(A) C;
}
In the optimizing compiler literature, this is known as
loop unswitching.
The new
collect_next()
is much longer, but also much faster. It looks like this:
CELL collect_next(CELL scan)
{
CELL *obj = (CELL *)scan;
CELL *end = (CELL *)(scan + binary_payload_start(scan));
obj++;
CELL newspace_start = newspace->start;
CELL newspace_end = newspace->end;
if(HAVE_NURSERY_P && collecting_gen == NURSERY)
{
CELL nursery_start = nursery.start;
CELL nursery_end = nursery.end;
for(; obj < end; obj++)
{
CELL pointer = *obj;
if(!immediate_p(pointer)
&& (pointer >= nursery_start && pointer < nursery_end))
*obj = copy_object(pointer);
}
}
else if(HAVE_AGING_P && collecting_gen == AGING)
{
F_ZONE *tenured = &data_heap->generations[TENURED];
CELL tenured_start = tenured->start;
CELL tenured_end = tenured->end;
for(; obj < end; obj++)
{
CELL pointer = *obj;
if(!immediate_p(pointer)
&& !(pointer >= newspace_start && pointer < newspace_end)
&& !(pointer >= tenured_start && pointer < tenured_end))
*obj = copy_object(pointer);
}
}
else if(collecting_gen == TENURED)
{
for(; obj < end; obj++)
{
CELL pointer = *obj;
if(!immediate_p(pointer)
&& !(pointer >= newspace_start && pointer < newspace_end))
*obj = copy_object(pointer);
}
do_code_slots(scan);
}
else
critical_error("Bug in collect_next",0);
return scan + untagged_object_size(scan);
}
That's a hell of a code explosion, but it made a real difference to the performance of the garbage collector.
Benchmarks
The first benchmark just allocates 3-element arrays in a loop:
: garbage 1 2 3 3array ;
: garbage-loop 150000000 [ garbage drop ] times ;
[ garbage-loop ] time
Here are the results with different nursery sizes; all times are in milliseconds and I ran each benchmark multiple times to ensure I was getting reliable results:
| Before | After |
1mb nursery: | 8908 | 4461 |
2mb nursery: | 6455 | 4451 |
4mb nursery: | 5336 | 4582 |
8mb nursery: | 4779 | 4582 |
The second benchmark allocates larger arrays in a loop:
: garbage 100 f <array> ;
: garbage-loop 15000000 [ garbage drop ] times ;
[ garbage-loop ] time
The results are similar:
| Before | After |
1mb nursery: | 10418 | 2158 |
2mb nursery: | 6364 | 2178 |
4mb nursery: | 4966 | 3223 |
8mb nursery: | 4249 | 3294 |
An interesting phenomenon occurs where after the GC changes, a larger nursery actually begins to hurt performance.
Dan and I hypothesized that this is due to the larger nursery hurting locality, with poor algorithms masking this effect with the older GC.
The next benchmark allocates a lot of temporary arrays but stores them into larger arrays which are somewhat longer lived:
: garbage 100 f <array> ;
: garbage-loop 150 [ 10000 [ drop garbage ] map drop ] times ;
[ garbage-loop ] time
The size of the aging generation makes a difference here so I tried the benchmark with more parameters:
| Before | After |
1mb nursery/2mb aging: | 8693 | 5525 |
2mb nursery/2mb aging: | 7318 | 5255 |
4mb nursery/2mb aging: | 3939 | 2628 |
8mb nursery/2mb aging: | 2314 | 1526 |
1mb nursery/8mb aging: | 3455 | 1493 |
4mb nursery/4mb aging: | 3024 | 1894 |
4mb nursery/4mb aging: | 3157 | 1577 |
8mb nursery/8mb aging: | 1533 | 1003 |
The above benchmarks are somewhat unrealistic, however other benchmarks showed speedups too, some more drastic than others:
Benchmark | Before | After | SHA1 | 12198 | 7282 |
---|
Random | 18989 | 14459 |
---|
Sort | 15685 | 13476 |
---|
Raytracer | 27700 | 12966 |
---|
Mandelbrot | 4633 | 1847 |
---|
Before the shrinking of the nursery, the runtimes of these benchmarks looked much like they do now; it was only until a month ago that GC time began to dominate benchmarks like that. However I like having a small nursery, and making it smaller forced me to optimize the garbage collector for higher throughput in constrained-memory conditions.
Looking ahead
The next big change I will be making to the garbage collector is switching to using mark/sweep for the old generation. This will reduce memory consumption by almost 50%.
The other side of this coin is compiler optimizations. If the compiler performed more advanced static analysis, it could eliminate much of the dynamic memory allocation in the above benchmarks, cutting GC time further. Of course, if we have a great compiler and a great GC, we'll just have great performance all around.