Many programmers use malloc() and free() and pay no more mind to "allocating memory", but I often find 2-20× speed improvements by rethinking memory usage, so it is something I usually want to examine when joining a distressed project.
The first step, is to stop allocating "memory", and start allocating specific somethings else. Consider the usage pattern:
There are other common data structures (like hash tables) that resist a good memory allocation strategy. We can usually find another data structure (like a trie) which is faster and better that we can find a good memory allocation strategy for.
One major limitation of malloc (and even the best implementations like jemalloc and dlmalloc) is that they try to use a single allocator for each data structure. This is a mistake: A huge performance gain can be had by using a separate allocator for each of your data structures — or rather, for each of your data usage patterns.
And yet you can still start with malloc if you wrap your use of it by length. Wrapping malloc in this way does not mean:
void*wrap_malloc(unsigned long long n) { return malloc(n); }but something like:
void*alloc_foo(void) { return malloc(sizeof(struct foo)); }
These functions can then be profiled randomly.
By the way, I do not use size_t but you are free to: This is not code that I am expecting you to cut and paste, but to read and meditate on. This is not the only surprise in this article.
The operating system often has a tool for allocating contiguous virtual memory space called pages.
Most ad-hoc memory allocation will not be done at the page-level, so will not spend a great deal of time on it, except from the user-space perspective:
void*page_alloc(unsigned long long bytes) { void *x; for(;;) { x = mmap(0, bytes, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if(x != MAP_FAILED) return x; out_of_memory(); } }
This will suffice for discussion, and so long as it is understood that out_of_memory() can clearly abort with an error message, or attempt some garbage collection, or adding of swap, error handling will not be explored further.
void page_return(void *page,unsigned long long bytes) { munmap(page,bytes); }
Most architectures have specialised instructions for stack allocation, and many operating systems implement special circuitry to extend the stack automatically, but C offers no functions for using the stack (which is funny because POSIX has a function for creating a stack)
gcc supports a popular extension called alloca which can allocate on the stack. Microsoft's C compiler also supports alloca but calls it _alloca
When using the stack, programmers often want multiple stacks, when they should consider how a stack-heavy application can benefit from multiple processes instead, however something not often considered is that UNIX comes with two stacks - the second one being called the break and is manipulated with the sbrk() call. a neat trick is to increase the break on fault (assuming 4k pages):
static void oops(int s,siginfo_t *p,void *x){ sbrk((p->si_addr + (16LL<<22)) & ~4095); }
This is of course, what many operating systems do with the stack, but that we can do it in user-space effectively gives us two stacks (one that we have to manipulate manually via a global "break" pointer)
It is worth mentioning that the C library malloc might be using and caching the results of sbrk it may be unwise to mix this trick with your C library malloc, but it is very useful in new programs that you have no intention of using the C library malloc with.
An array has a type and a length, and the most common allocation operations are to copy and to extend (append), with "slicing" being common enough that it is worth optimising.
The buddy allocator is an excellent fit for arrays.
union alloc_header{ void*next_free; struct { char type, bucket; unsigned short unused; unsigned int ref; unsigned long long length; }; } *free_table[32] = {0};
An array is sizeof(union alloc_header) + length * sizes[type] bytes long.
When we allocate an array, we allocate it out of a bucket. gcc has a very efficient way to obtain the bucket from the number of bytes, 60 - __builtin_clzll(byte_size); (Why does this work? We use 4 bits for alignment so there cannot be a smaller allocation than 16 bytes. So we want the zeroth bucket to be the smallest object. So we start from 60...)
If free_table[bucket] contains a value, we can pop a value off.
void*alloc(char type, unsigned long long length) { unsigned long long data_bytes = length * sizes[type]; unsigned long long byte_size = sizeof(union alloc_header) + data_bytes; char bucket = 60 - __builtin_clzll(byte_size); union alloc_header *h; if((h = free_table[bucket])) { free_table[bucket] = h->free; goto found; }
If we don't have a bucket, we can allocate it out of the next larger slot by splitting it in half: the astute reader will note that the size of a bucket is 16LL<<bucket and each larger bucket is twice the size of the smaller bucket (with the smallest being 16 bytes).
Powers of two are wasteful if you have a bunch of arrays that are just a little larger than a neat power of two, and virtual memory is under pressure (e.g. on 32-bit or smaller address spaces). You can use another series (the Fibonacci series might work well) by choosing the bucket and the bucket_to_bytes function expressed here as 16LL<<x.
unsigned long long j=1+bucket; while(j<31 && !(h=free_table[j])) ++j; if(!h) { j=bucket; if(j<22)j=22; h=page_alloc(16LL<<j); }
Making the minimum region 64MB (16LL<<22) is worth some experimentation, however note especially that we support larger regions (up to 16LL<<31).
free_table[j] = h->next_free; while(--j > bucket) { h->next_free = 0; free_table[j] = h; h = (union alloc_header*)(((char*)h)+(16LL<<j)); }
Splitting the pages does not need to be difficult, and the free list works fine for temporary storage. if you are using multiple threads, using a CAS to push entries onto the free list is worth experimenting with.
found: h->length = length; h->ref = 0; h->type = type; h->bucket = bucket; void *data = &h;[1]; memset(data,0,data_bytes); return data; }
If you design your functions to fully consume their arguments then you simply reference (ref) any value you keep, and unreference (unref) any value you receive.
void *ref(void *x) { union alloc_header *h = x;h--; h->ref++; return x; }
Unreferencing should put the object back onto its free list
void unref(void *x) { union alloc_header *h = x;h--; char bucket = h->bucket; if(h->ref--)return; h->next_free = free_table[bucket]; free_table[bucket] = h; }
Arrays work well with reference counting, because an operation that
takes an array of length n and produces an array of length n
can reuse the array when
void*need(void *x,unsigned long long length) { union alloc_header *h = x;h--; unsigned long long data_bytes = length * sizes[h->type]; unsigned long long old_data_bytes = h->length * sizes[h->type]; if(h->ref || (data_bytes+sizeof(union alloc_header)) > (16LL<<h->bucket)) { void *y = alloc(h->type, length); memcpy(y, x, h->length * sizes[h->type]); unref(x); return y; }
Growing the array can additionally fill the bucket
if(length > h->length) memset(((C*)x) + old_data_bytes, 0, old_data_bytes - data_bytes); h->length = length; return x; }
If you reserve a type for pointers to other arrays, and you always ref it when adding it to that array, then you can have unref clean up recursively:
void unref2(void *x) { union alloc_header *h = x;h--; char bucket = h->bucket; if(h->ref--)return; if(h->type==BOXTYPE) { for(unsigned long long i=0;i<h->length;++i)unref(((void**)x)[i]); } h->next_free = free_table[bucket]; free_table[bucket] = h; }
Because my array allocator knows the length of the array I never need to store my length in the object. This doesn't exist:
struct foo { int length; char d[0]; } *s = malloc(sizeof(int)+n);
because I can simply do this:
char *s=alloc(n);
and then any time I want the length, get it:
unsigned long long length(void*data) { return ((union alloc_header *)data)[-1].length; }
I also expose the number of references the same way,
int sharedp(void*d) { return ((union alloc_header *)data)[-1].ref > 0; }
so that it is easy to reuse the (big) array instead of consuming it elsewhere in my program.
Objects have a class, and a fixed size based on their class. This means there may be a lot of objects of a few sizes. interleaving them has no cache benefits, and makes it difficult to return memory. When we have memory usage like this, we can do better than the array approach.
struct page_info { int classno, count, scavange; }; union object_info { union object_info *next; }; union object_info *free_list[num_classes] = {0}; int sizes[num_classes] = {...};
Note that you do not have to use int and you should not use int if you have a small number of classes, or not that many objects can fit on a page. However you should make sure that page_info is big enough to force alignment (even though compilers are smart enough to do this) because an alignment failure here can cause unexpected slowdown elsewhere.
Allocating classes with minimal fragmentation is simple when using a separate free list for each class. first, the number of objects per page is:
int objects_per_page(int classno) { return (PAGESZ - sizeof(struct page_info)) / sizes[classno]; } struct page_info *page_info(void *x) { return (struct page_info *)(((unsigned long long)x) & ~(PAGESZ-1)); }
If your classes are bigger than an operating system page size, use a PAGESZ that is the LCM of the operating system page size and your largest class size.
void *alloc(int classno) { struct page_info *p; union object_info *h = free_list[classno]; if(h) { free_list[classno] = h->next; p = page_info(h); p->count++; p->scavange=0; return (void*)h; } void *q = page_alloc(PAGESZ); p = q; p->count = 1; p->scavange= 0; p->classno = classno;
now split the page onto the free list:
q = &p;[1]; int n = objects_per_page(classno), i; for(i=1;i<n;++i) { h = q; h->next = free_list[classno]; free_list[classno] = h; q = ((char*)q) + sizes[classno]; } return q; }
The last element is ours.
Freeing an object is trivial, and when the page count goes to zero it can be returned to the operating system if the free list is scavanged. the general strategy is:
Doing this "in one pass" sounds easy, but this can cause your program to pause in a data-dependent way. Writing your garbage collector without pauses is easy enough, and you can control how many steps you perform per allocation based on how frequently you need to collect.
char phase[num_classes] = {0}; union object_info *cursor[num_classes] = {0}; union object_info *to_be_deleted[num_classes] = {0}; void free_some_phase1(int classno) { struct page_info *p; union object_info *h; if(!(h=cursor[classno])) { h = free_list[classno]; p = page_info(h); if(p->count == 0) {
remove from free list, add to scavange list
free_list[classno] = h->next; p->scavange++; h->next = to_be_deleted[classno]; to_be_deleted[classno] = h; } else {
otherwise just advance the cursor
cursor[classno] = h; } } else { union object_info *j = h->next; if(!j) {
Out of elements on this freelist? Move to phase2
phase[classno] = 2; return; } p = page_info(j);
Again, if there's no more used members of the page, move this to the scavanged list:
if(p->count == 0) { p->scavange++; h->next = j->next; j->next = to_be_deleted[classno]; to_be_deleted[classno] = j;
Otherwise simply advance the cursor:
} else { cursor[classno] = j; } } }
In phase2, we have a simple linked list; either we have seen all elements of the list. We reuse our scavanger counter, so values 0...n refer to reclaimed pages, and n..n×2 are being dropped (never to be recovered) with the actual page being returned to the operating system at n×2
void free_some_phase2(int classno) { struct page_info *p; union object_info *h; if((h=to_be_deleted[classno])) { int i, n = objects_per_page(classno); p = page_info(h); to_be_deleted[classno] = h->next; if(p->scavange == (2*n)) { page_return(p,PAGESZ); } else { if(p->scavange >= n) p->scavange++; h->next = free_list[classno]; free_list[classno] = h; } } else { phase[classno] = 1; } }
We can run some number of phases every allocation, and tune the number of rounds so that the memory is under control:
void free_some(void) { int i,j; for(j=0;j<NROUNDS;++j) { for(i=0;i<num_classes;++i) if(phase[i]!=2) free_some_phase1(i); for(i=0;i<num_classes;++i) if(phase[i]==2) free_some_phase2(i); } }
A useful trick when data dependent is to increase the number of rounds when out_of_memory() and retry.
Another useful trick is (if using CAS to insert/remove entries from the free list) to free in another thread.