Faster, scalable memory allocations in Apache Arrow with jemalloc
Published
20 Jul 2018
By
Uwe Korn (uwe)
With the release of the 0.9 version of Apache Arrow, we have switched our default allocator for array buffers from the system allocator to jemalloc on OSX and Linux. This applies to the C++/GLib/Python implementations of Arrow. In most cases changing the default allocator is normally done to avoid problems that occur with many small, frequent (de)allocations. In contrast, in Arrow we normally deal with large in-memory datasets. While jemalloc provides good strategies for avoiding RAM fragmentation for allocations that are lower than a memory page (4kb), it also provides functionality that improves performance on allocations that span several memory pages.
Outside of Apache Arrow, jemalloc powers the infrastructure of Facebook (this is also where most of its development happens). It is also used as the default allocator in Rust as well as it helps Redis reduce the memory fragmentation on Linux (“Allocator”).
One allocation specialty that we require in Arrow is that memory should be 64byte aligned. This is so that we can get the most performance out of SIMD instruction sets like AVX. While the most modern SIMD instructions also work on unaligned memory, their performance is much better on aligned memory. To get the best performance for our analytical applications, we want all memory to be allocated such that SIMD performance is maximized.
For aligned allocations, the POSIX APIs only provide the
aligned_alloc(void** ptr, size_t alignment, size_t size)
function to
allocate aligned memory. There is also
posix_memalign(void **ptr, size_t alignment, size_t size)
to modify an
allocation to the preferred alignment. But neither of them cater for expansions
of the allocation. While the realloc
function can often expand allocations
without moving them physically, it does not ensure that in the case the
allocation is moved that the alignment is kept.
In the case when Arrow was built without jemalloc being enabled, this resulted
in copying the data on each new expansion of an allocation. To reduce the number
of memory copies, we use jemalloc’s *allocx()
-APIs to create, modify and free
aligned allocations. One of the typical tasks where this gives us a major
speedup is on the incremental construction of an Arrow table that consists of
several columns. We often don’t know the size of the table in advance and need
to expand our allocations as the data is loaded.
To incrementally build a vector using memory expansion of a factor of 2, we would use the following C-code with the standard POSIX APIs:
size_t size = 128 * 1024;
void* ptr = aligned_alloc(64, size);
for (int i = 0; i < 10; i++) {
size_t new_size = size * 2;
void* ptr2 = aligned_alloc(64, new_size);
memcpy(ptr2, ptr, size);
free(ptr);
ptr = ptr2;
size = new_size;
}
free(ptr);
With jemalloc’s special APIs, we are able to omit the explicit call to memcpy
.
In the case where a memory expansion cannot be done in-place, it is still called
by the allocator but not needed on all occasions. This simplifies our user code
to:
size_t size = 128 * 1024;
void* ptr = mallocx(size, MALLOCX_ALIGN(64));
for (int i = 0; i < 10; i++) {
size *= 2;
ptr = rallocx(ptr, size, MALLOCX_ALIGN(64));
}
dallocx(ptr, MALLOCX_ALIGN(64));
To see the real world benefits of using jemalloc, we look at the benchmarks in
Arrow C++. There we have modeled a typical use case of incrementally building up
an array of primitive values. For the build-up of the array, we don’t know the
number of elements in the final array so we need to continuously expand the
memory region in which the data is stored. The code for this benchmark is part
of the builder-benchmark
in the Arrow C++ sources as
BuildPrimitiveArrayNoNulls
.
Runtimes without jemalloc
:
BM_BuildPrimitiveArrayNoNulls/repeats:3 636726 us 804.114MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3 621345 us 824.019MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3 625008 us 819.19MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_mean 627693 us 815.774MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_median 625008 us 819.19MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_stddev 8034 us 10.3829MB/s
Runtimes with jemalloc
:
BM_BuildPrimitiveArrayNoNulls/repeats:3 630881 us 811.563MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3 352891 us 1.41687GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3 351039 us 1.42434GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_mean 444937 us 1.21125GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_median 352891 us 1.41687GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_stddev 161035 us 371.335MB/s
The benchmark was run three times for each configuration to see the performance
differences. The first run in each configuration yielded the same performance but
in all subsequent runs, the version using jemalloc was about twice as fast. In
these cases, the memory region that was used for constructing the array could be
expanded in place without moving the data around. This was possible as there
were memory pages assigned to the process that were unused but not reclaimed by
the operating system. Without jemalloc
, we cannot make use of them simply by
the fact that the default allocator has no API that provides aligned
reallocation.