Ridiculously fast base64 encoding and decoding

Computers store data as streams of bits. Binary files like image, audio or video files are allowed to contain just about any sequence of bits.

However, we also often use text formats; for example, web pages and emails are required to be text. So how do we send images by email? How do we embed images within web pages? One possibility is to point to a distinct binary file. Another common approach is to directly embed the binary data within the web page or the email using base64. Base64 is just a standard text format that can be used to code any binary input. To be precise, a base64 code is always a valid ASCII text (and thus, it is also valid UTF-8). Each byte of base64 code contains 6 bits of data. Thus we “waste” about 2 bits per byte. Hence the base64 equivalent of a binary file is about 33% larger. In practice, this size increase is rarely a source of concern. As far as I know, email attachments are almost always encoded as base64.

When writing HTML, I find it handy to encode my images directly in the HTML using a data URI. For example, in a recent post, I included a PNG file within my HTML code. Major websites like Google use data URIs all the time. It has a small downside in that the web pages are larger (obviously) and they cannot benefit from image caching. On the upside, you save the browser a separate network request.

If you are a web developer, you can use Web Storage to create a client-side database for your application. This client-side database can contain images and arbitrary data, but it must be all base64-encoded.

Most database engines support binary data, but several require that it be encoded as base64 at some point: MongoDB, Elasticsearch, Amazon SimpleDB, and Amazon DynamoDB. I am probably missing a few.

Base64 is commonly used in cryptography to exchange keys. A form of base64 is also used to pass arbitrary data as part of a URI.

Thankfully, encoding and decoding base64 is fast. Yet there are cases where it can become a problem. Matt Crane and Jimmy Lin found that decoding binary attributes from base64 in Amazon DynamoDB is slow.

How fast can you decode base64 data? On a recent Intel processor, it takes roughly 2 cycles per byte (from cache) when using a fast decoder like the one from the Chrome browser. This fast decoder is basically doing table lookups. That’s much slower than copying data within the cache (which takes less 0.05 cycles per byte).

Is this the best you can do?

Alfred Klomp showed a few years ago that you could do much better using vector instructions. Wojciech MuÅ‚a, myself and a few others (i.e., Howard and Kurz) decided the seriously revisit the problem. Wojciech Muła has a web page on the topic.

We found that, in the end, you could speed up the problem by a factor of ten and use about 0.2 cycles per byte on recent Intel processors using vector instructions. That’s still more than a copy, but much less likely to ever be a bottleneck. I should point out that this 0.2 cycles per byte includes error handling: the decoder must decode and validate the input (e.g., if illegal characters are found, the decoding should be aborted).

Our research code is available so you can reproduce our results. Our paper is available from arXiv and has been accepted for publication by ACM Transactions on the Web.

My understanding is that our good results have been integrated in Klomp’s base64 library.

Further reading:

Daniel Lemire, "Ridiculously fast base64 encoding and decoding," in Daniel Lemire's blog, January 17, 2018, https://lemire.me/blog/2018/01/17/ridiculously-fast-base64-encoding-and-decoding/.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

14 thoughts on “Ridiculously fast base64 encoding and decoding”

    1. But you should warn about the use of AVX2.

      The paper is called: “Faster Base64 Encoding and Decoding Using AVX2 Instructions”.

      Unfortunately, the use of AVX2 severely throttles the CPU, which can cause system-wide performance issues

      Intel reduces the turbo frequency depending on the instruction mix. On Skylake X, AVX-512 instructions have a greater effect than AVX2 instructions with multiplications and floating points. Simple AVX2 instructions can be used without any reduction to the turbo frequency. The effect is tiny on processors having few active cores (e.g. 4), unlikely to be measurable, but it is larger on wide chips with many active cores (e.g., 28).

      If you have a chip with many active cores (much more than 4) and if you have a CPU heavy load, and if AVX-512 does not accelerate the computation much, then you can get a negative outcome. This is discussed in Intel’s optimization manual.

      The link you refer to is in this scenario, they have 24-core processors, with all cores active, and they use AVX-512 instructions.

      1. To be fair to the grandparent poster, the “normal” frequency of almost any recent Intel chip is totally irrelevant. The chip almost never runs at that speed. It’s almost always either “off” (in some non-zero C-state), running at minimum frequency (i.e,. most efficient freq, usually at Vmin around 500-1000Mhz), or running at maximum turbo frequency. Rarely you’ll find it running at other frequencies between min up to including normal, which usually happens during workload transition.

        “Normal” (the frequency printed on the box) isn’t at all special here in terms of how often that’s used your chip probably runs at “normal” frequency less than 1% of the time. If you want to know how fast your CPU will run something, the turbo frequency is essentially the only number you need to know (and the turbo ratio rable for multiple running cores, unfortunately).

        Intel puts it on the box, probably for historical reasons and because of the confusing aspect of the turbo ration depending on the number of running CPUs, so for my 4-core CPU they can either say “2.6 GHz” or “3.5/3.4/3.3/3.2 GHz”, and for a 28-core CPU, well…

        Intel also positions the normal frequency as a the “guaranteed” frequency, but in practice this has almost no meaning today: except in very small form factors or with very poor cooling you’ll generally run at the max turbo indefinitely, and if you get hot enough or draw too much current you can go below normal anyways, so essentially all frequencies are “if conditions permit”.

        Historically and still to some extent today, the normal frequency was important for the power management API the chip offers: they expose the ability to the OS to adjust the frequency between the min and normal frequencies, so normal was relevant there – for turbo speeds you had to let the hardware take control. Later on the chips offered more control over turbo rations too, but the interface (i.e., what MSRs you write and what you write) was totally different. These days the recommended mode of operation is “HWP” which is hardware performance management, essentially giving the CPU control over the whole frequency range (the P-states), so that distinction has most disapeared.

        I wanted to comment on the AVX2/AVX512 throttling too, since I think there is some misunderstanding above, but this is already long enough… 🙂

        I’m happy to add that part later if anyone is interested.

      1. It is pretty easy to prevent the compiler from emitting AVX2 in code you are compiling with the appropriate compiler flags, but that’s only part of the story – you also have to check any third party libraries you use, especially the C and C++ standard libraries which almost everyone uses.

        The C library especially is almost always implemented with AVX2 for methods like memcpy, and you’ll often get these faster methods even if you didn’t compile with AVX2 flags (or even if you compiled before AVX2 existed) through the magic of runtime dispatch (including the runtime linker IFUNC magic).

        Finally, even interrupts or other processes running on the same CPU (including at the same time on the sibling hyperthread) might decide to use AVX2, slowing down your whole CPU (the interrupt case is admittedly a bit of a stretch!).

          1. Exactly, which is one of the tick marks in the column for “how a runtime-interpreted language like Java can be faster than a native compiled language like C”. That is, it can use CPU instructions that weren’t even invented when the source was compiled!

  1. Hi Daniel

    It would be really GREAT if you can make a “SIMD tutorial” for new comers.

    As you said, there very little information about how to use SIMD in practice.

    And please, if you decide to do so use “C” for simplicity 🙂

    Best
    Mica

  2. Hi Daniel,

    Is sending images and video files in base64 format in the html file through webAPI is a good approach in comparison with sending html,images,videos all in a zip file ?

  3. Daniel,

    I’m a big fan of your work and love reading your blog posts, but in this one, Wojciech Muła seems to be the victim of double-UTF-8 encoding.

    After realizing that your blog has the appropriate charset META element and encoding, I saw something I’ve seen before.

    > Wojciech Muła

    The broken UTF-8 code points are U+C5 U+201A, but these came from a source that was originally UTF-8 and encoded as if they weren’t. U+201A comes from the translation of the C1 control characters from CP-1252, revealing that it was originally 0x82.

    Properly stored these should be _bytes_ 0xC5 0x82, which decodes properly to U+142, or “ł”.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.