RFC: printf Code Size Optimization

mysterymath · 2024-11-14T22:39:17.048Z

Introduction

The printf family of functions are commonly used on size-constrained platforms for debugging, logging, and string formatting. Code for these platforms is commonly written in C, where type-safe APIs like iostreams and fmt are unavailable. Even for C++, these have somewhat of a reputation for increasing code size over printf, since C++ template monomorphization typically occurs at point of use. By comparison, calls to printf only have C variable argument list overhead. Uses tend to outnumber definitions, so code size overall tends to be an issue unless there are quite few format calls.

However, while uses are tight, a printf implementation must include every possible aspect of string formatting into the final binary. Real programs commonly use very little of that functionality. Runtime-variable format strings are quite rare, and clang already statically analyzes the contents and semantics of format strings to turn printf calls into e.g. puts calls. However, without the cooperation of libc, its transformations are limited to those expressible in the public C API.

Weak Implementation Functions

The pairing of clang and a statically-linked llvm-libc presents an opportunity to develop transformations that would allow dropping unused printf implementation code at a finer granularity. Here is an approach to doing so suggested in an earlier Discourse thread. (This was also mentioned as having been used in an ARM toolchain.)

The basic idea is that clang would analyze printf calls with statically-known format strings and emit a series of strong, but vacuous, references to symbols that declare that a various printf format string characteristics are present. Then, that call to printf would be rewritten to an alternative entry point that would be allowed to drop any aspect of its implementation that is not required by a characteristic present in the link.

For example, following call:

printf(“%f”, 42);

Would be rewritten to the equivalent of:

__printf_core(“%f”, 42);
asm(“.globl __printf_f”);

Here __printf_core is the printf implementation, and __printf_f is a symbol that declares that the format string contains a %f specifier. The implementation of __printf_core could freely place implementation functions that are needed for the %f specifier in a translation unit that defines that symbol. If these implementation functions were only weakly referenced by __printf_core, then they would not be brought in to the link unless some call actually required them.

printf itself would be a thin wrapper around __printf_core, but it would also declare that all possible printf characteristics are present. This provides a conservative way to drop-out of the feature.

llvm-libc Implementation

llvm-libc is actually quite well suited to this; most of the meat of the implementation is hidden behind convenient convert_<type> functions. Making these weak should provide most of the benefit of this approach with relatively little effort. The format specifier parsing logic is separate, and it seems like there’s relatively little opportunity to break parts of it out. It would be nice to have data on how much size removing all conversion functions from printf saves.

Extensibility

Rewriting a call from printf to __printf_core establishes a contract between the compiler and libc implementation that supports this feature. Extending this contract isn’t trivial.

For example, say the only characteristic is “supports floats” with symbol __printf_float. A __printf_core with this contract must retain float-related implementation if __printf_float is present in the link.

Say we wanted to make this finer-grained later, and we add __printf_f, __printf_g, and __printf_e for specifiers. If the compiler no longer included __printf_float, then it would make an older libc crash on %f.

Conversely, say we made the compiler include __printf_float. A libc would want to take presence of __printf_float but absence of __printf_f as evidence that %f isn’t used. This wouldn’t be true for an older compiler, since it only knows about __printf_float.

One way around this would be to include a version identifier to disambiguate the contract, e.g. __printf_core_v1. This problem seems akin to other symbol versioning concerns in compilers, but I have little direct experience with these, so advice here would be helpful.

Another option is to get it right the first time. Famous last words though.

Other Possible Optimizations

iprintf and small_printf are pre-existing alternative entry points for printf that preclude certain implementation characteristics (floats and large integers, respectively). This is a coarser mechanism than the one presented here, since it doesn’t compose well.
The core implementation of printf could be inlined by the compiler into a series of calls to a new cluster of API routines. This would essentially turn a printf call into an iostreams/fmt call at compile time. This may be appropriate in some cases, but it would likely come with a size cost if applied generally.

jhuber6 · 2024-11-15T16:00:36.492Z

The GPU case has a custom printf using the RPC interface. The sprintf family of functions however is problematic because it uses function pointers to carry a callback for how to treat the user-provided buffer. I had a patch up in [libc] Template the writing mode for the writer class by jhuber6 · Pull Request #111559 · llvm/llvm-project · GitHub however this creates concerns about bloating the binary size due to duplicating the LUT. If these were split into different functions maybe that could be avoided.

evodius96 · 2024-11-15T16:39:36.460Z

Something like Symbol metadata would be useful to communicate the printf entry points (or symbol remapping direction) to the linker (cc @snidertm). We’ve done something similar downstream at TI to optimize printf with our proprietary libc.

mysterymath · 2024-11-15T19:46:35.997Z

Ah, that’s a good point. It’s likely that more sophisticated trimming could be done with linker involvement; that’s another alternative. Involving the linker would extend the API between three logically-independent tools rather than just two though.

mysterymath · 2024-11-15T19:50:42.362Z

Also (no rush) ping to: @smithp35

smithp35 · 2024-11-15T20:40:51.617Z

Thanks for the ping. Will take a look next week.

statham-arm · 2024-11-18T09:13:17.673Z

One way around this would be to include a version identifier to disambiguate the contract, e.g. __printf_core_v1

That’s how we did it in the Arm toolchain you’re (surely) referring to. The idea is that if you make printf more modular in a later version of the toolchain, you introduce __printf_core_v2 which pulls in even less than v1 did, but then you keep the symbol __printf_core_v1 as well, so that older object files referring to that symbol can still depend on having all the parts of printf that that core implied. In general, if you link with a collection of object files built with different compiler versions and they pull in lots of different versions of __printf_core_vN, you get the union of all the printf subsets that each object expected to be able to depend on without specifically asking.

Other details of what we did:

We had an analyzer pass which handles printf with a constant format string, by reading and parsing that string. That can get a lot of very detailed information: not just which top-level format letters are used (%f, %d etc), but on which integer types (was there any %llx or only %x? %Lf or just %f?) and presence/absence of complicating flags (e.g. if nothing specified a field width then there’s no need to include the subroutine that pads the output to it).

For printf with a variable format string but a fixed argument list, we had a fallback pass that scans the list of argument types, which doesn’t give you as much information, but it’s still worth having – in particular you can still spot that no floats are used.

The weak point of the system is the case where the format string and the argument list are unknown at compile time. And unfortunately, that is common, e.g. lots of code bases contain wrapper functions like this …

void app_specific_log_function(LogContext *ctx, const char *fmt, ...) {
  char buf[LOG_MSG_MAX];
  va_list ap;
  va_start(ap, fmt);
  vsnprintf(buf, sizeof(buf), fmt, ap);
  write_log_message(ctx, buf);
  va_end(ap);
}

… and as soon as you have one of those, our system stops being able to exclude any part of printf.

Message localization is another confounder (when all your printf format strings come from a gettext lookup or equivalent), but less of one, because in that situation you can at least still use the list of argument types.

smithp35 · 2024-11-18T11:37:29.119Z

Thanks for posting. I don’t have anything to add over statham-arm for the description of Arm Compiler’s printf optimisation. We could probably go into more details of the implementation if that were helpful.

Thinking about the generic wrapper use case, I’m wondering if the format attribute could help: Attributes in Clang — Clang 20.0.0git documentation
that could identify the format string to use in the wrapper.

Thinking about versioning:

In practice we’ve found that our printf optimisation stabilised quickly and we haven’t had a need to continuosly change the interface. However this may not generalise. The most common challenge we have faced are customers that have built relocatable objects with our proprietary compiler, supplied as binaries to another project (still happens with regulated industries) and these have been linked against newlib, or some other toolchain’s C-library and got undefined symbol errors.

I think there’s some kind of policy decision could be made here about backwards compatibility for binary only relocatable objects. If we want to support it as an option, then we’ll need to have different symbol names (no symbol versioning in static linking), and we’ll need to keep each implementation in the static archive. If we don’t want to support it then we’ll need to advise people making binary only relocatable objects that need to persist to not use the optimisation.

I’ve not got any experience with actively using symbol versioning, only its implementation in the linker. I think symbol versioning is best for symbols that must keep the same symbol name, but may change interface (API/ABI) in incompatible ways between releases. An implementation of each version exists in the shared object for backwards compatibility, but only the latest (called default) version is bound to at static link time. This only works for dynamic linking.

I don’t think symbol versioning is required in this instance as we are free to choose our own symbol names for private implementation details. To get an advantage from symbol versioning we’d need to make sure that the private interface to the printf implementation remained the same, using the same symbol names. If we are talking about changing the granularity of the interface in future versions I don’t think we could guarantee that we’d stick to the same symbol names.

In summary I think symbol versioning isn’t likely to be that helpful in this case.

jyknight · 2024-11-18T18:32:09.534Z

The proposed implementation sounds feasible, but is this complexity really worth it?

In musl’s printf implementation, I measured the cost at ~8K additional size for floating-point support in printf (which included soft-float support for 128-bit long double), on top of ~6K for integer-only printf. It seems likely – without having measured – that llvm-libc is not nearly as size-efficient. Maybe size-optimizing llvm-libc within the current interfaces, first, would be a better goal – e.g. to get the baseline size for printf down to something near 14K?

In any case, this proposal should provide actual numbers for the expected savings.

mysterymath · 2024-11-18T18:38:15.937Z

A few thoughts on the wrapper case, expanding on @smithp35’s reference to the format attribute.

The same concerns apply to wrappers as to printf-family functions; there could be calls invisible to the compiler, such as through assembly, and there could be runtime format strings given to it. Accordingly, there would intrinsically need to be both a core and non-core version of the wrapper too; the former would be able to call __printf_core_vN, while the other would not.

In any TU that contains a call to the wrapper, the compiler would need to be made aware that a core version of the wrapper is available with the proper semantics. An attribute on the declaration in the header seems like a reasonable mechanism for this. This actually seems to be a straight generalization of the mechanism used for the printf function; it seems plausible that the implementation of printf in llvm-libc (or others) could itself use this mechanism to advertise the alternative version.

JonChesterfield · 2024-11-18T20:04:23.700Z

There’s a credible chance this doesn’t need to be special cased to get the gains. We have almost all the pieces needed already in place. I’m available to fill in the pieces if helpful, this is all directly of interest to me.

If libc is statically linked bitcode, decent chance the entire binary is statically linked. That is, it doesn’t care about the platform ABI. If you compile with -expand-variadics-override=lowering, there is an IR pass which (if it knows your target) will replace all calls to a variadic function with calls to an equivalent va_list taking function. At that point, function inlining and specialisation and so forth can kick in. Actually if you don’t pass that flag, printf probably isn’t called via function pointers very often so gets transformed anyway.

Specialising vprintf implementations with respect to the format string seems pretty likely to yield the same code that this RFC is requesting, without any special casing.

Missing pieces are likely to be:

the variadic lowering probably doesn’t know your architecture (but if you tell me which one you care about, I’ll add it)
ideally we’d have a lookup table for printf → vprintf and similar, for where we don’t need to create a new function taking a va_list. That’s just something we haven’t got around to
the function specialiser might ignore printf, but if it does that’s likely to be fixable for llvm libc

@jhuber6 thoughts on tagging printf as always-specialise on known format string? There’s a whole lot of control flow in there which I expect to burn out when the format string is known, and always-specialise should be available as an attribute soonish

jhuber6 · 2024-11-18T20:10:15.865Z

JonChesterfield:

@jhuber6 thoughts on tagging printf as always-specialise on known format string? There’s a whole lot of control flow in there which I expect to burn out when the format string is known, and always-specialise should be available as an attribute soonish

Sounds reasonable, we currently leave a lot of transformations on the table when it comes to printf. It might be a little difficult for the compiler to see all the way through these calls however, and there’s the issue that without special casing we’d still need to load the many MB long float conversion tables just to optimize it out, which has some link time implications.

mysterymath · 2024-11-18T20:22:56.513Z

In musl’s printf implementation, I measured the cost at ~8K additional size for floating-point support in printf (which included soft-float support for 128-bit long double), on top of ~6K for integer-only printf.

Even if we saved 8K, that’s a lot! One of the motivating examples we’ve hit with a partner was a bootloader that needed to fit into 24K and used printf for logging. Even if it had a more moderate 128K available, that’s 6.25% of the available space.

Anecdotally, that’s one of the major motivations for folks wanting to massage printf in particular. On embedded systems, it’s often the library that pulls in the most code for the least benefit (judged against how it’s actually used).

In any case, this proposal should provide actual numbers for the expected savings.

100% agree; I’m spending some time today to build llvm-libc with conversion functions stubbed out.

jyknight · 2024-11-18T23:22:34.640Z

To be clear, I would expect that any numbers you show will only go to show that llvm-libc is currently not at all suited for use on an ultra-tiny-code system.

I’d guess that llvm-libc’s printf configured for integers-only will already be larger than the full printf implementation from musl or other similar small libc implementations.

But certainly the float-formatting code will be large, since it currently depends on the fairly-hefty ryu constant tables. That’s great if you care about performance – but if you mainly care about code-size, they’re way too large. For such use-cases, it’d be nice if there was a size-optimized build mode available, which chooses slower but smaller-code-size algorithms.

mysterymath · 2024-11-19T00:39:36.205Z

Some quick data:

An x86-64 (for convenience) llvm-libc printf with the float call stubbed out comes to around 16K of text and 8K of rodata. The rodata appears to be the strerror tables.

Pulling out all conversion except that needed for %d leads to a size of around 14K, all things considered.

This was just done by commenting out the convert functions and building a “%d” hello world; this is an admittedly very lazy methodology.

michaelrj-google · 2024-11-19T00:58:10.647Z

For embedded/size conscious targets LLVM-libc has several options to better optimize for size other than just disabling floats. The ones I expect could be applied without anyone noticing are disabling %n, %m (the strerror one), and index mode (a posix extension). I’m not sure what the size would be after applying those flags, but you can find and play with them here: llvm-project/libc/config/config.json at main · llvm/llvm-project · GitHub

For shrinking floats without disabling them entirely there’s the Dyadic Float option which skips the Ryu tables, in exchange for only printing up to ~50 digits of the requested value. This isn’t generally a problem in practice, but it can make some unit tests fail.

For how much size saving you’d get from disabling various non-float conversions, I’d expect %m to be the biggest (as @mysterymath’s evidence seems to suggest). Other than that the conversion code just isn’t very large. The source for int_converter.h is ~200 LOC, string_converter.h is less than 100, and most of the other converters end up being special cases of those two (e.g. %p is a hex int conversion, or a string conversion for nullptr).

mysterymath · 2024-11-19T19:49:42.257Z

Maybe size-optimizing llvm-libc within the current interfaces, first, would be a better goal – e.g. to get the baseline size for printf down to something near 14K?

So, regarding this point specifically, I don’t think it will ever be possible to get a float-bearing printf down that small on the most size-sensitive platforms, since they typically don’t include an FPU. It’s also often the case that printf is the only float-using code in the codebase, so the full size of the soft float library would be properly accounted as an implementation concern of printf.

That specific concern could be addressed by the lighter-weight iprintf transform, though. It does seem like llvm-libc’s iprintf would still be pretty big though, and this is only one of a number of ways that we could get that size down further.

I took a quick a look at the printf implementation, and there isn’t a lot of glaringly obvious fat to trim. Embedded printf’s are pretty famously unreadable, and work to get its overall size down may involve finnicky optimizations in that direction. That’s one of the attractive things about a mechanism like this one: it’s a more black box way to trim down a printf implementation.

mysterymath · 2024-11-19T20:07:51.990Z

One correction to a point I made earlier:

The 24K example we ran into didn’t go over size due to float, but rather the fixed-point printf support. This is useful on embedded, but didn’t end up used in that particular bootloader.

mysterymath · 2024-11-19T22:23:07.771Z

One more observation: removing indexing mode as per @michaelrj-google 's suggestion decreases the %d-only size all the way from 14K to 10K. I’d definitely argue that a highly size-constrained printf for a non-POSIX system shouldn’t contain difficult POSIX extensions by default.

mysterymath · 2024-11-20T21:29:45.659Z

Okay, I’ll stop replying to myself after this

I spent some more time scrubbing things out of printf that we could most-likely get rid of with this optimization. First, I removed parsing for widths, precisions, and length modifiers. Then, I changed instances of inline_memxxx to memxxx. This brought the %d-only size down to 4.5K. For comparison, the no-float mpaland embedded printf is 4.4K. So, I’d say it’s an achievable goal to have support for size-constrained systems be a goal of llvm-libc’s printf. Still bigger, but it can be brought down to a similar order of magnitude.

I’m not very satisfied with printf("%d", 42) taking 4.5K, though, compared to how one would implement this by hand. Even without constant-folding the whole printf implementation down to a write syscall, the vast majority of the remaining implementation is still totally dead.

Code written in the style of printf is in one way ideal for code size. Runtime dispatch tends produces vastly less duplicated code than monomorphization does, but it’s harder for the compiler or linker to separately GC out unused code.

It seems like in a full LTO build, similar to @JonChesterfield’s suggestion it should be possible to collect all possible arguments and format strings into classes, then run a form of abstract interpretation over printf to DCE code that cannot possibly be hit. This doesn’t seem like the kind of thing that would be very specific to printf either. I’d also suspect that the abstract interpretation lattice needed for good results wouldn’t be very onerous either; even just “constant or unknown” might give good results.

Still, even such a solution would still require LTO, and ETA would probably be years. I don’t think such a future possibility makes it not worth considering this approach, even if it’s what we’d eventually want to push embedded folks towards.