Jekyll2024-11-06T21:22:29+00:00https://thephd.dev/feed.xmlThe PastureThe musings, ideas, discussions, and sometimes silly words from a digital sheep magicianThe Big Array Size Survey for C2024-11-06T00:00:00+00:002024-11-06T00:00:00+00:00https://thephd.dev/The%20Big%20Array%20Size%20Survey<![CDATA[

New in C2y is an operator that does something people have been asking us for, for decades: something that computes the size in elements (NOT bytes) of an array-like thing. This is a great addition and came from the efforts of Alejandro Colomar in N3369, and was voted into C2y during the recently-finished Minneapolis, MN, USA 2024 standardization meeting. But, there’s been some questions about whether we chose the right name or not, and rather than spend an endless amount of Committee time bikeshedding and arguing about this, I wanted to put this question to you, the user, with a survey! (Link to the survey at the bottom of the article.)

The Operator

Before we get to the survey (link at the bottom), the point of this article is to explain the available choices so you, the user, can make a more informed decision. The core of this survey is to provide a built-in, language name to the behavior of the following macro named SIZE_KEYWORD:

#define SIZE_KEYWORD(...) (sizeof(__VA_ARGS__) / sizeof(*(__VA_ARGS__)))

int main () {
	int arfarf[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
	return SIZE_KEYWORD(arfarf); // same as: `return 10;`
}

This is called nitems() in BSD-style C, ARRAY_SIZE() by others in C with macros, _countof() in MSVC-style C, std::size() (a library feature) and std::extent_v<...> in C++, len() in Python, ztdc_size() in my personal C library, extent in Fortran and other language terminology, and carries many other names both in different languages but also in C itself.

The survey here is not for the naming of a library-based macro (though certain ways of accessing this functionality could be through a macro): there is consensus in the C Standard Committee to make this a normal in-language operator so we can build type safety directly into the language operator rather than come up with increasingly hideous uses of _Generic to achieve the same goal. This keeps compile-times low and also has the language accept responsibility for things that it, honestly, should’ve been responsible for since 1985.

This is the basic level of knowledge you need to access the survey and answer. Further below is an explanation of each important choice in the survey related to the technical features. We encourage you to read this whole blog article before accessing the survey to understand the rationale. The link is at the bottom of this article.

The Choices

The survey has a few preliminary questions about experience level and current/past usage of C; this does not necessarily change how impactful your choice selection will be! It just might reveal certain trends or ideas amongst certain subsets of individuals. It is also not meant to be extremely specific or even all that deeply accurate. Even if you’re not comfortable with C, but you are forced to use it at your Day Job because Nobody Else Will Do This Damn Work, well. You may not like it, but that’s still “Professional / Industrial” C development!

The core part of the survey, however, revolve around 2 choices:

  • the usage pattern required to get to said operator/keyword;
  • and, the spelling of the operator/keyword itself.

There’s several spellings, and three usage patterns. We’ll elucidate the usage patterns first, and then discuss the spellings. Given this paper and feature were already accepted to C2y, but that C2y has only JUST started and is still in active development, the goal of this survey is to determine if the community has any sort of preference for the spelling of this operator. Ideally, it would have been nice if people saw the papers in the WG14 document log and made their opinions known ahead-of-time, but this time I am doing my best to reach out to every VIA this article and the survey that is linked at the bottom of the article.

Usage Pattern

Using SIZE_KEYWORD like in the first code sample, this section will explain the three usage patterns and their pros/cons. The program is always meant to return 42.

const double barkbark[] = { 0.0, 0.5, 7.0, 14.7, 23.3, 42.0 };

static_assert(SIZE_KEYWORD(barkbark) == 6, "must have a size of 6");

int main () {
	return (int)barkbark[SIZE_KEYWORD(barkbark) - 1];
}

Underscore and capital letter _Keyword; Macro in a New Header

This technique is a common, age-old way of providing a feature in C. It avoids clobbering the global user namespace with a new keyword that could be affected by user-defined or standards-defined macros (from e.g. POSIX or that already exist in your headers). A keyword still exists, but it’s spelled with an underscore and a capital letter to prevent any failures. The user-friendly, lowercase name is only added through a new macro in a new header, so as to prevent breaking old code. Some notable features that USED to be like this:

  • _Static_assert/static_assert with <assert.h>
  • _Alignof/alignof with <stdalignof.h>
  • _Thread_local/thread_local with <threads.h>
  • _Bool/bool with <stdbool.h>

As an example, it would look like this:

#include <stdkeyword.h>

const double barkbark[] = { 0.0, 0.5, 7.0, 14.7, 23.3, 42.0 };

_Static_assert(keyword_macro(barkbark) == 6, "must have a size of 6");

int main () {
	return (int)barkbark[_Keyword(barkbark) - 1];
}

Underscore and capital letter _Keyword; No Macro in Header

This is a newer way of providing functionality where no effort is made to provide a nice spelling. It’s not used very often, except in cases where people expect that the spelling won’t be used often or the lowercase name might conflict with an important concept that others deem too important to take for a given spelling. This does not happen often in C, and as such there’s really only one prominent example that exists in the standard outside of extensions:

  • _Generic; no macro ever provided in a header

As an example, it would look like this:

// no header
const double barkbark[] = { 0.0, 0.5, 7.0, 14.7, 23.3, 42.0 };

static_assert(_Keyword(barkbark) == 6, "must have a size of 6");

int main () {
	return (int)barkbark[_Keyword(barkbark) - 1];
}

Lowercase keyword; No Macro in Header

This is the more bolder way of providing functionality in the C programming language. Oftentimes, this does not happen in C without a sister language like C++ bulldozing code away from using specific lowercase identifiers. It can also happen if a popular extension dominates the industry and makes it attractive to keep a certain spelling. Technically, everyone acknowledges that the lowercase spelling is what we want in most cases, but we settle for the other two solutions because adding keywords of popular words tends to break somebody’s code. That leads to a lot of grumbling and pissed off developers who view code being “broken” in this way as an annoying busywork task added onto their workloads. For C23, specifically, a bunch of things were changed from the _Keyword + macro approach to using the lowercase name since C++ has already effectively turned them into reserved names:

  • true, false, and bool
  • thread_local
  • static_assert
  • alignof
  • typeof (already an existing extension in many places)

As an example, it would look like this:

// no header
const double barkbark[] = { 0.0, 0.5, 7.0, 14.7, 23.3, 42.0 };

static_assert(keyword(barkbark) == 6, "must have a size of 6");

int main () {
	return (int)barkbark[keyword(barkbark) - 1];
}

Keyword Spellings

By far the biggest war over this is not with the usage pattern of the feature, but the actual spelling of the keyword. This prompted a survey from engineer Chris Bazley at ARM, who published his results in N3350 Feedback for C2y - Survey results for naming of new nelementsof() operator. The survey here is not going to query the same set of names, but only the names that seemed to have the most discussion and support in the various e-mails, Committee Meeting discussion, and other drive-by social media / Hallway talking people have done.

Most notably, these options are presented as containing both the lowercase keyword name and the uppercase capital letter _Keyword name. Specific combinations of spelling and usage pattern can be given later during an optional question in the survey, along with any remarks you’d like to leave at the end in a text box that can handle a fair bit of text. There are only 6 names, modeled after the most likely spellings similar to the sizeof operator. If you have another name you think is REALLY important, please add it at the end of the comments section. Some typical names not included with the reasoning:

  • size/SIZE is too close to sizeof and this is not a library function; it would also bulldoze over pretty much every codebase in existence and jeopardize other languages built on top of / around C.
  • nitems/NITEMS is a BSD-style way of spelling this and we do not want to clobber that existing definition.
  • ARRAY_SIZE/stdc_size and similar renditions are not provided because this is an operator exposed through a keyword and not a macro, but even then array_size/_Array_size were deemed too awkward to spell.
  • dimsof/dimensionsof was, similarly, not all that popular and dimensions as a word did not convey the meaning very appropriately to begin with.
  • Other brave but unfortunately unmentioned spellings that did not make the cut.

The options in the survey are as below:

lenof / _Lenof

A very short spelling that utilizes the word “length”, but shortened in the typical C fashion. Very short and easy to type, and it also fits in with most individual’s idea of how this works. It is generally favored amongst C practitioners, and is immediately familiar to Pythonistas. A small point of contention: doing _Lenof(L"barkbark") produces the answer “9”, not “8” (the null terminator is counted, just as in sizeof("barkbark")). This has led some to believe this would result in “confusion” when doing string processing. It’s unclear whether this worry is well-founded in any data and not just a nomenclature issue.

As “len” and lenof are popular in C code, this one would likely need a underscore-capital letter keyword and a macro to manage its introduction, but it is short.

const double barkbark[] = { 0.0, 0.5, 7.0, 14.7, 23.3, 42.0 };

static_assert(_Lenof(barkbark) == 6, "must have an length of 6");

int main () {
	return (int)barkbark[lenof(barkbark) - 1];
}

lengthof / _Lengthof

This spelling won in Chris Bazley’s ARM survey of the 40 highly-qualified C/C++ engineers and is popular in many places. Being spelled out fully seems to be of benefit and heartens many users who are sort of sick of a wide variety of C’s crunchy, forcefully shortened spellings like creat (or len, for that matter, though len is much more understood and accepted). It is the form that was voted into C2y as _Lengthof, though it’s noted that the author of the paper that put _Lengthof into C is strongly against its existence and thinks this choice will encourage off-by-one errors (similarly to lenof discussed above). Still, it seems like both the least hated and most popular among the C Committee and the adherents who had responded to Alejandro Colomar’s GCC patch for this operator. Whether it will continue to be popular with the wider community has yet to be seen.

As “length” and lengthof are popular in C code, this one would likely need a underscore-capital letter keyword and a macro to introduce it carefully into existing C code.

const double barkbark[] = { 0.0, 0.5, 7.0, 14.7, 23.3, 42.0 };

static_assert(_Lengthof(barkbark) == 6, "must have an length of 6");

int main () {
	return (int)barkbark[lengthof(barkbark) - 1];
}

countof / _Countof

This spelling is a favorite of many people who want a word shorter than length but still fully spelled out that matches its counterpart size/sizeof. It has strong existing usage in codebases around the world, including a definition of this macro in Microsoft’s C library. It’s favored by a few on the C Committee, and I also received an e-mail about COUNT being provided by the C library as a macro. It was, unfortunately, not polled in the ARM survey. It also conflicts with C++’s idea of count as an algorithm rather than an operation (C++ just uses size for counting the number of elements). It is dictionary-definition accurate to what this feature is attempting to do, and does not come with off-by-one concerns associated with strings and “length”, typically.

As “count” and countof are popular in C code, this too would need some management in its usage pattern to make it available everywhere without getting breakage in some existing code.

const double barkbark[] = { 0.0, 0.5, 7.0, 14.7, 23.3, 42.0 };

static_assert(_Countof(barkbark) == 6, "must have an length of 6");

int main () {
	return (int)barkbark[countof(barkbark) - 1];
}

nelemsof / _Nelemsof

This spelling is an alternative spelling to nitems() from BSD (to avoid taking nitems from BSD). nelemsof is also seem as the short, cromulent spelling of another suggestion in this list, nelementsof. It is a short spelling but lacks spaces between n and elems, but emphasizes this is the number of elements being counted and not anything else. The n is seen as a universal letter for the count of things, and most people who encounter it understand it readily enough. It lacks problems about off-by-one counts by not being associated with strings in any manner, though n being a common substitution for “length” might bring this up in a few people’s minds.

As “nelems” and nelems are popular in C code, this too would need some management in its usage pattern to make it available everywhere without getting breakage in some existing code.

const double barkbark[] = { 0.0, 0.5, 7.0, 14.7, 23.3, 42.0 };

static_assert(_Nelemsof(barkbark) == 6, "must have an length of 6");

int main () {
	return (int)barkbark[nelemsof(barkbark) - 1];
}

nelementsof / _Nelementsof

This is the long spelling of the nelemsof option just prior. It is the preferred name of the author of N3369, Alejandro Colomar, before WG14 worked to get consensus to change the name to _Lengthof for C2y. It’s a longer name that very clearly states what it is doing, and all of the rationale for nelems applies.

This is one of the only options that has a name so long and unusual that it shows up absolutely nowhere that matters. It can be standardized without fear as nelements with no macro version whatsoever, straight up becoming a keyword in the Core C language without any macro/header song-and-dance.

const double barkbark[] = { 0.0, 0.5, 7.0, 14.7, 23.3, 42.0 };

static_assert(nelementsof(barkbark) == 6, "must have an length of 6");

int main () {
	return (int)barkbark[nelementsof(barkbark) - 1];
}

extentof / _Extentof

During the discussion of the paper in the Minneapolis 2024 meeting, there was a surprising amount of in-person vouching for the name extentof. They also envisioned it coming with a form that allowed to pass in which dimension of a multidimensional array you wanted to get the extent of, similar to C++’s std::extent_v and std::rank_v, as seen here and here. Choosing this name comes with the implicit understanding that additional work would be done to furnish a rankof/_Rankof (or similar spelling) operator for C as well in some fashion to allow for better programmability over multidimensional arrays. This option tends to appeal to Fortran and Mathematically-minded individuals in general conversation, and has a certain appeal among older folks for some reason I have not been able to appropriately pin down in my observations and discussions; whether or not this will hold broadly in the C community is anyone’s guess.

As “extent” is a popular word and extentof similarly, this one would likely need a macro version with an underscore capital-letter keyword, but the usage pattern can be introduced gradually and gracefully.

const double barkbark[] = { 0.0, 0.5, 7.0, 14.7, 23.3, 42.0 };

static_assert(_Extentof(barkbark) == 6, "must have an extent of 6");

int main () {
	return (int)barkbark[extentof(barkbark) - 1];
}

The Survey

Here’s the survey: https://www.allcounted.com/s?did=qld5u66hixbtj&lang=en_US.

There is an optional question at the end of the survey, before the open-ended comments, that allows for you to also rank and choose very specific combinations of spelling and feature usage mechanism. This allows for greater precision beyond just answering the two core questions, if you want to explain it.

Employ your democratic right to have a voice and inform the future of C, today!

Good Luck! 💚

]]>
<![CDATA[New in C2y is an operator that does something people have been asking us for, for decades:]]>
5 Years Later: The First Win2024-10-08T00:00:00+00:002024-10-08T00:00:00+00:00https://thephd.dev/5%20Years%20Later%20-%20The%20First%20Win<![CDATA[

N3366 - Restartable Functions for Efficient Character Conversions has made it into the C2Y Standard (A.K.A., “the next C standard after C23”). And one of my longest struggles — the sole reason I actually came down to the C Standards Committee in the first place —has come to a close.

Yes.

When I originally set out on this journey, it was over 6 years ago in the C++ Unicode Study Group, SG16. I had written a text renderer in C#, and then in C++. As I attempted to make that text renderer cross-platform in the years leading up to finally joining Study Group 16, and kept running into the disgustingly awful APIs for doing text conversions in C and C++. Why was getting e.g. Windows Command Line Arguments into UTF-8 so difficult in standard C and C++? Why was using the C standard functions on a default-rolled Ubuntu LTS at the time handing me data that was stripping off accent marks? It was terrible. It was annoying. It didn’t make sense.

It needed to stop.

Originally, I went to C++. But the more I talked and worked in the C++ Committee, the more I learned that they weren’t exactly as powerful or as separate from C as they kept claiming. This was especially when it came to the C standard library, where important questions about wchar_t, the execution encoding, and the wide execution encoding were constantly punted to the C standard library rather than changed or mandated in C++ to be better. Every time I wanted to pitch the idea of just mandating a UTF-8 execution encoding by default, or a UTF-8 literal encoding by default, I just kept getting the same qualms: “C owns the execution encoding” and “C owns the wide encoding” and “C hasn’t really separated wchar_t from its historical mistakes”. And on and on and on. So, well.

I went down there.

Of course, there were even more problems. Originally, I had proposed interfaces that looked fairly identical to the existing set of functions already inside of <wchar.h> and <uchar.h>. This was, unfortunately, a big problem: the existing design, as enumerated in presentation after presentation and blog post after blog post, was truly abysmal. These 1980s/1990s functions are wholly incapable of handling the encodings that were present even at 1980, and due to certain requirements on types such as wchar_t we ended up creating problematic functions with unbreakable Application Binary Interfaces (ABIs).

During a conversation on very-very-old Twitter now, I was expressing my frustration about these functions and how they’re fundamentally broken. But that if I wanted to see success, there was probably no other way to get the job done. After all, what is the most conservative and new-stuff-hostile language if not C, the language that’s barely responded to everything from world-shattering security concerns to unearthed poor design decisions for some 40 years at that point? And yet, Henri Sivonen pointed out that going that route was still just as bad: why would I standardize something I know is busted beyond all hope?

Contending with that was difficult. Why should I be made to toil due to C’s goofed up 1989 deficiencies? But, at the same time, how could I be responsible for continuing that failure into the future in-perpetuity? Neither of these questions was more daunting than the fact that what was supposed to be a “quick detour” into C would instantly crumble away if I accepted this burden. Doing things the right way meant I was signing up for not just a quick, clean, 1-year-max brisk journey, but a deep dungeon dive that could take an unknown and untold amount of time. I had to take a completely different approach from iconv and WideCharToMultiByte and uconvConvert and mbrtowc; I would need to turn a bunch of things upside down and inside out and come up with something entirely new that could handle everything I was talking about. I had to take the repulsive force of the oldest C APIs, and grasp the attractive forces of all of the existing transcoding APIs,

and unite them into something entirely different and powerful…



An anthropomorphic sheep wearing a purple robe with a blue scarf stares intently and directly at the viewer, pupils solid and without light with the whites of their eyes fully showing. Their hand it extended towards the viewer, with their thumb and pinky extended out while their ring and middle fingers and curled in. The index finger is curled in, but less so and rests on top of the ring and middle finger, triggering the ancient Imaginary Technique. Bright light emits from the meeting point of the index, ring, and middle fingers just above the palm, ready to unleash the Great Energy.
Imaginary Technique: Cuneicode

Henri was right.

It took a lot of me to make this happen. But, I made it happen. Obviously, it will take some time for me to make the patches to implement this for glibc, then musl-libc. I don’t quite remember if the Bionic team handling Android’s standard library takes submissions, and who knows if Apple’s C APIs are something I can contribute usefully to. Microsoft’s C standard library, unlike its C++ one, is also still proprietary and hidden. Microsoft still does a weird thing where, on some occasions, it completely ignores its own Code Page setting and just decides to use UTF-8 only, but only for very specific functions and not all of them.

I GENUINELY hope Microsoft doesn’t make the mistake in these new functions to not provide proper conversions to UTF-8, UTF-16, and UTF-32 through their locale-based execution encoding. These APIs are supposed to give them all the room to do proper translation of locale-based execution encoding data to the UTFs, so that customers can rely on the standard to properly port older and current application data out into Unicode. They can use the dedicated UTF-8-to-UTF-16 and vice versa functions if needed. The specification also makes it so they don’t have to accumulate data in the mbstate_t except for radical stateful encodings, meaning there’s no ABI concerns for their existing stuff so long as they’re careful!

But Microsoft isn’t exactly required to listen to me, personally, and the implementation-defined nature of execution encoding gives them broad latitude to do whatever the hell they want. This includes ignoring their own OEM/Active CodePage settings and just forcing the execution encoding for specific functions to be “UTF-8 only”, while keeping it not-UTF-8 for other functions where it does obey the OEM/Active CodePage.

All in All, Though?

The job is done. The next target is for P1629 to be updated and to start attending SG16 and C++ again (Hi, Tom!). There’s an open question if I should just abandon WG14 now that the work is done, and it is kind of tempting, but for now… I’m just going to try to get some sleep in, happy in the thought that it finally happened.

We did it, chat.

A double-thanks to TomTom and Peter Bindels, as well as the Netherlands National Body, NEN. They allowed me to attend C meetings as a Netherlands expert for 5 years now, ensuring this result could happen. A huge thanks to all the Sponsors and Patrons too. We haven’t written much in either of those places so it might feel barren and empty but I promise you every pence going into those is quite literally keeping me and the people helping going.

And, most importantly, an extremely super duper megathanks h-vetinari, who spent quite literally more than a year directly commenting on every update to the C papers directly in my repository and keeping me motivated and in the game. It cannot be understated how much those messages and that review aided me in moving forward.

God Bless You. 💚

]]>
<![CDATA[N3366 - Restartable Functions for Efficient Character Conversions has made it into the C2Y Standard (A.K.A., “the next C standard after C23”). And one of my longest struggles — the sole reason I actually came down to the C Standards Committee in the first place —]]>
Improving _Generic in C2y2024-08-01T00:00:00+00:002024-08-01T00:00:00+00:00https://thephd.dev/_Generic%20improvements%20in%20C<![CDATA[

The first two meetings of C after C23 was finalized are over, and we have started working on C2y. We decided that this cycle we’re not going to do that “Bugfix” followed by “Release” stuff, because that proved to be a REALLY bad idea that killed a ton of momentum and active contributors during the C11 to C17 timeframe. So, this time, we’re hitting both bugfixes AND features so we can make sure we don’t lose valuable contributions and fixes by stalling for 5 to 6 years again. So, with that… on to fixes!

Generic Selection, a Primer

_Generic — the keyword that’s used for a feature that is Generic Selection — is a deeply hated C feature that everyone likes to dunk on for both being too much and also not being good enough at the same time. It was introduced during C11, and the way it works is simple: you pass in an expression, and it figures out the type of that expression and allows you to match on that type. With each match, you can insert an expression that will be executed thereby giving you the ability to effectively have “type-based behavior”. It looks like this:

int f () {
	return 45;
}

int main () {
	const int a = 1;
	return _Generic(a,
		int: a + 2,
		default: f() + 4
	);
}

As demonstrated by the snippet above, _Generic(...) is considered an expression itself. So it can be used anywhere an expression can be used, which is useful for macros (which was its primary reason for being). The feature was cooked up in C11 and was based off of a GCC built-in (__builtin_choose_expr) and an EDG special feature (__generic) available at the time, after a few papers came in that said type-generic macros were absolutely unimplementable. While C has a colloquial rule that the C standard library can “require magic not possible by normal users”, it was exceedingly frustrating to implement type-generic macros — specifically, <tgmath.h> — without any language support at all. Thus, _Generic was created and a language hole was patched out.

There are, however, 2 distinct problems with _Generic as it exists at the moment.

Problem 0: “l-value conversion”

One of the things the expression put into a _Generic expression undergoes is something called “l-value conversion” for determining the type. “l-value conversion” is a fancy “phrase of power” (POP) in the standard that means a bunch of things, but the two things we’re primarily concerned about are:

  • arrays turn into pointers;
  • and, qualifiers are stripped off.

This makes some degree of sense. After all, if we took the example above:

int f () {
	return 45;
}

int main () {
	const int a = 1;
	return _Generic(a,
		int: a + 2,
		default: f() + 4
	);
}

and said that this example returns 49 (i.e., that it takes the default: branch here because the int: branch doesn’t match), a lot of people would be mad. This helps _Generic resolve to types without needing to write something very, very convoluted and painful like so:

int f () {
	return 45;
}

int main () {
	const int a = 1;
	return _Generic(a,
		int: a + 2,
		const int: a + 2,
		volatile int: a + 2,
		const volatile int: a + 2,
		default: f() + 4
	);
}

In this way, the POP “l-value conversion” is very useful. But, it becomes harder: if you want to actually check if something is const or if it has a specific type, you have to make a pointer out of it and make the expression a pointer. Consider this TYPE_MATCHES_EXPR bit, Version Draft 0:

#define TYPE_MATCHES_EXPR(DESIRED_TYPE, ...) \
	_Generic((__VA_ARGS__),\
		DESIRED_TYPE: 1,\
		default: 0 \
	)

If you attempt to use it, it will actually just straight up fail due to l-value conversion:

static const int a;
static_assert(TYPE_MATCHES_EXPR(const int, a), "AAAAUGH!"); // fails with "AAAAUGH!"

We can use a trick of hiding the qualifiers we want behind a pointer to prevent “top-level” qualifiers from being stripped off the expression:

#define TYPE_MATCHES_EXPR(DESIRED_TYPE, ...) \
	_Generic(&(__VA_ARGS__),\
		DESIRED_TYPE*: 1,\
		default: 0\
	)

And this will work in the first line below, but FAIL for the second line!

static const int a;
static_assert(TYPE_MATCHES_EXPR(const int, a), "AAAAUGH!"); // works, nice!
static_assert(TYPE_MATCHES_EXPR(int, 54), "AAAAUGH!"); // fails with "AAAAUGH!"

In order to combat this problem, you can use typeof (standardized in C23) to add a little spice by creating a null pointer expression:

#define TYPE_MATCHES_EXPR(DESIRED_TYPE, ...) \
	_Generic((typeof((__VA_ARGS__))*)0,\
		DESIRED_TYPE*: 1,\
		default: 0\
	)

Now it’ll work:

static const int a;
static_assert(TYPE_MATCHES_EXPR(const int, a), "AAAAUGH!"); // works, nice!
static_assert(TYPE_MATCHES_EXPR(int, 54), "AAAAUGH!"); // works, yay!

But, in all reality, this sort of “make a null pointer expression!!” nonsense is esoteric, weird, and kind of ridiculous to learn. We never had typeof when _Generic was standardized so the next problem just happened as a natural consequence of “standardize exactly what you need to solve the problem”.

Problem 1: Expressions Only?!

The whole reason we need to form a pointer to the DESIRED_TYPE we want is to (a) avoid the consequences of l-value conversion and (b) have something that is guaranteed (more or less) to not cause any problems when we evaluate it. Asides from terrible issues with Variably-Modified Types/Variable-Length Arrays and all of the Deeply Problematic issues that come from being able to use side-effectful functions/expressions as part of types in C (even if _Generic guarantees it won’t evaluate the selection expression), this means forming a null pointer to something is the LEAST problematic way we can handle any given incoming expression with typeof.

More generally, however, this was expected to just solve the problem of “make type-generic macros in C to implement <tgmath.h>”. There was no other benefit, even if a whole arena of cool uses grew out of _Generic and its capabilities (including very very basic type inspection / queries at compile-time). The input to type-generic macros was always an expression, and so _Generic only needed to take an expression to get started. There was also no standardized typeof, so there was no way to take the INPUT parameter or __VA_ARG__ parameter of a macro and get a type out of it in standard C anyways. So, it only seemed natural that _Generic took only an expression. Naturally, as brains got thinking about things,

someone figured out that maybe we can do a lot better than that!

Moving the Needle

Implementers had, at the time, been complaining about not having a way to match on types directly without doing the silly pointer tricks above because they wanted to implement tests. And some of them complained that the standard wasn’t giving them the functionality to solve the problem, and that it was annoying to reinvent such tricks from first principles. This, of course, is at the same time that implementers were also saying we shouldn’t just bring papers directly to the standard, accusing paper authors of “inventing new stuff and not standardizing existing practice”. This, of course, did not seem to apply to their own issues and problems, for which they were happy to blame ISO C for not figuring out a beautiful set of features that could readily solve the problems they were facing.

But, one implementer then got a brilliant idea. What if they flexed their implementer muscles? What if they improved _Generic and reported on that experience without waiting for C standard to do it first? What if implementers fulfilled their end of the so-called “bargain” where they actually implemented extensions? And then, as C’s previous charters kept trying to promise (and then fail to deliver on over and over again over decades), what if those implementers then turned around to the C standard to standardize their successful existing practice so that we could all be Charter-Legal about all of this? After all, it would be way, WAY better than being perpetually frozen with fear that if they implemented a (crappy) extension they’d be stuck with it forever, right? It seems like a novel idea in this day and age where everything related to C seems conservative and stiff and boring. But?

Aaron Ballman decided to flex those implementer muscles, bucking the cognitive dissonance of complaining that ISO C wasn’t doing anything, not writing a paper, and not follow up on his own implementation. He kicked off the discussion. He pushed through with the feature. And you wouldn’t believe it, but:

it worked out great.

N3260 - Generic Selection Expression with Type Operand

It’s as simple as the paper title: N3260 puts a type where the expression usually goes. Aaron got it into Clang in a few months, since it was such a simple paper and had relatively small wording changes. Using a type name rather than an expression in there, _Generic received the additional power to get direct matching with no l-value conversion. This meant that qualifier stripping — and more – did not happen. So we can now write TYPE_MATCHES_EXPR like so:

#define TYPE_MATCHES_EXPR(DESIRED_TYPE, ...) \
	_Generic(typeof((__VA_ARGS__)),\
		DESIRED_TYPE: 1,\
		default: 0\
	)

static const int a;
static_assert(TYPE_MATCHES_EXPR(const int, a), "AAAAUGH!"); // works, nice!
static_assert(TYPE_MATCHES_EXPR(int, 54), "AAAAUGH!"); // works, nice!

This code looks normal. Reads normal. Has no pointer shenanigans, no null pointer constant casting; none of that crap is included. You match on a type, you check for exactly that type, and life is good.

Clang shipped this quietly after some discussion and enabled it just about everywhere. GCC soon did the same in its trunk, because it was just a good idea. Using the flag -pedantic will have it be annoying about the fact that it’s a “C2y extension” if you aren’t using the latest standard flag, but this is C. You should be using the latest standard flag, the standard has barely changed in any appreciable way in years; the risk is minimal. And now, the feature is in C2y officially, because Aaron Ballman was willing to kick the traditional implementer Catch-22 in the face and be brave.

Thank you, Aaron!

The other compilers are probably not going to catch up for a bit, but now _Generic is much easier to handle on the two major implementations. It’s more or less a net win! Though, it… DOES provide for a bit of confusion when used in certain scenarios, however. For example, using the same code from the beginning of the article, this:

int f () {
	return 45;
}

int main () {
	const int a = 1;
	return _Generic(typeof(a),
		int: a + 2,
		default: f() + 4
	);
}

does not match on int anymore, IF you use the type-based match. In fact, it will match on default: now and consequently will call f() and add 4 to it to return 49. That’s gonna fuck some people’s brains up, and it will also expose some people to the interesting quirks and flaws about whether certain expressions — casts, member accesses, accesses into qualified arrays, etc. — result in specific types. We’ve already uncovered one fun issue in the C standard about whether this:

struct x { const int i; };

x f();

int main () {
	return _Generic(typeof(f().i),
		int: 1,
		const int: 2,
		default: 0
	);
}

will make the program return 1 or 2 (the correct answer is 2, but GCC and Clang disagree because of course they do). More work will need to be done to make this less silly, and I have some papers I’m writing to make this situation better by tweaking _Generic. _Generic, in general, still needs a few overhauls so it works better with the compatibility rules and also doesn’t introduce very silly undefined behavior with respect to Variable-Length Arrays and Fixed-Size Array generic types. But that’s a topic

for another time. 💚

]]>
<![CDATA[The first two meetings of C after C23 was finalized are over, and we have started working on C2y. We decided that this cycle we’re not going to do that “Bugfix” followed by “Release” stuff, because that proved to be a REALLY bad idea that killed a ton of momentum and active contributors during the C11 to C17 timeframe. So, this time, we’re hitting both bugfixes AND features so we can make sure we don’t lose valuable contributions and fixes by stalling for 5 to 6 years again. So, with that…]]>
Constant Integer Type Declarations Initialized With Constant Expressions Should Be Constants2024-06-16T00:00:00+00:002024-06-16T00:00:00+00:00https://thephd.dev/Constant%20Integers%20in%20C<![CDATA[

Constant integer-typed (including enumeration-typed) object declarations in C that are immediately initialized with an integer constant expression should just be constant expressions. That’s it. That’s the whole article; it’s going to be one big propaganda piece for an upcoming change I would like to make to the C standard for C2y/C3a!

Doing The “Obvious”, Obviously

As per usual, everyone loves complaining about the status quo and then not doing anything about it. Complaining is a fine form of feedback, but the problem with a constant stream of crticism/feedback is that nominally it has to be directed — eventually — into some kind of material change for the better. Otherwise, it’s just a good way to waste time and burn yourself out! As one would correctly imagine, this “duh, this is obvious” feature is not in the C standard. But, it seemed like making this change would take too much time, effort, and would be too onerous to wrangle. However, this is no longer the case anymore!

Thanks to changes made in C23 by Eris Celeste and Jens Gustedt (woo, thanks you two!), we can now write a very simple and easy specification for this that makes it terrifyingly simple to accomplish. We also know this will not be an (extra) implementation burden to conforming C23 compilers for the next revision of the standard thanks to constexpr being allowed in C23 for object declarations (but not functions!). As we now have such constexpr machinery for objects, there is no need to go the C++ route of trying to accomplish this in the before-constexpr times. This makes both the wording and the semantics easy to write about and reason about.

How It Works

The simple way to achieve this is to take every non-extern, const-qualified (with no other storage class specifiers except static in some cases) integer-typed (including enum-typed) declaration and upgrade it implicitly to be a constexpr declaration. It only works if you’re initializing it with an integer constant expression (a specific kind of Phrase of Power in C standardese), as well as a few other constraints. There are a few reasons for it to be limited to non-extern declarations, and a few reasons for it to be limited to integer and integer-like types rather than the full gamut of floating/struct/union/etc. types. Let’s take a peak into some of the constraints and reasonings, and why it ended up this way.

Non-extern only!

An extern object declaration could refer to read-only memory that is only read-only from the perspective of the C program. For example, it could refer to a location in memory written to by the OS, or handled by lower level routines that pull their values from a register or other hardware. (Typically, these are also marked volatile, but the point still stands.) We cannot have things that are visible outside of the translation unit and (potentially) affected by other translation units / powers outside of C marked as true constants; it would present a sincere conflict as interest. But, because of extern, we have a clear storage class specifier that allows us to know when things follow this rule or when things do not. This makes it trivially simple to know when something is entirely internal to the translation unit and the C program and does not “escape” the C abstract machine!

This makes it easy to identify which integer typed declarations would meet our goals, here. Though, it does bring up the important question of “why not the other stuff, too?”. After all, if we can do this for integers, why not structures with compound literals? Why not with string literals? Why not with full array initializers and array object declarations inside of a function?! All of these things can be VERY useful to make standards-mandated available to the optimizer.

Integer-Typed Declarations? Why Not “Literally Everything™”?

Doing this for integer types is more of a practicality than a full-on necessity. The reason it is practical is because 99% of all compilers already compute integer constant expressions for the purposes of the preprocessor and the purposes of the most basic internal compiler improvements. Any serious commercial compiler (and most toy compilers) can compute 1 + 1 at compile-time, and not offload that expression off to a run-time calculation.

However, we know that most C compilers do not go as far as GCC or Clang which will do its damnedest to compute not only every integer constant expression, but compound literal and structure initialization expression and string/table access at compile-time. If we extend this paper to types beyond integers, then we quickly exit the general blessing we obtain from “We Are Standardizing Widely-Deployed Existing Practice”. At that point, we would not be standardizing widespread existing practice, but instead the behavior of a select few powerful compilers whose built-in constant folders and optimizers are powerhouses among the industry and the flagships of their name.

C++ does process almost everything it can at compile-time when possible, under the “manifestly constant evaluated” rules and all of its derivatives. This has resulted in serious work on the forward progress of constant expression parsers, including a whole new constant expression interpreter in Clang1. However, C is not really that much of a brave language; historically, standard and implementation-provided C has been at least a decade (or a few decades) behind what could be considered basic functionality, requiring an independent hackup of what are bogstandard basic features from users and vendors alike. Given my role as “primary agitator for the destruction of C” (or improvement of C; depends on who’s being asked at the time), it seems fitting to take yet another decades-old idea and try to get it through the ol’ Standards Committee Gauntlet.

With that being the case, the changes to C23’s constant expression rules were already seen as potentially harmful for smaller implementations. (Personally, I think we went exactly as far as we needed to in order to make the situation less objectively awful.) So, trying to make ALL initializers be parsed for potential constant expressions would likely be a bridge too far and ultimately tank the paper and halt any progress. Plus, it turns out we tried to do the opposite of what I’m proposing here! And,

it actually got dunked on by C implementers?!

We Failed To Do It The Opposite Way

A while back, I wrote about the paper N2713 and how it downgraded implementation-defined integer constant expressions to be treated like normal numbers “for the purposes of treatment by the language and its various frontends”. This was a conservative fix because, as the very short paper stated, there was implementation divergence and smaller compilers were not keeping up with the larger ones. Floating point-to-integer conversions being treated as constants, more complex expressions, even something like __builtin_popcount(…) function calls with constants being treated as a constant expression by GCC and Clang were embarrassing the smaller commercial offerings and their constant expression parsers.

It turns out that implementation divergence mattered a lot. A competing paper got published during the “fix all the bugs before C23” timeframe, and it pointed all of this out in paper N3138 “Rebuttal to N2713”. The abstract of N3138 makes it pretty clear: “[N2713] diverges from existing practice and breaks code.” While we swear up and down that existing implementations are less important in our Charter (lol), the Committee DOES promise that existing code in C (and sometimes, C-derivative) languages will be protected and prioritized as highly as is possible. This ultimately destroyed N2713, and resulted in it being considered implementation-defined again whether or not non-standards-blessed constant expressions could be considered constants.

Effectively, the world rejected the idea that being downgraded and needing to ignore warnings about potential VLAs (that would get upgraded to constant arrays at optimization time) was appropriate. Therefore, if C programmers rejected going in the direction that these had to be treated for compiler frontend purposes as not-constants, we should instead go in the opposite direction, and start treating these things as constant expressions. So, rather than downgrading the experience (insofar as making certain expressions be not constants and not letting implementations upgrade them in their front-ends, but only their optimizers), let’s try upgrading it!

Formalizing the Upgrade

In order to do this, I have written a paper currently colloquially named NXXX1 until I order a proper paper number. The motivation is similar to what’s in this blog post, and it contains a table that can explain the changes better than I possibly ever could in text. So, let’s take a look:

int file_d0 = 1;
_Thread_local int file_d1 = 1;
extern int file_d2;
static int file_d3 = 1;
_Thread_local static int file_d4 = 1;
const int file_d5 = 1;
constexpr int file_d6 = 1;
static const int file_d7 = 1;

int file_d2 = 1;

int main (int argc, char* argv[]) {
	int block_d0 = 1;
	extern int block_d1;
	static int block_d2 = 1;
	_Thread_local static int block_d3 = 1;
	const int block_d4 = 1;
	const int block_d5 = file_d6;
	const int block_d6 = block_d4;
	static const int block_d7 = 1;
	static const int block_d8 = file_d5;
	static const int block_d9 = file_d6;
	constexpr int block_d10 = 1;
	static constexpr int block_d11 = 1;
	int block_d12 = argc;
	const int block_d13 = argc;
	const int block_d14 = block_d0;
	const volatile int block_d15 = 1;

	return 0;
}

int block_d1 = 1;
Declaration constexpr Before ? constexpr After ? Comment
file_d0 no change; extern implicitly, non-const
file_d1 no change; _Thread_local, extern implicitly, non-const
file_d2 no change; extern explicitly, non-const
file_d3 no change; non-const
file_d4 no change; _Thread_local, non-const
file_d5 no change; extern implicitly
file_d6 no change; constexpr explicitly
file_d7 static and const, initialized by constant expression
block_d0 no change; non-const
block_d1 no change; extern explicitly, non-const
block_d2 no change; non-const, static
block_d3 no change; _Thread_local, static, non-const
block_d4 const; initialized with literal
block_d5 const; initialized with other constexpr variable
block_d6 const, initialized by other constant expression
block_d7 static and const, initialized with literal
block_d8 no change; non-constant expression initializer
block_d9 static and const, initialized by constant expression
block_d10 no change; constexpr explicitly
block_d11 no change; constexpr explicitly
block_d12 no change; non-const, non-constant expression initializer
block_d13 no change; non-constant expression initializer
block_d14 no change; non-constant expression initializer
block_d15 no change; volatile

For the actual “words in the standard” changes, we’re effectively just making a small change to “§6.7 Declarations, §6.7.1 General” in the latest C standard. It’s an entirely new paragraph that just spins up a bulleted list, saying:

(NEW)13✨ If one of a declaration’s init declarator matches the second form (a declarator followed by an equal sign = and an initializer) meets the following criteria:

— it is the first visible declaration of the identifier;

— it contains no other storage-class specifiers except static, auto, or register;

— it does not declare the identifier with external linkage;

— its type is an integer type or an enumeration type that is const-qualified but not otherwise qualified, and is non-atomic;

— and, its initializer is an integer constant expression (6.6);

then it behaves as if a constexpr storage-class specifier is implicitly added for that declarator specifically. The declared identifier is then a named constant and is valid in all contexts where a named constant of the corresponding type is valid to form a constant expression of that specific kind (6.6).

Thanks to the improvements to §6.6 from Celeste and Gustedt, and their work on constexpr, the change here is very small, simple, and minimal. This covers all the widely-available existing practice we care about, without providing undue burden for many serious C implementations of C23 and beyond. It also would make a wide variety of integer constant expressions from the “Rebuttal” paper N3138 into valid constant expressions, according to the current rules of the latest C standard. This would be an improvement as it would mean the constant expressions written by users could be relied on across platforms that use a -std=c2y flag or claim to conform to the latest (working draft) C standard.

All in All, Though?

I’m just hoping I can get something as simple as this into C. It’s been long overdue given the number of ways folks complain about how C++ has this but C doesn’t, and it would deeply unify existing practice across implementations. It also helps to remove an annoying style of diagnostic warnings from -Wpedantic/-Wall-style warning lists, too!

The next meeting for C is around October, 2024. I’ll be trying to bring the paper there, to get it formalized, along with the dozens of other papers and features I am working on. Even if my hair will go fully grey by the time this is available on all platforms, I will keep working at it. We deserve the C that people keep talking about, on all implementations.

If not in my lifetime, in yours. 💚

  1. You can read a writeup about it on RedHat’s blog (Part 1, Part 2), or directly from the LLVM documentation

]]>
<![CDATA[Constant integer-typed (including enumeration-typed) object declarations in C that are immediately initialized with an integer constant expression should just be constant expressions. That’s it.]]>
Why Not Just Do Simple C++ RAII in C?2024-05-21T00:00:00+00:002024-05-21T00:00:00+00:00https://thephd.dev/just-raii-it-bro<![CDATA[

Ever since I finished publishing the “defer” paper and successfully defended it on its first go-around (it now has tentative approval to go to a Technical Specification, I just need to obtain the necessary written boilerplate to do so), an old criticism repeats itself frequently. Both internally to the C and C++ Standards Committee, as well as to many outside, the statement is exactly as the title implies: to implement a general-purpose undo mechanism for C, why not just make Objects with Well-Scoped, Deterministic Lifetimes and build it out of that like C++? This idiom, known as Resource Acquisition Is Initialization (RAII), is C++’s biggest bread and butter and its main claim to fame over just about every other language that grew up near it and after it (including all of the garbage collected languages such as Managed C++, D, Go, etc.). I have received no less than 5 external-to-WG14 (the formal abbreviation for the C Standards Committee) requests/asks about this, and innumerable posts internal to the C Standard mailing lists.

So, let’s just get this off the table right now so I can keep referring to this post every time somebody asks:

You ✨Cannot✨ Have “Simple RAII” in C

That’s the entire premise of this article. There’s a few reasons this is not possible – some mentioned in the defer paper version N3199, and others that I just sort of took for granted that people would understand but do not – and so, to clear up confusion, they will be written down here. There are two MAJOR reasons one cannot take the object-oriented semantics and syntax of RAII from C++ as-is, without jeopardizing sincere qualities about C:

  • RAII is syntactically difficult to achieve in C due to the semantics imbued on those syntax constructs by C++;
  • and, RAII is semantically impossible in C due to C’s utterly underwhelming type/object model.

To start with, let’s go over the syntax of C++, and how it achieves RAII. We will also discuss a version of RAII that uses not-C++ syntax, which would work…. at least until the second bulleted reason above dropkicks that in the face. So, let’s begin:

RAII: C++ Syntax

As a quick primer for those who are not familiar, C++ achieves its general purpose do-and-undo mechanism through the use of constructors and destructors. Constructors are function calls that are always invoked on the creation of an object, and destructors are always invoked when an object leaves scope. One can handle doing the construction and destruction manually, but we don’t have to talk about such complicated cases yet. The syntax looks as follows:

#include <cstdlib>

struct ObjectType {
	int a;
	double b;
	void* c;

	/* CONSTRUCTOR: */
	ObjectType() : a(1), b(2.2), c(malloc(30)) {

	}

	/* DESTRUCTOR: */
	~ObjectType() {
		free(c);
	}
};

In the above code snippet, we have a structure named ObjectType. It has a single constructor, that takes no arguments, and initializes all 3 of its members to some default values. It also has a destructor, which is meant to “undo” anything in the class that the programmer likes. In this case, I an using it to purposefully free the data that I originally mallocd into the member c during construction. Thus, when I use the class in this manner:

#include <cstdio>

int main () {
	ObjectType thing = {};
	printf("%d %f %p", thing.a, thing.b, thing.c);
	return 0;
}

despite not seeing any other code in that snippet, that code will:

  1. create automatic storage duration memory to put thing in (A.K.A. stack space for a stack variable);
  2. call the constructor on that automatic storage duration memory location (A.K.A. the thing that sets those values and does malloc)
  3. perform the printf call
  4. prepares the return statement with the value of 0
  5. call the destructor on that automatic storage duration memory location (A.K.A. the thing that calls free to release the memory)
  6. release the automatic storage duration memory (A.K.A. cleans up the stack)
  7. return from the function with the value 0 being transported in whatever manner the implementation has defined

This is a fairly simple set of steps, but it’s a powerful concept in C++ because no matter what happens (modulo some of the more completely bananas situations), once an object is “properly constructed” (all the data members are initialized from the TypeName (...) : … { list and reach the opening brace) in some region of memory, the compiler will always deterministically call the destructor at a fixed location. There is no wibbly-wobbly semantics like .NET IL finalizers or Lua __gc methods: the object is created, the objected is destroyed, always. (Again, we are ignoring more manual cases at the moment such as using new/delete, its array friends, or placement new & other sorts of shenanigans.) As Scott Meyers described it, this is a “powerful, general-purpose undo mechanism” and its one of the most influential concepts in deterministic, non-garbage-collected systems programming. Every other language worth being so much as spit on either employs deep garbage collection (Go, D, Java, Lua, C#, etc.) or automatic reference counting (Objective-C, Objective-C++, Swift, etc.), uses RAII (Rust with Drop, C++, etc.), or does absolutely nothing while saying to Go Fuck Yourself™ and kicking the developer in the shins for good measure (C, etc.).

The first problem with this, however, is a technical hangup. When C++ created their constructors, they created them with a concept called function overloading in mind. This very quickly gets into the weeds of Application Binary Interfaces and other thorny issues, which are thankfully already thoroughly written about in this expansive blog post, but for the sake of brevity revisiting these concepts is helpful to understand the issue.

Problem 0: Function Overloading

Function overloading is a technique where software engineers, in source code and syntactically, name what are at their core two different functions the same name. That single name is used as a way to referring to two different, distinct function calls by employing extra information, such as the number of arguments, the types of the arguments, and other clues when that single name gets used:

// FUNCTION 0
int func (int a);
// FUNCTION 1
double func (double b);

int main () {
    int x = func(2); // calls FUNCTION 0, f(int)
    double y = func(3.3); // calls FUNCTION 1, f(double)
    return (int)(x + y);   
}

However, when the source code has to stop being source code and instead needs to be serialized as an actual, runnable, on-the-0s-and-1s-machine binary, linkers and loaders do not have things like compile-time “type” information and what not at-the-ready. It is too expensive to carry that information around, all the time, in perpetuity so that when someone runs a program it can differentiate between “call f that does stuff with an integer” versus “call f that does stuff with a 64-bit IEEE 754 floating point number”. So, it undergoes a transformation that transforms f(int) or f(double) into something that looks like this at the assembly level:

main:
        push    rbx
        mov     edi, 2
        call    _Z4funci # call FUNCTION 0
        movsd   xmm0, QWORD PTR .LC0[rip]
        mov     ebx, eax
        call    _Z4funcd # call FUNCTION 1
        movapd  xmm1, xmm0
        pxor    xmm0, xmm0
        cvtsi2sd        xmm0, ebx
        pop     rbx
        addsd   xmm0, xmm1
        cvttsd2si       eax, xmm0
        ret
.LC0:
        .long   1717986918
        .long   1074423398

The code looks messy because we’re working with doubles and so it generates all sorts of stuff for passing arguments and also casting it down to a 32-bit int for the return expression, but the 2 important lines are call _Z4funci and call _Z4funcd. Believe it or not, these weird identifiers in the assembly correspond to the func(int) and func(double) identifiers in the code. This technique is called “name mangling”, and it powers a huge amount of C++’s featureset. Name mangling is how, so long a argument types or number of arguments change, things like the Application Binary Interface (ABI) of function calls can be preserved. The compiler is taking the name of the function func and the arguments int/double and mangling it into the final identifier present in the binary, so that it can call the right function without having a full type system present at the machine instruction level. This has the obvious benefit that the same conceptual name can be used multiple different ways in code with different data types, mapping strongly to the “this is the algorithm, and it can work with multiple data types” idea. Thus, the compiler worries about the actual dispatch details and resolves at compile-time, which means there no runtime cost to do matching on argument count or argument types. Having it resolved at compile-time and mapped out through mangling allows it to just directly call the right code during execution. The reason this becomes important is because this is how constructors must be implemented.

Problem 1: Member Functions

Consider the same ObjectType from before, but with multiple constructors:

#include <cstdlib>

struct ObjectType {
	int a;
	double b;
	void* c;

	/* CONSTRUCTOR 0: */
	ObjectType() : a(1), b(2.2), c(malloc(30)) {

	}

	/* CONSTRUCTOR 1: */
	ObjectType(double value) : a((int)(value / 2.0)), b(value), c(malloc(30)) {

	}

	/* DESTRUCTOR: */
	~ObjectType() {
		free(c);
	}
};

#include <cstdio>

int main () {
	ObjectType x = {};
	ObjectType y = {50.0};
	printf("x: %d %f %p\n", x.a, x.b, x.c);
	printf("y: %d %f %p\n", y.a, y.b, y.c);
	return 0;
}

We can see the following assembly:

.LC1:
	.string "x: %d %f %p\n"
.LC2:
	.string "y: %d %f %p\n"
main:
	push    r12
	push    rbp
	push    rbx
	sub     rsp, 64
	mov     rdi, rsp
	lea     rbp, [rsp+32]
	mov     rbx, rsp
	call    _ZN10ObjectTypeC1Ev
	movsd   xmm0, QWORD PTR .LC0[rip]
	mov     rdi, rbp
	call    _ZN10ObjectTypeC1Ed
	mov     rdx, QWORD PTR [rsp+16]
	movsd   xmm0, QWORD PTR [rsp+8]
	mov     edi, OFFSET FLAT:.LC1
	mov     eax, 1
	mov     esi, DWORD PTR [rsp]
	call    printf
	mov     rdx, QWORD PTR [rsp+48]
	movsd   xmm0, QWORD PTR [rsp+40]
	mov     edi, OFFSET FLAT:.LC2
	mov     eax, 1
	mov     esi, DWORD PTR [rsp+32]
	call    printf
	mov     rdi, rbp
	call    _ZN10ObjectTypeD1Ev
	mov     rdi, rsp
	call    _ZN10ObjectTypeD1Ev
	add     rsp, 64
	xor     eax, eax
	pop     rbx
	pop     rbp
	pop     r12
	ret
	mov     r12, rax
	jmp     .L3
	mov     r12, rax
	jmp     .L2
main.cold:
.L2:
	mov     rdi, rbp
	call    _ZN10ObjectTypeD1Ev
.L3:
	mov     rdi, rbx
	call    _ZN10ObjectTypeD1Ev
	mov     rdi, r12
	call    _Unwind_Resume
.LC0:
	.long   0
	.long   1078525952 

Again, we notice in particular the use of these special, mangled identifiers for the call instructions: call _ZN10ObjectTypeC1Ev, call _ZN10ObjectTypeC1Ed, and call _ZN10ObjectTypeD1Ev. It has the name of the type (…10ObjectType…) in it this time, but more or less just mangles it out. This is where the heart of our problems lie. If C wants to steal C++’s syntax for RAII, and C wants to be able to share (header file) source code that enjoys simple RAII objects, every single C implementation needs to implement a Name Mangler compatible with C++ for the platforms they target. And how hard could that possibly be?

Hm.

Here are some name manglings for the one argument ObjectType constructor:

  • _ZN10ObjectTypeC1Ed (GCC/Clang on Linux; x86-64, ARM, ARM64, and i686)
  • ??0ObjectType@@QEAA@N@Z (MSVC; x86-64, ARM64)
  • ??0ObjectType@@QAE@N@Z (MSVC; i686)

That’s three different name manglings for only a handful of platforms! And while some name manglers are partially documented or at least provided as a library so that it can be built upon, the name manglers for others are not only utterly undocumented but completely inscrutable. So much so that on some platforms (like MSVC on any architecture), certain name manglings are not guaranteed to be 1:1 and can infact “demangle” into multiple different plausible entities. If an implementation gets the name mangling wrong, well, that’s just a damn shame for the end user who has to deal with it! Of course, nobody’s claiming that name mangling is an unsolvable problem; it is readily solved in codebases such as Clang and GCC. But, it is worth noting that, as C’s specification stands now, there is no requirement to mangle any functions.

This is both a blessing, and a curse. The former because functions that users write are pretty much 1:1 when they are written under a C compiler. If a functioned is named glorbinflorbin in C, the name that shows up in the binary is glorbinflorbin with maybe some extra underscores added in places somewhere on certain implementations. But, the latter comes in to play for precisely this reason: if there is no name mangling performed that considers things such as related enclosing member object, argument types, and similar, then it is impossible to have even mildly useful features that can do things like avoid name clashes a function prototype is generated with the wrong types. It is, in fact, the primary reason that C ends up in interesting problems when using typedefs inside of its function declarations. Even if the typedefs change, the function names do not because there is no concept of “member functions” or “function overloading” or anything like that. It’s why the intmax_t problem is such an annoying one.

What Does This Have To Do With RAII?

Well, the devil is in these sorts of details. In order to introduce nominal support for something like constructors, name mangling (or something that allows the user to control how names come out on the other side) need to be made manifest in C. If name mangling is chosen as the implementation choice and a syntax identical to C++ is chosen, the implementation becomes responsible for furnishing a name mangler. And, because people are (usually) not trying to be evil, there should be ABI compatibility between the C implementation’s name mangler and C++’s name mangler so that code written with constructors in one language interoperate just fine with the other, without requiiring extern "C" to be placed on every constructor. (Right now, extern "C" is not legal to place on any member function in any C++ implementation.)

The reason this is desirable is obvious: header code could be shared between the languages, which makes sense in a world where “just steal C++’s constructors and destructors” is the overall design decision for C. But this is very much a nonstarter implementation reasons. Most implementers get annoyed when we require them to implement things that might take significant effort. While Clang and GCC likely won’t give an over damn so long as its not C++-modules levels of difficult (and MSVC ignores us until it ships in a real standard), there’s hundreds of C compilers and implementers of WILDLY varying quality. Unlike the 4-5 C++ compilers that exist today, C compilers and their implementers are still cranking things out, sometimes as significant pillars of their (local) software economy. Now, while I personally loathe to use things like lines of code as a functional metric for code, it can help us estimate complexity in a very crude, contextless way. Checking in on Clang’s Itanium Mangler, it clocks in somewhere on the order of about 7,000 lines of code. Which really doesn’t sound so bad,

until chibicc’s entire codebase measures somewhere around 7,300 lines of code.

“Double the amount of code in my entire codebase excluding tests for this feature” very much does not pass the smell test of implementability for C. This is also not including, you know, all the rest of the code required for actually implementing the “parse constructors and destructors” bit. Though, thankfully, that part is a lot less work than the name mangler. and I can guarantee that since there’s quite literally hundreds of C implementations, many of them will… “have fun”. If two or three different ways to mangle ObjectType::ObjectType(double) is bad, wait until a couple dozen implementers who have concerns outside of “C++ Compatibility” – some even with an active bone to pick with C++ – are handed a gaggle of features that relies on a core mechanic that is entirely unspecified. I am certainly not the smartest person out there,

but I know a goddamn interoperability bloodbath when I see one.

But… What If Name Mangling Was not a Problem?

This is the other argument I have received a handful of times on both the C mailing list, and in my inbox. It’s not a bad argument; after all, the entire above argument hinges on the idea of stealing the syntax from C++ entirely and copying their semantics bit-for-bit. By simply refusing to do it the way C++ does it, does it make the above argument go away? Thusly appears the following suggestion, which boils down to something like the following snippet. However, before we continue, note that this syntax comes partially from an e-mail sent to me. PLEASE, second-to-last person who sent me an e-mail about this and notices the syntax looks similar to what was in the e-mail: I am not trying to make fun of you or the syntax you have shown me, I am just trying to explain as best as I can. With that said:

#include <stdlib.h>

struct nya {
	void* data_that_must_be_freed;
};

_Constructor void nya_init(struct nya *nya, int n) {
	nya->data_that_must_be_freed = malloc(n);
}

_Destructor void nya_clear(struct nya *nya) {
	free(nya->data_that_must_be_freed);
}

int main () {
	struct nya n = {30};
	return 0;
}

The following uses the _Constructor and _Destructor tags on function declarations/definitions to associate either the returned type struct nya and the destructed type struct nya * (a pointer to an already-existing struct nya to destroy). The sequence of events, here, is pretty simple too:

  1. n’s memory is allocated (off of the stack), its memory is taken from the appropriate location on the stack and passed to;
  2. nya_init, which then calls malloc to initialize its data member;
  3. the return 0 is processed, storing the 0 value to do the actual return later, while;
  4. nya_clear is called on the memory for n, and the data member is appropriately freed;
  5. finally, main returns 0.

It has the same deterministic destruction properties as RAII here. But, notably, it is attached to a free-floating function.

This does the smart thing and gets around the name mangling issue! The person e-mailing me here has sidestepped the whole issue about sharing syntax with C++ and its function overloading issue, which is brilliant! If you can associate a regular, normal function call with these actions, it is literally no longer necessary to provide a name mangling scheme. It does not need to exist, so nobody will implement one: it’s just calling a normal function. (Kudos to Rust for figuring part of this out themselves as well, though they still need name mangling thanks to Traits and Generics.) It avoids all of the very weird fixes other people tried to propose on the C standards internal mailing list by saying things like “only allow one constructor” or “make C++ have extern "C" on constructors work and then have C and C++ mangle them differently” or “just implement name manglers for all C compilers that implement C2y/C3a, it’s fine”. Implementability can certainly be achieved with this.

Other forms of this come from a derivation of the two existing Operators proposals (Marcus Johnson’s n3201 and Jacob Navia’s n3051), most particularly n3201. The recommendation for n3201 by the author was to just use a different “association” that did not actually affect the syntax of the function itself, so the above code that produces the same affect but under n3201’s guidance (but slightly modified from the way it was presented in n3201 because that syntax has Problems™) might look like:

#include <stdlib.h>

struct nya {
	void* data_that_must_be_freed;
};

void nya_init(struct nya *nya, int n) {
	nya->data_that_must_be_freed = malloc(n);
}

void nya_clear(struct nya *nya) {
	free(nya->data_that_must_be_freed);
}

_Operator = nya_init;
_Operator ~ nya_clear;

int main () {
	struct nya n = {30};
	return 0;
}

Completely ignoring syntax choices here and the consequences therein, these _Operator statements would associate a function call with an action. = in this case seems to apply to construction, and ~ seems to apply to destruction. As usual, because the association is made using a statement and type information at compile-time, the compiler can know to simply call nya_init and nya_clear without needing to set up a complex, implementation-internal name mangling scheme to figure out which constructor/member/whatever function it needs to call to initialize the object correctly. It also doesn’t rob C++ of its syntax but try to impose different semantics. Nor does it just tell C implementations the functional equivalent of “git gud” with respect to implementing the name mangler(s) required to play nice with other systems. There is, unfortunately, one really annoying problem with having this form of constructors and destructors, and it’s the same problem that C++ had when it first started out trying to tackle the same problem back in the 80s and 90s:

none of these proposals come with an Object Model, and C does not have a real Object Model aside from its Effective Types model!

RAII: C++ Semantics

While the syntax problem can be designed around with any number of interesting permutations or fixes, whether it’s _Operator or _Constructor or whatever, the actual brass-and-tack semantics that C++ endows on memory obtained from these objects is very strict and direct. When someone allocates some memory and casts it to a type and begins to access it, both [c.malloc] and [intro.object]/11-13 cover them by giving them implicitly created objects, so long as those types satisfy the requirements of being trivial and implicitly-creatable types. On top of that, for constructors and destructors, there is an ENORMOUSLY robust system that comes with it beyond these implicitly created objects. This post was going to be extremely long, but thanks to an excellent writeup by Basit Ayuntande, everything anyone needs to know about the C++ object model is already all written up. To fully understand all the details, shortcuts, tricks, and more, please read Basit’s article; becoming a better C++ developer (if that’s desirable) is an inevitably from digesting it.

This, of course, leaves us to talk about just C and RAII and how those semantics play out.

C: Effective Types

In C, we do not have a robust object model. The closest are effective type rules, and they work VIA lvalue accesses rather than applying immediately on cast. The full wording is found in §6.5.1 “General” of N3220, which states:

The effective type of an object for an access to its stored value is the declared type of the object, if any. If a value is stored into an object having no declared type through an lvalue having a type that is not a non-atomic character type, then the type of the lvalue becomes the effective type of the object for that access and for subsequent accesses that do not modify the stored value. If a value is copied into an object having no declared type using memcpy or memmove, or is copied as an array of character type, then the effective type of the modified object for that access and for subsequent accesses that do not modify the value is the effective type of the object from which the value is copied, if it has one. For all other accesses to an object having no declared type, the effective type of the object is simply the type of the lvalue used for the access.

This is a bunch of text to say something really simple: if a region of memory (like a pointer obtained from malloc) is present, and it is cast to a specific type for the purposes of reading or writing, that region is marked with a given type and the type plus region informs what is the effective type of the memory. The first write or access is what solidifies it as such. The effective type follows a memory region through memmove or memcpy done with appropriate objects of the appropriate size. Fairly straightforward, standard stuff. The next paragraph after this then creates a list of scenarios wherein about any accesses or writes performed through casts or pointers aliasing that region afterwards:

  • a type compatible with the effective type of the object,
  • a qualified version of a type compatible with the effective type of the object,
  • the signed or unsigned type compatible with the underlying type of the effective type of the object,
  • the signed or unsigned type compatible with a qualified version of the underlying type of the effective type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
  • a character type.

This is, effectively, C’s aliasing rules. Once a type is set into that region of memory, once casting happens from one type to another (e.g. casting it first to uint32_t* to write to it, and then try to read it as a float* next), that action must be on that list to be standard-sanctioned. If it isn’t, then undefined behavior is invoked and programs are free to behave in very strange ways, at the whim of implementations or hardware or whatever. While I am not holding the person who sent me the simple one-off e-mail accountable to this, in the wider C ecosystem and in discussion even on the C mailing list, there seemed to be a distinct lack of appreciation for how thought-through the C++ system is and why it is this way in the first place. This also becomes glaringly clear after reading n3201 and going through 95% of the discussions around “RAII in C” that just tries to boil it down to simple syntactical solutions with basic code motion. The bigger picture is NOT being considered. There is not even a tiny amount of respect for where C or C++ comes from. It is not just about effective types and shadowy rules about how do they handle dynamic memory: even simpler things just completely fall apart in these counterproposals. Take, for example, a very simple question.

“How do you handle copies?”

Taking the _Operator example from above again, let’s add a single line of spice to this:

#include <stdlib.h>

struct nya {
	void* data_that_must_be_freed;
};

void nya_init(struct nya *nya, int n) {
	nya->data_that_must_be_freed = malloc(n);
}

void nya_clear(struct nya *nya) {
	free(nya->data_that_must_be_freed);
}

_Operator = nya_init;
_Operator ~ nya_clear;

int main () {
	struct nya n = {30};
	struct nya n2 = n; // OH SHIT--
	return 0;
}

In a proposal like n3201, what happens here? The actual answer is “the proposal literally does not answer this question”. Assuming (briefly, if I can be allowed such for a moment) the “basic” or “default” for how it works right now, the answer is probably “just memcpy like before”, which is wrong. n3201 is not the first “just do a quick RAII in C” proposal sent to me over e-mail to make this mistake. Simply performing a memberwise copy of struct nya from n to n2 leads to an obvious double-free when n2 goes out of scope, frees the memory pointed to by data_that_must_be_freed, and then n will attempt attempt to free that data as well. This is an infinitely classic blunder, and in critical enough code becomes a security blunder. The suggestions that stem from pointing this out range from unserious to just disappointing, including things like “just ban copying the structure”. Nobody needs a degree in Programming Language Design to communicate that “just ban simple automatic storage duration structure copying” is a terrible usability and horrific ergonomics decision to make, but that’s where we are. And it’s so confusingly baffling that it is impossible to be mad that the suggestion is brought up.

Or, take in n3201’s case (which updates the previous paper, n3182). When responding to the (ever-present) criticism that operators – including for initialization/assignment – that someone could do something weird inside of the operator, n3201 adds a constraint which reads:

Functions must contain the matching operator in their function bodies. i.e. _Operator declarations that associate the compares-equal operator with a function, must contain the compares-equal operator in the body of the function named in the _Operator declaration. (iostream-esque shenanigans with overloading the bitwise shift operators to read/write characters and strings isn’t allowed).

The fact that the proposal has something for initialization (but not cleanup), does not mention anything about the fact that the code snippet in the proposal itself apparently (?) leaks memory, that this constraint is very much deeply unsettling to impose on any type (there’s plenty of vec4 or other mathematics code where I’m using intrinsics that look nothing like the operators they’re being implemented for) does not seem to bother the author in the slightest. Instead, there’s just a palpable hatred of C++ there, apparently so strong that it overrides any practical engagement with the problem space. The proposal – and much of the counter-backlash I had to sift through on the mailing lists and elsewhere as people proposed stripped down RAII solutions for C under the guise of being “simple” – is too busy taking potshots at C++ to address clear and present dangers to its own functionality.

C as an Anti-C++

And this is where things just keep getting worse, because so much of C’s culture seems to swirl around the idea of either being “budget, simple, understandable C++” or “Anti/Nega-C++”. Instead of engaging on C’s stated merits or goals, like:

  • what-you-write-is-what’s-inside (a function foo produces a binary symbol named foo);
  • uncompromised, direct access to the hardware (through close collaboration with implementation-defined asm, intrinsics, and unparalleled control of the compiler (severe work in progress, honestly));
  • simple enough that it can always be used to glue two or more languages together (for any single given platform/compiler combination);
  • and, being a smaller language focused on its use cases (K&R literally sold C on being good at strings – we can see how that’s been going in the last 30 years).

We instead get “why doesn’t this PRIMITIVE, UNHOLY C just become C++” proposals, and similar just-as-ill-considered “here is my simpler (AND BETTER THAN THAT CRAPPY C++ STUFF) feature” proposals. Sometimes, like the person who e-mailed me with the struct nya example, there’s a genuine curiosity for exploring a different design space that serves as an actually better basis. But at even our highest echelons, the constant spectre of C++ that continually drives an underlying and utterly unhelpful antagonism that prevents actual technical evaluation. It results in things like _Operator throwing itself in the way of RAII, to try and half-heartedly solve the RAII problem without actually engaging with the sincere, instructive merit of the C++ object model. It also prevents actually evaluating the things that make RAII weak, including problems with the strong association with objects that actually manifest in its own standard library.

The negative tradeoffs for defer are numerous, especially since it absolutely loses many of the abilities that come from being a regular object with a well-defined lifetime. This means it is not as powerful as constructors and destructors, including that it is prone to Repeat-Yourself-Syndrome since the defer entity itself is not reusable. It cannot be attached to individual members of a structure, nor can it be passed through function calls or stored on the heap. It cannot be transferred with move constructors or duplicated with copy constructors in a natural way, or in any way as a matter of fact! It can only exist at function scope, not at file scope, and only exists procedurally.

The beneficial tradeoffs are it avoids the Static Initialization Order Fiasco that comes with having objects with constructors at file scope or marked static at function scope. It also does not combine lambdas with object-based destructors to torch 15+ years of life asking the C++ Standards Committee to standardize std::scope_guard only to ultimately be denied success at retirement age (sorry, Peter Sommerlad) because of the C++ Standard Library’s ironclad exceptions-and-destructors rule. And, to be clear, it was the right decision for them to do that! Poking a hole in the “all destructors from the standard library are noexcept” mandate adds needless library complexity gymnastics for a feature that the language should be taking care of! The proper realization after that would be that a language feature is required to sidestep the concerns that come with the Object Model. Of course, I do not expect the C++ Standard Committee’s Evolution Working Group to take that situation seriously as a body; likely, they will leave Library Evolution Working Group out to dry on the matter.

Coming to these sorts of conclusions only arises through behaving as an engineer that is looking to improve at their craft and strengthen their tools, rather than getting into a hammer-measuring pissing contest with the engineers down the hall.

But. Alas!

It still leaves a sour taste, though. It sort of lingers at the back of anyone’s mouth when they sit down to think about it, because it is kind of distasteful.

Genuinely, I understand that C can be behind. Very behind, in fact: taking 30 years to standardize typeof, not performing macro-gymnastics to get to typeof_unqual in the same 30 years, and not making any meaningful moves to work on things like e.g. “Statement Expressions” (something even the Tiny C Compiler implements) easily illustrates just how gut wrenchingly difficult it is to move the needle just a centimeter in this increasingly Godless industry. But when people propose a feature that has had 40+ years of work and refinement and care put into it, but at no point do they sit down and think about “what happens if I copy this object using the usual syntax” or “do we need some syntax for moving objects from one place to another” or “maybe I should not provoke a double free in the world’s most harmless looking code”, the thoughts start coming in. Is this being taken seriously? Is it just forgetfulness? Is it just so automatic nobody thinks about it? Is the pedagogy what is behind here, and is there a teaching crisis for this language?

So Many Questions

And yet, I will see not one damn answer, that’s for sure. Genuinely, I yearn for it because getting things half-baked things like they are in n3201 or similar is kind of rough to deal with. On one hand there’s the overwhelming urge to just grab the proposal and rip it up and get a white board and just go “here, HERE. WHERE IS YOUR OBJECT MODEL. WHAT HAPPENS TO THE EFFECTIVE TYPE RULES. DID YOU THINK ABOUT COPYING AND MOVING THINGS. WHAT HAPPENS IF SOMEBODY USES THESE IN AN COMPOUND ASSIGNMENT EXPRESSION. WHAT HAPPENS IF THEY ARE ASSIGNED FROM A TEMPORARY. HOW DO YOU PASS THAT IN TO THE USER. WHAT ARE THE THINGS THEY CAN CONTROL. HOW DO WE HANDLE THIS FROM HEAP MEMORY OR A STACK ARRAY UNSIGNED CHARACTERS.”

But that kind of tone, that sort of engagement is antagonistic, probably in the extreme.

It’s also not how I would like to engage with anyone. Like, the person who sent me an e-mail with the cute struct nya and the very simple and nice _Constructor syntax might not even have gotten that deep in the C standard and likely barely knows the effective type rules; I sure as hell barely understand them and I’m in charge of goddamn editing them when a few of the big upcoming papers finally make their way through the C Committee.

If I respond to an e-mail like that – with all the capital letters and everything – it would be completely out of line and also would be very unfair, because it is not their fault. I haven’t done that to anyone so far, but the fact that the thought exists in my head is Not Fun™. It’s not anyone’s fault, it’s just an internal struggle with thinking the whole industry is a lot farther along on these problems and continuously feeling like I am very much too stupid to be here. Like, I’m a goddamn moron, a genuine idiot, I cannot be ahead of the game, am I being pranked? Am I being tested, to see if I really belong here? Is someone going to swing in out of the blue and go “AHA, YOU MISSED THE OBVIOUS!”? Something is absolutely not adding up.

The utterly pervasive and constant feeling that a lot of people – way too many people – are really trying to invent these things from first principles and pretend like they were the first people to ever conceive of these ideas… it feels pretty miserable, all things considered. Going through life evaluating effectively no prior art in other languages, domains, C codebases as they exist today, just… anything. It’s a constant nagging pull when working on things like standard C that for the life of me I cannot seem to shake no matter how carefully I break it down. Hell, part of writing this post is so I can stick a link to it in my defer paper and in the defer Technical Specification when it happens so I don’t have to sit down and walk through why I chose a procedural-style, object-less idiom for C rather than trying to load the RAII shotgun and point it at our beloved 746-and-counting page C standard.

Changing a programming language’s whole object model is hard. Adding “things that must be run in order to bring an object into existence, and thing that must be run in order to kill an object, modulo Effective Type rules, with No Other Exceptions” is a big deal. Where in the proposals do they discuss new/delete, and why they are used as wrappers around malloc to ensure construction and destruction are coupled with memory creation to prevent very common bugs? Where is the consideration for placement new or being able to call destructors manually on an object or a region of memory? RAII enables simple idioms but it is not a simple endeavor! Weakening portions of RAII makes it so much less useful and so much less powerful, which is really weird! Is not the thing people keep telling me about C is that its the language of ultimate power and ultimate control? Why does that repeatedly not show up in these discussions?!

It feels so bizarre to have to actually sit down and explain some of these things sometimes because a lot of these things have become second nature to me, but it is just a part of the curse.

“It was Just Some E-mails, Man, Calm Down!”

To be very clear, the person who sent the e-mail – whose syntax I stole using struct nya * for this post for the _Constructor/_Destructor idea – is not someone I actually expect to send me a 5 page e-mail thesis on enhancements to the C object model. That person CLEARLY was just trying to give me a quick simple idea they thought up of that made it easy on them / solved the problem at hand, and I certainly don’t fault them for thinking of it! Their initiative actually demonstrates that rather than just doing the copy-paste roboticism of people who would blindly steal syntax from C++ and then strip off the bits they don’t like and go “See? Simple!” they’re actually thinking about and engaging with the technical merits of the problem. I certainly wish n3201 and other solutions had a fraction of that spark and curiosity and eagerness to explore the space and actually push the needle for C forward, rather than just being driven by trying to define C as “anti-C++”.

My intention is to keep moving forward with proposals like defer, among many others over the next few years, to start actually improving C for C’s sake. Sometimes this will mean cloning an idea right out of C++ and putting it in C; other times, weighing the pros and cons and addressing the inherent deficiencies in approaches to produce something better will be much more desirable. Knee jerk reactions like those in n3201 rarely serve to help either language and are producing demonstrably worse outcomes; which also concern me because I had an idea for handling operators in C for a long time now and seeing the current proposals do a poor job of handling the landscape is not going to bolster anyone’s confidence in how to do it…!

But, the person who inquired VIA e-mail deserves an enthusiastic “NICE”, a thumbs up, and maybe a cookie and a warm glass of milk for actually thinking about the problem domain. … In fact.

Cookies and milk sounds real good right now… 💚

]]>
<![CDATA[Ever since I finished publishing the “defer” paper and successfully defended it on its first go-around (it now has tentative approval to go to a Technical Specification, I just need to obtain the necessary written boilerplate to do so), an old criticism]]>
Implementing #embed for C and C++2023-10-19T00:00:00+00:002023-10-19T00:00:00+00:00https://thephd.dev/implementing-#embed<![CDATA[

I received a few complaints that #embed was difficult to implement and hard to optimize. And, the people making these claims are not exactly wrong. While std::embed was designed to be very simple and easy, the new #embed directive does the usual C thing: it’s “simple” on its face, but because of how C and C++ work and how the languages gel it has a ton of devils in the details. In this post, I’m going to describe the way I implemented #embed in both GCC and Clang and the style of work I used to support the few companies/vendors I did for an early version of #embed. I’ll use the publicly available version of #embed that I offered to Clang as a tool to display one of the usable techniques to get the guaranteed speedup for the subset of cases that matter (e.g., char/signed char/unsigned char array initialization).

Let’s get started.

Support Level 0: Basic #embed Expansion

Before we talk about the fast version of #embed, we need to discuss what it is specified to be. Consider the following two data files:

single_byte.txt:

a

art.txt:

           __  _
       .-.'  `; `-._  __  _
      (_,         .-:'  `; `-._
    ,'o"(        (_,           )
   (__,-'      ,'o"(            )>
      (       (__,-'            )
       `-'._.--._(             )
          |||  |||`-'._.--._.-'
                     |||  |||

We posit these are UTF-8 encoded text files, meaning the byte value of a is 97 (hexadecimal 0x61) with a size of 1 for the single_byte.txt, and the art.txt file has multiple values with a size of 275 (including the trailing \n newline). We then deploy these files using #embed, a new directive standardized in C23 and in-progress for standardization for C++26:

const unsigned char arr[] = {
#embed <art.txt>
};

int main () {
	return
#embed <single_byte.txt>
	;
}

The way #embed works is, conceptually, very simple: the preprocessor (stages 1 through 4 of the 7 stage compilation process of C and C++) expands the directive, according to any embed parameters and the file, and produces an “comma-delimited list of integer constant expressions” (or “integral constant expressions cast to unsigned char” for C++, but they mean the same thing here1). Each value goes from 0 to 2552. So, for the files above and the given program, that would look like this:

const unsigned char arr[] = {
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x5f,
0x5f, 0x20, 0x20, 0x5f, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x2e, 0x2d, 0x2e, 0x27, 0x20, 0x20, 0x60, 0x3b, 0x20, 0x60, 0x2d, 0x2e,
0x5f, 0x20, 0x20, 0x5f, 0x5f, 0x20, 0x20, 0x5f, 0x0a, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x28, 0x5f, 0x2c, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x2e, 0x2d, 0x3a, 0x27, 0x20, 0x20, 0x60, 0x3b, 0x20,
0x60, 0x2d, 0x2e, 0x5f, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x2c, 0x27, 0x6f,
0x22, 0x28, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x28, 0x5f,
0x2c, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x29, 0x0a, 0x20, 0x20, 0x20, 0x28, 0x5f, 0x5f, 0x2c, 0x2d, 0x27, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x2c, 0x27, 0x6f, 0x22, 0x28, 0x20, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x29, 0x3e,
0x0a, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x28, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x28, 0x5f, 0x5f, 0x2c, 0x2d, 0x27, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x29, 0x0a, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x60, 0x2d, 0x27, 0x2e, 0x5f, 0x2e,
0x2d, 0x2d, 0x2e, 0x5f, 0x28, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x29, 0x0a, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x7c, 0x7c, 0x7c, 0x20, 0x20, 0x7c,
0x7c, 0x7c, 0x60, 0x2d, 0x27, 0x2e, 0x5f, 0x2e, 0x2d, 0x2d, 0x2e, 0x5f,
0x2e, 0x2d, 0x27, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x20, 0x7c, 0x7c, 0x7c, 0x20, 0x20, 0x7c, 0x7c, 0x7c, 0x0a
};

int main () {
	return
0x61
	;
}

Simple enough. The problem with this — which is the problem with depending on program outputs from e.g. xxd -i or random python scripts you wrote because xxd is packaged only VIA vim for some inexplicable reason — is that it is slow. Horrifically slow, in fact. Taking a computer with the following specification:

OS Name: Microsoft Windows 10 Pro
Version: 10.0.19045 Build 19045
System Type: x64-based PC
Processor: AMD Ryzen 9 5950X 16-Core Processor, 3401 MHz, 16 Core(s), 32 Logical Processor(s)
Installed Physical Memory (RAM): 32.0 GB
Total Physical Memory: 31.9 GB
Total Virtual Memory: 36.7 GB

and dropping in a simple 40 MB file potato.bin filled with random data processed through xxd -i takes over 70 seconds to process. And, the worst part is, no matter how much we try to optimize a C++ frontend to parse things faster, the numbers do not get any better! So, we know expanding to a list of integer constants is very bad for build speed: why, then, is #embed specified in this manner? The reality on-the-ground is that C compilers are very weak creatures. Compared to the central 4/5 C++ compilers that exist in the world, there are easily over 100 different C compilers of varying flavors, powers, and implementation effort. At the end of the day, we had to write a specification that allowed the world’s worst compiler to continue being the world’s worst compiler (presumably, because their implementers are making a tradeoff for some other aspect of C they like more).

Therefore, at support level 0, just “expanding to a list of integers” (or a single integer if there is only one byte in the file) is the core behavior. This behavior is not entirely useless, however, and it will notably be used for some of the more interesting cases we will start outlining as we keep on implementing more and more specialized behavior to increase speed.

The first step is, obviously, adding flags to ensure that the compiler frontend knows where to find data. Do NOT use #include paths for this specified through -I: this is a surefire way to make life for users terrifically annoying and difficult and pull in inclusion of headers or data nobody ever wanted. Use a separate flag that provides directories for this. The implementation I made for Clang used -embed-dir WHATEVER and -embed-dir=WHATEVER. Given my data is in a directory called ./media, the invocation would look like: clang -embed-dir=./media/ -o main.exe main.c. All of the search directories are accumulated in order; additionally, the current directory of the file we are working with (e.g., main.c) is used for lookup when #embed "whatever.h" (with quotes) is used.

Now that we can find the files, the way this works in Clang is simple. We create a pseudo-file inside of the compiler, give it a fancy name, and then quite literally just dump the integer literal tokens into it. Stepping back, this:

const unsigned char arr[] = {
#embed <art.txt>
};

int main () {
	return
#embed <single_byte.txt>
	;
}

Is more faithfully represented by a multi-file split:

////////////////////////////////////////////////
// Enter `main.xxd.cpp`
////////////////////////////////////////////////
const unsigned char arr[] = {
////////////////////////////////////////////////
// Enter `art.txt`-generated
// file internally named `<built-in:embed:1>`
////////////////////////////////////////////////
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x5f,
0x5f, 0x20, 0x20, 0x5f, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x2e, 0x2d, 0x2e, 0x27, 0x20, 0x20, 0x60, 0x3b, 0x20, 0x60, 0x2d, 0x2e,
0x5f, 0x20, 0x20, 0x5f, 0x5f, 0x20, 0x20, 0x5f, 0x0a, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x28, 0x5f, 0x2c, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x2e, 0x2d, 0x3a, 0x27, 0x20, 0x20, 0x60, 0x3b, 0x20,
0x60, 0x2d, 0x2e, 0x5f, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x2c, 0x27, 0x6f,
0x22, 0x28, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x28, 0x5f,
0x2c, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x29, 0x0a, 0x20, 0x20, 0x20, 0x28, 0x5f, 0x5f, 0x2c, 0x2d, 0x27, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x2c, 0x27, 0x6f, 0x22, 0x28, 0x20, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x29, 0x3e,
0x0a, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x28, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x28, 0x5f, 0x5f, 0x2c, 0x2d, 0x27, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x29, 0x0a, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x60, 0x2d, 0x27, 0x2e, 0x5f, 0x2e,
0x2d, 0x2d, 0x2e, 0x5f, 0x28, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x29, 0x0a, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x7c, 0x7c, 0x7c, 0x20, 0x20, 0x7c,
0x7c, 0x7c, 0x60, 0x2d, 0x27, 0x2e, 0x5f, 0x2e, 0x2d, 0x2d, 0x2e, 0x5f,
0x2e, 0x2d, 0x27, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x20, 0x7c, 0x7c, 0x7c, 0x20, 0x20, 0x7c, 0x7c, 0x7c, 0x0a
////////////////////////////////////////////////
// Return to `main.xxd.cpp`
////////////////////////////////////////////////
};

int main () {
	return
////////////////////////////////////////////////
// Enter `single_byte.txt`-generated
// file internally named `<built-in:embed:2>`
////////////////////////////////////////////////
0x61
////////////////////////////////////////////////
// Return to `main.xxd.cpp`
////////////////////////////////////////////////
	;
}

Internally, there is just a memory buffer called <built-in:embed:1> (where 1 just represents it’s the first file being inserted, 2 would be for the second, and so on and so forth). It is presented as a “file”, and we just “enter” that memory buffer as a “file” and parse it like normal. Very simple stuff, it behaves exactly like #include. As a compiler developer, you also need to make sure you update your support for /showIncludes file dependency generation (MSVC) or -MMD Makefile dependency generation (GCC, Clang, or literally most other compilers). This allows #embed to work pretty much out-of-the-box with your makefile generators and other types of dependency-parsing tools that exist out in the wild, without requiring any updates on the build system side.

A Clang-Specific Explosion

Another Clang-specific part of this that’s very awkward is that you need to generate an actual in-memory source file with this data in it, rather than just directly creating a token stream and pushing that into the compiler’s frontend to handle. The reason here is much less language-design oriented and more compiler-architecture oriented. An earlier version of this code simply generated a sequence of tokens and jammed it back into the parser to deal with. This suddenly caused an inadvertent, potentially infinite number of out-of-bounds reads the part of Clang responsible for dumping token representations back to represent a fully preprocessed file.

The problem was that, somewhat hilariously, rather than Clang hardcoding the write out of things such as comma tokens by using a "," in the compiler’s code to be dumped into the output stream for the preprocessed file, it would simply assume there was a comma in the (original or generated) source code that represented the comma token. That caused the Clang “write preprocessed file” action to go look up a source location for a magic comma token that was being generated but had no backing source data in its SourceManager, and whose source location was just pointing at where the #embed had been. The result was effectively performing random reads of unknown data and piping that straight into the output stream.

It was a fun bug to track down:

A failed reprinting of the source code accessing (potentially already-released?) memory.

Compiler-specific shenanigans aside…

If you could get away with generating tokens directly rather than source code, that would save you a bit of time performing what most compilers call “tokenization” of source code. But, because I did not feel like dealing with Clang’s source location-based assumptions, I simply generated a source file and had clang process that instead. This results in a fraction of lost time (not too significant, really, but still some work always takes longer than simply not doing the work at all). A more optimized version of this would sidestep these problems deftly and avoid having to re-tokenize raw generated source code back into a sequence of {integer literal} {comma} {integer literal} … tokens.

Nevertheless, solving this issue meant that we could dump out a fully preprocessed file when given the -E option. This meant that specific C and C++ tools that just preprocessed source files and did not retain include flag or embed directory information could reliably parse/process these all-bits-included files that just had the integer list expansion baked right in. This served as the baseline support for #embed. There was just one more thing to do to round out Level 0 support…

Support Level 0, Part II: Preprocessor Parameters

Preprocessor parameters are a newer way to pass additional information to preprocessor directives in C23. They are a whitespace-delimited sequence of foo, bar(potentially-empty-balanced-token-sequence) vendor::baz, or vendor::quux(potentially-empty-balanced-token-sequence) arguments that can be given to a preprocessor directive. They only utilized for the #embed directive at the moment, but as compiler implementers find their bravery to actually start implementing extensions again instead of just constantly poking the Standards Committee to act first, it may start showing up in other places as a means to perform fun tasks.

Fun ideas aside, there’s 4 different preprocessor parameters that are mandated by the C standard for #embed: limit, prefix, suffix, and if_empty.

  • limit( constant-expression ) takes an integer constant expression and lets a file be up to (but no bigger than) the provided limit. This is useful for #embed <infinity_file> limit(value), like #embed </dev/urandom> limit(64).
  • prefix(balanced-token-sequence)/suffix(balanced-token-sequence) both take a sequence of tokens and apply it to the beginning or end of any generated integer token sequence, respectively. If there is no data in the file (or if it is set to limit(0), which will trigger the file to be considered empty), then this parameter has no effect.
  • if_empty(balanced-token-sequence) takes the sequence of tokens and expands the directive to those tokens, if there is no data in the file (or if it is set to limit(0)).

Implementing these are not hard: all one has to do is drop the token/text sequence out where expected. So when one encounters the #embed directive and parses the token sequence for prefix or suffix, all they need to do is search for the file. If it’s empty, then they ignore either of the tokens; otherwise, it gets placed before or after the embed directive’s contents. Conversely, if if_empty is present, and the file is empty, then the token sequence appears where the integer sequence would have.

limit(…) is just doing min(limit-expression, size-of-file); if the file size is larger than the limit, than the limit should be chosen. Otherwise, the file size should be chosen. limit(…) specifically refers to the number of integer literals that will be created in the sequence list, and not necessarily the number of bytes. They hold as a 1:1 correlation on most implementations (e.g., CHAR_BIT == size-of-filesystem-byte), but care needs to be taken on the World’s Weirdest Implementations™ (e.g., CHAR_BIT == 9 and fs-byte == 8, or similar foolish shenanigans). The actual wording in the specification for C and C++ has protections against this, but very literally talking about the bit size of the file (or the provided limit-expr ✖ bit width), the bit width of each integer literal, and how the second must cleanly divide into the first. A diagnostic is required if it does not cleanly divide. The full available range can then be defined in interval notation as

\([0, min(limit, file size))\).

There is one other parameter that is part of the Clang implementation that was asked for frequently when I was standardizing #embed. Unfortunately, I am not superhuman and did not have enough time to roll it out into the standard. Part of standardization is, of course, Standardizing Existing Practice, and so as part of the next level of support, adding a few vendor-specific parameters will help bolster adding them to the next C standard.

Support Level 1: clang::offset

This will obviously have to be called gnu::offset for GCC, and then everybody will copy from there. But, the goal is effectively to create and offset( constant-expression ) preprocessor parameter. This does exactly what you’d expect: it would drop up to constant-expression elements from the beginning the read data. This also has the chance to turn the data empty as well, if the offset is greater than the data (after the limit is applied). So, for example:

#embed <single_byte.txt> limit(0) /* empty */
#embed <single_byte.txt> offset(1) /* empty */
#embed </dev/urandom> offset(1) limit(1) /* empty */
#embed </dev/urandom> offset(458946493) limit(1) /* empty */

Notably, the last one is not a constraint violation: it simply just does the min(offset-expression, size-of-file). The full available range can then be defined with interval notation as

\([min(offset, limit, file size), min(limit, file size))\).

There are also many more advanced parameters that can be provided, such as a parameter for width( constant-expression ). This would define the number of bits that would be used for each element to generate the integer literal, which could be useful for initializing larger integral types or custom types when the data is type-punned. But, with that done, I could now move on to speeding the whole thing up! Retaining the support for various constructs above is nuanced, as we will see as we start talking about the next level: built-in speed support.

Support Level 2: Speedy Builtins

So, we implemented a basic preprocessor directive and dumped the contents to a file. It:

  • is slow for large files even though we’re generating the data directly in the preprocessor;
  • has tooling support (e.g. icecc/distcc) through “data is directly inside the generated preprocessed file”;
  • allows us to use it in places where only a single expressions (integer literal) is expected, such as return from int main();
  • and, works to initialize an array of unsigned char type (or any other type that accepts a list of comma-delimited integer literals).

We need to retain all of these properties, while speeding up the invocation significantly. For this, we implement a compiler-specific built-in. We will call this built-in __builtin_pp_embed. It will take 3 arguments:

  • the expected type of each element (for now, always unsigned char);
  • the filename as a string literal;
  • and, the data encoded as a base64 string literal.

There are more advanced3 versions of this built-in that I have implemented in other versions of this code, but I am not talking about such implementations here. Of course, I am glossing over the most interesting facet of this list: that last bullet point about “base64 string literal”. Some may read that and go ❓❓, and it would not be a bad reaction honestly! It does sound very silly, but it is actually an important facet of the new built-in.

Surviving the -E Tools

One of the requirements for this functionality is that it survives existing tooling. This includes icecc or distcc that employs -E upon the code to generate a single file before throwing it up to a server to build that single preprocessed source file. If you want a “fast” built-in that respects this, then that necessarily means that every time a file is processed with -E — every time data is pulled into a single source file — all of the data must be present. This means that you cannot just put an (absolute) file path into the built-in; icecc and distcc do not replicate the source file tree in any way, shape, or form. Most other tools also do not include full source tree information work this way, nor do any those “send us a single preprocessed file” bug reporting tools for C and C++ toolchains expect your whole working include (and now, embed) directory structure.

Thusly, when you “finish” preprocessing, you need to contain all of the data in a friendly-to-tools manner. Friendly in this case includes being friendly to tools that break source code down into logical source code lines and then use regex to find #include or #embed directives. So, when processing this main.cpp file:

const unsigned char arr[] = {
#embed <art.txt>
};

int main () {
	return
#embed <single_byte.txt>
	;
}

things end up looking like this when you generate the built-in based code after preprocessing (with large comment block annotations, similar to above code examples):

////////////////////////////////////////////////
// Enter `main.cpp`
////////////////////////////////////////////////
const unsigned char arr[] = {
////////////////////////////////////////////////
// Enter `art.txt`-generated
// file internally named `<built-in:embed:1>`
////////////////////////////////////////////////
__builtin_pp_embed(unsigned char, "/home/derp/pp_embed/examples/media/art.txt",
"ICAgICAgICAgICBfXyAgXwogICAgICAgLi0uJyAgYDsgYC0uXyAgX18g"
"IF8KICAgICAgKF8sICAgICAgICAgLi06JyAgYDsgYC0uXwogICAgLCdv"
"IiggICAgICAgIChfLCAgICAgICAgICAgKQogICAoX18sLScgICAgICAsJ"
"28iKCAgICAgICAgICAgICk+CiAgICAgICggICAgICAgKF9fLC0nICAgIC"
"AgICAgICAgKQogICAgICAgYC0nLl8uLS0uXyggICAgICAgICAgICAgKQo"
"gICAgICAgICAgfHx8ICB8fHxgLScuXy4tLS5fLi0nCiAgICAgICAgICAg"
"ICAgICAgICAgIHx8fCAgfHx8Cg==");
////////////////////////////////////////////////
// Return to `main.cpp`
////////////////////////////////////////////////
};

int main () {
	return
////////////////////////////////////////////////
// Enter `single_byte.txt`-generated
// file internally named `<built-in:embed:1>`
////////////////////////////////////////////////
__builtin_pp_embed(unsigned char, "/home/derp/pp_embed/examples/media/single_byte.txt", "YQ==");
////////////////////////////////////////////////
// Return to `main.cpp`
////////////////////////////////////////////////
	;
}

Notice how this source file only contains constructs that are:

  • blindly ASCII parse-ready;
  • do not require access to the original source files anymore;
  • and, understandable as normal C or C++ source code.

This means that icecc/distcc-style tools would not trip up a re-run of the compiler on the single unified source file. Base64 encoding the data in the second string literal argument is important, because data from a file could look like either valid C++ source when it is meant to be data or could contain bytes in the data that would absolutely destroy traditional/typical C and C++ tooling (like actual embedded nulls).

A Poorly Conceived Idea

A few C++ implementers had a poorly thought-through idea for how to handle this during -E processing. Particularly, their idea was to inject a special, compiler-specific _Pragma rather than something like __builtin_pp_embed; it would indicate the number of bytes of the #embed‘d file before dumping the data raw into the source file. As you can imagine, doing the _Pragma would mean the fully-preprocessed version of this file:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(void) {
	char src[] = {
#embed __FILE__
	}, *argv[] = { "./out", NULL };
	FILE *fd = fopen("src.c", "w+");
	fwrite(src, sizeof(src), 1, fd);
	fclose(fd);
	system("${CC} src.c -o out");
	return execv(argv[0], argv);
}

Would trip most tools up. Tools would not understand a generated compiler-specific _Pragma/#pragma that would contain C++ source code, such as:

/* stdio.h expansion here */
/* stdlib.h expansion here */
/* unistd.h expansion here */

int main(void) {
	char src[] = {
///////////////
// start pragma
#pragma embed 286 #include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(void) {
	char src[] = {
#embed __FILE__
	}, *argv[] = { "./out", NULL };
	FILE *fd = fopen("src.c", "w+");
	fwrite(src, sizeof(src), 1, fd);
	fclose(fd);
	system("${CC} src.c -o out");
	return execv(argv[0], argv);
}
///////////////
// end pragma
	}, *argv[] = { "./out", NULL };
	FILE *fd = fopen("src.c", "w+");
	fwrite(src, sizeof(src), 1, fd);
	fclose(fd);
	system("${CC} src.c -o out");
	return execv(argv[0], argv);
}

This is, of course, a travesty of new lines and other directives nested in on itself. This absolutely destroys and breaks tooling built on top of -E preprocessed source files. Therefore, the data must be turned into a form that is palpably understandable by something that can handle “regex for function calls” or “regex after logical line processing for preprocessor directives”. Anything that interferes with that idea breaks too much tooling to be (widely) viable, though it may be suitable for internal-only processing. However, if someone has a compiler with a fused preprocessor, C or C++ language frontend, and backend, they could skip this hullabaloo about _Pragmas or built-ins or what-have-you and just blast the memory into the optimal place in the compiler on the first go-around.

All in all, not a worthwhile long-term implementation strategy and one I almost lost a bunch of time trying to make happen; here’s to you not having to make the same mistake as I did.

Nevertheless,

Adding support for this is actually more complicated than imagined. For example, because this is a preprocessor directive, melting things down into a built-in can produce many surprising consequences for where it appears. It’s not just return statements or function invocations; it can appear in arguments, in template parameters, in places where nothing is expected, and so much more:

#embed <media/empty>
;

void f (unsigned char x) { (void)x;}
void g () {}
void h (unsigned char x, int y) {(void)x; (void)y;}
int i () {
	return
#embed <single_byte.txt>
		;
}

_Static_assert(
#embed <single_byte.txt> suffix(,)
""
);
_Static_assert(
#embed <single_byte.txt>
, ""
);
_Static_assert(sizeof(
#embed <single_byte.txt>
) ==
sizeof(unsigned char)
, ""
);
_Static_assert(sizeof
#embed <single_byte.txt>
, ""
);
_Static_assert(sizeof(
#embed <jk.txt>
) ==
sizeof(unsigned char)
, ""
);

#ifdef __cplusplus
template <int First, int Second>
void j() {
	static_assert(First == 'j', "");
	static_assert(Second == 'k', "");
}
#endif

void do_stuff() {
	f(
#embed <single_byte.txt>
	);
	g(
#embed <media/empty>
	);
	h(
#embed <jk.txt>
	);
	int r = i();
	(void)r;
#ifdef __cplusplus
	j<
#embed <jk.txt>
	>(
#embed <media/empty>
	);
#endif
}

This is what cost the majority of the implementation time when working on the built-in. Because the built-in is generated by the preprocessor but parsed by the frontend, handling this was the largest portion of what made it difficult. There’s a few things we did to make all of this work out rather simply, at least in Clang. Different compilers have different architectures, but many of these ideas can be applied universally.

0. Implement __builtin_pp_embed as a Keyword

Trying to parse __builtin_pp_embed as a “built-in function call” versus just grafting support directly into the parser with __builtin_pp_embed explicitly as a keyword is sufficiently more nightmarish. It opts into a lot of mechanisms and code around function calls that assume a single return (not the case for empty embeds or embeds that produce multiple integer literals). It absolutely requires manual tweaking if you want to do things like read a type name as an argument without having your typical function-body parser explode. There are also rules in both C and C++ that automatically decay arrays to pointers when put in normal function calls, making it difficult to retrieve information when you get to the “Semantic Analysis” part of working with code.

Instead, parsing __builtin_pp_embed as a keyword and then simply expecting the parentheses, type name, file name string literal, and string literal arguments results in far less code and far less post-hoc adjustments. It’s also marginally faster than reverse-engineering the proper data during Semantic Analysis and Constant Expression parsing. Internally, this produces a distinct PPEmbedExpr that contains the base64-decoded data as a StringLiteral (Clang) or a distinct VECTOR_CST-style tree node with a string node stored as part of the VECTOR_CST’s data and pattern (GCC). Most compilers have special __builtin_* markers that are treated as private keywords, even TCC. This implementation technique allows you to get right into the special internal format necessary for later processing and speed recognition.

1. Stringent Speed Requirements

In your compiler architecture, you want to eliminate any node or leaf object that represents the built-in as soon as is technologically possible. Specifically for Clang, this is possible by recognizing a sequence of conditions:

  • if the built-in is being used to initialize an array of character type (e.g. char or unsigned char or even signed char);
  • and if there is only ONE initializer in the list of initializers for an object that is the built-in;

then, the built-in node or tree element just gets replaced with a magic string literal that was generated from the decoded base64 data. The realization here comes from noting that string literals are, quite literally, the fastest array initializers in almost every C and C++ compiler today. String literals and their initializers are also often rarely copied, making them supremely ideal for the goal of initializing these arrays. This also prevents us from having to give a single damn about further downstream portions of either Clang or GCC’s compiler architecture: just substitute in a single string literal and let the usual “array initialization from a string literal” take hold.

This seemed like a hack, because it meant I did not have to really touch much if any-at-all semantic analysis of Clang’s compiler (earlier versions of my patch got completely lost on the SemaExpr sauce and the constant expression parser trying to gain bigger and bigger speedups) nor did I have to so much as look at the Code Gen. But, it actually paid off enormously, in both implementation speed, implementation correctness, and end result speed. I encourage almost every single compiler to follow the above 2 guidelines; if they cannot form a single initializer with all of the provided initializers so that it can simply be folded down into a typical array initialization of one of the character arrays, then go to Step #2.

2. Aggressively Expand Everywhere Else

If the two conditions above aren’t met and the initializer data cannot be massaged into the moral, spiritual, and factual equivalent of a string literal initialization, expand the directive. This is where things become really difficult, because not everything is allowed to expand in-place. For example, the return __builtin_pp_embed(…); statement cannot handle having 2 integers present. It can work with 1 integer, or with 0 (for a void function that does return ;). This requires manual diagnose when handling a return expression, but has to be done as early as possible in either the parser, or the semantic analyzer. This is where implementation difficulty fully ramps up, and is where I spent the least amount of time for the patch. A lot of things work correctly in the basic case, but extended and honestly completely asinine usages of #embed can and do break the compiler.

Because the compiler was not built to accept a comma-delimited list of integers anywhere and everywhere, the idea that a single expression — __builtin_pp_embed(…) — could turn into one is a fine way to make every part of the compiler scream. So, instead, I focused my energy on getting things correct for the typical usages and a few odd places, and leaving the rest to, effectively, undefined behavior and fate.

Thankfully, there’s a few key places in Clang where All Function Call Arguments are finalized/massaged, and central locations where All Template Arguments are processed, so the two big cases where this may happen are easy to handle. Initialization also has a Single Coalescing Location, which takes care of all structure and array initialization and makes it easy to get the speedup. It’s all the tiny little stragglers that need to be cleaned up, and that’s where my energy levels hit rock bottom. Having already done a lot of this boilerplate over the last 5 years, implementing it all again for #embed and std::embed over and over and over and over is… draining!

3. Recognize Simple Cases in the Preprocessor

Another way to avoid problems with #embed ruining things in unexpected ways in your compiler is — when doing the transition from #embed to __builtin_pp_embed in the preprocessor — simply not generating the built-in when it would not be useful. So, for example, if you detect that the file is empty (limit is too small or offset is too high or both, or the file is legitimately empty), there is absolutely no reason to produce the built-in at all. Just expand it out to nothing and leave. If there’s an if_empty(…), just pour the text directly into a buffer, make it a new file, and deem that the expansion. Let it parse normally, like any other preprocessor expansion. It avoids a wide class of issues related to “how do I delete this tree node / expression leaf out from itself?!”. The inverse of this, of course, is…

4. Recognize When Something Is Not Built-in-able

If you took the advice in the preceding section, then the only time you’re going to make a built-in is if there is data. So, we know for a fact that any suffix(…) or prefix(…) has to be valid and will be put out by the directive. If there is a suffix(…) or prefix(…) parameter, you can do a quick check to see if it is worth your time to turn it into a built-in. For prefixes, check if there is a comma delimited list of integers that end with a comma. For suffixes, check if it starts with a comma and then becomes a comma delimited list of integers after that. If one or both of these hold true, then you can just immediately slurp that data up into whatever binary data was produced, making sure that each integer constant is within the range \([0, 2^{CHAR\_BIT})\). Then you emit a single __builtin_pp_embed without adding any additional tokens before or after it.

If either is not, then just bail and expand the list of integers as programmed in Support Level 0. This was actually pointed out during the C Meeting by Joseph Myers — a prominent contributor to glibc, GCC’s C code, and several other highly used projects — as something implementations can do to keep code optimized as early as possible and not trip up the conditions above. This was also a primary reason why suffix(…) and prefix(…) were kept as embed parameters, despite being able to program this in multiple different ways. For example, all of these will make a null-terminated string:

const char non_optimized0[] = {
#embed "shaders/super_glitter.glsl"
	, 0
};

const char non_optimized1[] = {
#embed "shaders/super_glitter.glsl"
#if __has_embed("shaders/super_glitter.glsl") == __STDC_EMBED_FOUND__
	,
#endif
	0
};

const char optimized0[] = {
#embed "shaders/super_glitter.glsl" \
	suffix(, 0)
};

It is worth noting that non_optimized0 also just completely breaks if shaders/super_glitter.glsl is empty. But, assuming there is data in the file, the first two will not optimize cleanly; the current directive will just vomit it out into a mess. Contrarily, the last one will much more easily be optimized by most implementations. Doing this for both prefix(…) and suffix(…) will become increasingly important as people use it to do things such as provide integer sequences that map to things such as "#version 420 or other common top-level string boilerplate for all sorts of files inside of prefix() or suffix() clauses.

But…

That’s it, in terms of implementation prowess. Ostensibly, Support Level 0 is enough to be a conforming implementation. There are tons of examples in the Clang pull request and other places for this. There are also many more extensions that can be implemented for this functionality. I can only hope that implementers that read this are emboldened to add more directives, get spicy with how they implement things, and try expanding on their techniques into the future. A stagnant implementer culture that always wants to reach for assured, standards-mandated things is no fun in a world as vast and as lovely as Computing. And, of course, having dreams and realizing them means that all of us, together, get to see…

A screenshot of a Microsoft Windows Terminal, showing 3 prompts. One is "type main.c", showing a simple main.c file that makes a `constexpr` array and `#embed "potato.bin"` into it. The next prompt is "type main.xxd.c", showing a simple main.xxd.c which just includes an xxd-generated "potato.bin.h" file. The last prompt is a call to "dir", showing -- in particular -- the sizes of "main.exe", "main.no_builtin.exe", "main.xxd.exe", "potato.bin", and "potato.bin.h". A single command line PowerShell prompt command, which reads: Measure-Command { D:\Sync\Cross\llvm-project.cmake\vs\install\x64-Release\bin\clang.exe -std=c++2c -x c++ main.xxd.c -o main.xxd.exe | Out-Default }. It shows: TotalSeconds: 75.0577052 A single command line PowerShell prompt command, which reads: Measure-Command { D:\Sync\Cross\llvm-project.cmake\vs\install\x64-Release\bin\clang.exe -std=c++2c -x c++ main.c -o main.no_builtin.exe -fno-builtin-pp_embed | Out-Default }. It shows: TotalSeconds: 128.3183197

Just how much faster things can be compared to the tools we’ve been using for 40+ years:

A single command line PowerShell prompt command, which reads: Measure-Command { D:\Sync\Cross\llvm-project.cmake\vs\install\x64-Release\bin\clang.exe -std=c++2c -x c++ main.c -o main.exe | Out-Default }. It shows: TotalSeconds: 2.395532

A better future is possible. A future that’s at least 37x as fast as the one we’re living in. We just have to grasp it.

With our own two hands. 💚4

Footnotes

  1. Notably, C++ needs a cast to unsigned char with each integer literal due to its type deduction rules. Each element of the array is meant to be an unsigned char value, whereas a regular e.g. 212 is considered an int and might cause initialization of an unsigned char array might not go well for uses of auto or Class Template Argument Deduction (CTAD)

  2. This actually takes CHAR_BIT bits from the file, going from \([ 0, 2^{CHAR\_BIT} )\) for each generated integer constant expression. 

  3. For example, a four-argument version of the built-in would take: the expected type of each element; the number of bits to use per-element (for the width( constant-expression ) parameter mentioned earlier); the filename as a string literal; and, the data encoded as base64 string literal. One could provide a vendor::type(int32_t) attribute to demand that sizeof(int32_t) * CHAR_BIT bits are used. These are not implemented in the Clang branch we are discussing in this implementation, though I have successfully implemented it previously (with any std::is_trivial_v type, which includes literally all types from C). 

  4. Header image photo by Martin Lopez from Pexels

]]>
<![CDATA[I received a few complaints that #embed was difficult to implement and hard to optimize. And, the people making these claims]]>
cuneicode, and the Future of Text in C2023-06-07T00:00:00+00:002023-06-07T00:00:00+00:00https://thephd.dev/cuneicode-and-the-future-of-text-in-c<![CDATA[

Following up from the last post, there is a lot more we need to cover. This was intended to be the post where we talk exclusively about benchmarks and numbers. But, I have unfortunately been perfectly taunted and status-locked, like a monster whose “aggro” was pulled by a tank. The reason, of course, is due to a few folks taking issue with my outright dismissal of the C and C++ APIs (and not showing them in the last post’s teaser benchmarks).

Therefore, this post will be squarely focused on cataloguing the C and C++ APIs in detail, and how to design ourselves away from those mistakes in C.

Part of this post will add to the table from Part 1, talking about why the API is trash (rather than just taking it for granted that industry professionals, hobbyists, and academic experts have already discovered how trash it is). I also unfortunately had to add it to the benchmarks we are doing (which means using it in anger). As a refresher, here’s where the table we created left off, with all of what we discovered (including errata from comments people sent in):

Feature Set 👇 vs. Library 👉 ICU libiconv simdutf encoding_rs/encoding_c ztd.text
Handles Legacy Encodings
Handles UTF Encodings 🤨
Bounded and Safe Conversion API
Assumed Valid Conversion API
Unbounded Conversion API
Counting API
Validation API
Extensible to (Runtime) User Encodings
Bulk Conversions
Single Conversions
Custom Error Handling 🤨 🤨
Updates Input Range (How Much Read™) 🤨
Updates Output Range (How Much Written™)
Feature Set 👇 vs. Library 👉 boost.text utf8cpp Standard C Standard C++ Windows API
Handles Legacy Encodings 🤨 🤨
Handles UTF Encodings 🤨 🤨
Bounded and Safe Conversion API 🤨
Assumed Valid Conversion API
Unbounded Conversion API
Counting API 🤨
Validation API 🤨
Extensible to (Runtime) User Encodings
Bulk Conversions 🤨 🤨
Single Conversions
Custom Error Handling
Updates Input Range (How Much Read™)
Updates Output Range (How Much Written™)

In this article, what we’re going to be doing is sizing up particularly the standard C and C++ interfaces, benchmarking all of the APIs in the table, and discussing in particular the various quirks and tradeoffs that come with doing things in this manner. We will also be showing off the C-based API that we have spent all this time leading up to, its own tradeoffs, and if it can tick all of the boxes like ztd.text does. The name of the C API is going to be cuneicode, a portmanteau of Cuneiform (one of the first writing systems) and Unicode (of Unicode Consortium fame).

Feature Set 👇 vs. Library 👉 Standard C Standard C++ ztd.text ztd.cuneicode
Handles Legacy Encodings 🤨 🤨
Handles UTF Encodings 🤨 🤨
Bounded and Safe Conversion API 🤨
Assumed Valid Conversion API
Unbounded Conversion API
Counting API
Validation API
Extensible to (Runtime) User Encodings
Bulk Conversions 🤨
Single Conversions
Custom Error Handling
Updates Input Range (How Much Read™)
Updates Output Range (How Much Written™)

First, we are going to thoroughly review why the C API is a failure API, and all the ways it precipitates the failures of the encoding conversions it was meant to cover (including the existing-at-the-time Big5-HKSCS case that it does not support).

Then, we will discuss the C++-specific APIs that exist outside of the C standard. This will include going beneath std::wstring_convert’s deprecated API, to find that powers the string conversions that it used to provide. In particular, we will discuss std::codecvt<ExternCharType, InternCharType, StateObject> and the various derived classes std::codecvt(_utf8/_utf16/_utf8_utf16). We will also talk about how the C API’s most pertinent failure leaks into the C++ API, and how that pitfall is the primary reason why Windows, specific IBM platforms, lots of BSD platforms, and more cannot properly support UTF-16 or UTF-32 in its core C or C++ standard library offerings.

Finally, we will discuss ztd.cuneicode / cuneicode, a C library for doing encoding conversions that does not make exceedingly poor decisions in its interfaces.

Standard C

Standard C’s primary deficiency is its constant clinging to and dependency upon the “multibyte” encoding and the “wide character” encoding. In the upcoming C23 draft, these have been clarified to be the Literal Encoding (for "foo" strings, at compile-time (“translation time”)), the Wide Literal Encoding (for L"foo" strings, at compile-time), Execution Encoding (for any const char*/[const char*, size_t] that goes into run time (“execution time”) function calls), and Wide Execution Encoding (for any const wchar_t*/[const wchar_t*, size_t] that goes into run time function calls). In particular, C relies on the Execution Encoding in order to go-between UTF-8, UTF-16, UTF-32 or Wide Execution encodings. This is clear from the functions present in both the <wchar.h> headers and the <uchar.h> headers:

// From Execution Encoding to Unicode
size_t mbrtoc32(char32_t* restrict pc32, const char* restrict s,
	size_t n, mbstate_t* restrict ps );
size_t mbrtoc16(char16_t* restrict pc16, const char* restrict s,
	size_t n, mbstate_t* restrict ps );
size_t mbrtoc8(char8_t* restrict pc8, const char* restrict s,
	size_t n, mbstate_t* restrict ps ); // ⬅ C23 addition

// To Execution Encoding from Unicode
size_t c32rtomb(char* restrict s, char32_t c32,
	mbstate_t* restrict ps);
size_t c16rtomb(char* restrict s, char16_t c16,
	mbstate_t* restrict ps);
size_t c8rtomb(char* restrict s, char8_t c8,
	mbstate_t* restrict ps); // ⬅ C23 addition

// From Execution Encoding to Wide Execution Encoding
size_t mbrtowc(wchar_t* restrict pwc, const char* restrict s,
	size_t n, mbstate_t* restrict ps);
// Bulk form of above
size_t mbsrtowcs(wchar_t* restrict dst, const char** restrict src,
	size_t len, mbstate_t* restrict ps);
// From Wide Execution Encoding to Execution Encoding
size_t wcrtomb(char* restrict s, wchar_t wc,
	mbstate_t* restrict ps);
// Bulk form of above
size_t wcsrtombs(char* restrict dst, const wchar_t** restrict src,
	size_t len, mbstate_t* restrict ps);

The naming pattern is “(prefix)(s?)(r)to(suffix)(s?)”, where s means “string” (bulk processing), r” means “restartable” (takes a state parameter so it a string can be re-processed by itself), and the core to which is just to signify that it goes to the prefix-identified encoding to the suffix-identified encoding. mb means “multibyte”, wc means “wide character”, and c8/16/32 are “UTF-8/16/32”, respectively.

Those are the only functions available, and with it comes an enormous dirge of problems that go beyond the basic API design nitpicking of libraries like simdutf or encoding_rs/encoding_c. First and foremost, it does not include all the possible pairings of encodings that it already acknowledges it knows about. Secondly, it does not include full bulk transformations (except in the case of going between execution encoding and wide execution encoding). All in all, it’s an exceedingly disappointing offering, as shown by the tables below.

For “Single Conversions”, what’s provided by the C Standard is as follows:

  mb wc c8 c16 c32
mb
wc
c8
c16
c32

For “Bulk Conversions”, what’s provided by the C Standard is as follows:

  mbs wcs c8s c16s c32s
mbs
wcs
c8s
c16s
c32s

As with the other table, the “✅” is for that conversion sequence being supported, and the “❌” is for no support. As you can see from all the “❌” in the above table, we have effectively missed out on a ton of functionality needed to go to and from Unicode encodings. C only provides bulk conversion functions for the “mbs”/”wcs” series of functions, meaning you can kiss any SIMD or other bulk-processing optimizations goodbye for just about every other kind of conversion in C’s API, including any UTF-8/16/32 conversions. Also note that C23 and C++20/23 had additional burdens to fix:

  • u"" and U"" string literals did not have to be UTF-16 or UTF-32 encoded;
  • c16 and c32 did not have to actually mean UTF-16 or UTF-32 for the execution encoding;
  • the C and C++ Committee believed that mb could be used as the “UTF-8” through the locale, which is why they left c8rtomb and mbrtoc8 out of the picture entirely until it was fixed up in C23 by Tom Honermann.

These are no longer problems thanks to C++’s Study Group 16, with papers from R.M. Fernandes, Tom Honermann, and myself. If you’ve read my previous blog posts, I went into detail about how the C and C++ implementations could simply define the standard macros __STDC_UTF_32__ and __STDC_UTF_16__ to 0. That is, an implementation could give you a big fat middle finger with respect to what encoding was being used by the mbrtoc16/32 functions, and also not tell you what is in the u"foo" and U"foo" string literals.

This was an allowance that we worked hard to nuke out of existence. It was imperative that we did not allow yet another escalation of char16_t/char32_t and friends ending up in the same horrible situation as wchar_t where it’s entirely platform (and possibly run time) dependent what encoding is used for those functions. As mentioned in the previously-linked blog post talking about C23 improvements, we were lucky that nobody was doing the “wrong” thing with it and always provided UTF-32 and UTF-16. This made it easy to hammer the behavior in all current and future revisions of C and C++. This, of course, does not answer why wchar_t is actually something to fear, and why we didn’t want char16_t and char32_t to become either of those.

So let’s talk about why wchar_t is literally The Devil.

C and wchar_t

This is a clause that currently haunts C, and is — hopefully — on its way out the door for the C++ Standard in C++23 or C++26. But, the wording in the C Standard is pretty straight forward (emphasis mine):

wide character

value representable by an object of type wchar_t, capable of representing any character in the current locale

— §3.7.3 Definitions, “wide character”, N3054

This one definition means that if you have an input into mbrtowc that needs to output more than one (1) wchar_t for the desired output, there is no appreciable way to describe that in the standard C API. This is because there is no reserved return code for mbrtowc to describe needing to serialize a second wchar_t into the wchar_t* restrict pwc. Furthermore, despite being a pointer type, pwc expects only a single wchar_t to be output into it. Changing the definition in future standards to allow for 2 or more wchar_t’s to be written to pwc is a recipe for overwriting the stack for code that updates its standard library but does not re-compile the application to use a larger output buffer. Taking this kind of action would ultimately end up breaking applications in horrific ways (A.K.A, an ABI Break), so it is fundamentally unfixable in Standard C.

This is why encodings like Big5-HKSCS cannot be used in Standard C. Despite libraries advertising support for them like glibc and its associated locale machinery, they return non-standard and unexpected values to cope with inputs that need to write out two UTF-32 code points for a single indivisible unit of input. Most applications absolutely cannot cope with these return values, and so they start just outputting straight up garbage values as they don’t know how to bank up the characters and perform reads anymore, let alone do writes. It’s doubly-fun when others get to see it in real-time, too:

oh wow, even better: glibc goes absolutely fucking apeshit (returns 0 for each mbrtowc() after the initial one that eats 2 bytes; herein wc modified to write the resulting character)

A screenshot of a terminal in the directory `~/code/voreutils`. It is executing the following bash commands, one after another, and displaying their output `printf '\x88\x62\x88\x56\x48' | LOCPATH=/tmp/loc LC_ALL=zh_TW.Big5-HKSCS out/cmd/wc -m`. It shows a newline-separated list of output code points in hexadecimal without an "0x" prefix and as 8 numbers, which read, sequentially, `000000ca`, `00000304`, `00000304`, `00000304`, `00000304`. On the next line it then says `5`, indicating there were 5 output code points. Further usage shows the input string to `printf` being steadily truncated, showing increasingly silly outputs, including repeated `00000304` that show the output is effectively a failure.

наб, July 9th, 2022

This same reasoning applies to distributions that attempt to use UTF-16 as their wchar_t encoding. Similar to how Big5-HKSCS requires two (2) UTF-32 code points for some inputs, large swaths of the UTF-16 encoded characters use a double code unit sequence. Despite platforms like e.g. Microsoft Windows having the demonstrated ability to produce wchar_t strings that are UTF-16 based, the standard library must explicitly be an encoding that is called “UCS-2”. If a string is input that requires 2 UTF-16 code units — a leading surrogate code unit and its trailing counterpart surrogate code unit — there is no expressible way to do this in C. All that happens is that, even if they recognize an input sequence that generates 2 UTF-16 wchar_t code units, it will get chopped in half and the library reports “yep, all good”. Microsoft is far from the only company in this boat: IBM cannot fix many of their machines, same as Oracle and a slew of others that are ABI-locked into UTF-16 because they tried to adopt Unicode 1.0 faster than everyone else and got screwed over by the Unicode Consortium’s crappy UCS-2 version 1.0 design.

This, of course, does not even address that wchar_t on specific platforms does not have to be either UTF-16 or UTF-32, and thanks to some weasel-wording in the C Standard it can be impossible to detect if even your string literals are appropriately UTF-32, let alone UTF-16. Specifically, the predefined macro __STDC_ISO_10646__ can actually be turned off by a compiler because, as far as the C Committee is concerned, it is a reflection of a run time property (whether or not mbrtowc can handle UTF-32mor UTF-16, for example), which is decided by locale (yes, wchar_t can depend on locale, like it does on several IBM and *BSD-based machines). Thusly, as __STDC_ISO_10646__ is a reflection of a run time property, it becomes technically impossible to define before-hand, at compile time, in the compiler.

So, the easiest answer — even if your compiler knows it encodes L"foo" strings as 32-bit wchar_t with UTF-32 code points — is to just advertise its value 0. It’s 100% technically correct to do so, and that’s exactly what compilers like Clang do. GCC would be in a similar boat as well, but they cut a backdoor implementation deal with numerous platforms. A header called stdc-predef.h is looked up at the start of compilation and contains a definition determining whether __STDC_ISO_10646__ may be considered UTF-32 for the platform and other such configuration parameters. If so, GCC sets it to 1. Clang doesn’t want to deal with stdc-predef.h, or lock in any of the GCC-specific ways of doing things in this avenue too much, so they just shrug their shoulders and set it to 0.

I could not fix this problem exactly, myself. It was incredibly frustrating, but ultimately I did get something for a few implementations. In coordination with Corentin Jabot’s now-accepted P1885, I provided named macro string literals or numeric identifiers to identify the encoding of a string literal1. This allows a person to identify (at least for GCC, Clang, and (now) MSVC) the encoding they use with some degree of reliability and accuracy. The mechanism through which I implemented this and suggested it is entirely compiler-dependent, so it’s not as if other frontends for C or C++ will do this. I hope they’ll follow through and not continue to leave their users out to dry. For C++26 and beyond, Corentin Jabot’s paper will be enough to solve things on the C++ side. C is still left in the dark, but that’s just how it is all the time anyways these days so it’s not like C developers will be any less sad than when they started.

C and “multibyte” const char* Encodings

As mentioned briefly before, the C and C++ Committee believed that the Execution Encoding could just simply be made to be UTF-8. This was back when people still had faith in locales (an attitude still depressingly available in today’s ecosystem, but in a much more damaging and sinister form to be talked about later). In particular, there are no Unicode conversions except those that go through the opaque, implementation-defined Execution Encoding. For example, if you wanted to go from the Wide Execution Encoding (const wchar_t*) to UTF-8, you cannot simply convert directly from a const wchar_t* wide_str string — whatever encoding it may be — to UTF-8. You have to:

  • set up an intermediate const char temp[MB_MAX_LEN]; temporary holder;
  • call wcrtomb(temp, *wide_str, …);
  • feed the data from temp into mbrtoc8(temp, …);
  • loop over the wide_str string until you are out of input;
  • loop over any leftover intermediate input and write it out; and,
  • drain any leftover state-held data by checking mbsinit(state) (if using the r-based restartable functions).

The first and most glaring problem is: what happens if the Execution Encoding is not Unicode? It’s an unfortunately frightfully common case, and as much as the Linux playboys love to shill for their platform and the “everything is UTF-8 by default” perspective, they’re not being honest with you or really anyone else on the globe. For example, on a freshly installed WSL Ubuntu LTS, with sudo apt-get update and sudo apt-get dist-upgrade freshly run, when I write a C or C++ program to query what the locale is with getlocale and compile that program with Clang 15 with as advanced of a glibc/libstdc++ as I can get, this is what the printout reads:

=== Encoding Names ===
Literal Encoding: UTF-8
Wide Literal Encoding: UTF-32
Execution Encoding: ANSI_X3.4-1968
Wide Execution Encoding: UTF-32

If you look up what “ANSI_X3.4-1968” means, you’ll find that it’s the most obnoxious and fancy way to spell a particularly old encoding. That is to say, my default locale when I ask and use it in C or C++ — on my brand new Ubuntu 20.04 Focal LTS server, achieved from just pressing “ok” to all the setup options, installing build essentials, and then going out of my way to get the most advanced Clang I can and combine it with the most up-to-date glibc and libstdc++ I can —

is ASCII.

Not UTF-8. Not Latin-1!

Just ASCII.

Post-locale, “const char* is always UTF-8”, “UTF-8 is the only encoding you’ll need” world, eh? 🙄

Windows fares no better, pulling out the generic default locale associated with my typical location since the computer’s install. This means that if I decide today is a good day to transcode between UTF-16 and UTF-8 the “standard” way, everything that is not ASCII will simply be mangled, errored on, or destroyed. I have to adjust my tests when I’m using code paths that go through standard C or C++ paths, because apparently “Hárold” is too hardcore for Ubuntu 22.04 LTS and glibc to handle. I have long since had to teach not only myself, but others how to escape the non-UTF-8 hell on all kinds of machines. For a Windows example, someone sent me a screenshot of a bit of code whose comments looked very much like it was mojibake’d over Telegram:

A screenshot from telegram showing a C++ class definition that contains a series of comments with the text COMPLETELY mangled and unreadable, just a dozen of them line after line made of strange symbols and complete gibberish.

Visual Studio was, of course, doing typical Microsoft-text-editor-based Visual Studio things here. It was clear what went down, so I gave them some advice:

A screenshot of a telegram conversation. The text is all from a person with a rat avatar, named "Jö Brk": "you're dealing with 1 of 2 encodings." "1 - Windows 1251. Cyrillic [sic] encoding, used in Russia." "2 - UTF-8" "It's treating it as the wrong encoding. Chances are you need to go to "open file as", specifically, and ask it to open as UTF-8." "If that doesn't work, try Windows 1251." "When you're done with either, try re-saving the file as UTF-8, after you find the right encoding." "I'm putting money on it being Windows-1251".

And, ‘lo and behold:

A screenshot of a telegram conversation, continued after the first. From anonymous: "1251 works". Response from Jö Brk: "There ya go."

Of course, some questions arise. One might be “How did you know it was Windows 1251?”. The answer is that I spent a little bit of time in The Mines™ using alternative locales on my Windows machine — Japanese, Chinese, German, and Russian — and got to experience first-hand how things got pretty messed up by an overwhelming high number of programs. And that’s just the tip of the iceberg: Windows 1251 is the most persistent encoding for Cyrillic data into/out of Central & Eastern Europe, as well as Far North Asia. There’s literally an entire Wiki that contains common text sequences and their Mojibake forms when incorrectly interpreted as other forms of encodings, and most Cyrillic users are so used to being fucked over by computing in general that they memorized the various UTF and locale-based mojibake results, able to read pure mangled text and translate that in-real-time to what it actually means in their head. (I can’t do that: I have to look stuff up.) It’s just absurdly common to be dealing with this:

Me: “Why is the program not working?”
[looks at error logs]
Me: “Aha.”

A screenshot of Ólafur Waage's log file, with a horrific mangling of what is supposed to be the path "C:\Users\Ólafur Waage\AppData" that instead mangles the "Ó" to instead look like: Ã and quotation mark.

Ólafur Waage, May 22nd, 2023

Even the file name for the above embedded image had to be changed from ólafur-tweet.png to waage-tweet.png, because Ruby — when your Windows and Ruby is not “properly configured” (????) — will encounter that name from Jekyll, then proceed to absolutely both crap and piss the bed about it by trying to use the C standard-based sysopen/rb_sysopen on it. By default, that will use the locale-based file APIs on Windows, rather than utilizing the over 2-decade old W-based Windows APIs to open files. It’s extraordinary that despite some 20+ years of Unicode, almost every programming language, operating system, core library, or similar just straight up does not give a single damn about default Unicode support in any meaningful way! (Thank God at least Python tries to do something and gets pretty far with its myriad of approaches.)

There are other ways to transition your application to UTF-8 on Windows, even if you might receive Windows 1251 data or something else. Some folks achieve it by drilling Application Manifests into their executables. But that only works for applications; ztd.text and ztd.cuneicode are libraries. How the hell am I supposed to Unicode-poison an application that consumes my library? The real answer is that there is still no actual solution, and so I spend my time telling others about this crappy world when C and C++ programs inevitably destroy people’s data. But, there is one Nuclear Option you can deploy as a Windows user, just to get UTF-8 by-default as the default codepage for C and C++ applications:

A screenshot of the Windows 10 Settings screen, showing a sequence of windows eventually leading to the hidden Region Settings so that the check box for "Beta: Use Unicode UTF-8 for worldwide language support" can be checked off.

Yep, the option to turn on UTF-8 by default is buried underneath the new Settings screen, under the “additional clocks” Legacy Settings window on the first tab, into the “Region” Legacy Settings window on the second tab (“Administrative”), and then you need to click the “Change system locale” button, check a box, and reset your computer.

But sure, after you do all of that, you get to live in a post-locale world2. 🙃

And It Gets Worse

Because of course it gets worse. The functions I listed previously all have an r in the middle of their names; this is an indicator that these functions take an mbstate_t* parameter. This means that the state used for the conversion sequence is not taken from its alternative location. The alternative location is, of course, implementation-defined when you are not using the r-styled functions.

This alternative mbstate_t object might be a static storage duration object maintained by the implementation. It may be thread_local, it may not, and whether or not it is thread safe there is still the distinct horribleness that it is an opaquely shared object. So even if the implementation makes it thread-safe and synchronizes access to it (kiss your performance good-bye!), if, at any point, someone uses the non-r versions of the above standard C functions, any subsequent non-r functions downstream of them have their state changed out from underneath them. Somehow, our systems programming language adopted scripting-language style behavior, where everything is connected to everything else is a jumble of hidden and/or global state, grabbing variables and functionality from wherever and just loading it up willy-nilly. This is, of course, dependable and rational behavior that can and will last for a long time and absolutely not cause severe problems down the road. It definitely won’t lead to ranting screeds3 from developers who have to deal with it.

Of course, even using the r functions still leaves the need to go through the multibyte character set. Even if you pass in your own mbstate_t object, you still have to consult with the (global) locale. If at any point someone calls setlocale("fck.U"); you become liable to deal with that change in further downstream function calls. Helpfully, the C standard manifests this as unspecified behavior, even if we are storing our own state in an mbstate_t! If one function call starts in one locale with one associated encoding, but ends up in another locale with a different associated encoding during the next function call, well. Eat shit, I guess! This is because mbstate_t, despite being the “state” parameter, is still beholden to the locale when the function call was made and the mbstate_t object is not meant to store any data about the locale for the function call! Most likely you end up with either hard crashes or strange, undecipherable output even for what was supposed to be a legal input sequences, because the locale has changed in a way that is invisible to both the function call and the shared state between function calls with mbstate_t.

So, even if you try your hardest, use the restartable functions, and track your data meticulously with mbstate_t, libraries in the stack that may set locale will blow up everyone downstream of them, and applications which set locale may end up putting their upstream dependencies in an untested state of flux that they are entirely unprepared for. Of course, nobody sees fit to address this: there’s no reasonable locale_t object that can be passed into any of these functions, no way of removing the shadowy specter of mutable global state from the core underlying functionality of our C libraries. You either use it and deal with getting punched in the face at seemingly random points in time, or you don’t and rewrite large swaths of your standard library distribution.

All in all, just one footgun after another when it comes to using Standard C in any remotely scalable fashion. It is not surprise that the advice for these functions about their use is “DO. NOT.”, which really inspires confidence that this is the base that every serious computation engine in the world builds on for their low-level systems programming. This, of course, leaves only the next contender to consider: standard C++.

Standard C++

When I originally discussed Standard C++, I approached it from its flagship API — std::wstring_convert<…> — and all the problems therein. But, there was a layer beneath that I had filed away as “trash”, but that could still be used to get around many of std::wstring_convert<…>’s glaring issue. For example, wstring_convert::to_bytes always returns a new std::string-alike by-value, meaning that there’s no room to pre-allocate or pre-reserve data (giving it the worst of the allocation penalty and any pessimistic growth sizing as the string is converted). It also always assumes that the “output” type is char-based, while the input type is Elem-based. Coupled with the by-value, allocated return types, it makes it impossible to save on space or time, or make it interoperable with a wide variety of containers (e.g., TArray<…> from Unreal Engine or boost::static_vector), requiring an additional copy to put it into something as simple as a std::vector.

But, it would be unfair to judge the higher-level — if trashy — convenience API when there is a lower-level one present in virtual-based codecvt classes. These are member functions, and so the public-facing API and the protected, derive-ready API are both shown below:

template <typename InternalCharType,
	typename ExternalCharType,
	typename StateType>
class codecvt {
public:
	std::codecvt_base::result out( StateType& state,
		const InternalCharType* from,
		const InternalCharType* from_end,
		const InternalCharType*& from_next,
		ExternalCharType* to,
		ExternalCharType* to_end,
		ExternalCharType*& to_next ) const;

	std::codecvt_base::result in( StateType& state,
		const ExternalCharType* from,
		const ExternalCharType* from_end,
		const ExternalCharType*& from_next,
		InternalCharType* to,
		InternalCharType* to_end,
		InternalCharType*& to_next ) const;

	// …

protected:
	virtual std::codecvt_base::result do_out( StateType& state,
		const InternalCharType* from,
		const InternalCharType* from_end,
		const InternalCharType*& from_next,
		ExternalCharType* to,
		ExternalCharType* to_end,
		ExternalCharType*& to_next ) const;

	virtual std::codecvt_base::result do_in( StateType& state,
		const ExternalCharType* from,
		const ExternalCharType* from_end,
		const ExternalCharType*& from_next,
		InternalCharType* to,
		InternalCharType* to_end,
		InternalCharType*& to_next ) const;

	// …
}

Now, this template is not supposed to be anything and everything, which is why it additionally has virtual functions on it. And, despite the poorness of the std::wstring_convert<…> APIs, we can immediately see the enormous benefits of the API here, even if it is a little verbose:

  • it cares about having both a beginning and an end;
  • it contains a third pointer-by-reference (rather than using a double-pointer) to allow someone to know where it stopped in its conversion sequences; and,
  • it takes a StateType, allowing it to work over a wide variety of potential encodings.

This is an impressive and refreshing departure from the usual dreck, as far as the API is concerned. As an early adopter of codecvt and wstring_convert, however, I absolutely suffered its suboptimal API and the poor implementations, from Microsoft missing wchar_t specializations that caused wstring_convert to break until they fixed it, or MinGW’s patched library deciding today was a good day to always swap the bytes of the input string to produce Big Endian data no matter what parameters were used, it was always a slog and a struggle to get the API to do what it was supposed to do.

But the thought was there. You can see how this API could be the one that delivered C++ out of the “this is garbage nonsense” mines. Maybe, it could even be the API for C++ that would bring us all to the promised land over C. They even had classes prepared to do just that, committing to UTF-8, UTF-16, and UTF-32 while C was still struggling to get past “char* is always (sometimes) UTF-8, just use the multibyte encoding for that”:

template <typename Elem,
	unsigned long Maxcode = 0x10ffff,
	std::codecvt_mode Mode = (std::codecvt_mode)0
> class codecvt_utf8 : public std::codecvt<Elem, char, std::mbstate_t>;

template <typename Elem,
	unsigned long Maxcode = 0x10ffff,
	std::codecvt_mode Mode = (std::codecvt_mode)0
> class codecvt_utf16 : public std::codecvt<Elem, char, std::mbstate_t>;

template <typename Elem,
	unsigned long Maxcode = 0x10ffff,
	std::codecvt_mode Mode = (std::codecvt_mode)0
> class codecvt_utf8_utf16 : public std::codecvt<Elem, char, std::mbstate_t>;

UTF-32 is supported by passing in char32_t as the Elem element type. The codecvt API was byte-oriented, meaning it was made for serialization. That meant it would do little-endian or big-endian serialization by default, and you had to pass in std::codecvt_mode::little_endian to get it to behave. Similarly, it sometimes would generate or consume byte order markers if you passed in std::codecvt_mode::consume_header or std::codecvt_mode::generate_header (but it only generates a header for UTF-16 or UTF-8, NOT for UTF-32 since UTF-32 was considered the “internal” character type for these and therefore not on the “serialization” side, which is what the “external” character type was designated for). It was a real shame that the implementations were fairly lackluster when it first came out because this sounds like (almost) everything you could want. By virtue of being a virtual-based interface, you could also add your own encodings to this, which therefore made it both compile-time and run-time extensible. Finally, it also contained error codes that went beyond just “yes the conversion worked” and “no it didn’t lol xd”, with the std::codecvt_base::result enumeration:

enum result {
	ok,
	partial,
	error,
	noconv
};

whose values mean:

  • result::ok — conversion was completed with no error;
  • result::partial — not all source characters were converted;
  • result::error — encountered an invalid character; and,
  • result::noconv — no conversion required, input and output types are the same.

This is almost identical to ztd.text’s ztd::text::encoding_error type, with the caveat that ztd.text also accounts for the “all source characters could be converted, but the write out was partial” while gluing the result::noconv into its version of result::ok instead. This small difference, however, does contribute in one problem. And that one problem does, eventually, fully cripple the API.

The “1:N” and “N:1” Rule

Remember how this interface is tied to the idea of “internal” and “external” characters, and the normal “wide string” versus the internal “byte string”? This is where something sinister leaks into the API, by way of a condition imposed by the C++ standard. Despite managing to free itself from wchar_t issues by way of having an API that could allow for multiple input and multiple outputs, it reintroduces them by applying a new restriction focused exclusively on basic_filebuf-related containers.

A codecvt facet that is used by basic_­filebuf ([file.streams]) shall have the property that if

do_out(state, from, from_end, from_next, to, to_end, to_next)

would return ok, where from != from_­end, then

do_out(state, from, from + 1, from_next, to, to_end, to_next)

shall also return ok, and that if

do_in(state, from, from_end, from_next, to, to_end, to_next)

would return ok, where to != to_­end, then

do_in(state, from, from_end, from_next, to, to + 1, to_next)

shall also return ok.252

— Draft C++ Standard, §30.4.2.5.3 [locale.codecvt.virtuals] ¶4

And the footnote reads:

252) Informally, this means that basic_­filebuf assumes that the mappings from internal to external characters is 1 to N: that a codecvt facet that is used by basic_­filebuf can translate characters one internal character at a time.

— Draft C++ Standard, §30.4.2.5.3 [locale.codecvt.virtuals] ¶4 Footnote 252

In reality, what this means is that, when dealing with basic_filebuf as the thing sitting on top of the do_in/do_out conversions, you must be able to not only convert 1 element at a time, but also avoid returning partial_conv and just say “hey, chief, everything looks ok to me!”. This means that if someone, say, hands you an incomplete stream from inside the file, you’re supposed to be able to read only 1 byte of a 4-byte UTF-8 character, say “hey, this is a good, complete character — return std::codecvt_mode::ok!!”, and then let the file proceed even if it never provides you any other data.

It’s hard to find a proper reason for why this is allowed. If you are always allowed to feed in exactly 1 internal character, and it is always expected to form a complete “N” external/output characters, then insofar as basic_filebuf is concerned it’s allowed to completely mangle your data. This means that any encoding where it does not produce 1:N data (for example, produces 2:N or really anything with M:N where M ≠ 1) is completely screwed. Were you writing to a file? Well, good luck with that. Network share? Ain’t that a shame! Manage to open a file descriptor for a pipe and it’s wrapped in a basic_filebuf? Sucks to be you! Everything good about the C++ APIs gets flushed right down the toilet, all because they wanted to support the — ostensibly weird-as-hell — requirement that you can read or write things exactly 1 character at a time. Wouldn’t you be surprised that some implementation, somewhere, used exactly one character internally as a buffer? And, if we were required to ask for more than that, it would be an ABI break to fix it? (If this is a surprise to you, you should go back and read this specific section in this post about ABI breaks and how they ruin all of us.)

Of course, they are not really supporting anything, because in order to avoid having to cache the return value from in or out on any derived std::codecvt-derived class, it just means you can be fed a completely bogus stream and it’s just considered… okay. That’s it. Nothing you or anyone else can do about such a situation: you get nothing but suffering on your plate for this one.

An increasingly nonsensical part of how this specification works is that there’s no real way for the std::codecvt class to know that it’s being called up-stream by a std::basic_filebuf, so either every derived std::codecvt object has to be okay with artificial truncation, or the developers of the std::basic_filebuf have to simply throw away error codes they are not interested in and ignore any incomplete / broken sequences. It seems like most standard libraries choose the latter, which results in, effectively, all encoding procedures for all files to be broken in the same way wchar_t is broken in C, but for literally every encoding type if they cannot figure out how to get their derived class or their using class and then figure out if they’re inside a basic_filebuf or not.

Even more bizarrely, because of the failures of the specification, std::codecvt_utf16/std::codecvt_utf8 are, in general, meant to handle UCS-2 and not UTF-164. (UCS-2 is the Unicode Specification v1.0 “wide character” set that handles 65535 code points maximum, which Unicode has already surpassed quite some time ago.) Nevertheless, most (all?) implementations seem to defy the standard, further making the class’s stated purpose in code a lot more worthless. There are also additional quirks that trigger undefined behavior when using this type for text-based or binary-based file writing. For example, under the deprecated <codecvt> description for codecvt_utf16, the standard says in a bullet point

The multibyte sequences may be written only as a binary file. Attempting to write to a text file produces undefined behavior.

Which, I mean. … What? Seriously? My encoding things do not work well with my text file writing, the one place it’s supposed to be super useful in? Come on! At this point, there is simply an enduring horror that leads to a bleak — if not fully disturbed — fascination about the whole situation.

Fumbling the Bag

If it was not for all of these truly whack-a-doodle requirements, we would likely have no problems. But it’s too late: any API that uses virtual functions are calcified for eternity. Their interfaces and guarantees can never be changed, because changing and their dependents means breaking very strong binary guarantees made about usage and expectations. I was truly excited to see std::codecvt’s interface surpassed its menial std::wstring_convert counterpart in ways that actually made it a genuinely forward-thinking API. But. It ultimately ends up going in the trash like every other Standard API out there. So close,

yet so far!

The rest of the API is the usual lack of thought put into an API to optimize for speed cases. No way to pass nullptr as a marker to the to/from_end pointers to say “I genuinely don’t care, write like the wind”, though on certain standard library implementations you could probably just get away with it5. There’s also no way to just pass in nullptr for the entire to_* sets of pointers to say “I just want you to give me the count back”; and indeed, there’s no way to compute such a count with the triple-input-pointer, triple-output-pointer API. This is why the libiconv-style of pointer-to-pointer, pointer-to-size API ends up superior: it’s able to capture all use cases without presenting problematic internal API choices or external user use choices (even if libiconv itself does not live up to its API’s potential).

This is, ostensibly, a part of why the std::wstring_convert performance and class of APIs suck as well. They ultimately cannot perform a pre-count and then perform a reservation, after doing a basic from_next - from check to see if the input is large enough to justify doing a .reserve(…)/.resize() call before looping and push_back/insert-ing into the target string using the .in and .out APIs on std::codecvt. You just have to make an over-estimate on the size and pre-reserve, or don’t do that and just serialize into a temporary buffer before dumping into the output. This is the implementation choice that e.g. MSVC makes, doing some ~16 characters at a time before vomiting them into the target string in a loop until std::codecvt::in/out exhausts all the input. You can imagine that encoding up to 16 characters at-most in a loop for a string several megabytes long is going to be an enormous no-no for many performance use cases, so that tends to get in the way a little bit.

There is, of course, one other bit about the whole C++ API that once again comes to rear its ugly head in our faces.

Old Habits Die Hard

There is also another significant problem with the usage of std::codecvt for its job; it relies on a literal locale object / locale facet to get its job done. Spinning up a std::codecvt can be expensive due to its interaction with std::locale and the necessity of being attached to a locale. It is likely intended that these classes can be used standalone, without attaching it to the locale at all (as their destructors, unlike other locale-based facets) were made public and callable rather than hidden/private. This means they can be declared on the stack and used directly, at least.

This was noticeably bad, back when I was still using std::codecvt and std::wstring_convert myself in sol2. Creating a fresh object to do a conversion resulted in horrible performance characteristics for that convert-to-UTF-8 routine relying on standard library facilities. These days, I have been doing a hand-written, utterly naïve UTF-8 conversions, which has stupidly better performance characteristics simply because it’s not dragging along whatever baggage comes with locales, facets, wstring_convert, codecvt, and all of its ilk. Which is just so deeply and bewilderingly frustrating that I can get a thumbs up from users by just doing the most head-empty, braindead thing imaginable and its just so much better than the default actions that come with the standard library.

Constantly, we are annoyed in the Committee or entirely dismissive of game development programmers (and I am, too!) of many of their concerns. But it is ENTIRELY plausible to see how they can start writing off entire standard libraries when over and over again you can just do the world’s dumbest implementation of something and it kicks the standard library’s ass for inputs small and large. This does not extrapolate to other areas, but it only takes a handful of bad experiences — especially back 10 or 20 years ago when library implementations were so much worse — to convince someone not to waste their time investigating and benchmarking when it is so much easier on the time-financials tradeoff to just assume it is godawful trash and write something quick ‘n’ dirty that was going to perform better anyways.

What a time to be alive trying to ask people to use Standard C and C++, when they can throw a junior developer at a problem at get better performance and compilation times to do a very normal and standard thing like convert to UTF-8.

I certainly don’t endorse the attitude of taking 20 year old perceptions and applying them to vastly improved modern infrastructure that has demonstrably changed, but it doesn’t take a rocket scientist to see how we ended up on this particular horizon of understanding.

But, That Done and Dusts That

C and C++ are now Officially Critiqued™ and hopefully I don’t have to have anyone crawl out of the woodwork to talk about X or Y thing again and how I’m not being fair enough by just calling it outright garbage. Now, all of us should thoroughly understand why it’s garbage and how unusable it is.

Nevertheless, if these APIs are garbage, how do we build our own good one? Clearly, if I have all of this evidence and all of these opinions, assuredly I’ve been able to make a better API? So, let’s try to dig in on that. I already figured out the C++ API in ztd.text and written about it extensively, so let’s cook up ztd.cuneicode (or just cuneicode), from the ground up, with a good interface.

A Wonderful API

For a C function, we need to have 4 capabilities, as outlined by the table above.

  • Single conversions, to transcode one indivisible unit of information at a time.
  • Bulk conversions, to transcode a large buffer as fast as possible (a speed optimization over single conversion with the same properties).
  • Validation, to check whether an input is valid and can be converted to the output encoding (or where there an error would occur in the input if some part is invalid).
  • Counting, to know how much output is needed (often with an indication of where an error would occur in the input, if the full input can’t be counted successfully).

We also know from the ztd.text blog post and design documentation, as well as the analysis from the previous blog post and the above table, that we need to provide specific information for the given capabilities:

  • how much input was consumed (even in the case of an error);
  • how much output was written (even in the case of an error);
  • that “input read” should only include how much was successfully read (e.g., stops before the error happens and should not be a partial read);
  • that “output written” should only include how much was successfully converted (e.g., points to just after the last successful serialization, and not any partial writes); and,
  • that we may need additional state associated with a given encoding to handle it properly (any specific “shift sequences” or held-onto state; we’ll talk about this more thoroughly when demonstrating the new API).

It turns out that there is already one C API that does most of what we want design-wise, even if its potential was not realized by the people who worked on it and standardized its interface in POSIX!

Borrowing Perfection

This library has the perfect interface design and principles with — as is standard with most C APIs — the worst actual execution on said design. To review, let’s take a look at the libiconv conversion interface:

size_t iconv(
	iconv_t cd, // any necessary custom information and state
	char ** inbuf, // an input buffer and how far we have progressed
	size_t * inbytesleft, // the size of the input buffer and how much is left
	char ** outbuf, // an output buffer and how far we have progressed
	size_t * outbytesleft); // the size of the output buffer and how much is left

As stated in Part 1, while the libiconv library itself will fail to utilize the interface for the purposes we will list just below, we ourselves can adapt it to do these kinds of operations:

  • normal output writing (iconv(cd, inbuf, inbytesleft, outbuf, outbytesleft));
  • unbounded output writing (iconv(cd, inbuf, inbytesleft, outbuf, nullptr));
  • output size counting (iconv(cd, inbuf, inbytesleft, nullptr, outbytesleft)); and,
  • input validation (iconv(cd, inbuf, inbytesleft, nullptr, nullptr)).

Unfortunately, what is missing most from this API is the “single” conversion handling. But, you can always create a bulk conversion by wrapping a one-off conversion, or create a one-off conversion by wrapping a bulk conversion (with severe performance implications either way). We’ll add that to the list of things to include when we hijack and re-do this API.

So, at least for a C-style API, we need 2 separate class of functions for one-off and bulk-conversion. In Standard C, they did this by having mbrtowc (without an s to signify the one-at-a-time conversion nature) and by having mbsrtowcs (with an s to signify a whole-string conversion). Finally, the last missing piece here is an “assume valid” conversion. We can achieve this by providing a flag or a boolean on the “state”; in the case of iconv_t cd, it would be done at the time when the iconv_t cd object is generated. For the Standard C APIs, it could be baked into the mbstate_t type (though they would likely never, because adding that might change how big the struct is, and thus destroy ABI).

With all of this in mind, we can start a conversion effort for all of the fixed conversions. When I say “fixed”, I mean conversions from a specific encoding to another, known before we compile. These will be meant to replace the C conversions of the same style such as mbrtowc or c8rtomb, and fill in the places they never did (including e.g. single vs. bulk conversions). Some of these known encodings will still be linked to runtime based encodings that are based on the locale. But, rather than using them and needing to praying to the heaven’s the internal Multibyte C Encoding is UTF-8 (like with the aforementioned wcrtomb -> mbrtoc8/16/32 style of conversions), we’ll just provide a direction conversion routine and cut out the wchar_t encoding/multibyte encoding middle man.

Static Conversion Functions for C

The function names here are going to be kind of gross, but they will be “idiomatic” standard C. We will be using the same established prefixes from the C Standard group of functions, with some slight modifications to the mb and wc ones to allow for sufficient differentiation from the existing standard ones. Plus, we will be adding a “namespace” (in C, that means just adding a prefix) of cnc_ (for “cuneicode”), as well as adding the letter n to indicate that these are explicitly the “sized” functions (much like strncpy and friends) and that we are not dealing with null terminators at all in this API. Thusly, we end up with functions that look like this:

// Single conversion
cnc_XnrtoYn(size_t* p_destination_buffer_len,
	CharY** p_maybe_destination_buffer,
	size_t* p_source_buffer_len,
	CharX** p_source_buffer,
	cnc_mcstate_t* p_state);

// Bulk conversion
cnc_XsnrtoYsn(size_t* p_destination_buffer_len,
	CharY** p_maybe_destination_buffer,
	size_t* p_source_buffer_len,
	CharX** p_source_buffer,
	cnc_mcstate_t* p_state);

As shown, the s indicates that we are processing as many elements as possible (historically, s would stand for string here). The tags that replace X and Y in the function names, and their associated CharX and CharY types, are:

Tags Character Type Default Associated Encoding
mc char Execution (Locale) Encoding
mwc wchar_t Wide Execution (Locale) Encoding
c8 char8_t/unsigned char UTF-8
c16 char16_t UTF-16
c32 char32_t UTF-32

The optional encoding suffix is for the left-hand-side (from, X) encoding first, before the right-hand side (to, Y) encoding. If the encoding is the default associated encoding, then it can be left off. If it may be ambiguous which tag is referring to which optional encoding suffix, both encoding suffixes are provided. The reason we do not use mb or wc (like pre-existing C functions) is because those prefixes are tainted forever by API and ABI constraints in the C standard to refer to “bullshit multibyte encoding limited by a maximum output of MB_MAX_LEN”, and “can only ever output 1 value and is insufficient even if it is picked to be UTF-32”, respectively. The new name “mc” stands for “multi character”, and “mwc” stands for — you guessed it — “multi wide character”, to make it explicitly clear there’s multiple values that will be going into and coming out of these functions.

This means that if we want to convert from UTF-8 to UTF-16, bulk, the function to call is cnc_c8snrtoc16sn(…). Similarly, converting from the Wide Execution Encoding to UTF-32 (non-bulk) would be cnc_mwcnrtoc32n(…). There is, however, a caveat: at times, you may not be able to differentiate solely based on the encodings present, rather than the character type. In those cases, particularly for legacy encodings, the naming scheme is extended by adding an additional suffix directly specifying the encoding of one or both of the ends of the conversion. For example, a function that deliberate encodings from Punycode (RFC) to UTF-32 (non-bulk) would be spelled cnc_mcnrtoc32n_punycode(…) and use char for CharX and char32_t for CharY. A function to convert specifically from SHIFT-JIS to EUC-JP (in bulk) would be spelled cnc_mcsnrtomcsn_shift_jis_euc_jp(…) and use char for both CharX and CharY. Furthermore, since people like to use char for UTF-8 despite associated troubles with char’s signedness, a function converting from UTF-8 to UTF-16 in this style would be cnc_mcsnrtoc16sn_utf8(…). The function that converts the execution encoding to char-based UTF-8 is cnc_mcsnrtomcsn_exec_utf8(…).

The names are definitely a mouthful, but it covers all of the names we could need for any given encoding pair for the functions that are named at compile-time and do not go through a system similar to libiconv. Given this naming scheme, we can stamp out all the core functions between the 5 core encodings present on C and C++-based, locale-heavy systems (UTF-8, UTF-16, UTF-32, Execution Encoding, and Wide Execution Encoding), and have room for additional functions using specific names.

Finally, there is the matter of “conversions where the input is assumed good and valid”. In ztd.text, you get this from using the ztd::text::assume_valid_handler error handler object and its associated type. Because we do not have templates and we cannot provide type-based, compile-time polymorphism without literally writing a completely new function, cnc_mcstate_t has a function that will set its “assume valid” state. The proper = {} init of cnc_mcstate_t will keep it off as normal. But you can set it explicitly using the function, which helps us cover the “input is valid” bit.

Given all of this, we can demonstrate a small usage of the API here:

#include <ztd/cuneicode.h>

#include <ztd/idk/size.h>

#include <stdio.h>
#include <stdbool.h>
#include <string.h>

int main() {
	const char32_t input_data[] = U"Bark Bark Bark 🐕‍🦺!";
	char output_data[ztd_c_array_size(input_data) * 4] = {};
	cnc_mcstate_t state                                = {};
	// set the "do UB shit if invalid" bit to true
	cnc_mcstate_set_assume_valid(&state, true);
	const size_t starting_input_size  = ztd_c_string_array_size(input_data);
	size_t input_size                 = starting_input_size;
	const char32_t* input             = input_data;
	const size_t starting_output_size = ztd_c_array_size(output_data);
	size_t output_size                = starting_output_size;
	char* output                      = output_data;
	cnc_mcerror err                   = cnc_c32snrtomcsn_utf8(
	                       &output_size, &output, &input_size, &input, &state);
	const bool has_err          = err != cnc_mcerr_ok;
	const size_t input_read     = starting_input_size - input_size;
	const size_t output_written = starting_output_size - output_size;
	const char* const conversion_result_title_str = (has_err
		? "Conversion failed... \xF0\x9F\x98\xAD" // UTF-8 bytes for 😭
		: "Conversion succeeded \xF0\x9F\x8E\x89"); // UTF-8 bytes for 🎉
	const size_t conversion_result_title_str_size
		= strlen(conversion_result_title_str);
	// Use fwrite to prevent conversions / locale-sensitive-probing from
	// fprintf family of functions
	fwrite(conversion_result_title_str, sizeof(*conversion_result_title_str),
		conversion_result_title_str_size, has_err ? stderr : stdout);
	fprintf(has_err ? stderr : stdout,
		"\n\tRead: %zu %zu-bit elements"
		"\n\tWrote: %zu %zu-bit elements\n",
		(size_t)(input_read), (size_t)(sizeof(*input) * CHAR_BIT),
		(size_t)(output_written), (size_t)(sizeof(*output) * CHAR_BIT));
	fprintf(stdout, "%s Conversion Result:\n", has_err ? "Partial" : "Complete");
	fwrite(output_data, sizeof(*output_data), output_written, stdout);
	// The stream is (possibly) line-buffered, so make sure an extra "\n" is written
	// out; this is actually critical for some forms of stdout/stderr mirrors. They
	// won't show the last line even if you manually call fflush(…) !
	fwrite("\n", sizeof(char), 1, stdout);
	return has_err ? 1 : 0;
}

Which (on a terminal that hasn’t lost its mind6) produces the following output:

Conversion succeeded 🎉:
	Read: 19 32-bit elements
	Wrote: 27 8-bit elements
Complete Conversion Result:
Bark Bark Bark 🐕‍🦺! 

Of course, some readers may have a question about how the example is written. Two things, in particular…

fwrite? Huh??

The reason we always write out Unicode data using fwrite rather than fprintf/printf or similar is because on Microsoft Windows, the default assumption of input strings is that they have a locale encoding. In order to have that data reach specific kinds of terminals, certain terminal implementations on Windows will attempt to convert from what they suspect the encoding of the application’s strings are (e.g., from %s/%*.s) to the encoding of the terminal. In almost all cases, this assumption is wrong when you have a capable terminal (such as the new Windows Terminal, or the dozens of terminal emulators that run on Windows). The C Standard for fprintf and its %s modifier specifies no conversions, but it does not explicitly forbid them from doing this, either. They are also under no obligation to properly identify what the input encoding that goes into the terminal is either.

For example, even if I put UTF-8 data into fprintf("%s", (const char*)u8"🐈 meow 🐱");, it can assume that the data I put in is not UTF-8 but, in fact, ISO 8859-1 or Mac Cyrillic or GBK. This is, of course, flagrantly wrong for our given example. But, it doesn’t matter: it will misinterpret that data as one kind of encoding and blindly encode it to whatever the internal Terminal encoding is (which is probably UTF-16 or some adjacent flavor thereof).

The result is that you will get a bunch of weird symbols or a bunch of empty cells in your terminal, leading to confused users and no Unicode-capable output. So, the cross-platform solution is to use fwrite specifically for data that we expect implementations like Microsoft will mangle on various terminal displays (such as in VSCode, Microsoft Terminal, or just plain ol’ cmd.exe that is updated enough and under the right settings). This bypasses any internal %s decoding that happens, and basically shoves the bytes as-is straight to the terminal. Given it is just a sequence of bytes going to the terminal, it will be decoded directly by the display functions of the terminal and the shown cells, at least for the new Windows Terminal, will show us UTF-8 output.

It’s not ideal and it makes the examples a lot less readable and tangible, but that is (unfortunately) how it is.

What is with the "foo" string literal but the e.g. \xF0\x9F\x98\xAD sequence??

Let us take yet another look at this very frustrating initialization:

	// …

	const char* const conversion_result_title_str = (has_err
		? "Conversion failed... \xF0\x9F\x98\xAD" // UTF-8 bytes for 😭
		: "Conversion succeeded \xF0\x9F\x8E\x89"); // UTF-8 bytes for 🎉

	// …

You might look at this and be confused. And, rightly, you should be: why not just use a u8"" string literal? And, with that u8"" literal, why not just use e.g. u8"Blah blah\U0001F62D" to get the crying face? Well, unfortunately, I regret to inform you that

MSVC is At It Again!

Let’s start with just using u8"" and trying to put the crying face into the string literal, directly:

	// …

	const char very_normal[] = u8"😭";

	// …

This seems like the normal way of doing things. Compile on GCC? Works fine. Compile on Clang? Also works fine enough. Compile on MSVC? Well, I hope you brought tissues. If you forget to use the /utf-8 flag, this breaks in fantastic ways. First, it will translate your crying emoji into a sequence of code units that is mangled, at the source level. Then, as the compiler goes forward, a bunch of really fucked up bytes that no longer correspond to the sobbing face emoji (Unicode code point U+0001F62D) will then each individually be treated as its own Unicode code point. So you will get 4 code points, each one more messed up that the last, but it doesn’t stop there, because MSVC — in particular — has a wild behavior. The size of the string literal here won’t be 4 (number of mangled bytes) + 1 (null terminator) to have sizeof(very_normal) be 5. No, the sizeof(very_normal) here is NINE (9)!

See, Microsoft did this funny thing where, inside of the u8"", each byte is not considered as part of a sequence. Each byte is considered its own Unicode code point, all by itself. So the 4 fucked up bytes (plus null terminator) are each treated as distinct, individual code points (and not a sequence of code units). Each of these is then expanded out to their UTF-8 representation, one at a time. Since the high bit is set on all of these, each “code point” effectively translates to a double-byte UTF-8 sequence. Now, normally, that’s like… whatever, right? We didn’t specify /utf-8, so we’re getting garbage into our string literal at some super early stage in the lexing of the source. “Aha!”, you say. “I will inject each byte, one at a time, using a \xHH sequence, where HH is a 0-255 hexadecimal character.” And you would be right on Clang. And right on GCC. And right according to the C and C++ standard. You would even be correct if it was an L"" string literal, where it would insert one code unit corresponding to that sequence. But if you type this:

	// …

	const char very_normal[] = u8"\xF0\x9F\x98\xAD";

	// …

You would not be correct on MSVC.

The above code snippet is what folks typically reach for, when they cannot guarantee /utf-8 or /source-charset=.65001 (the Microsoft Codepage name for UTF-8). “If I can just inject the bytes, I can get a u8"" string literal, typed according with unsigned values, converted into my const char[] array.” This makes sense. This is what people do, to have cross-platform source code that will work everywhere, including places that were slow to adopt \U... sequences. It’s a precise storage of the bytes, and each \x fits for each character.

But it won’t work on MSVC.

The sizeof(very_normal) here, even with /utf-8 specified, is still 9. This is because, as the previous example shows up, it treats each code unit here as being a full code point. These are all high-bits-set values, and thus are treated as 2-byte UTF-8 sequences, plus the null terminator. No other compiler does this. MSVC does not have this behavior even for its other string literal types; it’s just with UTF-8 they pull this. So even if you can’t have u8"😭" in your source code — and you try to place it into the string in an entirely agnostic way that gets around bad source encoding — it will still punch you in the face. This is not standards-conforming. It was never standards-conforming, but to be doubly sure the wording around this in both C and C++ was clarified in recent additions to make it extremely clear this is not the right behavior.

There are open bug reports against MSVC for this. There were also bug reports against the implementation before they nuked their bug tracker and swapped to the current Visual Studio Voice. I was there, when I was not involved in the standard and code-young to get what was going on. Back when the libogonek author and others tried to file against MSVC about this behavior. Back when the “CTP”s were still a thing MSVC was doing.

They won’t fix this. The standard means nothing here; they just don’t give a shit. Could be because of source compatibility reasons. But even if it’s a source compatibility issue, they won’t even lock a standards conforming behavior behind a flag. /permissive- won’t fix it. /utf-8 won’t fix it. There’s no /Zc:unfuck-my-u8-literals-please flag. Nothing. The behavior will remain. It will screw up every piece of testing code you have written to test for specific patterns in your strings while you’re trying to make sure the types are correct. There is nothing you can do but resign yourself to not using u8"" anymore for those use cases.

Removing the u8 in front gets the desired result. Using const char very_normal[] = u8"\U0001F62D"; also works, except that only applies if you’re writing UTF-8 exactly. If you’re trying to set up e.g. an MUTF-8 null terminator (to work with Android/Java UTF-8 strings) inside of your string literal to do a quick operation? If you want to insert some “false” bytes in your test suite’s strings to check if your function works? …

Hah.

Stack that with the recent char8_t type changes for u8"", and it’s a regular dumpster fire on most platforms to program around for the latest C++ version.

Nevertheless!

This manages to cover the canonical conversions between most of the known encodings that come out of the C standard:

  mc mwc c8 c16 c32
mc
mwc
c8
c16
c32
  mcs mwcs c8s c16s c32s
mcs
mwcs
c8s
c16s
c32s

Anything else has the special suffix added, but ultimately it is not incredibly satisfactory. After all, part of the wonderful magic of ztd.text and libogonek is the fact that — at compile-time — they could connect 2 encodings together. Now, there’s likely not a way to fully connect 2 encodings at compile-time in C without some of the most disgusting macros I would ever write being spawned from the deepest pits of hell. And, I would likely need an extension or two, like Blocks or Statement Expressions, to make it work out nicely so that it could be used everywhere a normal function call/expression is expected.

Nevertheless, not all is lost. I promised an interface that could automatically connect 2 disparate encodings, similar to how ztd::text::transcode(…) can give to the ability to convert between a freshly-created Shift-JIS and UTF-8 without writing that specific conversion routine yourself. This is critical functionality, because it is the step beyond what Rust libraries like encoding_rs offer, and outstrips what libraries like simdutf, utf8cpp, or Standard C could ever offer. If we do it right, it can even outstrip libiconv, where there is a fixed set of encodings defined by the owner of the libiconv implementation that cannot be extended without recompiling the library and building in new routines. ICU includes functionality to connect two disparate encodings, but the library must be modified/recompiled to include new encoding routines, even if they have the ucnv_ConvertEx function that takes any 2 disparate encodings and transcodes through UChars (UTF-16). Part of the promise of this article was that we could not only achieve maximum speed, but allow for an infinity of conversions within C.

So let’s build the C version of all of this.

General-Purpose Interconnected Conversions Require Genericity

The collection of the cuneicode functions above are both strongly-typed and the encoding is known. In most cases (save for the internal execution and wide execution encodings, where things may be a bit ambiguous to an end-user (but not for the standard library vendor)), there is no need for any intermediate conversion steps. They do not need any potential intermediate storage because both ends of the transcoding operation are known. libiconv provides us with a good idea for what the input and output needs to look like, but having a generic pivot is a different matter. ICU and a few other libraries have an explicit pivot source; other libraries (like encoding_rs) want you to coordinate the conversion from the disparate encoding to UTF-8 or UTF-16 and then to the destination encoding yourself (and therefore provide your own UTF-8/16 pivot). Here’s how ICU does it in its ucnv_convertEx API:

U_CAPI void ucnv_convertEx(
	UConverter *targetCnv, UConverter *sourceCnv, // converters describing the encodings
	char **target, const char *targetLimit, // destination
	const char **source, const char *sourceLimit, // source data
	UChar *pivotStart, UChar **pivotSource, UChar **pivotTarget, const UChar *pivotLimit, // ❗❗ the pivot ❗❗
	UBool reset, UBool flush, UErrorCode *pErrorCode); // error code out-parameter

The buffers have to be type-erased, which means either providing void*, aliasing-capable7 char*, or aliasing-capable unsigned char*. (Aliasing is when a pointer to one type is used to look at the data of a fundamentally different type; only char and unsigned char can do that, and std::byte if C++ is on the table.) After we type-erase the buffers so that we can work on a “byte” level, we then need to develop what ICU calls UConverters. Converters effectively handle converting between their desired representation (e.g., SHIFT-JIS or EUC-KR) and transport to a given neutral middle-ground encoding (such as UTF-32, UTF-16, or UTF-8). In the case of ICU, they convert to UChar objects, which are at-least 16-bit sized objects which can hold UTF-16 code units for UTF-16 encoded data. This becomes the Unicode-based anchor through which all communication happens, and why it is named the “pivot”.

Pivoting: Getting from A to B, through C

ICU is not the first library to come up with this. Featured in libiconv, libogonek, my own libraries, encoding_rs (in the examples, but not the API itself), and more, libraries have been using this “pivoting” technique for coming up on a decade and a half now. It is effectively the same platonic ideal of “so long as there is a common, universal encoding that can handle the input data, we will make sure there is an encoding route from A to this ideal intermediate, and then go to B through said intermediate”. Let’s take a look at ucnv_convertEx from ICU again:

U_CAPI void ucnv_convertEx (UConverter *targetCnv, UConverter *sourceCnv,
	char **target, const char *targetLimit,
	const char **source, const char *sourceLimit,
	UChar *pivotStart, UChar **pivotSource, UChar **pivotTarget, // pivot / indirect
	const UChar *pivotLimit,
	UBool reset, UBool flush, UErrorCode *pErrorCode);

The pivot is typed as various levels of UChar pointers, where UChar is a stand-in for a type wide enough to hold 16 bits (like uint_least16_t). More specifically, the UChar-based pivot buffer is meant to be the place where UTF-16 intermediate data is stored when there is no direct conversion between two encodings. The iconv library has the same idea, except it does not expose the pivot buffer to you. Emphasis mine:

It provides support for the encodings…

… [huuuuge list] …

It can convert from any of these encodings to any other, through Unicode conversion.

GNU version of libiconv, May 21st, 2023

In fact, depending on what library you use, you can be dealing with a “pivot”, “substrate”, or “go-between” encoding that usually ends up being one of UTF-8, UTF-16, or UTF-32. Occasionally, non-Unicode pivots can be used as well but they are exceedingly rare as they often do not accommodate characters from both sides of the equation, in a way that Unicode does (or gives room to). Still, just because somebody writes a few decades-old libraries and frameworks around it, doesn’t necessarily prove that pivots are the most useful technique. So, are pivots actually useful?

When I wrote my previous article about generic conversions, we used the concept of a UTF-32-based pivot to convert between UTF-8 and Shift-JIS, without either encoding routine knowing about the other. Of course, because this is C and we do not have ugly templates, we cannot use the compile-time checking to make sure the decode_one of one “Lucky 7” object and the encode_one of the other “Lucky 7” object lines up. So, we instead need a system where encodings pairs identify themselves in some way, and then identify that as the pivot point. That is, for this diagram:

An image showing a sequence of conversions with three-quarter circle arrows showing a swirling progression through 4 different stops: "Encoded Single Input" ↪ "Decode -> Unicode Code Point" ↪ "Unicode Code Point -> Encode" ↪ "Encoded Single Output"

And make the slight modification that allows for this:

A modification of the previous image. It shows a sequence of conversions with three-quarter circle arrows showing a swirling progression through 4 different stops: "Encoded Single Input" ↪ "Decode -> {hastily blotched over text written over with {SOMETHING}}" ↪ "{hastily blotched over text written over with {SOMETHING}} -> Encode" ↪ "Encoded Single Output"

The “something” is our indirect encoding, and it will also be used as the pivot. Of course, we still can’t know what that pivot will be, so we will once again use a type-erased bucket of information for that. Ultimately, our final API for doing this will look like this:

#include <stddef.h>

typedef enum cnc_mcerror {
	cnc_mcerr_ok = 0,
	cnc_mcerr_invalid_sequence = 1,
	cnc_mcerr_incomplete_input = 2,
	cnc_mcerr_insufficient_output = 3
} cnc_mcerror;

struct cnc_conversion;
typedef struct cnc_conversion cnc_conversion;

typedef struct cnc_pivot_info {
	size_t bytes_size;
	unsigned char* bytes;
	cnc_mcerror error;
} cnc_pivot_info;

cnc_mcerror cnc_conv_one_pivot(cnc_conversion* conversion,
	size_t* p_output_bytes_size, unsigned char** p_output_bytes,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes,
	cnc_pivot_info* p_pivot_info);

cnc_mcerror cnc_conv_pivot(cnc_conversion* conversion,
	size_t* p_output_bytes_size, unsigned char** p_output_bytes,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes,
	cnc_pivot_info* p_pivot_info);

The _one suffixed function does one-by-one conversions, and the other is for bulk conversions. We can see that the API shape here looks pretty much exactly like libiconv, with the extra addition of the cnc_pivot_info structure for the ability to control how much space is dedicated to the pivot. If p_pivot_info->bytes is a null pointer, or p_pivot_info is, itself, a null pointer, then it will just use some implementation-defined, internal buffer for a pivot. From this single function, we can spawn the entire batch of functionality we initially yearned for in libiconv. But, rather than force you to write nullptr/NULL in the exact-right spot of the cnc_conv_pivot function, we instead just provide you everything you need anyways:

// cnc_conv bulk variants
cnc_mcerror cnc_conv(cnc_conversion* conversion,
	size_t* p_output_bytes_size, unsigned char** p_output_bytes,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes);

cnc_mcerror cnc_conv_count_pivot(cnc_conversion* conversion,
	size_t* p_output_bytes_size,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes,
	cnc_pivot_info* p_pivot_info);
cnc_mcerror cnc_conv_count(cnc_conversion* conversion,
	size_t* p_output_bytes_size,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes);

bool cnc_conv_is_valid_pivot(cnc_conversion* conversion,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes,
	cnc_pivot_info* p_pivot_info);
bool cnc_conv_is_valid(cnc_conversion* conversion,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes);

cnc_mcerror cnc_conv_unbounded_pivot( cnc_conversion* conversion,
	unsigned char** p_output_bytes,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes,
	cnc_pivot_info* p_pivot_info);
cnc_mcerror cnc_conv_unbounded(cnc_conversion* conversion,
	unsigned char** p_output_bytes,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes);

// cnc_conv_one single variants
cnc_mcerror cnc_conv_one(cnc_conversion* conversion,
	size_t* p_output_bytes_size, unsigned char** p_output_bytes,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes);

cnc_mcerror cnc_conv_one_count_pivot(cnc_conversion* conversion,
	size_t* p_output_bytes_size,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes,
	cnc_pivot_info* p_pivot_info);
cnc_mcerror cnc_conv_one_count(cnc_conversion* conversion,
	size_t* p_output_bytes_size,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes);

bool cnc_conv_one_is_valid_pivot(cnc_conversion* conversion,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes,
	cnc_pivot_info* p_pivot_info);
bool cnc_conv_one_is_valid(cnc_conversion* conversion,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes);

cnc_mcerror cnc_conv_one_unbounded_pivot(cnc_conversion* conversion,
	unsigned char** p_output_bytes, size_t* p_input_bytes_size,
	const unsigned char** p_input_bytes, cnc_pivot_info* p_pivot_info);
cnc_mcerror cnc_conv_one_unbounded(cnc_conversion* conversion,
	unsigned char** p_output_bytes,
	size_t* p_input_bytes_size, const unsigned char** p_input_bytes);

It’s a lot of declarations, but wouldn’t you be surprised that the internal implementation of almost all of these is just one function call!

bool cnc_conv_is_valid(cnc_conversion* conversion,
	size_t* p_bytes_in_count,
	const unsigned char** p_bytes_in)
{
	cnc_mcerror err = cnc_conv_pivot(conversion,
		NULL, NULL,
		p_bytes_in_count, p_bytes_in,
		NULL); 
	return err == cnc_mcerr_ok;
}

It is mostly for convenience to provide these functions. Since the implementation is so simple, it warrants giving people exactly what they want. This is so they can have named functions which communicate what they’re doing in as normal a way, as opposed to using NULL/nullptr splattering that does not communicate anything to an external user why exactly someone is doing that with the function. Still, for as much as I talk these functions up, there’s two very important bits I’ve been sort of skirting around:

  • How the heck do we get a cnc_conversion* handle?
  • How do we make sure we provide generic connection points between random encodings?

Well, strap in, because we are going to be crafting a reusable, general-purpose encoding library that allows for run time extension of the available encodings (without loss of speed).

cuneicode and the Encoding Registry

As detailed in Part 1 and hinted at above, libiconv — and many other existing encoding infrastructures — do not provide a way to expand their encoding knowledge at run time. They ship with a fixed set of encodings, and you must either directly modify the library or directly edit data files in order to coax more encodings out of the interface. In the case of Standard C, sometimes that means injecting more files into the system locale files, or other brittle/non-portable things. We need a means of loading up and controlling a central place where we can stuff all our encodings. Not only that, but we also:

  • need to allow for controlling all allocations made; and,
  • need to allow for loading up an encoding registry with any “defaults” the library may have ready for us.

So, we need to come up with a new entity for the API that we will term a cnc_conversion_registry. As the name implies, it’s a place to store all of our conversions, and thusly a brief outline of the most important parts of the API should look like this:

typedef void*(cnc_allocate_function)(size_t requested_size, size_t alignment,
	size_t* p_actual_size, void* user_data);
typedef void*(cnc_reallocate_function)(void* original, size_t requested_size,
	size_t alignment, size_t* p_actual_size, void* user_data);
typedef void*(cnc_allocation_expand_function)(void* original, size_t original_size,
	size_t alignment, size_t expand_left, size_t expand_right, size_t* p_actual_size,
	void* user_data);
typedef void*(cnc_allocation_shrink_function)(void* original, size_t original_size,
	size_t alignment, size_t reduce_left, size_t reduce_right, size_t* p_actual_size,
	void* user_data);
typedef void(cnc_deallocate_function)(
	void* ptr, size_t ptr_size, size_t alignment, void* user_data);

typedef struct cnc_conversion_heap {
	void* user_data;
	cnc_allocate_function* allocate;
	cnc_reallocate_function* reallocate;
	cnc_allocation_expand_function* shrink;
	cnc_allocation_shrink_function* expand;
	cnc_deallocate_function* deallocate;
} cnc_conversion_heap;

cnc_conversion_heap cnc_create_default_heap(void);

typedef enum cnc_open_error {
	cnc_open_err_ok = 0,
	cnc_open_err_no_conversion_path = -1,
	cnc_open_err_insufficient_output = -2,
	cnc_open_err_invalid_parameter = -3,
	cnc_open_err_allocation_failure = -4
} cnc_open_error;

typedef enum cnc_registry_options {
	cnc_registry_options_none = 0,
	cnc_registry_options_empty = 1,
	cnc_registry_options_default = cnc_registry_options_none,
} cnc_registry_options;

struct cnc_conversion_registry;
typedef struct cnc_conversion_registry cnc_consversion_registry;

cnc_open_error cnc_open_registry(cnc_conversion_registry** p_out_registry, cnc_conversion_heap* p_heap,
	cnc_registry_options registry_options);
cnc_open_error cnc_new_registry(cnc_conversion_registry** p_out_registry,
	cnc_registry_options registry_options);

This is a LOT to digest. So, we’re going to walk through it, from top-to-bottom. The first 5 are function type definitions: they define the 5 different core operations an allocator can perform. Following the order of the type definitions:

  • cnc_allocate_function: an allocation function. Creates/acquires memory to write into. That memory can come from anywhere, so long as it contains as much as size was requested. It can give more space than requested (due to block size or for alignment purposes), and so the function takes the actual size as a pointer parameter to give it back to the end-user.
  • cnc_reallocate_function: a reallocation function. Takes an already-allocated block of memory and sees if it can potentially expand it in place or move it to another place (perhaps using memory relocation) with a larger size. Might result in a memcpy action to get the memory from one place to another place, or might do nothing and simply return nullptr while not doing anything to the original pointer. Tends to be used as an optimization, and may perhaps be a superset of the cnc_allocation_expand_function.
  • cnc_allocation_expand_function: an expansion function. This function takes an already-done allocation and attempts to expand it in-place. If it cannot succeed at expanding to the left (before) or right (after) of the memory section by the requested amounts, it will simply return nullptr and do nothing. Returns a new pointer by return value and files out the actual size by a size_t pointer value.
  • cnc_allocation_shrink_function: a shrinking function. This function takes an already-done allocation and attempts to shrink it in-place. It if cannot succeed at shrinking from the left (before) or right (after) of the memory section by the requested amounts, it will simply return nullptr and do nothing. Returns a new pointer by return value and files out the actual size by a size_t pointer value.
  • cnc_deallocation_function: a deallocation function. Releases previously-allocated memory.

From there, we compose a heap that contains one of each of the above functions, plus a void* which acts as a user data that goes into the heap. The user data’s purpose is to provide any additional information that may be needed contextually by this heap to perform its job (for example, a pointer to an span of memory that is then used as a raw arena). 99% of people will ignore the existence of the heap, however, and just use either cnc_create_default_heap, or just call cnc_new_registry which will create a defaulted heap for you. (The default heap will just shill out to malloc and friends.) The defaulted heap is then passed to cnc_open_registry.

Finally, there’s the registry options. Occasionally, it’s useful to create an entirely empty registry, so there’s a cnc_registry_options_empty for that, but otherwise the default is to stuff the registry with all of the pre-existing encodings that the library knows about. So, we can create a registry for this by doing:

cnc_conversion_registry* registry = NULL;
cnc_open_error reg_err            = cnc_new_registry(&registry, cnc_registry_options_default);

So far, the usage is surprisingly simple, despite all the stuff we talked about. The cnc_conersion_registry is a never-completed type, because it’s meant to just be a (strongly-typed) handle value (rather than just passing around a void*). The various error codes come from the cnc_open_error enumeration, and the names themselves explain pretty clearly what could happen. Some of the error codes don’t matter for this specific function, because it’s just opening the registry. The most we could run into is a cnc_open_err_allocation_failure or cnc_open_err_invalid_parameter; otherwise, we will just get cnc_open_err_ok! Assuming that we did, in fact, get cnc_open_err_ok, we can move on to the next part, which is opening/newing up a cnc_conversion* from our freshly created cnc_conversion_registry.

Creating a cuneicode Conversion

Dealing with allocations can be a pretty difficult task. As with the cnc_new_registry function, we are going to provide a number of entry points that simply shill out to the heap passed in during registry creation so that, once again, 99% of users do not have to care where their memory comes from for these smaller objects. But, it’s still important to let users override such defaults and control the space: this is paramount to allow for a strictly-controlled embedded implementation that can compile and run the API we are presenting here. So, let’s get into the (thorny) rules of both creating a conversion object, and providing routines to give our own conversion routines. First, let’s start with creating a conversion object to use:

cnc_open_error cnc_conv_new(cnc_conversion_registry* registry,
	const char* from, const char* to,
	cnc_conversion** out_p_conversion,
	cnc_conversion_info* p_info);

cnc_open_error cnc_conv_new_n(cnc_conversion_registry* registry,
	size_t from_size, const char* from,
	size_t to_size, const char* to,
	cnc_conversion** out_p_conversion,
	cnc_conversion_info* p_info);

cnc_open_error cnc_conv_open(
	cnc_conversion_registry* registry, const char* from, const char* to,
	cnc_conversion** out_p_conversion, size_t* p_available_space, unsigned char* space,
	cnc_conversion_info* p_info);

cnc_open_error cnc_conv_open_n(cnc_conversion_registry* registry,
	size_t from_size, const char* from,
	size_t to_size, const char* to,
	cnc_conversion** out_p_conversion,
	size_t* p_available_space, void* space,
	cnc_conversion_info* p_info);

As shown with the registry APIs, there’s 2 distinct variants: the _open and _new styles. The _new style pulls its memory from the heap passed in during registry creation. It’s the simplest and easiest and effectively runs with whatever is on the heap at the time. However, sometimes that’s not local-enough for some folks. Therefore, the _open variant of the functions ask for a pointer to a size_t* for the amount of space is available, and a void* space that points to an area of memory that contains at least *p_available_space bytes. Each set of APIs takes a from name and a to name: these are encoding names that are compared in a specific manner. That is:

  • it is basic ASCII Latin Alphabet (A-Z, a-z) case-insensitive;
  • ASCII _, -, . (period), and ` ` (space) are considered identical to one another;
  • and the input must be UTF-8.

The reason that the rules are like this is so "UTF-8" and "utf-8" and "utf_8" and "Utf-8" are all considered identical. This is different from Standard C and C++, where setlocale and getlocale are not required to do any sort of invariant-folding comparison and instead can consider "C.UTF-8", "C.Utf-8", "c.utf-8" and similar name variations as completely different. That is, while one platform will affirm that "C.UTF-8" is a valid locale/encoding, another platform will reject this despite having the moral, spiritual, and semantic equivalent of "C.UTF-8" because you spelled it with lowercase letters rather than some POSIX-blessed “implementation-defined” nutjobbery. Perhaps in the future I could provide Unicode-based title casing/case folding, but at the moment 99% of encoding names are in mostly-ASCII identifiers. (It could be possible in the future to provide a suite of translated names for the to and from codes, but that is a bridge we can cross at a later date.)

The _n and non-_n style of functions are just variations on providing a size for the from and to names; this makes it easy not to require allocation if you parse a name out of another format (e.g., passing in a validated sub-input that identifies the encoding from a buffer that contains an <?xml … ?> encoding tag in an XHTML file, or the <meta> tag). If you don’t call the _n functions, we do the C thing and call strlren on the input from and to buffers. (This is, obviously, a problem if the string is not 0-terminated as is the case with directly accessing a region of memory inside of the raw loaded text that represents a <meta> tag or a #pragma file_encoding "kz1048".) It’s not great, but momentum is momentum: C programmers and the APIs they use/sit beneath them on their systems expect footgun-y null terminated strings, no matter how many times literally everyone gets it wrong in their lifespan as a C or C++ programmer.

Now that we know all of this, we can start talking about direct matches and indirect matches and the cnc_conversion_info structure:

typedef struct cnc_conversion_info {
	const ztd_char8_t* from_code_data;
	size_t from_code_size;
	const ztd_char8_t* to_code_data;
	size_t to_code_size;
	bool is_indirect;
	const ztd_char8_t* indirect_code_data;
	size_t indirect_code_size;
} cnc_conversion_info;

The (to|from)_code_(data/size) fields should be self-explanatory: when the conversion from from to to is found, it hands the user the sized strings of the found conversions. These names should compare equal under the function ztdc_is_encoding_name_equal_n_c8(…) to the from/to code passed in to any of the cnc_conv_new_*/cnc_conv_open_* functions. Note it may not be identical (even if they are considered equivalent) as mentioned with the “normalizing” algorithm above. The names provided in the cnc_conversion_info structure are what is stored inside of the registry, and not the name provided to the function call.

The interesting bit is the is_indirect boolean value and the indirect_code_(data/size) fields. If is_indirect is true, then the indirect_ fields will be populated with the name (and the size of the name) of the indirect encoding that is used as a pivot between the two encoding pairs!

Indirect Encoding Connection

If we are going to have a way to connect two entirely disparate encodings through a common medium, then we need to be able to direct an encoding through an intermediate. This is where indirect conversions come in. The core idea is, thankfully, not complex, and works as follows:

  • if there is an encoding conversion from “from” to “{Something}”;
  • and, if there is an encoding from “{Something}” to “to”;
  • then, a conversion entry will be created that internally connects from to to through {Something} as the general-purpose pivot.

So, for a brief second, if we assumed we have an encoding conversion from an encoding called “SHIFT-JIS” to “UTF-32”, and we had an encoding from “UTF-32” to “UTF-8”, we could simply ask to go from “Shift-JIS” to “UTF-8without explicitly writing that encoding conversion ourselves. Since cuneicode comes with an encoding conversion that does Shift-JIS ➡ UTF-32 and UTF-32 ➡ UTF-8, we can try out the following code ourselves and verify it works with the APIs we have been discussing up until now. This is the exact same example we had back in the C++ article.

Step one is to open a registry:

#include <ztd/cuneicode.h>

#include <ztd/idk/size.h>

#include <stdio.h>
#include <stdbool.h>

int main() {
	cnc_conversion_registry* registry = NULL;
	{
		cnc_open_error err
			= cnc_registry_new(&registry, cnc_registry_options_empty);
		if (err != cnc_open_err_ok) {
			fprintf(stderr, "[error] could not open a new empty registry.");
			return 2;
		}
	}

	// …

That’s the first step. If we fail to open a conversion registry, we return 2 out of main and bail. Otherwise, this gets us an entirely empty registry. Normally, we would use cnc_registry_options_default to have the registry filled with all of the existing conversions that exist in cuneicode added to the registry by-default, but we’re going to use the fact that its empty to test that there does not exist a conversion between the 2 encodings we want to work with. That test looks like this:

	// …

	// Verify that no conversion exists
	// for Shift-JIS to UTF-8.
	{
		cnc_conversion* conversion          = NULL;
		cnc_conversion_info conversion_info = { 0 };
		cnc_open_err err                    = cnc_conv_new(
			registry, "shift-jis", "utf-8", &conversion, &conversion_info);
		if (err != cnc_open_err_no_conversion_path) {
			fprintf(stderr,
				"[error] there should be no conversion path, but one exists");
			cnc_registry_delete(registry);
			return 3;
		}
	}

	// …

If this passes, we know we have a registry that does not have a conversion. Good. Now we can test if our idea of an indirect conversion is real. In order to do that, we’re going to need some APIs for adding conversions to the registry. There are a LOT and they are all documented in the ztd.cuneicode documentation; the ones we are going to focus on will be the cnc_registry_add_single(…) and cnc_registry_add_alias(…), which look like this:

typedef cnc_mcerr(cnc_conversion_function)(cnc_conversion* conversion,
	size_t* p_output_bytes_size, unsigned char** p_output_bytes, size_t* p_input_bytes_size,
	const unsigned char** p_input_bytes, cnc_pivot_info* p_pivot_info, void* p_state);

typedef bool(cnc_state_is_complete_function)(
	const cnc_conversion* conversion, const void* p_state);

typedef void(cnc_close_function)(void* data);

cnc_open_err cnc_registry_add_single(
	cnc_conversion_registry* registry, const char* from, const char* to,
	cnc_conversion_function* single_conversion_function,
	cnc_state_is_complete_function* state_is_complete_function,
	cnc_open_function* open_function,
	cnc_close_function* close_function);

cnc_open_err cnc_registry_add_alias(
	cnc_conversion_registry* registry, const char* alias,
	const char* original);

To start, there’s a somewhat complex function signature for cnc_conversion_functions. Most of those parameters should be recognizable from the previous definitions we talked about earlier, with the addition of a single extra void* called p_state. This is an advanced parameter that we will not be talking about here, which is for managing state in a conversion and encoding-agnostic manner. Because Shift-JIS, UTF-32, and UTF-8 do not require state with the way our API is structured, we will not need to use it. We will also not need to touch the *p_pivot_info either, as we will not be attempting to use any sort of “temporary buffer space” (what the p_pivot_info is meant to control) in order to do the conversion. We will be using the static conversion functions that know which encodings they are going to, and which they are coming from, and we will know before hand that they do not need any state. That will make our implementation of the cnc_conversion_function* single_conversion_function very simple.

The last three parameters to cnc_registry_add_single are also advanced parameters that we will not be covering here. Because we have no state we are interested in, we can pass NULL for the state_is_complete_fuinction. Similarly, because there is no extra data or state we need to maintain inside the cnc_conversion* opaque handle, we will not need either an open_function or a close_function to describe the format of the memory that is contained within. Therefore, we will pass NULL and NULL to these parameters, too. The rest of the parameters — registry, from, and to — are straightforward manifestations of what we expect. We want the registry we are going to put the conversion into, the (null-terminated) name to register on the “from” side of the encoding, and the (null-terminated) name of the encoding on the “to” side.

The cnc_registry_add_alias function is meant to behave exactly as you expect: if someone asks for the name pointed to by alias, it will instead give them the name at original. This is to add common names to more descriptive encodings so that the usual existing names can map usefully on any given platform. Note that, like the APIs above, there are other versions such as cnc_add_registry_single_n and cnc_registry_add_alias_n so that counted strings can be used for the names as opposed to null-terminated strings, similar to the other APIs talked about above.

Now that we’ve established that, it’s time for the fun bit: writing a type-erased cnc_conversion_function that does a single indivisible unit of encoding work.

“Single” Conversions

As described in the last article, there is — in a generally-applicable, widely-applied way — a way to convert between doing “bulk” work to make it do a “single” unit of work, and a way to stack a “single” unit of work to make it do “bulk” work. The performance of these approaches is bloody AWFUL. But! It will work. This is why we are using cnc_registry_add_single: the name is a clue that we are going to only provide a function which does a single unit of indivisible work. Then, the API — automatically — is going to take that single function and then proceed to run it in a hard loop, over and over again, until it either errors or consumes all of the input data. “Consumes all of the input data” also includes any data accumulated on the state; this is what the cnc_state_is_complete function is for. Again, our encodings have no state, so we will just be providing NULL for the completion function and other related information to let them default internally, but that is its purpose.

So, we just have to implement the single conversion and the library will automatically take care of the rest for us, which is pretty nice! Going above the main function in the same file, we start with the following skeletons:

// …

#include <ztd/idk/size.h>
#include <ztd/idk/assert.h>
#include <ztd/idk/align.h>
#include <ztd/idk/assume_aligned.h>

#include <stddef.h>
#include <stdalign.h>

static inline cnc_mcerr shift_jis_x0208_to_utf32(
	cnc_conversion* conversion,
	size_t* p_bytes_out_count, unsigned char** p_bytes_out,
	size_t* p_bytes_in_count, const unsigned char** p_bytes_in,
	cnc_pivot_info* p_pivot_info, void* state)
{
	// …
}

static inline cnc_mcerr utf32_to_utf8(cnc_conversion* conversion,
	size_t* p_bytes_out_count, unsigned char** p_bytes_out,
	size_t* p_bytes_in_count, const unsigned char** p_bytes_in,
	cnc_pivot_info* p_pivot_info, void* state)
{
	// …
}

Okay, so far so good. Let’s start with the shift_jis_x0208_to_utf32 function, and fill it in.

Shift-JIS to UTF-32

There is a LOT to consider, so we will take it piece by piece:

static inline cnc_mcerr shift_jis_x0208_to_utf32(
	cnc_conversion* conversion,
	size_t* p_bytes_out_count, unsigned char** p_bytes_out,
	size_t* p_bytes_in_count, const unsigned char** p_bytes_in,
	cnc_pivot_info* p_pivot_info, void* state)
{
	// since we know our conversion we can safely ignore many of the parameters
	(void)conversion;
	(void)p_pivot_info;
	(void)state;
	// set up variables for use
	const char* elements_in       = NULL;
	ztd_char32_t* elements_out    = NULL;
	const char** p_elements_in    = &elements_in;
	ztd_char32_t** p_elements_out = &elements_out;
	size_t elements_in_count      = 0;
	size_t elements_out_count     = 0;
	size_t* p_elements_in_count   = NULL;
	size_t* p_elements_out_count  = NULL;

	// …
}

Nothing too complex, here. We are setting up a BUNCH of variables for us to use. Notably, we are trying to get strongly-typed pointers out of the existing byte-based ones, since internally we want to work with whole, complete code units and code points rather than going through everything byte-by-byte. As stated before, we are ignoring state, conversion, and p_pivot_info, since we know everything about the conversions we are going to do.

static inline cnc_mcerr shift_jis_x0208_to_utf32(
	cnc_conversion* conversion,
	size_t* p_bytes_out_count, unsigned char** p_bytes_out,
	size_t* p_bytes_in_count, const unsigned char** p_bytes_in,
	cnc_pivot_info* p_pivot_info, void* state)
{
	// …

	// if the counts are non-null, adjust them to be element counts
	if (p_bytes_in_count) {
		elements_in_count   = ((*p_bytes_in_count) / sizeof(char));
		p_elements_in_count = &elements_in_count;
	}
	if (p_bytes_out_count) {
		elements_out_count   = ((*p_bytes_out_count) / sizeof(ztd_char32_t));
		p_elements_out_count = &elements_out_count;
	}

	// if the pointers are not null, set their values up here
	if (p_bytes_in) {
		elements_in
			= ZTD_ASSUME_ALIGNED(alignof(char), (const char*)*p_bytes_in);
	}
	if (p_bytes_out) {
		// if the pointer is non-null, double-check alignment to prevent UB from
		// unaligned reads
		// NOTE: a more sophisticated implementation would do the conversion and
		// then break each write down into one-byte-at-a-time writes to allow for
		// unaligned pointers. We don't do that here, for simplicity's sake.
		ZTD_ASSERT(p_bytes_out
				? ztdc_is_aligned(alignof(ztd_char32_t), *p_bytes_out)
				: true);
		elements_out = ZTD_ASSUME_ALIGNED(
			alignof(ztd_char32_t), (ztd_char32_t*)*p_bytes_out);
	}
	
	// …
}

This is a lot of work to do a little, but it is, thankfully, very simple work that gets complicated by trying to be pedantically correct, the worst kind of correct. So:

  • In order to work internally, we need element counts and not byte counts. We convert to the right size by dividing by the size of the character.
    • (Yes, the / sizeof(char) is redundant. It’s just for symmetry of the code, and it thankfully doesn’t hurt anyone but the worst non-optimized C compilers.)
  • If the pointer parameters are not null pointers, then we override the initialization in the last code snippet by setting it to a real value.
    • We use ztd_char32_t because, unfortunately, Mac OS is broken and does not implement <uchar.h> properly on its platforms, and is missing as char32_t definition in C mode.
    • For the ztd_char32_t* pointer, we make sure it is aligned. This is because, technically, unaligned reads and writes are undefined behavior. x86 and x86_64 allow it, but e.g. PowerPC will either mangle the data or just straight up lie/crash with the data. Not fun to debug, so it’s asserted.
    • Once we’re done asserting it, we set the pointer with ZTD_ASSUME_ALIGNED, which isn’t strictly necessary but since we already went through all this trouble why the hell not?

That covers all of the steps above. Again, a lot of noise just to be pedantic, but it never hurts to not lead anyone down the stray path, right? Now that we have our proper pointers and proper sizes, we get to do the hard part: converting between Shift-JIS and UTF-32. That part looks like this:

static inline cnc_mcerr shift_jis_x0208_to_utf32(
	cnc_conversion* conversion,
	size_t* p_bytes_out_count, unsigned char** p_bytes_out,
	size_t* p_bytes_in_count, const unsigned char** p_bytes_in,
	cnc_pivot_info* p_pivot_info, void* state)
{
	// …

	// do actual conversion
	cnc_mcerr err = cnc_mcntoc32n_shift_jis_x0208(p_elements_out_count,
		p_elements_out, p_elements_in_count, p_elements_in);
	
	// …
}

Yeah, of course I’m cheating! Do you really want to see me write a Shift-JIS converter in the middle of this function? Absolutely not: I did not convert these pointers to something usable just to ignore something I already implemented: we call into the static conversion function and go on about our day like normal creatures that have better things to do. The rest of the function is just converting back to byte sizes and byte pointers and having a good time:

static inline cnc_mcerr shift_jis_x0208_to_utf32(
	cnc_conversion* conversion,
	size_t* p_bytes_out_count, unsigned char** p_bytes_out,
	size_t* p_bytes_in_count, const unsigned char** p_bytes_in,
	cnc_pivot_info* p_pivot_info, void* state)
{
	// …

	// do actual conversion
	// NOTE: we're just going to use what's provided by cuneicode,
	// but we COULD write out own here!
	cnc_mcerr err = cnc_mcntoc32n_shift_jis_x0208(p_elements_out_count,
		p_elements_out, p_elements_in_count, p_elements_in);

	// translate pointers back to byte pointers
	if (p_bytes_in) {
		*p_bytes_in = (const unsigned char*)elements_in;
	}
	if (p_bytes_out) {
		*p_bytes_out = (unsigned char*)elements_out;
	}
	// If the counts are non-null, translate them back into normal byte counts.
	if (p_bytes_in_count) {
		*p_bytes_in_count = elements_in_count * sizeof(char);
	}
	if (p_bytes_out_count) {
		*p_bytes_out_count = elements_out_count * sizeof(ztd_char32_t);
	}
	return err;
}

You can imagine that the implementation for utf32_to_utf8 uses much the same mechanisms: we convert to concrete pointers, assert that they are aligned, and then pass it into the pre-existing cnc_mcntoc32n_utf8 function that cuneicode has. Again, it’s not a crash course in UTF-32 to UTF-8 conversion, but we’re always meant to work smarter, not harder. I am not implementing these conversions for the eightieth time in my life just to score points on a technical writeup.

Back to main

With that out of the way, we can get back to our main function and start using these functions to do a type-erased conversion in for our encoding registry. Let’s add the 2 conversions to our registry now:

// …

int main() {
	// …

	// Actually add the conversion here that we need.
	{
		// Shift-JIS to UTF-32
		cnc_open_err to_utf32_err
			= cnc_registry_add_single(registry, "shift-jis-x0208", "utf-32",
				shift_jis_x0208_to_utf32, NULL, NULL, NULL);
		if (to_utf32_err != cnc_open_err_ok) {
			fprintf(stderr,
				"[error] could not add conversion from shift-jis-x0208 to utf-32");
			cnc_registry_delete(registry);
			return 4;
		}
		// UTF-32 to UTF-8
		cnc_open_err to_utf8_err = cnc_registry_add_single(
			registry, "utf-32", "utf-8", utf32_to_utf8, NULL, NULL, NULL);
		if (to_utf8_err != cnc_open_err_ok) {
			fprintf(stderr,
				"[error] could not add conversion from utf-32 to utf-8");
			cnc_registry_delete(registry);
			return 5;
		}
	}

	// …
}

Very straightforward. We add the Shift-JIS to UTF-32, and then the UTF-32 to UTF-8 conversion routine. We want to be able to access Shift-JIS at standard X0208 by just using the name “Shift-JIS”, so we connect it using an alias:

	// …

	// Ease-of-use alias for Shift-JIS's name
	{
		cnc_open_err err
		     = cnc_registry_add_alias(registry, "shift-jis", "shift-jis-x0208");
		if (err != cnc_open_err_ok) {
			fprintf(stderr,
			     "[error] could not add alias that maps shift-jis to "
			     "shift-jis-x0208");
			cnc_registry_delete(registry);
			return 6;
		}
	}

	// …

Past this point, we just need to actually create the conversion and attempt the actual work. We tested before that the conversion did NOT work, so now we will require that it does work (otherwise, we bail from main with the error code 7):

	// …

	cnc_conversion* conversion          = NULL;
	cnc_conversion_info conversion_info = { 0 };
	{
		cnc_open_err err = cnc_conv_new(
			registry, "shift-jis", "utf-8", &conversion, &conversion_info);
		if (err != cnc_open_err_ok) {
			fprintf(stderr, "[error] could not open a new conversion");
			cnc_registry_delete(registry);
			return 7;
		}
	}

	// …

As before, the conversion_info variable has been filled in at this point, so now we can use it to get information about what we opened up into the cnc_conversion* handle:

	// …

	fprintf(stdout, "Opened a conversion from \"");
	// Use fwrite to prevent conversions / locale-sensitive-probing from
	// fprintf family of functions with `%s`
	fwrite(conversion_info.from_code_data,
		sizeof(*conversion_info.from_code_data), conversion_info.from_code_size,
		stdout);
	fprintf(stdout, "\" to \"");
	fwrite(conversion_info.to_code_data, sizeof(*conversion_info.to_code_data),
		conversion_info.to_code_size, stdout);
	if (conversion_info.is_indirect) {
		fprintf(stdout, "\" (through \"");
		fwrite(conversion_info.indirect_code_data,
			sizeof(*conversion_info.indirect_code_data),
			conversion_info.indirect_code_size, stdout);
		fprintf(stdout, "\")");
	}
	else {
		fprintf(stdout, "\"");
	}
	fprintf(stdout, "\n");

	// …

Executing the code up until this point, we’ll get something like:

Opened a conversion from "shift-jis-x0208" to "utf-8" (through "utf-32").

which is what we were expecting. Right now, cuneicode only has a conversion routine between Shift-JIS ⬅➡ UTF-32, so it only has one “indirect” encoding to pick from. The rest of this code should look familiar to the example given above for the compile-time known encoding conversions, save for the fact that we are passing values through unsigned char* rather than any strongly-typed const char* or char8_t* types. That means we need to get the array sizes in bytes (not that it matters too much, since the input and output values are in char and unsigned char arrays):

	// …

	const char input_data[]
		= "\x61\x6c\x6c\x20\x61\x63\x63\x6f\x72\x64\x69\x6e\x67\x20\x74\x6f\x20"
		  "\x82\xAF\x82\xA2\x82\xA9\x82\xAD\x2c\x20\x75\x66\x75\x66\x75\x66\x75"
		  "\x21";
	unsigned char output_data[ztd_c_array_size(input_data)] = {};

	const size_t starting_input_size  = ztd_c_string_array_size(input_data);
	size_t input_size                 = starting_input_size;
	const unsigned char* input        = (const unsigned char*)&input_data[0];
	const size_t starting_output_size = ztd_c_array_size(output_data);
	size_t output_size                = starting_output_size;
	unsigned char* output             = (unsigned char*)&output_data[0];
	cnc_mcerror err
		= cnc_conv(conversion, &output_size, &output, &input_size, &input);
	const bool has_err          = err != cnc_mcerr_ok;
	const size_t input_read     = starting_input_size - input_size;
	const size_t output_written = starting_output_size - output_size;
	const char* const conversion_result_title_str = (has_err
		? "Conversion failed... \xF0\x9F\x98\xAD" // UTF-8 bytes for 😭
		: "Conversion succeeded \xF0\x9F\x8E\x89"); // UTF-8 bytes for 🎉
	const size_t conversion_result_title_str_size
		= strlen(conversion_result_title_str);
	// Use fwrite to prevent conversions / locale-sensitive-probing from
	// fprintf family of functions
	fwrite(conversion_result_title_str, sizeof(*conversion_result_title_str),
		conversion_result_title_str_size, has_err ? stderr : stdout);
	fprintf(has_err ? stderr : stdout,
		"\n\tRead: %zu %zu-bit elements"
		"\n\tWrote: %zu %zu-bit elements\n",
		(size_t)(input_read), (size_t)(sizeof(*input) * CHAR_BIT),
		(size_t)(output_written), (size_t)(sizeof(*output) * CHAR_BIT));
	fprintf(stdout, "%s Conversion Result:\n", has_err ? "Partial" : "Complete");
	fwrite(output_data, sizeof(*output_data), output_written, stdout);
	// the stream (may be) line-buffered, so make sure an extra "\n" is written
	// out this is actually critical for some forms of stdout/stderr mirrors; they
	// won't show the last line even if you manually call fflush(…) !
	fprintf(stdout, "\n");

	// clean up resources
	cnc_conv_delete(conversion);
	cnc_registry_delete(registry);
	return has_err ? 1 : 0;
}

Finally. That’s it. So, now we can run all of this, and so, we can see the following output from the whole program:

Opened a conversion from "shift-jis-x0208" to "utf8" (through "utf32").
Conversion succeeded 🎉
	Read: 35 8-bit elements
	Wrote: 39 8-bit elements
Complete Conversion Result:
all according to けいかく, ufufufu!

Nice!

A general-purpose pivoting mechanism that can choose an intermediate and allow us to transcode through it, that we created ourselves! That means we have covered most of what is inside of the table even when we use an encoding that is as obnoxious to write an implementation against such as Punycode. Of course, despite demonstrating it can go through an indirect/intermediate encoding, that does not necessarily prove that we can do that for any encoding we want. The algorithm inside of cuneicode prefers conversions to and from UTF-32, UTF-8, and UTF-16 before any other encoding, but after that it’s a random grab bag of whichever matching encoding pair is discovered first.

This can, of course, be a problem. You may want to bias the selection of the intermediate encoding one way or another; to solve this problem, we just have to add another function call that takes a “filtering”/”selecting” function.

Indirect Control: Choosing an Indirect Encoding

Because this is C, we just add some more prefixes/suffixes on to the existing collection of function names, so we end up with a variant of cnc_conv_new that is instead named cnc_conv_new_select and its friends:

typedef bool(cnc_indirect_selection_function)(
	size_t from_size, const char* from,
	size_t to_size, const char* to,
	size_t indirect_size, const char* indirect);

cnc_open_error cnc_conv_new_n_select(cnc_conversion_registry* registry,
	size_t from_size, const char* from,
	size_t to_size, const char* to,
	cnc_indirect_selection_function* selection, // ❗ this parameter
	cnc_conversion** out_p_conversion, cnc_conversion_info* p_info);

cnc_open_error cnc_conv_open_n_select(cnc_conversion_registry* registry,
	size_t from_size, const char* from,
	size_t to_size, const char* to,
	cnc_indirect_selection_function* selection,
	cnc_conversion** out_p_conversion,
	size_t* p_available_space, void* space,
	cnc_conversion_info* p_info);

A cnc_indirect_selection_function type effectively takes the from name, the to name, and the indirect name and passes them to a function that returns a bool. This allows a function to wait for e.g. a specific indirect name to select, or maybe will reject any conversion that features an indirect conversion at all (the indirect name will be a null pointer to signify that it’s a direct conversion). For example, here’s a function that will only allow direct conversions or Unicode-based go-betweens:

#include <ztd/cuneicode.h>

#include <stdbool.h>

inline bool filter_non_unicode_indirect(size_t from_size, const char* from,
	size_t to_size, const char* to,
	size_t indirect_size, const char* indirect) {
	// unused warnings removal
	(void)from_size;
	(void)from;
	(void)to_size;
	(void)to_size;
	if (indirect == nullptr) {
		// if there's no indirect, then it's a direct conversion
		// which is fine.
		return true;
	}
	// otherwise, it must be Unicode
	return ztdc_is_unicode_encoding_name_n(indirect_size, indirect);
}

This function might come in handy to guarantee, for example, that there’s a maximum chance that 2 encodings could convert between each other. Typically, Unicode’s entire purpose is to enable going from one encoded set of text to another without any loss, whether through publicly available/assigned code points or through usage of the private use area. A user can further shrink this surface area by demanding that the go-between is something like UTF-8. This can come particularly in handy for UTF-EBCDIC which has many bit-level similarities with UTF-8 that can be used for optimization purposes as a go-between.

cuneicode itself, when a version of the cnc_conv_(open|new) is used, provides a function that simply just returns true. This is because cuneicode, internally, has special mechanisms that directly scans a subset of the list of known Unicode encodings and checks them first. If there’s a conversion routine stored in the registry to or from for UTF-8, UTF-16, and UTF-32, it will select and prioritize those first before going on to let the function pick whatever happens to be the first one. The choice is unspecified and not stable between invocations of the cnc_conv creation functions, but that’s because I’m reserving the right to improve the storage of the conversion routines in the registry and thus might need to change the data structures and their iteration paths / qualities in the future.

So There We Have It

We have an API that can:

  • statically convert between 2 encodings using information known at compile-time (through the naming scheme of the functions);
  • run-time convert between 2 encodings with a known, explicitly provided pathway between them (cuneicode encoding registry); and,
  • run-time convert between 2 encodings with a run-time discovered, and typically Unicode-preferred, pathway between them (by arbitrarily connecting two different encodings through an indirect encoding).

This satisfies all our requirements, and has an API that can work on even the tiniest devices to-boot. (We did not go over the allocation-less API’s that are signified by the _open functions; this will be the subject of a deep-dive for a later blog post.) So, now it comes time to fill in our tables from the last blog post about the functionality. It should come as no surprise that we check all the boxes, because we built it to check all the boxes.

Feature Set 👇 vs. Library 👉 ICU libiconv simdutf encoding_rs/encoding_c ztd.text ztd.cuneicode
Handles Legacy Encodings
Handles UTF Encodings 🤨
Bounded and Safe Conversion API
Assumed Valid Conversion API
Unbounded Conversion API
Counting API
Validation API
Extensible to (Runtime) User Encodings
Bulk Conversions
Single Conversions
Custom Error Handling 🤨 🤨
Updates Input Range (How Much Read™) 🤨
Updates Output Range (How Much Written™)
Feature Set 👇 vs. Library 👉 boost.text utf8cpp Standard C Standard C++ Windows API
Handles Legacy Encodings 🤨 🤨
Handles UTF Encodings 🤨 🤨
Bounded and Safe Conversion API 🤨
Assumed Valid Conversion API
Unbounded Conversion API
Counting API 🤨
Validation API 🤨
Extensible to (Runtime) User Encodings
Bulk Conversions 🤨 🤨
Single Conversions
Custom Error Handling
Updates Input Range (How Much Read™)
Updates Output Range (How Much Written™)

There’s more API surface that we have not covered in this code. For example, there’s functions that help do error handling (e.g. replacement characters with skipping bad input, among other things). However, because this is C, this creates a combinatoric explosion of API surface: there needs to be so, SO many functions to handle it. One of the ways to mitigate this would be to use a combination of macros with Statement Expressions and similar to do better. Unfortunately, statement expressions are non-standard. What we do instead is create some pretty disgusting abominations with macros… though! Even though they are disgusting abominations, it actually ends up working somewhat decently (for now):

#include <ztd/cuneicode.h>

#include <ztd/idk/size.h>

#include <stdio.h>
#include <stdbool.h>
#include <string.h>

int main() {
	const ztd_char32_t input_data[] = U"Bark Bark Bark \xFFFFFFFF🐕‍🦺!";
	ztd_char8_t output_data[ztdc_c_array_size(input_data) * CNC_C8_MAX] = { 0 };
	cnc_mcstate_t state                                                 = { 0 };
	const size_t starting_input_size  = ztdc_c_string_array_size(input_data);
	size_t input_size                 = starting_input_size;
	const ztd_char32_t* input         = input_data;
	const size_t starting_output_size = ztdc_c_array_size(output_data);
	size_t output_size                = starting_output_size;
	ztd_char8_t* output               = output_data;
	cnc_error_result err_result
	     = cnc_cxsnrtocysn_with_handler( // ❗ generic macro function call
			&output_size, &output, &input_size,
	          &input, &state,
			cnc_skip_input_and_replace_error_cxntocyn, // ❗ special object to pick handler
			NULL);
	const size_t input_read     = starting_input_size - input_size;
	const size_t output_written = starting_output_size - output_size;
	const bool has_err          = err_result.error_code != cnc_mcerr_ok;
	const char* const conversion_result_title_str = (has_err
		? "Conversion failed... \xF0\x9F\x98\xAD"   // UTF-8 bytes for 😭
		: "Conversion succeeded \xF0\x9F\x8E\x89"); // UTF-8 bytes for 🎉
	const size_t conversion_result_title_str_size
	     = strlen(conversion_result_title_str);
	// Use fwrite to prevent conversions / locale-sensitive-probing from
	// fprintf family of functions
	fwrite(conversion_result_title_str, sizeof(*conversion_result_title_str),
	     conversion_result_title_str_size, has_err ? stderr : stdout);
	fprintf(has_err ? stderr : stdout,
	     "\n\tRead: %zu %zu-bit elements"
	     "\n\tWrote: %zu %zu-bit elements"
	     "\n\tTotal # of errors handled in input: %zu\n",
	     input_read, (size_t)(sizeof(*input) * CHAR_BIT), output_written,
	     (size_t)(sizeof(*output) * CHAR_BIT), err_result.error_count);
	fprintf(stdout, "%s Conversion Result:\n", has_err ? "Partial" : "Complete");
	fwrite(output_data, sizeof(*output_data), output_written, stdout);
	// the stream (may be) line-buffered, so make sure an extra "\n" is written
	// out this is actually critical for some forms of stdout/stderr mirrors; they
	// won't show the last line even if you manually call fflush(…) !
	fwrite("\n", sizeof(char), 1, stdout);
	return has_err ? 1 : 0;
}

I will not show the implementation of this, because quite frankly it’s downright sinful. It’s the kind of stuff I’d have to go to confession for, but… well. I’m not going to the pearly gates yet, so I won’t have to account for the things I’ve done. I am not supposed to feel bad for the code I have written, nor for the state of the world as it relates to text in C and C++ … And yet? If I can be honest with you, dear reader…

I Feel My Sins Crawling On My Back

The sad reality is that I attempted to standardize at least a portion of this work (not the generic indirect conversions and encoding registry part) in C, for C23. I failed. So even as I sit here, lamenting the state of the ecosystem, angry at glibc, perpetually pissed off about Windows’s ucrt.dll, angry at Standard C and Standard C++?

The reality is that I’m no better.

This was my responsibility to fix it, to see it through to the end. It was, in fact, the sole reason I came down to work on C in the first place. Becoming Project Editor, helping with enums, doing #embed, fixing initialization with = {}, typeof, and so much more…. that was extra. Unplanned. Unicode conversion functions was the one thing I did plan. This is the one thing I had the most knowledge about, a solid game plan for. Numerous individuals pulled through for me, even submitted National Body comments on my behalf so this could be cleaned up in time.

I still didn’t make it happen.

It also had knock-on effects: without these functions, we also did not get u8/u16/u32 specifiers for the fprintf family of functions. The conversions were hard to specify without the functions being there (again, because I did not succeed in my mission). So not only did I fail in my mission, but my failure became other people’s problems. Just my one thing I failed to do, and it kept on going. And going.

And going.

This means we miss a whole cycle; no Unicode functionality in C for another spin of the standardization wheel. And, because the C++ library imports all C functions de-facto and de-jure, C++ does not get it either. This only makes me shudder more, because the deadline for the next version of the C standard is not set in stone. Some are advocating we take another 10 years to roll out the next version of the C standard with more features. Another 10 years with no cross-platform printing mechanism. Another 10 years without even the static functions going between Unicode encodings and freeing people from their locale / wide locale-sensitive conversion functions. Another 10 years of functions which are not thread-safe.

10 years of getting to watch the ecosystem continue to slide into hell, because I couldn’t get my words on a paper correct so I couldn’t put it in the magic sky document.

10 years of responsibility.

Ten. Years.

Mn.

Well. Not everything can go perfect, right? But, there is more to discuss than my abject inability to get the job done for just static, Unicode-centric conversions in C. In particular, one of the things hinted at by Part 1 was this interface — despite it doing things like updating the input/output pointers as well as the input/output sizes — could be fast. Now that we have both the static conversion sections and the registry for this C library, is it possible to be fast? Can we compete with the Big Dogs™ like ICU and encoding_rs for their conversion routines? Where does standard C and C++ fit on the scale? And, well, here’s what it looks like:

I am very tired and the alt-text here is so massive it actually can't be posted in Markdown (I have to use HTML tags directly, I think?). I apologize for this.

I am very tired and the alt-text here is so massive it actually can't be posted in Markdown (I have to use HTML tags directly, I think?). I apologize for this.

But we’ll discuss these benchmarks more… Next time. If you’ve stuck around to read this whole giant article, wow, nice job! Unlike me with my C papers, you did great! 🎊

Hopefully, it enriched you, even if only a little. 💚

  1. I implemented this in both Clang and GCC myself, because why wait for somebody else to give you what you deserve? For MSVC, I had to wait until they got punched in the face by not having this information available for about a year and a half, and then after having accidentally punched themselves by not providing it they did a numeric version as I suggested which has a reliable mapping

  2. Narrator: they were lying. Windows still had many applications that refused to acknowledge this default locale, as they would soon find out when needing fwrite on their machine to print UTF-8 to a capable console. 

  3. Not that we endorse the language here, clearly the commit author is having a Certified Moment® and so this commit is filled with your usual videogame chat ableist thoroughfare. But, even if packaged in this manner, a valid complaint is a valid complaint. 

  4. See [depr.locale.stdcvt]. 

  5. NOTE: I am lying. I tried this. This is a bad idea. Do not do it on any available implementation, ever. At the best you’ll segmentation fault or get an assert failure. At the worst you will make a security issue. This is me, in all my seriousness, letting you know this is a TERRIBLE idea. 

  6. The Good Terminals™ includes Windows Terminal, a handful of Command Prompt shims, PowerShell (most of the time), the Console on Mac OS, and (most) Linux Terminals not designed by people part of the weird anti-Unicode Fiefdoms that exist in the many Canon *nix Universes. 

  7. Aliasing-capable means that the pointer can be used as the destination of a pointer cast and then be used in certain ways without violating the rules of C and C++ on what is commonly called “strict aliasing”. Generally, this means that if data has one type, it cannot be used through a pointer as another type (e.g., getting the address of a float variable, then casting the float* to a unsigned int* and accessing it through the unsigned int*). Strict aliasing is meant to allow a greater degree of optimizations by being capable of knowing certain data types can always be handled in a specific way / with specific instructions. 

]]>
<![CDATA[Following up from the last post, there is a lot more we need to cover. This was intended to be the post where we talk exclusively about benchmarks and numbers. But, I have unfortunately been perfectly taunted and status-locked,]]>
Undefined behavior, and the Sledgehammer Principle2023-02-02T00:00:00+00:002023-02-02T00:00:00+00:00https://thephd.dev/c-undefined-behavior-and-the-sledgehammer<![CDATA[

Previously, an article made the rounds concerning Undefined behavior that made the usual Rust crowd go nuts and the usual C and C++ people get grumpy that someone Did Not Understand the Finer Points and Nuance of Their Brilliant Language. So, as usual, it’s time for me to do what I do best and add nothing of value to the conversation whatsoever.

It’s time to get into The Big One in C and C++, and the Sledgehammer Principle.

Undefined behavior

This article is the one that flew around late November 2022 with people having a minor conniption over the implications that GCC will take your grubby signed integer and exploit the Undefined behavior of the C Standard to load its shotgun and start blasting your code. The full code looks like this:

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

uint8_t tab[0x1ff + 1];

uint8_t f(int32_t x)
{
    if (x < 0)
        return 0;
    int32_t i = x * 0x1ff / 0xffff;
    if (i >= 0 && i < sizeof(tab)) {
        printf("tab[%d] looks safe because %d is between [0;%d[\n", i, i, (int)sizeof(tab));
        return tab[i];
    }

    return 0;
}

int main(int argc, char **argv)
{
    (void)argc;
    return f(atoi(argv[1]));
}

The “bad” code that GCC optimizes into a problem is contained in f, particularly the multiply-and-then-check:

    // …
    int32_t i = x * 0x1ff / 0xffff;
    if (i >= 0 && i < sizeof(tab)) {
        printf("tab[%d] looks safe because %d is between [0;%d[\n", i, i, (int)sizeof(tab));
        return tab[i];
    }
    // …

This program, after being compiled VIA the use of GCC with i.e. gcc -02 -Wall -o f_me_up_gnu_daddy, can be run with as ./f_me_up_gnu_daddy 50000000 to which you will be greeted with a lovely segmentation fault (that will, of course, core dump, as is tradition). As the blog points out, it’ll even printf out “a non-sense lie, go straight into dereferencing tab, and die miserably”.

If you’re following along, 50,000,000 multiplied by 0x1ff (511 in decimal) results in 25,550,000,000; in other words, a number FAR too big for a 32-bit integer (whose maximum is a meager 2,147,483,647). This triggers a signed integer overflow. But the optimizer assumes that signed integer overflow can’t happen since the number is already positive (that’s what the x < 0 check guarantees, plus the constant multiplication). So, eventually, GCC takes this code and punches it in the face during its optimization step, and effectively removes the i >= 0 check and all it implies. The blog post author is, of course, not happy about this. And the fact is,

they’re not alone.

The Great Struggle

First thing I’d like to point out is that this isn’t the first time the C Language, implementations, or the C Standard came under flak for this kind of optimization. Earlier last year, someone posted the exact same style of code – using a signed integer index and then trying to bolt safety checks onto it after doing the arithmetic – and (at the time, before the Twitter Collapse and they locked their account) tagged it with the hashtag #vulnerability and said GCC was making their code more dangerous. Prior to that, Victor Yodaiken went on a The-Committee-and-implementers-have-lost-their-marbles bender for about a year and a half, which culminated in his paper for how ISO C was Not Suitable For Operating System Development (and even published a video explaining his position in Proceedings of the 11th Workshop on Programming Languages and Operating Systems).

And that’s just the recent examples, because the same issues go back a long way.

Given the number of times people have taken serious affront to compilers optimizing on Undefined behavior, you’d think WG14 — the C Committee — or WG21 — the C++ Committee — would make it their business to bring a solution to what has been a recurring issue in the C and C++ communities for decades now. But before we get into things that were done and should be done, we should talk about why everyone is increasingly freaking out about Undefined behavior, and why in particular it’s starting to become a more frequent occurrence. After all, it has System Programmers™ and Compiler Vendors/Implementers® getting upset and starting staring contests to see who flinches first. The eyes get dry, the boredom cycles start settling in, and it becomes difficult to keep hands on the keyboard and focus on what’s going on…

An anthropomorphic sheep sits at a computer, eyes bleary and tired as they stare directly at the viewer in a dimly lit room. Their hand-hooves are on the keyboard, and their head is tilted to the side in tiredness, but they're trying to maintain and upright posture and keep staring as best as they can. A portrait in the room of an anthropomorphic with similar eyes stares at the viewer as well, somewhat creepily.

And, well. Unfortunately,

We Blinked First

As Victor Yodaiken tries to point out in his blog post and presentation, he believes that Undefined behavior was not meant to be the tool that people (particularly, compiler implementers) are using it for today. The blog post linked above also is shocked, and cites the Principle of least astonishment as reasoning to why GCC, Clang, and other compilers are being meanie-meanie-buttfaces for optimizing the code in this manner. And, the best possible reaction (in less an amusing sense and more of a “people really were depending on this stuff, huh?” sense), is from felix-gcc in a much older GCC bug report:

signed type overflow is undefined by the C standard, use unsigned int for the addition or use -fwrapv.

You have GOT to be kidding?

PLEASE REVERT THIS CHANGE. This will create MAJOR SECURITY ISSUES in ALL MANNER OF CODE. I don’t care if your language lawyers tell you gcc is right. THIS WILL CAUSE PEOPLE TO GET HACKED.

felix-gcc, January 15, 2007

Users blinked first in the staring contest, and in that brief moment GCC took all the liberty it wanted to begin optimizing on Undefined behavior. Clang followed suit, and now life is getting dicey for a lot of developers who believe they are writing robust, safe code but actually aren’t anymore because they were using constructs that the C Standard defined as Undefined behavior. The language lawyers and compiler politicians had, seemingly, won out and people like Victor Yodaiken, felix-gcc, and bug/ubitux (the author of the blog post that sparked the most recent outbreak of protest against optimization) were left with nothing.

… Which is, of course, not the whole truth.

The truth is that, even as much as Yodaiken argues in his blog that Undefined behavior being used for optimization is simply a “reading error”, the problem did not start with a reading error. The problem started much earlier, when the precursor to ISO C set you up, some of you before you were even born or knew what a computer was.

Hands-off

WG14 — back before it was called ISO/IEC SC22 JTC1 WG14 and even before it was formally known as an ANSI Committee — had a problem.

There were a bunch of computers, and they had deeply divergent behavior. On top of that, a lot of things were hard to check or accommodate with the compute power available at the time. At that moment, enumerating all of the possible behaviors seemed to be a daunting task, bordering on impossible (and seriously, if people don’t want to write documentation now it is hard to imagine how much people wanted to deal with it in the era of punch card mainframes). They also did not want to prevent different kinds of hardware to use their shiny new language, or the various uses different kinds of computers would end up having. So they devised a scheme, back in the day when there was effectively one compiler vendor per deeply disturbing/cursed architecture being built:

what if they just didn’t?

Enter Undefined (and friends) behavior. It was the Committee’s little out. Things that were:

  • too hard (e.g., verifying the One Definition Rule was Definitely Followed between all of the header-included versions of an inline function); or
  • they weren’t sure about (what happens if, in the future, someone comes up with a computer that uses an even more exotic CHAR_BIT or even wackier address spaces?); or
  • too damn hard to document (integer conventions, overflow behaviors (modulo, saturating, trapping, etc.)).

They deemed it some flavor of Undefined/unspecified/implementation-defined behavior. Using 1’s complement versus 2’s complement? Undefined behavior at the tips of integer ranges. Using a different kind of shifter for your 16-bit integers and you shift the top bits? Undefined behavior. Pass a too-big argument into a function call meant to negate things? Unspecified/Undefined behavior! Multiply two integers together and they’re not within the range anymore? That’s right,

Undefined behavior.

The Curse

WG14 got to wash their hands of the problem. And for the next 30/40 years, it stayed that way. Users, of course, couldn’t just write programs on top of “Undefined behavior”. So folks like felix-gcc, Victor Yodaiken, and perhaps hundreds of thousands of others struck what was effectively a backroom deal with their compiler implementers. Compilers would just “generate the code”, and let users basically do “whatever they told the machine to do”. This is the interpretation that Yodaiken ultimately tries to get us back to by performing the most tortured and drawn out sentence reading of all time in the above-linked blog post about Undefined behavior being a “reading error” in C. Whether or not anyone gets on — or wants to get on — the same grammatical train as he does, doesn’t really matter because in reality there’s a de-facto pecking order for how C code is interpreted. This ordering determines everything, from how Undefined behavior gets handled, to which optimizations trigger and which ones don’t, to how Implementation-defined Behavior and Unspecified Behavior gets written down; EVERYTHING that the light touches you touch as far as your code is concerned with your implementation. The ordering – from most powerful to least powerful — when it comes to interpreting the behavior is as follows:

  1. The Sum of Human Knowledge about code generation / interpretation
  2. The Compiler Vendor / Implementer
  3. The C Standard and other related standards (e.g., POSIX, MISRA, ISO-26262, ISO-12207)
  4. The User (⬅ We Are Here)

As much as I would not like this to be the case, users – me, you and every other person not hashing out the bits ‘n’ bytes of your Frequently Used Compiler — get exactly one label in this situation.

Bottom Bitch

That’s right. When used in a conscientious and consenting relationship it’s a lot of fun to throw this word around, but in the context of dealing with vendors, it really doesn’t help! When all they have to do is throw up the Stone Wall and say “sorry, Standard Said So” and make off like bandits, no matter how masochistic any of us turn out to be, it hurts in the bad way. “But wait,” you say, desperately trying to fight the label we’ve all been slapped with. “What about -fwrapv or -ftrapv or -fno-delete-null-pointer-checks? That’s me, the user, being in control!” Unfortunately, that’s not real control. That’s something your implementation gives you.

It’s still entirely in their control, and frequently when you migrate to compilers outside of the niceties that GCC or Clang offer you, you can get shafted in exactly the same way. They can also take it away from you. Clang’s flag is very obvious about this and the freedom it affords them when they make flags like -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang. Even embedded compilers like SDCC fall prey to implementation-defined behavior, which produces a different structure size for this bitfield sequence in this bug report. And, to be very clear here, SDCC is not wrong here; the C Standard allows them to do exactly this, and likely for the machines and compatibility they compile to this is exactly what they need to be doing. And, well, that’s the thing.

It’s the Standard’s Fault

Compiler vendors and implementers are always allowed to do whatever they want, and many times they break with standards to go and do their own things for compatibility reasons. But even though “the standard” is ranked beneath a compiler vendor and implementer, it’s still a mighty sword to swing. While you, the user, are powerless in the face of a vendor, the Standard is an effective weapon you can wield to get the behaviors you want to do. This is not only where felix-gcc and ubitux failed, but where 30 years of C programmer communities failed. They lean too heavily on their implementers and these backroom, invisible deals, praying to some callous and capricious deity that their assumptions are not violated. But implementers have their own priorities, their own benchmarks, and their own milestones to hit. Every day of accepting whatever schlop an implementation handed us — whether it was a high quality Clang-level operational control through #pragmas and other schlop, or Compiler-written-by-a-drunk-EE-in-a-weekend schlop — was a day we condemned our own future.

For all the talk C programmers love to make about how “close to the metal” they are, they never really were. It was always based on that invisible contract that your implementer would “do the right thing”, which as it turns out tends to mean different things to different people and different communities. While signed integer overflow optimizations based on UB make Benchmark Line Go Up, it has a noticeable impact on the ability to predictably handle overflow at a hardware level because the compiler vendor precludes you from trying in the first place by optimizing based on that.

This is why C and C++ programmers get so pissed off at GCC, or Clang, or whatever implementation doesn’t do what they want the compiler to do. It shatters the illusion that they were in the driver’s seat for their code, and absolutely violates the Principle of Least Astonishment. Not because the concept of Undefined behavior hasn’t been explained to death or that they don’t understand it, but because it questions the very nature of that long-held “C is just a macro assembler” perspective. And as we keep saying, year over year, GCC bug after GCC bug, highly-upvoted blogpost after highly-commented writeup, that perspective isn’t going to change because it’s a fundamentally-held belief of the C and — to a lesser extent — the C++ community. “Native” code, “Machine” code, inline “assembly”, “close to the metal”; all of it is part of that shiny veneer of being the most badass person with a computer and access to a terminal in any given room.

And compiler vendors doing this threatens the programmer’s sacred and holy commune that is meant to be between them and the hardware.

Fighting Back

If you’ve been visiting this blog long enough, you know that we don’t do JUST problems. We’re engineers, here. Compiler vendors and implementers are not evil, but they’ve clearly drawn their line in the sand: they are going to optimize based on Undefined behavior. The more of it we have in the standard the more at-risk us we-tell-the-hardware-what-to-do folks are going to be. If compiler vendors are going to start flexing on us and optimizing UB, then there’s a few things we can do to take back what belongs to us.

Adopting the ‘Sledgehammer Principle’

I’ve personally developed a very simple rule called the ‘Sledgehammer Principle’. Whenever you commit undefined behavior, you have to imagine that you’ve taken a sledgehammer and smashed it into a large, very expensive vase your mother (or father, or grandmother, or whoever happens to be closest to you) owns. This means that if you needed that vase (e.g., the result of that signed integer multiply), you have to imagine that it is now completely unrecoverable. The vase is goddamn smashed, after all; only through brutally difficult craftsmanship could you ever imagine putting it back together. This means that, before you swing your sledgehammer around, you check before you commit the undefined behavior, not afterwards. For example, taking the program from the blog post, here’s a modification that can be done to check if you have fucked everything up before things go straight to hell:

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>

uint8_t tab[0x1ff + 1];

uint8_t f(int32_t x)
{
    if (x < 0)
        return 0;
    // overflow check
    if ((INT32_MAX / 0x1ff) <= x) {
        printf("overflow prevented!\n");
        return 0;
    }
    // we have verified swinging this
    // particular sledge hammer is a-okay! 🎉
    int32_t i = x * 0x1ff / 0xffff;
    if (i < sizeof(tab)) {
        printf("tab[%d] looks safe because %d is between [0,%d) 🎊\n", i, i, (int)sizeof(tab));
        return tab[i];
    }
    else {
        printf("tab[%d] is NOT safe; not executing 😱!\n", i);
    }
    return 0;
}

int main(int argc, char* argv[])
{
    (void)argc;
    memset(tab, INT_MAX, sizeof(tab));
    return f(atoi(argv[1]));
}

This is a safe way to check for overflow, before you actually commit the sin. This particular overflow check doesn’t have to worry about negative or other cases because we check for “less than zero” earlier, which makes this particularly helpful. The coding style of the above snippet is not great, of course: we’re using magic numbers and not specifying things, but this gets across the general idea of the Sledgehammer Principle: check before you swing, not after the vase is broken. But that’s only a minor balm on the overall issue, truly.

Why is doing Integer Multiplies like swinging a Sledge Hammer?

This is, of course, the crux of it all. Why is doing something so simple — especially something that was perfectly within the purview of “I can check it post-facto in my hardware” — so difficult? And the answer of course lies above, in the fact that the compiler implementers are the Top Dogs in this scenario. We, the users, are still at the bottom of the power ranking. We’re not Super Saiyan Goku ready to take on Super Broly in an epic battle; we’re the weak, pathetic Krillin of the of the whole shebang that’s about to get punched/slapped/destroyed for comedic effect.

So, how do we get this sledgehammer out of our hands? How do we make it so every vase we touch does not have the potential to shatter into a million irreplaceable pieces?

Our Greatest Weapon

Notice how every single bug report linked in this blog post ends with “the standard says we can, no I am not joking, take a hike” (not exactly with that tone, but you get the idea). If these vendors are going to be all about conforming to the C Standard, then what we need to start doing is investing in changing or adding to the Standard so we can start having it reflect the behaviors we want. It truly sucks that K&R left so much Undefined in the first place, and that the first iteration of the ANSI C Committee left it that way, and it snowballed into hundreds of places of Unspecified and Undefined behavior that are exploitable by not just the Committee, but by red teamers that know how to abuse the code we write daily.

This is not a hopeless situation, however. In C, we finally standardized the <stdckdint.h> header thanks to David Svoboda’s tireless efforts to produce safer, better integers in C. I wrote up about its usages here, but it may take too long for standard library implementations to roll it out. If you’re not interested in waiting, you can grab a publicly-available version of the code written to a pretty high quality here (C++-heads can grab Peter Sommerlad’s Simple Safe Integer library themselves, since C++ itself hasn’t made any progress in this area). It won’t be perfect everywhere, but that’s the fun of open source code; everyone can make it a little better, so we can all stop reimplementing foundational things like five hundred times. Getting really high quality goods out of things will also help us hold standard library implementations accountable to the high level of performance we expect from them and others. It may also start encouraging them to finally share at least some of their code between each other so we’re not all racing to do the exact same thing on 80 different platforms. Svoboda managed to change the C Standard for the better, and while it won’t fix everything it does set a precedent for how we can make this a tractable problem that is solvable before most of us retire. There is, of course, a lot more to do:

  • ckd_div was not included in David’s proposal. This is because the only case of failure for division is N / 0 and {}_MIN / -1, because the result would not be representable in a 2’s complement integer ({}_MAX of any given integer type does not fit there).
  • ckd_modulus has the exactly same problems as ckd_div, and so anything that solves ckd_div can bring ckd_modulus along for the ride.
  • ckd_right_shift and ckd_left_shift are both not included. There is Undefined behavior if we shift into the high bit; it would be very nice to provide definitions for these so that there is well-defined behavior on a shift that happens to move bits into the high bit for a signed integer, especially since we now have 2’s complement behavior in C.

This, of course, only covers problematic mathematics for C and C++-style integers. There a TON of other Undefined/Unspecified behaviors that cause serious issues for users, least of all being the fact that things like NULL + 0 are Undefined behavior or passing NULL with a length of 0 to library functions (representing e.g. an empty array) is also Undefined behavior.

Integer promotion also tends to be a source of bugs (e.g., right-shifting a 16-bit unsigned short with the value 0xFFFF by 15 results in an integer promotion to int before shifting the high bit into the top bit, resulting in undefined behavior). This is solved by using Erich Keane, Aaron Ballman, Tommy Hoffner and Melanie Blower’s recently-standardized _BitInt(N) type, which I also wrote about here. This is something C++ doesn’t currently have except in the form of bespoke libraries; we’ll see if they move in this space but for now using the C thing in C++ such as in e.g. Clang might be a good way to get the non-promoting integers you need to take back control of your code from problematic behavior. (A slight warning, however: we have not solved _BitInt(N) for generic functions, so it will not work <stdckdint.h>’s generic functions since we do not have generic paramtricity in C.)

Do Not Go Quietly

It’s a lot of effort, and we have to keep actively working at it. None of this will come to us overnight. But, this is the world we inherited from our forebears. It’s the one where we’re the least of the ecosystem, and where vendors and implementers still have outsize control of what goes on in every aspect. But if they are going to hold the C Standard up as the Holy Text that justifies everything they do, then — to survive — we must flip the script. A lot of people don’t want fixes or changes or C. They like stability, they like things being frozen. They view it as a feature when it takes 5 years to put something as simple as #embed into C because that’s how we stop C from becoming “a mess” (“a mess like C++”, they sometimes say). But, for me and many others trying to write to the hardware, trying to write to the metal, trying to write code that says what we mean,

C and C++ are already broken.

We can either leave it like this and keep letting the vendors take our space from us. Or, we can fight back. I threw money into a bus ticket and slept in a rickety room, where I got to see some very large spiders and a few other many-legged bugs say good morning to me on my curtain every day. Just for the chance to get into this mess of a language, just because I believed it doesn’t have to stay busted and broken and difficult. We deserve signed integers that don’t trigger undefined behavior on a flippin’ left shift. We deserve multiplies and subtracts and adds that don’t punch us in the face. We deserve code that actually respects what we write and what we say when we say it, so we can build the necessary safety guarantees into what I know C programmers are shipping to millions of people all over the globe daily. We deserve better, and if our forefathers aren’t going to settle the question and give it to us?

Then we better damn well do it ourselves.

… Of Course.

There is a secret option as opposed to all of this. You could not remember the Sledgehammer Principle. You can ignore the minutia of Undefined or Implementation-defined or Unspecified or whatever behavior. You could simply just use tools

that don’t treat you like you’re too stupid to write what you mean. 💚

Art by the ever-skillful Stratica!

]]>
<![CDATA[Previously, an article made the rounds concerning Undefined behavior that made the usual Rust crowd go nuts and the usual C and C++ people get grumpy that someone Did Not Understand the Finer Points and Nuance of Their Brilliant Language. So, as usual, it’s time for me to do what I do best]]>
The Wonderfully Terrible World of C and C++ Encoding APIs (with Some Rust)2022-10-12T00:00:00+00:002022-10-12T00:00:00+00:00https://thephd.dev/the-c-c++-api-landscape<![CDATA[

Last time we talked about encodings, we went in with a C++-like design where we proved that so long as you implement the required operations on a single encoding type, you can go between any two encodings on the planet. This meant you didn’t need to specifically write an e.g. SHIFT-JIS-to-UTF-8 or UTF-EBCDIC-to-Big5-HKSCS pairwise function, it Just Worked™ as long as you had some common pivot between functions. But, it involved a fair amount of template shenanigans and a bunch of things that, quite frankly, do not exist in C.

Can we do the same in C, nicely?

Well, dear reader, it’s time we found out! Is it possible to make an interface that can do the same as C++, in C? Can we express the ability to go from one arbitrary encoding to another arbitrary encoding? And, because we’re talking about C, can it be made to be fast?

In totality, we will be looking at the designs and performance of:

We will not only be comparing API design/behavior, but also the speed with benchmarks to show whether or not the usage of the design we come up with will be workable. After all, performance is part of correctness measurements. No use running an algorithm that’s perfect if it will only compute the answer by the Heat Death of the Universe, right? So, with all of that in mind, let’s start by trying to craft a C API that can cover all of the concerns we need to cover. In order to do that without making a bunch of mistakes or repeated bad ideas from the last five decades, we’ll be looking at all the above mentioned libraries and how they handle things.

Enumerating the Needs

At the outset, this is a pretty high-level view of what we are going for:

  1. know how much data was read from an input string, even in the case of failure;
  2. know how much data was written to an output, even in the case of failure;
  3. get an indicative classification of the error that occurred; and,
  4. control the behavior of what happens with the input/output that happens when a source does not have proper data within it.

There will also be sub-concerns that fit into the above but are noteworthy enough to call out on their own:

  1. the code can be/will be fast;
  2. the code can handle all valid input values; and,
  3. the code is useful for other higher level operations that are not strictly about encoding/decoding/transcoding stuff (validation, counting, etc.).

With all this in mind, we can start evaluating APIs. To help us, we’ll create the skeleton of a table we’re going to use:

Feature Set 👇 vs. Library 👉 ICU libiconv simdutf encoding_rs/encoding_c ztd.text
Handles Legacy Encodings
Handles UTF Encodings
Bounded and Safe Conversion API
Assumed Valid Conversion API
Unbounded Conversion API
Counting API
Validation API
Extensible to (Runtime) User Encodings
Bulk Conversions
Single Conversions
Custom Error Handling
Updates Input Range (How Much Read™)
Updates Output Range (How Much Written™)
Feature Set 👇 vs. Library 👉 boost.text utf8cpp Standard C Standard C++ Windows API
Handles Legacy Encodings
Handles UTF Encodings
Bounded and Safe Conversion API
Assumed Valid Conversion API
Unbounded Conversion API
Counting API
Validation API
Extensible to (Runtime) User Encodings
Bulk Conversions
Single Conversions
Custom Error Handling
Updates Input Range (How Much Read™)
Updates Output Range (How Much Written™)

The “❓” just means we haven’t evaluated it / do not know what we’ll get. Before evaluating the performance, we’re going to go through every library listed here (for the most part; one of the libraries is mine (ztd.text) so I’m just going to brush over it briefly since I already wrote a big blog post about it and thoroughly documented all of its components) and talk about the salient points of each library’s API design. There will be praise for certain parts and criticism for others. Some of these APIs will be old; not that it matters, because many are still in use and considered fundamental. Let’s talk about what the feature sets mean:

Handles Legacy Encodings

This is a pretty obvious feature: whether or not you can process at least some (not all) legacy encodings. Typical legacy encodings are things like Latin-1, EUC-KR, Big5-HKSCS, Shift-JIS, and similar. Usually this comes down to a library trying (and failing) to handle things, or just not bothering with them at all and refusing to provide any structure for such legacy encodings.

Handles UTF Encodings

What it says on the tin: the library can convert UTF-8, UTF-16, and/or UTF-32, generally between one another but sometimes to outside encodings. Nominally, you would believe this is table-stakes to even be discussed here but believe it or not some people believe that not all Unicode conversions be fully supported, so it has to become a row in our feature table. We do not count specifically converting to UTF-16 Little Endian, or UTF-32 Big Endian, or what have you: this can be accomplished by doing an UTF-N conversion and then immediately doing byte swaps on the output code units of the conversion.

Safe Conversion API

Safe conversion APIs are evaluated on their ability to have a well-bounded input range and a well-bounded output range that will not start writing or reading off into space when used. This includes things like having a size go with your output pointer (or a “limit” pointer to know where to stop), and avoiding the use of C-style strings on input (which is bad because it limits what can be put into the function before it is forced to stop due to null termination semantics). Note that all encodings have encodings for null termination, and that stopping on null terminators was so pervasive and so terrible that it spawned an entire derivative-UTF-8 encoding so that normal C-style string operations worked on it (see Java Programming Language documentation, §4.4.7, Table 4.5 and other documentation concerning “Modified UTF-8”).

Assumed Valid Conversion API

Assumed valid conversion APIs are conversions that assume the input is already valid. This can drastically increased speed because checking for e.g. overlong sequences, illegal sequences, and other things can be entirely ignored. Note that it does not imply unbounded conversion, which is talked about just below. Assumed valid conversions are where the input is assumed valid; unbounded conversions are where the output is assumed to be appropriately sized/aligned/etc. Both are dangerous and unsafe and may lead to undefined behavior (unpredictable branching in algorithm, uncontrolled reads and stray writes, etc.) or buffer overruns. This does not mean it is always bad to have: Rust saw significant performance increases when they stopped verifying known-valid UTF-8 string data when constructing and moving around their strings, for both compile and run time workloads.

Lacking this API can result in speed drops, but not always.

Unbounded Conversion API

Unbounded conversions are effectively conversions with bounds checking on the output turned off. This is a frequent mainstay of not only old, crusty C functions but somehow managed to stay as a pillar of the C++ standard library. Unbounded conversions typically only take a single output iterator / output pointer, and generally have no built-in check to see if the output is exhausted for writing. This can lead to buffer overruns for folks who do not appropriately check or size their outputs. Occasionally, speed gains do come from unbounded writing but it is often more powerful and performance-impacting to have assumed valid conversion APIs instead. Combining both assumed valid and unbounded conversions tend to offer the highest performance of all APIs, pointing to a need for these APIs to exist even if they are not inherently safe.

Counting API

This is not too much of a big deal for APIs; nominally, it just allows someone to count how many bytes / code units / code points result from converting an input sequence from X to Y, without actually doing the conversion and/or without actually writing to an output buffer. There are ways to repurpose normal single/bulk conversions to achieve this API without requiring a library author to write it, but for this feature we will consider the explicit existence of such counting APIs directly because they can be optimized on their own with better performance characteristics if they are not simply wrappers around bulk/single conversion APIs.

Validation API

This is identical to counting APIs but for checking validity. Just like counting APIs, significant speed gains can be achieved by taking advantage of the lack of the need to count, write, or report/handle errors in any tangible or detailed fashion. This is mostly for the purposes of optimization, but may come in handy for speeding up general conversions where checking validity before doing an assumed-valid conversion may be faster, especially for larger blocks of data.

Extensible to (Runtime) User Encodings

This feature is for the ability to add encodings into the API. For example, if the API does not have functions for TSCII, or recognize TSCII, is it possible to slot that into an API without needing to abandon what principles or goals the library sets up for itself? There is also the question of whether or not the extensibility happens at the pre-build step (adding extra data files, adding new flags, and more before doing a rebuild) or if it is actually accommodated by the API of the library itself.

Bulk Conversions

Bulk conversions are a way of converting as much input as possible to fill up as much output as possible. The only stopping conditions for bulk conversions are exhausted input (success), not enough room in the output, an illegal input sequence, or an incomplete sequence (but only at the very end of the input when the input is exhausted). Bulk conversions open the door to using Single Instruction Multiple Data (SIMD) CPU instructions, GPU processing, parallel processing, and more to convert large regions of text at a time.

More notably, given a stable single conversion function, running that conversion in a loop would produce the same effect, but may be slower due to various reasons (less able to optimize the loop, cannot be easily restructured to use SIMD, and more). Bulk conversions get around that by stating up-front they will process as much data as possible.

Single Conversions

Single conversions are, effectively, doing “one unit of work” at a time. Whether converting Unicode one code point at a time or making iterators/views to go over a range of UTF-8 one bundle of non-error code units at a time, the implication here is that it uses minimal memory while guaranteeing that forward progress is made/work is done. It is absolutely not the most performant way of encoding, but it makes all other operations possible out of the composition of this single unit of work, so there’s that. If a single conversion is the only thing you have, you can typically build up the bulk conversion from it.

In the opposite case, where only a bulk conversion API is available, you can still implement a single conversion API. Just take a bulk API, break the input off into a subrange of size 1 from the beginning of the input. Then, call the bulk API. If it succeeds, that’s all you need to do. If not, you take the subrange and make it a subrange of size 2 from the start of the input. You keep looping up until the bulk API successfully converts that sub-chunk of input, or until input exhaustion. This method is, of course, horrifically inefficient. It is inadvisable to do this, unless correctness and feature set is literally your only goal with your library. Pressuring your user to provide a single conversion first, and then a bulk conversion, will provide far better performance metrics.

Custom Error Handling

This is just the ability for an user to change what happens on failure/error. Note that this is still possible if the algorithm just stops where the error is and hands you that information; you can decide to skip and/or insert your own kind of replacement characters. (Or panic/crash, write to log, trigger an event; whatever it is you like to do.) I did not think I needed this category, but after working with a bunch of different APIs it is surprising how many do not offer functions to handle this exact use case and instead force replacements or just throw this information into the sea of forgetfulness.

Updates Input Range / Updates Output Range

It’s split into two categories so that I can document which part of this that APIs handle appropriately. This is for when, on either success or error/failure, an API tells you where in the input it failed (the “input range” bit) if it did, and how much it wrote before it succeeded/failed (the “output range” bit). I was always annoyed by this part, and I only got increasingly annoyed that it seems most APIs just give you a big fat middle finger and tell you to go pound sand when it comes to figuring out what happened with the data. This also ties in with “custom error handling” quite a bit; not returning this information means that even if the user was prepared to do it on their own, it is fundamentally impossible to since they lose any possible information of where errors occurred or how much output was written out to not re-do work. Oftentimes, not getting this information results in you needing to effectively treat the entire buffer of input and the entire buffer of output as one big blob of No-Man’s-Land. You could have 4 GB of input data resulting in 8.6 GB of output data, and other APIs will literally successfully convert all but the very last byte and then simply report “we messed up, boss”.

Where did you mess up?

“🤷‍♂️”

Okay, but how far did you get before you messed up?

“🤷‍♂️🤷‍♂️”

Okay, okay, easy there, how much did you actually manage to get out before you think you messed up?

“🤷‍♂️¯\_(ツ)_/¯🤷‍♂️”

D… D-Do you remember ANYTHING about what we just spent several minutes of compute… doing…?

“Oh, hey boss! You ready to get started on all this data? It’s gonna be great.”

If that conversation is concerning and/or frustrating to you, then welcome to the world of string and text APIs in C and C++. It’s having this conversation, every damn day, day in and day out, dusk to dawn. It’s pretty bad!

Nevertheless

Now that we have all of our feature sets and know what we are looking for in an API, we can start talking about all of these libraries. The core goal here will be in dealing with the issues of each API and trying to fill out as much of the functionality as required in the tables. The goal will be to fill it all with ✅ in each row, indicating full support. If there is only partial support, we will indicate that with 🤨 and add notes where necessary. If there is no support, we will use a ❌.

ICU

ICU has almost everything about their APIs correct, which is great because it means we can start with an almost-perfect example of how things work out. They have a number of APIs specialized for UTF-8 and UTF-16 conversions (and we do benchmark those), but what we will evaluate here is the ucnv_convertEx API that serves at their basic and fundamental conversion primitive. Here’s what it looks like:

U_CAPI void ucnv_convertEx(
	UConverter *targetCnv, UConverter *sourceCnv, // converters describing the encodings
	char **target, const char *targetLimit, // destination
	const char **source, const char *sourceLimit, // source data
	UChar *pivotStart, UChar **pivotSource, UChar **pivotTarget, const UChar *pivotLimit, // pivot
	UBool reset, UBool flush, UErrorCode *pErrorCode); // error code out-parameter

Some things done really well are the pErrorCode value and the pointer-based source/limit arguments. The error code pointer means that an user cannot ever forget to pass in an error code object. And, what’s more, is that if the value going into the function isn’t set to the proper 0-value (U_SUCCESS), the function will return a “warning” error code that the value going in was an unexpected value. This forces you to initialize the error code with UErrorCode error = U_SUCCESS;, then pass it in properly to be changed to something. You can still ignore what it sets the value to and pretend nothing ever happened, but you have, at least, forced the user to reckon with its existence. If someone is a Space CadetⓇ like me and still forgets to check it, well. That is on them.

Furthermore, the pointer arguments are also extraordinarily helpful. Taking a pointer-to-pointer of the first argument means that the algorithm can increment the pointer past what it has read, to successfully indicate how much data was read that produced an output. For example, UTF-8 may have anywhere from 1 to 4 code units of data (1 to 4 unsigned chars of data) in it. It is impossible to know how much data was read from a “successful” decoding operation, because it is a variable amount. Due to this, you cannot know how much was read/written without the function telling you, or you going backwards to re-do the work the function already did.

Updating the pointer value makes sure you know how much input was successfully read. An even better version of this updates the pointer only if the operation was successful, rather than just doing it willy-nilly. This is the way ICU works, and it is incredibly helpful when *pErrorCode is not set to U_SUCCESS. It allows you to insert replacement characters and/or skip over the problematic input if needed, allowing for greater control over how errors are handled.

Of course, there’s some things the ICU function is not perfectly good at, and that is having potentially unchecked functions for increased speed. One of the things I attempted to do while using ucnv_convertEx was pass in a nullptr for the targetLimit argument. The thinking was that, even if that was not a valid ending pointer, the nullptr would never compare equal to any valid pointer passed through the *target argument. Therefore, it could be used as a pseudo-cheap way to have an “unbounded” write, and perhaps the optimizer could take advantage of this information for statically-built versions of ICU. Unfortunately, if you pass in nullptr, ICU will immediately reject the function call with U_ILLEGAL_ARGUMENT as the error code. You also cannot do “counting” in a straightforward manner, since nullptr is not allowed as an argument to the target-parameters.

There are other dedicated functions you can use to do counting (“pre-flighting”, as the ICU documentation calls it since it also performs other functions), and there are also other conversion functions you can do to instrument the “in-between” part of the conversion too (that is serviced by the pivotStart, pivotSource, pivotTarget, and pivotLimit parameters). But, all in all it still contains much of the right desired shape of an API that is useful for text encoding, and does provide a way to do conversions between arbitrary encodings.

Single conversion or one-by-one conversions are supported by using the CharacterIterator abstraction, but it has very strange usability semantics because it iterators over UChars of 16 bits for UTF-16. Stitching together surrogate characters takes work and, in most cases, is almost certainly not worth it given the API and its design for this portion.

For going to UTF-8 or UTF-16 — which should cover the majority of modern conversions — there are also dedicated APIs for it, such as ucnv_fromUChars/ucnv_toUChars and u_strFromUTF8 and u_strToUTF8. These APIs take different parameters but share much of the same philosophies as the ucnv_convertEx described above, generally lacking a pivot buffer set of arguments since direct Unicode conversions need no in-between pivot. Unlike many of the other libraries compared / tested in this API and benchmark comparison we are doing, they do support quite an array of legacy encodings. The only problem is that adding new encoding conversions takes rebuilding the library and/or modifying data files to change what is available.

Despite being an early contender and having to base their API model around Unicode 1.0 and a 16-bit UChar (and thus settling on UTF-16 as the typical go-between), ICU’s interface has at least made amends for this by providing a rich set of functionality. It can be difficult to figure it all out and how to use it appropriately, but once you do it works well. It may fail in some regards due to not embracing different kinds of performant API surfaces but by-and-large it provides everything someone could need (and more, but we are only concerned with encoding conversions at this point).

simdutf

As its name states, simdutf wants to be a Single Instruction Multiple Data implementation (SIMD) for all of the Unicode Transformation Formats (UTFs). And it does achieve that; with speeds that tear apart tons of data and rebuild it in the appropriate UTF encoding, Professor Lemire earned himself a sweet spot in the SPIRE 2021: String Processing and Information Retrieval Journal/Symposium thingy for his paper, “Unicode at Gigabytes per Second”. The interface is markedly simpler than ICU’s; with no generic conversions to worry about and no special conversion features, simdutf takes the much simpler route of doing the following:

simdutf_warn_unused size_t convert_utf8_to_utf16le(
	const char * input,
	size_t length,
	char16_t* utf16_output) noexcept;
simdutf_warn_unused result convert_utf8_to_utf16le_with_errors(
	const char * input,
	size_t length,
	char16_t* utf16_output) noexcept;
simdutf_warn_unused size_t utf8_length_from_utf16le(
	const char16_t* input,
	size_t length) noexcept;
simdutf_warn_unused bool validate_utf8(
	const char *buf,
	size_t len) noexcept;
simdutf_warn_unused result validate_utf8_with_errors(
	const char *buf,
	size_t len) noexcept;

simdutf_warn_unused is a macro expanding to [[nodiscard]] or similar shenanigans like __attribute__((warn_unused)) on the appropriate compiler(s). The result type is:

struct result {
	error_code error;
	size_t position;
};

It rinses and repeats the above for UTF-8 (using (const )char*), UTF-16 (using (const )char16_t*), and UTF-32 (using (const )char32_t*).You will notice a couple of things lacking from this interface, before we get into the “SIMD” part of simdutf:

  • How much input have I read, if something goes wrong?
  • How much output have you written, if something goes wrong?
  • How do I know if the buffer utf8/16/32_output is full?

For the version that is not suffixed with _with_errors, the first two cases are assumed not to happen: there are no output errors because it assumes the input data is well-formed UTF-N (for N in {8, 16, 32}). This only returns a size_t, telling you how much data was written into the *_output pointer. The assumption is that the entire input buffer was well-formed, after all, and that means the entire input buffer was consumed, assuming no problems. The *_with_errors functions have just one problem, however…

One, The Other, But Definitely Not Both

You are a smart, intelligent programmer. You know data coming in can have the wrong values frequently, due to either user error, corruption, or just straight up maliciousness. Out of an abundance of caution, you allocate a buffer that is big enough. You run the convert_utf8_to_utf16le_with_errors function on the input data, not one to let others get illegal data past you! And you were right to: some bad data came in, and the result structure’s error field has an enum error_code of OVERLONG: hah! Someone tried to sneak an overlong-UTF-8-encoded character into your data, to trap you! You pat yourself on the back, having caught the problem. But, now…. hm! Well, this is interesting. You have both an input pointer, and an utf16_output pointer, but there’s only one size_t position; field! Reading the documentation, that applies to the input, so… okay! You know where the badly-encoded overlong UTF-8 sequence is! But… uhm. Er.

How much output did you write again…?

This is simdutf’s problem. If you do a successful read of all the input, and output all the appropriate data, the position field on the result structure tells you how many characters were written to the output. You know it was successful, so you know you’ve consumed all the input; that’s fine! But when an errors occurs? You only know how much input was processed before the error. Did you write 8.6 GB of data and only failed on the very last byte? Did you want to not start from the beginning of that 8.6 GB buffer and page in a shitload of memory? Eat dirt, loser; we’re not going to tell you squat. Normally, I’d be a-okay with that order of business. But there’s just one teensy, tiny problem with simdutf here! If you go into the implementations (the VARIOUS implementations, using everything from SSE2 to AVX2), you’ll notice a particular… pattern:

A screenshot of the internals of the simdutf library, particularly one of its 128-bit SSE processing blocks. It demonstrates knowing exactly how many output characters are written, specifically with red circles on the screenshot showing the output pointer being taken as a reference ("char16_t*&") and demonstrates that it returns the number of characters both read and written from its internal functions.

It. Knows.

It knows how much it’s been writing, and it just deigns not to tell you when its done. And I can see why it doesn’t pass that information back. After all, we all know how expensive it is to have an extra output_position field. That’s a WHOLE wasted size_t in the case where we read the input successfully and output everything nicely; it would be silly to include it! If we did not successfully read everything, what good can the input be anyhow?

Sarcasm aside, simdutf gets so many points for having routines that assume validity and do not, as well as length counting functions for input sequences and more, but just drops the ball on this last crucial bit of information! You either get to know the input is fully consumed and the output you wrote, or where you messed up in the input sequence, but you can’t get both.

Of course, it also doesn’t take buffer safety seriously either. Not that I blame simdutf for this: this has been an ongoing problem that C and C++ continue to not take seriously in all of its APIs, new and old. Nothing exemplifies this more than the standard C and C++ ways of “handling” this problem.

A Brief & Ranting Segue: Output Ranges and “Modern” C++

(Skip this rant by clicking here!)

Since the dawn of time, C and C++ have been on team “output limits are for losers”. I wrote an extensive blogpost about Output Ranges and their benefits after doing some benchmarks and citing Victor Zverovich’s work on fmtlib. At the time, that blogpost was fueled by rumblings of the idea that we do not need output ranges, or even a single output iterator (which is like an output pointer char16_t* utf16_output that Lemire’s functions take). Instead, we should take sink functions, and those sink functions can just be optimized into the speedy direct writes we need. The blog post showed that you can not only have the “sink” based API thanks to Stepanov’s iterator categorizations (an output iterator does exactly what a “sink” function is meant to do), but you can also get the performance upgrades to a direct write by having an output range composed of contiguous iterators of [T*, T*)/[T*, size_t]. This makes output ranges both better performing and, in many cases safer than both single-iterator and sink functions. So, when we standardized ranges, what did we do with old C++ functions that had signatures like

namespace std {
	template <typename InputIt, typename OutputIt>
	constexpr OutputIt copy(
		InputIt first,
		InputIt last,
		OutputIt d_first);
}

the above? Well, we did a little outfitting and made the ones in std::ranges look like…

namespace std {
	namespace ranges {
	template <std::input_iterator I, std::sentinel_for<I> S, std::weakly_incrementable O>
		requires  std::indirectly_copyable<I, O>
		constexpr copy_result<I, O> copy(
			I first,
			S last,
			O result); // ... well, shit.
	}
}

… Oh. We… we still only take an output iterator. There’s no range here. Well, hold on, there’s a version that takes a range! Surely, in the version that takes an input range, O will be a range too–

namespace std {
	namespace ranges {
		template <ranges::input_range R, std::weakly_incrementable O>
		requires  std::indirectly_copyable<ranges::iterator_t<R>, O>
		constexpr copy_result<ranges::borrowed_iterator_t<R>, O> copy(
			R&& r, // yay, a range!
			O result); // ... lmao
	}
}

Ah.

… Nice. Nice. Nice nice nice cool cool. Good. Great, even.

Fantastic.

One of the hilarious bits about this is that one of the penalties of doing SIMD-style writes and reads outside the bounds of a proper data pointer can be corruption and/or bricking of your device. If you have a pointer type for O result, and you start trying to do SIMD or other nonsense without knowing the size (or having an end pointer which can be converted into a size) with the single output pointer, on some hardware going past the ending boundary of the data and working on it means that you can, effectively, brick the device you’re running code on.

Now, this might not mean anything for std::(ranges::)copy, which might not rely on SIMD at all to get its work done. But, this does affect all the implementations that do want to use SIMD under the hood and may need to port to more exotic architectures; not having the size means you can’t be sure if/when you might do an “over-read” or “over-write” of a section of data, and therefore you must be extra pessimistic when optimizing on those devices. To be clear: a lot of computing does not run on such devices (e.g., all the devices Windows runs on and cares about do not have this problem). But, if you’re going to be writing a standard it might behoove us to actually give people the tools they need to not accidentally destroy their own esoteric devices when they use their SIMD instructions. When you have a range (in particular, a contiguous range with a size), you can safely work within the boundaries of both the input and output data and not trigger spurious failures/device bricking from being too “optimistic” with reads and writes outside of boundaries.

The weird part is that we also already have a range-based solution to “if I have to take a range, then I’m forced to bounds check against that range”. If you take an output range, you can also take an infinity range that simply does unbounded writes. This is something I’ve been using extensively since the earliest range-v3 days: unbounded_view. In fact, I gave a whole talk about how by using output ranges you can get safety by-default (the right default) and then get speed when you want it in an easily-searchable, greppable manner (timed video link):

A screenshot of a presentation titled "Catch ⬆️: Unicode for C++23". This slide in particular demonstrations using a "std::span" for output purposes, then an "unbounded_view", and then an "unbounded_view" with an "assume_valid" handler for even more speed.

It still baffles me that we can’t push people with our standard APIs to have decent performance metrics with safety first, and then ask people to deliberately pull the jar lid off to get to the dangerous and terrifying C++ mojo later. But, we continue to do this for C, C++, and similar code without taking a whole-library or whole-standard approach to being conscientious of the security vulnerability-rich environments we like to saddle developers with. These sorts of decisions are infectious because they are effectively the standards-endorsed interfaces, and routinely we suffer from logic errors which leak into unchecked buffer overruns and worse because the every-day tools employed in C and C++ codebases next to the usual logic we do are often the most unsafe possible tools we have. You cannot debug build or iterator-checking your way out of fundamentally poor design decisions. No matter how hard Stephan T. Lavavej or Louis Dionne or Jonathan Wakely iterator-safety the standard libraries in debug mode, leaving open the potential for gaping issues in our release builds is not helpful for the forward progress of C and C++ to be considered effective languages for an industry suffering from a wide variety of increasingly sophisticated security attacks.

But I digress.

The real problem here is that, in simdutf, if your data is not perfectly valid you are liable to waste work done in the face of a failed conversion. Kiss that 8.5999 GB goodbye and prepare to start from the beginning of that buffer all over again, because the interface still does not return how much output was written! In at least one win for Modern C++ interfaces, the new std::ranges algorithms in C++ did learn from the past at least a little bit. They return both the input iterator and the output iterator (std::ranges::copy_result<I, O>) passed into the function. simdutf has, unfortunately, been learning from the old C++ and C school of functions rather than the latest C++ school of functions! So, even if they both make the same unbounded-output mistake, simdutf doesn’t get the updated input/output perspective correct. And I really do mean the C school of function calls: the Linux Kernel has gotten into this same situation with trying to make a string copy function!

The Kernel folks are now deprecating strlcpy. They have begun the (long, painful?) maybe-migration to the newly decided-on strscpy. They are, once again, trying to convince everyone that the new strscpy is better than strlcpy, the same way the people who wrote strlcpy convinced everyone that it was better than strncpy. This time, they declare, they have really cracked the string copy functionality and came up with the optimal return values and output processing. And you know what, maybe for a bulk of the situations they care about, the people who designed strscpy are right! Unfortunately, you get tired of reading about the same return-value-but-changed-slightly or null-termination-but-we-handled-it-better-this-time-I-swear mistakes over and over again, you know? Even the article writer is resigned to this apparent infinity-cycle of “let’s write a new copy function” every decade or two (emphasis mine):

… That would end a 20-year sporadic discussion on the best way to do bounded string copies in the kernel — all of those remaining strncpy() calls notwithstanding — at least until some clever developer comes up an even better function and starts the whole process anew.

Jonathan Corbet, August 25th, 2022

I wish we would give everyone the input-and-output bounded copies with a full set of error codes so they could capture all the situations that matter, and then introduce strlcpy/strncpy/strscpy as optimizations where they are confident they can introduce it as such. But, instead, we’re just going to keep subtly tweaking/modifying/poking at the same damn function every 20 years. And keep introducing weird, messed up behavioral intrigues that drive people up the wall. It does give us all something to do, I guess! Clearly, we do not have enough things to be working on at the lowest levels of computing, beneath all else, except whether or not we’ve got our string copy functions correct. That’s the kind of stuff we need to be spending the time of the literal smartest people on the earth figuring out. Again. And get it right this time! For real. We promise. Double heart-cross and mega hope-to-die promise. Like SUPER-UBER pinky promise, it’s perfect this time, ultra swearsies!!

Ultra swearsies…

A sheep stares with utter exhaustion at their phone, eyes baggy and eyebrows drawn in with exasperated anger and hopelessness.

Rant Over

Regardless of how poorly output pointers are handled in the majority of C and C++ APIs, and ignoring the vast track record of people messing up null termination, sizes, and other such things, simdutf has more or less a standard interface offering most of the functionality you could want for Unicode conversions. Its combination of functions that do not check for valid input and ones that do (which are suffixed with _with_errors) allows for getting all of the information you need, albeit sometimes you need to make multiple function calls and walk over the data multiple times (e.g., call validate_utf8 before calling utf8_length_from_utf16le since the length function does not bother doing validation).

simdutf also does not bother itself with genericity or pivots or anything, because it solely works for a fixed set of encodings. This makes it interesting for Unicode cases (which is hopefully the vast majority of encoding conversions performed), but utterly useless when someone has to go battle the legacy dragons that lurk in the older codebases.

utf8cpp

utf8cpp is what it says on the tin: UTF-8 conversions. It also includes some UTF-16 and UTF-32 conversion routines and, as normal for most newer APIs, does not bother with anything else. It has both checked and unchecked conversions, and the APIs all follow an STL-like approach to storing information.

template <typename u16bit_iterator, typename octet_iterator>
u16bit_iterator utf8to16 (
	octet_iterator start,
	octet_iterator end,
	u16bit_iterator result);
namespace unchecked {
	template <typename u16bit_iterator, typename octet_iterator>
	u16bit_iterator utf8to16 (
		octet_iterator start,
		octet_iterator end,
		u16bit_iterator result);
}

You can copy-and-paste all of my criticisms for simdutf onto this one, as well as all my praises. Short, simple, sweet API, has an explicit opt-in for unchecked behavior (using a namespace to do this is a nice flex), and makes it clear what is on offer. As a side benefit, it also includes an utf8::iterator and an utf16::iterator classes to do iterator and view-like stuff with, which can help cover a pretty vast set of functionality that the basic functions cannot provide.

It goes without saying that extensibility is not built into this package, but it will be fun to test its speed. The way errors are handled are done by the user, which means that custom error handling / replacement / etc. can be done. Of course, just like simdutf, it thinks input iterator returns are for losers, so there’s no telling where exactly the error might be in e.g. an UTF-16 sequence or something to that effect. However, for ease-of-use, utf8cpp also includes a utf8::replace_invalid function to replace bad UTF-8 sequences with a replacement character sequence. It also has utf8::find_valid, so you can scan for bad things in-advance and either remove/replace/eliminate them in a given object yourself. (UTF-8 only, of course!)

encoding_rs/encoding_c

encoding_rs is, perhaps surprisingly, THE Rust entry point into this discussion. This will make it interesting from a performance perspective and an API perspective, since it has a C version — encoding_c — that provides a C-like API with a C++ wrapper around it where possible. It’s got a much weirder design philosophy than freely-creatable conversion objects; it uses static const objects of specific types to signal which encoding is which:

// …

/// The UTF-8 encoding.
extern ENCODING_RS_NOT_NULL_CONST_ENCODING_PTR const UTF_8_ENCODING;

/// The gb18030 encoding.
extern ENCODING_RS_NOT_NULL_CONST_ENCODING_PTR const GB18030_ENCODING;

/// The macintosh encoding.
extern ENCODING_RS_NOT_NULL_CONST_ENCODING_PTR const MACINTOSH_ENCODING;

// …

The types underlying them are all the same, so you select whichever encoding you need either by referencing the static const object in code or by using the encoding_for_label(uint8_t const* label, size_t label_len) function. You then start calling the (decoder|encoder)_(decode|encode)_(to|from)_utf16 functions (or using the object-oriented function calls that do exactly the same thing but by calling (decoder|encoder)->(decode|encode)_(to|from)_utf16 on a decoder/encoder pointer):

uint32_t encoder_encode_from_utf16(
	ENCODING_RS_ENCODER* encoder,
	char16_t const* src,
	size_t* src_len,
	uint8_t* dst,
	size_t* dst_len,
	bool last,
	bool* had_replacements);
uint32_t encoder_encode_from_utf16_without_replacement(
	ENCODING_RS_ENCODER* encoder,
	char16_t const* src,
	size_t* src_len,
	uint8_t* dst,
	size_t* dst_len,
	bool last);
uint32_t decoder_decode_to_utf16(
	ENCODING_RS_DECODER* decoder,
	uint8_t const* src,
	size_t* src_len,
	char16_t* dst,
	size_t* dst_len,
	bool last,
	bool* had_replacements);
uint32_t decoder_decode_to_utf16_without_replacement(
	ENCODING_RS_DECODER* decoder,
	uint8_t const* src,
	size_t* src_len,
	char16_t* dst,
	size_t* dst_len,
	bool last);

First off, I would just like to state, for the record:

FINALLY. Someone finally included some damn sizes to go with both pointers. This is the first API since ICU not to just blindly follow in the footsteps of either the standard library, C string functions, or whatever other nonsense keeps people from writing properly checked functions (with optional opt-outs for speed purposes). This is most likely due to the fact that this is a Rust library underneath, and the way data is handled is with built-in language slices (the equivalent of C++’s std::span, and the equivalent of C’s nothing because C still hates its users and wants them to make their own miserable structure type with horrible usability interfaces). encoding_rs unfortunately fails to provide functions that do no checking here. Weirdly enough, this functionality could be built in by allowing dst_len to be NULL, giving the “write indiscriminately into dst and I won’t care” functionality. But, encoding_rs just… does not, instead stating:

… UB ensues if any of the pointer arguments is NULL, src and src_len don’t designate a valid block of memory or dst and dst_len don’t designate a valid block of memory. …

So, that’s that. Remember that my qualm is not that there are unsafe versions of functions: it’s that there exist unsafe functions without well-designed, safe alternatives. encoding_rs swings all the way in the other direction, much like ICU, and says “unbounded writing is for losers”, leaving the use case out in the cold. The size_t pointer parameters still need to be as given, because the original Rust functions return sizes indicating how much was written. Rather than returning a structure (potentially painful to do in FFI contexts), these functions load up the new size values through the size_t* pointers, showing how much is left in the buffers.

Error handling can be done automatically for you by using the normal functions, with an indication that replacements occurred in the output bool* parameter has_replacements. Functions which want to apply some of their own handling and not just scrub directly over malformed sequences have to use the _without_replacement-suffixed functions.

Finally, the functions present here always go: to UTF-8 or UTF-16; or, from UTF-8 or UTF-16. It is your job to write a function that stiches 2 encodings together, if you need to go from one exotic/legacy encoding to another. This is provided in examples (here, and here), but not in the base package: transcoding between any 2 encodings is something you must specifically work out. The design is also explicitly not made to be extensible; what the author does is effectively his own package-specific hacks to pry the mechanisms and Traits open with his bare hands to get the additional functionality (such as in this crate).

This makes it a little painful to add one’s own encodings using the library, but it can technically be done. I will not vouch for such a path because when the author tells me “I explicitly made it as hard as possible to make it extensible”, I don’t take that as an invitation to go trying to force extensibility. Needless to say, the API was built for streaming and is notable because it is used in Mozilla Firefox and a handful of other places like Thunderbird. It is also frequently talked up as THE Rust Conversion API, but that wording has only come from, in my experience, Mozilla and Mozilla-adjacent folks who had to use the Gecko API (and thus influenced by them), so that might just be me getting the echo feedback from a specific silo of Rustaceans. But if it’s powering things like Thunderbird, it’s got to be good, especially on the performance front, right?

I did encounter a pretty annoying usability bug when trying to convert from UTF-8 to UTF-16 the “generic” way, and ended up with an unexplained spurious conversion failure that seemed to mangle the data. It was, apparently, derived from the fact that you cannot ask for an encoder (an “output encoding”) of an UTF-16 type, which is itself apparently a restriction derived from the WHATWG encoding specification:

4.3. Output encodings

To get an output encoding from an encoding encoding, run these steps:

  • If encoding is replacement or UTF-16BE/LE, then return UTF-8.
  • Return encoding

— §4.3 WHATWG Encoding Specification, June 20, 2022

Yes, you read that right. If the encoding is UTF-16, return the UTF-8 encoding instead. Don’t raise an error, don’t print that something’s off to console, just slam an UTF-8 in there. I spent a good moment doing all sorts of checks/tests, trying to figure out why the Rust code was apparently giving me an UTF-8 encoder when I asked for an UTF-16 encoder:

A screenshot showing the Visual Studio Code debugger, highlighting a variable called "u16enc". The variable was initialized using "auto u16enc = UTF_16LE_ENCODING->new_encoder();" which produced a type of "std::unique_ptr<encoding_rs::Encoder>". Inspecting the variable in the left-hand-side panel and checking deep into its data members, it reveals that it has a name of "UTF-8" and not "UTF-16".

I’m certainly glad that encoding_rs cleaves that closely to the encoding specification, but you can backdoor this by simply generating a decoder for the encoding you are going from and directly calling decoder->decode_to_utf16_without_replacement(…). This, of course, begs the question of why we are following the specification this closely in the first place, if I can simply cheat my way out of the situation by shooting the encoder portion in the face and doing a direct conversion from the decoder half. It also begs the question of why the WHATWG specification willingly returns you a false encoding rather than raising an error. I’m sure there’s a good reason, but encoding_rs does not say it (other than stony-faced “it’s what the spec does”), and the WHATWG spec does not make it immediately obvious what the benefit of this is supposed to be. So I will simply regard it as the infinite wisdom of people 1,000 times my superior and scold myself for being too dumb to read the docs appropriately.

Topping off my Unicode Conversion troubles, encoding_rs (and it’s various derivatives like charset) don’t believe in UTF-32 as an encoding either. To be fair, neither does the WHATWG specification, but I’ve got applications trafficking in UTF-32 text (including e.g. the very popular Harfbuzz shaper and the Freetype API), so… I guess we’re just ignoring the hell out of those use cases and all of the wchar_t-based code out there in the world for *nix distributions.

Finally, there does not seem to be an “assume input is valid” conversion API either, despite the Rust ecosystem itself needing to have such functionality to drastically improve its own UTF-8 compile-time and run-time workloads for known-good strings. It’s not the end of the world to have neither unbounded nor assumed valid conversions, but it certainly means that there could be plenty of performance left out on the table from the API here. We also have to remember that encoding_rs’s job is strictly for web code, and maybe they just don’t trust anyone to do a conversion the unsafe way without endangering too-important data. Which is likely a fair call to make, but as somebody that’s trying to crush Execution and Wide Execution encodings from the C libraries like a watermelon between my thighs the library comes up disappointingly short on necessary functionality.

It has certainly colored my impression of Rust’s text encoding facilities if this is the end-all, be-all crate that was hyped up to me for handling things in Rust.

libiconv

libiconv has an interface similar to ICU, but entirely slimmed down. There is no pivot parameter, and instead of taking a pair of [T**, T*), it works on – perhaps strangely – [T**, size_t*):

size_t iconv(
	iconv_t cd,
	char ** inbuf,
	size_t * inbytesleft,
	char ** outbuf,
	size_t * outbytesleft);

Initially, when I first encountered this, I thought libiconv was doing something special here. I thought they were going to use the nullptr argument for outbytesleft and outbuf to add additional modes to the functions, such as:

  • input validation (iconv(cd, inbuf, inbytesleft, nullptr, nullptr)), similar to validate_utf8 from simdutf;
  • output size counting (iconv(cd, inbuf, inbytesleft, nullptr, outbytesleft)), similar to utf16le_length_from_utf8 from simdutf; and,
  • unbounded output writing (iconv(cd, inbuf, inbytesleft, outbuf, nullptr)), similar to the lack of an “end” or “size” done by simdutf and utf8cpp and many other APIs.

libiconv was, of course, happy to NOT provide any of that functionality, nor give other functions capable of doing so either. This was uniquely frustrating to me, because the shape of the API was ripe and ready to provide all of these capabilities, and provide additional functions for those capabilities. Remember up above when, about how a bulk API can be repurposed for the goals of doing counting, unbounded output writing, and validation? This is exactly what was meant: using the provided API surface of libiconv could achieve all of these goals as a bulk encoding provider. You could even repurpose the iconv API to do one-by-one encoding (the sticking point being, of course, that performance would be crap).

The only thing going for libiconv, instead, is its wide variety of encodings it supports. Other than that, it’s a decent API surface whose potential is not at all taken advantage of, including the fact that despite having a type-erased encoding interface does not provide any way to add new encodings to its type-erased interface. (If you want to do that, you need to add it manually into the code and then recompile the entire library, giving no means of runtime addition that are not expressly added to it by some outside force.)

Additionally, the names given to “create iconv_t conversion descriptor objects” function is not stable. For example, asking for the “UTF-32” encoding does not necessarily mean you will be provided with the UTF-32 encoding that matches the endianness of the machine compiled for. (This actually became a problem for me because a DLL meant to be used for Postgres’s libiconv got picked up in my application once. Suffice to say, deep shenanigans ensued as my little endian machine suddenly started chugging down big-endian UTF-32 data.) In fact, asking for “UTF-32” does not guarantee there is any relationship between the encoding name you asked for and what is the actual byte representation; despite being a POSIX standard, there are no guarantees about the name <-> encoding mapping. There is also no way to control Byte Order Marks, which is hilarious when e.g. you are trying to compile the C Standard using LaTeX and a bad libiconv implementation (thanks Postgres) that inserts them poisons your system installation.

It is further infuriating that the error handling modes for POSIX can range from “stop doing things and return here to the user can take care of it”, “insert ASCII ? everywhere an error occurs” (glibc does this, and sometimes uses the Unicode Replacement Character when it likes), or even “insert ASCII * everywhere an error occurs” (musl-libc; do not ask me why they chose * instead of the nearly-universally-applied ?). How do you ask for a different behavior? Well, by building an entirely different libiconv module based on a completely different standard library and/or backing implementation of the functionality. Oh, the functionality comes from a library that is part of your core distribution? Well, just figure out the necessary linker and loader flags to get the right functions first. Duh!

Of course. How could I be such a bimbo! Just need to reach into my system and turn into a mad dog frothing at the mouth about encodings to get the behavior that works best for me. I just need to create patches and hold my distribution updates at gunpoint so I can inject the things I need! So simple. So easy!!

An anthropomorphic, smol sheep in a robe and a scarf, with beady little eyes and down-turned ears going "a" with their mouth open in disbelieving, mostly quiet, shocked agony.

In short, libiconv is a great API tainted by a lot of exceedingly poor specification choices, implementation choices, and deep POSIX baggage. Its lack of imagination for a better world and contentment with a broken, lackluster specification is only rivaled by its flaccid, uninspired API usage and its crippling lack of behavioral guarantees. At the very least, GNU libiconv provides a large variety of encodings, but lacks extensibility or any meaningful way to override or control how specific encoding conversions behave, leaving you at the mercy of the system.

In other words, it behaves exactly like every other deeply necessary C API in the world, so no surprises there.

boost.text

This is perhaps the spiritually most progressive UTF encoding and decoding library that exists. But, while having the ability to perhaps add more encodings to its repertoire, it distinctly refuses to and instead consists only of UTF-based encodings. There is a larger, more rich offering of strictly Unicode functionality (normalization, bidirectional algorithms, word break algorithms, etc.) that the library provides, but we are — perhaps unfortunately — not dealing with APIs outside of conversions for now. Like utf8cpp and simdutf before it, it offers simple free functions for conversions:

template <std::input_iterator I, std::sentinel_for<I> S, 
	std::output_iterator<uint16_t> O> 
transcode_result<I, O> transcode_to_utf16(
	I first,
	S last,
	O out);

It also offers range-based solutions, more fleshed out that utf8cpp’s. These are created from a boost::text::as_utfN(...) function call, (where N is {8, 16, 32}) and produce iterator/range types for going from the input type (deduced from the pointer (treated as a null-terminated C-string) or from the range’s value type) to the N-deduced output type.

As usual, the criticism of “please do not assume unbounded writes are the only thing I am interested in or need” applies to boost.text’s API here. Thankfully, it does something better than simdutf or utf8cpp: it gives back both the incremented first and the incremented out iterators. Meaning that if I passed 3 pointers in, I will get 2 updated pointers back, allowing me to know how much was read and how much was written. There is an open question about whether or not one can safely subtract pointers that may have a difference larger than PTRDIFF_MAX without invoking undefined behavior, but I have resigned myself that it is more or less an impossible problem to solve in a C and C++ standards-compliant way for now (modulo relying on not-always-present uintptr_t).

boost.text’s unfortunate drawback is in its error handling scheme. You are only allowed to insert one (1) character — maybe — through its error handler abstraction, and only when using its iterator-based replacement facilities. During it’s transcode_to_utfN functions, it actively refuses to do anything other than insert the canonical Unicode replacement character, which means you cannot get the fast iteration functions to stop on bad data. It will just plow right on through, stick the replacement character, and then smile a big smile at you while pretending everything is okay. This can have some pretty gnarly performance implications for folks who want to do things like, say, stop on an error or perform some other kind of error handling, forcing you to use the (less well-performing) iterator APIs.

But this is par-for-the-course for boost.text; it was made to be an incredibly opinionated API. I personally think its opinionated approach to everything is not how you get work done when you want to pitch it to save C and C++ from its encoding troubles, and when I sent my mailing list review for it when I still actively participated in Boost I very vocally explained why I think it’s the wrong direction. It is, in fact, the one of the bigger reasons I got involved with the C++ Standards Committee SG16: someone was doing something wrong on the internet and I couldn’t let that pass:

A screenshot of 2 slides on a presentation deck, where the first one includes a quote from Zach Laine – author of boost.text – saying (paraphrasing) "Yes, it's text_view stapled to UTF-8, with quite a few staples". The next slide shows a staple remover, and explains my intent to de-couple the boost.text codebase from its UTF-specific assumptions.

It’s still not a bad library. It has less features and ergonomics versus simdutf, but the optimizations Zach Laine put into the encoding conversion layer are slick and at times compete with simdutf, when it’s working. (More on that later, when we get to the benchmarks.)

Standard C and C++

I am not even going to deign to consider C++ here. The APIs for doing conversions were so colossally terrible that they were not only deprecated in C++17 (NOTE: not yet removed, but certainly deeply discouraged by existing compiler flags). I myself have suffered an immensely horrible number of bugs trying to use the C++ version from the <codecvt>, and users of my sol2 library have also suffered from the exceedingly poor implementation quality derived from an even worse API that does not even deserve to be mentioned in the same breath as the rest of the APIs here. You can read some of the criticisms:

<codecvt> and std::wstring_convert are dead and I will never hide how glad I am we put that thing in the trash can.

I’m also less than thrilled about the C Standard API for conversions. There are a number of problems, but I won’t regale you all of the problems because I wrote a whole paper about it so I could fix it eventually in C. It did not make C23 because a last minute objection to the structure of the wording handling state ended up costing the paper it’s ability to make the deadline. Sorry; despite being the project editor I am (A) new to this, (B) extremely exhausted, and (C) not good at this at all, unfortunately!

Nevertheless, the C standard does not support UTF-16 as a wide encoding right now, which violates at least 6 different existing major C platforms today. Even if the wide (wchar_t) encoding is UTF-32, the C API is still fundamentally incapable of representing many of the legacy text encodings it is supposed to handle in the first place. This has made even steely open source contributors stared slack-jawed at embattled C libraries like glibc, which have no choice but to effectively jettison themselves into nonsense behavior because the C standard provides no proper handling of things. This case, in particular, arises when Big5-HKSCS needs to return two UTF-32 code points (e.g. U"\U00CA\U+0304") for certain input sequences (e.g. "Ê̄"):

oh wow, even better: glibc goes absolutely fucking apeshit (returns 0 for each mbrtowc() after the initial one that eats 2 bytes; herein wc modified to write the resulting character)

— наб, July 9, 2022

In fact, implementations can do whatever they want in the face of Big5-HKSCS, since it’s outside the Standard’s auspices:

Florian raised a similar issue in May of 2019 and the general feedback at that time was that BIG5-HKSCS is simply not supported by ISO C. I expect the same answer from POSIX which is harmonized with ISO C in this case.

If BIG5-HKSCS is not supported, then the standard will have nothing to say about which values can be returned after the first or second input bytes are read. …

— Carlos O’ Donnell, March 30, 2020

And indeed, the standard cannot handle it. Both because of the assumption that a single wchar_t (one UTF-32 code unit, a single char32_t) can represent all characters from any character set, and the horrible API design that went into the mbrtowc/wcrtomb/etc. function calls. My paper details much of the pitfalls and I won’t review them exhaustively here, but suffice to say everyone who has had their head in the trenches for a long time has conclusively reached the point where we know the original APIs are bunk garbage. I have no intention of rehashing why these utilities are garbage and do not work, and seek only to supplant them and then drive them back into the burning hell whence forth they deigned to sputter out of.

Also, fun fact: the <uchar.h> functions which do attempt to do Unicode conversions (but use the execution encoding as the “go between”, so if you do not have a UTF-8 multibyte encoding every encoding is lossy and worthless) are not present on Mac OS. Which is weird, because Mac OS went all-in on UTF-8 encoding conversions as its char* encoding in all of its languages, so it could just… make that assumption and ignore all the BSD-like files left lurking in the guts of the OS. But they don’t, so instead they just provide… nothing.

All in all, if the C standard was at least capable — or the C++ standard ever rolled its sleeves up and design something halfway good — we might not be in this mess. But we are, driving the reason for this whole article. Of course, some platforms realized that the C and C++ standards are trash, so they invented their own functions. Like, for example, the Win32 folks.

Windows API

The Windows API has 2 pretty famous functions for doing conversions: WideCharToMultiByte and MultiByteToWideChar. They convert from a given code page to UTF-16, and from UTF-16 to a given code page. The signatures from MSDN look as follows:

int WideCharToMultiByte(
	UINT                               CodePage,
	DWORD                              dwFlags,
	_In_NLS_string_(cchWideChar)LPCWCH lpWideCharStr,
	int                                cchWideChar,
	LPSTR                              lpMultiByteStr,
	int                                cbMultiByte,
	LPCCH                              lpDefaultChar,
	LPBOOL                             lpUsedDefaultChar
);

int MultiByteToWideChar(
	UINT                              CodePage,
	DWORD                             dwFlags,
	_In_NLS_string_(cbMultiByte)LPCCH lpMultiByteStr,
	int                               cbMultiByte,
	LPWSTR                            lpWideCharStr,
	int                               cchWideChar
);

There is no “one by one” API here; just bulk. And, similar to the criticisms levied at simdutf, the standard library, and so many other APIs, they only have a single return int that is used as both the error code channel and the return value for the text. (We will ignore that they are using int, which definitely means you cannot be using larger than 4 GB buffers, even on a 64-bit machine, without getting a loop prepped to do the function call multiple times.) I am willing to understand Windows’s poor design because this API is some literal early-2000s crap. I imagine with all the APIs Windows cranks out regularly, they might have an alternative to this one by now. But if they do, (A) I cannot find such an API, and (B) contacting people who literally work (and worked) on the VC++ runtime have started in no uncertain terms that the C and C++ code to use for these conversions is the WideCharToMultiByte/MultiByteToWideChar interfaces.

So, well, that’s what we’re using.

This API certainly suffers from its age as well. For example, it does things like assume you would only want to insert 1 replacement character (and that the replacement character can fit neatly in one UTF-16 code unit). This was fixed in more recent versions of windows with the introduction of the MB_ERR_INVALID_CHARS flag that can be passed to the dwFlags parameter, where the conversion simply fails if there are invalid characters. Of course, the same problem as simdutf manifests but in an even worse fashion. Because the error code channel is the same as the “# of written bytes” channel (the return value), returning an error code means you cannot even communicate to the user where you left off in the input, or how much output you have written. If the conversion fails and you want to, say, insert a replacement u'\xFFFD' or u'?' by yourself and skip over the single bit of problematic input, you simply cannot because you have no idea where in the output to put it. You also don’t know where the error has occurred in the input. It’s the old string conversion issue detailed at the start of this article, all over again, and it’s infuriating.

ztd.text

Turns out I have an entire slab of documentation you can read about the design, and an entire article explaining part of that design out, so I really won’t bother explaining ztd.text. It’s the API I developed, the API mentioned in the video linked above, and what I’ve poured way too much of my time into for the sole purpose of saving the C and C++ landscape from its terrible encoding woes. I have people reaching out from different companies already attempting re-implementations of the specification for their platforms, and progress continues to move forward.

It checks every single box in the table’s row for the desired feature sets, obviously. If it didn’t I would have went back and whipped my API into shape to make sure it did, but I didn’t have to because unlike just about every other API in this list it actually paid attention to everything that came before it and absorbed their lessons. I didn’t make obvious mistakes or skip over use cases because, as it turns out, listening and learning are really, really powerful tools to prevent rehashing 30 year old discussions.

Wild, isn’t it?

So… What Happens Now?

We listed a few criteria and talked about it, so let’s try to make a clear table of what we want out of an API and what each library gives us in terms of conversions. As a reminder, here’s the key:

  • ✅ Meets all the necessary criteria of the feature.
  • ❌ Does not meet all the necessary criteria of the feature.
  • 🤨 Partially meets the necessary criteria with caveats or addendums.

And here’s how each of the libraries squares up.

Feature Set 👇 vs. Library 👉 ICU libiconv simdutf encoding_rs/encoding_c ztd.text
Handles Legacy Encodings
Handles UTF Encodings 🤨
Bounded and Safe Conversion API
Assumed Valid Conversion API
Unbounded Conversion API
Counting API
Validation API
Extensible to (Runtime) User Encodings
Bulk Conversions
Single Conversions
Custom Error Handling 🤨 🤨
Updates Input Range (How Much Read™) 🤨
Updates Output Range (How Much Written™)
Feature Set 👇 vs. Library 👉 boost.text utf8cpp Standard C Standard C++ Windows API
Handles Legacy Encodings 🤨 🤨
Handles UTF Encodings 🤨 🤨
Bounded and Safe Conversion API 🤨
Assumed Valid Conversion API
Unbounded Conversion API
Counting API 🤨
Validation API 🤨
Extensible to (Runtime) User Encodings
Bulk Conversions 🤨 🤨
Single Conversions
Custom Error Handling
Updates Input Range (How Much Read™)
Updates Output Range (How Much Written™)

Briefly covering the “🤨” for each library/API:

  • libiconv: error handling and insertion of replacements is implementation-defined, and the replacements are also implementation-defined, and whether or not it even does it is implementation-defined, and whether or not it’s any good is – you guessed it! — implementation-defined.
  • simdutf: it only reports how much output was read one success, and only reports how much input was read if you’re careful, making inserting custom handling a lot harder than is necessary.
  • encoding_rs: cannot handle UTF-32 from C or C++ (but gets it for free in Rust because you can convert to Rust char, which is a Unicode Scalar Value).
  • boost.text: this one has an “❌” for custom error handling, despite enabling it for its ranges, because its bulk transcoding functions refuse you the opportunity or chance to do that and they do not provide functions to allow you to change it.
  • utf8cpp: it does not provide counting and validation APIs for all UTF functions, so even if restricted to purely UTF functions validation and counting must be done by the end-user, or through using iterators with a slower API.
  • Standard C: it’s trash.
  • Standard C++: provides next-to-nothing of its own that is not sourced from C, and when it does it somehow makes it worse. Also trash.
  • Windows API: it does not handle UTF-32. From the documentation of its Code Page identifiers (emphasis mine):

    12000 utf-32 Unicode UTF-32, little endian byte order; available only to managed applications
    12001 utf-32BE Unicode UTF-32, big endian byte order; available only to managed applications

    UTF-32 continues to be for losers, apparently!

This is the full spread. Every marker should be explained above; if something is missing, do let me know, because I am going to routinely reference this table as the Definitive™ Feature List for all of these libraries from now until I die and also in important articles and journals. I spent way too much time investigating these APIs, suffering through their horrible builds, and knick-knack-patty-whacking these APIs and benchmarks and investigations together. I most certainly never want to touch libiconv again, and even though I’m tired as hell I’ve already put “remake ztd.text in Rust so I can have an UTF-32 conversion as part of a Rust text library, For God’s Sake” on my list of things to do. (Or someone else will get to do it before I do, which would be grrreeat.)

But… Where’s Your C API?

Right. I said I was going to use all of this to see if we can make an API in C that matches the power of my C++ one, and learns all the necessary lessons from all the C and C++ APIs that litter the text encoding battlefield. A C API for working with text that covers all of the use cases and basis that already exist in the industry. Which is exactly what I did when I created ztd.cuneicode, a powerful C library that allows runtime extension and uncompromised speed while absorbing the lessons of Stepanov’s STL, iconv’s interface, and libogonek/ztd.text’s state handling apparatus. The time has come to explain the ultimate C encoding API to you

… Part 2!

Sorry! It turns out this article has quite literally surpassed 10,000 words and quite frankly there’s still a LOT to talk about. The next one might another 10,000 word banger (and I sincerely do not want it to be because then that means I will be writing far past New Years and into 2023). So, the actual design of the C library, its benefits, and more, will all come later. But I won’t just leave you empty-handed! In fact, here’s a little teaser…

A benchmark teaser. Dear low-vision reader, I apologize for the crappy alt-text in this current release. I will go back and write a much more detailed one for these teaser graphs, but right now I am exhausted. Maybe I can beg Kate or ask someone else for help with this, because I am just energy-wasted. A benchmark teaser. Dear low-vision reader, I apologize for the crappy alt-text in this current release. I will go back and write a much more detailed one for these teaser graphs, but right now I am exhausted. Maybe I can beg Kate or ask someone else for help with this, because I am just energy-wasted. A benchmark teaser. Dear low-vision reader, I apologize for the crappy alt-text in this current release. I will go back and write a much more detailed one for these teaser graphs, but right now I am exhausted. Maybe I can beg Kate or ask someone else for help with this, because I am just energy-wasted. A benchmark teaser. Dear low-vision reader, I apologize for the crappy alt-text in this current release. I will go back and write a much more detailed one for these teaser graphs, but right now I am exhausted. Maybe I can beg Kate or ask someone else for help with this, because I am just energy-wasted.

Nyeheheh, such beautiful graphs…!

See you soon 💚.

]]>
<![CDATA[Last time we talked about encodings, we went in with a C++-like design where we proved that]]>
C23 is Finished: Here is What is on the Menu2022-07-31T00:00:00+00:002022-07-31T00:00:00+00:00https://thephd.dev/C23-the-end<![CDATA[

It’s That Blog Post. The release one, where we round up all the last of the features approved since the last time I blogged. If you’re new here, you’ll want to go check out these previous articles to learn more about what is/isn’t going into C23, many of them including examples, explanations, and some rationale:

The last meeting was pretty jam-packed, and a lot of things made it through at the 11th hour. We also lost quite a few good papers and features too, so they’ll have to be reintroduced next cycle, which might take us a whole extra 10 years to do. Some of us are agitating for a faster release cycle, mainly because we have 20+ years of existing practice we’ve effectively ignored and there’s a lot of work we should be doing to reduce that backlog significantly. It’s also just no fun waiting for __attribute__((cleanup(…))) (defer), statement expressions, better bitfields, wide pointers (a native pointer + size construct), a language-based generic function pointer type (GCC’s void(*)(void)) and like 20 other things for another 10 years when they’ve been around for decades.

But, I’ve digressed long enough: let’s not talk about the future, but the present. What’s in C23? Well, it’s everything (sans the typo-fixes the Project Editors - me ‘n’ another guy - have to do) present in N3047. Some of them pretty big blockbuster features for C (C++ will mostly quietly laugh, but that’s fine because C is not C++ and we take pride in what we can get done here, with our community.) The first huge thing that will drastically improve code is a combination-punch of papers written by Jens Gustedt and Alex Gilding.

N3006 + N3018 - constexpr for Object Definitions

Link.

I suppose most people did not see this one coming down the pipe for C. (Un?)Fortunately, C++ was extremely successful with constexpr and C implementations were cranking out larger and larger constant expression parsers for serious speed gains and to do more type checking and semantic analysis at compile-time. Sadly, despite many compilers getting powerful constant expression processors, standard C just continued to give everyone who ended up silently relying on those increasingly beefy and handsome compiler’s tricks a gigantic middle finger.

For example, in my last post about C (or just watching me post on Twitter), I explained how this:

const int n = 5 + 4;
int purrs[n];
// …

is some highly illegal contraband. This creates a Variable-Length Array (VLA), not a constant-sized array with size 9. What’s worse is that the Usual Compilers™ (GCC, Clang, ICC, MSVC, and most other optimizing compilers actually worth compiling with) were typically powerful enough to basically turn the code generation of this object – so long as you didn’t pass it to something expecting an actual Variably-Modified Types (also talked about in another post) – into working like a normal C array.

This left people relying on the fact that this was a C array, even though it never was. And it created enough confusion that we had to accept N2713 to add clarification to the Standard Library to tell people that No, Even If You Can Turn That Into A Constant Expression, You Cannot Treat It Like One For The Sake Of the Language. One way to force an error up-front if the compiler would potentially turn something into not-a-VLA behind-your-back is to do:

const int n = 5 + 4;
int purrs[n] = { 0 }; // 💥
// …

VLAs are not allowed to have initializers1, so adding one makes a compiler scream at you for daring to write one. Of course, if you’re one of those S P E E D junkies, this could waste precious cycles in potentially initializing your data to its 0 bit representation. So, there really was no way to win when dealing with const here, despite everyone’s mental model – thanks to the word’s origin in “constant” – latching onto n being a constant expression. Compilers “accidentally” supporting it by either not treating it as a VLA (and requiring the paper I linked to earlier to be added to C23 as a clarification), or treating it as a VLA but extension-ing and efficient-code-generating the problem away just resulted in one too many portability issues. So, in true C fashion, we added a 3rd way that was DEFINITELY unmistakable:

constexpr int n = 5 + 4;
int purrs[n] = { 0 }; // ✅
// …

The new constexpr keyword for C means you don’t have to guess at whether that is a constant expression, or hope your compiler’s optimizer or frontend is powerful enough to treat it like one to get the code generation you want if VLAs with other extensions are on-by-default. You are guaranteed that this object is a constant expression, and if it is not the compiler will loudly yell at you. While doing this, the wording for constant expressions was also improved dramatically, allowing:

  • compound literals (with the constexpr storage class specifier);
  • structures and unions with member access by .;
  • and, the usual arithmetic / integer constant expressions,

to all be constant expressions now.

Oh No, Those Evil Committee People are Ruining™ my Favorite LanguageⓇ with C++ NonsenseⒸ!

Honestly? I kind of wish I could ruin C sometimes, but believe it or not: we can’t!

Note that there are no function calls included in this, so nobody has to flip out or worry that we’re going to go the C++ route of stacking on a thousand different “please make this function constexpr so I can commit compile-time crimes”. It’s just for objects right now. There is interest in doing this for functions, but unlike C++ the intent is to provide a level of constexpr functions that is so weak it’s worse than even the very first C++11 constexpr model, and substantially worse than what GCC, Clang, ICC, and MSVC can provide at compile-time right now in their C implementations.

This is to keep it easy to implement evaluation in smaller compilers and prevent feature-creep like the C++ feature. C is also protected from additional feature creep because, unlike C++, there’s no template system. What justified half of the improvements to constexpr functions in C++ was “well, if I just rewrite this function in my favorite Functional Language – C++ Templates! – and tax the compiler even harder, I can do exactly what I want with worse compile-time and far more object file bloat”. This was a scary consideration for many on the Committee, but we will not actually go that direction precisely because we are in the C language and not C++.

You cannot look sideways and squint and say “well, if I just write this in the most messed up way possible, I can compute a constant expression in this backdoor Turing complete functional language”; it just doesn’t exist in C. Therefore, there is no prior art or justification for an ever-growing selection of constant expression library functions or marked-up headers. Even if we get constexpr functions, it will be literally and intentionally be underpowered and weak. It will be so bad that the best you can do with it is write a non-garbage max function to use as the behind-the-scenes for a max macro with _Generic. Or, maybe replace a few macros with something small and tiny.

Some people will look at this and go: “Well. That’s crap. The reason I use constexpr in my C++-like-C is so I can write beefy compile-time functions to do lots of heavy computation once at a compile-time, and have it up-to-date with the build at the same time. I can really crunch a perfect hash or create a perfect table that is hardware-specific and tailored without needing to drop down to platform-specific tricks. If I can’t do that, then what good is this?” And it’s a good series of questions, dear reader. But, my response to this for most C programmers yearning for better is this:

we get what we shill for.

With C we do not ultimately have the collective will or implementers brave enough to take-to-task making a large constant expression parser, even if the C language is a lot simpler to write one for compared to C++. Every day we keep proclaiming C is a simple and beautiful language that doesn’t need features, even features that are compile-time only with no runtime overhead. That means, in the future, the only kind of constant functions on the table are ones with no recursion, only one single statement allowed in a function body, plus additional restrictions to get in your way. But that’s part of the appeal, right? The compilers may be weak, the code generation may be awful, most of the time you have to abandon actually working in C and instead just use it as a macro assembler and drop down to bespoke, hand-written platform-specific assembly nested in a god-awful compiler-version-specific #ifdef, but That’s The Close-To-The-Metal C I’m Talkin’ About, Babyyyyy!!

“C is simple” also means “the C standard is underpowered and cannot adequately express everything you need to get the job done”. But if you ask your vendor nicely and promise them money, cookies, and ice cream, maybe they’ll deign to hand you something nice. (But it will be outside the standard, so I hope you’re ready to put an expensive ring 💍 on your vendor’s finger and marry them.)

N3038 - Introduce Storage Classes for Compound Literals

Link.

Earlier, I sort of glazed over the fact that Compound Literals are now part of things that can be constant expressions. Well, this is the paper that enables such a thing! This is a feature that actually solves a problem C++ was having as well, while also fixing a lot of annoyances with C. For those of you in the dark and who haven’t caught up with C99, C has a feature called Compound Literals. It’s a way to create any type - usually, structures - that have a longer lifetime than normal and can act as a temporary going into a function. They’re used pretty frequently in code examples and stuff done by Andre Weissflog of sokol_gfx.h fame, who writes some pretty beautiful C code (excerpted from the link):

#define SOKOL_IMPL
#define SOKOL_GLCORE33
#include <sokol_gfx.h>
#define GLFW_INCLUDE_NONE
#include <GLFW/glfw3.h>

int main(int argc, char* argv[]) {

	/* create window and GL context via GLFW */
	glfwInit();
	/* … CODE ELIDED … */

	/* setup sokol_gfx */
	sg_setup(&(sg_desc){0}); // ❗ Compound Literal

	/* a vertex buffer */
	const float vertices[] = {
		// positions            // colors
		0.0f,  0.5f, 0.5f,     1.0f, 0.0f, 0.0f, 1.0f,
		0.5f, -0.5f, 0.5f,     0.0f, 1.0f, 0.0f, 1.0f,
		-0.5f, -0.5f, 0.5f,     0.0f, 0.0f, 1.0f, 1.0f
	};
	sg_buffer vbuf = sg_make_buffer(&(sg_buffer_desc){  // ❗ Compound Literal
		.data = SG_RANGE(vertices)
	});

	/* a shader */
	sg_shader shd = sg_make_shader(&(sg_shader_desc){  // ❗ Compound Literal
		.vs.source =
			"#version 330\n"
			"layout(location=0) in vec4 position;\n"
			"layout(location=1) in vec4 color0;\n"
			"out vec4 color;\n"
			"void main() {\n"
			"  gl_Position = position;\n"
			"  color = color0;\n"
			"}\n",
		.fs.source =
			"#version 330\n"
			"in vec4 color;\n"
			"out vec4 frag_color;\n"
			"void main() {\n"
			"  frag_color = color;\n"
			"}\n"
	});

	/* … CODE ELIDED … */
	return 0;
}

C++ doesn’t have them (though GCC, Clang, and a few other compilers support them out of necessity). There is a paper by Zhihao Yuan to support Compound Literal syntax in C++, but there was a hang up. Compound Literals have a special lifetime in C called “block scope” lifetime. That is, compound literals in functions behave as-if they are objects created in the enclosing scope, and therefore retain that lifetime. In C++, where we have destructors, unnamed/invisible C++ objects being l-values (objects whose address you can take) and having “Block Scope” lifetime (lifetime until where the next } was) resulted in the usual intuitive behavior of C++’s temporaries-passed-to-functions turning into a nightmare.

For C, this didn’t matter and - in many cases - the behavior was even relied on to have longer-lived “temporaries” that survived beyond the duration of a function call to, say, chain with other function calls in a macro expression. For C++, this meant that some types of RAII resource holders – like mutexen/locks, or just data holders like dynamic arrays – would hold onto the memory for way too long.

The conclusion from the latest conversation was “we can’t have compound literals, as they are, in C++, since C++ won’t take the semantics of how they work from the C standard in their implementation-defined extensions and none of the implementations want to change behavior”. Which is pretty crappy: taking an extension from C’s syntax and then kind of just… smearing over its semantics is a bit of a rotten thing to do, even if the new semantics are better for C++.

Nevertheless, Jens Gustedt’s paper saves us a lot of the trouble. While default, plain compound literals have “block scope” (C) or “temporary r-value scope” (C++), with the new storage-class specification feature, you can control that. Borrowing the sg_setup function above that takes the sg_desc structure type:

#include <sokol_gfx.h>

SOKOL_GFX_API_DECL void sg_setup(const sg_desc *desc);

we are going to add the static modifier, which means that the compound literal we create has static storage duration:

int main (int argc, const char* argv[]) {
	/* … CODE ELIDED … */
	/* setup sokol_gfx */
	sg_setup(&(static sg_desc){0}); // ❗ Compound Literal	
	/* … CODE ELIDED … */
}

Similarly, auto, thread_local, and even constexpr can go there. constexpr is perhaps the most pertinent to people today, because right now using compound literals in initializers for const data is technically SUPER illegal:

typedef struct crime {
    int criming;
} crime;

const crime crimes = (crime){ 11 }; // ❗ ILLEGAL!!

int main (int argc, char* argv[]) {
    return crimes.criming;
}

It will work on a lot of compilers (unless warnings/errors are cranked up), but it’s similar to the VLA situation. The minute a compiler decides to get snooty and picky, they have all the justification in the world because the standard is on their side. With the new constexpr specifier, both structures and unions are considered constant expressions, and it can also be applied to compound literals as well:

typedef struct crime {
    int criming;
} crime;

const crime crimes = (constexpr crime){ 11 }; // ✅ LEGAL BABYYYYY!

int main (int argc, char* argv[]) {
    return crimes.criming;
}

Nice.

N3017 - #embed

Link.

Go read this to find out all about the feature and how much of a bloody pyrrhic victory it was.

N3033 - Comma Omission and Deletion (__VA_OPT__ in C and Preprocessor Wording Improvements)

Link.

This paper was a long time coming. C++ got it first, making it slightly hilarious that C harps on standardizing existing practice so much but C++ tends to beat it to the punch for features which solve long-standing Preprocessor shenanigans. If you’ve ever had to use __VA_ARGS__ in C, and you needed to pass 0 arguments to that , or try to use a comma before the __VA_ARGS__, you know that things got genuinely messed up when that code had to be ported to other platforms. It got a special entry in GCC’s documentation because of how blech the situation ended up being:

… GNU CPP permits you to completely omit the variable arguments in this way. In the above examples, the compiler would complain, though since the expansion of the macro still has the extra comma after the format string.

To help solve this problem, CPP behaves specially for variable arguments used with the token paste operator, ‘##’. If instead you write

#define debug(format, …) fprintf (stderr, format, ## __VA_ARGS__)

and if the variable arguments are omitted or empty, the ‘##’ operator causes the preprocessor to remove the comma before it. If you do provide some variable arguments in your macro invocation, GNU CPP does not complain about the paste operation and instead places the variable arguments after the comma. …

This is solved by the use of the C++-developed __VA_OPT__, which expands out to a legal token sequence if and only if the arguments passed to the variadic are not empty. So, the above could be rewritten as:

#define debug(format, …) fprintf (stderr, format __VA_OPT__(,) __VA_ARGS__)

This is safe and contains no extensions now. It also avoids any preprocessor undefined behavior. Furthermore, C23 allows you to pass nothing for the argument, giving users a way out of the previous constraint violation and murky implementation behaviors. It works in both the case where you write debug("meow") and debug("meow", ) (with the empty argument passed explicitly). It’s a truly elegant design and we have Thomas Köppe to thank for bringing it to both C and C++ for us. This will allow a really nice standard behavior for macros, and is especially good for formatting macros that no longer need to do weird tricks to special-case for having no arguments.

Which, speaking of 0-argument functions…

N2975 - Relax requirements for variadic parameter lists

Link.

This paper is pretty simple. It recognizes that there’s really no reason not to allow

void f();

to exist in C. C++ has it, and all the arguments get passed successfully, and nobody’s lost any sleep over it. It was also a important filler since, as talked about in old blog posts, we have finally taken the older function call style and put it down after 30+ years of being in existence as a feature that never got to see a single proper C standard release non-deprecated. This was great! Except, as that previous blog post mentions, we had no way of having a general-purpose Application Binary Interface (ABI)-defying function call anymore. That turned out to be bad enough that after the deprecation and removal we needed to push for a fix, and lucky for us void f(…); had not made it into standard C yet.

So, we put it in. No longer needing the first parameter, and no longer requiring it for va_start, meant we could provide a clean transition path for everyone relying on K&R functions to move to the -based function calls. This means that mechanical upgrades of old codebases - with tools - is now on-the-table for migrating old code to C23-compatibility, while finally putting K&R function calls – and all their lack of safety – in the dirt. 30+ years, but we could finally capitalize on Dennis M. Ritchie’s dream here, and put these function calls to bed.

Of course, compilers that support both C and C++, and compilers that already had void f(…); functions as an extension, may have deployed an ABI that is incompatible with the old K&R declarations of void f();. This means that a mechanical upgrade will need to check with their vendors, and:

  • make sure that this occupies the same calling convention;
  • or, the person who is calling the function cannot update the other side that might be pulling assembly/using a different language,

then the upgrade that goes through to replace every void f(); may need to also add a vendor attribute to make sure the function calling convention is compatible with the old K&R one. Personally, I suggest:

[[vendor::kandr]] void f();

, or something similar. But, ABI exists outside the standard: you’ll need to talk to your vendor about that one when you’re ready to port to an exlusively-post-C23 world. (I doubt anyone will compile for an exclusively C23-and-above world, but it is nice to know there is a well-defined migration path for users still hook up a 30+ year deprecated feature). Astute readers may notice that if they don’t have a parameter to go off of, how do they commit stack-walking sins to get to the arguments? And, well, the answer is you still can: ztd.vargs has a proof-of-concept of that (on Windows). You still need some way to get the stack pointer in some cases, but that’s been something compilers have provided as an intrinsic for a while now (or something you could do by committing register crimes). In ztd.vargs, I had to drop down into assembly to start fishing for stuff more directly when I couldn’t commit more direct built-in compiler crimes. So, this is everyone’s chance to get really in touch with that bare-metal they keep bragging about for C. Polish off those dusty manuals and compiler docs, it’s time to get intimately familiar with all the sins the platform is doing on the down-low!

N3029 - Improved Normal Enumerations

Link.

What can I say about this paper, except…

What The Hell, Man?

It is absolutely bananas to me that in C – the systems programming language, the language where const int n = 5 is not a constant expression so people tell you to use enum { n = 5 } instead – just had this situation going on, since its inceptions. “16 bits is enough for everyone” is what Unicode said, and we paid for it by having UTF-16, a maximum limit of 21 bits for our Unicode code points (“Unicode scalar values” if you’re a nerd), and the entire C and C++ standard libraries with respect to text encoding just being completely impossible to use. (On top of the library not working for Big5-HKSCS as a multibyte/narrow encoding). So of course, when I finally sat down with the C standard and read through the thing, noticing that enumeration constants “must be representable by an int” was the exact wording in there was infuriating. 32 bits may be good, but there were plenty of platforms where int was still 16 bits. Worse, if you put code into a compiler where the value was too big, not only would you not get errors on most compilers, you’d sometimes just get straight up miscompiles. This is not because the compiler vendor is a jerk or bad at their job; the standard literally just phones it in, and every compiler from ICC to MSVC let you go past the low 16-bit limit and occasionally even exceed the 32-bit INT_MAX without so much as a warning. It was a worthless clause in the standard,

and it took a lot out of me to fight to correct this one.

The paper next in this blog post was seen as the fix, and we decided that the old code – the code where people used 0x10000 as a bit flag – was just going to be non-portable garbage. Did you go to a compiler where int is 16 bits and INT_MAX is smaller than 0x10000? Congratulations: your code was non-standard, you’re now in implementation-defined territory, pound sand! It took a lot of convincing, nearly got voted down the first time we took a serious poll on it (just barely scraped by with consensus), but the paper rolled in to C23 at the last meeting. A huge shout out to Aaron Ballman who described this paper as “value-preserving”, which went a really long way in connecting everyone’s understanding of how this was meant to work. It added a very explicit set of rules on how to do the computation of the enumeration constant’s value, so that it was large enough to handle constants like 0x10000 or ULLONG_MAX. It keeps it to be int wherever possible to preserve the semantics of old code, but if someone exceeds the size of int then it’s actually legal to upgrade the backing type now:

enum my_values {
	a = 0, // 'int'
	b = 1, // 'int'
	c = 3, // 'int'
	d = 0x1000, // 'int'
	f = 0xFFFFF, // 'int' still
	g, // implicit +1, on 16-bit platform upgrades type of the constant here
	e = g + 24, // uses "current" type of g - 'long' or 'long long' - to do math and set value to 'e'
	i = ULLONG_MAX // 'unsigned long' or 'unsigned long long' now
};

When the enumeration is completed (the closing brace), the implementation gets to select a single type that my_values is compatible with, and that’s the type used for all the enumerations here if int is not big enough to hold ULLONG_MAX. That means this next snippet:

int main (int argc, char* argv[]) {
	// when enum is complete,
	// it can select any type
	// that it wants, so long as its
	// big enough to represent the type
	return _Generic(a,
		unsigned long: 1,
		unsigned long long: 0,
		default: 3);
}

can still return any of 1, 0, or 3. But, at the very least, you know a, or g or i will never truncate or lose the value you put in as a constant expression, which was the goal. The type was always implementation-defined (see: -fshort-enum shenanigans of old). All of that old code that used to be wrong is now no longer wrong. All of those people who tried to write wrappers/shims for OpenGL who used enumerations for their integer-constants-with-nice-identifier-names are also now correct, so long as they are using C23. (This is also one reason why the OpenGL constants in some of the original OpenGL code are written as preprocessor defines (#define GL_ARB_WHATEVER …) and not enumerations. Enumerations would break with any of the OpenGL values above 0xFFFF on embedded platforms; they had to make the move to macros, otherwise it was busted.)

Suffice to say I’m extremely happy this paper got it and that we retroactively fixed a lot of code that was not supposed to be compiling on a lot of platforms, at all. The underlying type of an enumeration can still be some implementation-defined integer type, but that’s what this next paper is for…

N3030 - Enhanced Enumerations

Link.

This was the paper everyone was really after. It also got in, and rather than being about “value-preservation”, it was about type preservation. I could write a lot, but whitequark – as usual – describes it best:

i realized today that C is so bad at its job that it needs the help of C++ to make some features of its ABI usable (since you can specify the width of an enum in C++ but not C)

Catherine (@whitequark), May 25th, 2020

C getting dumpstered by C++ is a common occurrence, but honestly? For a feature like this? It is beyond unacceptable that C could not give a specific type for its enumerations, and therefore made the use of enumerations in e.g. bit fields or similar poisonous, bad, and non-portable. There’s already so much to contend with in C to write good close-to-the-hardware code: now we can’t even use enumerations portably without 5000 static checks special flags to make sure we got the right type for our enumerations? Utter hogwash and a blight on the whole C community that it took this long to fix the problem. But, as whitequark also stated:

in this case the solution to “C is bad at its job” is definitely to “fix C” because, even if you hate C so much you want to eradicate it completely from the face of the earth, we’ll still be stuck with the C ABI long after it’s gone

Catherine (@whitequark), May 25th, 2020

It was time to roll up my sleeves and do what I always did: take these abominable programming languages to task for their inexcusably poor behavior. The worst part is, I almost let this paper slip by because someone else – Clive Pygott – was handling it. In fact, Clive was handling this even before Catherine made the tweet; N2008, from…

oh my god, it’s from 2016.

I had not realized Clive had been working on it this long until, during one meeting, Clive – when asked about the status of an updated version of this paper – said (paraphrasing) “yeah, I’m not carrying this paper forward anymore, I’m tired, thanks”.

That’s not, uh, good. I quickly snapped up in my chair, slammed the Mute-Off button, and nearly fumbled the mechanical mute on my microphone as I sputtered a little so I could speak up: “hey, uh, Clive, could you forward me all the feedback for that paper? There’s a lot of people that want this feature, and it’s really important to them, so send me all the feedback and I’ll see if I can do something”. True to Clive’s word, minutes after the final day on the mid-2021 meeting, he sent me all the notes. And it was…

… a lot.

I didn’t realize Clive had this much push back. It was late 2021. 2022 was around the corner, we were basically out of time to workshop stuff. I frequently went to twitter and ranted about enumerations, from October 2021 and onward. The worst part is, most people didn’t know, so they just assumed I was cracked up about something until I pointed them to the words in the standard and then revealed all the non-standard behavior. Truly, the C specification for enumerations was something awful.

Of course, no matter how much I fumed, anger is useless without direction.

I honed that virulent ranting into a weapon: two papers, that eventually became what you’re reading about now. N3029 and N3030 was the crystallization of how much I hated this part of C, hated it’s specification, loathed the way the Committee worked, and despised a process that led us for over 30 years to end up in this exact spot. This man – Clive – had been at this since 2016. It’s 2022. 5 years in, he gave up trying to placate all the feedback, and that left me only 1 year to clean this stuff up.

Honestly, if I didn’t have a weird righteous anger, the paper would’ve never made it.

Never underestimate the power of anger. A lot of folk and many cultures spend time trying to get you to “manage your emotions” and “find serenity”, often to the complete exclusion of getting mad at things. You wanna know what I think?

🖕 ““Serenity””

Serenity, peace, all of that can be taken and shoved where the sun don’t shine. We were delivered a hot garbage language, made Clive Pygott – one of the smartest people working on the C Memory Model – gargle Committee feedback for 5 years, get stuck in a rocky specification, and ultimately abandon the effort. Then, we had to do some heroic editing and WAY too much time of 3 people – Robert Seacord, Jens Gustedt, and Joseph Myers – just to hammer it into shape while I had to drag that thing kicking and screaming across the finish line. Even I can’t keep that up for a long time, especially with all the work I also had to do with #embed and Modern Bit Utilities and 10+ other proposals I was fighting to fix. “Angry” is quite frankly not a strong enough word to describe a process that can make something so necessary spin its wheels for 5 years. It’s absolutely bananas this is how ISO-based, Committee-based work has to be done. To all the other languages eyeing the mantle of C and C++, thinking that an actual under-ISO working group will provide anything to them.

Do. Not.

Nothing about ISO or IEC or its various subcommittees incentivizes progress. It incentivizes endless feedback loops, heavy weighted processes, individual burn out, and low return-on-investment. Do anything – literally anything – else with your time. If you need the ISO sticker because you want ASIL A/B/C/D certification for your language, than by all means: figure out a way to make it work. But keep your core process, your core feedback, your core identity out of ISO. You can standardize existing practices way better than this, and without nearly this much gnashing of teeth and pullback. No matter how politely its structured, the policies of ISO and the way it expects Committees to be structured is a deeply-embedded form of bureaucratic violence against the least of these, its contributors, and you deserve better than this. So much of this CIA sabotage field manual’s list:

A picture of the CIA handbook covering lessons (1) through (8), which you can read at the link which covers more detail. From the pictured page: (1) Insist on doing everything through “channels.” Never permit short-cuts to be taken in order to expedite decisions. (2) Make “speeches.” Talk as frequently as possible and at great length. Illustrate your “points” by long anecdotes and accounts of personal experiences. (3) When possible, refer all matters to committees, for “further study and consideration.” Attempt to make the committee as large as possible — never less than five. (4) Bring up irrelevant issues as frequently as possible. (5) Haggle over precise wordings of communications, minutes, resolutions. (6) Refer back to matters decided upon at the last meeting and attempt to re-open the question of the advisability of that decision. (7) Advocate “caution.” Be “reasonable” and urge your fellow-conferees to be “reasonable” and avoid haste which might result in embarrassments or difficulties later on. (8) Be worried about the propriety of any decision - raise the question of whether such action as is contemplated lies within the jurisdiction of the group or whether it might conflict with the policy of some higher echelon.

should not have a directly-applicable analogue that describes how an International Standards Organization conducts business. But if it is what it is, then it’s time to roll up the sleeves. Don’t be sad. Get mad. Get even.

Anyways, enumerations. You can add types to them:

enum e : unsigned short {
    x
};

int main (int argc, char* argv[]) {
    return _Generic(x, unsigned short: 0, default: 1);
}

Unlike before, this will always return 0 on every platform, no exceptions. You can stick it in structures and unions and use it with bitfields and as long as your implementation is not completely off its rocker, you will get entirely dependable alignment, padding, and sizing behavior. Enjoy! 🎉

N3020 - Qualifier-preserving Standard Functions

Link.

This is a relatively simple paper, but closes up a hole that’s existed for a while. Nominally, it’s undefined-behavior to modify an originally-const array – especially a string literal – through a non-const pointer. So,

why exactly was strchr, bsearch, strpbrk, strrchr, strstr, memchr, and their wide counterparts basically taking const in and stripping it out in the return value?

The reason is because these had to be singular functions that defined a single externally-visible function call. There’s no overloading in C, so back in the old days when these functions were cooked up, you could only have one. We could not exclude people who wanted to write into the returned pointers of these functions, so we made the optimal (at the time) choice of simply removing the const from the return values. This was not ideal, but it got us through the door.

Now, with type-generic macros in the table, we do not have this limitation. It was just a matter of someone getting inventive enough and writing the specification up for it, and that’s exactly what Alex Gilding did! It looks a little funny in the standardese, but:

#include <string.h>
QChar *strchr(QChar *s, int c);

Describes that if you pass in a const-qualified char, you get back a const-qualified char. Similarly if there is no const. It’s a nice little addition that can help improve read-only memory safety. It might mean that people using any one of the aforementioned functions as a free-and-clear “UB-cast” to trick the compiler will have to fess up and use a real cast instead.

N3042 - Introduce the nullptr constant

Link.

To me, this one was a bit obviously in need, though not everyone thinks so. For a long time, people liked using NULL, (void*)0, and literal 0 as the null pointer constant. And they are certainly not wrong to do so: the first one in that list is a preprocessor macro resolving to either of the other 2. While nominally it would be nice if it resolved to the first, compatibility for older C library implementations and the code built on top of it demands that we not change NULL. Of course, this made for some interesting problems in portability:

#include <stdio.h>

int main (int argc, char* argv[]) {
	printf("ptr: %p", NULL); // oops
	return 0;
}

Now, nobody’s passing NULL directly to printf(…), but in a roundabout way we had NULL - the macro itself - filtering down into function calls with variadic arguments. Or, more critically, we had people just passing straight up literal 0. “It’s the null pointer constant, that’s perfectly fine to pass to something expecting a pointer, right?” This was, of course, wrong. It would be nice if this was true, but it wasn’t, and on certain ABIs that had consequences. The same registers and stack locations for passing a pointer were not always the same as were used for literal 0 or - worse - they were the same, but the literal 0 didn’t fill in all the expected space of the register (32-bit vs. 64-bit, for example). That meant people doing printf("%p", 0); in many ways were relying purely on the luck of their implementation that it wasn’t producing actual undefined behavior! Whoops.

nullptr and the associated nullptr_t type in <stddef.h> fixes that problem. You can specify nullptr, and it’s required to have the same underlying representation as the null pointer constant in char* or void* form. This means it will always be passed correctly, for all ABIs, and you won’t read garbage bits. It also aids in the case of _Generic: with NULL being implementation-defined, you could end up with void* or 0. With nullptr, you get exactly nullptr_t: this means you don’t need to lose the _Generic slot for both int or void*, especially if you’re expecting actual void* pointers that point to stuff. Small addition, gets rid of some Undefined Behavior cases, nice change.

Someone recently challenged me, however: they said this change is not necessary and bollocks, and we should simply force everyone to define NULL to be void*. I said that if they’d like that, then they should go to those vendors themselves and ask them to change and see how it goes. They said they would, and they’d like a list of vendors defining NULL to be 0. Problem: quite a few of them are proprietary, so here’s my Open Challenge:

if you (yes, you!!) have got a C standard library (or shim/replacement) where you define NULL to be 0 and not the void-pointer version, send me a mail and I’ll get this person in touch with you so you can duke it out with each other. If they manage to convince enough vendors/maintainers, I’ll convince the National Body I’m with to write a National Body Comment asking for nullptr to be rescinded. Of course, they’ll need to not only reach out to these people, but convince them to change their NULL from 0 to ((void*)0), which. Well.

Good luck to the person who signed up for this.

N3022 - Modern Bit Utilities

Link.

Remember how there were all those instructions available since like 1978 – you know, in the Before Times™, before I was even born and my parents were still young? – and how we had easy access to them through all our C compilers because we quickly standardized existing practice from last century?

… Yeah, I don’t remember us doing that either.

Modern Bit Utilities isn’t so much “modern” as “catching up to 40-50 years ago”. There were some specification problems and I spent way too much time fighting on so many fronts that, eventually, something had to suffer: although the paper provides wording for Rotate Left/Right, 8-bit Endian-Aware Loads/Stores, and 8-bit Memory Reversal (fancy way of saying, “byteswap”), the specification had too many tiny issues in it that opposition mounted to prevent it from being included-and-then-fixed-up-during-the-C23-commenting-period, or just included at all. I was also too tired by the last meeting day, Friday, to actually try to fight hard for it, so even though a few other members of WG14 sacrificed 30 minutes of their block to get Rotate Left/Right in, others insisted that they wanted to do the Rotate Left/Right functions in a different style. I was too tired to fight too hard over it, so I decided to just defer it to post-C23 and come back later.

Sorry.

Still, with the new <stdbit.h>, this paper provides:

  • Endian macros (__STDC_ENDIAN_BIG__, __STDC_ENDIAN_LITTLE__, __STDC_ENDIAN_NATIVE__)
  • stdc_popcount
  • stdc_bit_width
  • stdc_leading_zeroes/stdc_leading_ones/stdc_trailing_zeros/stdc_trailing_ones
  • stdc_first_leading_zero/stdc_first_leading_one/stdc_first_trailing_zero/stdc_first_trailing_one
  • stdc_has_single_bit
  • stdc_bit_width
  • stdc_bit_ceil
  • stdc_bit_floor

“Where’s the endian macros for Honeywell architectures or PDP endianness?” You can get that if __STDC_ENDIAN_NATIVE__ isn’t equal to either the little OR the big macro:

#include <stdbit.h>
#include <stdio.h>

int main (int argc, char* argv[]) {
	if (__STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_LITTLE__) {
		printf("little endian! uwu\n");
	}
	else if (__STDC_ENDIAN_NATIVE__ == __STDC_ENDIAN_BIG__) {
		printf("big endian OwO!\n");
	}
	else {
		printf("what is this?!\n");
	}
	return 0;
}

If you fall into the last branch, you have some weird endianness. We do not provide a macro for that name because there is too much confusion around what the exact proper byte order for “PDP Endian” or “Honeywell Endian” or “Bi Endian” would end up being.

“What’s that ugly stdc_ prefix?”

For the bit functions, a prefix was added to them in the form of stdc_…. Why?

popcount is a really popular function name. If the standard were to take it, we’d effectively be loading up a gun to shoot a ton of existing codebases right in the face. The only proper resolution I could get to the problem was adding stdc_ in front. It’s not ideal, but honestly it’s the best I could do on short notice. We do not have namespaces in C, which means any time we add functionality we basically have to square off with users. It’s most certainly not a fun part of proposal development, for sure: thus, we get a stdc_ prefix. Perhaps it will be the first of many functions to use such prefixes so we do not have to step on user’s toes, but I imagine for enhancements and fixes to existing functionality, we will keep writing function names by the old rules. This will be decided later by a policy paper, but that policy paper only applies to papers after C23 (and after we get to have that discussion).

N3006 + N3007 - Type Inference for object definitions

Link.

This is a pretty simple paper, all things considered. If you ever used __auto_type from GCC: this is that, with the name auto. I describe it like this because it’s explicitly not like C++’s auto feature: it’s significantly weaker and far more limited. Whereas C++’s auto allows you to declare multiple variables on the same line and even deduce partial qualifiers / types with it (such as auto* ptr1 = thing, *ptr2 = other_thing; to demand that thing and other_thing are some kind of pointer or convertible to one), the C version of auto is modeled pretty directly after the weaker version of __auto_type. You can only declare one variable at a time. There’s no pointer-capturing. And so on, and so forth:

int main (int argc, char* argv[]) {
	auto a = 1;
	return a; // returns int, no mismatches
}

It’s most useful in macro expressions, where you can avoid having to duplicate expressions with:

#define F(_NAME, ARG, ARG2, ARG3) \
	typeof(ARG + (ARG2 || ARG3)) _NAME = ARG + (ARG2 | ARG3);

int main (int argc, char* argv[]) {
	F(a, 1, 2, 3);
	return a;
}

instead being written as:

#define F(_NAME, ARG, ARG2, ARG3) \
	auto _NAME = ARG + (ARG2 | ARG3);

int main (int argc, char* argv[]) {
	F(a, 1, 2, 3);
	return a;
}

Being less prone to make subtle or small errors that may not be caught by the compiler you’re using is good, when it comes to working with specific expressions. (You’ll notice the left hand side of the _NAME definition in the first version had a subtle typo. If you did: congratulations! If you didn’t: well, auto is for you.) Expressions in macros can get exceedingly complicated, and worse if there are unnamed structs or similar being used it can be hard-to-impossible to name them. auto makes it possible to grasp these types and use them properly, resulting in a smoother experience.

Despite being a simple feature, I expect this will be one of the most divisive for C programmers. People already took to the streets in a few places to declare C a dead language, permanently ruined by this change. And, as a Committee member, if that actually ends up being the case? If this actually ends up completely destroying C for any of the reasons people have against auto and type inference for a language that quite literally just let you completely elide types in function calls and gave you “implicit int” behavior that compilers today still have to support so that things like OpenSSL can still compile?2

A picture of an anthropomorphic sheep, their eyes squinted small as they say "Heh.". They look EXTREMELY smug, a not-too-large grin with the corners of their mouth turned up quite high.

Don’t threaten me with a good time, now.

N2897 - memset_explicit

Link.

memset_explicit is memset_s from Annex K, without the Annex K history/baggage. It serves functionally the same purpose, too. It took a lot (perhaps too much) discussion, but Miguel Ojeda pursued it all the way to the end. So, now we have a standard, mandated, always-present memset_explicit that can be used in security-sensitive contexts, provided your compiler and standard library implementers work together to not Be Evil™.

Hoorah! 🎉

N2888 - Exact-width Integer Types May Exceed (u)intmax_t

Link.

The writing has been on the wall for well over a decade now; intmax_t and uintmax_t have been inadequate for the entire industry over and has been consistently limiting the evolution of C’s integer types year over year, and affecting downstream languages. While we cannot exempt every single integer type from the trappings of intmax_t and uintmax_t, we can at least bless the intN_t types and uintN_t types so they can go beyond what the two max types handle. There is active work in this area to allow us to transition to a better ABI and let these two types live up to their promises, but for now the least we could do is let the vector extensions and extended compiler modes for uint128_t, uint256_t, uint512_t, etc. etc. all get some time in the sun and out of the (u)intmax_t shadow.

This doesn’t help for the preprocessor, though, since you are still stuck with the maximum value that intmax_t and uintmax_t can handle. Integer literals and expressions will still be stuck dealing with this problem, but at the very least there should be some small amount of portability between the Beefy Machines™ and the presence of the newer UINT128_WIDTH and such macros.

Not the best we can do, but progress in the right direction! 🎉

And That’s All I’m Writing About For Now

Note that I did not say “that is it”: there’s quite a few more features that made it in, just my hands are tired and there’s a lot of papers that were accepted. I also do not feel like there are some I can do great justice with, and quite frankly the papers themselves make better explanations than I do. Particularly, N2956 - unsequenced functions is a really interesting paper that can enable some intense optimizations with user attribute markup. Its performance improves can also be applied locally:

#include <math.h>
#include <fenv.h>

inline double distance (double const x[static 2]) [[reproducible]] {
	#pragma FP_CONTRACT OFF
	#pragma FENV_ROUND  FE_TONEAREST
	// We assert that sqrt will not be called with invalid arguments
	// and the result only depends on the argument value.
	extern typeof(sqrt) [[unsequenced]] sqrt;
	return sqrt(x[0]*x[0] + x[1]*x[1]);
}

I’ll leave the paper to explain how exactly that’s supposed to work, though! On top of that, we also removed Trigraphs??! (N2940) from C, and we made it so the _BitInt feature can be used with bit fields (N2969, nice). (If you don’t know what Trigraphs are, consider yourself blessed.)

Another really consequential paper is the Tag Compatibility paper by Martin Uecker, N3037. It makes for defining generic data structures through macros a lot easier, and does not require a pre-declaration in order to use it nicely. A lot of people were thrilled about this one and picked up on the improvement immediately: it helps us get one step closer to maybe having room to start shipping some cool container libraries in the future. You should be on the lookout for when compilers implement this, and rush off to the races to start developing nicer generic container libraries for C in conjunction with all the new features we put in!

There is also a lot of functionality that did not make it, such as Unicode Functions, defer, Lambdas/Blocks/Nested Functions, wide function pointers, constexpr functions, the byteswap and other low-level bit functionality I spoke of before, statement expressions, additional macro functionality, break break (or something like it), size_t literals, __supports_literal, Transparent Aliases, and more.

But For Now?

My work is done. I’ve got to go take a break and relax. You can find the latest draft copy of the Committee Draft Standard N3047 here. It’s probably filled with typos and other mistakes; I’m not a great project editor, honestly, but I do try, and I guess that’s all I can do for all of us. That’s it for me and C for the whole year. So now, 3

A transformation sequence. An anthropomorphic schoolgirl sheep with lavender-purple hair runs with toast a piece of toast in their mouth, looking rushed and hurried, wearing a big pink bow over their chest with the typical white blouse over a knee-length blue skirt. Their image fades and drifts off to the right, slowly turning turn before they end up fully transformed into what looks like a very realistic sheep, whose floof is the same color as the lavender-purple hair from before.

it’s sleepy time. Nighty night, and thanks for coming on this wild ride with me 💚4.

Footnotes

  1. except for {}, which is a valid initializer as of C23 thanks to a different paper I wrote. This was meant to properly 0 out VLA data without requiring memset, and it safer because it includes no elements-to-initialize in its list (which means a VLA of size-0 or a VLA that is “too small” to fit the initializer need not become some kind of weird undefined behavior / “implementation-defined constraint violation” sort of deal). 

  2. “Heh” Anya-style Art by Stratica from Twitter. 

  3. Animorph Posting by Aria Beingessner, using art drawn by lilwilbug for me! 

  4. Title photo by Terje Sollie from Pexels. 

]]>
<![CDATA[It’s That Blog Post. The release one, where we round up all the last of the features approved since the last time I blogged. If you’re]]>