wstring_convert sucks #571

ThePhD · 2018-01-27T16:07:30Z

And so does codec_vt.

It's time to write some utf8/16/32 conversions, seeing as there's literally no simple header-only library to perform just these conversions without a million years of baggage stacked on top of it.

ThePhD · 2018-01-29T03:23:40Z

It's hilarious: there's TONS of utf8 routines out there. But nobody -- NOBODY -- thought they'd make a minimal UTF code point conversion library, with none of the extra garbage involved! Goodness... gracious. I guess I'm writing one myself.

sagamusix · 2018-01-29T18:59:35Z

Maybe our FromUTF8/ToUTF8 implementations would help you avoiding having to write most of the conversion stuff yourself: https://github.com/OpenMPT/openmpt/blob/master/common/mptString.cpp#L842

elvea-dev · 2018-01-29T22:38:43Z

@ThePhD What's your specific problem with codec_vt? For what it's worth, I use the following routines, and on Windows I reinterpret_cast the char16_t data to wchar_t and pass it to Windows API functions (such as _wfopen). That works fine as far as I can tell.

Having said that, I agree that Unicode support in standard C++ is a bad joke :-(

std::u16string utf8_to_utf16(const char *s)
{
	std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conversion;
	return conversion.from_bytes(s);
}

std::string utf16_to_utf8(const std::u16string &s)
{
	std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conversion;
	return conversion.to_bytes(s);
}

template <class internT, class externT, class stateT>
struct codecvt : std::codecvt<internT,externT,stateT>
{ ~codecvt(){} };


std::u32string utf8_to_utf32(const char *s)
{
	std::wstring_convert<codecvt<char32_t,char,std::mbstate_t>,char32_t> conversion;
	return conversion.from_bytes(s);
}

std::string utf32_to_utf8(const std::u32string &s)
{
	std::wstring_convert<codecvt<char32_t,char,std::mbstate_t>,char32_t> conversion;
	return conversion.to_bytes(s);
}

ThePhD · 2018-01-30T00:01:47Z

The Right Way™

When I want a conversion routine, I need only 2 things.

The conversion function itself
Any state associated with making that conversion happen

In terms of a C or even C++ API, that's my_result_struct my_converting_function( my_state* ptr_to_state, inputs ... );. That's all I need. I don't need whatever weird stuff codecvt is offering me, I don't need its forced locales or anything else. If I need the locale, it should be bundled up in the converter's state structure, which upon construction requires me to pass in or default said locale. Note that in doing so, I can trim down the state for constructs that do not need it (HELLO, UTF: no state needed for each conversion). That means that for utf conversions, it boils down to this:

struct utf_state {}; // empty as all get-out

template <typename It>
struct utf8_result {
     utf_error_code error_code; // did it even work?
     std::size_t num_code_units; // number of used code units
     std::array<unsigned char, 4> code_units; // partial array of code units with [num_code_units]
     It next; // where the converter left off, useful for variable-width encodings like UTF!
};

template <typename Iterator> // forward only
utf8_result<Iterator> to_utf8( utf_state&, Iterator start, Iterator end ) {
    // blah blah blah encoding stuff
}

Simple, no heap allocations, everything on the stack. We set up a for-loop around this, converting to the code units we want from the code points we have. That means we also need to have a from_utf8 at some point, but this was just for showing the kind of interface that isn't a pile of steaming piss. Note that by returning an error code you also can satisfy the needs of people who need a non-throwing API, which is a very pertinent need for those working in HPC and critical environments where the unpredictability of a throw is just not acceptable.

Note we could even just not have the state there to start with! But if you wanted to make things more generic and take the same number of arguments and have a base "state" class that you type-cast in your conversion function (if it's necessary, utf is blessed that it is a stateless conversion).

Implementation left to the reader!

... Oh wait, that's me.

Warning: Rant(ionale)

sol2/sol/stack_get.hpp

Line 470 in 63ec47b

// Thanks, MinGW and libstdc++, for introducing this absolutely asinine bug

On top of the above absolute SNAFU's on MinGW's part. rather than commit to fixing the behavior and weird interface of codecvt. The C++ standards committee just deprecated it outright. A good move because it is -- demonstrably -- complete garbage, truly.

Of course, they don't introduce any replacement, and then std lib vendors added the deprecation tags to it before the replacement is agreed upon, which means that I also have to document (and add to all the test builds) that it's deprecated and I need to do something else. And ALSO document VC++'s deprecation warning, because I immediately get not one, but two issues opened asking me if it's okay to proceed beyond the warnings, because most people don't even use codecvt and what is this strange error showing up whoa. So now their deprecation warning is part of my documentation, just to make sure people know it's safe. Of course, even if it's safe,

it doesn't make codecvt any less crap.

Codecvt itself is garbage, but the implementations themselves are worse. From MinGW's bug, to where VC++ has a build of itself where wchar_t and char16_t inside of codecvt and wstring_convert legitimately don't compile. So. I have to do that thing where I type-pun with a reinterpret_cast on VC++ (also the subject of many raised eyebrows and borderline annoying commentary about why I'm doing some weird behavior, and you have to answer their questions and assuage their concerns and literally please stop asking not even completely paranoid linter is complaining about it just STOP).

What's hilarious about the VC++ bug is that it's not even a hair-tugging runtime weirdness like MinGW's byte-swapping bug that somebody quite literally programmed into the library after it worked fine in the 4.x and some 5.x iterations: all they had to do was use codecvt and wstring_convert with char16_t once and it would have outright snapped whatever testing harness or build they had in half.

But things fall through the cracks all the time! Look at my commit history and you'll see it (thanks, no Two-Phase Lookup for MSVC). It was fixed rather speedily. Unfortunately, that doesn't help me when somebody still has VS Version X and company rules say they're not updating for a long time because {Corporate and Technical Debt Reasons, usually}.

Oh. Don't forget that codecvt is also absolutely, terribly slow to construct a single instance for. As to 'why'? Probably because it also interfaces with the herculean monstrosity that is the C and C++ "locale" abstraction, making every construction some on the order of jaw-droppingly dogpoo slow. I waved my hand at it and said "eh I don't care, it's niche nobody will care if it's slow right"?

Wrong.

Everybody cares. Performance matters. And whoever designed codecvt in the first place didn't think that maybe the global locale should NOT be in there, given the locale's historical record of being not only poorly understood and used, but poorly optimized. (Or maybe optimized as much as it could be, given the implementation's constraints? I confess I never read codecvt's standardese closely.)

ThePhD · 2018-01-30T00:12:03Z

Also, hilariously from the code @sagamusix linked: https://github.com/OpenMPT/openmpt/blob/master/common/mptString.cpp#L576

Codecvt implementations are trash, everywhere, and hopefully something more useful for doing encoding and decoding is properly standardized.

This helps avoid bringing in <codecvt> and Boost.Locale just for converting between UTF-8 and UTF-16 on Windows in a locale-agnostic way. See also ThePhD/sol2#571.

ThePhD added the Feature.Can Do label Jan 27, 2018

ThePhD added this to the Feature milestone Jan 27, 2018

ThePhD self-assigned this Jan 27, 2018

ThePhD closed this as completed in f48ba8b Feb 3, 2018

Ruin0x11 mentioned this issue May 20, 2019

Add new map format elonafoobar/elonafoobar#1294

Closed

andim2 mentioned this issue Sep 4, 2020

wstring_convert constructor repeat many times unnecessary causes performance degradation #326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wstring_convert sucks #571

wstring_convert sucks #571

ThePhD commented Jan 27, 2018

ThePhD commented Jan 29, 2018 •

edited

Loading

sagamusix commented Jan 29, 2018

elvea-dev commented Jan 29, 2018

ThePhD commented Jan 30, 2018 •

edited

Loading

ThePhD commented Jan 30, 2018

wstring_convert sucks #571

wstring_convert sucks #571

Comments

ThePhD commented Jan 27, 2018

ThePhD commented Jan 29, 2018 • edited Loading

sagamusix commented Jan 29, 2018

elvea-dev commented Jan 29, 2018

ThePhD commented Jan 30, 2018 • edited Loading

The Right Way™

Warning: Rant(ionale)

ThePhD commented Jan 30, 2018

ThePhD commented Jan 29, 2018 •

edited

Loading

ThePhD commented Jan 30, 2018 •

edited

Loading