Searched the document for the word “fast” and it did not turn up.
The thing where dev tooling for other languages is written in Rust and then becomes much much faster… Somebody maybe should be doing that for Rust itself.
The goal of this project is to make it easier to integrate Rust into existing C and C++ projects that are traditionally built with gcc (Linux being the most prominent).
Given that gcc and clang are comparable at compilation speed, with clangmaaybe being slightly better, I wouldn’t expect this project to improve compilation speed, nor I believe it should be within scope for this project.
Wow, newlines in filenames being officially deprecated?!
Re. modern C, multithreaded code really needs to target C11 or later for atomics. POSIX now requires C17 support; C17 is basically a bugfix revision of C11 without new features. (hmm, I have been calling it C18 owing to the publication year of the standard, but C23 is published this year so I guess there’s now a tradition of matching nominal years with C++ but taking another year for the ISO approval process…)
Nice improvements to make, and plenty of other good stuff too.
It seems like both C and POSIX have woken up from a multi-decade slumber and are improving much faster than they used to. Have a bunch of old farts retired or something?
I have been calling it C18 owing to the publication year of the standard, but C23 is published this year so I guess there’s now a tradition of matching nominal years with C++
I believe the date is normally they year when the standard is ratified. Getting ISO to actually publish the standard takes an unbounded amount of time and no one cares because everyone works from the ratified draft.
As a fellow brit, you may be amused to learn that the BSI shut down the BSI working group that fed WG14 this year because all of their discussions were on the mailing list and so they didn’t have the number of meetings that the BSI required for an active standards group. The group that feeds WG21 (of which I am a member) is now being extra careful about recording attendance.
Unfortunately, there were a lot of changes after the final public draft and the document actually being finished. ISO is getting harsher about this and didn’t allow the final draft to be public. This time around people will probably reference the “first draft” of C2y instead, which is functionally identical to the final draft of C23.
There are a bunch of web sites that have links to the free version of each standard. The way to verify that you are looking at the right one is
look at the committee mailings which include a summary of the documents for a particular meeting
look for the editor’s draft and the editor’s comments (two adjacent doocuments)
the comments will say if the draft is the one you want
Sadly I can’t provide examples because www.open-std.org isn’t working for me right now :-( It’s been unreliable recently, does anyone know what’s going on?
For C23, Cppreference links to N3301, the most recent C2y draft. Unfortunate that the site is down, so we can’t easily check whether all those June 2024 changes were also made to C23. The earlier C2y draft (N3220) only has minor changes listed. Cppreference also links to N3149, the final WD of C23, which is protected by low quality ZIP encryption.
I think for C23 the final committee draft was last year but they didn’t finish the ballot process and incorporating the feedback from national bodies until this summer. Dunno how that corresponds to ISO FDIS and ratification. Frankly, the less users of C and C++ (or any standards tbh) have to know or care about ISO the better.
Re modern C: are there improvements in C23 that didn’t come from either C++ or are standardization of stuff existing implementations have had for ages?
I think the main ones are _BitInt, <stdbit.h>, <stdckdint.h>, #embed
Generally the standard isn’t the place where innovation should happen, though that’s hard to avoid if existing practice is a load of different solutions for the same problem.
They made realloc(ptr, 0) undefined behaviour. Oh, sorry, you said improvements.
I learned about this yesterday in the discussion of rebasing C++26 on C23 and the discussion from the WG21 folks can be largely summarised as ‘new UB, in a case that’s trivial to detect dynamically? WTF? NO!’. So hopefully that won’t make it back into C++.
realloc(ptr,0) was broken by C99 because since then you can’t tell when NULL is returned whether it successfully freed the pointer or whether it failed to malloc(0).
POSIX has changed its specification so realloc(ptr, 0) is obsolescent so you can’t rely on POSIX to save you. (My links to old versions of POSIX have mysteriously stopped working which is super annoying, but I’m pretty sure the OB markers are new.)
C ought to require that malloc(0) returns NULL and (like it was before C99) realloc(ptr,0) is equivalent to free(ptr). It’s tiresome having to write the stupid wrappers to fix the spec bug in every program.
Maybe C++ can fix it and force C to do the sensible thing and revert to the non-footgun ANSI era realloc().
C ought to require that malloc(0) returns NULL and (like it was before C99) realloc(ptr,0) is equivalent to free(ptr). It’s tiresome having to write the stupid wrappers to fix the spec bug in every program.
98% sure some random vendor with a representative via one of the national standards orgs will veto it.
In cases like this it would be really helpful to know who are the bad actors responsible for making things worse, so we can get them to fix their bugs.
It was already UB in practice. I guarantee that there are C++ compilers / C stdlib implementations out there that together will make 99% of C++ programs that do realloc(ptr, 0) have UB.
Not even slightly true. POSIX mandates one of two behaviours for this case, which are largely compatible. I’ve seen a lot of real-world code that is happy with either of those behaviours but does trigger things that are now UB in C23.
But POSIX is not C++. And realloc(ptr, 0) will never be UB with a POSIX-compliant compiler, since POSIX defines the behavior. Compilers and other standards are always free to define things that aren’t defined in the C standard. realloc(ptr, 0) was UB “in practice” for C due to non-POSIX compilers. They could not find any reasonable behavior for it that would work for every vendor. Maybe there just aren’t enough C++ compilers out there for this to actually be a problem for C++, though.
And realloc(ptr, 0) will never be UB with a POSIX-compliant compiler
In general, POSIX does not change the behaviour of compiler optimisations. Compilers are free to optimise based on UB in accordance with the language semantics.
They could not find any reasonable behavior for it that would work for every vendor
Then make it IB, which comes with a requirement that you document what you do, but doesn’t require that you do a specific thing, only that it’s deterministic.
Maybe there just aren’t enough C++ compilers out there for this to actually be a problem for C++, though.
No, the C++ standards committee just has a policy of not introducing new kinds of UB in a place where they’re trivially avoidable.
In general, POSIX does not change the behaviour of compiler optimisations. Compilers are free to optimise based on UB in accordance with the language semantics.
C23 does not constrain implementations when it comes to the behavior of realloc(ptr, 0), but POSIX does. POSIX C is not the same thing as standard C. Any compiler that wants to be POSIX-compliant has to follow the semantics laid out by POSIX. Another example of this is function pointer to void * casts and vice versa. UB in C, but mandated by POSIX.
No, the C++ standards committee just has a policy of not introducing new kinds of UB in a place where they’re trivially avoidable.
They introduced lots of new UB in C++20, so I don’t believe this.
When I’ve tried emulating X86-64 on Apple Silicon using QEMU it’s been incredibly slow, like doing ls took like 1-2 seconds. So if these fine people manage to emulate games then I’m very impressed!
QEMU emulation (TCG) is very slow! Its virtue is that it can run anything on anything, but it’s not useful for productivity or gaming. I used to use it to hack around a FEX RootFS as root, and even just downloading and installing packages with dnf was excruciatingly slow.
Emulators that optimize for performance (such as FEX, box64, and Rosetta, and basically every modern game console emulator too) are in a very different league. Of course, the tradeoff is they only support very specific architecture combinations.
As @lina says, QEMU is general. It works a few instructions at a time, generates an IR (TGIR, which was originally designed for TCC, which was originally an IOCC entry), does a small amount of optimisation, and emits the result.
Rosetta 2 works on much larger units but, more importantly, AArch64 was designed to support x86 emulation and it can avoid the intermediate representation entirely. Most x86-64 instructions are mapped to 1-2 instructions. The x86-64 register file is mapped into 16 of the AArch64 registers, with the rest used for emulator state.
Apple has a few additional features that make it easier:
They use some of the reserved bits in the flags register for x86-compatible flags emulation.
They implement a TSO mode, which automatically sets the fence bits on loads and stores.
FEX doesn’t (I think) take advantage of these (or possible does but only on Apple hardware?), but even without them it’s quite easy (as in, it’s a lot of engineering work, but each bit of it is easy) to translate x86-64 binaries to AArch64. Arm got a few things wrong but both Apple and Microsoft gave a lot of feedback and newer AArch64 revisions have a bunch of extensions that make Rosetta 2-style emulation easy.
RISC-V’s decision to not have a flags register would make this much harder.
There are two more hardware features: SSE denormal handling (FTZ/DAZ) and a change in SIMD vector handling. Those are standardized as FEAT_AFP in newer ARM architectures, but Apple doesn’t implement the standard version yet. The nonstandard Apple version is not usable in FEX due to a technicality in how they implemented it (they made the switch privileged and global, while FEX needs to be able to switch between modes efficiently, unlike Rosetta, and calling into the kernel would be too slow).
FEX does use TSO mode on Apple hardware though, that’s by far the biggest win and something you can’t just emulate performantly if the hardware doesn’t support it. Replacing all the loads/stores with synchronized ones is both slower and also less flexible (fewer addressing modes) so it ends up requiring more instructions too.
them it’s quite easy […] to translate x86-64 binaries to AArch64
[…]
RISC-V’s decision to not have a flags register would make this much harder.
Dumb question: is there a reason not to always ahead-of-time compile to the native arch anyway?
(i believe that is what RPCS3 does, see the LLVM recompiler option).
As I understand it, that’s more or less what Rosetta 2 does: it hooks into mmap calls and binary translates libraries as they’re loaded. The fact that the mapping is simple means that this can be done with very low latency. It has a separate mode for JIT compilers that works more incrementally. I’m impressed by how well the latter works: the Xilinx tools are Linux Java programs (linked to a bunch of native libraries) and they work very well in Rosetta on macOS, in a Linux VM.
The Dynamo Rio work 20 or so years ago showed that JITs can do better by taking advantage of execution patterns. VirtualPC for Mac did this kind of thing to avoid the need to calculate flags (which were more expensive on PowerPC) when they weren’t used. In contrast, Apple Silicon simp,y makes it sufficiently cheap to calculate the flags that this is not needed.
Rosetta does do this, but you have to support runtime code generation (that has to be able to interact with AOT generated code) at minimum because of JITs (though ideally an JIT implementation should check to see if it is being translated and not JIT), but also if you don’t support JIT translating you can get a huge latency spike/pause when a new library is loaded.
So no matter what you always have to support some degree of runtime codegen/translation, so it’s just a question of can you get enough of a win from an AOT as well as the runtime codegen to justify the additional complexity.
Ignore the trashy title, this is actually really neat. They offload IO to separate threads which means the main thread now gets commands in batches; so the main thread can interleave the data structure traversals for multiple keys from the batch, so it can make much better use of the memory system’s concurrency.
I wonder whether anyone has tried to fold system load into the concept of time. I.e. time flows slower if the system is under load from other requests.
Oops, that’s a total mistake on our part! I’ve (re)added a reference (it was lost along the way somehow) and included it back in our latter discussion. Thanks for pointing out that omission!
The original mailing-list thread started when someone came back to their workstation to find it magically unlocked: while they were gone, the system had run out of memory and the OOM killer had chosen to kill the xlock process!
If anything, this speaks more how badly X is for modern use cases than anything. There are lots of reasons that the locker can die (not only for OOM), but the fact that this can “unlock” your desktop is the actual absurd part.
If wayland was good, we’d all be using it by now. It has had so, so much time to prove itself.
My experience with wayland has been:
it’s never the default when I install a popular window manager
every 5 years I see if I should tweak settings to upgrade, and find out that if I do that it’s going to break some core use case, for example gaming, streaming, or even screenshots for god’s sake.
it’s been like 20 years now
I think the creators of Wayland have done us a disservice, by convincing everyone it is the way forward, while not actually addressing all the use cases swiftly and adequately, leaving us all in window manager limbo for two decades.
Maybe my opinion will change when I upgrade to Plasma 6. Although, if you search this page for “wayland” you see a lot of bugs…
it’s never the default when I install a popular window manager
Your information might be outdated, not only is the default in Plasma 6 and GNOME 46, but they’ve actually worked to allow compiling them with zero Xorg support. I believe a lot of distros are now not only enabling it by default but have express plans to no longer ship Xorg at all outside of xwayland.
If wayland was good, we’d all be using it by now. It has had so, so much time to prove itself.
Keep in mind that I gave Wayland as an example in how they should have fixed the issue (e.g.: having a protocol where if the locker fails, the session is just not opened to everyone).
My experience with Wayland is that it works for my use cases. I can understand the frustration of not working for yours (I had a similar experience 5 years ago, but since switching to Sway 2 years ago it seems finally good enough for me), but this is not a “Wayland is good and X is bad”, it is “X is not designed for modern use cases”.
This is a thread about OS kernels and memory management. There are lots of us who use Linux but don’t need a desktop environment there. With that in mind, please consider saving the Wayland vs X discussion for another thread.
If Linux was any good, we’d all be using it by now.
If Dvorak was any good, we’d all be typing on it by now.
If the metric system was any good, the US would be using it by now.
None of the above examples are perfect, I just want to insist that path dependence is a thing. Wayland, being so different than X, introduced many incompatibilities, so it had much inertia to overcome right from the start. People need clear, substantial, and immediate benefits to consider paying even small switching costs, and Wayland’s are pretty significant.
Except on the desktop. You could argue it’s just one niche among many, but it remains a bloody visible one.
Dvorak isn’t very good (although I personally use it). Extremely low value-add.
Hmm, you’re effectively saying that no layout is very good, and all have extremely low value-add… I’m not sure I believe that, even if we ignore chording layouts that let stenotypists type faster than human speech: really, we can’t do significantly better than what was effectively a fairly random layout?
The metric system is ubiquitous, including in the US. […]
I call bullshit on this one. Last time I went there it was all about miles and inches and square feet. Scientists may use it, but day to day you’re still stuck with the imperial system, even down to your standard measurements: wires are gauged, your wood is 2 by 4 inches, even your screws use imperial threads.
Oh, and there was this Mars probe that went crashing down because of an imperial/metric mismatch. I guess they converted everything to metric since then, but just think of what it took to do that even for this small, highly technical niche.
That being said…
I think Wayland sucked a lot.
I can believe it did (I don’t have a first hand opinion on this).
Just on a detail, I believe it doesn’t matter to most people what their keyboard layout is, and I’ve wasted a lot of time worrying about it. A basically random one like qwerty is just fine. That doesn’t affect your main point though, especially since the example of stenography layouts is a slam dunk. Many people still do transcription using qwerty, and THAT is crazy path-dependence.
The display protocol used by a system using the bazaar style of development is not a question of design, but that of community support/network effect. It can be the very best thing ever, if no client supports it.
Also, the creators of Wayland are the ex-maintainers of X, it’s not like they were not familiar with the problem at hand. You sometime have to break backwards compatibility for good.
Sure, it’s good to be finally happening. My point is if we didn’t have Wayland distracting us, a different effort could have gotten us there faster. It’s always the poorly executed maybe-solution that prevents the ideal solution from being explored. Still, I’m looking forward to upgrading to Plasma 6 within the next year or so.
Designing Wayland was a massive effort, it wasn’t just the Xorg team going “we got bored of this and now you have to use the new thing”, they worked very closely with DE developers to design something that wouldn’t make the same mistakes Xorg did.
Take the perspective of a client developer chasing after the tumbleweed of ‘protocols’ drifting around and try to answer ‘what am I supposed to implement and use’? To me it looked like like a Picasso painting of ill-fitting- and internally conflicted ideas. Let this continue a few cycles more and X11 will look clean and balanced by comparison. Someone should propose a desktop icon protocol for the sake of it, then again, someone probably already has.
It might even turn out so well that one of these paths will have a fighting chance against the open desktop being further marginalised as a thin client in the Azure clouded future; nothing more than a silhouette behind unwashed Windows, a virtualized ghost of its former self.
That battle is quickly being lost.
The unabridged story behind Arcan should be written down (and maybe even published) during the coming year or so as the next thematic shift is around the corner. That will cover how it’s just me but also not. A lot of people has indirectly used the thing without ever knowing, which is my preferred strategy for most things.
Right now another fellow is on his way from another part of Europe for a hackaton in my fort out in the wilderness.
Arcan does look really cool in the demos, and I’d like to try it, but last time I tried to build it I encountered a compilation bug (and submitted a PR to fix it) and I’ve never been able to get any build of it to give me an actually usable DE.
I’m sure it’s possible, but last time I tried I gave up before I worked out how.
I also wasn’t able to get Wayland to work adequately, but I got further and it was more “this runs but is not very good” instead of “I don’t understand how to build or run this”.
Maybe being a massive effort is not actually a good sign.
Arguably Wayland took so long because it decided to fix issues that didn’t need fixing. Did somebody actually care about a rogue program already running on your desktop being able to capture the screen and clipboard?
edit: I mean, I guess so since they put in the effort. It’s just hard for me to fathom.
Was it a massive effort? Its development starting 20 years ago does not equate to a “massive effort,” especially considering that the first 5 years involved a mere handful of people working on it as a hobby. The remainder of the time was spent gaining enough network effect, rather than technical effort.
Sorry, but this appeal to the effort it took to develop wayland is just embarrassing.
Vaporware does not mean good. Au contraire, it usually means terrible design by committee, as is the case with wayland.
Besides, do you know how much effort it took to develop X?
It’s so tiring to keep reading this genre of comment over and over again, especially given that we have crazyloglad in this community utterly deconstructing it every time.
This is true but I do think it was also solved in X. Although there were only a few implementations as it required working around X more than using it.
IIRC GNOME and GDM would coordinate so that when you lock your screen it actually switched back to GDM. This way if anything started or crashed in your session it wouldn’t affect the screen locking. And if GDM crashed it would just restart without granting any access.
That being said it is much simpler in Wayland where the program just declares itself a screen locker and everything just works.
Crashing the locker hasn’t been a very good bypass route in some time now (see e.g. xsecurelock, which is more than 10 years old, I think). xlock, the program mentioned in the original mailing list thread, is literally 1980s software.
X11 screen lockers do have a lot of other problems (e.g. input grabbing) primarily because, unlike Wayland, X11 doesn’t really have a lock protocol, so screen lockers mostly play whack-a-mole with other clients. Technically Wayland doesn’t have one, either, as the session lock protocol is in staging, but I think most Wayland lockers just go with that.
Unfortunately, last time I looked at it Wayland delegated a lot of responsibilities to third-parties, too. E.g. session lock state is usually maintained by the compositor (or in any case, a lot of ad-hoc solutions developed prior to the current session lock protocol did). Years ago, “resilient” schemes that tried to restart the compositor if it crashed routinely suffered from the opposite problem: crashing the screen locker was fine, but if the OOM reaped the compositor, it got unlocked.
I’d have thought there was a fairly narrow space where CPU SIMD mattered in game engines. Most of the places where SIMD will get an 8x speedup, GPU offload will get a 1000x speedup. This is even more true on systems with a unified memory architecture where there’s much less cost in moving data from CPU to GPU. It would be nice for the article to discuss this.
Early SIMD (MMX, 3DNow, SSE1) on mainstream CPUs typically gave a 10-30% speedup in games, but then adding a dedicated GPU more than doubled the resolution and massively increased the rendering quality.
The big advantage of CSV and TSV is that you can edit them in a text editor. If you’re putting non-printing characters in as field separators you lose this. If you don’t need that property then there are a lot of better options up to, and including, sqlite databases.
Though I suppose that it has the advantage of not coming with any meaning pre-loaded into it. Yet. If we use these delimiter tokens for data files then people will be at least slightly discouraged from overloading them in ways that break those files.
A text editor or grep is one of M, and TSV is one of N.
If you have an arbitrary DSV, you’re not really in that world any more – now you need to write your own tools.
FWIW I switched from CSV and TSV, because the format is much simpler. As far as I can tell, there is exactly one TSV format, but multiple different CSV formats in practice. There’s less room for misunderstanding.
grep (GNU grep 3.11) does pass the non-printables through but doesn’t recognise \x1e as a line separator (and has no option to specify that either) which means you get the whole wash of data whatever you search for.
And there are more tools, like head and tail and shuf.
xargs -0 and find -print0 actually have the same problem – I pointed this out somewhere on https://www.oilshell.org
It kind of “infects” into head -0, tail -0, sort -0, … Which are sometimes spelled sort -z, etc.
The Oils solution is “TSV8” (not fully implemented) – basically you can optionally use JSON-style strings within TSV cells.
So head tail grep cat awk cut work for “free”. But if you need to represent something with tabs or with \x1f, you can. (It handles arbitrary binary data, which is a primary rationale for the J8 Notation upgrade of JSON - https://www.oilshell.org/release/latest/doc/j8-notation.html)
I don’t really see the appeal of \x1f because it just “pushes the problem around”.
Now instead of escaping tab, you have to escape \x1f. In practice, TSV works very well for me – I can do nearly 100% of my work without tabs.
If I need them, then there’s TSV8 (or something different like sqlite).
head can be done in awk, the rest likely require to output to zero on newline terminated output and pass it to the zero version of themselves. With both DSV and zero-terminated commands, I’d make a bunch of aliases and call it a day.
I guess that counts as “writing your own tools”, but I end up turning commonly used commands into functions and scripts anyway, so I don’t see as a great burden. I guess to each their workflow.
I appreciate that PureScript has ado for applicative do-notation instead of overloading do + a language prgama & doesn’t need to have something complicated trying to guess what can be done in parallel. Seems this offers some speedups but now you have implicit things going on & do for monads is supposed to be about expressing sequential data flow.
Generally execution is implicit in Haskell so this is in line with the culture / philosophy of the language. Though, I agree with you that having explicit control over how things are executed is often necessary to engineer things that behave predictably.
Also, Haskell is one of the few languages that lets the programmer hook into the optimizer with rewrite rules.
So this paper can be seen as equivalent to a rewrite rule that rewrites monad binds into applicative <*> for performance reasons.
(there’s also other precedent for this kind of high level optimizations in Haskell. E.g. stream fusion.)
#[inline(always)]
pub unsafe fn get_partial_unsafe(data: *const State, len: usize) -> State {
let indices =
_mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
// Create a mask by comparing the indices to the length
let mask = _mm_cmpgt_epi8(_mm_set1_epi8(len as i8), indices);
// Mask the bytes that don't belong to our stream
return _mm_and_si128(_mm_loadu_si128(data), mask);
}
You can generate the mask by reading from a buffer at an offset. It will save you a couple of instructions:
e: of course this comes from somewhere - found this dumb proof of concept / piece of crap i made forever ago https://files.catbox.moe/pfs2qu.txt (point to avoid page faults; no other ‘bounds’ here..)
Interesting, I might try this out on some of the other tasks on HighLoad. It would be nice if we could get some hard evidence that engaging the stream prefetcher by interleaving is actually the reason for the perf gain. Otherwise, I don’t buy explanations like these at face value, especially when virtual memory is involved.
One thing that makes it tricky to optimize on the platform is that you have limited ways to exfiltrate profiling data. Your only way of introspecting the system is to dump info to stderr.
If you have an Intel CPU, can’t you try it locally? You can just pipe /dev/urandom into your program.
Maybe run the program under perf (or whatever the equivalent is on Mac OS or Windows) to count cache misses?
You can, but you need to have the same hardware as the target platform (in this case a Haswell server). I have a Coffee Lake desktop processor, so whatever results I get locally will not be representative.
I was thinking of running both the streaming and interleaved on the same machine to see if you could replicate the same speedup (even for a different microarchitecture).
That said, I looked up the phrase mentioned in the blog post in a the January 2023 version of the Intel Optimization Manual, and I could only find it under “EARLIER GENERATIONS OF INTEL® 64” for Sandy Bridge.
Not sure it would apply to Skylake/Coffee Lake.
In the openbsd version, because the length and copy loop are fused together, whether or not the next byte will be copied depends on the byte value of the previous iteration.
Effectively the cost of this dependency is now not just imposed on the length computation but also on the copy operation. And to add insult to injury, dependencies are not just difficult for the CPU, they are also difficult for the compiler to optimize/auto-vectorize resulting in worse code generation - a compounding effect.
This seems wrong? The if ((*dst++ = *src++) == '\0') branch should be very predictable and shouldn’t hinder the CPU.
What I believe is happening is that splitting the strlen and memcpy loops removes the early exit from the memcpy loop, allowing the autovectorizer to kick in.
That’s a great discussion. When you do the strlen, you’re streaming data in, so you’ll hit fast paths in the prefetch and, importantly, the termination branch will be predicted not taken so you can run hundreds of instructions forward, skipping cache misses, and do it with a very high degree of parallelism. Eventually, you’ll resolve the branch that escapes the loop and unwind, which comes with a fixed cost.
After that, the cache will be warm. The most important thing here is that you don’t need to reorder the stores. You’re writing full cache lines with data that’s already in cache and so you’ll do a little bit of reassembly and then write entire new lines. Modern caches do allocate in place, so if you store a whole line of data they won’t fetch from memory. If the source and destination are differently aligned, one cache miss in the middle can cause the stores to back up in the store queue waiting for the cache and backpressure the entire pipeline.
Yes, there’s a small store queue. If a full cache line of data is in the store queue, you avoid the load from memory.
Thanks. Do the stores need to be aligned to be coalesced? I.e. does this, for r8 not aligned to cacheline, avoid one of the 3 loads? If not, are the alignment requirements documented somewhere for x64/aarch64?
It will vary a lot across microarchitectures. Generally, stores that span a cache line will consume multiple entries in a store queue and then be coalesced, but the size of the store queue may vary both in the width of entries and the number. I’m also not sure the amount of reordering that!s permitted on x86 with TSO, that may require holding up the coalesced store middle line until the load of the first one has happened. There’s also some complexity that Intel chips often fill pairs of cache lines (but evict individual ones) so the miss in the first may still bring the middle line into LLC and then be replaced.
Dumb followup question: do you think that Torvalds argument that the ISA should have some kind of rep mov that can skip some of the CPU internal machinery (store coalescing but maybe the register file too?) holds water?
(I’m doesn’t need to berep mov, you can imagine a limited version that requires alignment to cachelines in both source, size and destination, or even a single cacheline copy.)
Yes. I was one of the reviewers for Arm’s version, which is designed to avoid the complex microcode requirements of rep movsb. In the Arm version, each memcpy operation is split into three instructions. On a complex out-of-order machine, you’ll probably treat the first and last as NOPs and just do everything in the middle, but in a simpler design the first and last can handle unaligned starts and ends and the middle one can do a bulk copy.
The bulk copy can be very efficient if it’s doing a cache line at a time. Even if the source and destination have different alignment, if you can load two cache lines and then fill one from overlapping bits, you guarantee that you’re never needing to read anything from the target. There are a bunch of other advantages:
You know where the end is, so if it’s in the middle of a cache line you can request that load right at the start.
The loads are entirely predictable and so you can issue a load of them together without needing to involve the speculative execution machinery.
The stores are unsequenced, so you can make them visible in any order as long as you make all of the ones before visible when you take an interrupt.
There’s no register rename involved in the copies, you’re not making one vector register live for the duration of the copy. This is especially important for smaller cores, where you might not bother with register renaming on the vector registers (there are enough vector registers to keep the pipelines full without it and the vector register state is huge).
For very large copies, if they miss in cache you can punt them lower in the memory subsystem.
The last point is the same motivation as atomic read-modify-write sequences. If you have something like CXL memory, you have around a 1500 cycle latency. If you need to do an atomic operation by pulling that data into the cache and operating on it, it will almost certainly back pressure the pipeline. If you can just send an instruction to the remote memory controller to do it, the CPU can treat it as retired (unless it needs to do something with the result). If you’re doing a copy of a page in CXL memory (e.g. a CoW fault after fork), you can just send a message telling the remote controller to do the copy and, at the same time, read a small subset values at the source into the cache for the destination. Some CXL memory controllers do deduplication (very useful if you have eight cloud nodes using one CXL memory device and all running multiple copies of the same VM images) and so having an explicit copy makes this easy: the copy is just a metadata update that adds a mapping table entry and updates a reference count.
I also missed that it was in POSIX, but the equivalent functionality has been in BSD libcs for a long time. It’s how things like asprintf are implemented. The example on the POSIX site is logically how it works, though the real code allocates the FILE on the stack and initialises it (avoiding the second heap allocation), which you can get away with if you are libc.
I’m slightly surprised that earlier BSDs had their own stdio unrelated to the 7th Edition stdio: I thought there was more sharing at that time … but I suppose the BSD / AT&T USL / Bell Labs Research unix divergence had already started before the 7th Edition.
Question for people with a DB background on mmap vs a manual buffer pool. What would be a typical workload where you would expect mmap to do badly, and how would you implement the buffer pool to fix that?
Searched the document for the word “fast” and it did not turn up.
The thing where dev tooling for other languages is written in Rust and then becomes much much faster… Somebody maybe should be doing that for Rust itself.
The goal of this project is to make it easier to integrate Rust into existing C and C++ projects that are traditionally built with
gcc
(Linux being the most prominent).Given that
gcc
andclang
are comparable at compilation speed, withclang
maaybe being slightly better, I wouldn’t expect this project to improve compilation speed, nor I believe it should be within scope for this project.Wow, newlines in filenames being officially deprecated?!
Re. modern C, multithreaded code really needs to target C11 or later for atomics. POSIX now requires C17 support; C17 is basically a bugfix revision of C11 without new features. (hmm, I have been calling it C18 owing to the publication year of the standard, but C23 is published this year so I guess there’s now a tradition of matching nominal years with C++ but taking another year for the ISO approval process…)
Nice improvements to make, and plenty of other good stuff too.
It seems like both C and POSIX have woken up from a multi-decade slumber and are improving much faster than they used to. Have a bunch of old farts retired or something?
Even in standard naming they couldn’t avoid off by 1 error. ¯\_(ツ)_/¯
I believe the date is normally they year when the standard is ratified. Getting ISO to actually publish the standard takes an unbounded amount of time and no one cares because everyone works from the ratified draft.
As a fellow brit, you may be amused to learn that the BSI shut down the BSI working group that fed WG14 this year because all of their discussions were on the mailing list and so they didn’t have the number of meetings that the BSI required for an active standards group. The group that feeds WG21 (of which I am a member) is now being extra careful about recording attendance.
Unfortunately, there were a lot of changes after the final public draft and the document actually being finished. ISO is getting harsher about this and didn’t allow the final draft to be public. This time around people will probably reference the “first draft” of C2y instead, which is functionally identical to the final draft of C23.
There are a bunch of web sites that have links to the free version of each standard. The way to verify that you are looking at the right one is
Sadly I can’t provide examples because www.open-std.org isn’t working for me right now :-( It’s been unreliable recently, does anyone know what’s going on?
Or just look at cppreference …
https://en.cppreference.com/w/cpp/language/history
https://en.cppreference.com/w/c/language/history
For C23, Cppreference links to N3301, the most recent C2y draft. Unfortunate that the site is down, so we can’t easily check whether all those June 2024 changes were also made to C23. The earlier C2y draft (N3220) only has minor changes listed. Cppreference also links to N3149, the final WD of C23, which is protected by low quality ZIP encryption.
I think most of open-std is available via the Archive, e.g. here is N3301: https://web.archive.org/web/20241002141328/https://open-std.org/JTC1/SC22/WG14/www/docs/n3301.pdf
For C23 the documents are
I think for C23 the final committee draft was last year but they didn’t finish the ballot process and incorporating the feedback from national bodies until this summer. Dunno how that corresponds to ISO FDIS and ratification. Frankly, the less users of C and C++ (or any standards tbh) have to know or care about ISO the better.
Re modern C: are there improvements in C23 that didn’t come from either C++ or are standardization of stuff existing implementations have had for ages?
It’s best to watch the standard editor’s blog and sometimes twitter for this information.
https://thephd.dev/
https://x.com/__phantomderp
I think the main ones are
_BitInt
,<stdbit.h>
,<stdckdint.h>
,#embed
Generally the standard isn’t the place where innovation should happen, though that’s hard to avoid if existing practice is a load of different solutions for the same problem.
They made
realloc(ptr, 0)
undefined behaviour. Oh, sorry, you said improvements.I learned about this yesterday in the discussion of rebasing C++26 on C23 and the discussion from the WG21 folks can be largely summarised as ‘new UB, in a case that’s trivial to detect dynamically? WTF? NO!’. So hopefully that won’t make it back into C++.
realloc(ptr,0) was broken by C99 because since then you can’t tell when NULL is returned whether it successfully freed the pointer or whether it failed to malloc(0).
POSIX has changed its specification so realloc(ptr, 0) is obsolescent so you can’t rely on POSIX to save you. (My links to old versions of POSIX have mysteriously stopped working which is super annoying, but I’m pretty sure the OB markers are new.)
C ought to require that malloc(0) returns NULL and (like it was before C99) realloc(ptr,0) is equivalent to free(ptr). It’s tiresome having to write the stupid wrappers to fix the spec bug in every program.
Maybe C++ can fix it and force C to do the sensible thing and revert to the non-footgun ANSI era realloc().
98% sure some random vendor with a representative via one of the national standards orgs will veto it.
In cases like this it would be really helpful to know who are the bad actors responsible for making things worse, so we can get them to fix their bugs.
Alas, I don’t know. I’ve just heard from people on the C committee that certain things would be vetoed by certain vendors.
Oh good grief, it looks like some of the BSDs did not implement C89 properly, and failed to implement realloc(ptr, 0) as free(ptr) as they should have
FreeBSD 2.2 man page / phkmalloc source
OpenBSD also used phkmalloc; NetBSD’s malloc was conformant with C89 in 1999.
It was already UB in practice. I guarantee that there are C++ compilers / C stdlib implementations out there that together will make 99% of C++ programs that do
realloc(ptr, 0)
have UB.Not even slightly true. POSIX mandates one of two behaviours for this case, which are largely compatible. I’ve seen a lot of real-world code that is happy with either of those behaviours but does trigger things that are now UB in C23.
But POSIX is not C++. And
realloc(ptr, 0)
will never be UB with a POSIX-compliant compiler, since POSIX defines the behavior. Compilers and other standards are always free to define things that aren’t defined in the C standard.realloc(ptr, 0)
was UB “in practice” for C due to non-POSIX compilers. They could not find any reasonable behavior for it that would work for every vendor. Maybe there just aren’t enough C++ compilers out there for this to actually be a problem for C++, though.In general, POSIX does not change the behaviour of compiler optimisations. Compilers are free to optimise based on UB in accordance with the language semantics.
Then make it IB, which comes with a requirement that you document what you do, but doesn’t require that you do a specific thing, only that it’s deterministic.
No, the C++ standards committee just has a policy of not introducing new kinds of UB in a place where they’re trivially avoidable.
C23 does not constrain implementations when it comes to the behavior of
realloc(ptr, 0)
, but POSIX does. POSIX C is not the same thing as standard C. Any compiler that wants to be POSIX-compliant has to follow the semantics laid out by POSIX. Another example of this is function pointer tovoid *
casts and vice versa. UB in C, but mandated by POSIX.They introduced lots of new UB in C++20, so I don’t believe this.
It doesn’t really list the integers in question, but an interesting article anyway!
“Integer multiplies” in this context means “x64 instruction that treats register content as integers and, among other things, multiplies them”.
When I’ve tried emulating X86-64 on Apple Silicon using QEMU it’s been incredibly slow, like doing
ls
took like 1-2 seconds. So if these fine people manage to emulate games then I’m very impressed!QEMU emulation (TCG) is very slow! Its virtue is that it can run anything on anything, but it’s not useful for productivity or gaming. I used to use it to hack around a FEX RootFS as root, and even just downloading and installing packages with
dnf
was excruciatingly slow.Emulators that optimize for performance (such as FEX, box64, and Rosetta, and basically every modern game console emulator too) are in a very different league. Of course, the tradeoff is they only support very specific architecture combinations.
As @lina says, QEMU is general. It works a few instructions at a time, generates an IR (TGIR, which was originally designed for TCC, which was originally an IOCC entry), does a small amount of optimisation, and emits the result.
Rosetta 2 works on much larger units but, more importantly, AArch64 was designed to support x86 emulation and it can avoid the intermediate representation entirely. Most x86-64 instructions are mapped to 1-2 instructions. The x86-64 register file is mapped into 16 of the AArch64 registers, with the rest used for emulator state.
Apple has a few additional features that make it easier:
FEX doesn’t (I think) take advantage of these (or possible does but only on Apple hardware?), but even without them it’s quite easy (as in, it’s a lot of engineering work, but each bit of it is easy) to translate x86-64 binaries to AArch64. Arm got a few things wrong but both Apple and Microsoft gave a lot of feedback and newer AArch64 revisions have a bunch of extensions that make Rosetta 2-style emulation easy.
RISC-V’s decision to not have a flags register would make this much harder.
There are two more hardware features: SSE denormal handling (FTZ/DAZ) and a change in SIMD vector handling. Those are standardized as FEAT_AFP in newer ARM architectures, but Apple doesn’t implement the standard version yet. The nonstandard Apple version is not usable in FEX due to a technicality in how they implemented it (they made the switch privileged and global, while FEX needs to be able to switch between modes efficiently, unlike Rosetta, and calling into the kernel would be too slow).
FEX does use TSO mode on Apple hardware though, that’s by far the biggest win and something you can’t just emulate performantly if the hardware doesn’t support it. Replacing all the loads/stores with synchronized ones is both slower and also less flexible (fewer addressing modes) so it ends up requiring more instructions too.
Dumb question: is there a reason not to always ahead-of-time compile to the native arch anyway? (i believe that is what RPCS3 does, see the LLVM recompiler option).
As I understand it, that’s more or less what Rosetta 2 does: it hooks into mmap calls and binary translates libraries as they’re loaded. The fact that the mapping is simple means that this can be done with very low latency. It has a separate mode for JIT compilers that works more incrementally. I’m impressed by how well the latter works: the Xilinx tools are Linux Java programs (linked to a bunch of native libraries) and they work very well in Rosetta on macOS, in a Linux VM.
The Dynamo Rio work 20 or so years ago showed that JITs can do better by taking advantage of execution patterns. VirtualPC for Mac did this kind of thing to avoid the need to calculate flags (which were more expensive on PowerPC) when they weren’t used. In contrast, Apple Silicon simp,y makes it sufficiently cheap to calculate the flags that this is not needed.
Rosetta does do this, but you have to support runtime code generation (that has to be able to interact with AOT generated code) at minimum because of JITs (though ideally an JIT implementation should check to see if it is being translated and not JIT), but also if you don’t support JIT translating you can get a huge latency spike/pause when a new library is loaded.
So no matter what you always have to support some degree of runtime codegen/translation, so it’s just a question of can you get enough of a win from an AOT as well as the runtime codegen to justify the additional complexity.
Ignore the trashy title, this is actually really neat. They offload IO to separate threads which means the main thread now gets commands in batches; so the main thread can interleave the data structure traversals for multiple keys from the batch, so it can make much better use of the memory system’s concurrency.
That’s similar to the famous talk by Gor Nishanov about using coroutines to interleave multiple binary searches.
Can’t ignore the trashy title, it’s spam.
I wonder whether anyone has tried to fold system load into the concept of time. I.e. time flows slower if the system is under load from other requests.
It sounds like you’re looking for queue management algorithms, such as CoDel, PIE or BBR.
Slightly surprised there’s no mention of the seemingly-similar Prolly Tree. (At least, not in the first half of the paper…)
Oops, that’s a total mistake on our part! I’ve (re)added a reference (it was lost along the way somehow) and included it back in our latter discussion. Thanks for pointing out that omission!
That’s another very cute B-tree-like data structure. Thanks for the pointer!
The funny thing about the airline example is that airlines do overcommit tickets, and they will ask people not to board if too many people show up…
If anything, this speaks more how badly X is for modern use cases than anything. There are lots of reasons that the locker can die (not only for OOM), but the fact that this can “unlock” your desktop is the actual absurd part.
This would be impossible in Wayland, for example.
If wayland was good, we’d all be using it by now. It has had so, so much time to prove itself.
My experience with wayland has been:
I think the creators of Wayland have done us a disservice, by convincing everyone it is the way forward, while not actually addressing all the use cases swiftly and adequately, leaving us all in window manager limbo for two decades.
Maybe my opinion will change when I upgrade to Plasma 6. Although, if you search this page for “wayland” you see a lot of bugs…
Your information might be outdated, not only is the default in Plasma 6 and GNOME 46, but they’ve actually worked to allow compiling them with zero Xorg support. I believe a lot of distros are now not only enabling it by default but have express plans to no longer ship Xorg at all outside of xwayland.
Keep in mind that I gave Wayland as an example in how they should have fixed the issue (e.g.: having a protocol where if the locker fails, the session is just not opened to everyone).
My experience with Wayland is that it works for my use cases. I can understand the frustration of not working for yours (I had a similar experience 5 years ago, but since switching to Sway 2 years ago it seems finally good enough for me), but this is not a “Wayland is good and X is bad”, it is “X is not designed for modern use cases”.
Yeah I realize I’m changing the topic. Your point stands.
This is a thread about OS kernels and memory management. There are lots of us who use Linux but don’t need a desktop environment there. With that in mind, please consider saving the Wayland vs X discussion for another thread.
Lone nerd tries to stop nerd fight. Gets trampled. News at 11
By the same logic, we could argue that:
None of the above examples are perfect, I just want to insist that path dependence is a thing. Wayland, being so different than X, introduced many incompatibilities, so it had much inertia to overcome right from the start. People need clear, substantial, and immediate benefits to consider paying even small switching costs, and Wayland’s are pretty significant.
I think the logic works fine:
I think Wayland sucked a lot. And it has finally started to be good enough to get people to switch. And I’m mad that it took so long.
Except on the desktop. You could argue it’s just one niche among many, but it remains a bloody visible one.
Hmm, you’re effectively saying that no layout is very good, and all have extremely low value-add… I’m not sure I believe that, even if we ignore chording layouts that let stenotypists type faster than human speech: really, we can’t do significantly better than what was effectively a fairly random layout?
I call bullshit on this one. Last time I went there it was all about miles and inches and square feet. Scientists may use it, but day to day you’re still stuck with the imperial system, even down to your standard measurements: wires are gauged, your wood is 2 by 4 inches, even your screws use imperial threads.
Oh, and there was this Mars probe that went crashing down because of an imperial/metric mismatch. I guess they converted everything to metric since then, but just think of what it took to do that even for this small, highly technical niche.
That being said…
I can believe it did (I don’t have a first hand opinion on this).
Just on a detail, I believe it doesn’t matter to most people what their keyboard layout is, and I’ve wasted a lot of time worrying about it. A basically random one like qwerty is just fine. That doesn’t affect your main point though, especially since the example of stenography layouts is a slam dunk. Many people still do transcription using qwerty, and THAT is crazy path-dependence.
Linux isn’t very good on the desktop, speaking as a Linux desktop user since 2004.
The display protocol used by a system using the bazaar style of development is not a question of design, but that of community support/network effect. It can be the very best thing ever, if no client supports it.
Also, the creators of Wayland are the ex-maintainers of X, it’s not like they were not familiar with the problem at hand. You sometime have to break backwards compatibility for good.
Seems to be happening though? Disclaimer, self reported data.
The other survey I could find puts Wayland at 8%, but it dates to early 2022.
Sure, it’s good to be finally happening. My point is if we didn’t have Wayland distracting us, a different effort could have gotten us there faster. It’s always the poorly executed maybe-solution that prevents the ideal solution from being explored. Still, I’m looking forward to upgrading to Plasma 6 within the next year or so.
Designing Wayland was a massive effort, it wasn’t just the Xorg team going “we got bored of this and now you have to use the new thing”, they worked very closely with DE developers to design something that wouldn’t make the same mistakes Xorg did.
Meanwhile, Arcan is basically just @crazyloglad and does a far better job of solving the problems with X11 than Wayland ever will.
The appeal to effort argument in the parent comment is just https://imgur.com/gallery/many-projects-GWHoJMj which aptly describes the entire thing.
Being a little smug, https://www.divergent-desktop.org/blog/2020/10/29/improving-x/ has this little thing:
I think we are at 4 competing icon protocols now. Mechanism over policy: https://arcan-fe.com/2019/05/07/another-low-level-arcan-client-a-tray-icon-handler/
The closing bit:
That battle is quickly being lost.
The unabridged story behind Arcan should be written down (and maybe even published) during the coming year or so as the next thematic shift is around the corner. That will cover how it’s just me but also not. A lot of people has indirectly used the thing without ever knowing, which is my preferred strategy for most things.
Right now another fellow is on his way from another part of Europe for a hackaton in my fort out in the wilderness.
Arcan does look really cool in the demos, and I’d like to try it, but last time I tried to build it I encountered a compilation bug (and submitted a PR to fix it) and I’ve never been able to get any build of it to give me an actually usable DE.
I’m sure it’s possible, but last time I tried I gave up before I worked out how.
I also wasn’t able to get Wayland to work adequately, but I got further and it was more “this runs but is not very good” instead of “I don’t understand how to build or run this”.
Maybe being a massive effort is not actually a good sign.
Arguably Wayland took so long because it decided to fix issues that didn’t need fixing. Did somebody actually care about a rogue program already running on your desktop being able to capture the screen and clipboard?
edit: I mean, I guess so since they put in the effort. It’s just hard for me to fathom.
Was it a massive effort? Its development starting 20 years ago does not equate to a “massive effort,” especially considering that the first 5 years involved a mere handful of people working on it as a hobby. The remainder of the time was spent gaining enough network effect, rather than technical effort.
Sorry, but this appeal to the effort it took to develop wayland is just embarrassing.
Vaporware does not mean good. Au contraire, it usually means terrible design by committee, as is the case with wayland.
Besides, do you know how much effort it took to develop X?
It’s so tiring to keep reading this genre of comment over and over again, especially given that we have crazyloglad in this community utterly deconstructing it every time.
This is true but I do think it was also solved in X. Although there were only a few implementations as it required working around X more than using it.
IIRC GNOME and GDM would coordinate so that when you lock your screen it actually switched back to GDM. This way if anything started or crashed in your session it wouldn’t affect the screen locking. And if GDM crashed it would just restart without granting any access.
That being said it is much simpler in Wayland where the program just declares itself a screen locker and everything just works.
Crashing the locker hasn’t been a very good bypass route in some time now (see e.g. xsecurelock, which is more than 10 years old, I think).
xlock
, the program mentioned in the original mailing list thread, is literally 1980s software.X11 screen lockers do have a lot of other problems (e.g. input grabbing) primarily because, unlike Wayland, X11 doesn’t really have a lock protocol, so screen lockers mostly play whack-a-mole with other clients. Technically Wayland doesn’t have one, either, as the session lock protocol is in staging, but I think most Wayland lockers just go with that.
Unfortunately, last time I looked at it Wayland delegated a lot of responsibilities to third-parties, too. E.g. session lock state is usually maintained by the compositor (or in any case, a lot of ad-hoc solutions developed prior to the current session lock protocol did). Years ago, “resilient” schemes that tried to restart the compositor if it crashed routinely suffered from the opposite problem: crashing the screen locker was fine, but if the OOM reaped the compositor, it got unlocked.
I’d have thought there was a fairly narrow space where CPU SIMD mattered in game engines. Most of the places where SIMD will get an 8x speedup, GPU offload will get a 1000x speedup. This is even more true on systems with a unified memory architecture where there’s much less cost in moving data from CPU to GPU. It would be nice for the article to discuss this.
Early SIMD (MMX, 3DNow, SSE1) on mainstream CPUs typically gave a 10-30% speedup in games, but then adding a dedicated GPU more than doubled the resolution and massively increased the rendering quality.
An earlier article from the same author mentions that to switch to GPU they’d have to change algorithm:
The big advantage of CSV and TSV is that you can edit them in a text editor. If you’re putting non-printing characters in as field separators you lose this. If you don’t need that property then there are a lot of better options up to, and including, sqlite databases.
Obvious solution is to put non-printing characters on the keyboard
…APL user?
Close; he uses J.
And after some time, people would start using them for crazy stuff that no one anticipated and this solution wouldn’t work anymore 👌
Though I suppose that it has the advantage of not coming with any meaning pre-loaded into it. Yet. If we use these delimiter tokens for data files then people will be at least slightly discouraged from overloading them in ways that break those files.
Also,
grep
works on both CSV and TSV, which is very useful … it won’t end up printing crap to your terminal.diff
andgit merge
can work to a degree as well.Bytes and text are essential narrow waists :) I may change this to “M x N waist” to be more clear.
A text editor or
grep
is one of M, and TSV is one of N.If you have an arbitrary DSV, you’re not really in that world any more – now you need to write your own tools.
FWIW I switched from CSV and TSV, because the format is much simpler. As far as I can tell, there is exactly one TSV format, but multiple different CSV formats in practice. There’s less room for misunderstanding.
Do you? I believe
awk
andtr
deal with it just fine. E.g.tr
to convert from DSV to TSV for printing:and
awk
for selecting single columns, printing it TSV:Also, I think
grep
shouldn’t have any problems either, it should pass the non-printable characters as-is?grep
(GNU grep 3.11) does pass the non-printables through but doesn’t recognise\x1e
as a line separator (and has no option to specify that either) which means you get the whole wash of data whatever you search for.You’d have to pipe it through
tr
to swap\x1e
for\n
beforegrep
.Fair, I didn’t know. You can use
awk
as agrep
substitute though.It’s cool that that works, but I’d argue it is indeed a case of writing your own tools! Compare with
And there are more tools, like
head
andtail
andshuf
.xargs -0
andfind -print0
actually have the same problem – I pointed this out somewhere on https://www.oilshell.orgIt kind of “infects” into
head -0
,tail -0
,sort -0
, … Which are sometimes spelledsort -z
, etc.The Oils solution is “TSV8” (not fully implemented) – basically you can optionally use JSON-style strings within TSV cells.
So
head tail grep cat awk cut
work for “free”. But if you need to represent something with tabs or with\x1f
, you can. (It handles arbitrary binary data, which is a primary rationale for the J8 Notation upgrade of JSON - https://www.oilshell.org/release/latest/doc/j8-notation.html)I don’t really see the appeal of
\x1f
because it just “pushes the problem around”.Now instead of escaping tab, you have to escape
\x1f
. In practice, TSV works very well for me – I can do nearly 100% of my work without tabs.If I need them, then there’s TSV8 (or something different like sqlite).
head
can be done inawk
, the rest likely require to output to zero on newline terminated output and pass it to the zero version of themselves. With both DSV and zero-terminated commands, I’d make a bunch of aliases and call it a day.I guess that counts as “writing your own tools”, but I end up turning commonly used commands into functions and scripts anyway, so I don’t see as a great burden. I guess to each their workflow.
The other major advantage is the ubiquity of the format. You lose a lot of tools if you aren’t using the common formats.
Interesting stuff for certain types of number-crunching nerds. It’s impressive what AMD’s pulled off here.
Or if you just want an ironic laugh rather than anything too useful, open it up and ^F VP2INTERSECT.
Not just number crunching. Double shuffles is a big deal for certain kinds of string processing.
I appreciate that PureScript has
ado
for applicative do-notation instead of overloading do + a language prgama & doesn’t need to have something complicated trying to guess what can be done in parallel. Seems this offers some speedups but now you have implicit things going on &do
for monads is supposed to be about expressing sequential data flow.Generally execution is implicit in Haskell so this is in line with the culture / philosophy of the language. Though, I agree with you that having explicit control over how things are executed is often necessary to engineer things that behave predictably.
Also, Haskell is one of the few languages that lets the programmer hook into the optimizer with rewrite rules. So this paper can be seen as equivalent to a rewrite rule that rewrites monad binds into applicative
<*>
for performance reasons.(there’s also other precedent for this kind of high level optimizations in Haskell. E.g. stream fusion.)
In this:
You can generate the mask by reading from a buffer at an offset. It will save you a couple of instructions:
I feel like this line is missing an addition of 16? And also probably a cast or two so that the pointer arithmetic works out correctly?
why trying to do this crap in c is always a mistake
e: ‘better’ (no overread):
of course overread is actually fine…
e: of course this comes from somewhere - found this dumb proof of concept / piece of crap i made forever ago https://files.catbox.moe/pfs2qu.txt (point to avoid page faults; no other ‘bounds’ here..)
Correct. I shouldn’t write comments late in the evening…
Interesting, I might try this out on some of the other tasks on HighLoad. It would be nice if we could get some hard evidence that engaging the stream prefetcher by interleaving is actually the reason for the perf gain. Otherwise, I don’t buy explanations like these at face value, especially when virtual memory is involved.
One thing that makes it tricky to optimize on the platform is that you have limited ways to exfiltrate profiling data. Your only way of introspecting the system is to dump info to stderr.
If you have an Intel CPU, can’t you try it locally? You can just pipe
/dev/urandom
into your program. Maybe run the program underperf
(or whatever the equivalent is on Mac OS or Windows) to count cache misses?You can, but you need to have the same hardware as the target platform (in this case a Haswell server). I have a Coffee Lake desktop processor, so whatever results I get locally will not be representative.
I was thinking of running both the streaming and interleaved on the same machine to see if you could replicate the same speedup (even for a different microarchitecture).
That said, I looked up the phrase mentioned in the blog post in a the January 2023 version of the Intel Optimization Manual, and I could only find it under “EARLIER GENERATIONS OF INTEL® 64” for Sandy Bridge. Not sure it would apply to Skylake/Coffee Lake.
Good point, this piqued my interest. I know what I’ll be doing next weekend :P
This seems wrong? The
if ((*dst++ = *src++) == '\0')
branch should be very predictable and shouldn’t hinder the CPU.What I believe is happening is that splitting the
strlen
andmemcpy
loops removes the early exit from thememcpy
loop, allowing the autovectorizer to kick in.Godbolt seems to confirm my thesis, with
bespoke_strlcpy
being vectorized at-O3
bygcc
14.That’s a great discussion. When you do the strlen, you’re streaming data in, so you’ll hit fast paths in the prefetch and, importantly, the termination branch will be predicted not taken so you can run hundreds of instructions forward, skipping cache misses, and do it with a very high degree of parallelism. Eventually, you’ll resolve the branch that escapes the loop and unwind, which comes with a fixed cost.
After that, the cache will be warm. The most important thing here is that you don’t need to reorder the stores. You’re writing full cache lines with data that’s already in cache and so you’ll do a little bit of reassembly and then write entire new lines. Modern caches do allocate in place, so if you store a whole line of data they won’t fetch from memory. If the source and destination are differently aligned, one cache miss in the middle can cause the stores to back up in the store queue waiting for the cache and backpressure the entire pipeline.
Do modern CPUs do any coalescing of stores? I.e. is this:
coalesced into a single cacheline-sized store somehow?
Yes, there’s a small store queue. If a full cache line of data is in the store queue, you avoid the load from memory.
Thanks. Do the stores need to be aligned to be coalesced? I.e. does this, for r8 not aligned to cacheline, avoid one of the 3 loads? If not, are the alignment requirements documented somewhere for x64/aarch64?
It will vary a lot across microarchitectures. Generally, stores that span a cache line will consume multiple entries in a store queue and then be coalesced, but the size of the store queue may vary both in the width of entries and the number. I’m also not sure the amount of reordering that!s permitted on x86 with TSO, that may require holding up the coalesced store middle line until the load of the first one has happened. There’s also some complexity that Intel chips often fill pairs of cache lines (but evict individual ones) so the miss in the first may still bring the middle line into LLC and then be replaced.
TL;DR: Computers are weird.
Dumb followup question: do you think that Torvalds argument that the ISA should have some kind of
rep mov
that can skip some of the CPU internal machinery (store coalescing but maybe the register file too?) holds water?(I’m doesn’t need to be
rep mov
, you can imagine a limited version that requires alignment to cachelines in both source, size and destination, or even a single cacheline copy.)Yes. I was one of the reviewers for Arm’s version, which is designed to avoid the complex microcode requirements of rep movsb. In the Arm version, each memcpy operation is split into three instructions. On a complex out-of-order machine, you’ll probably treat the first and last as NOPs and just do everything in the middle, but in a simpler design the first and last can handle unaligned starts and ends and the middle one can do a bulk copy.
The bulk copy can be very efficient if it’s doing a cache line at a time. Even if the source and destination have different alignment, if you can load two cache lines and then fill one from overlapping bits, you guarantee that you’re never needing to read anything from the target. There are a bunch of other advantages:
The last point is the same motivation as atomic read-modify-write sequences. If you have something like CXL memory, you have around a 1500 cycle latency. If you need to do an atomic operation by pulling that data into the cache and operating on it, it will almost certainly back pressure the pipeline. If you can just send an instruction to the remote memory controller to do it, the CPU can treat it as retired (unless it needs to do something with the result). If you’re doing a copy of a page in CXL memory (e.g. a CoW fault after fork), you can just send a message telling the remote controller to do the copy and, at the same time, read a small subset values at the source into the cache for the destination. Some CXL memory controllers do deduplication (very useful if you have eight cloud nodes using one CXL memory device and all running multiple copies of the same VM images) and so having an explicit copy makes this easy: the copy is just a metadata update that adds a mapping table entry and updates a reference count.
yes ‘write combining’
the main thing you want is to avoid actually paging in the cache line in question
People that want a “string builder” interface in C might want to check
open_memstream
.Holy shit, I had not noticed that POSIX had grown that functionality!
The more flexible precursors are funopen and the badly-named fopencookie
POSIX also has fmemopen which can read or write with a fixed-sized buffer, where open_memstream() is write-only to a dynamically growing buffer.
Naming things is hard, but come on. :P
I also missed that it was in POSIX, but the equivalent functionality has been in BSD libcs for a long time. It’s how things like
asprintf
are implemented. The example on the POSIX site is logically how it works, though the real code allocates theFILE
on the stack and initialises it (avoiding the second heap allocation), which you can get away with if you are libc.Yeah, funopen dates from 4.4BSD, a little bit too late to be in the more widely-copied 4.3BSD but years before fopencookie.
I’m slightly surprised that earlier BSDs had their own stdio unrelated to the 7th Edition stdio: I thought there was more sharing at that time … but I suppose the BSD / AT&T USL / Bell Labs Research unix divergence had already started before the 7th Edition.
Possibly relevant article, that uses C + tail call optimization to implement some of the same optimizations.
Question for people with a DB background on
mmap
vs a manual buffer pool. What would be a typical workload where you would expectmmap
to do badly, and how would you implement the buffer pool to fix that?