Threads for xoranth

-1

alper 24 days ago | link | on: gccrs: An alternative compiler for Rust

Searched the document for the word “fast” and it did not turn up.

The thing where dev tooling for other languages is written in Rust and then becomes much much faster… Somebody maybe should be doing that for Rust itself.

9

xoranth 24 days ago | link | on: gccrs: An alternative compiler for Rust

The goal of this project is to make it easier to integrate Rust into existing C and C++ projects that are traditionally built with gcc (Linux being the most prominent).

Given that gcc and clang are comparable at compilation speed, with clang maaybe being slightly better, I wouldn’t expect this project to improve compilation speed, nor I believe it should be within scope for this project.

16

fanf 1 month ago | link | on: What’s New in POSIX 2024 – XCU

Wow, newlines in filenames being officially deprecated?!

Re. modern C, multithreaded code really needs to target C11 or later for atomics. POSIX now requires C17 support; C17 is basically a bugfix revision of C11 without new features. (hmm, I have been calling it C18 owing to the publication year of the standard, but C23 is published this year so I guess there’s now a tradition of matching nominal years with C++ but taking another year for the ISO approval process…)

Nice improvements to make, and plenty of other good stuff too.

It seems like both C and POSIX have woken up from a multi-decade slumber and are improving much faster than they used to. Have a bunch of old farts retired or something?

15

pointlessone 1 month ago | link | on: What’s New in POSIX 2024 – XCU

Even in standard naming they couldn’t avoid off by 1 error. ¯\_(ツ)_/¯
8

david_chisnall 1 month ago | link | on: What’s New in POSIX 2024 – XCU

I have been calling it C18 owing to the publication year of the standard, but C23 is published this year so I guess there’s now a tradition of matching nominal years with C++

I believe the date is normally they year when the standard is ratified. Getting ISO to actually publish the standard takes an unbounded amount of time and no one cares because everyone works from the ratified draft.

As a fellow brit, you may be amused to learn that the BSI shut down the BSI working group that fed WG14 this year because all of their discussions were on the mailing list and so they didn’t have the number of meetings that the BSI required for an active standards group. The group that feeds WG21 (of which I am a member) is now being extra careful about recording attendance.
1. 4
  
  lonjil 1 month ago | link | on: What’s New in POSIX 2024 – XCU
  
  everyone works from the ratified draft.
  
  Unfortunately, there were a lot of changes after the final public draft and the document actually being finished. ISO is getting harsher about this and didn’t allow the final draft to be public. This time around people will probably reference the “first draft” of C2y instead, which is functionally identical to the final draft of C23.
  1. 4
    fanf 1 month ago | link | on: What’s New in POSIX 2024 – XCU
    
    There are a bunch of web sites that have links to the free version of each standard. The way to verify that you are looking at the right one is
    
    look at the committee mailings which include a summary of the documents for a particular meeting
    
    look for the editor’s draft and the editor’s comments (two adjacent doocuments)
    
    the comments will say if the draft is the one you want
    
    Sadly I can’t provide examples because www.open-std.org isn’t working for me right now :-( It’s been unreliable recently, does anyone know what’s going on?
    
    Or just look at cppreference …
    
    https://en.cppreference.com/w/cpp/language/history
    
    https://en.cppreference.com/w/c/language/history
    1. 2
      
      lonjil 1 month ago | link | on: What’s New in POSIX 2024 – XCU
      
      For C23, Cppreference links to N3301, the most recent C2y draft. Unfortunate that the site is down, so we can’t easily check whether all those June 2024 changes were also made to C23. The earlier C2y draft (N3220) only has minor changes listed. Cppreference also links to N3149, the final WD of C23, which is protected by low quality ZIP encryption.
      
      I think most of open-std is available via the Archive, e.g. here is N3301: https://web.archive.org/web/20241002141328/https://open-std.org/JTC1/SC22/WG14/www/docs/n3301.pdf
      1. 2
        
        fanf 1 month ago | link | on: What’s New in POSIX 2024 – XCU
        
        For C23 the documents are
        
        n3299, first C2y draft which differs from the standard only with one editorial change in a footnote
        
        n3300, editor’s report saying n3299 is a suitable basis for those writing papers
        
        n3301, post-meeting C2y draft with first substantive changes
        
        n3302, editor’s report (not in the wayback machine)
2. 2
  
  fanf 1 month ago | link | on: What’s New in POSIX 2024 – XCU
  
  I think for C23 the final committee draft was last year but they didn’t finish the ballot process and incorporating the feedback from national bodies until this summer. Dunno how that corresponds to ISO FDIS and ratification. Frankly, the less users of C and C++ (or any standards tbh) have to know or care about ISO the better.
2

xoranth 1 month ago | link | on: What’s New in POSIX 2024 – XCU

Re modern C: are there improvements in C23 that didn’t come from either C++ or are standardization of stuff existing implementations have had for ages?
1. 5
  
  ssl 1 month ago | link | on: What’s New in POSIX 2024 – XCU
  
  It’s best to watch the standard editor’s blog and sometimes twitter for this information.
  
  https://thephd.dev/
  
  https://x.com/__phantomderp
2. 3
  
  fanf edited 1 month ago | link | on: What’s New in POSIX 2024 – XCU
  
  I think the main ones are _BitInt, <stdbit.h>, <stdckdint.h>, #embed
  
  Generally the standard isn’t the place where innovation should happen, though that’s hard to avoid if existing practice is a load of different solutions for the same problem.
3. 1
  
  david_chisnall 1 month ago | link | on: What’s New in POSIX 2024 – XCU
  
  They made realloc(ptr, 0) undefined behaviour. Oh, sorry, you said improvements.
  
  I learned about this yesterday in the discussion of rebasing C++26 on C23 and the discussion from the WG21 folks can be largely summarised as ‘new UB, in a case that’s trivial to detect dynamically? WTF? NO!’. So hopefully that won’t make it back into C++.
  1. 3
    
    fanf edited 1 month ago | link | on: What’s New in POSIX 2024 – XCU
    
    realloc(ptr,0) was broken by C99 because since then you can’t tell when NULL is returned whether it successfully freed the pointer or whether it failed to malloc(0).
    
    POSIX has changed its specification so realloc(ptr, 0) is obsolescent so you can’t rely on POSIX to save you. (My links to old versions of POSIX have mysteriously stopped working which is super annoying, but I’m pretty sure the OB markers are new.)
    
    C ought to require that malloc(0) returns NULL and (like it was before C99) realloc(ptr,0) is equivalent to free(ptr). It’s tiresome having to write the stupid wrappers to fix the spec bug in every program.
    
    Maybe C++ can fix it and force C to do the sensible thing and revert to the non-footgun ANSI era realloc().
    1. 2
      
      lonjil 1 month ago | link | on: What’s New in POSIX 2024 – XCU
      
      C ought to require that malloc(0) returns NULL and (like it was before C99) realloc(ptr,0) is equivalent to free(ptr). It’s tiresome having to write the stupid wrappers to fix the spec bug in every program.
      
      98% sure some random vendor with a representative via one of the national standards orgs will veto it.
      1. 1
        
        fanf 1 month ago | link | on: What’s New in POSIX 2024 – XCU
        
        In cases like this it would be really helpful to know who are the bad actors responsible for making things worse, so we can get them to fix their bugs.
        
        2
        
        lonjil 1 month ago | link | on: What’s New in POSIX 2024 – XCU
        
        Alas, I don’t know. I’ve just heard from people on the C committee that certain things would be vetoed by certain vendors.
    2. 2
      
      fanf 1 month ago | link | on: What’s New in POSIX 2024 – XCU
      
      Oh good grief, it looks like some of the BSDs did not implement C89 properly, and failed to implement realloc(ptr, 0) as free(ptr) as they should have
      
      FreeBSD 2.2 man page / phkmalloc source
      
      OpenBSD also used phkmalloc; NetBSD’s malloc was conformant with C89 in 1999.
  2. 2
    
    lonjil 1 month ago | link | on: What’s New in POSIX 2024 – XCU
    
    It was already UB in practice. I guarantee that there are C++ compilers / C stdlib implementations out there that together will make 99% of C++ programs that do realloc(ptr, 0) have UB.
    1. 1
      
      david_chisnall 1 month ago | link | on: What’s New in POSIX 2024 – XCU
      
      Not even slightly true. POSIX mandates one of two behaviours for this case, which are largely compatible. I’ve seen a lot of real-world code that is happy with either of those behaviours but does trigger things that are now UB in C23.
      1. 2
        
        lonjil 1 month ago | link | on: What’s New in POSIX 2024 – XCU
        
        But POSIX is not C++. And realloc(ptr, 0) will never be UB with a POSIX-compliant compiler, since POSIX defines the behavior. Compilers and other standards are always free to define things that aren’t defined in the C standard. realloc(ptr, 0) was UB “in practice” for C due to non-POSIX compilers. They could not find any reasonable behavior for it that would work for every vendor. Maybe there just aren’t enough C++ compilers out there for this to actually be a problem for C++, though.
        
        1
        
        david_chisnall 1 month ago | link | on: What’s New in POSIX 2024 – XCU
        
        And realloc(ptr, 0) will never be UB with a POSIX-compliant compiler
        
        In general, POSIX does not change the behaviour of compiler optimisations. Compilers are free to optimise based on UB in accordance with the language semantics.
        
        They could not find any reasonable behavior for it that would work for every vendor
        
        Then make it IB, which comes with a requirement that you document what you do, but doesn’t require that you do a specific thing, only that it’s deterministic.
        
        Maybe there just aren’t enough C++ compilers out there for this to actually be a problem for C++, though.
        
        No, the C++ standards committee just has a policy of not introducing new kinds of UB in a place where they’re trivially avoidable.
        
        3
        
        lonjil 1 month ago | link | on: What’s New in POSIX 2024 – XCU
        
        In general, POSIX does not change the behaviour of compiler optimisations. Compilers are free to optimise based on UB in accordance with the language semantics.
        
        C23 does not constrain implementations when it comes to the behavior of realloc(ptr, 0), but POSIX does. POSIX C is not the same thing as standard C. Any compiler that wants to be POSIX-compliant has to follow the semantics laid out by POSIX. Another example of this is function pointer to void * casts and vice versa. UB in C, but mandated by POSIX.
        
        No, the C++ standards committee just has a policy of not introducing new kinds of UB in a place where they’re trivially avoidable.
        
        They introduced lots of new UB in C++20, so I don’t believe this.

1

robey 1 month ago | link | on: Why those particular integer multiplies?

It doesn’t really list the integers in question, but an interesting article anyway!

1

xoranth 1 month ago | link | on: Why those particular integer multiplies?

“Integer multiplies” in this context means “x64 instruction that treats register content as integers and, among other things, multiplies them”.

1

giffengrabber 1 month ago | link | on: AAA gaming on Asahi Linux

When I’ve tried emulating X86-64 on Apple Silicon using QEMU it’s been incredibly slow, like doing ls took like 1-2 seconds. So if these fine people manage to emulate games then I’m very impressed!

29

lina 1 month ago | link | on: AAA gaming on Asahi Linux

QEMU emulation (TCG) is very slow! Its virtue is that it can run anything on anything, but it’s not useful for productivity or gaming. I used to use it to hack around a FEX RootFS as root, and even just downloading and installing packages with dnf was excruciatingly slow.

Emulators that optimize for performance (such as FEX, box64, and Rosetta, and basically every modern game console emulator too) are in a very different league. Of course, the tradeoff is they only support very specific architecture combinations.
13
david_chisnall 1 month ago | link | on: AAA gaming on Asahi Linux
As @lina says, QEMU is general. It works a few instructions at a time, generates an IR (TGIR, which was originally designed for TCC, which was originally an IOCC entry), does a small amount of optimisation, and emits the result.

Rosetta 2 works on much larger units but, more importantly, AArch64 was designed to support x86 emulation and it can avoid the intermediate representation entirely. Most x86-64 instructions are mapped to 1-2 instructions. The x86-64 register file is mapped into 16 of the AArch64 registers, with the rest used for emulator state.

Apple has a few additional features that make it easier:
- They use some of the reserved bits in the flags register for x86-compatible flags emulation.
- They implement a TSO mode, which automatically sets the fence bits on loads and stores.
FEX doesn’t (I think) take advantage of these (or possible does but only on Apple hardware?), but even without them it’s quite easy (as in, it’s a lot of engineering work, but each bit of it is easy) to translate x86-64 binaries to AArch64. Arm got a few things wrong but both Apple and Microsoft gave a lot of feedback and newer AArch64 revisions have a bunch of extensions that make Rosetta 2-style emulation easy.

RISC-V’s decision to not have a flags register would make this much harder.
1. 14
  
  lina 1 month ago | link | on: AAA gaming on Asahi Linux
  
  There are two more hardware features: SSE denormal handling (FTZ/DAZ) and a change in SIMD vector handling. Those are standardized as FEAT_AFP in newer ARM architectures, but Apple doesn’t implement the standard version yet. The nonstandard Apple version is not usable in FEX due to a technicality in how they implemented it (they made the switch privileged and global, while FEX needs to be able to switch between modes efficiently, unlike Rosetta, and calling into the kernel would be too slow).
  
  FEX does use TSO mode on Apple hardware though, that’s by far the biggest win and something you can’t just emulate performantly if the hardware doesn’t support it. Replacing all the loads/stores with synchronized ones is both slower and also less flexible (fewer addressing modes) so it ends up requiring more instructions too.
2. 2
  
  xoranth 1 month ago | link | on: AAA gaming on Asahi Linux
  
  them it’s quite easy […] to translate x86-64 binaries to AArch64 […] RISC-V’s decision to not have a flags register would make this much harder.
  
  Dumb question: is there a reason not to always ahead-of-time compile to the native arch anyway? (i believe that is what RPCS3 does, see the LLVM recompiler option).
  1. 2
    
    david_chisnall 1 month ago | link | on: AAA gaming on Asahi Linux
    
    As I understand it, that’s more or less what Rosetta 2 does: it hooks into mmap calls and binary translates libraries as they’re loaded. The fact that the mapping is simple means that this can be done with very low latency. It has a separate mode for JIT compilers that works more incrementally. I’m impressed by how well the latter works: the Xilinx tools are Linux Java programs (linked to a bunch of native libraries) and they work very well in Rosetta on macOS, in a Linux VM.
    
    The Dynamo Rio work 20 or so years ago showed that JITs can do better by taking advantage of execution patterns. VirtualPC for Mac did this kind of thing to avoid the need to calculate flags (which were more expensive on PowerPC) when they weren’t used. In contrast, Apple Silicon simp,y makes it sufficiently cheap to calculate the flags that this is not needed.
  2. 2
    
    olliej 1 month ago | link | on: AAA gaming on Asahi Linux
    
    Rosetta does do this, but you have to support runtime code generation (that has to be able to interact with AOT generated code) at minimum because of JITs (though ideally an JIT implementation should check to see if it is being translated and not JIT), but also if you don’t support JIT translating you can get a huge latency spike/pause when a new library is loaded.
    
    So no matter what you always have to support some degree of runtime codegen/translation, so it’s just a question of can you get enough of a win from an AOT as well as the runtime codegen to justify the additional complexity.

4

fanf 2 months ago | link | on: Valkey · Unlock 1 Million RPS: Experience Triple the Speed with Valkey - part 2

Ignore the trashy title, this is actually really neat. They offload IO to separate threads which means the main thread now gets commands in batches; so the main thread can interleave the data structure traversals for multiple keys from the batch, so it can make much better use of the memory system’s concurrency.

1

xoranth 2 months ago | link | on: Valkey · Unlock 1 Million RPS: Experience Triple the Speed with Valkey - part 2

That’s similar to the famous talk by Gor Nishanov about using coroutines to interleave multiple binary searches.
-1

craftyguy 2 months ago | link | on: Valkey · Unlock 1 Million RPS: Experience Triple the Speed with Valkey - part 2

Can’t ignore the trashy title, it’s spam.

2

xoranth 3 months ago | link | on: GCRA: leaky buckets without the buckets

I wonder whether anyone has tried to fold system load into the concept of time. I.e. time flows slower if the system is under load from other requests.

2

noncrab edited 3 months ago | link | on: GCRA: leaky buckets without the buckets

It sounds like you’re looking for queue management algorithms, such as CoDel, PIE or BBR.

2

snej 3 months ago | link | on: Geometric Search Trees

Slightly surprised there’s no mention of the seemingly-similar Prolly Tree. (At least, not in the first half of the paper…)

4

carsonfarmer 3 months ago | link | on: Geometric Search Trees

Oops, that’s a total mistake on our part! I’ve (re)added a reference (it was lost along the way somehow) and included it back in our latter discussion. Thanks for pointing out that omission!
1

xoranth 3 months ago | link | on: Geometric Search Trees

That’s another very cute B-tree-like data structure. Thanks for the pointer!

25

xoranth 3 months ago | link | on: Andries Brouwer on the OOM killer

The funny thing about the airline example is that airlines do overcommit tickets, and they will ask people not to board if too many people show up…

17

kokada 3 months ago | link | on: Andries Brouwer on the OOM killer

The original mailing-list thread started when someone came back to their workstation to find it magically unlocked: while they were gone, the system had run out of memory and the OOM killer had chosen to kill the xlock process!

If anything, this speaks more how badly X is for modern use cases than anything. There are lots of reasons that the locker can die (not only for OOM), but the fact that this can “unlock” your desktop is the actual absurd part.

This would be impossible in Wayland, for example.

18
andrewrk edited 3 months ago | link | on: Andries Brouwer on the OOM killer
If wayland was good, we’d all be using it by now. It has had so, so much time to prove itself.

My experience with wayland has been:
- it’s never the default when I install a popular window manager
- every 5 years I see if I should tweak settings to upgrade, and find out that if I do that it’s going to break some core use case, for example gaming, streaming, or even screenshots for god’s sake.
- it’s been like 20 years now
I think the creators of Wayland have done us a disservice, by convincing everyone it is the way forward, while not actually addressing all the use cases swiftly and adequately, leaving us all in window manager limbo for two decades.

Maybe my opinion will change when I upgrade to Plasma 6. Although, if you search this page for “wayland” you see a lot of bugs…
1. 18
  
  kkiri 3 months ago | link | on: Andries Brouwer on the OOM killer
  
  it’s never the default when I install a popular window manager
  
  Your information might be outdated, not only is the default in Plasma 6 and GNOME 46, but they’ve actually worked to allow compiling them with zero Xorg support. I believe a lot of distros are now not only enabling it by default but have express plans to no longer ship Xorg at all outside of xwayland.
2. 11
  
  kokada 3 months ago | link | on: Andries Brouwer on the OOM killer
  
  If wayland was good, we’d all be using it by now. It has had so, so much time to prove itself.
  
  Keep in mind that I gave Wayland as an example in how they should have fixed the issue (e.g.: having a protocol where if the locker fails, the session is just not opened to everyone).
  
  My experience with Wayland is that it works for my use cases. I can understand the frustration of not working for yours (I had a similar experience 5 years ago, but since switching to Sway 2 years ago it seems finally good enough for me), but this is not a “Wayland is good and X is bad”, it is “X is not designed for modern use cases”.
  1. 8
    
    andrewrk 3 months ago | link | on: Andries Brouwer on the OOM killer
    
    Yeah I realize I’m changing the topic. Your point stands.
3. 5
  
  giffengrabber 3 months ago | link | on: Andries Brouwer on the OOM killer
  
  This is a thread about OS kernels and memory management. There are lots of us who use Linux but don’t need a desktop environment there. With that in mind, please consider saving the Wayland vs X discussion for another thread.
  1. 2
    
    kghose 3 months ago | link | on: Andries Brouwer on the OOM killer
    
    Lone nerd tries to stop nerd fight. Gets trampled. News at 11
4. 4
  Loup-Vaillant 3 months ago | link | on: Andries Brouwer on the OOM killer
  If wayland was good, we’d all be using it by now.
  
  By the same logic, we could argue that:
  
  If Linux was any good, we’d all be using it by now.
  
  If Dvorak was any good, we’d all be typing on it by now.
  
  If the metric system was any good, the US would be using it by now.
  
  None of the above examples are perfect, I just want to insist that path dependence is a thing. Wayland, being so different than X, introduced many incompatibilities, so it had much inertia to overcome right from the start. People need clear, substantial, and immediate benefits to consider paying even small switching costs, and Wayland’s are pretty significant.
  1. 2
    andrewrk 3 months ago | link | on: Andries Brouwer on the OOM killer
    
    I think the logic works fine:
    
    Linux is ubiquitous.
    
    Dvorak isn’t very good (although I personally use it). Extremely low value-add.
    
    The metric system is ubiquitous, including in the US. I learned it in school, scientists use it.
    
    I think Wayland sucked a lot. And it has finally started to be good enough to get people to switch. And I’m mad that it took so long.
    1. 3
      
      Loup-Vaillant 3 months ago | link | on: Andries Brouwer on the OOM killer
      
      Linux is ubiquitous.
      
      Except on the desktop. You could argue it’s just one niche among many, but it remains a bloody visible one.
      
      Dvorak isn’t very good (although I personally use it). Extremely low value-add.
      
      Hmm, you’re effectively saying that no layout is very good, and all have extremely low value-add… I’m not sure I believe that, even if we ignore chording layouts that let stenotypists type faster than human speech: really, we can’t do significantly better than what was effectively a fairly random layout?
      
      The metric system is ubiquitous, including in the US. […]
      
      I call bullshit on this one. Last time I went there it was all about miles and inches and square feet. Scientists may use it, but day to day you’re still stuck with the imperial system, even down to your standard measurements: wires are gauged, your wood is 2 by 4 inches, even your screws use imperial threads.
      
      Oh, and there was this Mars probe that went crashing down because of an imperial/metric mismatch. I guess they converted everything to metric since then, but just think of what it took to do that even for this small, highly technical niche.
      
      That being said…
      
      I think Wayland sucked a lot.
      
      I can believe it did (I don’t have a first hand opinion on this).
      1. 4
        
        recursion 3 months ago | link | on: Andries Brouwer on the OOM killer
        
        Just on a detail, I believe it doesn’t matter to most people what their keyboard layout is, and I’ve wasted a lot of time worrying about it. A basically random one like qwerty is just fine. That doesn’t affect your main point though, especially since the example of stenography layouts is a slam dunk. Many people still do transcription using qwerty, and THAT is crazy path-dependence.
      2. 3
        
        orib 3 months ago | link | on: Andries Brouwer on the OOM killer
        
        Linux isn’t very good on the desktop, speaking as a Linux desktop user since 2004.
5. 3
  
  gf0 3 months ago | link | on: Andries Brouwer on the OOM killer
  
  it’s been like 20 years now
  
  The display protocol used by a system using the bazaar style of development is not a question of design, but that of community support/network effect. It can be the very best thing ever, if no client supports it.
  
  Also, the creators of Wayland are the ex-maintainers of X, it’s not like they were not familiar with the problem at hand. You sometime have to break backwards compatibility for good.
6. 2
  
  xoranth 3 months ago | link | on: Andries Brouwer on the OOM killer
  
  Seems to be happening though? Disclaimer, self reported data.
  
  The other survey I could find puts Wayland at 8%, but it dates to early 2022.
  1. 5
    
    andrewrk 3 months ago | link | on: Andries Brouwer on the OOM killer
    
    Sure, it’s good to be finally happening. My point is if we didn’t have Wayland distracting us, a different effort could have gotten us there faster. It’s always the poorly executed maybe-solution that prevents the ideal solution from being explored. Still, I’m looking forward to upgrading to Plasma 6 within the next year or so.
    1. 4
      
      kkiri 3 months ago | link | on: Andries Brouwer on the OOM killer
      
      Designing Wayland was a massive effort, it wasn’t just the Xorg team going “we got bored of this and now you have to use the new thing”, they worked very closely with DE developers to design something that wouldn’t make the same mistakes Xorg did.
      1. 12
        
        david_chisnall 3 months ago | link | on: Andries Brouwer on the OOM killer
        
        Meanwhile, Arcan is basically just @crazyloglad and does a far better job of solving the problems with X11 than Wayland ever will.
        
        7
        
        crazyloglad 3 months ago | link | on: Andries Brouwer on the OOM killer
        
        The appeal to effort argument in the parent comment is just https://imgur.com/gallery/many-projects-GWHoJMj which aptly describes the entire thing.
        
        Being a little smug, https://www.divergent-desktop.org/blog/2020/10/29/improving-x/ has this little thing:
        
        Take the perspective of a client developer chasing after the tumbleweed of ‘protocols’ drifting around and try to answer ‘what am I supposed to implement and use’? To me it looked like like a Picasso painting of ill-fitting- and internally conflicted ideas. Let this continue a few cycles more and X11 will look clean and balanced by comparison. Someone should propose a desktop icon protocol for the sake of it, then again, someone probably already has.
        
        I think we are at 4 competing icon protocols now. Mechanism over policy: https://arcan-fe.com/2019/05/07/another-low-level-arcan-client-a-tray-icon-handler/
        
        The closing bit:
        
        It might even turn out so well that one of these paths will have a fighting chance against the open desktop being further marginalised as a thin client in the Azure clouded future; nothing more than a silhouette behind unwashed Windows, a virtualized ghost of its former self.
        
        That battle is quickly being lost.
        
        The unabridged story behind Arcan should be written down (and maybe even published) during the coming year or so as the next thematic shift is around the corner. That will cover how it’s just me but also not. A lot of people has indirectly used the thing without ever knowing, which is my preferred strategy for most things.
        
        Right now another fellow is on his way from another part of Europe for a hackaton in my fort out in the wilderness.
        
        3
        
        cmcaine 3 months ago | link | on: Andries Brouwer on the OOM killer
        
        Arcan does look really cool in the demos, and I’d like to try it, but last time I tried to build it I encountered a compilation bug (and submitted a PR to fix it) and I’ve never been able to get any build of it to give me an actually usable DE.
        
        I’m sure it’s possible, but last time I tried I gave up before I worked out how.
        
        I also wasn’t able to get Wayland to work adequately, but I got further and it was more “this runs but is not very good” instead of “I don’t understand how to build or run this”.
      2. 7
        
        FeepingCreature edited 3 months ago | link | on: Andries Brouwer on the OOM killer
        
        Maybe being a massive effort is not actually a good sign.
        
        Arguably Wayland took so long because it decided to fix issues that didn’t need fixing. Did somebody actually care about a rogue program already running on your desktop being able to capture the screen and clipboard?
        
        edit: I mean, I guess so since they put in the effort. It’s just hard for me to fathom.
        
        4
        
        gf0 3 months ago | link | on: Andries Brouwer on the OOM killer
        
        Was it a massive effort? Its development starting 20 years ago does not equate to a “massive effort,” especially considering that the first 5 years involved a mere handful of people working on it as a hobby. The remainder of the time was spent gaining enough network effect, rather than technical effort.
      3. 3
        
        prez 3 months ago | link | on: Andries Brouwer on the OOM killer
        
        Sorry, but this appeal to the effort it took to develop wayland is just embarrassing.
        
        Vaporware does not mean good. Au contraire, it usually means terrible design by committee, as is the case with wayland.
        
        Besides, do you know how much effort it took to develop X?
        
        It’s so tiring to keep reading this genre of comment over and over again, especially given that we have crazyloglad in this community utterly deconstructing it every time.
3

kevincox 3 months ago | link | on: Andries Brouwer on the OOM killer

This is true but I do think it was also solved in X. Although there were only a few implementations as it required working around X more than using it.

IIRC GNOME and GDM would coordinate so that when you lock your screen it actually switched back to GDM. This way if anything started or crashed in your session it wouldn’t affect the screen locking. And if GDM crashed it would just restart without granting any access.

That being said it is much simpler in Wayland where the program just declares itself a screen locker and everything just works.
1. 3
  
  x64k 3 months ago | link | on: Andries Brouwer on the OOM killer
  
  Crashing the locker hasn’t been a very good bypass route in some time now (see e.g. xsecurelock, which is more than 10 years old, I think). xlock, the program mentioned in the original mailing list thread, is literally 1980s software.
  
  X11 screen lockers do have a lot of other problems (e.g. input grabbing) primarily because, unlike Wayland, X11 doesn’t really have a lock protocol, so screen lockers mostly play whack-a-mole with other clients. Technically Wayland doesn’t have one, either, as the session lock protocol is in staging, but I think most Wayland lockers just go with that.
  
  Unfortunately, last time I looked at it Wayland delegated a lot of responsibilities to third-parties, too. E.g. session lock state is usually maintained by the compositor (or in any case, a lot of ad-hoc solutions developed prior to the current session lock protocol did). Years ago, “resilient” schemes that tried to restart the compositor if it crashed routinely suffered from the opposite problem: crashing the screen locker was fine, but if the OOM reaped the compositor, it got unlocked.

4

david_chisnall 3 months ago | link | on: SIMD Matters

I’d have thought there was a fairly narrow space where CPU SIMD mattered in game engines. Most of the places where SIMD will get an 8x speedup, GPU offload will get a 1000x speedup. This is even more true on systems with a unified memory architecture where there’s much less cost in moving data from CPU to GPU. It would be nice for the article to discuss this.

Early SIMD (MMX, 3DNow, SSE1) on mainstream CPUs typically gave a 10-30% speedup in games, but then adding a dedicated GPU more than doubled the resolution and massively increased the rendering quality.

4
xoranth 3 months ago | link | on: SIMD Matters
An earlier article from the same author mentions that to switch to GPU they’d have to change algorithm:
Future
There are certainly more solver variations than I’ve tried so far. Here are some future options to try:

[…]

GPU friendly solvers. Jacobi, mass-splitting, etc.

58

david_chisnall 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good

The big advantage of CSV and TSV is that you can edit them in a text editor. If you’re putting non-printing characters in as field separators you lose this. If you don’t need that property then there are a lot of better options up to, and including, sqlite databases.

27

hwayne 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good

Obvious solution is to put non-printing characters on the keyboard
1. 17
  
  bitemyapp 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good
  
  …APL user?
  1. 5
    
    vfoley 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good
    
    Close; he uses J.
2. 6
  
  sylq 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good
  
  And after some time, people would start using them for crazy stuff that no one anticipated and this solution wouldn’t work anymore 👌
  1. 1
    
    icefox 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good
    
    Though I suppose that it has the advantage of not coming with any meaning pre-loaded into it. Yet. If we use these delimiter tokens for data files then people will be at least slightly discouraged from overloading them in ways that break those files.
9

andyc edited 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good

Also, grep works on both CSV and TSV, which is very useful … it won’t end up printing crap to your terminal.

diff and git merge can work to a degree as well.

Bytes and text are essential narrow waists :) I may change this to “M x N waist” to be more clear.

A text editor or grep is one of M, and TSV is one of N.

If you have an arbitrary DSV, you’re not really in that world any more – now you need to write your own tools.

FWIW I switched from CSV and TSV, because the format is much simpler. As far as I can tell, there is exactly one TSV format, but multiple different CSV formats in practice. There’s less room for misunderstanding.
1. 6
  xoranth 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good
  If you have an arbitrary DSV, you’re not really in that world any more – now you need to write your own tools.
  
  Do you? I believe awk and tr deal with it just fine. E.g. tr to convert from DSV to TSV for printing:
  
  $ printf '42\x1f99\x1e13\x1f420\x1e' | tr $(printf '\x1f\x1e') '\t\n' 42 99 13 420
  
  and awk for selecting single columns, printing it TSV:
  
  $ printf '42\x1f99\x1e13\x1f420\x1e' | awk -v RS='\x1e' -v FS='\x1f' -v OFS='\t' -v ORS='\n' '{ print $0, $1, $2, NR, NF }' 4299 42 99 1 2 13420 13 420 2 2
  
  Also, I think grep shouldn’t have any problems either, it should pass the non-printable characters as-is?
  1. 4
    zimpenfish 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good
    
    grep (GNU grep 3.11) does pass the non-printables through but doesn’t recognise \x1e as a line separator (and has no option to specify that either) which means you get the whole wash of data whatever you search for.
    
    $ printf '42\x1f99\x1e13\x1f420\x1e' | grep 99 429913420 $ printf '42\x1f99\x1e13\x1f420\x1e' | grep 13 429913420
    
    You’d have to pipe it through tr to swap \x1e for \n before grep.
    1. 2
      
      xoranth 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good
      
      Fair, I didn’t know. You can use awk as a grep substitute though.
  2. 2
    andyc edited 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good
    
    It’s cool that that works, but I’d argue it is indeed a case of writing your own tools! Compare with
    
    $ printf '42\t99\n13\t420\n' 42 99 13 420
    
    $ printf '42\t99\n13\t420\n' | awk '{ print $0, $1, $2, NR, NF }' 42 99 42 99 1 2 13 420 13 420 2 2
    
    And there are more tools, like head and tail and shuf.
    
    xargs -0 and find -print0 actually have the same problem – I pointed this out somewhere on https://www.oilshell.org
    
    It kind of “infects” into head -0, tail -0, sort -0, … Which are sometimes spelled sort -z, etc.
    
    The Oils solution is “TSV8” (not fully implemented) – basically you can optionally use JSON-style strings within TSV cells.
    
    So head tail grep cat awk cut work for “free”. But if you need to represent something with tabs or with \x1f, you can. (It handles arbitrary binary data, which is a primary rationale for the J8 Notation upgrade of JSON - https://www.oilshell.org/release/latest/doc/j8-notation.html)
    
    I don’t really see the appeal of \x1f because it just “pushes the problem around”.
    
    Now instead of escaping tab, you have to escape \x1f. In practice, TSV works very well for me – I can do nearly 100% of my work without tabs.
    
    If I need them, then there’s TSV8 (or something different like sqlite).
    1. 1
      
      xoranth 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good
      
      head can be done in awk, the rest likely require to output to zero on newline terminated output and pass it to the zero version of themselves. With both DSV and zero-terminated commands, I’d make a bunch of aliases and call it a day.
      
      I guess that counts as “writing your own tools”, but I end up turning commonly used commands into functions and scripts anyway, so I don’t see as a great burden. I guess to each their workflow.
2

agent281 3 months ago | link | on: CSVs Are Kinda Bad. DSVs Are Kinda Good

The other major advantage is the ubiquity of the format. You lose a lot of tools if you aren’t using the common formats.

8

hobbified 3 months ago | link | on: Zen5's AVX512 Teardown + More

Interesting stuff for certain types of number-crunching nerds. It’s impressive what AMD’s pulled off here.

Or if you just want an ironic laugh rather than anything too useful, open it up and ^F VP2INTERSECT.

2

xoranth 3 months ago | link | on: Zen5's AVX512 Teardown + More

Not just number crunching. Double shuffles is a big deal for certain kinds of string processing.

2

toastal 4 months ago | link | on: Desugaring Haskell's do-Notation into Applicative Operations

I appreciate that PureScript has ado for applicative do-notation instead of overloading do + a language prgama & doesn’t need to have something complicated trying to guess what can be done in parallel. Seems this offers some speedups but now you have implicit things going on & do for monads is supposed to be about expressing sequential data flow.

2

inactive-user 4 months ago | link | on: Desugaring Haskell's do-Notation into Applicative Operations

Generally execution is implicit in Haskell so this is in line with the culture / philosophy of the language. Though, I agree with you that having explicit control over how things are executed is often necessary to engineer things that behave predictably.
1. 4
  
  xoranth 4 months ago | link | on: Desugaring Haskell's do-Notation into Applicative Operations
  
  Also, Haskell is one of the few languages that lets the programmer hook into the optimizer with rewrite rules. So this paper can be seen as equivalent to a rewrite rule that rewrites monad binds into applicative <*> for performance reasons.
  
  (there’s also other precedent for this kind of high level optimizations in Haskell. E.g. stream fusion.)

2

xoranth 4 months ago | link | on: Unsafe read beyond of death

In this:

#[inline(always)]
pub unsafe fn get_partial_unsafe(data: *const State, len: usize) -> State {
    let indices = 
        _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
    // Create a mask by comparing the indices to the length
    let mask = _mm_cmpgt_epi8(_mm_set1_epi8(len as i8), indices);
    // Mask the bytes that don't belong to our stream
    return _mm_and_si128(_mm_loadu_si128(data), mask);
}

You can generate the mask by reading from a buffer at an offset. It will save you a couple of instructions:

// Shorthand for sixteen -1, followed by sixteen 0s.
static uint64_t mask_buffer[4] = { -1, -1, 0,  0}; 

__m128i get_partial(__m128i* data, size_t len) {
    __m128i mask = _mm_loadu_si128(mask_buffer - len);
    __m128i raw_data = _mm_loadu_si128(data);
    return _mm_and_si128(raw_data, mask);
}

2
abbeyj 4 months ago | link | on: Unsafe read beyond of death
__m128i mask = _mm_loadu_si128(mask_buffer - len);

I feel like this line is missing an addition of 16? And also probably a cast or two so that the pointer arithmetic works out correctly?
```
 __m128i mask = _mm_loadu_si128((__m128i *)((unsigned char *)mask_buffer + 16 - len));
```
1. 2
  Moonchild edited 4 months ago | link | on: Unsafe read beyond of death
  cast
  
  why trying to do this crap in c is always a mistake
  
  section .rodata mask: dq -1, -1, 0, 0 section .text ; in rsi ptr rcx length ; out xmm0 loaded data movdqu xmm0, [rsi] neg rcx lea rax, [mask + 16] ;riprel limited addressing forms vpand xmm0, xmm0, [rax + rcx] ;avx unaligned ok
  
  e: ‘better’ (no overread):
  
  section .rodata blop: dq 0x0706050403020100, 0x0f0e0d0c0b0a0908, 0x8080808080808080, 0x8080808080808080 section .text movdqu xmm0, [rsi + rcx - 16] neg rcx lea rax, [blop + 16] vpshufb xmm0, xmm0, [rax + rcx]
  
  of course overread is actually fine…
  
  e: of course this comes from somewhere - found this dumb proof of concept / piece of crap i made forever ago https://files.catbox.moe/pfs2qu.txt (point to avoid page faults; no other ‘bounds’ here..)
2. 1
  
  xoranth 4 months ago | link | on: Unsafe read beyond of death
  
  Correct. I shouldn’t write comments late in the evening…

1

dist1ll 4 months ago | link | on: Counting Bytes Faster Than You'd Think Possible

Interesting, I might try this out on some of the other tasks on HighLoad. It would be nice if we could get some hard evidence that engaging the stream prefetcher by interleaving is actually the reason for the perf gain. Otherwise, I don’t buy explanations like these at face value, especially when virtual memory is involved.

One thing that makes it tricky to optimize on the platform is that you have limited ways to exfiltrate profiling data. Your only way of introspecting the system is to dump info to stderr.

1

xoranth 4 months ago | link | on: Counting Bytes Faster Than You'd Think Possible

If you have an Intel CPU, can’t you try it locally? You can just pipe /dev/urandom into your program. Maybe run the program under perf (or whatever the equivalent is on Mac OS or Windows) to count cache misses?
1. 1
  
  dist1ll 4 months ago | link | on: Counting Bytes Faster Than You'd Think Possible
  
  You can, but you need to have the same hardware as the target platform (in this case a Haswell server). I have a Coffee Lake desktop processor, so whatever results I get locally will not be representative.
  1. 1
    
    xoranth 4 months ago | link | on: Counting Bytes Faster Than You'd Think Possible
    
    I was thinking of running both the streaming and interleaved on the same machine to see if you could replicate the same speedup (even for a different microarchitecture).
    
    That said, I looked up the phrase mentioned in the blog post in a the January 2023 version of the Intel Optimization Manual, and I could only find it under “EARLIER GENERATIONS OF INTEL® 64” for Sandy Bridge. Not sure it would apply to Skylake/Coffee Lake.
    1. 1
      
      dist1ll 4 months ago | link | on: Counting Bytes Faster Than You'd Think Possible
      
      Good point, this piqued my interest. I know what I’ll be doing next weekend :P

7

xoranth 4 months ago | link | on: strlcpy and how CPUs can defy common sense

In the openbsd version, because the length and copy loop are fused together, whether or not the next byte will be copied depends on the byte value of the previous iteration. Effectively the cost of this dependency is now not just imposed on the length computation but also on the copy operation. And to add insult to injury, dependencies are not just difficult for the CPU, they are also difficult for the compiler to optimize/auto-vectorize resulting in worse code generation - a compounding effect.

This seems wrong? The if ((*dst++ = *src++) == '\0') branch should be very predictable and shouldn’t hinder the CPU.

What I believe is happening is that splitting the strlen and memcpy loops removes the early exit from the memcpy loop, allowing the autovectorizer to kick in.

Godbolt seems to confirm my thesis, with bespoke_strlcpy being vectorized at -O3 by gcc 14.

18

david_chisnall 4 months ago | link | on: strlcpy and how CPUs can defy common sense

That’s a great discussion. When you do the strlen, you’re streaming data in, so you’ll hit fast paths in the prefetch and, importantly, the termination branch will be predicted not taken so you can run hundreds of instructions forward, skipping cache misses, and do it with a very high degree of parallelism. Eventually, you’ll resolve the branch that escapes the loop and unwind, which comes with a fixed cost.

After that, the cache will be warm. The most important thing here is that you don’t need to reorder the stores. You’re writing full cache lines with data that’s already in cache and so you’ll do a little bit of reassembly and then write entire new lines. Modern caches do allocate in place, so if you store a whole line of data they won’t fetch from memory. If the source and destination are differently aligned, one cache miss in the middle can cause the stores to back up in the store queue waiting for the cache and backpressure the entire pipeline.

1
xoranth 4 months ago | link | on: strlcpy and how CPUs can defy common sense
You’re writing full cache lines with data that’s already in cache and so you’ll do a little bit of reassembly and then write entire new lines

Do modern CPUs do any coalescing of stores? I.e. is this:
```
vmovdq YMMWORD PTR [r8], ymm0
vmovdq YMMWORD PTR [r8+32], ymm1
```
coalesced into a single cacheline-sized store somehow?
1. 2
  
  david_chisnall 4 months ago | link | on: strlcpy and how CPUs can defy common sense
  
  Yes, there’s a small store queue. If a full cache line of data is in the store queue, you avoid the load from memory.
  1. 1
    xoranth 4 months ago | link | on: strlcpy and how CPUs can defy common sense
    
    Yes, there’s a small store queue. If a full cache line of data is in the store queue, you avoid the load from memory.
    
    Thanks. Do the stores need to be aligned to be coalesced? I.e. does this, for r8 not aligned to cacheline, avoid one of the 3 loads? If not, are the alignment requirements documented somewhere for x64/aarch64?
    
    vmovdqu YMMWORD PTR [r8], ymm0 vmovdqu YMMWORD PTR [r8+32], ymm1 vmovdqu YMMWORD PTR [r8+64], ymm2
    1. 2
      
      david_chisnall 4 months ago | link | on: strlcpy and how CPUs can defy common sense
      
      It will vary a lot across microarchitectures. Generally, stores that span a cache line will consume multiple entries in a store queue and then be coalesced, but the size of the store queue may vary both in the width of entries and the number. I’m also not sure the amount of reordering that!s permitted on x86 with TSO, that may require holding up the coalesced store middle line until the load of the first one has happened. There’s also some complexity that Intel chips often fill pairs of cache lines (but evict individual ones) so the miss in the first may still bring the middle line into LLC and then be replaced.
      
      TL;DR: Computers are weird.
      1. 2
        
        xoranth 4 months ago | link | on: strlcpy and how CPUs can defy common sense
        
        Dumb followup question: do you think that Torvalds argument that the ISA should have some kind of rep mov that can skip some of the CPU internal machinery (store coalescing but maybe the register file too?) holds water?
        
        (I’m doesn’t need to be rep mov, you can imagine a limited version that requires alignment to cachelines in both source, size and destination, or even a single cacheline copy.)
        
        5
        
        david_chisnall 4 months ago | link | on: strlcpy and how CPUs can defy common sense
        
        Yes. I was one of the reviewers for Arm’s version, which is designed to avoid the complex microcode requirements of rep movsb. In the Arm version, each memcpy operation is split into three instructions. On a complex out-of-order machine, you’ll probably treat the first and last as NOPs and just do everything in the middle, but in a simpler design the first and last can handle unaligned starts and ends and the middle one can do a bulk copy.
        
        The bulk copy can be very efficient if it’s doing a cache line at a time. Even if the source and destination have different alignment, if you can load two cache lines and then fill one from overlapping bits, you guarantee that you’re never needing to read anything from the target. There are a bunch of other advantages:
        
        You know where the end is, so if it’s in the middle of a cache line you can request that load right at the start.
        
        The loads are entirely predictable and so you can issue a load of them together without needing to involve the speculative execution machinery.
        
        The stores are unsequenced, so you can make them visible in any order as long as you make all of the ones before visible when you take an interrupt.
        
        There’s no register rename involved in the copies, you’re not making one vector register live for the duration of the copy. This is especially important for smaller cores, where you might not bother with register renaming on the vector registers (there are enough vector registers to keep the pipelines full without it and the vector register state is huge).
        
        For very large copies, if they miss in cache you can punt them lower in the memory subsystem.
        
        The last point is the same motivation as atomic read-modify-write sequences. If you have something like CXL memory, you have around a 1500 cycle latency. If you need to do an atomic operation by pulling that data into the cache and operating on it, it will almost certainly back pressure the pipeline. If you can just send an instruction to the remote memory controller to do it, the CPU can treat it as retired (unless it needs to do something with the result). If you’re doing a copy of a page in CXL memory (e.g. a CoW fault after fork), you can just send a message telling the remote controller to do the copy and, at the same time, read a small subset values at the source into the cache for the destination. Some CXL memory controllers do deduplication (very useful if you have eight cloud nodes using one CXL memory device and all running multiple copies of the same VM images) and so having an explicit copy makes this easy: the copy is just a metadata update that adds a mapping table entry and updates a reference count.
2. 2
  
  Moonchild 4 months ago | link | on: strlcpy and how CPUs can defy common sense
  
  yes ‘write combining’
  
  the main thing you want is to avoid actually paging in the cache line in question

12

xoranth 4 months ago | link | on: I'm not a fan of strlcpy(3)

People that want a “string builder” interface in C might want to check open_memstream.

5

fanf 4 months ago | link | on: I'm not a fan of strlcpy(3)

Holy shit, I had not noticed that POSIX had grown that functionality!

The more flexible precursors are funopen and the badly-named fopencookie

POSIX also has fmemopen which can read or write with a fixed-sized buffer, where open_memstream() is write-only to a dynamically growing buffer.
1. 2
  
  Lilian 4 months ago | link | on: I'm not a fan of strlcpy(3)
  
  The cookie argument is a pointer to the caller’s cookie structure that is to be associated with the new stream.
  
  Naming things is hard, but come on. :P
2. 1
  
  david_chisnall 4 months ago | link | on: I'm not a fan of strlcpy(3)
  
  I also missed that it was in POSIX, but the equivalent functionality has been in BSD libcs for a long time. It’s how things like asprintf are implemented. The example on the POSIX site is logically how it works, though the real code allocates the FILE on the stack and initialises it (avoiding the second heap allocation), which you can get away with if you are libc.
  1. 1
    
    fanf 4 months ago | link | on: I'm not a fan of strlcpy(3)
    
    Yeah, funopen dates from 4.4BSD, a little bit too late to be in the more widely-copied 4.3BSD but years before fopencookie.
    
    I’m slightly surprised that earlier BSDs had their own stdio unrelated to the 7th Edition stdio: I thought there was more sharing at that time … but I suppose the BSD / AT&T USL / Bell Labs Research unix divergence had already started before the 7th Edition.

2

xoranth 4 months ago | link | on: Beating the compiler

Possibly relevant article, that uses C + tail call optimization to implement some of the same optimizations.

2

xoranth 4 months ago | link | on: Are You Sure You Want to Use MMAP in Your DBMS?

Question for people with a DB background on mmap vs a manual buffer pool. What would be a typical workload where you would expect mmap to do badly, and how would you implement the buffer pool to fix that?