Threads for abbeyj

    1. 2

      Does not feel that different from systemd’s socket units…

      1. 3

        listen allows for a per-user, per-port allocation, and (while I’m not absolutely sure) I don’t think systemd allows that.

        Also, listen provides strong isolation by default (no access to the network or the filesystem, except where specified in the namespace script), and while it is possible to engineer it with systemd, I don’t think many users bother, so having it by default with listen provides better security.

        Finally, reliance on GNU Guix containers allow for tailored dependencies, preventing the use of some vulnerable piece of software that is there without being needed as way for attackers.

        1. 5

          I think I understand how this is “per-port” (you can configure each port individually, and one of the configuration settings is the user) but not how it is “per-port, per-user”. If it was “per-port, per-user”, that, to me, would indicate that user alice and user bob could both listen on a given port. There’s no obvious way to make that work and I don’t think that listen attempts this.

          Or, to put it another way, imagine that we have 65535 ports and 2 users on the system: alice and bob. A “per-port” design would have 65535 different “boxes”, each of which can hold some configuration settings. A “per-user” design would have 2 different boxes in which to to put configuration settings: one box for alice and one box for bob.. A “per-port, per-user” design would have 65535 * 2 different boxes in which to put configurations settings: one for port 1 for alice, one for port 1 for bob, one for port 2 for alice, one for port 2 for bob, etc. It seems like listen has 65535 boxes so that would make it “per-port”, not “per-port, per-user”, at least to my way of thinking.

          1. 2

            You are right. I should probably phrase it better, what I meant is what you call per-port.

            The thing to compare to is linux’s coarse dichotomy of privileged/unprivileged, which defines two ranges, one for those with the CAP_NET_BIND_SERVICE (basically, root, hardly anybody use capabilities), and the other range being free for all.

            What I meant by per-port per user, is that not only you are not bound by only two ranges, you can change each port individually (“per port”) but also you can attribute those individual ports to any user or any group (“per-user”, as you can do with files), not just the CAP_NET_BIND_SERVICE-having users and the rest.

            I will probably rephrase it in the next iteration of the article.

        2. 2

          Systemd has socket units (== ini files) that define what to listen for and systemd will fire up the matching service as neede. Same basic idea as inetd. Systemd of course runs with more priviledge than listen, but since listen can start services more priviledged than itself by calling out to guix (I think), that is probably not that big a win.

          Do guix containers work like nix os containers: Expose the store into the container? Or does guix play mount-tricks to hide all the unnecessary parts of the store or create a different store with just the necessary parts linked into it? If not then the containerization does not help too much: An attacker can still use everything in the store to attack the system with.

          But I admit, I never looked into guix too deeply. I played with nixos for a bit (and did not like it:-), so I did not see the need to look at guix. That does not even have my preferred plumbing layer for Linux that I need to build the systems I like to build.

          1. 4

            Do guix containers work like nix os containers: Expose the store into the container?

            I just checked because I was curious, guix shell containers use a tmpfs for the container’s root and selectively mount in in dependencies into a smaller /gnu/store.

            1. 2

              I wrote the first version of call-with-container. You’re correct. Guix uses bind mounts to bring in only the necessary store items.

            2. 1

              Wow, that is a nice way to do things:-)

          2. 4

            As chadcatlett said, Guix indeed does not expose the whole store from within the container.

            listen can not start binaries more privileged than itself. A user can choose to expose part of the files they have access to, for listen to enjoy, but listen itself can not force this.

            The user sharing the files typically does it in a guix container as well, which allows for only sharing a part of the fs tree, and making it read only if need be.

            I don’t want to veer off too much in troll territory, but there’s also the fact that listen is a 211-lines shell script, whereas systemd is a behemoth that spans the whole system. The attack surface or listen really is smaller.

            I should probably make all of that clearer in the article, which I find now to be a bit meandering, but there’s a lot to explain, so it was difficult to write.

            1. 1

              Oh, no doubt that listen needs less priviledge:-) Less code is always harder to tell though… systemd IIRC does the listening in PID1 itself. The listening itself is very little extra code, which in itself should be fairly safe (it never reads from the sockets it listens on). All the service management code is in PID1 anyway in systemd’s approach.

              Your article was a interesting read, it is always fun to see how other people approach problems.

              1. 1

                Yeah, that’s true that the inetd part of systemd is probably not much additional code.

                Thanks for the kind words :)

    2. 2

      This can’t be right…?

      #if defined(4) …
      
        1. 2

          Aaah of course. Generating a website about C using CPP is a rather fraught endeavor!

      1. 1

        Ya, that produces an error using GCC.

    3. 7

      This is what the $LESSEDIT environment variable was designed for. Quoting from the man page:

        if an environment variable LESSEDIT is defined, it is used as the
        command to be executed when the v command is invoked.  The
        LESSEDIT string is expanded in the same way as the prompt
        strings.  The default value for LESSEDIT is:
      
             %E ?lm+%lm. %g
      
        Note that this expands to the editor name, followed by a + and
        the line number, followed by the shell-escaped file name.  If
        your editor does not accept the "+linenumber" syntax, or has oth‐
        er differences in invocation syntax, the LESSEDIT variable can be
        changed to modify this default.
      

      In this specific case I think you can set

      LESSEDIT="echo LINE %lm; %E %g"
      

      Or you could hardcode the editor name here instead of using %E if you want to use something different than what’s in $VISUAL or $EDITOR.

      If you don’t like the choice of using the middle line for the line number, change the lm to lt or lj or something else. See the manpage for available options.

    4. 2

      In this:

      #[inline(always)]
      pub unsafe fn get_partial_unsafe(data: *const State, len: usize) -> State {
          let indices = 
              _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
          // Create a mask by comparing the indices to the length
          let mask = _mm_cmpgt_epi8(_mm_set1_epi8(len as i8), indices);
          // Mask the bytes that don't belong to our stream
          return _mm_and_si128(_mm_loadu_si128(data), mask);
      }
      

      You can generate the mask by reading from a buffer at an offset. It will save you a couple of instructions:

      // Shorthand for sixteen -1, followed by sixteen 0s.
      static uint64_t mask_buffer[4] = { -1, -1, 0,  0}; 
      
      __m128i get_partial(__m128i* data, size_t len) {
          __m128i mask = _mm_loadu_si128(mask_buffer - len);
          __m128i raw_data = _mm_loadu_si128(data);
          return _mm_and_si128(raw_data, mask);
      }
      
      1. 2

        __m128i mask = _mm_loadu_si128(mask_buffer - len);

        I feel like this line is missing an addition of 16? And also probably a cast or two so that the pointer arithmetic works out correctly?

         __m128i mask = _mm_loadu_si128((__m128i *)((unsigned char *)mask_buffer + 16 - len));
        
        1. 2

          cast

          why trying to do this crap in c is always a mistake

          section .rodata
          mask: dq -1, -1, 0, 0
          section .text
          ; in rsi ptr rcx length
          ; out xmm0 loaded data
          movdqu xmm0, [rsi]
          neg rcx
          lea rax, [mask + 16] ;riprel limited addressing forms
          vpand xmm0, xmm0, [rax + rcx] ;avx unaligned ok
          

          e: ‘better’ (no overread):

          section .rodata
          blop: dq 0x0706050403020100, 0x0f0e0d0c0b0a0908, 0x8080808080808080, 0x8080808080808080
          section .text
          movdqu xmm0, [rsi + rcx - 16]
          neg rcx
          lea rax, [blop + 16]
          vpshufb xmm0, xmm0, [rax + rcx]
          

          of course overread is actually fine…

          e: of course this comes from somewhere - found this dumb proof of concept / piece of crap i made forever ago https://files.catbox.moe/pfs2qu.txt (point to avoid page faults; no other ‘bounds’ here..)

        2. 1

          Correct. I shouldn’t write comments late in the evening…

    5. 6

      Hi, author here. I’ve posted this mainly to get some feedback.

      badkeys is a tool to detect known-vulnerable cryptographic keys (things like the Debian OpenSSL bug or ROCA). I received a feature request to implement detection of keys for the xz backdoor.

      Part of the xz backdoor functionality is that RSA keys of a certain form can trigger the backdoor. Detecting those is relatively simple (multiplying the first 2 ints and adding the 3rd has to create a number between 0 and 3, see code here: https://github.com/badkeys/badkeys/blob/main/badkeys/rsakeys/xzbackdoor.py ).

      This is a bit different from what badkeys usually does. The other checks are all indicating a broken/insecure key. In this case, we are detecting keys that we know have been created by a malicious actor.

      The check is computationally cheap compared to the other badkeys checks. What worries me a bit is that it has a small, but not completely irrelevant false positive risk. According to my back-of-the-envelope calc, if we’d scan 12 billion keys (that’s the number of certs in the CT logs, though they are not all RSA and not all unique), we’d have a false positive risk of around 1:500.000.

      1. 1

        Is it possible to reduce the false-positive rate a bit more by looking at the Ed448 signature that’s embedded in the RSA key? You wouldn’t be able to verify the signature itself because you don’t know what host key it was intended for. But I think that it would be possible to extract R and make sure that it is a point on the curve and to extract S and make sure that it is less than the group order. I’m not sure what the chances are that random data forms a valid-looking Ed448 signature so I don’t know if this is really worth it or not.

        1. 1

          Patches welcome :-)

      2. 1

        Hi, author here. I’ve posted this mainly to get some feedback.

        What kind of feedback are you looking for? Whether people think the tradeoff of having that false positive risk is worth it?

      3. 1

        The gmpy2 that gets pulled in doesn’t seem to be compatible with python-3.12. Works with 3.10.
        Paramiko is not listed in requirements.txt

        Kinda wish the tool was written in Go to avoid having to deal with stuff like this.

        1. 2

          Unfortunately, there’s little I can do about gmpy2’s upstream not making a python 3.12 compatible release. Some distros have backported patches, so depending on what system you’re on, installing it from your distro may be an option, otherwise you can install their alpha version. I can’t avoid using gmpy2, as it’s the only high-performance bignum library available in python.

          paramiko is intentionally an optional dependency, only required for ssh host scans. The core functionality of badkeys does not need it. It is listed as an optional feature dependency in the pyproject.toml file. However, I intend to provide a better error message and mention this in the documentation.

      4. 1

        Probably unfair to ask of you, but it would be cool if someone could do that ecosystem-wide scan :)

    6. 2

      The underlined chapter titles break badly on Firefox Android, it took me a while before I realised that wasn’t an intentional glitch. Always hurts to see websites designed solely for Chrome.

      1. 3

        I see the same thing. With a slightly different font size I can also get the same thing to happen on Firefox Desktop.

        The web page uses CSS to set text-decoration-thickness: 0.25ex and text-underline-offset: 4%. When the browser renders an underline it is supposed to leave a gap for letters with descenders so that the the underline doesn’t draw over them. However it looks like in this case something about these settings, the font metrics, and/or Firefox’s algorithm leave it thinking that the line is “too close” to some letters and so it doesn’t draw the line under those letters.

        Chrome seems to interpret the same CSS slightly differently and draws the underline a bit farther away from the text. This means that it doesn’t get treated as intersecting the letters so there are no gaps in the underline. I don’t if one of these behaviors is correct and the other is incorrect or if both are permissible. If you adjust the line so it is closer and/or thicker then Chrome also starts showing similar behavior with the line disappearing under certain letters.

        I have a bit of a hard time blaming the web site for this. The author is using standard CSS with values that don’t seem unreasonable and it renders correctly in both Firefox and Chrome on the desktop (given the font sizes in use). How much testing can we really expect somebody to do to try to find a minor cosmetic issue like this one that somehow only seems to impact Firefox on Android?

      2. 2

        FWIW, it works fine for me with Firefox on Linux (v124.0.1). Actually, I thought it looks pretty awesome - hadn’t seen any design like it, before.

    7. 17

      Isn’t the 24-bit truncation happening in this quoted portion of the code?

      add  word [gdt.dest], 0xfe00
      adc  byte [gdt.dest+2], 0
      

      To extend this to 32-bit sizes you’d need to add another line

      adc  byte [gdt.dest+5], 0
      

      See https://en.wikipedia.org/wiki/Segment_descriptor#/media/File:SegmentDescriptor.svg for why it is +5 and not +3 like you might otherwise expect.

      1. 7

        Oh, thank you very much for this suggestion!

        To verify, I removed a few instructions from the error function to make space in the MBR and added the adc command as you suggested, and this does indeed seem to fix the issue!

        I’ll try and get this properly submitted in the coming days.

        Edit: I updated the article to include your fix for the benefit of other interested readers: https://michael.stapelberg.ch/posts/2024-02-11-minimal-linux-bootloader-debugging-story/#update-a-fix — thanks again!

        1. 4

          Glad to help. This is one of those few times when understanding 16-bit 8086 assembly turns out to still have practical applications.

          When I was first reading the article and I got to the part where you showed the source for read_protected_mode_kernel, I thought that you had probably already tracked down the problem to somewhere within this block and were challenging the reader to try and spot the bug. When looking for a 24-bit limit, having add word followed by adc byte practically screams “look here!” so I thought I had likely found it. Then I was surprised when I read the rest of the article and there was no definitive answer for where the limit was coming from.

          It is always nice when the fix is a single line. And when you can get it correct on the first try. It is too bad there no extra space to insert this instruction without removing something else. You could maybe save a byte by using add byte [gdt.dest+1], 0xfe instead of add word [gdt.dest], 0xfe00 since everything is always 256-byte aligned? I’m sure there are other places to save a byte or two but nothing I can find right now while I’m on my phone.

          1. 1

            Yeah, my assembly skills are limited to reading and making educated guesses :)

            Unfortunately the extra adc instruction seems to take 5 bytes, so changing the error routine seems like the least-invasive change to me to get enough extra bytes :)

            Sebastian Plotz (the author of Minimal Linux Bootloader) also emailed me, confirming that your fix looks good to him as well, and saying he wants to publish a fixed version.

    8. 2

      In the first picture you have 0x000081DA as the “Physical address”. Should this be 0x0000F49D?

      1. 2

        Definitely. Thanks for noticing. Will fix! Fixed!

    9. 1

      I’ve had a question on my mind for a while, and haven’t found time to dig in for answers. Since it’s fairly on-topic, I’ll hope a shell wizard comes across the thread. Thanks for documenting this behavior!

      When working in general purpose languages, I’ve written pipelines with shells in them to pass binary data through /dev/fd/%d files (to something more interesting than cat, of course):

      printf '%b' 'a\0bc\0de\0f' | od -c
      # => a  \0   b   c  \0   d   e  \0   f
      
      printf '%b' 'a\0bc\0de\0f' \
        | bash -c '\
            f () { dd if=/dev/stdin of=/dev/stdout bs=1 count=3 | base64; }; \
            FIRST_THIRD="$(f)"; SECOND_THIRD="$(f)"; \
            cat <(base64 -d <<< "$FIRST_THIRD") <(base64 -d <<< "$SECOND_THIRD") -' \
        | od -c
      # => a  \0   b   c  \0   d   e  \0   f
      

      I’d probably do the chunking and encoding in the host language, exporting it from the calling-process’s environment, but you get the idea. When you use process-substitution like that in bash or create an additional file descriptor in any POSIX shell:

      { echo 1 >&4; } 4>&1
      

      You don’t need to create a named pipe or tempfile in the global file-system namespace. Is there no such file, managed by the shell behind the scenes, and if there isn’t, how can I (in eg. any language with GLIBC bindings) create the same sort of pipe?


      edit: There’s almost no chance the code I was thinking of behaves well when arbitrarily signaled, and now I’ll be excessively conscious of that for my next project :p

      also, I guess the file backing /dev/fd4 there is the existing stdout, so this is mostly about what’s really going on under the hood with subprocess subsitution and how could I emulate it

      1. 2

        Are you looking for a pipe? The ordinary, non-named kind? If so, man 2 pipe. If not then I’m not sure exactly what you’re asking. Can you be more specific?

        1. 2

          Are you looking for a pipe? The ordinary, non-named kind? If so, man 2 pipe.

          Ahh, it’s so obvious. Thx! I’m coming at this from the top down (high-level -> low), and kept it non-specific because I really didn’t want to specify a host language. I’m bright enough to see the shell doing something I can’t, and know enough to know it’s something any process should be able to do (hence it must be in eg. libc), but didn’t quite make the connection that such platforms are exposing popen but not pipe. When (nebulously) they do expose a pipe or port API, it’s often without the ability to access underlying file-descriptors unless you’ve called open on an actual file. I figure that must be because a clever runtime doesn’t actually need to open a pipe for most use-cases (ie. most programs must not actually need it), like with green threads, and that it impedes cross-platform operation– but now I know what I need to bend these platforms into doing. Thx again!

      2. 1

        In addition to unnamed pipes described by the other comment, the shell notation 5>&7 is equivalent to the POSIX API dup2(5, 7) - see man 2 dup2.

    10. 1

      I could have sworn that I once used loadhigh on a 286. The article seems to suggest that doing this without emm386.exe is not possible but emm386.exe won’t work on a 286 so I couldn’t have been using that. Maybe I’m misremembering? Or maybe I had a third-party driver to make this work?

      I think I was overlaying the monochrome text video memory at B000-B7FF. That worked fine as long as you never switched into monochrome text video mode. If you ever did then your PC would crash. This worked without remapping any RAM from extended memory since there was already memory at that address.

      I was kind of hoping for some mention of the Extended BIOS Data Area. But if you’re only going to mention the BIOS Data Area in one sentence then it probably doesn’t make sense to spend time talking about the XBDA.

      The memory map reserved two chunks of the address space for video: one for monochrome displays and one for color displays. But only one of them can be in use at any given time.

      I don’t think this is strictly true. It was possible to have both a CGA card and an MDA card in a single machine with a monitor connected to each. This was useful for programmers who could have a program under development running on the CGA card and use the MDA card to show a debugger at the same time. Early multi-monitor support!

      You have a small typo: “When UBMs are available…”

      1. 1

        I think you may be right. In principle, UMBs could be offered by the chipset or other hardware given the right drivers, and I think I read a note on that in my research. But then I couldn’t find more details so I chose to simplify the story in the post. Would love to know more though!

        Will fix the typo tomorrow. Thanks.

      2. 1

        I could have sworn that I once used loadhigh on a 286.

        With stock DOS, be it MS-DOS, PC DOS or DR DOS, it was impossible.

        But there were a handful of relatively expensive specialist memory managers that only worked on specific 80286 motherboard chipsets which could use the chipset (rather than the CPU) to map UMBs on 286s. I never saw a single one in a decade in the DOS industry but I read about them.

        The best known was from the vendors of 386Max (now GPL!) and it was called BlueMax, because it only ran on 286 IBM PS/2 machines. “Big Blue” -> Blue Max, geddit?

      3. 1

        Yes one could, but it was chipset specific, and specific chipset drivers were available to link UMB’s in to the chain. This seems to mention a few:

        https://retrocomputing.stackexchange.com/questions/7785/what-286-chipsets-support-umbs

        Err - from memory, the monochome memory was 64k from A000, the colour 64k from B000, and the colour text mode started at B800. So (pre DOS 5/6), one could claw back 96k before the the text (CGA/EGA/VGA) region.

        Yes, one could use both cards at the same time (or 2 VGAs, one in mono mode). I’d often use that form when debugging, especially device drivers, as one could be used for simple direct logging via memory writes, w/o affecting the use of the other.

        As to load high without specific DOS (or driver support), that was easy. Link the UMBs in to the main chain, use the Int 21 call to change the memory allocation strategy to “last fit”, then load the program. However many TSRs had their own ability to relocate themselves, assuming the chains were linked, and the system was still in “first fit” mode.

        Usage did rather change once DR-DOS (then MS-DOS) got the ability to relocate the major portion of their code in to UMBs, and the choice of what to juggle where was altered.

        1. 3

          EGA/VGA graphics mode address space was 64K from A000-AFFF. MDA (monochrome) text mode address space was 32K from B000-B7FF. CGA address space was 32K from B800-BFFF, used for both text and graphics modes. These latter two ranges were designed to not overlap so that you could have both a CGA and an MDA card in a single machine at the same time with no conflicts. I think this also worked when combining an EGA card with an MDA card.

          VGA was mostly backward compatible with CGA and had some limited MDA support so a VGA card would have memory at the entire region A000-BFFF. If you limited yourself to just CGA modes then you’d only be using B800-BFFF and all of A000-B7FF would be available. But making it into regular conventional memory was complicated by the XBDA that lived just below A000, putting an inconvenient hole in the middle of your free space. And there were probably also a few programs that would get confused if they saw more than 640K of available conventional memory.

          I liked to play games and those would use VGA graphics modes and thus would need A000-AFFF. And totally avoiding 80x25 color text mode was basically impossible so that meant that B800-BFFF was also in use. But B000-B7FF was only used for MDA monochrome text modes and basically nothing used those so this space was very convenient to use as a UMB.

          1. 1

            You’re correct, I misremembered where the MDA area was.

            However many early clones simply simply did not have an Extended BIOS Data Area (like the one I used in 1990), or when such started appearing, had a BIOS option to relocate it to low memory. Hence there was no impediment having the main memory region grow beyond 640k.

            As to program now liking it, possibly. At the time, we were mainly using text mode, CLI based development tools, and they relished the extra memory. My first machine at work had an MDA, so I reclaimed the A000 region, I eventually got a colour card in it, and the reclaimed the next 32k.

            Some colleagues had 386sx machines with 1M of RAM, once I got one I was able to add UMB’s, and move various things there, so gaining an even larger TPA.

          2. 1

            MDA (monochrome) text mode address space was 32K from B000-B7FF.

            Yup. I fairly routinely used

            device=c:\dos\emm386.exe ram frame=none I=b000-b7ff 
            

            … to use that mono space, unneeded on a colour machine, and enable EMS for disk caching etc. without dedicating a 64kB UMB to the page frame. It usually worked on almost all machines.

    11. 2

      Another thorn in the PC side was the A20 gate “required” to run MS-DOS on 80286s (which could address 16MB of physical memory). The issue was software that relied upon memory wrapping around the 1MB boundary, but amusingly enough, I think the only software that relied on that was … MS-DOS itself! To remain compatible (somewhat) with CP/M. Sigh.

      1. 1

        http://www.os2museum.com/wp/who-needs-the-address-wraparound-anyway/ goes into details on this. Does anybody know of any program in the wild that actually used call 5. I assume that there must have been at least a few in the early days that were the result of running a program through one of the 8080 to 8086 translators. Maybe those programs got quickly displaced by native ones so that there are no examples left to easily find?

    12. 21

      Best wishes to Drew and everyone else. You guys are more important than the site, glad you’re getting sleep.

      EDIT: Hopefully the DOS is a mistake and not a protection racket.

      For those suggesting solutions: thank you. We need a layer 3 solution and that means $$$$$

      The sort of money you could earn running a business to make this “go away” would be impressive. That’s concerning.

      1. 19

        The sort of money you could earn running a business to make this “go away” would be impressive.

        It sounds like you are describing Cloudflare?

        1. 11

          And it sounds like the money is indeed impressive!

          1. 5

            “nice site you got there, would be a real pity if it got DDoS’d”

          2. 1

            jokes aside, do you happen to know what order of magnitude we’re talking about for network filtering from Cloudflare?

            1. 2

              Ranked by traffic, 3,280 out of 10,000 most popular websites globally use Cloudflare.

              source

              1. 1

                thanks! I was referring to the monetary costs though.

                1. 1

                  Even the free tier gets significant DDOS mitigation.

                  1. 6

                    for the specific case, from the article:

                    We spoke to CloudFlare and were quoted a number we cannot reasonably achieve within our financial means

                    1. 5

                      Ahh - sourcehut want layer 3 protection - I assume to keep using their own IP range. That’s quite expensive via cloudflare (whose cheap service is where you point your visitors to cloudflare’s IP addresses, and they do reverse proxy stuff, which protects you from layer 3 attacks by hiding your real IP).

                      That doesn’t work for SourceHut because they offer a bunch of non-HTTP services (and looking at the status page, all the HTTP stuff is up).

                      It’s also true in my experience that if you speak to sales at any vendor, you get quoted eyewatering numbers - I think because sales staff are very expensive. I’ve seen pricing gaps of 1000x between “credit card payment via form” and “speak to sales” for what is very much the same underlying service.

      2. 5

        I don’t know if they are in need of funds for smoothing this transition but I’m willing to pony up some money if there’s a fund started.

    13. 6

      I’ve written more than my fair share of IP stacks, usually for high speed packet capture.

      On one of these, every now and again, the stats would show that we were capturing at a line rate of terabits per second (or more), on devices with 100Mb Ethernet.

      I tore my hair out trying to reproduce the problem. Turned out every now and again a corrupted IP packet would come through with a header length field that was impossibly small. We’d drop the packet as corrupt and update the “invalid packets” counter, but we also updated the “mbps” counter to show that we still processed that amount of data (i.e. we handled it just fine, we weren’t overloaded, it’s the packet that was wrong), but we used the computed packet length to add to the stats…

      1. 4

        My favourite network bug (which, fortunately, I didn’t have to debug) was on our CHERI MIPS prototype. The CPU allowed loads to execute in speculation because loads never have side effects. It turns out that they do if they’re loading from a memory-mapped device FIFO. Sometimes, in speculation, the CPU would load 32 bits of data from the FIFO. The network stack would then detect the checksum mismatch and drop the packet. The sender would then adapt to higher packet loss and slow down. The user just saw very slow network traffic. Once this was understood, the CPU bug was fixed (only loads of cached memory are allowed in speculation). This was made extra fun by branch predictor aliasing, which meant that often the load would happen in speculation on a totally unrelated bit of code (and not the same bit)l

        Apparently the Xbox 360 had a special uncached load instruction with a similar bug and the only way to avoid problems was to not have that instruction anywhere in executable memory.

    14. 1

      The “fake threads” discussed here remind me of the Virtual Threads from Java’s Project Loom.

      They also sound very much like Green Threads. Rust originally started out with Green Threads and then eventually dropped them and decided to support only Native Threads. Bringing them back now might face a difficult path.

    15. 5

      What does “RL:” stand for? The program is called rusage so if there was going to be any prefix on these lines I would have expected “RU:” or maybe “rusage:”.

      I tried to see if there were any hints in the git history. This prefix was added in https://github.com/jart/cosmopolitan/commit/cc1920749eb81f346badaf55fbf79620cb718a55. This commit touches over a thousand files. The description of this commit talks about TLS and makes no mention of rusage. It appears that a fairly large rewrite of rusage somehow sneaked its way into this otherwise unrelated commit. So that’s a dead end.

      What size of buffer is std::ifstream using? Does increasing that make any difference at all? If the goal is to reduce system call overhead it would seem like an easy way to do that is to reduce the number of system calls by making each one read more data.

      an i/o system call that happens magically as a page fault is still a system call.

      Is it still a system call though? A “context switch”, sure. But I wouldn’t think of this as a system call myself.

      1. 1

        Is it still a system call though? A “context switch”, sure. But I wouldn’t think of this as a system call myself.

        It is not an explicit system call, but I would expect a fault to be as expensive as a system call: your CPU state has to be preserved, we have to go to another privilege level which means dumping out all the speculation stuff and doing all the other expensive mitigation for the sieve-like nature of modern CPUs, and you’re then in the kernel futzing with data structures and potentially arranging disk I/O and putting the thread to sleep until that comes back.

        The only time mapping memory is cheaper is when there are enough spare resources (address space in the user process, physical RAM for the page cache in the kernel) to have already mapped the things you need – and the kernel has to have been smart enough to read it in the background on its own and arrange the mappings in advance. But then who gets billed for that background work is always an interesting (and frequently unanswered) question in models like prefetched mappings or even the newer completion ring style APIs.

    16. 1

      This kinda seems like doing things the hard way. It will work, eventually. But if you avoid the use of all the tools that are designed to make your life easier you’re going to be spending a lot of time trying things with no guarantee of progress. There was no attempt to use valgrind. There was a brief mention of ASan but it is dismissed without trying it because it “probably won’t be of use”.

      Where did the original binaries come from? I see that they were living in /usr/bin, which means there is a good chance they were installed via the operating system’s package manager. If this is accurate then there’s a decent chance that the debug symbols (that have been stripped from the binaries in the package) are installable as a separate package.

      Take OpenSUSE for example. The openrct2, openrct2-debuginfo, and openrct2-debugsource packages are available at https://download.opensuse.org/repositories/games/openSUSE_Tumbleweed/x86_64/. These should be installable via the command line. (Is it zypper on openSUSE?)

      For Ubuntu the openrct2 and openrct2-dbgsym packages are available from https://ppa.launchpadcontent.net/openrct2/nightly/ubuntu/pool/main/o/openrct2/. These should be installable via apt.

      Installing these packages would provide a meaningful backtrace in the debugger when a crash occurs.

      If for some reason you cannot get debug symbols this way, how about doing a release build with debug information included using -g? A build created this way should crash just the same as the release build that has had its debug information stripped, but will provide a meaningful backtrace in the debugger when a crash occurs. It won’t be as nice to use in the debugger as a dedicated debug build but when the problem doesn’t reproduce in a debug build, debugging the release build is better than nothing.

      1. 1

        I think it’s possible that the author is not aware of the tools or, in the case of the packaged debug symbols, not aware that such an option exists.

        I regularly catch myself forgetting about debug tools at my disposal. I often work in embedded environments where for some reason or another, be it a compiler from the 90ies or a toolchain issue (the toolchain often being supplied by a vendor) it is not possible to use all the nice tools available when doing development on a “big machine”. Then, when going to back to development on a mainstream OS, I somehow forget to break out of the old habits.

        The advice presented is all helpful IMO.

    17. 2

      If everything is still an fd I’m not sure what the difference between actually implementing this approach and acting as if this approach was implemented. Nothing is stopping you from using pread/pwrite on files (obviously) and just never calling lseek. An error from read on a file is not much different than an error from lseek on a pipe or socket, or named pipe, not to mention platform specific things like timerfds and what not.

      Also unless you also remove dup altogether you just shift the problem to when you duplicate the post-streamified fd. Even if lseek is gone reads on the two fds will interfere with the current position in the same way.

      I could see this working if fds and sds (“stream descriptors”) were different types but I think the existence of fifos means open can’t return just fds (non-streamified descriptors).

      1. 3

        You can avoid calling lseek yourself but if you dup a descriptor and hand it off to a library or another process you can’t control whether or not it calls lseek on its descriptor. I guess if it decides to do that you’d still be fine as long as you only used pread/pwrite and never did anything that read the file position.

        I’m not entirely clear on the author’s proposal but it sounds like the idea is that if you dupped a “streaming view” of a file then the duplicated copy would have its own independent file position? Or maybe dup on a “streaming view” works the same way that things do now (with a shared file position) but if that bothered you then you could choose to not call dup on the streaming view. Instead you’d create a brand new streaming view from the same underlying file. Then each streaming view would have its own position and you could hand one to other code without worrying about the effects on your own streaming view.

        Of course none of this solves the issue of what do to if you have a real stream (not a streaming view of a file) like a pipe. If you dup it then a read from any copy will consume the data and that data won’t be available on other copies of the descriptor. Maybe this is simply defined as the expected behavior and thus as OK?

        Named pipes (FIFOs) would complicate things. But this article seems like it proposing an alternative OS design that is not POSIX but is instead “POSIX-like”. In this alternative world we could say that named pipes are not supported. Or that they have to be opened with a special open_named_pipe syscall. Or that the file descriptor returned by calling open on a named pipe is a special stub descriptor. Attempting to call pread/pwrite on the stub will fail. The only way to do anything useful with the stub would to be to create a streaming view from it and then call read/write on that streaming view. This is admittedly kind of ugly but that’s the price for maintaining the ability to open named pipes with open.

        There are probably other complications. How do you handle writes to files open for O_APPEND? Does pwrite write to the file at the requested offset or does it ignore that offset and write at the end? If it does write at the requested offset, how can you atomically append some data to the file? You can’t ask for the current size and then write at that offset because the file size might change between the first call and the second.

        What do you do about select and poll and friends? Do these take streaming views instead of file descriptors now?

        Overall I don’t hate the idea. If we were going to put this in object-oriented terms then the current system would have pread and pwrite methods on the file descriptor interface. But some classes that implement that interface (like pipes, sockets, etc.) don’t support those operations so they just fail at runtime if you try to call those methods. Usually this is a sign that you’ve got your interfaces designed poorly. The most obvious fix for this type of thing would be to split things up into two (or more) interfaces and have each class implement only the methods that make sense for that particular class, and maybe create some adapter classes to help out. That seems to be what’s being proposed here, with the role of the adapter class being played by the “streaming view”. The most significant difference that I can see is that constructing new wrapper objects would normally be considered fairly cheap but constructing the streaming view would require an extra syscall which could be expensive enough that people would want to avoid it.

        I wonder if it would be possible to emulate the streaming view in userspace in some place like the C library. That would get the kernel entirely out of the business of maintaining file offsets and leave them up to the process to track. The C library would be allowed to call read and write on objects like pipes and sockets but for real files it would only be allowed to call pread and pwrite. If the user code creates a streaming view of a file and tries to call read on it then the C library would have to translate that request to an equivalent pread call and then update the file position stored in the streaming view. Doing this for any POSIX environment would probably be somewhere between difficult and impossible but maybe one can imagine an OS design where it could be made to work.

        1. 1

          My point isn’t that “this isn’t necessary because discipline”, it’s “the amount that this helps doesn’t reduce the discipline required in any significant way.” Everything is still read(Object, … ), pread(Object, …), ioctl(Object, …) etc. Removing lseek doesn’t stop two processes or threads from interfering with each other with read and its implicit seeks on a pipe, socket or streamed file.

    18. 2

      I would exercise caution with the Linux version. It will work fine for processes under a debugger or processes that aren’t being ptraced at all. But for things that are being ptraced by something other than a debugger (like strace, or ltrace), it will abruptly kill that process.

    19. 5

      The part about “Local crashes and global equilibria” reminds me of the story of how AT&T’s entire long distance network went down for most of a day in 1990: https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse