Threads for stevan

    1. 8

      As someone who never again wants to write a k8s yaml file, nor a yaml file to give to the cloud provider to stand up a k8s cluster to consume additional yaml files, is learning a BEAM-based language for me? Or do these BEAM VM apps just get deployed to k8s pods defined with yaml files?

      I feel since I’m a distributed systems guy I should probably just learn Erlang or Gleam for the vibes alone.

      1. 23

        Learning Erlang* / OTP is definitely worth it if you do distributed systems things. It has a lot of good ideas. Monitors, live code updates, and so on are all interesting. Even if you never use them, some of the ideas will apply in other contexts.

        If someone told me I’d built an Erlang, I’d consider it praise, so the article doesn’t quite work for me.

        *Or probably Elixir now. It didn’t exist when I learned Erlang. Also, I am the person who looks at Prolog-like syntax and is made happy.

        1. 5

          Another huge benefit is that it gives you a new way to think about and solve problems. The main application I maintain at work is a high throughput soft-real-time imaging/ML pipeline. In a previous life I worked on a pretty big Elixir-based distributed system and this imaging system is definitely highly Elixir/OTP-inspired despite being written in C++. All message passing, shared nothing except for a a few things for performance (we have shared_ptrs that point to immutable image buffers, as an example). Let It Die for all of the processing threads with supervisors that restart pipeline stages if they crash. It’s been running in production for close to 6 years now and it continually amazes me how absolutely robust it has been.

          1. 2

            All message passing, shared nothing except for a a few things for performance (we have shared_ptrs that point to immutable image buffers, as an example).

            Funny thing, Erlang does exactly the same thing with large binaries.

      2. 7

        The vibes, as you suggested, are a great reason.

        1. 3

          That was excellent, thank you!

      3. 3

        Given your interest in formal methods, I can recommend having a look at Joe Armstrong’s thesis. I think he makes a very good case for why Erlang behaviours make sense from a testing and formal verification point of view (even though the thesis itself doesn’t contain any verification).

        Erlang’s behaviours are interfaces that give you nice building blocks once you implement them, e.g. there’s a state machine behaviour, once you implement it using your sequential business logic, you get a concurrent server (all the concurrency is hidden away in the OTP implementation which is programmed against the interface).

        The thesis is quite special in that he was 50+ years old when he wrote it, with like 20 years of distributed systems programming experience already, when compared to most theses that are written by 20-something year olds.

      4. 2

        This reminds me the effort of couple projects years ago to use BEAM on the cloud like Erlang on Xen (ling) or using rumprun unikernel, there also other things like ergo (“golang otp”).

      5. 1

        Yes! I was in the exact situation as you and learned elixir. The processes API is straightforward and relatively simple. Elixir documentation on the topic is superb. On top of that elixir is a great language. Functional and very ergonomic. Erlang is fine too but elixir has that quick to write syntax sugar comparable to python or ruby.

    2. 8

      At this point “make invalid states unrepresentable” is a bit old-hat. Great, sure, but what about making invalid state transitions unrepresentable?

      1. 11

        Using dependent types one can represent state machines as follows:

          StateMachine = record {
            State      : Type
            Input      : State -> Type
            Output     : (s : State) -> Input s -> Type
            transition : (s : State) (i : Input s) -> Output s i -> State
          }
        

        As you can see the input and output types depend on the state, so one can disallow transitions (e.g. return only a valid subset of inputs in each state, or even the empty type if no inputs are allowed in some state).

        (Dependent records are merely syntactic sugar for Sigma types, i.e. dependent product types, and (x : A) is a Pi type, i.e. a dependent function type.)

        A slight variant of these state machines, where we move the State from being a field inside the record to being a parameter, give raise to predicate transformers:

          Predicate A = A -> Type
        
          [[_]] : StateMachine State -> (Predicate State -> Predicate State)
        

        These predicate transformers form a monad where bind is the equivalent of the sequence rule (;) in Hoare logic, which means we can compose operations whose “protocol” is caputred by the state machine.

      2. 8

        That’s what the typestate pattern is for.

        1. 3

          I am getting deja vu :) And almost exactly a year to the day!

          https://nitter.poast.org/tomjaguarpaw/status/1727434384104387055

        2. 2

          I have written a typestate program in rust but unfortunately it is unbelievably unwieldy if an operation transitions to one of several possible states.

    3. 4

      I have an draft article about rewriting where I’ve collected a bunch of quotes, here’s a selection:

      In the following interview Joe Armstrong explains that he often wrote a piece of code and the next day he threw it away and rewrote it from scratch.

      Also in the early days of Erlang it was possible to do a total rewrite of the whole language in less than a week. New language features were added in one work session, if you couldn’t get the idea out of your brain and code it up in that time then you didn’t do it, Joe explained (17:10).

      In a later talk he elaborated saying:

      “We need to break systems down into small understandable components with message passing between them and with contracts describing whats going on between them so we can understand them, otherwise we just won’t be able to make software that works. I think the limit of human understandability is something like 128KB of code in any language. So we really need to box things down into small units of computation and formally verify them and the protocols in particular.”

      I found the 128KB interesting. It reminds me of Forth here you are forced to fit your code in blocks (1024 chars or 16 lines on 64 characters).

      Speaking of Forth, Chuck Moore also appears to be a rewriter. He said something in similar:

      “Instead of being rewritten, software has features added. And becomes more complex. So complex that no one dares change it, or improve it, for fear of unintended consequences. But adding to it seems relatively safe. We need dedicated programmers who commit their careers to single applications. Rewriting them over and over until they’re perfect.” (2009)

      Chuck reimplemented the his Forth many times, in fact Forth’s design seems to be centered around being easily reimplementable on new hardware (this was back when new CPUs had new instruction sets). Another example is Chuck’s OKAD, VLSI design tools, to which he comments:

      “I’ve spent more time with it that any other; have re-written it multiple times; and carried it to a satisfying level of maturity.”

      Something I’m curious about is: what would tools and processes that encourage rewriting look like? The post mentions that it somehow is related to the Extreme Programming movement, but doesn’t give a precise reference…

      1. 1

        what would tools and processes that encourage rewriting look like

        I’ve been doing this on a micro level with PRs. I wrote it once for me, then again (usually in another branch) for the reviewer so the commits are small and make more sequential sense. Usually I’m surprised that I’m able to make improvements and produce a better outcome. Sometimes I find bugs I didn’t notice the first time around.

        It helps that it’s in Rust so strong types and lint tools etc. give me lots of confidence in refactors. It takes more time, but reduces review length and improves the outcome so it’s a net win IMHO.

        It’s not for every team and every situation, but it’s a good tool to have in the toolbox.

      2. 1

        I came to the comments to mention the time Joe Armstrong had mentioned this technique to me in the hallway at an Erlang conference when I asked him about how he approaches productivity, and how unnatural it initially felt, but over time I’ve tried applying it to more things, including a bunch of the projects that ended up in the komora github project. I’m so happy to see the videos you linked, since I never was able to find anything he wrote about it.

        When I’ve had the luxury of having enough time to comfortably apply this technique, it has really resulted in some nicely modular components, since you’re forced to break things down into pieces that you can reasonably attempt in a day. But over time, you can build these simple modules into more and more interesting, well-decomposed complex systems.

    4. 2

      You’ll note the build script tries to tail the container logs to nearly-automate the copying, but I’m not sure if it actually works – I haven’t taken the time to troubleshoot it.

      What you could do instead is:

      1. Remove the infinite sleep and grep logic to try to extract the path to the binary;
      2. Simply call cp "$(cabal list-bin $EXECUTABLE_NAME)" bin after the containerised build finishes.

      Another small improvement would be to add caches outside of the image:

      "$DOCKER" run \
        -v "$PWD":/mnt/:ro \
        -v "$PWD/cache/cabal-cache-musl":/root/.cache/cabal \
        -v "$PWD/cache/cabal-store-musl":/root/.local/state/cabal/store \
        -v "$PWD/cache/dist-newstyle-musl":/mnt/dist-newstyle \
        muslghc
      

      That way we can avoid rebuilding everything between runs. I suppose we could reuse the host’s cache here, but prefer to keep them separate.

      Thanks for writing this up by the way, I found it useful. I guess what would be really neat is if we could have a nix config (with system dependencies) and then derive the alpine dockerfile from that (that way we wouldn’t need to update system dependencies in two places.)

      1. 1

        Ohh, yeah, that’s really clever. Thanks!

    5. 2

      The example of processing requests and updating a database and cache lends itself nicely to the kind of pipelining that using LMAX Disruptors offer.

      It seems to me that people seem to prefer, or at least a lot more effort is put into, async-like solutions to parallel programming compared to pipelining.

      Does anyone have thoughts on why that might be the case?

      1. 2

        For parallel programming, pipelining gives you at most constant-times speedup, where the constant is defied by application topology, and is usually small (smaller the number of CPU cores a laptop has these days). It’s much more fruitful to re-phrase the problem in an embarrassingly parallel way, to get infinite speedup.

        For concurrent programming, io_uring is basically disruptor pattern.

        1. 2

          I’m not sure I follow. If you don’t mind, can we build a bit further upon your example? Imagine we got an input queue of requests and we want to produce an output queue of responses.

          First, how do you bridge this gap with your async solution? Do you concurrently map your process function over the input queue (bounded by some worker pool of some size)?

          If so, is what you mean with “infinite speed up” that there will always be plenty of green threads around and the runtime will schedule them best it can, and this scales with the number of cores (because the runtime does)?

          As opposed to pipelining which essentially looks like:

                                        +-- update db -----+
                                       /                    \
          input -> compute_response --+                      +--> output
                                       \                    /
                                        +-- update cache --+
          

          Where we only get an speed up by factor 3, because we only got 3 things we can run in parallel?

          1. 3

            Yes, if your problem is parallel processing, partitioning the data into independent subproblem and solving each subproblem sequentially is usually the faster way to get more speedup.

            If your problem is concurrent processing, than it’s a different problem. All cool async/await implementations literally use LMAX disruptor underneath, in the form of io_uring.

            1. 3

              Here are a couple of advantages I see with the pipeline approach:

              1. Determinism (the outputs preserves the order of the inputs), this will be harder with the async approach;

              2. Batching and backpressure can be trivially applied at any stage of the pipeline (in isolation, without affecting the rest of the pipeline), which will require more reworking in the async case;

              3. We can apply what Jim Gray calls “partition parallelism” or what the Disruptor people call “sharding” at any stage of the pipeline or the whole pipeline. This effectively adds more cores to a stage without breaking determinism.

              It seems to me that by essentially moving some of the decisions that the runtime does to the user, we should be able to do a better job.

              Just throwing everything at the runtime and hope for the best, reminds me of Joe Armstrong’s anecdote of an unmodified Erlang program only running 33 times faster on a 64 core machine, rather than 64 times faster as per the Ericsson higher-ups’ expectations.

              What I’m curious about is: can we do better than Erlang by listening to Jim?

    6. 3

      Regarding “fancy domain-specific language (DSL) for breadth-first search” and building upon a comment you made a month ago: I feel like we are not very far off from being able to do these breadth-first searches in a simulation using the actual implementation. At least for safety properties, I’m still not exactly sure how to do liveness and fairness in a simulator as opposed to a model-checker. It feels like one would need to use timeouts and assert that good things happen before they go off?

      Using the actual implementation in the simulator would avoid having to do refinement, but perhaps one would still want to start with a more abstract specification. Which brings me to: I never understood the formulation of refinement via temporal existential quantification. For me it’s easier to grasp the predicate transformers formulation that can be found in Back and von Wright’s Refinement calculus, Morgan’s book Programming from Specifications, and in Abrial’s Z, B and Event-B languages. In fact Event-B is the only language, that I’ve seen, in which going all the way from abstract specification to concrete implementation (that can be run) is actually doable (as opposed to talking about the possibility of being able to do it as in TLA+).

      Regarding machine-checkable proof languages: I find the Curry-Howard isomorphism between propositions-as-types and proofs-as-programs so aesthetically pleasing that I feel it must be the right forward here. Kleppmann gave a nice talk about how to verify distributed systems using Isabelle/HOL, while Isabelle doesn’t make use of the CH isomorphism, the technique can be adopted.

      What I don’t see quite yet is how to make a language in which all this is possible at the same time. Ideally one would like to make an abstract specification of some protocol, say some consensus algorithm, model check it, prove it correct, refine an implementation from the spec, and finally test the implementation in a simulator (to ensure no false assumptions were made about the environment). Currently this requires different tools and libraries, I think it would be good to package it all up in one system and come up with ways of structuring the code which makes it easier to rigorously check it in this way.

      1. 4

        I’m still not exactly sure how to do liveness and fairness in a simulator

        Wrote a bit simulation testing for liveness in https://tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness!

        The TL;DR: yes, you use the (logical) timeout, which you could set to be very large. For a real system, at some point you probably should be able to replace it with a more tight timeout (a system makes forward progress after each pair of nodes exchanged two messages or some such), but we haven’t got to that point yet.

        With liveness, the hard part is not the timeout, but making sure that your simulation is capable of simulating hard livenes condition. For example, if your simulation can always crash & restart a replcia with a certain probability, than it doesn’t check liveness, because crash&restart might be exactly the thing that gets your system unstuck!

        So what you need to do is to use swarm testing to pick a subset of failure modes, and then make those failures permanent.

        To say it another way, the hard part about liveness is not what happens with the system, but what is allowed to happen with the simulator itself. You need to be able to simulate (infinite) behaviors where the infinite tail looks differently from the finite start.

        1. 3

          When timeouts are added, a counterexample represents a finite prefix of a behavior where the timeout has expired. This constitutes a safety violation, rather than a liveness issue. You don’t check liveness.

          1. 1

            I think what happens here is that I strengthen the property to make it a safety one? Liveness says that, assuming fairness, progress happens eventually. I strengthen that to a statement that, assuming fairness, progress happens faster than (logical) time T.

            This seems to work well enough in practice, in a sense that it does find genuine liveness bugs where the system can get stuck, and doesn’t have false positives. The latter in particular means that the algorithm does converge from any given state in a bounded amount of time, which of course might not be true for certain mathematical models, but I think should hold for any practical system?

              1. 2

                Not exactly sure what you mean by stronger here, but Reducing Liveness to Safety in First-Order Logic (2018), further muddles the water here (at least for me).

              2. 1

                I am not claiming that any liveness property can be strengthened to a safety property, but some certainly can!

                Consider a system whose state is a binary bit, and whose trajectory is a bitstring. Consider this two properties:

                • P1: some bit is 1
                • P2: the first bit is 1

                P1 is liveness, P2 is safety, according to definition in Alpern&Schneider. But P2 is stronger than P1, so, if you managed to prove P2 for your system, you’ve also got P1 for free.

                I am also making an informal claim that the system I am working on seems to work roughly similar, in a sense that interesting liveness properties can be strengthened to safety ones.

                I am actually curious just how powerful this technique is? The counter-example that first comes to mind is this:

                Let the state of the system be a pair of naturals, and possible moves be:

                • decrease the second number: (a, b) -> (a, b - n)
                • decrease the first number, and set the second number to arbitrary value: (a, b) -> (a - n, X)

                The liveness property in question is that the system eventually reaches (0, 0). Now, “just use a timeout” trick clearly doesn’t work here, because, for any value of the timeout I choose, you could decrement fst and set snd to timeout * 100500.

                But I think it is possible to set a timeout smarter? First, let’s define liveness not as reaching (0, 0), but as decreasing fst (so, just making progress). Then, let’s allow me to take a look at the current state of the system and make timeout depending on that. Given a pair (x, y), I can say that it’ll make progress in y steps, at which point I would select a new timeout.

        2. 1

          (It would be nice to have a couple of simple to understand and self-contained examples of how to test for liveness.)

        3. 1

          Neat blog post, there is a broken link to a file on github which was moved: https://github.com/tigerbeetledb/tigerbeetle/blob/main/src/simulator.zig

      2. 2

        Using the actual implementation in the simulator would avoid having to do refinement

        I don’t think this is necessarily true, as there are correctness properties that can only be expressed through refinement (this is a new idea to me actually, previously I just thought refinement was about going from a high-level design to a more detailed design instead of a way of expressing whole-system correctness properties). Imagine you want to ensure your system, at a high level, follows the standard two-phase commit state transition of Propose -> Commit -> (Success OR Fail). This can only really be expressed through refinement. A simple example but they can certainly get more complicated.

        I find the Curry-Howard isomorphism between propositions-as-types and proofs-as-programs so aesthetically pleasing that I feel it must be the right forward here.

        I think Lean is the closest language we have to this right now. There’s a very interesting textbook called Functional Programming in Lean that I worked through. It has a section called Pitfalls of Programming with Dependent Types which highlights a big issue: when you’re writing proofs (which in dependent types means justifying you have the right to call a function), it is extremely hard to maintain encapsulation. Your proofs end up looking like a reflection of internal implementations. This makes refactoring unbelievably difficult as parts of your implementation become load-bearing very very quickly. Whether this is an artifact of dependent types and will be solved by some future, as-yet-unheralded type system is uncertain.

        I do think there is something to be said for how simple TLA+ proofs are. It’s just induction, which most people have learned at some point! If your invariant holds in the initial state(s) of your program, and you show that if the invariant is true of an arbitrary state then it is true of its successor state, then you’ve proved the invariant holds for all system states. It should be noted that this style of proof also very strongly mirrors the gritty details of your specification, though.

        Ideally one would like to make an abstract specification of some protocol, say some consensus algorithm, model check it, prove it correct, refine an implementation from the spec, and finally test the implementation in a simulator (to ensure no false assumptions were made about the environment).

        I guess this is the holy grail of formal verification. I don’t know whether we even have a disparate toolchain that does this yet. Certainly there are some formal verification languages like Dafny but I don’t think they’re very good as high-level specification languages nor do they have simulator functionality.

        1. 1

          […] a way of expressing whole-system correctness properties). Imagine you want to ensure your system, at a high level, follows the standard two-phase commit state transition of Propose -> Commit -> (Success OR Fail). This can only really be expressed through refinement.

          Interesting, I’m not completely sure I follow. Do you mean to say that it’s a way to express a whole-system correctness property without involving the whole system? Because otherwise, why can’t we show that 2PC works by: 1) have all actors, involved in the protocol, log their actions, 2) express the property in terms of all logs?

          Your proofs end up looking like a reflection of internal implementations. This makes refactoring unbelievably difficult as parts of your implementation become load-bearing very very quickly. Whether this is an artifact of dependent types and will be solved by some future, as-yet-unheralded type system is uncertain.

          I’m mostly concerned about black-box behaviour of a (sub)system (i.e. what’s observable from the outside). What happens inside the box, less so. Between the boxes, i.e. the protocols, is important as well. So yes, if you change the API then you’ll have to change your blackbox proofs, but the same is true for testing.

          (Regarding inside the box: I haven’t used Lean and only skimmed the chapter that you linked to, but my expereince is that once you have massaged a definition to the point where the proofs you want to make about it are neat, then you have already done all the refactoring. It’s like an extreme version of test-driven design.)

          I don’t know whether we even have a disparate toolchain that does this yet.

          Which steps are you missing?

          1. 1

            You can certainly write predicates over history variables, but that usually makes model-checking intractable because it causes the state space to grow factorially. As an aside, Ron Pressler likes to make the point that all invariant & liveness checking is secretly a form of refinement checking, which I thought was uselessly pedantic, but is true. Refinement checking as it’s usually referred to is I guess the most powerful possible expression of this process.

      3. 1

        . In fact Event-B is the only language, that I’ve seen, in which going all the way from abstract specification to concrete implementation (that can be run) is actually doable (as opposed to talking about the possibility of being able to do it as in TLA+).

        Do you have a good repo or online example of this? I couldn’t really pull it out of the Abral Event-B book.

    7. 4

      Does anyone happen to know of a small self-contained example which uses the InternPool technique to implement an AST and an interpreter for a small toy language (e.g. integers and addition)? The answers to one of the audience questions seems to suggest that it should be possible, but I can’t immediately see it from only the API of the InternPool.

      1. 6

        The API of InternPool is a little specialized to its specific use case of, well, interning. For something like an AST or IR, you’re not trying to intern things; just to construct a sequence, which will be immutable after initial construction. So, the constraints are a bit different in this case, leading to a different API design. In particular, the “friendly” representation isn’t really a “key” in any sense, it’s just unwrapped data. I’d name the types a little differently here.

        You probably wouldn’t have a get function, since the logic to construct the AST is all centralised in one place anyway – the parser – so that logic might as well just live there.

        Here’s a skeleton of a toy example I threw together, it might help to get the idea across: https://zigbin.io/039842

        1. 3

          Thanks, that’s really helpful. You should consider turning that into a proper blog post! I think we need simple examples of data-oriented design like that, in order for these techniques to catch on in the programming languages community.

      2. 1

        At its core it’s just an array hash map. You put something in the map and find out its index. That’s it.

        All the complications added on top of it are just organizing memory layout with various techniques to maintain trivial serializability, efficient memory usage, type safety, and thread safety.

    8. 2

      Perhaps we can do better than spawning lots of processes and context switching between them, regardless if in the language runtime or the operating system?

      People like Jim Gray advocated using pipeline parallelism instead (one process/thread per core) and partitioning/sharding slow stages in the pipeline (scaling a stage by assigning more cores to it). This is also what the LMAX disruptor does, and what database do (for more recent example see Umbra’s morsels).

      Given the success of pipelining in manufacturing and hardware, I wonder why nobody seems to be following Jim’s advice and trying to make it easier to do in software (perhaps via new programming languages)?

      1. 1

        Somewhat similar to Spark (the big data framework).

        1. 1

          I suppose, but perhaps we need something without the clunkiness of being a library and the overhead which causes it to be slower than using a single thread when run locally?

    9. 7

      It seems to me like the whole world would be a lot better off if we added support for determinism right on down to the hardware level. As mentioned here, to get any semblance of determinism, one has to jump through elaborate hoops at higher levels of the stack to get anywhere close to determinism.

      That’s why the Antithesis approach is the most exciting one that I’ve seen so far, since it handles so much of the determinism under the hood.

      The closest thing I’ve seen to implementing determinism in the software stack itself is Deterministic Process Groups, which allow for determinism in OS task scheduling and thus concurrency. Does anyone know of anything else like this?

      1. 3

        It’s not clear to me that it’s desirable to solve the problem of determinism as far down the stack as possible.

        For example, can deterministic hardware be as performant as non-deterministic hardware? Is solving it at the OS level better than at the programming language runtime level? The programming language level should have more information about what the programmer intends to do, and might therefore be in a better place to make the right decision with regards to scheduling/introducing interesting entropy? (A bit similar to how database people don’t want to rely on the OS scheduler or IO system, because at the database level they got more information and can make better scheduling and IO decisions.)

        1. 2

          (Antithesis employee here)

          I agree that if you can make a specific simulator for your use case you probably will have a better time. The issue is it’s hard. You need to make sure every library, third party service,etc is deterministic or you can mock it. If your foundationdb or tigerbeetle that took time to structure the code to make that easy, that’s awesome!

          If you have a codebase that is older and you want to add DST to it, that’s where I view the hypervisor approach as the better option. You don’t have to change anything and you get DST “for free”.

        2. 1

          If you want to cover as much software as possible (which is a desirable thing in this case), then you need to be as low in the stack as possible. FoundationDB built a simulator, but that simulator only makes sense for FoundationDB. A database only needs to care about itself, but if you have software that uses POSIX threads, networking, locking, atomic operations and filesystems the language has no way of handling that without completely restricting what resources from the OS you have available. And if you do that then it only works for that language.

          Antithesis (from the same people as FoundationDB) tries to be generic because it works a lot lower in the stack so it works with all software (in theory, there are probably limitations).

        3. 1

          The issue is, at each level it can’t be solved because of leaky non-determinism at some lower level. So I hear you about performance, but there is likely a way to have deterministic and non-deterministic modes that are functionally equivalent, but where the non-deterministic mode can be chosen for performance reasons.

          Other than that, the lower the better to me.

    10. 2

      I think tools like these are very cool and they have me wondering about the long-term relevance of TLA⁺. Beyond formal specifications as a way of getting your thinking straight, the big value of TLA⁺ is a terse & expressive language for extracting concurrent logic into a deterministic simulator where you can poke at it and state invariants or liveness properties. However, if you’re able to simulate your system deterministically in-situ then that is obviously less work. For software that runs across multiple processes or even nodes you have Antithesis, which is an expensive proprietary product, but I can see simple free single-process deterministic thread schedulers like this proliferating across languages - perhaps as Hypothesis extensions. In the TLA⁺ world we know that random simulation is very effective at finding invariant violations, and often more tractable than a full search of the state space. These schedulers offer that.

      1. 2

        If we decide that writing specifications and ensuring that our programs adhere to them is important, then I think one would hope that our programming languages get better at doing those things.

        I’d like to see programming languages with (optionally) deterministic run-time systems which can toggled on during testing, that way all the stuff in my post would be hidden away in the run-time rather than having to be implemented and dealt with by the user. Once this is in place various verification methods can more easily follow, model checking being one example.

        I believe the P programming language can model check across multiple process and even nodes. There’s a lot more room for experimentation in this area though.

    11. 4

      But you also might choose to provide some sort of guarantees about execution time, and that’s really hard. The two approaches work. One is to make sure that processing is obviously linear.

      In case anyone’s curious, Martin Hofmann’s paper Linear types and non-size-increasing polynomial time computation (1999) explains how to do this.

      As we know from the PRF exercise, this won’t actually prevent people from writing arbitrary recursive programs. It’ll just require some roundabout code to do that. But maybe that’ll be enough of a speedbump to make someone invent a simple solution, instead of brute-forcing the most obvious one?

      Quoting from TIGER_STYLE.md:

      Do not use recursion to ensure that all executions that should be bounded are bounded. […]

      Put a limit on everything because, in reality, this is what we expect—everything has a limit. For example, all loops and all queues must have a fixed upper bound to prevent infinite loops or tail latency spikes. This follows the “fail-fast” principle so that violations are detected sooner rather than later. Where a loop cannot terminate (e.g. an event loop), this must be asserted.

      I see only allowing primitive (or more accurately structural) recursion as the language designers way of helping the user do the sensible thing. Writing an infinite loop sometimes happens by mistake, unlike using foreach <really large number> to effectively get a while loop (that’s a conscious choice).

      In my opinion general recursion is an effect, it doesn’t belong in a pure language. When used it should be made explict via some effect system (to other humans and to the machine).

      1. 1

        Quoting from TIGER_STYLE.md:

        Thanks! As embarrassing as it is, I didn’t make this connection!

    12. 3

      I wonder if aiming for being intuitive is perhaps better than aiming for not being too strange?

      Consider SQL: strange (unusual) compared to the languages around at the time it was introduced, but intuitive (easy to learn) given some prior exposure to discrete math.

      Another example: dependently typed languages (e.g. Agda and Idris) are more strange than Haskell, but more intuitive (because type-level computations unify many of Haskell’s extensions in a coherent way).

      I also think that being intutive is closely tied to having a nice and simple denotational semantics.

      SQL has relational algebra (not precise, but good enough for intuition). Agda and Idris have Martin-Löf type theory (predicate logic). Rust, on the other hand, is an example of a language where the syntax was designed before the nice and simple semantics was found. I know people have tried to give a semantics for Rust, but doing it the other way around makes it difficult because: 1) your semantics is now constrained by your syntactic choices, and 2) the semantics can/should inform the design of the syntax but since you’ve already committed to a syntax it gets expensive to change.

      1. 1

        maybe that confuses what I would disambiguate as grammar strangeness vs. syntax strangeness. SQL is, in most cases, just words and quotes and a semicolon. The allowed syntax is very concise and simple, and not strange at all.

    13. 1

      @matklad: Thanks for turning your comments from my previous post into its own proper post, it’s easier to follow now when you dressed your code with some words!

      As a response to my post I receieved an email asking if I’ve seen the talk Property-testing async code in Rust to build reliable distributed systems by Antonio Scandurra. I hadn’t seen it, but watched it since and it seems to be about doing a similar thing that you do but on the async level rather than thread level (in about 100 LOC) and they use this approach to test the Zed editor. I thought I’d mention it here, as it might be interesting to compare the two approaches.

    14. 20

      (note: the following comment probably warrants a “rant” tag)

      I have two alternative hypothesis for the explanation of the status quo.

      First hypothesis is that state of the art in academia doesn’t necessary correlate with what is actually useful in practice. Not claiming (yet) that that’s what is happening with stateful property based testing, but I have a good example from a different area — there’s a lot of literature in academia about dealing with errors during parsing, and all that literature was not helpful to me when I had to solve the problem. Although I have read a tonne of papers on the topic, none ultimately feed into the solution that worked.

      My second hypothesis is that academia is geared at proving to other academics that a paper is worth a publication, but is pretty bad at actually explaining ideas in a way that allows incorporating them into software people use. This post is actually an example of this. I am pretty much the target audience — I regularly use property based testing for my hobby stuff, my work thing is basically an appendage to one giant randomized test of strict serializability, I wrote a property-based testing library, and I think I formulated a specific idea which, while not a theoretical advancement, I think moved state-of-the-art of practice a little bit further.

      I very much want to learn about a property-based testing technique which I should be using instead of what I use currently. But it’s very hard to use the present post to do that!

      It starts with a very long-winded historical aside, which, while interesting, doesn’t really help me to learn anything. And than there are is series of examples which just bog down in Haskell’s idiosyncrasies, and it’s very hard to extract the actual idea out of the implementation gunk. This reminds me another recent case when I needed to learn about Reed-Solomon coding. The wikipedia page read like a wall of thick graffiti to me. And then I read https://mazzo.li/posts/reed-solomon.html posted here on lobsters, and it said, basically:

      A polynomial of degree N is uniquely identified by N+1 points on its graph. Evaluate polynomial at M>N+1 points, give each point to a separate person, and then any N+1 people can recover the original polynomial.

      I see this being repeated over and over — there’s a vast chasm between the essential simplicity of an idea which can often fit in a Tweet, and fluff and rigor which wrap the idea inside an academically-adjacent piece of writing.

      I’d love to read a Tweet-sized explanation of what is stateful property-based testing, and how is it better than doing the same thing “by hand”.

      Coming back to the first hypothesis: I do, in fact, have a suspicion that academic state of the art of property based testing isn’t all that useful in practice. I make two claims:

      • You get 70% of the benefits of property-based testing if you pass in a PRNG to your test function.
      • You get 85% of the benefits, if the PRNG is finite (at some point, instead of giving you heads or tails, it says error_out_of_entropy).

      As a specific example, here’s how you can write a property test with minimization for the ring buffer (repo):

      #[test]
      fn test_queue() {
        arbtest::arbtest(|rng| {
          let len_max = rng.int_in_range(0..=10)?;
          let mut queue = Queue::new(len_max);
          let mut queue_model: VecDeque<i32> = VecDeque::new();
      
          while !rng.is_empty() {
            match *rng.choose(&["add", "remove"])? {
              "add" if queue_model.len() < len_max => {
                let x: i32 = rng.arbitrary()?;
                queue_model.push_back(x);
                queue.push(x);
              }
              "remove" if queue_model.len() > 0 => {
                let x_model = queue_model.pop_front().unwrap();
                let x = queue.pop();
                assert_eq!(x_model, x);
              }
              "add" | "remove" => (),
              _ => unreachable!(),
            }
            assert_eq!(queue.len(), queue_model.len())
          }
          Ok(())
        });
      }
      

      Minimization there is very simplistic, but it works well enough for this case. And you don’t need a tree of types&monads to use the thing, the rng is the familiar API of a random number generator.

      In this sense, property based-testing reminds me the cottage industry of parser-combinator libraries: the idea lends itself very well to functional programming & static typing, but, in practice, production code tends to use a hand-written parser, because you can express all the same ideas in stupid first-order programming just fine. Same here — just pass in the rng!

      1. 5

        I also found the article somewhat hard to read, but - having played quite a bit with proptesting - did get two worthwhile things out of it.

        The check-state-vs-model pattern seems worth making explicit, even if it isn’t particularly new (and may be “obvious”). I agree that the value delivered by just knowing the pattern and writing it out (as you did) and the value delivered by actual library support isn’t huge. (And e.g. Hypothesis’ stateful testing seems to be making some trade-offs that do not suit my use case.)

        The attempt to model parallelism explicitly is… probably not the final form of that idea, but something that I’d be idly interested in toying around with:

        • Generating parallel testcases seems easy enough to be worth trying in some cases, even if without a deterministic scheduler you’re just hoping to hit a race and can’t really shrink (the “wait 100 ms” is a gross hack, although probably sometimes effective enough?)
        • Checking against possible linearizations using the non-parallel model is admirably low-effort, so may deliver good value-per-effort; of course, something like Jepsen’s Elle is a lot more sophisticated, both in what constraints it can handle and in scaling to more concurrency than just three interleaved threads, but is also quite a bit harder to use.

        (The author’s TODO is also quite full of additional ideas.)

        1. 3

          Checking against possible linearizations using the non-parallel model is admirably low-effort,

          Yeah, this bit was novel to me as well, but, if I reflect on how lve dealt with this historically, the answer is that there’s some internal linearization point in the system, so, rather than combinatorially searching through all interleavings, you can asks the system which interleaving it thinks is the correct linearization, and then check that.

          Specifically, in TigerBeetle tests, when the test harness submits transfers concurrently, it doesn’t know in which order they’ll execute. But the response from the system assigns logical timestamps to the transfers, so the test doesn’t need combinatorics, it just learns the correct permutation from the system itself.

          1. 1

            That’s very interesting. And I’m pretty sure it’s indeed the correct solution in cases where you’re able to introspect a system that does linearize events.

            However, at that point you’ve left the area I’d be willing to call “admirably low-effort” (but you’re delivering so much value that it’s clearly worthwhile!) There really may be room in the design space for something you’d just throw on some random code that has a lock or two and surely doesn’t have any interesting concurrency “phenomena” ;-).

        2. 1

          Checking against possible linearizations using the non-parallel model is admirably low-effort

          This is also what Jepsen’s Knossos checker does btw.

      2. 3

        I’d love to read a Tweet-sized explanation of what is stateful property-based testing, and how is it better than doing the same thing “by hand”.

        To see what the 400 LOC library from the post does that your “by hand” code doesn’t try:

        1. Extending your example with "new" that creates a new queue. Or imagine you are testing an api which returns a uuid/file handle or anything that you can’t hard code into your tests like you did with let mut queue = Queue::new(len_max);. Hint: you’ll find that the functionality you need is some kind of “symbolic reference” (an integer) which is a placeholder for the queue/uuid/file handle, etc which will only become available as you execute against the real system;
        2. Extending your example to do concurrent execution of commands against the queue. Hint: you can get this for free from the sequential model via linearisability.

        That’s the two things a library should do (but most libraries don’t). If you don’t need that functionality, then by all means do it “by hand”. It seems to me that most users don’t know that this functionality is even possible, how could they if their libraries don’t provide it?

        1. 4

          Could you formulate these two examples as specific litmus tests, which could serve as a benchmark for property testing approaches?

          For 1., I am pretty sure that “just pass an rng” approach actually handles that simpler? Inside the for loop, you have access to the entire real system, so you can just query it to figure out what operations makes sense. I guess a good model here would be testing a banking system, where you want to query against existing account? Something like this:

          let mut bank = Bank::new();
          let mut accounts: Vec<AccountId> = Vec::new();
          while !rng.is_empty() {
            match *rng.choose(&["create_account", "transfer"])? {
              "create_account" => {
                let account = bank.create_account();
                accounts.push(account);
              }
              "transfer" if accounts.len() > 0 => {
                let dr = rng.choose(&accounts)?;
                let cr = rng.choose(&accounts)?;
                if (cr != dr) {
                  let amount: u123 = rng.arbitrary()?;
                  bank.tranfer(cr, dr, amount)
                }
              }
              ...
            }
          }
          

          For 2, I think I really need to see a specific example. For TigerBeetle, we pretty much use the above approach (sans automatic minimization) to test strict serializability. Concurrency is just state machines, after all. Two state machines running concurrently is just one product state machine running sequentially.

          What might be tricky is testing parallelism — that all your atomic operations conform to memory model and such. There, you need a tricky runtime library to turn your “normal” code using atomics into an explicit state machine, but that seems more specific to a particular language runtime, rather than to general idea of property-based testing.

          1. 4

            For 1., I am pretty sure that “just pass an rng” approach actually handles that simpler? Inside the for loop, you have access to the entire real system, so you can just query it to figure out what operations makes sense.

            That’s pretty nice. I still don’t understand how shrinking/minimisation of counterexamples works in your approach however. If a bug is found using 10 accounts, but only 3 are needed to trigger the bug, how is it shrunk to the minimal counterexample? Also if you remove an account you can potentially invalidate later transfers involving said account, how is that dealt with?

            For 2, I think I really need to see a specific example.

            The simplest example from the post is a counter. If you implement it by first reading the value and then writing the value + 1, then there’s an obvious race condition that the library can find using an integer as a model.

            1. 2

              I wrote a bit about the internal implementation at https://tigerbeetle.com/blog/2023-03-28-random-fuzzy-thoughts, specifically the Finite PRNG section (sorry that the section anchors don’t work yet)

              The TLDR is that the rng underneath is a finite sequence of bytes. So, when you ask it for a random number, it chops four bytes from the start of the sequence. If the sequence is empty, the error is reported to the caller, and it needs to short-circuit the current test.

              This gives you universal minimization strategy — just try to shorten the prng sequence. For example, fix the prefix, and try to find a shorter but still failing suffix. Or keep the length the same and make the individual values smaller.

              But that’s if you are smart. The arbtest is dumber than that. When minimizing, it doesn’t look at the failing sequence at all. It just shortens its length (it tries halving first, when that fails, it goes to constant decrements), and than a generates a new shorter sequence completely at random.

              Counter-intuitively, this naive strategy works well for many real-life situations! The thing is, most examples are small, and there are few small examples! So picking a random small example works!

              The smarter thing would be to plug something like AFl into the input prng, so that you get coverage guided minimization.

              1. 2

                Counter-intuitively, this naive strategy works well for many real-life situations!

                Does it work for the two scenarios I asked for though?

                If a bug is found using 10 accounts, but only 3 are needed to trigger the bug, how is it shrunk to the minimal counterexample? Also if you remove an account you can potentially invalidate later transfers involving said account, how is that dealt with?

                1. 2

                  Does it work for the two scenarios I asked for though?

                  Yup! I implemented the bank example, where:

                  • there are operations for creating and deleting accounts
                  • you can transfer between existing accounts (transferring against non-existing account is a programming error and a bug in the test, so it needs to be careful to not do that)
                  • the invariant checked is that the total balance remains unchanged.
                  • the bug is that you can delete an account with non-zero balance, violating the invariant.

                  https://github.com/matklad/arbring/commit/09098849eb5bb1524bb866c248262a867db401ad

                  Here’s one minimization trace I got from that example:

                  accounts created 4, transfers created 4 seed 0xff9d5f7f00000020, seed size 32, search time 103.00ns
                  accounts created 1, transfers created 1 seed 0x61955ad200000010, seed size 16, search time 351.87µs
                  accounts created 1, transfers created 1 seed 0xd5bf1b3a00000008, seed size 8, search time 1.08ms
                  accounts created 1, transfers created 1 seed 0x62b3860600000007, seed size 7, search time 21.60ms
                  accounts created 1, transfers created 1 seed 0x68a8d5af00000006, seed size 6, search time 21.70ms
                  accounts created 1, transfers created 1 seed 0x43e1e68400000005, seed size 5, search time 21.76ms
                  minimized
                  seed 0x43e1e68400000005, seed size 5, search time 100.00ms
                  

                  So, it started with example generated using 32 random bytes, which were used to generate 4 accounts and 4 transfers, then it halved the amount of entropy couple of times to 8 byte, and then brought that to just 5 random input bytes, which were enough to create one account, one transfer, and, I assume, one deletion.

                  Also if you remove an account you can potentially invalidate later transfers involving said account, how is that dealt with?

                  That’s the most nitfy thing about this approach — you don’t need to solve this problem because it simply doesn’t arise. At any point, the test has access to both model and the real system, so it can just look at the current state to choose a reasonable next step. I call this interactive testing, as opposed to the pattern where you separately create a plan and then execute it. The main insight is that you don’t need up-front planning to do minimization — minization works at the source of randomness.

                  1. 2

                    I call this interactive testing, as opposed to the pattern where you separately create a plan and then execute it. The main insight is that you don’t need up-front planning to do minimization — minization works at the source of randomness.

                    Interesting, thanks!

            2. 2

              The simplest example from the post is a counter.

              Naturally, this took a bit longer to answer. The “property tesing” bits are simple here:

              #[derive(Default)]
              struct BadCounter {
                // NB: This is a spiced-up AtomicU32, not just std::sync one!
                value: AtomicU32,
              }
              
              impl BadCounter {
                fn increment(&self) {
                  // self.value.fetch_add(1, SeqCst);
                  let value = self.value.load(SeqCst);
                  self.value.store(value + 1, SeqCst);
                }
              
                fn get(&self) -> u32 {
                  self.value.load(SeqCst)
                }
              }
              
              #[test]
              fn test_bad_counter() {
                arbtest::arbtest(|rng| {
                  let counter = BadCounter::default();
                  let mut counter_model: u32 = 0;
              
                  thread::scope(|scope| {
                    let mut t1 = controlled_thread(scope, &counter); // Creates a thread whose state is a `&counter`
                    let mut t2 = controlled_thread(scope, &counter);
              
                    while !rng.is_empty() {
                      for t in [&mut t1, &mut t2] {
                        if rng.arbitrary()? { // Pick at random which thread needs to advace
                          if t.is_blocked() { // Thread is blocked on an atomic op, unblock it
                            t.unblock()
                          } else {
                            t.act(|it| it.increment()); // Instruct the thread to do something by sending it a closure that acts on state
                            counter_model += 1;
                          }
                        }
                      }
                    }
                    drop((t1, t2));
              
                    assert_eq!(counter.get(), counter_model);
                    Ok(())
                  })
                });
              }
              

              Basically, we create two “managed” threads which we can manually interleave from within our test, and than it’s just standard state-machine testing stuff, just with explicit steps for advancing concurrent operations.

              Turning normal AtomicU32::load into something that can be paused externally is of course pretty cursed code (source), but that’s not really strictly speaking property based testing, rather just some generic atomics instrumentation.

              The property part here is checking with the model – a simple sequential counter. Which is, in some sense, cheating — we are not checking linearizability here, we just check the know answer, because the ops commute. But in my experience you usually can come up with a reasonable sequential model which doesn’t require reverese-engineering a particular linearization.

              1. 3

                The property part here is checking with the model – a simple sequential counter. Which is, in some sense, cheating — we are not checking linearizability here, we just check the know answer, because the ops commute.

                You also don’t seem to be generating gets? If you did, how would you deal with the situation when two increments and a read all happen concurrently? In that case the value of the read can have three different (all correct) values. (I explain this in more detail in the introduction of the parallel testing section.

                Also you seem to be taking a white-box testing approach (i.e. instrumenting thread and atomics)? This wouldn’t work if you didn’t have access to the system you were testing, i.e. black-box, (or it would be too expensive to rewrite/instrument)?

                But in my experience you usually can come up with a reasonable sequential model which doesn’t require reverese-engineering a particular linearization.

                Can you elaborate on this, I’m not sure I understand what you mean?

                1. 2

                  You also don’t seem to be generating gets? If you did, how would you deal with the situation when two increments and a read all happen concurrently?

                  Depends on the model I want my counter to maintain! For example, if the model is eventual consistency, than more than two answers are possible.

                  For linearizability, the model would have to maintain a per-thread state and check that:

                  • reads of each thread are monotonic
                  • reads of each thread count at least this thread’s increment.

                  But in my experience you usually can come up with a reasonable sequential model which doesn’t require reverese-engineering a particular linearization.

                  So, one of the approach of the post is capturing concurrent history and then finding its linearization by checking all possible interleavings. The essential element here is a combinatorial search over all possible interleavings.

                  What I am saying is that often times you don’t have to search for the correct linearization of the history. You can just ask the system directly what it thinks is the appropriate linearization, and then verify that directly. For TigreBeetle, when the testing harness submits transactions to the cluster, it of course doesn’t know in which order they are going to be executed. But a response to a transaction contains its logical timestamp, assigned by the system itself.

                  So, for a testing harness to check that the results are consistent, it doesn’t need to look through possible permutations of transactions. It takes the permutation implied by logical timestamp, and verifies that:

                  • that permutation is consistent with the events originally submitted
                  • that the results we actually got match the results from application of the permutation to a sequential model.

                  The counter example is similar: we can test it by probing for interleaving, and that would tell us whether it is linearizable. But we just know what the answer should be. So we can test for that directly, getting both linearizability, and “the counter counts correctly” properties.

                  1. 3

                    What I am saying is that often times you don’t have to search for the correct linearization of the history. You can just ask the system directly what it thinks is the appropriate linearization, and then verify that directly.

                    Again, this seem to assume a white-box approach where you have access to the system you are testing? (I asked this already, but you didn’t answer.)

                    For linearizability, the model would have to maintain a per-thread state and check that:

                    This doesn’t seem to follow mechanically from the sequential model? I.e. it sounds like you need to think about specific things to test in the concurrent case? In the post, you get the concurrent tests for free (no extra code) from the sequential model.

                    1. 2

                      Again, this seem to assume a white-box approach where you have access to the system you are testing?

                      Sorry, missed that bit with all the concurrent threads!

                      I think instrumentation is indeed required if you want to get reproducibility. Without instrumentation in some form, re-running with the same actions might pass due to a different real-time interleaving. Which I think negates most proper parts of a property testing library? At that point, just spawn 10 threads, and check the counter at the end.

                      This doesn’t seem to follow mechanically from the sequential model?

                      Precisely, that’s my point! While you can get a general property for free for arbitrary system, in practice just locking at the system and checking the exact required behavior is not hard. And has the benefit that you can check fine grained properties (counter is correct, rather than merely linearizable) and tweak consistency model (you can weaken linearizability to eventual consistency, or strengthen it to strict serializability)

                      1. 2

                        While you can get a general property for free for arbitrary system, in practice just locking at the system and checking the exact required behavior is not hard.

                        My point is: why don’t we package this free stuff up for people who don’t even know they can have it, nevermind be able to do the stuff you suggest by hand?

      3. 2

        Shrinking is IMO pretty load bearing on “useful” property testing tools, and you’re not doing it in your test. It’s not exactly a hard thing to do by hand, but it is tedious to set up, and is a bit of a design problem. Having people show up with shrinking logic for you is nice.

        (Aside: I think that you can make a property testing library that is pretty minimal in most languages, and when libraries instead lean too much into things like macro magic, we kinda lose that point)

        Though I feel like I’ve already had this conversation with you in some other post on this website ;)

        1. 2

          I do shrinking though :-)

          It’s just that you don’t necessary need to write any explicit code for shrinking, with finite rng it falls out naturally.

          See this comment/commit which shows an example: https://lobste.rs/s/uutqvn/sad_state_property_based_testing#c_q5pgww

    15. 2

      This is a fascinating perspective to me because, while I know of QuickCheck and have used Haskell, I primarily use something like Hypothesis instead. And, if it’s not available in my language of choice, then Hypothesis provides a pattern which can be ported, even to dynamically-typed systems. This means that the closed-source claims don’t quite land for me; if Hypothesis were closed-source, I’d just write my own.

      I liked the worked examples in the second half. It’s always fun to see what kinds of properties we can invent.

      You may be interested in this 2016 list of QuickCheck alternatives from the author of Hypothesis, which is also the article from which that big list in Wikipedia is sourced.

      1. 2

        This means that the closed-source claims don’t quite land for me; if Hypothesis were closed-source, I’d just write my own.

        Not sure I follow, what do you mean with “closed-source claims”?

        1. 2

          Ah, sorry, poor wording. I’m talking about the framing that Hughes sets up and to which you reply. I want to do a little close reading:

          [We] have founded a start-up … to develop … Quviq QuickCheck. Despite enthusiasm [for the first implementation] … it was never adopted in industry. We tried to give away the technology, and it didn’t work! … [W]e are putting a great deal of work into helping customers adopt the technology. It was naïve to expect that simply putting source code on the web would suffice to make that happen, and it would also be unreasonable to expect funding agencies to pay for all the work involved.

          I see a disconnection here. Mostly, there seems to be an expectation that corporations will readily adopt good engineering practices, combined with an expectation that corporations would contribute money or other resources to the open-source ecosystem which they are exploiting. Admittedly, the commons was not as rich twenty-five years ago and we had not earned the cynicism of the dot-com bust.

    16. 2

      This is an interesting writeup, but the title feels like taking two concepts and declaring them to be the most important things you can have in a property testing library, to the exclusion of all else, and I don’t think that’s obviously true.

      On another point, it’s possible you should add jqwik may be worth considering for Java? It’s more actively developed than QuickTheories. I plan to consider porting my uses of QuickTheories to it, but that’s something I keep putting off.

      1. 2

        feels like taking two concepts and declaring them to be the most important things you can have in a property testing library, to the exclusion of all else, and I don’t think that’s obviously true.

        Well they do seem to be the foundation upon which the spin-off company was built. What other features would you say are as important?

        On another point, it’s possible you should add jqwik may be worth considering for Java?

        Added, thanks!

        1. 7

          What other features would you say are as important?

          Integrated shrinking seems extremely important. Beyond that, there’s a ton that goes into providing good generators, properly shrinking values like floats, etc. The ergonomics of the library are also extremely important.

        2. 5

          Re: features: I’ve found Hypothesis’ level of polish to be quite valuable. That seems to be a mix of real theory (Hypothesis’ what-I-assume-is-mostly swarm testing is surprisingly good at generating examples that pass even fairly stringent preconditions) and “just” an extensive built-in library of nasty generators and well-considered shrinkers for many common Python types.

    17. 2

      I hadn’t heard of “stateful” proptesting. Can someone tell me if I’m wrong on this? The way I’ve solved this is by just taking in a Vec<Action> and applying them where Action might be:

      enum Action {Push(generated), Pop}
      

      This has satisfied my needs historically. Where I’m imagining that this is insufficient is something where random sequences of Action aren’t valid? Like, I need some sort of rule that says “if the last generated was even the next one must be odd”? Historically if I’ve had these sorts of constraints I post-process the values but I can see how it may be desirable and more efficient for the framework to handle that, since the post-process means that the generated inputs aren’t what you actually test against.

      edit: I misunderstood parallel testing - it’s for detecting races, not for performance.

      My guess as to why these things aren’t adopted is simply because they aren’t that important. I love property tests and I use them plenty. When I’ve run into more complex property tests I haven’t found the frameworks to be overly limiting to me in the majority of cases. Given that in 90% of cases I’m totally happy, in 9% of cases I’m able to work around things, and perhaps in 1% of cases I just don’t write a proptest because it seems too painful, I don’t really think I would be motivated to do things differently and the work to meet those metrics is far less than the work to eat the last 10%.

      1. 2

        Where I’m imagining that this is insufficient is something where random sequences of Action aren’t valid?

        Yes, you essentially want something like preconditions to solve this properly. Another thing that’s missing in your simple setup is “symbolic references”. Imagine you wanted to have an action which creates a new stack, then you’d want to generate something like [New(size), Push(Var(0), generated), Pop(Var(0))] where the Vars don’t exist until you start executing the real commands. A nice benefit from including New is that shrinking will find the smallest stack in which the property doesn’t hold (I do this in the queue example in the post).

        Parallel processing certainly seems nice but I usually expect those to be a matter of the testing harness. Rust tests already run in parallel

        This is a different kind of “parallel” than what I’m talking about, what I mean is that you use several threads to concurrently execute Actions against the software under test.

        1. 1

          Yes, [..] Makes sense.

          This is a different kind of “parallel”

          Ah I completely misinterpreted the parallel bit. I see now that you were referring to detecting errors where the issue comes up under parallel execution, not an optimization. Lazy reading on my part.

      2. 1

        One thing I like with hypothesis is that if you lean into the stateful testing tooling, then it can do things like spit out a unit test for a failing property test.

        For example, from their docs:

        AssertionError: assert set() == {b''}
        
        ------------ Hypothesis ------------
        
        state = DatabaseComparison()
        var1 = state.add_key(k=b'')
        var2 = state.add_value(v=var1)
        state.save(k=var1, v=var2)
        state.delete(k=var1, v=var2)
        state.values_agree(k=var1)
        state.teardown()
        

        you have this whole script that gets spat out, you can copy-paste it into your shell or the company slack, etc. It’s not really dark magic, but it saves on the tedium

    18. 3

      Just imagine how intricate optimization algorithms could get with knowledge of effects like this!

      One effect that people don’t talk about often is potential non-termination. I’d imagine knowing that a function always terminates could be useful for optimising.

      Speaking of going beyong type systems, why stop at effects? You could have the type system statically check that your function runs in polynomial time (see Linear Types and Non-Size-Increasing Polynomial Time Computation by Martin Hofmann (1999)).

      1. 3

        Maybe also the optimizer could recognize that some effects have some of the guarantees of pure code:

        • commutative effects can be reordered (like incrementing a counter, or generating unique opaque ids)
        • idempotent effects can be deduped (like forcing a thunk, or throwing an error)
    19. 4

      Have you seen Labelled BNF grammar? Here’s your example in it:

      EAdd. 	Exp 	::= 	Exp 	"+" 	Exp1 	;
      EMul. 	Exp1 	::= 	Exp1 	"*" 	Exp2 	;
      EInt. 	Exp2 	::= 	Integer 	;
      coercions 	Exp 	2 	;
      

      There’s a tool called bnfc which can generate an AST, lexer, parser, pretty printer, and docs from the above in many programming languages.

      1. 1

        Yes, cool project. Unfortunately, it doesn’t give guarantee that parser and pretty printer match, i. e. it doesn’t give round-trip guarantee. This is bug about this reported by me: https://github.com/BNFC/bnfc/issues/360 .

        I wrote Haskell library, which does thing similar to bnfc (but restricted to Haskell), but which gives guarantee of round-trip! https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html

    20. 3

      I’m a bit of a dabbler in program verification, and know just enough to be confused about the central idea of this post:

      In some sense, the Rust type checker is performing a separation logic proof for us, each time we compile our program.

      If these steps are separable in this way (the Rust compiler does half, and the verifier does half), why don’t verifiers work that way? Is it simply because the Rust compiler can reject programs it can’t reason about cleanly?

      I’m really attracted to the idea that Rust is easier to verify, both because I like Rust for other reasons, and because I like verification. But, in my ignorance, I’m missing something about this post (and Graydon’s) that explains why the two steps can’t be ahem separated when verifying a language like Java (that is otherwise memory safe but allows aliasing of mutable references).

      Anybody know the answer?

      1. 7

        (ignorant noob when it comes to verification):

        My understanding is that Rust brings two things to the table relative to, eg Java:

        • programs already come with modular aliasing annotations. You can imagine augmenting Java with some kind of “this is unique reference” annotations for proofs specifically, but they’ll naturally be much less ergonomic than what Rust has.
        • more importantly, the programs written in Rust already have nice aliasing properties.

        The last one is important and is the real reason why C++ can’t incrementally evolve into Rust.

        You can add lifetime annotations with Rust’s semantics to C++ language! What you can’t do though, is adding those lifetime annotations to the actual C++ code you already have, because that code plainly doesn’t conform to the Rust aliasing model. It doesn’t observe shared^mutable.

        So, Rust has annotations, that enable modular reasoning, and it forces you to change your source code logic in a way that is more amenable to modular reasoning. The first part you can have in any language. The second part plainly doesn’t work for existing bodies of code.

      2. 2

        If these steps are separable in this way (the Rust compiler does half, and the verifier does half), why don’t verifiers work that way?

        Footnote 5 says they do:

        The first tool to note this was Prusti (2019), which verifies Rust programs in a separation logic verifier but uses typing information to automate the separation logic parts of the proof!

      3. 1

        Because if they are separated, there can be legal compiled programs that break the rules, so the rules that simplify reasoning… Are not rules!