1. 23
  1.  

    1. 3

      Regarding “fancy domain-specific language (DSL) for breadth-first search” and building upon a comment you made a month ago: I feel like we are not very far off from being able to do these breadth-first searches in a simulation using the actual implementation. At least for safety properties, I’m still not exactly sure how to do liveness and fairness in a simulator as opposed to a model-checker. It feels like one would need to use timeouts and assert that good things happen before they go off?

      Using the actual implementation in the simulator would avoid having to do refinement, but perhaps one would still want to start with a more abstract specification. Which brings me to: I never understood the formulation of refinement via temporal existential quantification. For me it’s easier to grasp the predicate transformers formulation that can be found in Back and von Wright’s Refinement calculus, Morgan’s book Programming from Specifications, and in Abrial’s Z, B and Event-B languages. In fact Event-B is the only language, that I’ve seen, in which going all the way from abstract specification to concrete implementation (that can be run) is actually doable (as opposed to talking about the possibility of being able to do it as in TLA+).

      Regarding machine-checkable proof languages: I find the Curry-Howard isomorphism between propositions-as-types and proofs-as-programs so aesthetically pleasing that I feel it must be the right forward here. Kleppmann gave a nice talk about how to verify distributed systems using Isabelle/HOL, while Isabelle doesn’t make use of the CH isomorphism, the technique can be adopted.

      What I don’t see quite yet is how to make a language in which all this is possible at the same time. Ideally one would like to make an abstract specification of some protocol, say some consensus algorithm, model check it, prove it correct, refine an implementation from the spec, and finally test the implementation in a simulator (to ensure no false assumptions were made about the environment). Currently this requires different tools and libraries, I think it would be good to package it all up in one system and come up with ways of structuring the code which makes it easier to rigorously check it in this way.

      1. 4

        I’m still not exactly sure how to do liveness and fairness in a simulator

        Wrote a bit simulation testing for liveness in https://tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness!

        The TL;DR: yes, you use the (logical) timeout, which you could set to be very large. For a real system, at some point you probably should be able to replace it with a more tight timeout (a system makes forward progress after each pair of nodes exchanged two messages or some such), but we haven’t got to that point yet.

        With liveness, the hard part is not the timeout, but making sure that your simulation is capable of simulating hard livenes condition. For example, if your simulation can always crash & restart a replcia with a certain probability, than it doesn’t check liveness, because crash&restart might be exactly the thing that gets your system unstuck!

        So what you need to do is to use swarm testing to pick a subset of failure modes, and then make those failures permanent.

        To say it another way, the hard part about liveness is not what happens with the system, but what is allowed to happen with the simulator itself. You need to be able to simulate (infinite) behaviors where the infinite tail looks differently from the finite start.

        1. 3

          When timeouts are added, a counterexample represents a finite prefix of a behavior where the timeout has expired. This constitutes a safety violation, rather than a liveness issue. You don’t check liveness.

          1. 1

            I think what happens here is that I strengthen the property to make it a safety one? Liveness says that, assuming fairness, progress happens eventually. I strengthen that to a statement that, assuming fairness, progress happens faster than (logical) time T.

            This seems to work well enough in practice, in a sense that it does find genuine liveness bugs where the system can get stuck, and doesn’t have false positives. The latter in particular means that the algorithm does converge from any given state in a bounded amount of time, which of course might not be true for certain mathematical models, but I think should hold for any practical system?

              1. 2

                Not exactly sure what you mean by stronger here, but Reducing Liveness to Safety in First-Order Logic (2018), further muddles the water here (at least for me).

              2. 1

                I am not claiming that any liveness property can be strengthened to a safety property, but some certainly can!

                Consider a system whose state is a binary bit, and whose trajectory is a bitstring. Consider this two properties:

                • P1: some bit is 1
                • P2: the first bit is 1

                P1 is liveness, P2 is safety, according to definition in Alpern&Schneider. But P2 is stronger than P1, so, if you managed to prove P2 for your system, you’ve also got P1 for free.

                I am also making an informal claim that the system I am working on seems to work roughly similar, in a sense that interesting liveness properties can be strengthened to safety ones.

                I am actually curious just how powerful this technique is? The counter-example that first comes to mind is this:

                Let the state of the system be a pair of naturals, and possible moves be:

                • decrease the second number: (a, b) -> (a, b - n)
                • decrease the first number, and set the second number to arbitrary value: (a, b) -> (a - n, X)

                The liveness property in question is that the system eventually reaches (0, 0). Now, “just use a timeout” trick clearly doesn’t work here, because, for any value of the timeout I choose, you could decrement fst and set snd to timeout * 100500.

                But I think it is possible to set a timeout smarter? First, let’s define liveness not as reaching (0, 0), but as decreasing fst (so, just making progress). Then, let’s allow me to take a look at the current state of the system and make timeout depending on that. Given a pair (x, y), I can say that it’ll make progress in y steps, at which point I would select a new timeout.

        2. 1

          (It would be nice to have a couple of simple to understand and self-contained examples of how to test for liveness.)

        3. 1

          Neat blog post, there is a broken link to a file on github which was moved: https://github.com/tigerbeetledb/tigerbeetle/blob/main/src/simulator.zig

      2. 2

        Using the actual implementation in the simulator would avoid having to do refinement

        I don’t think this is necessarily true, as there are correctness properties that can only be expressed through refinement (this is a new idea to me actually, previously I just thought refinement was about going from a high-level design to a more detailed design instead of a way of expressing whole-system correctness properties). Imagine you want to ensure your system, at a high level, follows the standard two-phase commit state transition of Propose -> Commit -> (Success OR Fail). This can only really be expressed through refinement. A simple example but they can certainly get more complicated.

        I find the Curry-Howard isomorphism between propositions-as-types and proofs-as-programs so aesthetically pleasing that I feel it must be the right forward here.

        I think Lean is the closest language we have to this right now. There’s a very interesting textbook called Functional Programming in Lean that I worked through. It has a section called Pitfalls of Programming with Dependent Types which highlights a big issue: when you’re writing proofs (which in dependent types means justifying you have the right to call a function), it is extremely hard to maintain encapsulation. Your proofs end up looking like a reflection of internal implementations. This makes refactoring unbelievably difficult as parts of your implementation become load-bearing very very quickly. Whether this is an artifact of dependent types and will be solved by some future, as-yet-unheralded type system is uncertain.

        I do think there is something to be said for how simple TLA+ proofs are. It’s just induction, which most people have learned at some point! If your invariant holds in the initial state(s) of your program, and you show that if the invariant is true of an arbitrary state then it is true of its successor state, then you’ve proved the invariant holds for all system states. It should be noted that this style of proof also very strongly mirrors the gritty details of your specification, though.

        Ideally one would like to make an abstract specification of some protocol, say some consensus algorithm, model check it, prove it correct, refine an implementation from the spec, and finally test the implementation in a simulator (to ensure no false assumptions were made about the environment).

        I guess this is the holy grail of formal verification. I don’t know whether we even have a disparate toolchain that does this yet. Certainly there are some formal verification languages like Dafny but I don’t think they’re very good as high-level specification languages nor do they have simulator functionality.

        1. 1

          […] a way of expressing whole-system correctness properties). Imagine you want to ensure your system, at a high level, follows the standard two-phase commit state transition of Propose -> Commit -> (Success OR Fail). This can only really be expressed through refinement.

          Interesting, I’m not completely sure I follow. Do you mean to say that it’s a way to express a whole-system correctness property without involving the whole system? Because otherwise, why can’t we show that 2PC works by: 1) have all actors, involved in the protocol, log their actions, 2) express the property in terms of all logs?

          Your proofs end up looking like a reflection of internal implementations. This makes refactoring unbelievably difficult as parts of your implementation become load-bearing very very quickly. Whether this is an artifact of dependent types and will be solved by some future, as-yet-unheralded type system is uncertain.

          I’m mostly concerned about black-box behaviour of a (sub)system (i.e. what’s observable from the outside). What happens inside the box, less so. Between the boxes, i.e. the protocols, is important as well. So yes, if you change the API then you’ll have to change your blackbox proofs, but the same is true for testing.

          (Regarding inside the box: I haven’t used Lean and only skimmed the chapter that you linked to, but my expereince is that once you have massaged a definition to the point where the proofs you want to make about it are neat, then you have already done all the refactoring. It’s like an extreme version of test-driven design.)

          I don’t know whether we even have a disparate toolchain that does this yet.

          Which steps are you missing?

          1. 1

            You can certainly write predicates over history variables, but that usually makes model-checking intractable because it causes the state space to grow factorially. As an aside, Ron Pressler likes to make the point that all invariant & liveness checking is secretly a form of refinement checking, which I thought was uselessly pedantic, but is true. Refinement checking as it’s usually referred to is I guess the most powerful possible expression of this process.

      3. 1

        . In fact Event-B is the only language, that I’ve seen, in which going all the way from abstract specification to concrete implementation (that can be run) is actually doable (as opposed to talking about the possibility of being able to do it as in TLA+).

        Do you have a good repo or online example of this? I couldn’t really pull it out of the Abral Event-B book.