1. 38
    1. 20

      (note: the following comment probably warrants a “rant” tag)

      I have two alternative hypothesis for the explanation of the status quo.

      First hypothesis is that state of the art in academia doesn’t necessary correlate with what is actually useful in practice. Not claiming (yet) that that’s what is happening with stateful property based testing, but I have a good example from a different area — there’s a lot of literature in academia about dealing with errors during parsing, and all that literature was not helpful to me when I had to solve the problem. Although I have read a tonne of papers on the topic, none ultimately feed into the solution that worked.

      My second hypothesis is that academia is geared at proving to other academics that a paper is worth a publication, but is pretty bad at actually explaining ideas in a way that allows incorporating them into software people use. This post is actually an example of this. I am pretty much the target audience — I regularly use property based testing for my hobby stuff, my work thing is basically an appendage to one giant randomized test of strict serializability, I wrote a property-based testing library, and I think I formulated a specific idea which, while not a theoretical advancement, I think moved state-of-the-art of practice a little bit further.

      I very much want to learn about a property-based testing technique which I should be using instead of what I use currently. But it’s very hard to use the present post to do that!

      It starts with a very long-winded historical aside, which, while interesting, doesn’t really help me to learn anything. And than there are is series of examples which just bog down in Haskell’s idiosyncrasies, and it’s very hard to extract the actual idea out of the implementation gunk. This reminds me another recent case when I needed to learn about Reed-Solomon coding. The wikipedia page read like a wall of thick graffiti to me. And then I read https://mazzo.li/posts/reed-solomon.html posted here on lobsters, and it said, basically:

      A polynomial of degree N is uniquely identified by N+1 points on its graph. Evaluate polynomial at M>N+1 points, give each point to a separate person, and then any N+1 people can recover the original polynomial.

      I see this being repeated over and over — there’s a vast chasm between the essential simplicity of an idea which can often fit in a Tweet, and fluff and rigor which wrap the idea inside an academically-adjacent piece of writing.

      I’d love to read a Tweet-sized explanation of what is stateful property-based testing, and how is it better than doing the same thing “by hand”.

      Coming back to the first hypothesis: I do, in fact, have a suspicion that academic state of the art of property based testing isn’t all that useful in practice. I make two claims:

      • You get 70% of the benefits of property-based testing if you pass in a PRNG to your test function.
      • You get 85% of the benefits, if the PRNG is finite (at some point, instead of giving you heads or tails, it says error_out_of_entropy).

      As a specific example, here’s how you can write a property test with minimization for the ring buffer (repo):

      #[test]
      fn test_queue() {
        arbtest::arbtest(|rng| {
          let len_max = rng.int_in_range(0..=10)?;
          let mut queue = Queue::new(len_max);
          let mut queue_model: VecDeque<i32> = VecDeque::new();
      
          while !rng.is_empty() {
            match *rng.choose(&["add", "remove"])? {
              "add" if queue_model.len() < len_max => {
                let x: i32 = rng.arbitrary()?;
                queue_model.push_back(x);
                queue.push(x);
              }
              "remove" if queue_model.len() > 0 => {
                let x_model = queue_model.pop_front().unwrap();
                let x = queue.pop();
                assert_eq!(x_model, x);
              }
              "add" | "remove" => (),
              _ => unreachable!(),
            }
            assert_eq!(queue.len(), queue_model.len())
          }
          Ok(())
        });
      }
      

      Minimization there is very simplistic, but it works well enough for this case. And you don’t need a tree of types&monads to use the thing, the rng is the familiar API of a random number generator.

      In this sense, property based-testing reminds me the cottage industry of parser-combinator libraries: the idea lends itself very well to functional programming & static typing, but, in practice, production code tends to use a hand-written parser, because you can express all the same ideas in stupid first-order programming just fine. Same here — just pass in the rng!

      1. 5

        I also found the article somewhat hard to read, but - having played quite a bit with proptesting - did get two worthwhile things out of it.

        The check-state-vs-model pattern seems worth making explicit, even if it isn’t particularly new (and may be “obvious”). I agree that the value delivered by just knowing the pattern and writing it out (as you did) and the value delivered by actual library support isn’t huge. (And e.g. Hypothesis’ stateful testing seems to be making some trade-offs that do not suit my use case.)

        The attempt to model parallelism explicitly is… probably not the final form of that idea, but something that I’d be idly interested in toying around with:

        • Generating parallel testcases seems easy enough to be worth trying in some cases, even if without a deterministic scheduler you’re just hoping to hit a race and can’t really shrink (the “wait 100 ms” is a gross hack, although probably sometimes effective enough?)
        • Checking against possible linearizations using the non-parallel model is admirably low-effort, so may deliver good value-per-effort; of course, something like Jepsen’s Elle is a lot more sophisticated, both in what constraints it can handle and in scaling to more concurrency than just three interleaved threads, but is also quite a bit harder to use.

        (The author’s TODO is also quite full of additional ideas.)

        1. 3

          Checking against possible linearizations using the non-parallel model is admirably low-effort,

          Yeah, this bit was novel to me as well, but, if I reflect on how lve dealt with this historically, the answer is that there’s some internal linearization point in the system, so, rather than combinatorially searching through all interleavings, you can asks the system which interleaving it thinks is the correct linearization, and then check that.

          Specifically, in TigerBeetle tests, when the test harness submits transfers concurrently, it doesn’t know in which order they’ll execute. But the response from the system assigns logical timestamps to the transfers, so the test doesn’t need combinatorics, it just learns the correct permutation from the system itself.

          1. 1

            That’s very interesting. And I’m pretty sure it’s indeed the correct solution in cases where you’re able to introspect a system that does linearize events.

            However, at that point you’ve left the area I’d be willing to call “admirably low-effort” (but you’re delivering so much value that it’s clearly worthwhile!) There really may be room in the design space for something you’d just throw on some random code that has a lock or two and surely doesn’t have any interesting concurrency “phenomena” ;-).

        2. 1

          Checking against possible linearizations using the non-parallel model is admirably low-effort

          This is also what Jepsen’s Knossos checker does btw.

      2. 3

        I’d love to read a Tweet-sized explanation of what is stateful property-based testing, and how is it better than doing the same thing “by hand”.

        To see what the 400 LOC library from the post does that your “by hand” code doesn’t try:

        1. Extending your example with "new" that creates a new queue. Or imagine you are testing an api which returns a uuid/file handle or anything that you can’t hard code into your tests like you did with let mut queue = Queue::new(len_max);. Hint: you’ll find that the functionality you need is some kind of “symbolic reference” (an integer) which is a placeholder for the queue/uuid/file handle, etc which will only become available as you execute against the real system;
        2. Extending your example to do concurrent execution of commands against the queue. Hint: you can get this for free from the sequential model via linearisability.

        That’s the two things a library should do (but most libraries don’t). If you don’t need that functionality, then by all means do it “by hand”. It seems to me that most users don’t know that this functionality is even possible, how could they if their libraries don’t provide it?

        1. 4

          Could you formulate these two examples as specific litmus tests, which could serve as a benchmark for property testing approaches?

          For 1., I am pretty sure that “just pass an rng” approach actually handles that simpler? Inside the for loop, you have access to the entire real system, so you can just query it to figure out what operations makes sense. I guess a good model here would be testing a banking system, where you want to query against existing account? Something like this:

          let mut bank = Bank::new();
          let mut accounts: Vec<AccountId> = Vec::new();
          while !rng.is_empty() {
            match *rng.choose(&["create_account", "transfer"])? {
              "create_account" => {
                let account = bank.create_account();
                accounts.push(account);
              }
              "transfer" if accounts.len() > 0 => {
                let dr = rng.choose(&accounts)?;
                let cr = rng.choose(&accounts)?;
                if (cr != dr) {
                  let amount: u123 = rng.arbitrary()?;
                  bank.tranfer(cr, dr, amount)
                }
              }
              ...
            }
          }
          

          For 2, I think I really need to see a specific example. For TigerBeetle, we pretty much use the above approach (sans automatic minimization) to test strict serializability. Concurrency is just state machines, after all. Two state machines running concurrently is just one product state machine running sequentially.

          What might be tricky is testing parallelism — that all your atomic operations conform to memory model and such. There, you need a tricky runtime library to turn your “normal” code using atomics into an explicit state machine, but that seems more specific to a particular language runtime, rather than to general idea of property-based testing.

          1. 4

            For 1., I am pretty sure that “just pass an rng” approach actually handles that simpler? Inside the for loop, you have access to the entire real system, so you can just query it to figure out what operations makes sense.

            That’s pretty nice. I still don’t understand how shrinking/minimisation of counterexamples works in your approach however. If a bug is found using 10 accounts, but only 3 are needed to trigger the bug, how is it shrunk to the minimal counterexample? Also if you remove an account you can potentially invalidate later transfers involving said account, how is that dealt with?

            For 2, I think I really need to see a specific example.

            The simplest example from the post is a counter. If you implement it by first reading the value and then writing the value + 1, then there’s an obvious race condition that the library can find using an integer as a model.

            1. 2

              I wrote a bit about the internal implementation at https://tigerbeetle.com/blog/2023-03-28-random-fuzzy-thoughts, specifically the Finite PRNG section (sorry that the section anchors don’t work yet)

              The TLDR is that the rng underneath is a finite sequence of bytes. So, when you ask it for a random number, it chops four bytes from the start of the sequence. If the sequence is empty, the error is reported to the caller, and it needs to short-circuit the current test.

              This gives you universal minimization strategy — just try to shorten the prng sequence. For example, fix the prefix, and try to find a shorter but still failing suffix. Or keep the length the same and make the individual values smaller.

              But that’s if you are smart. The arbtest is dumber than that. When minimizing, it doesn’t look at the failing sequence at all. It just shortens its length (it tries halving first, when that fails, it goes to constant decrements), and than a generates a new shorter sequence completely at random.

              Counter-intuitively, this naive strategy works well for many real-life situations! The thing is, most examples are small, and there are few small examples! So picking a random small example works!

              The smarter thing would be to plug something like AFl into the input prng, so that you get coverage guided minimization.

              1. 2

                Counter-intuitively, this naive strategy works well for many real-life situations!

                Does it work for the two scenarios I asked for though?

                If a bug is found using 10 accounts, but only 3 are needed to trigger the bug, how is it shrunk to the minimal counterexample? Also if you remove an account you can potentially invalidate later transfers involving said account, how is that dealt with?

                1. 2

                  Does it work for the two scenarios I asked for though?

                  Yup! I implemented the bank example, where:

                  • there are operations for creating and deleting accounts
                  • you can transfer between existing accounts (transferring against non-existing account is a programming error and a bug in the test, so it needs to be careful to not do that)
                  • the invariant checked is that the total balance remains unchanged.
                  • the bug is that you can delete an account with non-zero balance, violating the invariant.

                  https://github.com/matklad/arbring/commit/09098849eb5bb1524bb866c248262a867db401ad

                  Here’s one minimization trace I got from that example:

                  accounts created 4, transfers created 4 seed 0xff9d5f7f00000020, seed size 32, search time 103.00ns
                  accounts created 1, transfers created 1 seed 0x61955ad200000010, seed size 16, search time 351.87µs
                  accounts created 1, transfers created 1 seed 0xd5bf1b3a00000008, seed size 8, search time 1.08ms
                  accounts created 1, transfers created 1 seed 0x62b3860600000007, seed size 7, search time 21.60ms
                  accounts created 1, transfers created 1 seed 0x68a8d5af00000006, seed size 6, search time 21.70ms
                  accounts created 1, transfers created 1 seed 0x43e1e68400000005, seed size 5, search time 21.76ms
                  minimized
                  seed 0x43e1e68400000005, seed size 5, search time 100.00ms
                  

                  So, it started with example generated using 32 random bytes, which were used to generate 4 accounts and 4 transfers, then it halved the amount of entropy couple of times to 8 byte, and then brought that to just 5 random input bytes, which were enough to create one account, one transfer, and, I assume, one deletion.

                  Also if you remove an account you can potentially invalidate later transfers involving said account, how is that dealt with?

                  That’s the most nitfy thing about this approach — you don’t need to solve this problem because it simply doesn’t arise. At any point, the test has access to both model and the real system, so it can just look at the current state to choose a reasonable next step. I call this interactive testing, as opposed to the pattern where you separately create a plan and then execute it. The main insight is that you don’t need up-front planning to do minimization — minization works at the source of randomness.

                  1. 2

                    I call this interactive testing, as opposed to the pattern where you separately create a plan and then execute it. The main insight is that you don’t need up-front planning to do minimization — minization works at the source of randomness.

                    Interesting, thanks!

            2. 2

              The simplest example from the post is a counter.

              Naturally, this took a bit longer to answer. The “property tesing” bits are simple here:

              #[derive(Default)]
              struct BadCounter {
                // NB: This is a spiced-up AtomicU32, not just std::sync one!
                value: AtomicU32,
              }
              
              impl BadCounter {
                fn increment(&self) {
                  // self.value.fetch_add(1, SeqCst);
                  let value = self.value.load(SeqCst);
                  self.value.store(value + 1, SeqCst);
                }
              
                fn get(&self) -> u32 {
                  self.value.load(SeqCst)
                }
              }
              
              #[test]
              fn test_bad_counter() {
                arbtest::arbtest(|rng| {
                  let counter = BadCounter::default();
                  let mut counter_model: u32 = 0;
              
                  thread::scope(|scope| {
                    let mut t1 = controlled_thread(scope, &counter); // Creates a thread whose state is a `&counter`
                    let mut t2 = controlled_thread(scope, &counter);
              
                    while !rng.is_empty() {
                      for t in [&mut t1, &mut t2] {
                        if rng.arbitrary()? { // Pick at random which thread needs to advace
                          if t.is_blocked() { // Thread is blocked on an atomic op, unblock it
                            t.unblock()
                          } else {
                            t.act(|it| it.increment()); // Instruct the thread to do something by sending it a closure that acts on state
                            counter_model += 1;
                          }
                        }
                      }
                    }
                    drop((t1, t2));
              
                    assert_eq!(counter.get(), counter_model);
                    Ok(())
                  })
                });
              }
              

              Basically, we create two “managed” threads which we can manually interleave from within our test, and than it’s just standard state-machine testing stuff, just with explicit steps for advancing concurrent operations.

              Turning normal AtomicU32::load into something that can be paused externally is of course pretty cursed code (source), but that’s not really strictly speaking property based testing, rather just some generic atomics instrumentation.

              The property part here is checking with the model – a simple sequential counter. Which is, in some sense, cheating — we are not checking linearizability here, we just check the know answer, because the ops commute. But in my experience you usually can come up with a reasonable sequential model which doesn’t require reverese-engineering a particular linearization.

              1. 3

                The property part here is checking with the model – a simple sequential counter. Which is, in some sense, cheating — we are not checking linearizability here, we just check the know answer, because the ops commute.

                You also don’t seem to be generating gets? If you did, how would you deal with the situation when two increments and a read all happen concurrently? In that case the value of the read can have three different (all correct) values. (I explain this in more detail in the introduction of the parallel testing section.

                Also you seem to be taking a white-box testing approach (i.e. instrumenting thread and atomics)? This wouldn’t work if you didn’t have access to the system you were testing, i.e. black-box, (or it would be too expensive to rewrite/instrument)?

                But in my experience you usually can come up with a reasonable sequential model which doesn’t require reverese-engineering a particular linearization.

                Can you elaborate on this, I’m not sure I understand what you mean?

                1. 2

                  You also don’t seem to be generating gets? If you did, how would you deal with the situation when two increments and a read all happen concurrently?

                  Depends on the model I want my counter to maintain! For example, if the model is eventual consistency, than more than two answers are possible.

                  For linearizability, the model would have to maintain a per-thread state and check that:

                  • reads of each thread are monotonic
                  • reads of each thread count at least this thread’s increment.

                  But in my experience you usually can come up with a reasonable sequential model which doesn’t require reverese-engineering a particular linearization.

                  So, one of the approach of the post is capturing concurrent history and then finding its linearization by checking all possible interleavings. The essential element here is a combinatorial search over all possible interleavings.

                  What I am saying is that often times you don’t have to search for the correct linearization of the history. You can just ask the system directly what it thinks is the appropriate linearization, and then verify that directly. For TigreBeetle, when the testing harness submits transactions to the cluster, it of course doesn’t know in which order they are going to be executed. But a response to a transaction contains its logical timestamp, assigned by the system itself.

                  So, for a testing harness to check that the results are consistent, it doesn’t need to look through possible permutations of transactions. It takes the permutation implied by logical timestamp, and verifies that:

                  • that permutation is consistent with the events originally submitted
                  • that the results we actually got match the results from application of the permutation to a sequential model.

                  The counter example is similar: we can test it by probing for interleaving, and that would tell us whether it is linearizable. But we just know what the answer should be. So we can test for that directly, getting both linearizability, and “the counter counts correctly” properties.

                  1. 3

                    What I am saying is that often times you don’t have to search for the correct linearization of the history. You can just ask the system directly what it thinks is the appropriate linearization, and then verify that directly.

                    Again, this seem to assume a white-box approach where you have access to the system you are testing? (I asked this already, but you didn’t answer.)

                    For linearizability, the model would have to maintain a per-thread state and check that:

                    This doesn’t seem to follow mechanically from the sequential model? I.e. it sounds like you need to think about specific things to test in the concurrent case? In the post, you get the concurrent tests for free (no extra code) from the sequential model.

                    1. 2

                      Again, this seem to assume a white-box approach where you have access to the system you are testing?

                      Sorry, missed that bit with all the concurrent threads!

                      I think instrumentation is indeed required if you want to get reproducibility. Without instrumentation in some form, re-running with the same actions might pass due to a different real-time interleaving. Which I think negates most proper parts of a property testing library? At that point, just spawn 10 threads, and check the counter at the end.

                      This doesn’t seem to follow mechanically from the sequential model?

                      Precisely, that’s my point! While you can get a general property for free for arbitrary system, in practice just locking at the system and checking the exact required behavior is not hard. And has the benefit that you can check fine grained properties (counter is correct, rather than merely linearizable) and tweak consistency model (you can weaken linearizability to eventual consistency, or strengthen it to strict serializability)

                      1. 2

                        While you can get a general property for free for arbitrary system, in practice just locking at the system and checking the exact required behavior is not hard.

                        My point is: why don’t we package this free stuff up for people who don’t even know they can have it, nevermind be able to do the stuff you suggest by hand?

      3. 2

        Shrinking is IMO pretty load bearing on “useful” property testing tools, and you’re not doing it in your test. It’s not exactly a hard thing to do by hand, but it is tedious to set up, and is a bit of a design problem. Having people show up with shrinking logic for you is nice.

        (Aside: I think that you can make a property testing library that is pretty minimal in most languages, and when libraries instead lean too much into things like macro magic, we kinda lose that point)

        Though I feel like I’ve already had this conversation with you in some other post on this website ;)

        1. 2

          I do shrinking though :-)

          It’s just that you don’t necessary need to write any explicit code for shrinking, with finite rng it falls out naturally.

          See this comment/commit which shows an example: https://lobste.rs/s/uutqvn/sad_state_property_based_testing#c_q5pgww

    2. 3

      Just today saw a recent talk from Python North Bay 2024 above improvements to hypothesis

      Hypothesis Levels Up: codegen, observability, fault localization, and a black-magic backend

      https://youtu.be/NRtAEXqo0rc?si=kqF1HD_Z14Qcrxqg

    3. 2

      I hadn’t heard of “stateful” proptesting. Can someone tell me if I’m wrong on this? The way I’ve solved this is by just taking in a Vec<Action> and applying them where Action might be:

      enum Action {Push(generated), Pop}
      

      This has satisfied my needs historically. Where I’m imagining that this is insufficient is something where random sequences of Action aren’t valid? Like, I need some sort of rule that says “if the last generated was even the next one must be odd”? Historically if I’ve had these sorts of constraints I post-process the values but I can see how it may be desirable and more efficient for the framework to handle that, since the post-process means that the generated inputs aren’t what you actually test against.

      edit: I misunderstood parallel testing - it’s for detecting races, not for performance.

      My guess as to why these things aren’t adopted is simply because they aren’t that important. I love property tests and I use them plenty. When I’ve run into more complex property tests I haven’t found the frameworks to be overly limiting to me in the majority of cases. Given that in 90% of cases I’m totally happy, in 9% of cases I’m able to work around things, and perhaps in 1% of cases I just don’t write a proptest because it seems too painful, I don’t really think I would be motivated to do things differently and the work to meet those metrics is far less than the work to eat the last 10%.

      1. 2

        Where I’m imagining that this is insufficient is something where random sequences of Action aren’t valid?

        Yes, you essentially want something like preconditions to solve this properly. Another thing that’s missing in your simple setup is “symbolic references”. Imagine you wanted to have an action which creates a new stack, then you’d want to generate something like [New(size), Push(Var(0), generated), Pop(Var(0))] where the Vars don’t exist until you start executing the real commands. A nice benefit from including New is that shrinking will find the smallest stack in which the property doesn’t hold (I do this in the queue example in the post).

        Parallel processing certainly seems nice but I usually expect those to be a matter of the testing harness. Rust tests already run in parallel

        This is a different kind of “parallel” than what I’m talking about, what I mean is that you use several threads to concurrently execute Actions against the software under test.

        1. 1

          Yes, [..] Makes sense.

          This is a different kind of “parallel”

          Ah I completely misinterpreted the parallel bit. I see now that you were referring to detecting errors where the issue comes up under parallel execution, not an optimization. Lazy reading on my part.

      2. 1

        One thing I like with hypothesis is that if you lean into the stateful testing tooling, then it can do things like spit out a unit test for a failing property test.

        For example, from their docs:

        AssertionError: assert set() == {b''}
        
        ------------ Hypothesis ------------
        
        state = DatabaseComparison()
        var1 = state.add_key(k=b'')
        var2 = state.add_value(v=var1)
        state.save(k=var1, v=var2)
        state.delete(k=var1, v=var2)
        state.values_agree(k=var1)
        state.teardown()
        

        you have this whole script that gets spat out, you can copy-paste it into your shell or the company slack, etc. It’s not really dark magic, but it saves on the tedium

    4. 2

      This is an interesting writeup, but the title feels like taking two concepts and declaring them to be the most important things you can have in a property testing library, to the exclusion of all else, and I don’t think that’s obviously true.

      On another point, it’s possible you should add jqwik may be worth considering for Java? It’s more actively developed than QuickTheories. I plan to consider porting my uses of QuickTheories to it, but that’s something I keep putting off.

      1. 2

        feels like taking two concepts and declaring them to be the most important things you can have in a property testing library, to the exclusion of all else, and I don’t think that’s obviously true.

        Well they do seem to be the foundation upon which the spin-off company was built. What other features would you say are as important?

        On another point, it’s possible you should add jqwik may be worth considering for Java?

        Added, thanks!

        1. 7

          What other features would you say are as important?

          Integrated shrinking seems extremely important. Beyond that, there’s a ton that goes into providing good generators, properly shrinking values like floats, etc. The ergonomics of the library are also extremely important.

        2. 5

          Re: features: I’ve found Hypothesis’ level of polish to be quite valuable. That seems to be a mix of real theory (Hypothesis’ what-I-assume-is-mostly swarm testing is surprisingly good at generating examples that pass even fairly stringent preconditions) and “just” an extensive built-in library of nasty generators and well-considered shrinkers for many common Python types.

    5. 2

      This is a fascinating perspective to me because, while I know of QuickCheck and have used Haskell, I primarily use something like Hypothesis instead. And, if it’s not available in my language of choice, then Hypothesis provides a pattern which can be ported, even to dynamically-typed systems. This means that the closed-source claims don’t quite land for me; if Hypothesis were closed-source, I’d just write my own.

      I liked the worked examples in the second half. It’s always fun to see what kinds of properties we can invent.

      You may be interested in this 2016 list of QuickCheck alternatives from the author of Hypothesis, which is also the article from which that big list in Wikipedia is sourced.

      1. 2

        This means that the closed-source claims don’t quite land for me; if Hypothesis were closed-source, I’d just write my own.

        Not sure I follow, what do you mean with “closed-source claims”?

        1. 2

          Ah, sorry, poor wording. I’m talking about the framing that Hughes sets up and to which you reply. I want to do a little close reading:

          [We] have founded a start-up … to develop … Quviq QuickCheck. Despite enthusiasm [for the first implementation] … it was never adopted in industry. We tried to give away the technology, and it didn’t work! … [W]e are putting a great deal of work into helping customers adopt the technology. It was naïve to expect that simply putting source code on the web would suffice to make that happen, and it would also be unreasonable to expect funding agencies to pay for all the work involved.

          I see a disconnection here. Mostly, there seems to be an expectation that corporations will readily adopt good engineering practices, combined with an expectation that corporations would contribute money or other resources to the open-source ecosystem which they are exploiting. Admittedly, the commons was not as rich twenty-five years ago and we had not earned the cynicism of the dot-com bust.