1. 6
    1. 6

      What are we even trying to do with testing? The end goal is to show correctness.

      I’m not sure I’d put it this way. For me, more important goals are: 1) to spend more time with code, furthering understanding of it 2) forcing code to be more modular and less coupled 3) to prevent someone else from accidentally breaking my code later

      1. 7

        +1, the end goal of tests depends on the context in which they are written, but most of the time, the purpose is to reduce risk, not show correctness. Correctness is defined by business requirements, which change arbitrarily and cannot really be effectively modeled as specifications.

      2. 1

        #1 and #2 are definitely good goals. That’s why I still recommend writing individual test scenarios.

        What I struggle with is, how do you get #3 without striving for correctness? What stops someone from breaking my code with inputs that I haven’t thought about testing for ahead of time?

        1. 2

          For #3 I rather referred to other developers refactoring/repurposing my code at a later point in time and not fully understanding what it is meant to do, and by doing so breaking my functionality.

          I agree that correctness is an important purpose for a test, especially if the parameter space for what you’re testing is large. Just laying out the tests that define its behavior is an important exercise for me, which is largely what I meant with #1.

          But for me the idea of tests that can be automatically generated don’t fit in the picture of code you’d need to have in your code base. I rather picture this as a system, such as a fuzz testing framework, that automatically checks your interfaces against a range of parameters. I agree that this is a very strong way of catching bugs btw.

    2. 5

      I think I’m inclined to agree, but I find it unclear whether, by “generated tests”, you mean property-based tests, a specific kind of property-based tests, or something adjacent to property-based tests. The terms “generated tests” and “property-based testing” are both used but the intended relation between them is not stated.

      1. 2

        I use the term “generated tests” because property-based testing is very closely associated with random data generation, and there’s lots of other interesting ways to generate data.

        But, the idea is always the same: generate lots of input data, and use that to check a property.

        Also, if it helps, you can just assume that everything in this post is about property-based testing.

    3. 2

      The thing that I struggle with is that this approach tends to handwave away the “verify correctness” step.

      When I try to actually write tests this way, I find that my verification logic quickly gets about as complex as the application code because it has to try to figure out the expected output from an arbitrary set of inputs, which is exactly what the application code does. The end result is that I’m now worried about subtle bugs in the test code in addition to subtle bugs in the application code, and if I’ve forgotten to handle an edge case in the application code, chances are I’ll also forget it in the verification function.

      I’d love to see a concrete example of this kind of testing in an application that isn’t something low level like a parser or a network protocol stack. Say, for example, testing the pricing rules of a shopping app that calculates taxes and shipping fees based on seller and buyer locations.

      1. 3

        Generative testing is definitely not trivial. When the logic is really tricky, like a typical enterprise application, I think the most promising approach is to define a model of the logic. The example application in that post isn’t huge, but it is a real working web application which is better than network protocol parsing examples.

        This has also been used for POSIX file system testing, which is also a pretty hairy domain.

        These are all valid concerns though. That’s one of the reasons I think correctness is such an interesting domain.

    4. 2

      I’d rather have ChatGPT write test cases for code I write, than have to write tests for code ChatGPT writes (which is one of my biggest fears being a programmer).

      1. 3

        Interestingly, I feel like the “ChatGPT writing code” use case is one of the best reasons to start generating more tests. If that becomes the norm in the industry, it’ll be a lot easier to generate tests for AI-created code than to keep hand-writing test cases for code that you don’t understand or have intimate knowledge of.

      2. 1

        Don’t worry, LLMs should soon be perfectly capable to write both code and tests.

        What they lack, though, is taste, vision, innovation and curation.

    5. 1

      property-based testing

      I wish I could property-based test, or fuzz-test, any of the code that I usually work with. As far as I know, it’s not possible. For example, how would you apply these techniques to an arbitrary HTTP handler?

      1. 4

        I think the key takeaway from the discussion so far is that this is possible but not easy: when you have complex situations you need to test, you need a lot of skill to generatively test. That’s a totally different skill from hand-testing, and it has a much steeper learning curve. That’s what makes it hard to get into generative testing.

      2. 2

        By “handler”, do you mean a library that implements HTTP? I’m no expert, but two approaches that come to mind for me are

        1. Throw random input at it and test that it doesn’t crash (with sanitizers, if available).

        2. Give the same input to it and to one or more other HTTP implementations and test that they understood it in the same way (or that yours did better, in cases in which the other implementation is known to be wrong).

        1. 1

          I think my use case reduces to an API which accepts an arbitrary type, and returns an arbitrary type, both of which are too complex to exercise exhaustively. Automated testing would need to provide me a way to produce valid values of the input type, somehow — throwing random input at it wouldn’t be useful, even with sanitizers.

          1. 4

            Automated testing would need to provide me a way to produce valid values of the input type, somehow

            Yeah, that’s the tricky bit. Ultimately, for non-trivial cases, this all boils down to “you need to come up with an ingenious way to generate random valid-ish inputs”. It is possible to do significantly better here than just throwing random bytes at the system, but this generally requires writing non-trivial amount of code.

            https://fitzgeraldnick.com/2020/08/24/writing-a-test-case-generator.html is a good example here.

            For another real-world example, here’s what we use at work:

            https://github.com/tigerbeetledb/tigerbeetle/blob/main/src/state_machine/workload.zig

            Even for a relatively simple domain model, that’s almost a 1k lines.

            1. 1

              That test case generator post is a goldmine. Thanks for sharing. CSmith is such a powerful tool, and it’s great to see a deep dive into how similar tools work.

          2. 2

            Here’s a post that talks about generating input data.

            It can be complex, for sure, but every property-based testing / fuzzing library has extensive API support for creating complex data.

            You can also create the data yourself non-randomly, one example is by using the category partition method.

            1. 1

              My point is that the code I generally want to fuzz doesn’t take input data, it takes input as typed values. So I basically need the fuzzer to generate values, not bytes.

              1. 4

                Here’s an example in Python for reference, though if you tell me what language you’re working in I could find an example in that language. Say we have an HTTP API for creating blog posts, the input request might look something like:

                @dataclass
                class CreateBlogPostRequest:
                    content: str
                    title: str
                    tags: t.List[str]
                

                We can generate values of CreateBlogPostRequest in a test like:

                @given(blog_req=gen.from_type(CreateBlogPostRequest))
                def test_blog_post_creation(blog_req):
                  resp = api.handle(blog_req)
                  assert_something_about(resp)
                

                This is using the Hypothesis property-based testing library for Python, but many languages have PBT libraries. In this case, the data is generated from the type definition.

                Check out property-based testing, not fuzzing. They’re similar, but fuzzing is more focused on generating random bytes whereas property-based testing is focused on creating properly typed values. What you’re describing is an everyday task with PBT.

              2. 2

                Fuzzers are moving towards “structure-aware” fuzzing where they can produce (and for some operate directly) values, not bytes. See for example

          3. 1

            Maybe I misunderstand what you say, but it seems your case is exactly covered by fuzzing templates. Current state-of-the-art fuzzers (AFL comes to mind) allow you to describe input templates, with bits set to specific values or ranges. You are not forced to generate fully random input.

            1. 1

              If you want to effectively fuzz an HTTP handler, you can’t treat the input as an opaque sequence of bits, even if you group those bits in one way or another via templates. The fuzzer needs to generate input data with at least a minimal semantic understanding of the actual request type(s) – randomizing not sequences of opaque bits or bytes on the wire, but rather valid values as defined by the relevant type(s).

              1. 3

                That’s precisely what AFL lets you do. You can give it a grammar and it will generate streams of tokens that match that grammar. I believe it can also give you things that almost match the grammar, so you can check that you handle errors.

                1. 1

                  Ah, OK, so this is an approach where the fuzzer is a separate process to the code under test. My intuition is that fuzzing is, ideally, part of the language-native testing apparatus.

                  1. 2

                    AFL has a bunch of modes for varying degrees of white-box testing. It also has an LLVM pass that, among other things, transforms branches that match on exact values into nested branches that math on bits, which helps it explore the coverage space more usefully. For something that implements an RPC interface, you can just ask the fuzzer to generate the RPC messages, with some hinting about the structure of the messages.

                  2. 1

                    No, AFL can run in many modes. The one I’ve seen most frequently is with the AFL entrypoint as a compilation target of the project: you have a parsing library, and you build a custom fuzzing binary by specifying where and how AFL should call your library.

                    The interface that I know of are at least in C and C++, I would be surprised if there aren’t other bindings.

                    It honestly seems like AFL fully covers the ideal fuzzer representation you have, you should check it out. It’s a quite clever piece of engineering.