Most Tests Should Be Generated

6

Most Tests Should Be Generated practices testing concerningquality.com
authored by amw-zero 1 year ago | caches
Archive.org Archive.today Ghostarchive
| 27 comments

27

1. 6
  
  thev 1 year ago | link
  
  What are we even trying to do with testing? The end goal is to show correctness.
  
  I’m not sure I’d put it this way. For me, more important goals are: 1) to spend more time with code, furthering understanding of it 2) forcing code to be more modular and less coupled 3) to prevent someone else from accidentally breaking my code later
  1. 7
    
    peterbourgon 1 year ago | link
    
    +1, the end goal of tests depends on the context in which they are written, but most of the time, the purpose is to reduce risk, not show correctness. Correctness is defined by business requirements, which change arbitrarily and cannot really be effectively modeled as specifications.
  2. 1
    
    amw-zero 1 year ago | link
    
    #1 and #2 are definitely good goals. That’s why I still recommend writing individual test scenarios.
    
    What I struggle with is, how do you get #3 without striving for correctness? What stops someone from breaking my code with inputs that I haven’t thought about testing for ahead of time?
    1. 2
      
      thev edited 1 year ago | link
      
      For #3 I rather referred to other developers refactoring/repurposing my code at a later point in time and not fully understanding what it is meant to do, and by doing so breaking my functionality.
      
      I agree that correctness is an important purpose for a test, especially if the parameter space for what you’re testing is large. Just laying out the tests that define its behavior is an important exercise for me, which is largely what I meant with #1.
      
      But for me the idea of tests that can be automatically generated don’t fit in the picture of code you’d need to have in your code base. I rather picture this as a system, such as a fuzz testing framework, that automatically checks your interfaces against a range of parameters. I agree that this is a very strong way of catching bugs btw.
2. 5
  
  5d22b edited 1 year ago | link
  
  I think I’m inclined to agree, but I find it unclear whether, by “generated tests”, you mean property-based tests, a specific kind of property-based tests, or something adjacent to property-based tests. The terms “generated tests” and “property-based testing” are both used but the intended relation between them is not stated.
  1. 2
    
    amw-zero 1 year ago | link
    
    I use the term “generated tests” because property-based testing is very closely associated with random data generation, and there’s lots of other interesting ways to generate data.
    
    But, the idea is always the same: generate lots of input data, and use that to check a property.
    
    Also, if it helps, you can just assume that everything in this post is about property-based testing.
3. 2
  
  koreth 1 year ago | link
  
  The thing that I struggle with is that this approach tends to handwave away the “verify correctness” step.
  
  When I try to actually write tests this way, I find that my verification logic quickly gets about as complex as the application code because it has to try to figure out the expected output from an arbitrary set of inputs, which is exactly what the application code does. The end result is that I’m now worried about subtle bugs in the test code in addition to subtle bugs in the application code, and if I’ve forgotten to handle an edge case in the application code, chances are I’ll also forget it in the verification function.
  
  I’d love to see a concrete example of this kind of testing in an application that isn’t something low level like a parser or a network protocol stack. Say, for example, testing the pricing rules of a shopping app that calculates taxes and shipping fees based on seller and buyer locations.
  1. 3
    
    amw-zero 1 year ago | link
    
    Generative testing is definitely not trivial. When the logic is really tricky, like a typical enterprise application, I think the most promising approach is to define a model of the logic. The example application in that post isn’t huge, but it is a real working web application which is better than network protocol parsing examples.
    
    This has also been used for POSIX file system testing, which is also a pretty hairy domain.
    
    These are all valid concerns though. That’s one of the reasons I think correctness is such an interesting domain.
4. 2
  
  spc476 1 year ago | link
  
  I’d rather have ChatGPT write test cases for code I write, than have to write tests for code ChatGPT writes (which is one of my biggest fears being a programmer).
  1. 3
    
    amw-zero 1 year ago | link
    
    Interestingly, I feel like the “ChatGPT writing code” use case is one of the best reasons to start generating more tests. If that becomes the norm in the industry, it’ll be a lot easier to generate tests for AI-created code than to keep hand-writing test cases for code that you don’t understand or have intimate knowledge of.
  2. 1
    
    xyproto 1 year ago | link
    
    Don’t worry, LLMs should soon be perfectly capable to write both code and tests.
    
    What they lack, though, is taste, vision, innovation and curation.
5. 1
  
  peterbourgon 1 year ago | link
  
  property-based testing
  
  I wish I could property-based test, or fuzz-test, any of the code that I usually work with. As far as I know, it’s not possible. For example, how would you apply these techniques to an arbitrary HTTP handler?
  1. 4
    
    hwayne 1 year ago | link
    
    I think the key takeaway from the discussion so far is that this is possible but not easy: when you have complex situations you need to test, you need a lot of skill to generatively test. That’s a totally different skill from hand-testing, and it has a much steeper learning curve. That’s what makes it hard to get into generative testing.
  2. 2
    5d22b 1 year ago | link
    
    By “handler”, do you mean a library that implements HTTP? I’m no expert, but two approaches that come to mind for me are
    
    Throw random input at it and test that it doesn’t crash (with sanitizers, if available).
    
    Give the same input to it and to one or more other HTTP implementations and test that they understood it in the same way (or that yours did better, in cases in which the other implementation is known to be wrong).
    1. 1
      
      peterbourgon 1 year ago | link
      
      I think my use case reduces to an API which accepts an arbitrary type, and returns an arbitrary type, both of which are too complex to exercise exhaustively. Automated testing would need to provide me a way to produce valid values of the input type, somehow — throwing random input at it wouldn’t be useful, even with sanitizers.
      1. 4
        
        matklad 1 year ago | link
        
        Automated testing would need to provide me a way to produce valid values of the input type, somehow
        
        Yeah, that’s the tricky bit. Ultimately, for non-trivial cases, this all boils down to “you need to come up with an ingenious way to generate random valid-ish inputs”. It is possible to do significantly better here than just throwing random bytes at the system, but this generally requires writing non-trivial amount of code.
        
        https://fitzgeraldnick.com/2020/08/24/writing-a-test-case-generator.html is a good example here.
        
        For another real-world example, here’s what we use at work:
        
        https://github.com/tigerbeetledb/tigerbeetle/blob/main/src/state_machine/workload.zig
        
        Even for a relatively simple domain model, that’s almost a 1k lines.
        
        1
        
        amw-zero 1 year ago | link
        
        That test case generator post is a goldmine. Thanks for sharing. CSmith is such a powerful tool, and it’s great to see a deep dive into how similar tools work.
      2. 2
        
        amw-zero 1 year ago | link
        
        Here’s a post that talks about generating input data.
        
        It can be complex, for sure, but every property-based testing / fuzzing library has extensive API support for creating complex data.
        
        You can also create the data yourself non-randomly, one example is by using the category partition method.
        
        1
        
        peterbourgon 1 year ago | link
        
        My point is that the code I generally want to fuzz doesn’t take input data, it takes input as typed values. So I basically need the fuzzer to generate values, not bytes.
        
        4
        
        amw-zero 1 year ago | link
        
        Here’s an example in Python for reference, though if you tell me what language you’re working in I could find an example in that language. Say we have an HTTP API for creating blog posts, the input request might look something like:
        
        @dataclass class CreateBlogPostRequest: content: str title: str tags: t.List[str]
        
        We can generate values of CreateBlogPostRequest in a test like:
        
        @given(blog_req=gen.from_type(CreateBlogPostRequest)) def test_blog_post_creation(blog_req): resp = api.handle(blog_req) assert_something_about(resp)
        
        This is using the Hypothesis property-based testing library for Python, but many languages have PBT libraries. In this case, the data is generated from the type definition.
        
        Check out property-based testing, not fuzzing. They’re similar, but fuzzing is more focused on generating random bytes whereas property-based testing is focused on creating properly typed values. What you’re describing is an everyday task with PBT.
        
        2
        
        teymour 1 year ago | link
        
        Fuzzers are moving towards “structure-aware” fuzzing where they can produce (and for some operate directly) values, not bytes. See for example
        
        https://github.com/loiclec/fuzzcheck-rs
        
        https://github.com/google/fuzzing/blob/master/docs/structure-aware-fuzzing.md
      3. 1
        
        Grive 1 year ago | link
        
        Maybe I misunderstand what you say, but it seems your case is exactly covered by fuzzing templates. Current state-of-the-art fuzzers (AFL comes to mind) allow you to describe input templates, with bits set to specific values or ranges. You are not forced to generate fully random input.
        
        1
        
        peterbourgon 1 year ago | link
        
        If you want to effectively fuzz an HTTP handler, you can’t treat the input as an opaque sequence of bits, even if you group those bits in one way or another via templates. The fuzzer needs to generate input data with at least a minimal semantic understanding of the actual request type(s) – randomizing not sequences of opaque bits or bytes on the wire, but rather valid values as defined by the relevant type(s).
        
        3
        
        david_chisnall 1 year ago | link
        
        That’s precisely what AFL lets you do. You can give it a grammar and it will generate streams of tokens that match that grammar. I believe it can also give you things that almost match the grammar, so you can check that you handle errors.
        
        1
        
        peterbourgon 1 year ago | link
        
        Ah, OK, so this is an approach where the fuzzer is a separate process to the code under test. My intuition is that fuzzing is, ideally, part of the language-native testing apparatus.
        
        2
        
        david_chisnall 1 year ago | link
        
        AFL has a bunch of modes for varying degrees of white-box testing. It also has an LLVM pass that, among other things, transforms branches that match on exact values into nested branches that math on bits, which helps it explore the coverage space more usefully. For something that implements an RPC interface, you can just ask the fuzzer to generate the RPC messages, with some hinting about the structure of the messages.
        
        1
        
        Grive 1 year ago | link
        
        No, AFL can run in many modes. The one I’ve seen most frequently is with the AFL entrypoint as a compilation target of the project: you have a parsing library, and you build a custom fuzzing binary by specifying where and how AFL should call your library.
        
        The interface that I know of are at least in C and C++, I would be surprised if there aren’t other bindings.
        
        It honestly seems like AFL fully covers the ideal fuzzer representation you have, you should check it out. It’s a quite clever piece of engineering.