Oh my gosh I love this so much. It’s the kind of article that makes me retroactively unhappy with my current workflow, because now I know it could be so much better.
Oh ha thanks! I have a distant hope that one day this will just be the default way that people write tests, and the workflow will be universally supported, and everyone can enjoy it. I’ve talked to some of my coworkers about this technique and the overall consensus is “it’s crazy that this isn’t more widespread.”
As it stands right now it takes a bit of work to set it up in different places, and it’s a bit different in every language, and it’s “alien” enough that I think it’s easy for someone who has never tried it to write it off as not worth the effort. But it really is!
The sad thing is that this was the default way that people worked on Lisp machines and in Smalltalk on the Alto, but the machines where it wasn’t the default way of working took over.
Really! I’m too young to have used these, but I guess I just assumed that the workflow looked like a modern SLIME thing, where you had interactive expression eval built into the editor, but you didn’t typically write the results to disk. So you don’t have the automatic repeating of REPL sessions, and you couldn’t share them with other people as tests. Is that wrong? Was there a facility for replaying REPL sessions and seeing if they changed?
Try playing with Squeak or Pharo, they’re a pretty close approximation of the original Smalltalk model. Smalltalk was a real imperative language. Most modern languages have a mix of imperative and declarative constructs. For example, Java classes and C structs are described declaratively, whereas function / method bodies contain imperative code. In Smalltalk, your interaction is a dialog with the system. If you want to create a new class, you send a #subclass: message to an existing class, which takes a class name as the argument and returns a new class. You then send that class messages to add (or remove / modify) instance variables (fields) and methods.
Smalltalk, traditionally, didn’t really have a notion of source code. You started with a running system and modified it. This was then serialised to disk. Classes are just objects, which are stored in serialised form. Methods are typically stored as bytecode (with comments along the side) because Smalltalk bytecode can be transformed back into code (this is actually how the pretty-printer works in some implementations). Other implementations stored the ‘source code’ for a method and generated the bytecode on demand.
Lisp machines were quite similar. They stored S expressions (which may be code, may be data - Lisp is far less fussy about that distinction than most languages) and provided in- and out-of-core storage.
In Pragmatic Smalltalk (which is now probably bit-rotted beyond any conceivable use) I tried to get to a half-way step and, in particular, persistently store only model objects. We had a framework called CoreObject that was used to provide persistent storage, undo, diffing, and merging of structured data. Initially I had a mode to store modified classes as text (source code) but my longer-term goal before I took a 13-year (so far) detour into designing hardware was to represent code using CoreObject as well.
The things that killed these systems were, roughly speaking, C and UNIX. C code ran on computers that cost <10% of the cost of a Lisp Machine. The separate compile step (which necessitated having a notion of source code as something distinct from the internal structure of the program. They had small amounts of memory and multi-level storage and so you needed filesystem abstractions as distinct from your programmer view of objects.
The other thing that killed Smalltalk (and which I tried to address with Pragmatic Smalltalk) was that no one writes code in a single language. Even if you write a Java (or Kotlin) app for Android, you’re linking a load of C, C++, OpenCL C, Rust, and so on libraries. These languages all had a model where data is ephemerally stored in memory in non-introspectable forms and so just persisting the object graph for the high-level language doesn’t work. This is one of the things that I want to solve with Verona: by building interop on top of an abstraction that can represent sandboxing, we get some nice security properties (code is either type-safe or sandboxed) but we also get a way of easily saying ‘this region that encapsulates an instance of a C library went away, you must recreate any state in it’ - you have to do that to handle crashes in the C code and if you’ve done it then you can also serialise the Verona world and restart the C bit at any given point in execution (this also requires the no concurrently mutable state guarantee that the Verona type system provides).
Anyway, I’ve been working on aspects of this for the last 20 years now. I’m probably another 10 years or so away from having all of the building blocks that I need to be able to build this kind of system. Any decade now…
I had an idea of a smalltalk without subclassing, where your FFI would be via the dynamic object load model, and your smalltalk methods would all be files with the contents in a normal filesystem. This would allow replacing any code with an executable in a different language if you needed to, and would be very friendly to all the tooling we have around files-in-a-directory.
Might have some ideas worth integrating into your thing? I personally found Clojure and Zig and went “I lack the time to create my own tools of this caliber, so I’ll just use these.” I do hope someday to see a smalltalk that’s deeply pragmatic.
I was very sceptical when I started reading this. I’ve seen a lot of people claim that they love “REPL-driven development” because they can just take their REPL results and use them for test cases… but they always end up with terrible tests.
This is a pet peeve of mine: tests should be informative. They should be documentation that cannot get out-of-sync with the code. When a test fails, it should be clear why it has failed. If a test just feeds an inscrutable blob of input data to a function then checks that the output matches another inscrutable blob of data, it tells me nothing. What is the function for? What properties of the output are actually important? If the test fails, is it because the function is actually broken or is it because the output is different from the snapshot in some irrelevant way?
I don’t really follow the TDD religion but one thing that I do like about it is the idea that you should think about the properties you want to test for first. This helps you write tests that actually convey information to someone reading them. Snapshot testing is pretty much the exact opposite.
So I was really happy to find that a large part of this blog post was all about what makes for a good test and how this style of REPL usage helps you to write good tests. It’s so cool to see a workflow that makes use of that friction-free, interactive, REPLy style and also leads to meaningful, informative test suites. This is excellent!
To add on to this a bit, I’m really sad that there are vastly more “snapshot testing” libraries than there are “inline snapshot testing” libraries. Lots of libraries that want to put the interesting half of your tests in a separate file, which makes the tests themselves completely unreadable… just go one step further! It’s not that hard to patch the source!
A useful implementation tip is that instead of wrapping the whole test into a macro call, it’s enough to wrap just the string literal. That way, you can create a “self-updating string” and pass that as a normal value to helper functions.
And MPW drew inspiration from Domain/OS’ pads (IIRC MPW was written by a lot of ex-Apollo people), which IIRC drew inspiration from Oberon, which drew inspiration from Cedar.
Oh wow! I’ve never heard of that. I wanted to read more but the info I can find on the internet seems like it’s describing something more like eshell. MPW could embed and check output in C programs?
This is genius. And trivially easy to implement in python - 10 minutes of hacking (it could, of course, use some work) and pycharm is updating/checking MySnapshots with every ctrl-shift-f10:
class MySnapshot:
NO_ARGUMENT = object()
def __init__(self, expected_value=NO_ARGUMENT):
self.expected_value = expected_value
def __eq__(self, other):
if self.expected_value is self.NO_ARGUMENT:
from inspect import currentframe, getframeinfo
caller_frame = getframeinfo(currentframe().f_back)
calling_filename = caller_frame.filename
calling_line = caller_frame.lineno # calling line is 1 based
replaced_contents = []
with open(calling_filename) as f:
for line_no, l in enumerate(f.readlines()):
if line_no+1 == calling_line:
l = l.replace("MySnapshot()", f"MySnapshot({repr(other)})")
replaced_contents.append(l)
with open(calling_filename, 'w') as f:
f.writelines(replaced_contents)
return True
else:
return other == self.expected_value
assert 1 == MySnapshot() # this will be replaced
Extending Emacs to insert the value of the eval’d expression is pretty nifty! I just copy/paste from my REPL to my test suite file but I am 100% on board with the REPL to test case pipeline. It takes care of my issues with TDD while still driving me to generate good test coverage. I use Geiser for my Scheme REPL needs in Emacs and it would be cool to have a geiser-eval-and-insert-last-sexp function or something.
There’s a really nice advantage to this workflow that I didn’t fit into the original post, but your comment reminded me: this technique makes it really easy to update all of your tests. If you made a change that you expect to change behavior, you can automatically re-run every “test” and accept the new output as correct – you don’t have to spend any time updating tests by hand. This makes it a lot cheaper to make changes to code that already has good test coverage – it’s like running a hundred REPLs in parallel.
Are you sure it doesn’t? I often do C-u C-x C-e (cider-eval-last-sexp) in Clojure using CIDER, and I’m pretty sure I use the same combination for other languages too.
Actually I got too curious and had to take a look myself, and geiser-eval-last-sexp does work the same:
Eval the previous sexp in the Geiser REPL.
With a prefix, revert the effect of geiser-mode-eval-last-sexp-to-buffer
I really, really like this, but isn’t there a potential footgun in the form of, “I generate the expectation and… that’s correct, right? I’m sure it is. Moving on.” where one might not scrutinize the expectation enough to notice that it’s not actually doing what they think it’s doing? That certainly isn’t specific to this workflow, though the lack of friction (and I’m not advocating for that friction here, just weighing the potential cost of removing it) may increase the likelihood that the developer fails to notice they’ve generated an error.
That’s a good question! I remember people asking the same thing on another article about this idea. I think it’s something that seems like a failure mode, but in practice I haven’t noticed this happening in years of working like this.
I feel like the assumption is that someone who’s cavalier with their test outputs would be more careful if they had to write down assertions explicitly. But I think that, realistically, the alternative is that they test code manually and don’t notice their mistake, or that they don’t test their code at all. Which is to say – this workflow does enable a new kind of sloppiness that is otherwise not available. But it’s preferable to the standard modes of sloppiness, because by committing a bad test, you can at least get more eyeballs on it at code review time.
Maybe a more interesting case is someone who is not generally sloppy, but who might cut a corner or two given this workflow. I think that that could be a sign that a test’s output is not very good – if it takes so much effort to verify that a test’s output is right or wrong that reasonable people are skipping that step, then it might be worth spending time to improve the test.
I agree with everything here. Thanks for laying it out! I can also imagine a case where this style might be difficult if the expectation is very large or very unwieldy, which of course is also a sign that there’s an issue with what’s being tested that might be worth refactoring.
Now I need to figure out how to get something like this set up for ruby. :)
Cinder has this for its test suite — there is a script to update snapshots. I wrote tests by writing the input, running the script, and checking it in after a once-over. No fancy unicode, but it is a nice test workflow.
I’ve done a janky, project specific version of this: I built a tool to take input data and build out tests for different ETL jobs. It was really helpful for locking changes into place. This seems like a much more elegant and general purpose version of this. I especially like using visualizations to extend the areas where this is usable.
the ceiling for what a “good test” can look like is much, much higher when you’re using this technique.
I don’t understand why this ASCII art game state rendering (nice as it is) is connected to this technique of auto-generating expected test output. Is it because the ASCII art game state rendering would have been hard to write by hand?
Thanks for sharing this, it’s great! While I was reading the first part of the article, I kept thinking “this sounds quite a lot like expect tests”… I was happy to see you mentioned the same connection when I kept reading further on. 🙂
Now I just need to get around to using this workflow myself. Perhaps this article will finally nudge me into doing it!
Oh my gosh I love this so much. It’s the kind of article that makes me retroactively unhappy with my current workflow, because now I know it could be so much better.
Oh ha thanks! I have a distant hope that one day this will just be the default way that people write tests, and the workflow will be universally supported, and everyone can enjoy it. I’ve talked to some of my coworkers about this technique and the overall consensus is “it’s crazy that this isn’t more widespread.”
As it stands right now it takes a bit of work to set it up in different places, and it’s a bit different in every language, and it’s “alien” enough that I think it’s easy for someone who has never tried it to write it off as not worth the effort. But it really is!
The sad thing is that this was the default way that people worked on Lisp machines and in Smalltalk on the Alto, but the machines where it wasn’t the default way of working took over.
Really! I’m too young to have used these, but I guess I just assumed that the workflow looked like a modern SLIME thing, where you had interactive expression eval built into the editor, but you didn’t typically write the results to disk. So you don’t have the automatic repeating of REPL sessions, and you couldn’t share them with other people as tests. Is that wrong? Was there a facility for replaying REPL sessions and seeing if they changed?
Try playing with Squeak or Pharo, they’re a pretty close approximation of the original Smalltalk model. Smalltalk was a real imperative language. Most modern languages have a mix of imperative and declarative constructs. For example, Java classes and C structs are described declaratively, whereas function / method bodies contain imperative code. In Smalltalk, your interaction is a dialog with the system. If you want to create a new class, you send a
#subclass:
message to an existing class, which takes a class name as the argument and returns a new class. You then send that class messages to add (or remove / modify) instance variables (fields) and methods.Smalltalk, traditionally, didn’t really have a notion of source code. You started with a running system and modified it. This was then serialised to disk. Classes are just objects, which are stored in serialised form. Methods are typically stored as bytecode (with comments along the side) because Smalltalk bytecode can be transformed back into code (this is actually how the pretty-printer works in some implementations). Other implementations stored the ‘source code’ for a method and generated the bytecode on demand.
Lisp machines were quite similar. They stored S expressions (which may be code, may be data - Lisp is far less fussy about that distinction than most languages) and provided in- and out-of-core storage.
In Pragmatic Smalltalk (which is now probably bit-rotted beyond any conceivable use) I tried to get to a half-way step and, in particular, persistently store only model objects. We had a framework called CoreObject that was used to provide persistent storage, undo, diffing, and merging of structured data. Initially I had a mode to store modified classes as text (source code) but my longer-term goal before I took a 13-year (so far) detour into designing hardware was to represent code using CoreObject as well.
The things that killed these systems were, roughly speaking, C and UNIX. C code ran on computers that cost <10% of the cost of a Lisp Machine. The separate compile step (which necessitated having a notion of source code as something distinct from the internal structure of the program. They had small amounts of memory and multi-level storage and so you needed filesystem abstractions as distinct from your programmer view of objects.
The other thing that killed Smalltalk (and which I tried to address with Pragmatic Smalltalk) was that no one writes code in a single language. Even if you write a Java (or Kotlin) app for Android, you’re linking a load of C, C++, OpenCL C, Rust, and so on libraries. These languages all had a model where data is ephemerally stored in memory in non-introspectable forms and so just persisting the object graph for the high-level language doesn’t work. This is one of the things that I want to solve with Verona: by building interop on top of an abstraction that can represent sandboxing, we get some nice security properties (code is either type-safe or sandboxed) but we also get a way of easily saying ‘this region that encapsulates an instance of a C library went away, you must recreate any state in it’ - you have to do that to handle crashes in the C code and if you’ve done it then you can also serialise the Verona world and restart the C bit at any given point in execution (this also requires the no concurrently mutable state guarantee that the Verona type system provides).
Anyway, I’ve been working on aspects of this for the last 20 years now. I’m probably another 10 years or so away from having all of the building blocks that I need to be able to build this kind of system. Any decade now…
I had an idea of a smalltalk without subclassing, where your FFI would be via the dynamic object load model, and your smalltalk methods would all be files with the contents in a normal filesystem. This would allow replacing any code with an executable in a different language if you needed to, and would be very friendly to all the tooling we have around files-in-a-directory.
Might have some ideas worth integrating into your thing? I personally found Clojure and Zig and went “I lack the time to create my own tools of this caliber, so I’ll just use these.” I do hope someday to see a smalltalk that’s deeply pragmatic.
I was very sceptical when I started reading this. I’ve seen a lot of people claim that they love “REPL-driven development” because they can just take their REPL results and use them for test cases… but they always end up with terrible tests.
This is a pet peeve of mine: tests should be informative. They should be documentation that cannot get out-of-sync with the code. When a test fails, it should be clear why it has failed. If a test just feeds an inscrutable blob of input data to a function then checks that the output matches another inscrutable blob of data, it tells me nothing. What is the function for? What properties of the output are actually important? If the test fails, is it because the function is actually broken or is it because the output is different from the snapshot in some irrelevant way?
I don’t really follow the TDD religion but one thing that I do like about it is the idea that you should think about the properties you want to test for first. This helps you write tests that actually convey information to someone reading them. Snapshot testing is pretty much the exact opposite.
So I was really happy to find that a large part of this blog post was all about what makes for a good test and how this style of REPL usage helps you to write good tests. It’s so cool to see a workflow that makes use of that friction-free, interactive, REPLy style and also leads to meaningful, informative test suites. This is excellent!
To add on to this a bit, I’m really sad that there are vastly more “snapshot testing” libraries than there are “inline snapshot testing” libraries. Lots of libraries that want to put the interesting half of your tests in a separate file, which makes the tests themselves completely unreadable… just go one step further! It’s not that hard to patch the source!
Timely! Just earlier today I implemented expectation testing for Zig:
https://github.com/tigerbeetle/tigerbeetle/pull/959/files#diff-41c6547f23285b9ca771fb6a2b4cb86cb375387272beff22b6fd28657085ed0c
It really is quite simple to do yourself, you don’t need a huge tool with a flashy website to do this.
A useful implementation tip is that instead of wrapping the whole test into a macro call, it’s enough to wrap just the string literal. That way, you can create a “self-updating string” and pass that as a normal value to helper functions.
For people like me who haven’t worked with Zig before, I found a Python library that might explain the implementation a bit.
https://github.com/ezyang/expecttest/tree/main
This sounds exactly like Apple’s old MPW development environment (circa 1988 through 2000.)
And MPW drew inspiration from Domain/OS’ pads (IIRC MPW was written by a lot of ex-Apollo people), which IIRC drew inspiration from Oberon, which drew inspiration from Cedar.
Oh wow! I’ve never heard of that. I wanted to read more but the info I can find on the internet seems like it’s describing something more like eshell. MPW could embed and check output in C programs?
This is genius. And trivially easy to implement in python - 10 minutes of hacking (it could, of course, use some work) and pycharm is updating/checking
MySnapshot
s with every ctrl-shift-f10:Extending Emacs to insert the value of the eval’d expression is pretty nifty! I just copy/paste from my REPL to my test suite file but I am 100% on board with the REPL to test case pipeline. It takes care of my issues with TDD while still driving me to generate good test coverage. I use Geiser for my Scheme REPL needs in Emacs and it would be cool to have a
geiser-eval-and-insert-last-sexp
function or something.There’s a really nice advantage to this workflow that I didn’t fit into the original post, but your comment reminded me: this technique makes it really easy to update all of your tests. If you made a change that you expect to change behavior, you can automatically re-run every “test” and accept the new output as correct – you don’t have to spend any time updating tests by hand. This makes it a lot cheaper to make changes to code that already has good test coverage – it’s like running a hundred REPLs in parallel.
Are you sure it doesn’t? I often do
C-u C-x C-e
(cider-eval-last-sexp
) in Clojure using CIDER, and I’m pretty sure I use the same combination for other languages too.Actually I got too curious and had to take a look myself, and
geiser-eval-last-sexp
does work the same:You’re right! I just never knew about it!
I really, really like this, but isn’t there a potential footgun in the form of, “I generate the expectation and… that’s correct, right? I’m sure it is. Moving on.” where one might not scrutinize the expectation enough to notice that it’s not actually doing what they think it’s doing? That certainly isn’t specific to this workflow, though the lack of friction (and I’m not advocating for that friction here, just weighing the potential cost of removing it) may increase the likelihood that the developer fails to notice they’ve generated an error.
That’s a good question! I remember people asking the same thing on another article about this idea. I think it’s something that seems like a failure mode, but in practice I haven’t noticed this happening in years of working like this.
I feel like the assumption is that someone who’s cavalier with their test outputs would be more careful if they had to write down assertions explicitly. But I think that, realistically, the alternative is that they test code manually and don’t notice their mistake, or that they don’t test their code at all. Which is to say – this workflow does enable a new kind of sloppiness that is otherwise not available. But it’s preferable to the standard modes of sloppiness, because by committing a bad test, you can at least get more eyeballs on it at code review time.
Maybe a more interesting case is someone who is not generally sloppy, but who might cut a corner or two given this workflow. I think that that could be a sign that a test’s output is not very good – if it takes so much effort to verify that a test’s output is right or wrong that reasonable people are skipping that step, then it might be worth spending time to improve the test.
I agree with everything here. Thanks for laying it out! I can also imagine a case where this style might be difficult if the expectation is very large or very unwieldy, which of course is also a sign that there’s an issue with what’s being tested that might be worth refactoring.
Now I need to figure out how to get something like this set up for ruby. :)
Cinder has this for its test suite — there is a script to update snapshots. I wrote tests by writing the input, running the script, and checking it in after a once-over. No fancy unicode, but it is a nice test workflow.
I’ve done a janky, project specific version of this: I built a tool to take input data and build out tests for different ETL jobs. It was really helpful for locking changes into place. This seems like a much more elegant and general purpose version of this. I especially like using visualizations to extend the areas where this is usable.
This or property based tests is the only way I’ve ever managed to write tests. Everything else has too much latency or is too boring for my brain.
I don’t understand why this ASCII art game state rendering (nice as it is) is connected to this technique of auto-generating expected test output. Is it because the ASCII art game state rendering would have been hard to write by hand?
Thanks for sharing this, it’s great! While I was reading the first part of the article, I kept thinking “this sounds quite a lot like expect tests”… I was happy to see you mentioned the same connection when I kept reading further on. 🙂
Now I just need to get around to using this workflow myself. Perhaps this article will finally nudge me into doing it!