The point about reproducing the issue is important and the discussion around it is good. But I’m going to be a broken record here and cite Dave Agans’ book as the best thing to read if you want to understand how to debug something. This post hits on some of it. Dave’s book does a great job of clarifying the stuff this post glosses over.
Here are the nine rules from Dave’s book. These are really the table stakes for debugging discussions, in my opinion.
The hardest bug I fixed (taking over a month of dedicated time) was hard because none of the advice of the article (reproduce it) or the above advice, actually worked. The bug was very hard to reproduce on demand—on my development system at the time, it took several days of heavy use to trigger; on the production server it hook hours. The resulting core dumps (and yes, I had plenty of core dumps to look at) were inconsistent with each other. The program would never just crash in the same place twice. It was failing, yet I couldn’t make it fail deliberately.
What it finally took to solve it was just thinking hard about the code and how it worked, making hypothesizes, testing them, rejecting them, and thinking even more. And like all such bugs, the root cause was obvious once I identified it—I had a signal handler calling non-async-safe functions.
I had a rare and inconsistent crash bug that (after some days of fiddling to reproduce and characterize) I eventually managed to reduce to about an hour of artificial load before the crash. I ran the reproducer under Mozilla rr (the record/replay time-travel debugger) which allowed a colleague to work out the cause.
It’s good to have a rich collection of debugging tools and techniques.
Once you’ve solved the problem of reproducing a bug, a good problem to work on after that is, Was it working before? If so, when did it break? And finally and perhaps most importantly, What changed?
I had a great boss who, when some of us got stuck in the weeds of a complex, critical bug, always asked that last question. It usually saved us a great deal of time. In my very subjective memory, 80% of the time, critical bugs are the result of some breaking change either to the code or to the environment. (If in code, git bisect is a great tool for finding the breaking commit.) If you can demonstrate that it was working before, figuring out what changed between working and broken can be the shortest path to solving the bug.
I concur. At my previous job, whenever ops told us, “there’s a bug in your code,” I would shoot back with “well, the last time our code was deployed was three months ago [1], and if the bug is only showing up now, what have you changed?” In those cases, it was always something else that changed that cause the issue.
[1] Our customer was the Oligarchic Cell Phone Company, who could veto our deployments into production. They did not move fast, and thus, four deployments per year was a “busy” year for our department.
Often the solution will naturally present itself in the course of debugging, usually by the time you’re successfully reproducing the issue. It could be that by reproducing the problem, you are gaining the required knowledge to fix it. However I think there very few actually hard problems and you’re probably not smart and/or lucky enough to work on one. I’m not sure that I’ve ever worked on a hard problem, at least not for work. We’re mostly just shoveling mud around in various forms with various amounts of efficiency.
So in order to attack a hard problem like debugging we want to decompose it into easier problems. “Reproduce the bug” is a good tactic because it’s easy to test hypotheses when you can reproduce the bug. But the bugs where you fix the entire thing in the course of trying to reproduce it are the ones where that tactic failed: we decomposed “fix the bug” into “reproduce the bug and then do some other stuff” which turned out to be equivalent to “fix the bug and then do some other stuff”.
IOW, when OP says “reproduce the bug”, I feel like they’re really just saying “draw the rest of the owl”, and that’s unsatisfying to me, so here’s my (short) take on how it’s actually done.
I think trying to reproduce bugs is a special case of applying (something akin to) the scientific method to them. That is, we try to explain and then disprove our own explanation until we have an explanation we can’t disprove. I bring all this up because sciencing it works even when reproducing the bug is impossible, although it still benefits from a repro if we have one.
At the beginning hypotheses should usually be cheap and general, like “foo has the wrong value”, so you kind of learn the rules of the game as you play, while complicated ones like “this happens when you shoot a Scotsman with a longbow within the city walls of York” should wait until you’ve proved those elements are significant.
I would suggest some rules that lean more towards the case where reproduction is impossible and any fix will take weeks to verify. For me, empirically, these are the interesting ones to solve:
Understand the thing that’s broken in terms of something that is not. You can’t debug, say, C undefined behaviour by thinking about the C abstract machine.
Try to find and prove important invariants.
Make small hypotheses on your way to big ones, because they’ll still be useful after the big ones fail.
complicated ones like “this happens when you shoot a Scotsman with a longbow within the city walls of York” should wait until you’ve proved those elements are significant.
If you’ve seen it happen when a Scotsman was shot with a longbow within the city walls of York, it’s fine to work top-down. So you start with that complex situation and try to reproduce it. If that works consistently, try shooting an Englishman with a longbow within the city walls of York, or shooting a Scotsman with a longbow within city walls of a different city, etc. Stripping it down to “it has to be a Scotsman who dies” can be helpful.
Of course, if people actually have to die (i.e., it only happens in production and is totally unacceptable to break like it does), you might want to start bottom-up just with the situation in mind like you described.
I don’t disagree. I’m a bit prone to thinking of debugging with and without reproduction as almost qualitatively* different things, although they’re not. Most of what I said was with the no-repro case in mind, and the principle could be expressed more generally. Maybe: hypotheses should be only small jumps away from what you already know. I hope it still worked decently as an example.
Debugging is one of the purest forms of applying the scientific method. If you are great at debugging, you make a great scientist. The thinking tools are exactly the same.
The point about reproducing the issue is important and the discussion around it is good. But I’m going to be a broken record here and cite Dave Agans’ book as the best thing to read if you want to understand how to debug something. This post hits on some of it. Dave’s book does a great job of clarifying the stuff this post glosses over.
Here are the nine rules from Dave’s book. These are really the table stakes for debugging discussions, in my opinion.
The hardest bug I fixed (taking over a month of dedicated time) was hard because none of the advice of the article (reproduce it) or the above advice, actually worked. The bug was very hard to reproduce on demand—on my development system at the time, it took several days of heavy use to trigger; on the production server it hook hours. The resulting core dumps (and yes, I had plenty of core dumps to look at) were inconsistent with each other. The program would never just crash in the same place twice. It was failing, yet I couldn’t make it fail deliberately.
What it finally took to solve it was just thinking hard about the code and how it worked, making hypothesizes, testing them, rejecting them, and thinking even more. And like all such bugs, the root cause was obvious once I identified it—I had a signal handler calling non-async-safe functions.
I had a rare and inconsistent crash bug that (after some days of fiddling to reproduce and characterize) I eventually managed to reduce to about an hour of artificial load before the crash. I ran the reproducer under Mozilla rr (the record/replay time-travel debugger) which allowed a colleague to work out the cause.
It’s good to have a rich collection of debugging tools and techniques.
The other day I bought two more copies of that book so I can give them to two junior engineers I just met.
This book really is the gift that keeps on giving :)
It’s such a simple thing, but somehow new folks need to read it because this kind of stuff isn’t taught in formal education anywhere, AFAICT.
Once you’ve solved the problem of reproducing a bug, a good problem to work on after that is, Was it working before? If so, when did it break? And finally and perhaps most importantly, What changed?
I had a great boss who, when some of us got stuck in the weeds of a complex, critical bug, always asked that last question. It usually saved us a great deal of time. In my very subjective memory, 80% of the time, critical bugs are the result of some breaking change either to the code or to the environment. (If in code,
git bisect
is a great tool for finding the breaking commit.) If you can demonstrate that it was working before, figuring out what changed between working and broken can be the shortest path to solving the bug.I concur. At my previous job, whenever ops told us, “there’s a bug in your code,” I would shoot back with “well, the last time our code was deployed was three months ago [1], and if the bug is only showing up now, what have you changed?” In those cases, it was always something else that changed that cause the issue.
[1] Our customer was the Oligarchic Cell Phone Company, who could veto our deployments into production. They did not move fast, and thus, four deployments per year was a “busy” year for our department.
So in order to attack a hard problem like debugging we want to decompose it into easier problems. “Reproduce the bug” is a good tactic because it’s easy to test hypotheses when you can reproduce the bug. But the bugs where you fix the entire thing in the course of trying to reproduce it are the ones where that tactic failed: we decomposed “fix the bug” into “reproduce the bug and then do some other stuff” which turned out to be equivalent to “fix the bug and then do some other stuff”.
IOW, when OP says “reproduce the bug”, I feel like they’re really just saying “draw the rest of the owl”, and that’s unsatisfying to me, so here’s my (short) take on how it’s actually done.
I think trying to reproduce bugs is a special case of applying (something akin to) the scientific method to them. That is, we try to explain and then disprove our own explanation until we have an explanation we can’t disprove. I bring all this up because sciencing it works even when reproducing the bug is impossible, although it still benefits from a repro if we have one.
At the beginning hypotheses should usually be cheap and general, like “foo has the wrong value”, so you kind of learn the rules of the game as you play, while complicated ones like “this happens when you shoot a Scotsman with a longbow within the city walls of York” should wait until you’ve proved those elements are significant.
I would suggest some rules that lean more towards the case where reproduction is impossible and any fix will take weeks to verify. For me, empirically, these are the interesting ones to solve:
If you’ve seen it happen when a Scotsman was shot with a longbow within the city walls of York, it’s fine to work top-down. So you start with that complex situation and try to reproduce it. If that works consistently, try shooting an Englishman with a longbow within the city walls of York, or shooting a Scotsman with a longbow within city walls of a different city, etc. Stripping it down to “it has to be a Scotsman who dies” can be helpful.
Of course, if people actually have to die (i.e., it only happens in production and is totally unacceptable to break like it does), you might want to start bottom-up just with the situation in mind like you described.
I don’t disagree. I’m a bit prone to thinking of debugging with and without reproduction as almost qualitatively* different things, although they’re not. Most of what I said was with the no-repro case in mind, and the principle could be expressed more generally. Maybe: hypotheses should be only small jumps away from what you already know. I hope it still worked decently as an example.
* Yes, I know.
Debugging is one of the purest forms of applying the scientific method. If you are great at debugging, you make a great scientist. The thinking tools are exactly the same.