Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Dec 23, 2014 22:55 UTC (Tue) by cesarb (subscriber, #6266)
Parent article: The "too small to fail" memory-allocation rule

All this talk of barely tested error-recovery paths reminded me of the following parable:

> We went to lunch afterward, and I remarked to Dennis that easily half the code I was writing in Multics was error recovery code. He said, "We left all that stuff out. If there's an error, we have this routine called panic, and when it is called, the machine crashes, and you holler down the hall, 'Hey, reboot it.'"

-- http://www.multicians.org/unix.html

Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Dec 24, 2014 7:39 UTC (Wed) by jezuch (subscriber, #52988) [Link] (3 responses)

>> easily half the code I was writing in Multics was error recovery code. He said, "We left all that stuff out. If there's an error, we have this routine called panic

And now in Linux we have both! :)

Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Dec 24, 2014 21:07 UTC (Wed) by agrover (guest, #55381) [Link] (1 responses)

We've all been writing, reviewing, and debugging error-handling code in the kernel, hundreds of programmer-years of effort. It's a little insulting that it doesn't even get used. Seems to me the sooner we pull off the band-aid and enable all allocations to fail, the better.

If there are bugs that are "too scary" to contemplate fixing the right way, then we are all in BIG trouble.

Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Dec 25, 2014 22:28 UTC (Thu) by epa (subscriber, #39769) [Link]

The terminology used is unfortunate. It is not that small allocations "cannot fail". Of course if the memory is unavailable, any memory allocation will fail. The question is what failure mode happens. Is it by a false status being returned to the caller, or is it some other kind of failure such as a kernel panic or hard lockup?

Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Dec 27, 2014 0:16 UTC (Sat) by reubenhwk (guest, #75803) [Link]

>> And now in Linux we have both! :)

If only we didn't have all that code we may be able to satisfy that small allocation request.

Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Dec 26, 2014 18:24 UTC (Fri) by rwmj (subscriber, #5474) [Link] (8 responses)

I used to think this was an indictment of Unix, but if you look at modern cloud systems with their multiple virtual machines, any one of which is expected to fail without affecting the service. Or Erlang with its philosophy of failing early and recovering failed processes. Well, now it's writing all that error handling code which looks stupid.

Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Dec 29, 2014 14:09 UTC (Mon) by epa (subscriber, #39769) [Link] (7 responses)

Yes, because the next time your mobile phone crashes you can seamlessly switch to one of the cloud of redundant phones you carry with you at all times...

Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Dec 29, 2014 21:58 UTC (Mon) by rwmj (subscriber, #5474) [Link] (2 responses)

You obviously don't understand how erlang works. Not many do which I guess explains the state of programming these days.

Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Dec 30, 2014 10:31 UTC (Tue) by epa (subscriber, #39769) [Link] (1 responses)

I'm sure Erlang works reliably, but the kernel cannot be written in Erlang.

Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Jan 23, 2015 9:03 UTC (Fri) by Frej (guest, #4165) [Link]

No but the philosophy could be followed. In many ways it is the just the micro vs monolith kernel. If subsystems could be completely separate, you could just restart the subsystem and retry the request. But it's never quite that simple, and especially so for the kernel.

Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Dec 30, 2014 15:59 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link] (3 responses)

OTOH, Android expects its apps to be able to handle crashes cleanly. When it needs to free up memory, it just kills something in the background, and it's up to the app not to have problems from that. Seamless recovery from being killed is mandatory.

Error recovery (was: The "too small to fail" memory-allocation rule)

Posted Dec 30, 2014 16:40 UTC (Tue) by epa (subscriber, #39769) [Link]

It is sound design to make the application handle crashes cleanly, so it can recover without losing more than a tiny bit of work in progress. And that applies to the kernel too: journalling filesystems are designed so that even a hard crash will not lose data.

That doesn't really mean you can do without error handling code in the kernel, though. It's great if your filesystem doesn't get horribly corrupted when the machine crashes, but still the crash is not appreciated by the user. Yes, if you are running a farm of several machines then you can fail over to another and the service stays up; that doesn't really work as a remedy for your laptop locking up, unless you happen to carry around a redundant laptop with you at all times.

And in the case of Android, the apps are killed and restarted, but it would not be acceptable for the kernel itself to just panic on any error condition and require restarting the phone. Which is what we are talking about here: *kernel* error recovery.

Re: OTOH, Android expects its apps to be able to handle crashes cleanly.

Posted Dec 30, 2014 20:47 UTC (Tue) by ldo (guest, #40946) [Link] (1 responses)

Not quite. The framework will always explicitly call onDestroy() before killing your Activity. If this is happening because the system is running low on resources, not because of direct user action, then onSaveInstanceState() will be called before that so you can save whatever is necessary of the state of your UI so, when the user returns to your app, it can be transparently restarted to make it look like it never stopped.

Re: OTOH, Android expects its apps to be able to handle crashes cleanly.

Posted Dec 30, 2014 21:41 UTC (Tue) by cesarb (subscriber, #6266) [Link]

Not true, see drivers/staging/android/lowmemorykiller.c. It directly sends SIGKILL to the process, without calling any Java function.