|
|
Subscribe / Log in / New account

Re: does this 'break' exits the loop or is it simply an error exit?

Re: does this 'break' exits the loop or is it simply an error exit?

Posted Dec 26, 2014 0:38 UTC (Fri) by ldo (guest, #40946)
In reply to: Or You Could Simplify The Error-Recovery Paths by Cyberax
Parent article: The "too small to fail" memory-allocation rule

Hint: check the condition.


to post comments

Re: does this 'break' exits the loop or is it simply an error exit?

Posted Dec 27, 2014 1:10 UTC (Sat) by reubenhwk (guest, #75803) [Link] (40 responses)

Are you choosing this...
for (i = 0; i < MAX; ++i) {
   do { /*once*/
       if (err) {
           i = MAX;
           break;
       }
   } while (false);
}
over this?
goto cleanup;

Re: does this 'break' exits the loop or is it simply an error exit?

Posted Dec 27, 2014 1:24 UTC (Sat) by reubenhwk (guest, #75803) [Link] (39 responses)

This isn't good...
for (i = 0; i < IMAX; ++i)
{
    for (j = 0; j < JMAX; ++j)
    {
        do { /* once */
            if (PyErr_Occurred())
                break;
        } while (false);
        if (PyErr_Occurred())
            break;
    }
    if (PyErr_Occurred())
        break;
}
This is far better...
for (i = 0; i < IMAX; ++i)
{
    for (j = 0; j < JMAX; ++j)
    {
        if (PyErr_Occurred())
            goto done;
    }
}

done:
Dealing with complexity means splitting a complex problem up into simple things people can understand. Good programmers don't write complex code. Good programmers write simple code. Also, the *ONLY* time you should use a pseudo-loop like this is in a macro so the compiler will accept the semi-colon following it...
#define COMPLEX_MACRO_SHOULD_BE_A_FUNC(x) do { \
    /* complex macro code isn't good either */ \
} while (0)
later...
COMPLEX_MACRO_SHOULD_BE_A_FUNC("hello world");

Re: This is far better...

Posted Dec 27, 2014 20:44 UTC (Sat) by ldo (guest, #40946) [Link] (38 responses)

You do realize that PyErr_Occurred() is checking for an error from other API calls, right? So where do you put these API calls in your example?

Re: This is far better...

Posted Dec 27, 2014 20:55 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (37 responses)

Right before the PyErr_Occurred.

Re: Right before the PyErr_Occurred.

Posted Dec 27, 2014 21:32 UTC (Sat) by ldo (guest, #40946) [Link] (36 responses)

You need to fill in your example some more. Try, say, your version of the block beginning on line 480.

Re: Right before the PyErr_Occurred.

Posted Dec 28, 2014 3:51 UTC (Sun) by reubenhwk (guest, #75803) [Link] (35 responses)

for (i = 0; i < IMAX; ++i)
{
    for (j = 0; j < JMAX; ++j)
    {
        /* Some Python C calls, allocations, etc */

        if (PyErr_Occurred())
            goto done;
    }
}

done:
    free(whatever);
    if (fileptr)
        fclose(fileptr);

Don't fight it. Embrace goto statements. Goto is clearly better than calling PyErr_Occurred multiple times. For all we know, PyErr_Occurred checks the entire state of the Python interpreter (possible a non-trivial, expensive, operation). goto done will emit one machine instruction. Don't abuse other language constructions (do nothing loops) just to avoid using a goto...especially when you're effectively doing the same thing. Try it. You'll like it.

Re: Right before the PyErr_Occurred.

Posted Dec 28, 2014 6:19 UTC (Sun) by ldo (guest, #40946) [Link] (34 responses)

You need to make it more explicit. Rewrite the part of my code that I pointed you to, so we can compare and see which is better.

There seems to be a marked reluctance among most of you nay-sayers to do this. But that is the only way you can prove that there is something to your point, more than mere hand-waving hot air.

Re: Right before the PyErr_Occurred.

Posted Dec 28, 2014 12:44 UTC (Sun) by mm7323 (subscriber, #87386) [Link] (33 responses)

Laurence - discussion of goto usage is historically well trodden territory and this thread seems a little off topic and repetitive.

Did you read the LWN Christmas/end of year message?

The Story So Far...

Posted Dec 28, 2014 22:17 UTC (Sun) by ldo (guest, #40946) [Link] (32 responses)

For those who just came in, and are too lazy to read the rest of the thread before sticking their oar in:

  • The article is about an assumption about (lack of) possible failures of certain memory allocations, which has somehow baked itself into many parts of the Linux kernel since long before anyone can now remember.
  • This assumption now seems to be unwise. However, it is going to be hard to remove, because of the complexity of error-recovery paths right through the kernel.
  • I offered up a general technique for simplifying such error-recovery paths, and gave a decently non-trivial example to illustrate it in action, solving a reasonably complex, real-world problem. A key part of the technique is that it avoids gotos. This way, you can be sure that cleanups will always execute, no matter which way the flow of control goes.
  • My efforts so far have been met with fear, hostility, and just plain prejudice. Several people have tried to claim that using gotos would be preferable to my technique.
  • Yet, when challenged, they are unable to actually prove their point, by offering simpler alternatives to my code.
  • So they resort to trying to cast aspersions on the quality of my code.
  • But that just brings us back to the same point: if they think my code is too complicated or just plain crap, why can’t they do better? Why can they not come up with something simpler to solve the same real-world problem? But they cannot.

Any questions?

The Story So Far...

Posted Dec 28, 2014 22:56 UTC (Sun) by nix (subscriber, #2304) [Link] (3 responses)

The problem in the kernel is not an inability to be sure that cleanups will execute. With the goto exception-handling technique, unless you omit a goto, they always execute, and if your cleanups are executed in the normal path (i.e. there is no return before the goto label) they will execute even if you omit a goto. This is *exactly the same* as in your technique.

The problem in the kernel is an inability to be sure that what those code paths do on error is even sane. They will execute if triggered, but because of this "too small to fail* rule, many of them have never executed, even when the system is out of memory.

Your digression on an idiosyncratic (and IMNHSO needlessly verbose and unreadable) technique to replace gotos is not relevant to this problem so cannot solve it.

To me, the technique appears similar to the panic that strikes people who started with one-return languages like Pascal when they are faced with languages that allow multiple returns. Yes, if carelessly used this can lead to unreadable spaghetti, but that doesn't mean the right answer is to ban it! It's amazing how many of my functions these days reduce to

int err = -1;

if (do-nothing-fastpath-test)
    return 0; /* early */

stuff that allocates memory or does other things needing cleanup in both success and failure cases
if (error check)
   goto cleanup_stuff;

stuff that doesn't need cleanup but can fail
if (error check)
   goto cleanup_stuff;

more stuff
if (error check)
   goto cleanup_more_stuff;

...

err = 0; /* success */

cleanup_more_stuff:
undo 'more stuff'

cleanup_stuff:
undo 'stuff'

return err;
It is clear that in this code -- other than in the early-fast-exit case -- the cleanups will always execute. You don't have to have a 'return' before them! The normal flow of control passes smoothly through them at all times: it's just that the cleanups can skip some of that flow (the parts relating to things they did that don't need cleanup, and where the cleanup is not something like free() that is harmless to do on something that was never allocated).

Notice how little code error handling includes here -- I've replaced the actual work with pseudocode, but even in the real code the error unwinding is *two lines per check*, one an unavoidable conditional, plus optional extra work in those conditionals to do error-specific stuff first. The cleanup is no extra code: it's just a couple of goto labels, and if you miss one of them out or somehow manage to do some cleanup at a too-deeply-nested level you get a nice easy-to-spot syntax error.

This is ever so much neater than your false-loops approach and involves much less spurious nesting. It's linear in the normal case, just like the actual code flow, with branches only in exceptional conditions, while your code makes *loops* look like the normal condition, even though those loops actually never happen. Plus, spurious break and continue are both correctly diagnosed as syntax errors: in your approach, they cause a silent and erroneous change in control flow, as if an error had happened. It is much easier to mistakenly type or transpose 'break' outside e.g. a switch than it is to accidentally type 'goto some_cleanup'!

In my view it is you who has manifestly failed to understand the merits of our approach, which is quite clearly beneficial on every metric I can see other than the unthinking one which assigns an infinite automatic negative value to any use of 'goto' regardless, even uses which only jump forward in control flow to the end of a function and are thus precisely equivalent to a block with a cleanup in it and then a return. Spaghetti code this is not! It is much less spaghetti than yours.

Note: I came from a Pascal background and was at first violently anti-goto *and* anti-multiple-returns. When I look at my code from those long-ago days the contorted control flow is almost unreadable. Almost all the time a nice linear flow is preferable. Save loops for things that can actually *loop* at least once -- and for the while (0) macro trick.

The Story So Far...

Posted Dec 29, 2014 0:41 UTC (Mon) by vonbrand (guest, #4458) [Link]

That is exactly the point: With the Linux goto usage the normal code path is linear and clear for all to see, failure exceptional code paths are kept apart, for separate analysis (which they require, as "normal situation" isn't guaranteed in them).

The Story So Far...

Posted Dec 29, 2014 4:20 UTC (Mon) by viro (subscriber, #7872) [Link] (1 responses)

TBH, my impression is that ldo either has never bothered to read any of the relevant papers (Dijkstra, Hoare, etc.) *or* has confused the proofs that goto doesn't add any expressive power (e.g. compared to mutual (tail) recursion) with the discussions of the reasons why goto often leads to brittle code that is hard to reason about. And these are very different things - by the very nature of such proofs, feeding them a hard-to-analyse ball of spaghetti yields an equally hard to analyse goto-free equivalent, so they actually demonstrate both that goto can be avoided *and* that avoiding goto doesn't guarantee avoiding the problems usually associated with it.

BTW, from the control flow graph decomposition POV, break is an atrocity almost equal to multiple returns (and if you add break <number of levels>
a-la sh(1), it becomes _worse_ than multiple returns). Not to mention its inconsistency between if() and switch(), etc.

Frankly, this sort of attitude is best cured by careful reading of the standard papers circa late 60s--early 70s *plus* sorting through the usual fare on lambda-lifting, supercombinators and related stuff (circa early 80s) *plus* that on the monads sensu Peyton-Jones et.al. Attitude re goto-phobia, that is - "my code is perfect, whaddya mean, idiosyncrasies?" one is cured only by sufficiently convincing LART...

Re: ldo either has never bothered to read any of the relevant papers

Posted Dec 30, 2014 20:57 UTC (Tue) by ldo (guest, #40946) [Link]

Or maybe you never bothered to look at my code?

Feel free to try any of the error-recovery-within-loop cases. That is where the rubber hits the road, and where all the goto-ists have come a cropper.

The Story So Far...

Posted Dec 28, 2014 23:33 UTC (Sun) by cesarb (subscriber, #6266) [Link] (5 responses)

> This way, you can be sure that cleanups will always execute, no matter which way the flow of control goes.

Stop right here. The problem is not with cleanups which should always be executed; these are already well-tested. The problem is with cleanups which should *never* be executed, unless an error happens.

In exception-handling terms, what you are talking about is a "finally" block, which executes both on success and on failure. The error-recovery paths in question, however, are "except" blocks, which execute only on failure.

For a filesystem-related example, since we're supposed to be talking about filesystems: let's say a user program just appended something into a file, and you're in the function which will submit the newly appended data to the block subsystem. As part of that, it has to allocate space for the new data, updating several of the filesystem's control structures, and allocate the control structures for the block subsystem.

If any allocation fails in the middle of this complex process, you have to unwind all the bookkeeping updates, otherwise the filesystem gets into an inconsistent state. If no allocation fails, however, the bookkeeping changes must be kept, since they represent the new state of the filesystem.

Is the pseudo-loop technique able to do the equivalent of an "except" block without getting even uglier? Remember that the rewind will have to be called from many places, since any of the allocations can fail, and that the failure can also be in the middle of the the initial update of the control structures, so the rewind might be partial.

The traditional way of doing this in the Linux kernel is somewhat like the following:

int foo(...)
{
int err = 0;
/* do stuff */
err = bar(...);
if (err)
goto error;
/* do even more stuff */
err = baz(...);
if (err)
goto error;
/* do yet more stuff */
out:
/* free temporary memory and unlock mutexes */
return err;
error:
/* unwind state updates */
goto out;
}

The kernel developers are pragmatic. If a kernel developer feels that using a "pseudo-loop" results in cleaner code, they do use it. However, for error handling the usual design pattern is either a "goto error" (single error label, pointers initialized to NULL) or a "goto error_foo" (many error labels, each falling through to the next, releasing resources in reverse order). On success, either the unlock is duplicated (it's usually just a unlock, memory is rarely allocated just for the duration of one function), or the error handling does a "goto out" like in the example above.

Re: The problem is with cleanups which should *never* be executed, unless an error happens.

Posted Dec 29, 2014 17:35 UTC (Mon) by ldo (guest, #40946) [Link] (4 responses)

This is why cleanups should be idempotent--pass the cleanup routine a NULL argument, and it does nothing. So the steps become something like

  • Allocate the first thing needing cleanup; abort on failure
  • Allocate the second thing needing cleanup; abort on failure
  • If the second thing subsumes the first thing (so cleaning up the second thing automatically cleans up the first thing), then you can set the first thing to NULL at this point, to avoid a double cleanup.
  • And so on.

So when you get to the end, you simply execute all the relevant cleanups unconditionally, and the ones with NULL arguments turn into no-ops.

Re: The problem is with cleanups which should *never* be executed, unless an error happens.

Posted Dec 29, 2014 18:14 UTC (Mon) by cesarb (subscriber, #6266) [Link] (3 responses)

> If the second thing subsumes the first thing (so cleaning up the second thing automatically cleans up the first thing), then you can set the first thing to NULL at this point, to avoid a double cleanup.

You still are in the "cleanup" mentality. But the problem with filesystems is not "cleanup", it's "rollback". There's nothing to subsume the relevant code, and it subsumes nothing else; it's really something which executes only on failure.

> So when you get to the end, you simply execute all the relevant cleanups unconditionally, and the ones with NULL arguments turn into no-ops.

That doesn't help with the situation in question, where the memory allocation routines never return "failure". The relevant error-recovery code would always get passed NULL, and so the branch which is called when it receives a non-NULL never gets tested.

Sure, you *called* the code unconditionally, but it doesn't change the fact that it *executes* conditionally. Whether the branch point is outside or inside it is immaterial. For instance, it's well-known that the standard "free(void *ptr)" function does nothing when called with a NULL argument, but that's because it begins with a conditional branch: "if (!ptr) return;". If it were always called with a NULL pointer, the most complex part of its code would never be exercised.

Re: You still are in the "cleanup" mentality.

Posted Dec 30, 2014 20:34 UTC (Tue) by ldo (guest, #40946) [Link] (2 responses)

That’s right. You keep insisting that “cleanup” and “rollback” are entirely different, whereas they are really two aspects of the same thing, and can be treated as such.

> That doesn't help with the situation in question, where the memory
> allocation routines never return "failure".

Returning NULL from an allocation request is a failure.

> Sure, you *called* the code unconditionally, but it doesn't change
> the fact that it *executes* conditionally.

Do you know what “abstraction” means?

Re: You still are in the "cleanup" mentality.

Posted Dec 30, 2014 22:33 UTC (Tue) by cesarb (subscriber, #6266) [Link] (1 responses)

> > That doesn't help with the situation in question, where the memory allocation routines never return "failure".
> Returning NULL from an allocation request is a failure.

Please, reread the article this comment thread is attached to.

The whole issue is that, under certain conditions, the allocation requests were *never* returning NULL, even when they should!

> > Sure, you *called* the code unconditionally, but it doesn't change the fact that it *executes* conditionally.
> Do you know what “abstraction” means?

Please, reread the article this comment thread is attached to.

Abstraction or not, it doesn't change the fact that, since the allocation requests were *never* returning NULL, even when they should, the error handling code was *never* being executed. It doesn't matter whether it has been abstracted away or not, untested code is untested code.

Re: requests were *never* returning NULL, even when they should!

Posted Dec 31, 2014 21:00 UTC (Wed) by ldo (guest, #40946) [Link]

Precisely the point. And changing them to return NULL is fraught, because the error-handling paths in the callers are quite likely riddled with omissions where they should be dealing with this case. And finding and fixing those omissions is hard, because the error-handling paths are so complex.

Which is why I am advocating simpler error-handling paths, as in my example.

The Story So Far...

Posted Dec 28, 2014 23:59 UTC (Sun) by bronson (subscriber, #4806) [Link] (1 responses)

You think that, just because nobody wants to refactor your 1300-line hellishly-nested file with way-too-long functions, you win the argument?

Hardly. Reread Cyberax's replies with an open mind this time. He's right: carefully applied gotos would take care of your "do /* once */ {} while(false)" nesting and ambiguous breaks. If you want to make your code more readable, this would be a great first step. After that, you could give it a good scrubbing with https://www.kernel.org/doc/Documentation/CodingStyle

But, from your aggressive writing style, it sounds like you'd rather argue.

Re: you win the argument?

Posted Dec 29, 2014 17:28 UTC (Mon) by ldo (guest, #40946) [Link]

Yes.

I took the trouble to write all that code which proves my point. If you want to prove yours, you have to do the same.

The Story So Far...

Posted Dec 29, 2014 0:35 UTC (Mon) by cesarb (subscriber, #6266) [Link]

> So they resort to trying to cast aspersions on the quality of my code.

Taking a look at that code, it's easy to see why. It has a rather unorthodox coding style, which can make coders used only to common coding styles for both C *and* Python recoil in horror.

The code layout is the first thing one unavoidably notices when first looking at a new source code file, so even before even they can even consider the actual code quality, your code already has a negative score in their minds.

In particular, Python does have a standard coding style for C code: https://www.python.org/dev/peps/pep-0007/. When in Rome...

Here are, in no particular order, a few of the things which made *me* recoil in horror on a quick look at that code:

* The placement of the parameters in a function definition.
* The placement of the comment describing the function. Its traditional location (and the one all code-documentation tools expect) is before the function; you placed it just before the opening brace, where the parameters types were declared in pre-standard C.
* The unorthodox indentation of a multi-line expression, which placed each operator in its own line (instead of, as usually done, either at the end or at the beginning of a line).
* The use of an "end comment" for control structures. An actual example from that code file:

for (; i < 4; ++i)
{
/* ... */
} /*if*/

Yes, the comment doesn't match the control structure it's attached to. This is why most people don't bother with these sorts of comments: unless they are part of the language (and thus forced to be correct by the compiler), they can and will end up being wrong.

None of that affects the quality of the compiled code; it's all stripped by the compiler (even the "dummy loops" are optimized out). The exception is all the PyErr_Occurred() calls; your code has redundant calls to that function (it's not a macro, so it won't be optimized out), which probably wouldn't happen with a more standard coding style.

Your code might or might not have a good quality; the strange coding style makes it unnecessarily hard to tell. And as others have noted, it has a few "code smells" (https://en.wikipedia.org/wiki/Code_smell), like excessively long functions, which usually correlate with a lower code quality.

The Story So Far...

Posted Dec 29, 2014 9:39 UTC (Mon) by itvirta (guest, #49997) [Link] (5 responses)

>The article is about an assumption about (lack of) possible failures of certain memory allocations,
>which has somehow baked itself into many parts of the Linux kernel since long before anyone can now remember.

About errors promised, but not returned, and the numerous error paths that are untested
because of that (plus a nice lock-up), if I got it right. The actual coding style of the error
paths seems unrelated.

> Yet, when challenged, they are unable to actually prove their point, by offering simpler
> alternatives to my code.
> So they resort to trying to cast aspersions on the quality of my code.

You've repeated about five times the suggestion that someone _else_ should prove their
way by reorganising _your_ code. Given that everyone else, including the Linux kernel folks
seem to be happy with their gotos against your suggestion, it seems that the numbers are
against you, and I've seen no proof for your way other than your personal (and persistent)
assertion.

Not that mere numbers prove everything, but it makes one wonder. As does the fact that
someone tried to rework your code, but made mistakes. Doesn't that also reflect badly on your
code? Perhaps it isn't as well-readable as you assert? Some specific issues about the coding
style have been mentioned, but I haven't seen any real arguments to defend them, except the
dogmatic anti-goto attitude. (And you accuse others of prejudice!)

I stumbled upon a rather apt quote attributed to Dijkstra regarding his well-known article
about not using goto. Perhaps you should ponder on it for a moment:
"I have the uncomfortable feeling that others are making a religion out of it, as if the conceptual
problems of programming could be solved by a single trick, by a simple form of coding discipline!"

> But that just brings us back to the same point: if they think my code is too complicated or just
> plain crap, why can’t they do better? Why can they not come up with something simpler to solve
> the same real-world problem? But they cannot.

If this is just an exercise, perhaps a smaller one would do. Also of course, you could do your
part in comparing the options. It's not like anyone has a responsibility to attend to exercises
someone else gives on the Internet. If you want to take the lack of submissions as proof of
your superior position, you are of course free to do so. But it doesn't mean you are right,
or that others will want to talk to you after that.

Somehow, I'm starting to hope you are just trolling. That would at least _explain_ all of this. :)

Re: If this is just an exercise, perhaps a smaller one would do

Posted Dec 30, 2014 21:00 UTC (Tue) by ldo (guest, #40946) [Link] (4 responses)

My sample code already includes both simple cases and complex ones. Others have already tackled the simple cases, and frankly, I don’t think they prove anything either way. But that is the nature of classroom exercises; they never quite prepare you for real-world code.

It is the complex cases that demonstrate the scalability of my technique over GOTOs. This is evidenced by the fact that no one can come up with an equivalent using GOTOs that is even correct.

Re: If this is just an exercise, perhaps a smaller one would do

Posted Dec 30, 2014 22:56 UTC (Tue) by cesarb (subscriber, #6266) [Link] (3 responses)

> This is evidenced by the fact that no one can come up with an equivalent using GOTOs that is even correct.

Absence of evidence is not evidence of absence.

You posted a large block of source code, without any testcases, written in an unorthodox coding style. To do a decent rewrite, first one would have to understand the code, which is made harder by the uncommon code style and by the fact that it's a mathematical algorithm; experienced programmers tend to "pattern match" the source code visually to skip unimportant details, but this can only happen if one is familiar with the coding style. Then one would have to carefully reformat each function, converting the do-nothing loops into straight code without breaking the do-something loops; this is made harder because the same keyword (break) is being used to exit both. Finally, one would have to refactor the code into a more traditional style, being very careful to not break anything (since there are no testcases).

That's a lot of work for something which would be used solely to score a rhetorical point and then thrown away; the maintainer of the code (you) would not accept the changed code, since you're being so dogmatic about your coding style.

It could be worth doing the exercise if there was any chance that it would not be a wasted effort. Since there isn't, there are better uses for our time, and that's probably why nobody's bothering.

Re: Absence of evidence is not evidence of absence.

Posted Dec 31, 2014 21:01 UTC (Wed) by ldo (guest, #40946) [Link] (2 responses)

*Yawn* More content-free hand-waving hot air. Show us the code!

Re: Absence of evidence is not evidence of absence.

Posted Dec 31, 2014 22:25 UTC (Wed) by bronson (subscriber, #4806) [Link] (1 responses)

If you want something to be rewritten so bad, why don't you rewrite the goto-laden kernel examples cesarb posted? It should be easy to demonstrate how much more readable your technique is.

Re: why don't you rewrite the goto-laden kernel examples cesarb posted?

Posted Jan 1, 2015 21:46 UTC (Thu) by ldo (guest, #40946) [Link]

Because they’re just not that interesting.

The Story So Far...

Posted Dec 29, 2014 18:36 UTC (Mon) by acollins (guest, #94471) [Link] (12 responses)

Looking at your spuhelper.c example has convinced me of the exact opposite. I find your error paths nearly unreadable.

A goto label would provide a clear indication of where error handling takes place, instead, in your code I have to look at all the surrounding context to figure out what on earth "break" means in that particular context (is it exiting a real loop normally or is it a do-nothing loop to avoid a goto?). You mix error handling and normal loop control flow in a very confusing manner.

Contrast this with far more complex kernel code I've read that is much more understandable at first glance, largely due to readable error handling.

A number of people have already replied with similar thoughts but I'll reiterate, instead of lashing out, perhaps take a look at the feedback and reconsider your code.

Re: Contrast this with far more complex kernel code

Posted Dec 30, 2014 20:36 UTC (Tue) by ldo (guest, #40946) [Link] (11 responses)

I’d be curious to know if anybody can point to an example in the kernel which has to deal with error recovery from inside a loop, similar to my code.

That is the one case where the goto-ists have so far fallen flat on their faces when trying to “improve” my code.

Re: Contrast this with far more complex kernel code

Posted Dec 30, 2014 22:27 UTC (Tue) by cesarb (subscriber, #6266) [Link] (10 responses)

> I’d be curious to know if anybody can point to an example in the kernel which has to deal with error recovery from inside a loop, similar to my code.

Sure. A very simple one, which should be easy to follow: the deeply nested unuse_mm() loop, which can be found at mm/swapfile.c. This is one I'm familiar with, other kernel developers most certainly know of better examples.

The first thing to notice is that, for better readability, it's split into several functions, one for each nesting level of the loop. The outermost loop, within unuse_mm, loops over the vmas and calls unuse_vma for each one. The next level, within unuse_vma, calls unuse_pud_range for each pgd; unuse_pud_range calls unuse_pmd_range for each pud; unuse_pmd_range calls unuse_pte_range for each pmd; and unuse_pte_range calls unuse_pte for each pte. Finally, unuse_pte does the real work, and it's where an error can happen.

Yes, we have a 5-level nested loop here, 4 of them looping over the 4-level abstract page table, with errors propagating outward from the innermost loop. Since each loop is in its own function, it doesn't use even need a "goto"; it can use a straight "return". But the innermost function (unuse_pte) does have an example of the traditional "cleanup" use of goto.

Now how about an example from XFS, since we're supposed to be talking about XFS? I randomly looked at its source code, and found xlog_alloc_log. That function has to deal with error recovery from before the loop, from after the loop, and from within the loop, and the error recovery must be run only on failure. It's an allocation function; if there's no failure, it must keep everything it allocated, and if there's any failure, it must release everything it has allocated.

Re: Finally, unuse_pte does the real work, and it's where an error can happen.

Posted Dec 31, 2014 21:56 UTC (Wed) by ldo (guest, #40946) [Link] (9 responses)

Speaking of which, I notice there is no error checking on this call to pte_offset_map_lock. Can that never fail?

And what happens if unuse_pte returns an error, anyway? Do the outer routines abort, and leave their work half-done? Is this supposed to be cleanup code, or not?

Re: Finally, unuse_pte does the real work, and it's where an error can happen.

Posted Dec 31, 2014 22:56 UTC (Wed) by cesarb (subscriber, #6266) [Link] (8 responses)

> Speaking of which, I notice there is no error checking on this call to pte_offset_map_lock. Can that never fail?

Looking at how it's implemented, that call does two things: it temporarily maps the page table, in a way which won't fail (in some architectures, it's a simple arithmetic operation, and in others, it uses a mechanism which has a number of slots reserved for temporary mappings), and it locks a spinlock. If the spinlock is unlocked, it can't fail; if the spinlock is locked, it will wait until its current owner unlocks it, so again it can't fail.

> And what happens if unuse_pte returns an error, anyway? Do the outer routines abort, and leave their work half-done?

The answer here is yes!

This code is ultimately called from within the swapoff system call (further down in the same file). There's in fact another outermost loop, try_to_unuse, which loops over the swapfile's pages and tries to unuse (yeah...) each one in turn. Here is where it's called:

1872 set_current_oom_origin();
1873 err = try_to_unuse(p->type, false, 0); /* force unuse all pages */
1874 clear_current_oom_origin();
1875
1876 if (err) {
1877 /* re-insert swap space back into swap_list */
1878 reinsert_swap_info(p);
1879 goto out_dput;
1880 }

Just before this fragment of code, the swapfile (or swap partition) is marked as disabled on the list of swapfiles. The try_to_unuse function then tries to move all the pages which currently reside into that swapfile back into the main memory, and make all page tables which pointed to these pages on the swap point to them in main memory, so the swapfile can be safely removed.

If try_to_unuse fails (usually because there's not enough memory to hold what's currently on the swapfile to be removed), this code enables the swapfile again (this part of the code used to be almost a duplicate of the corresponding part of the swapon code; I refactored it into a separate function used by both). It doesn't try to swap out again the pages it swapped in; if there's a need to free some of the main memory, the normal memory management code will swap them out again.

If try_to_unuse succeeds, on the other hand, the swapfile is now empty; the code after the fragment of code I pasted above releases all the resources which were allocated by the swapon system call for this swapfile, and returns success to the userspace.

Re:The answer here is yes!

Posted Jan 1, 2015 21:44 UTC (Thu) by ldo (guest, #40946) [Link] (7 responses)

In that case, this code is not very interesting. The interesting case would be the construction of a complex data structure, piece by piece, where each individual piece construction could fail. If any failures occur, then all the partially-constructed pieces so far need to be freed before returning an error indication to the caller. Only if all the construction steps succeed can the complete object be returned to the caller.

In my opinion, this would be just about the ultimate stress test of your error-recovery technique.

Search through my example for the comment “so I don't dispose of it yet” to see how I deal with this. You should find three instances.

Re:The answer here is yes!

Posted Jan 1, 2015 22:51 UTC (Thu) by reubenhwk (guest, #75803) [Link]

It seems like the vars around /* so I don't dispose of it yet */ are used as local variable where you're building/using something, and at a point (where the assignment of result is) is where you're ready to call this data structure complete and ready to return. I do something similar. When I can I try to ensure that a function either completely fails or completely succeeds with no in-between.

Re:The answer here is yes!

Posted Jan 1, 2015 23:12 UTC (Thu) by nix (subscriber, #2304) [Link]

There have been multiple worked examples of exactly this shown to you already.

Re:The answer here is yes!

Posted Jan 2, 2015 1:56 UTC (Fri) by cesarb (subscriber, #6266) [Link]

> In that case, this code is not very interesting. The interesting case would be the construction of a complex data structure, piece by piece, where each individual piece construction could fail.

Well, swapoff is the destruction of a complex data structure, not its construction. Its construction is the swapon system call, further down in the same file.

The same design pattern can be found all over the Linux kernel: there's a construction function, which constructs whatever complex data structure the module needs, and a destruction function, which releases it.

> If any failures occur, then all the partially-constructed pieces so far need to be freed before returning an error indication to the caller. Only if all the construction steps succeed can the complete object be returned to the caller.

In the swapon system call, there's a single error handling label "bad_swap" which frees all the partially-constructed data structures, undoes block device configuration, closes the opened file, and so on. It falls through to the "out" label, used for both the success and failure cases, which releases any temporarily used resources.

> Search through my example for the comment “so I don't dispose of it yet” to see how I deal with this. You should find three instances.

I see. You have two variables for the return value of the function: one which has its reference counter decremented at the end, and one which is actually returned from the function. You keep the allocated structure at the first one, and when everything's ready, you swap it with the second one. So in the failure case, the reference counter is decremented and it returns NULL; in the success case, the reference counter is not decremented and it returns the pointer to the structure.

It's an elegant way of doing it (though I'm annoyed at your inconsistency: twice you called it "result" and once you called it "Result"). It's also orthogonal to the use of goto versus pseudo-loop for cleanup: this trick can help simplify the code in both cases.

What you did is actually a design pattern in modern C++:

std::unique_ptr<foo> fn(/*args*/)
{
auto ret = std::make_unique<foo>(/*args*/);
/* ...code which can throw an exception... */
return ret;
}

Here, "ret" is a std::unique_ptr<foo>, which contains a pointer to a "foo". When this variable gets out of scope, which will happen if an exception is thrown, whatever is pointed to by "ret" will be deleted. When it reaches the "return ret", however, the pointer is moved (as in std::move) to the return value of the function, so when it leaves the scope, "ret" is pointing to NULL, and the returned object isn't deleted.

Re:The answer here is yes!

Posted May 23, 2017 13:21 UTC (Tue) by mirabilos (subscriber, #84359) [Link] (3 responses)

What, you don’t initialise them to NUL and just free them afterwards?

struct foo *foo = calloc(1, sizeof(foo));

if (!(foo->a = geta()))
goto out;
if (!(foo->b = getb()))
goto out;
if (!(foo->c = getc()))
goto out;
if (!(foo->d = getd()))
goto out;
return (foo);

out:
free(foo->d);
free(foo->c);
free(foo->b);
free(foo->a);
return (NULL); /* error */

Re:The answer here is yes!

Posted May 23, 2017 14:08 UTC (Tue) by anselm (subscriber, #2796) [Link] (2 responses)

This sort of thing may work most of the time, but just for the record, while calloc() fills the whole structure in question with all-bits-zero bytes, there is no guarantee in the C standard that an individual structure entry like foo->a will in fact turn out to be a valid null pointer afterwards. (The C language does not require the null pointer to be “all bits zero”, even though expressions like “!(foo->a = geta())” must still return 1, as in “true”, if the geta() call yields a null pointer.)

If you're unlucky this means that if, say, you error out when trying to allocate foo->b, the “free(foo->d);” at the beginning of the out: path might try to free something at the all-bits-zero-address-that-isn't-a-null-pointer that hasn't previously be allocated, which isn't allowed. This shortcut looks enticingly convenient and probably works on most platforms today but people who are interested in safe, portable, and standard-conforming C code shouldn't be using it.

Re:The answer here is yes!

Posted May 26, 2017 9:55 UTC (Fri) by mirabilos (subscriber, #84359) [Link] (1 responses)

Sure, but that can be implementation-defined, and POSIX does do this (in the next version):

http://austingroupbugs.net/view.php?id=940#c2696

Re:The answer here is yes!

Posted May 26, 2017 23:06 UTC (Fri) by nix (subscriber, #2304) [Link]

Note the freedom which still exists: there can be *other* bit patterns that also represent the null pointer. :)


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds