Leading items
Welcome to the LWN.net Weekly Edition for September 1, 2022
This edition contains the following feature content:
- Python multi-level break and continue: not all proposed extensions to Python are adopted; the ability to break out of multiple levels of loops was one that didn't make it.
- Debian to vote on its firmware path: the latest step in Debian's multi-decade discussion on how to handle non-free firmware.
- Ushering out strlcpy(): an unloved string-copy function may finally be removed from the kernel, but it's surprisingly complicated.
- Toward a better definition for i_version: after nearly 30 years, filesystem developers are trying to define (and refine) what the version count means.
- Crash recovery for user-space block drivers: an extension to the 6.0 user-space block device mechanism.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Python multi-level break and continue
A fairly lengthy discussion of whether there should be a way to break out of (or continue) more than one level of nested loops in Python recently took place in the Ideas category of the language's discussion forum. The idea is attractive, at least in an abstract sense—some other languages support jumping out of multiple loops at once—but it seems unlikely to go anywhere for Python. The barrier to new features is fairly high, for sure, but there is also a need for proponents to provide real-world examples that demonstrate their advantages. That, too, is a difficult bar to clear, as was seen in the discussion.
Idea
A user called "Python Millionaire" posted
an example of some loops that they had written to process data about some
basketball players; "I want to continue or break out of the nested loops
because I am no longer interested in the player
". They proposed
adding an integer to break and continue statements to
specify how
many loops to operate on. For example:
for player in all_players: for player_tables in all_tables: for version in player_tables: # things have gone wrong, need to break break 2 this_is_not_reached = True this_line_is_called()
If Python ever gets this feature, though, the (un-Pythonic?) integer
mechanism will surely not be part of it. It is terribly fragile when code
gets shuffled around, for one thing. Also, as Bryan Van de Ven observed,
it would be "a usability nightmare
" because it would be difficult "to
quickly locate the target of your fancy goto by means of a simple
code grep
".
This is not the first time the idea has come up; the feature was
raised 15 years ago by Matt Chisholm in PEP 3136 ("Labeled break
and continue"). The PEP was rejected
by Guido van Rossum for a variety of reasons, including a worry that the
feature would "be abused more than it will be
used right, leading to a net decrease in code clarity
". Peter Suter pointed
to the PEP in the discussion, noting that Millionaire "would
presumably need at least some very convincing examples that outweigh the
reasons given in the rejection notice
".
PEP 3136 offered several possibilities for the syntax of the feature, and did not choose one, which was another reason Van Rossum rejected it. But it seems clear that the labeled version is seen as the most viable path, even among those who are against adding the feature. A labeled break might look something like the following:
for a in a_list as a_loop: for b in b_list as b_loop: if ... break a_loopThat break would exit both loops immediately; a labeled continue would go to the next iteration of the named loop.
Millionaire thought
that after 15 years it might be time to reconsider the idea. They lamented
that the approaches suggested to work around the lack of multi-level
break are "infinitely clumsier
" and
"anti-pythonic
". Suter agreed with
that to a certain extent, noting the first search result for "python
multiple for loop break" is a Stack
Overflow answer that is overly clever. Suter adapted it
to the original example as follows:
for sport in all_sports: # "for sport" loop for player in all_players: for player_tables in all_tables: # "for player_tables" loop for version in player_tables: # things have gone wrong, go to next iteration of all_sports loop break else: continue break else: continue breakThat uses the else clause for loops, which will execute if no break is used in the loop, thus the loop runs to completion. So if the innermost loop runs to completion, the continue in the else will result in another iteration of the "for player_tables" loop. If the inner loop uses break, however, it will break twice more, all the way back to the "for sport" loop. As can be seen from that convoluted description, the construct is far from readable—or maintainable.
Other ways
There are multiple ways to accomplish what Millionaire is trying to do, some of which were described in the discussion. Using flags is one obvious, perhaps clunky, mechanism, another is to use exceptions, but that may not be much less clunky. Overall, though, several participants thought that the code itself should be refactored in some fashion. Chris Angelico thought that moving the search operation into its own function, which can return once the outcome is known, would simplify things. Steven D'Aprano agreed:
The obvious fix for that ugly code is to refactor into a function:def handle_inner_loops(sport): for player in all_players: for player_tables in all_tables: for version in player_tables: if condition: # things have gone wrong, bail out early. return block() for sport in all_sports: handle_inner_loops(sport)The solution to "Python needs a way to jump out of a chunk of code" is usually to put the chunk of code into a function, then return out of it.
Millionaire thought that requiring refactoring into a function was less than ideal. It is also not possible to implement a multi-level continue that way. Beyond that, Millionaire pushed back on the notion that labeled break/continue was a better syntactic choice in all cases; offering the numeric option too would give the most flexibility. There was little or no support for keeping the numeric version, however.
But the arguments given in support of the feature were generally fairly weak; they often used arbitrary, "made up" examples that demonstrated a place where multi-level break could be used, but were not particularly compelling. For example, "Gouvernathor" posted the following:
for system in systems: for planet in system: for moon in planet.moons: if moon.has_no_titanium: break 2 # I don't want to be in a system with a moon with no titanium if moon.has_atmosphere: break # I don't want to be in the same planetary system
As D'Aprano pointed
out, though, that is hardly realistic; "your example seems
so artificial, and implausible, as to be useless as a use-case for
multilevel break
". He reformulated the example in two different ways,
neither of which exactly duplicated the constraints of Gouvernathor's
example, however. He also had some thoughts on what it would take to
continue pursuing the feature:
To make this proposal convincing, we need a realistic example of an algorithm that uses it, and that example needs to be significantly more readable and maintainable than the refactorings into functions, or the use of try…except (also a localised goto).If you intend to continue to push this idea, I strongly suggest you look at prior art: find languages which have added this capability, and see why they added it.
Angelico noted that he has used the Pike programming language, which does have a labeled break. He found that he had used the feature twice in all of the Pike code he has written. Neither of the uses was particularly compelling in his opinion; one was in a quick-and-dirty script and the other is in need of refactoring if he were still working on that project, he said. That was essentially all of the real-world code that appeared in the discussion.
Paul Moore suggested
that needing a multi-level break may be evidence that the code
needs to be reworked; "Think of it in terms of 'having to break out of
multiple loops is a code smell, indicating that you should re-think your
approach'.
" Though he questioned the value of doing so, he did
offer
up a recent example:
I don't think everyone piling in with their code samples is particularly helpful. The most recent example I had, though, was a "try to fetch a URL 10 times before giving up" loop, inside the body of a function. I wanted to break out if one of the tries returned a 304 Not Modified status. A double-break would have worked. But in reality, stopping and thinking for a moment and factoring out the inner loop into a fetch_url function was far better, named the operation in a way that was more readable, and made the outer loop shorter and hence more readable itself.
Workarounds?
Millionaire complained
that all of the suggestions that had been made for ways to restructure the
code were workarounds of various sorts; "They are all ways of getting
around the problem, not actual solutions presented by the programming
language, which should be the case.
" But Oscar Benjamin said
that he could not "picture in my mind real maintainable code where
labelled break is significantly better than a reorganisation
".
All he can see in his mind is the feature "being used to extend the kind
of spaghetti code that I already wish people didn't write
". There is,
of course, an alternative: "if real life examples were provided then we
could discuss the pros and cons in those cases without depending on my
imagination
".
Meanwhile, others in the discussion pushed back against the workaround complaint and also reiterated calls for real-world code. Millionaire returned to his earlier basketball example, with a beefed-up version that uses labeled break and continue. While Millionaire seemed to think it was a perfectly readable chunk of code that way, others were less impressed. Angelico questioned some of the logic, while Van de Ven thought it did not demonstrate quite what Millionaire was claiming:
A 50-line loop body and eight levels of indentation (assuming this is inside a function) and this is the good version? Having a multi-breakwon'tdidn't fix that. All the complexity in that code stems from trying to do ad-hoc relational querying with imperative code, at the same time as pre- and post-processing.
Van de Ven and Millionaire went back and forth a few times, with Millionaire insisting that Van de Ven's refactorings and other suggestions were not mindful of various constraints (which were never mentioned up front, of course). Van de Ven thought that the episode was an example of an XY problem, where someone asks about their solution rather than their problem, but he still persisted in trying to show Millionaire alternative ways to structure their code. There are, seemingly, several avenues that Millionaire could pursue to improve their code overall, while also avoiding the need for multi-level break—if they wished to. But that is apparently not a viable path for Millionaire.
The discussion was locked by David Lord shortly thereafter; it was clear that it had run its course.
The convoluted examples presented in the thread were not particularly helpful to the cause, in truth. Users who want to add a feature to Python should have an eye on compelling use cases from the outset, rather than generalized feelings that "this would be a nice addition" to the language. If, for example, code from the standard library had been shown, where a multi-level break would have significantly improved it, the resurrected feature idea might have gained more traction. There are lots of other huge, open Python code bases out there, as well; any of those might provide reasonable examples. So far, at least, no one has brought anything like that to the fore.
This is something of a recurring theme in discussions about ideas for new Python features. To those who are proposing the feature, it seems like an extremely useful, rather straightforward addition to the language, but the reception to the idea is much different than expected. Python developers need to cast a critical eye on any change to the language and part of that is to determine whether the benefit outweighs the substantial costs of adopting it. That is not going to change, so it makes sense for those who are looking to add features to Python to marshal their arguments—examples—well.
Debian to vote on its firmware path
Dealing with the non-free firmware that is increasingly needed to install Debian has been a hot topic for the distribution over the past few months. The problem goes back further still, of course, but Steve McIntyre re-raised the issue in April, which resulted in a predictable lengthy discussion thread on the debian-devel mailing list. Now McIntyre has proposed a general resolution (GR) with the intent of resolving how to give users a way to install the distribution on their hardware while trying to avoid trampling on the "100% free" guarantee in the Debian Social Contract. Finding the right balance is going to be tricky as is shown by the multiple GR options that have been proposed in the discussion.
The basic problem is that the use of downloadable firmware in computer systems is on the rise and most of that firmware is not free software. The official Debian installer only incorporates free software (and firmware), which leads to serious problems for many users. McIntyre said in April:
Today, a user with a new laptop from most vendors will struggle to use it at all with our firmware-free Debian installation media. Modern laptops normally don't come with wired ethernet now. There won't be any usable graphics on the laptop's screen. A visually-impaired user won't get any audio prompts. These experiences are not acceptable, by any measure. There are new computers still available for purchase today which don't need firmware to be uploaded, but they are growing less and less common.
Currently, the Debian installer (sometimes abbreviated "d-i") image only includes packages from the official "main" repository that consists of software and firmware which conforms to the Debian Free Software Guidelines (DFSG). Obviously, main does not include the non-free firmware, which lives in the "non-free" repository instead. The same team that creates the official installer images also creates unofficial, non-free images, which is what most users actually need to install the distribution. The Debian community would much prefer not to have to provide the non-free version, but that is not really an option in today's hardware world—at least if the project wants users to actually be able to install and use Debian.
Beyond starting the mailing-list discussion, McIntyre also gave a talk at DebConf in July. One of the problems he had identified is that when users install with the non-free installer, it enables the non-free repository on their systems. That could well mean that those users unknowingly install additional non-free software simply because the Debian package manager (APT or something built on top of it) makes it directly available. That particular problem could be solved by creating a separate non-free-firmware repository; that repository was created as part of the work done during DebConf, though there is still more to do to use it in Debian.
Proposal
So McIntyre has proposed a
GR with a single option; based on a suggestion by
Russ Allbery back in April, McIntyre thinks it is "better to
leave it for other people to come up with the text of options that
they feel should also be on the ballot
". His option is to include the
non-free-firmware packages in the official installer, but to provide ways
to inform users about the type of firmware being used and to give them ways to
disable the non-free functionality and installation if desired. If that
should pass, there would only be a single installer, so the "fully free"
installer would no longer be built.
The proposal immediately elicited far more seconds than required (16 are shown on the GR page and five are needed). Naturally, it also drew some questions and comments, as well as some additional proposals for the ballot. Timo Lindfors asked for some additional information to be made available to users; for example:
As it is pretty impossible to write a clear definition of firmware, we should require packages in non-free-firmware to clearly explain where the code will get executed to allow people to make informed decisions. Some people are more ok with having code run on an external device than on the main CPU.
Lindfors also wanted the project to keep producing the fully free installer
and to clearly distinguish between the two installers. McIntyre was
amenable to adding firmware descriptions along the lines of what
was requested, but thought that Lindfors's other requests were better
handled with further ballot options; "I imagine that you will quite
easily get
seconds here
".
Wouter Verhelst wondered
about enabling the non-free-firmware repository on installed systems by
default. He thinks that only makes sense "if the installer determines that
packages from that component are useful for the running system (or if
the user explicitly asked to do so)
". McIntyre agreed,
saying that his proposal text was unclear; he provided some modified text
that would make that clearer.
But Ansgar Burchardt thought
that it made sense to enable non-free-firmware even if the installer did
not need it. Detachable devices (e.g. USB) might require firmware, for
example. "For the same reason the system should probably install all
(reasonable)
firmware by default, just like we install all kernel drivers even for
devices that are not present on the target system.
"
Simon Richter wondered
whether McIntyre's proposal also required changing the Debian Social
Contract (DSC); he
pointed to the first section of the contract ("Debian will remain 100%
free
") and suggested that an official installer with non-free firmware
would violate that. He also alluded to section five, which allows for the
non-free and contrib repositories, but not as "part of Debian
".
Some thread participants thought that the final line of
section one ("We will never make the system require the use of a
non-free component.
") was not being violated by the proposal. But
section five seems
more problematic because it
clearly says that non-free, thus by extension non-free-firmware, is not part
of the Debian system, so how could an official installer incorporate
that? As Simon Josefsson put it:
"what is being proposed here is to replace our current
DSC-compatible free software installer images with non-free. That goes
significantly further than what the spirit of DSC§5 suggests.
"
Tobias Frost disagreed
because there was no requirement that the non-free firmware be used;
"there are just additional bits in there which
help people to actually be able to install Debian on some modern
machines
". As might be guessed, others disagreed, but there are also some
questions of what the majority requirements for passing the GR would be;
the Debian
Constitution requires a 3:1 supermajority for changing either the DSC
or DFSG (the Constitution, too, for that matter). Josefsson is worried that
those requirements may not be followed:
I believe it would be bad for the project if the supermajority requirements of changing a [foundational] document is worked around by approving a GR vote with simple majority that says things contrary to what the DSC says.
Project secretary Kurt Roeckx said that he had no
plans to require a 3:1 supermajority for anything proposed so far, since
none targets changing the social contract. He does not think that the
secretary has the
power to impose the supermajority requirement on the vote based on their
interpretation of the proposal, though that might result in something of a
mess. If a GR
passes with a regular majority and might conflict with DSC, DFSG, or
Constitution, "the Secretary might
have to decide if it conflicts or not, and if it conflicts void the
GR
".
For the purposes of putting the firmware question to rest for good, Allbery would like to see proposals that change the social contract and thus require a 3:1 supermajority. A simple addition to section five allowing non-DFSG-compliant firmware in the installer would suffice.
The failure mode that I'm worried about here is that a ballot option passes expressing a position that we should include non-free firmware but since it doesn't explicitly update the Social Contract some folks who disagree with this direction for Debian continue to believe doing so is invalid and we don't actually put the argument to rest. Also, if the 3:1 majority option doesn't pass but a 1:1 option that doesn't require a supermajority does pass, that's also useful information. (For example, I believe that would imply that such an installer has to continue to be labeled as unofficial and not a part of the Debian system, since I think that's the plain meaning of point 5 of the Social Contract.)
Other options
Two other proposals have been posted and seconded, and there is a third that is in an indeterminate state at this point, but may well make the GR ballot; it is also not impossible that another option or three could crop up before the discussion period ends on September 3. Gunnar Wolf simply proposed ensuring that both installers would continue to be built, though the non-free version would be highlighted:
Images that do include non-free firmware will be presented more prominently, so that newcomers will find them more easily; fully-free images will not be hidden away; they will be linked from the same project pages, but with less visual priority.
That proposal (proposal B on the GR page) garnered nine seconds (including McIntyre), but also drew another proposal that is seemingly in procedural limbo as of this writing. Josefsson proposed something that is effectively the antithesis of McIntyre's (proposal A) and proposal B; it also embodies the status quo at this point by forcing any installer with non-free firmware to be "unofficial" and to not be distributed as part of Debian. It appears to have been seconded enough times, though some of those seconds are not because the person agrees with Josefsson, they simply think that the option should be available to the voters. Meanwhile, there was a procedural hiccup and the proposal does not appear on the GR page (as of this writing).
That leaves proposal C, which is a simplified statement in support of non-free firmware for the installer that leaves out the details in McIntyre's and Wolf's proposals; it was proposed by Bart Martens and is just one sentence long:
The Debian project is permitted to make distribution media (installer images and live images) containing packages from the non-free section of the Debian archive available for download alongside with the free media in a way that the user is informed before downloading which media are the free ones.
It was seconded by five developers, including McIntyre again. In his message seconding it, Stefano Zacchiroli noted that it perhaps sidesteps conflicts with the social contract; making users choose is less than optimal, but may open some eyes as well:
Rationale: while it is not lost on me that in terms of usability having to choose between two options is a net loss for newcomers, I think this might be the only way to not run afoul of the Social Contract. Also, I think that on users that are even a little bit more knowledgeable and come to Debian for software freedom reasons, this choice might carry some real educational value (on how bad the consumer hardware market is these days, mostly).
There is a danger to pushing the free installer, which came up earlier in the thread, however. Since there are few systems that can actually work without the non-free firmware, using the free installer will lead to user unhappiness, which may impact more than just Debian's reputation, as Ted Ts'o noted:
Whether we recommend the one with non-free firmware or not (some have proposed that the "free" installer would have "visual priority", whatever that means), I suspect there will be various Linux newbie or FAQ's, external to Debian, that will warn users that the using the "free" installer will just cause them pain and frustration.So there may be some unintended consequences where new users may associate "100% free software" with "not functional" and "induces pain and frustration", such that it might end up *hurting* the cause of free software.
The voting will presumably start in early September and a resolution may come by mid-month. The constitutional question is cogent, so Allbery's suggestion to explicitly have an option that changes the social contract seems like a good one. It would be ugly to see something pass and then to get invalidated; even if it passed by 3:1 or more, which is a high bar to surmount, the question of conflicting language in the social contract would still linger. At a minimum, the GR will help determine the mood of the project with respect to non-free firmware in the installer, which is definitely a good start.
Ushering out strlcpy()
With all of the complex problems that must be solved in the kernel, one might think that copying a string would draw little attention. Even with the hazards that C strings present, simply moving some bytes should not be all that hard. But string-copy functions have been a frequent subject of debate over the years, with different variants being in fashion at times. Now it seems that the BSD-derived strlcpy() function may finally be on its way out of the kernel.In the beginning, copying strings in C was simple. Your editor's dog-eared, first-edition copy of The C Programming Language provides an implementation of strcpy() on page 101:
strcpy(s, t) char *s, *t; { while (*s++ = *t++) ; }
This function has a few shortcomings, the most obvious of which is that it will overrun the destination buffer if the source string is too long. Developers working in C eventually concluded that this could be a problem, so other string-copying functions were developed, starting with strncpy():
char *strncpy(char *dest, char *src, size_t n);
This function will copy at most n bytes from src to
dest, so, if n is no larger than the length of
dest, then that array cannot be overrun. strncpy() has a
couple of quirks, though. It is defined to NUL-fill dest if
src is shorter than n, so it ends up always writing the
full array. If src is longer than n, then dest
will not be NUL-terminated at all — an invitation to trouble if the caller
does not carefully check the return value. That return value is the
address of the first NUL character written to dest unless
src is too long, in which case strncpy() returns
&dest[n] — an address beyond the actual array
dest regardless of whether truncation occurs or not. As a result,
checking for truncation is a bit tricky and often not done. [Thanks to
Rasmus Villemoes for pointing out the error in our earlier description of
the strncpy() return value.]
strlcpy() and strscpy()
The BSD answer to the problems with strncpy() was to introduce a new function called strlcpy():
size_t strlcpy(char *dest, const char *src, size_t n);
This function, too, will copy a maximum of n bytes from src to dest; unlike strncpy(), it will always ensure that dest is NUL-terminated. The return value is always the length of src regardless of whether it was truncated in the copy or not; developers must compare the returned length against n to determine whether truncation has occurred.
The first uses of strlcpy() in the kernel entered briefly during the 2.4 stable series — sort of. The media subsystem had a couple of implementations defined as:
#define strlcpy(dest,src,len) strncpy(dest,src,(len)-1)
As one might imagine, there was not a lot of checking of return values going on at that point. That macro disappeared relatively quickly, but a real strlcpy() implementation appeared in the 2.5.70 release in May 2003; that release also converted many callers in the kernel over to this new function. Everything seemed good for quite some time.
In 2014, though, criticism of strlcpy() started to be heard, resulting in, among other things, an extended discussion over whether to add an implementation to the GNU C library; to this day, glibc lacks strlcpy(). Kernel developers, too, started to feel disenchanted with this API. In 2015, yet another string-copy function was added to the kernel by Chris Metcalf:
ssize_t strscpy(char *dest, const char *src, size_t count);
This function, like the others, will copy src to dest without overrunning the latter. Like strlcpy(), it ensures that the result is NUL-terminated. The difference is in the return value; it is the number of characters copied (without the trailing NUL byte) if the string fits, and -E2BIG otherwise.
Reasons to like strscpy()
Why is strscpy() better? One claimed advantage is the return value, which makes it easy to check whether the source string was truncated or not. There are a few other points as well, though; to get into those, it is instructive to look at the kernel's implementation of strlcpy():
size_t strlcpy(char *dest, const char *src, size_t size) { size_t ret = strlen(src); if (size) { size_t len = (ret >= size) ? size - 1 : ret; memcpy(dest, src, len); dest[len] = '\0'; } return ret; }
One obvious shortcoming is that this function will read the entire source string regardless of whether that data will be copied or not. Given the defined semantics of strlcpy(), this inefficiency simply cannot be fixed; there is no other way to return the length of the source string. This is not just a question of efficiency, though; as recently pointed out by Linus Torvalds, bad things can happen if the source string is untrusted — which is one of the intended use cases for this function. If src is not NUL-terminated, then strlcpy() will continue merrily off the end until it does find a NUL byte, which may be way beyond the source array — if it doesn't crash first.
Finally, strlcpy() is subject to a race condition. The length of src is calculated, then later used to perform the copy and returned to the caller. But if src changes in the middle, strange things could happen; at best the return value will not match what is actually in the dest string. This problem is specific to the implementation rather than the definition, and could thus be fixed, but nobody seems to think it's worth the effort.
The implementation of strscpy() avoids all of these problems and is also more efficient. It is also rather more complex as a result, of course.
The end of strlcpy() in the kernel?
When strlcpy() was first introduced, the intent was to replace all of the strncpy() calls in the kernel and get rid of the latter function altogether. In the 6.0-rc2 kernel, though, there are still nearly 900 strncpy() call sites remaining; that number grew by two in the 6.0 merge window. At the introduction of strscpy(), instead, Torvalds explicitly did not want to see any sort of mass conversion of strlcpy() calls. In 6.0-rc2, there are just over 1,400 strlcpy() calls and nearly 1,800 strscpy() calls.
Nearly seven years later, the attitude seems to have changed a bit;
Torvalds now says that "strlcpy() does need to go
". A number of
subsystems have made conversion passes, and the number of
strlcpy() call sites has fallen by 85 since 5.19. Whether it will
ever be possible to remove strlcpy() entirely is unclear;
strncpy() is still holding strong despite its known hazards and a
decision to get rid of it nearly 20 years ago. Once something gets
into the kernel, taking it out again can be a difficult process.
There may be hope, though, in this case. As Torvalds observed in response to a set of conversions from Wolfram Sang, most of the callers to strlcpy() never use the return value; those could all be converted to strscpy() with no change in behavior. All that would be needed, he suggested, was for somebody to create a Coccinelle script to do the work. Sang rose to the challenge and has created a branch with the conversions done. That work, obviously, won't be considered for 6.0, but might show up in a 6.1 pull request.
That would leave relatively few strlcpy() users in the kernel. Those could be cleaned up one by one, and it might just be possible to get rid of strlcpy() entirely. That would end a 20-year sporadic discussion on the best way to do bounded string copies in the kernel — all of those remaining strncpy() calls notwithstanding — at least until some clever developer comes up an even better function and starts the whole process anew.
Toward a better definition for i_version
Filesystems maintain a lot of metadata about the files they hold; most of this metadata is for consumption by user space. Some metadata, though, stays buried within the filesystem and is not visible outside of the kernel. One such piece of metadata is the file version count, known as i_version. Current efforts to change how i_version is managed — and to make it visible to user space — have engendered a debate on what i_version actually means and what its behavior should be.
Early versions of i_version
Version
0.99.7 of the kernel was released on March 13, 1993. Those were
exciting times; among other things, this release included a version of the
mmap() system call that was, according to a young Linus Torvalds,
"finally starting to really
happen
". This release also brought a new filesystem by Rémy Card
called "ext2fs" — the distant ancestor of the ext4 filesystem currently used by many
Linux systems.
As part of the ext2fs addition, the kernel's inode structure was augmented with a field called i_version, which was noted in a comment as being for the NFS filesystem. Nothing actually used that field until the 0.99.14 release in November of that year, when an ioctl() call was added to provide access to i_version. Those of us who were valiantly trying to use NFS on Linux in those days will remember that the server ran in user space then, so this ioctl() call was needed for i_version to be useful for NFS.
Initially, i_version was incremented whenever a given inode number was reused for a new file. This is an event that the NFS server needs to know about; otherwise a file handle created for one file could be used to access a completely different file that happened to end up with the same inode number, with aesthetically displeasing results. Version 2.2.3pre1 in 1999 added a new i_generation field to be used for this purpose instead, though it was not actually used until the 2.3.1pre1 development kernel in May of that year. When i_generation took over this role, i_version became a sort of counter for versions of the same file, incremented on changes in a filesystem-specific way (for filesystems that managed i_version at all).
While i_generation was all that the NFS server needed to carry out its task of creating the dreaded "stale file handle" errors when a file is replaced, there was still a role for i_version. NFS will perform far better if it can cache data locally, but doing so safely requires knowledge of when a file's contents change; i_version can be used for that purpose. Those who are interested in the details can read this article by Neil Brown on how cache consistency is maintained in current versions of NFS.
The trouble with i_version now
In the nearly 30 years since i_version was introduced, there has been little in the way of formal description of what the field is supposed to mean. In 2018, Jeff Layton added some comments describing how i_version was meant to be used, which clarified some details. As it turns out, though, some details remain to be nailed down, and they are creating trouble now.
Layton's text says: "The i_version must appear different to
observers if there was a change to the inode's data or metadata since it
was last queried
". That has been the deal between the virtual
filesystem (VFS) layer and the filesystems for years, but now there is a
desire to alter it. In its current form, it seems that i_version
is creating some performance difficulties.
As described above, NFS uses i_version to detect when a file has changed. If an NFS client has portions of a file cached, an i_version change will cause it to discard those caches, leading to more traffic with the server. The kernel's integrity measurement architecture (IMA), which ensures that files have not been tampered with by comparing them against trusted checksums, also uses i_version; if a file has changed, it must be re-checksummed before access can be allowed. In either case, spurious i_version increments will cause needless extra work to be done, hurting performance.
These unwanted increments are indeed happening, as it turns out, and the cause is an old villain: access-time (atime) tracking. By default, Unix filesystems will note every time that a file is read in that file's atime field. This record-keeping turns an otherwise read-only operation into a filesystem write and can be bad for performance on its own; for this reason, there are a number of options for disabling atime updates. If they are enabled, though, every atime update will, since it changes the metadata in a file's inode, increment i_version, with the bad results described above.
Rethinking i_version
Layton has decided to do something about that problem, resulting in a number of related patch sets. This patch, for example, makes i_version visible in the statx() system call, exposing it to user space for the first time (the old ext2 ioctl() command still exists, but it returns i_generation rather than i_version). The stated purpose is to make it easier to test its behavior and to facilitate the writing of user-space NFS servers. Another patch causes the XFS filesystem to not update i_version for atime updates; there is a similar patch for ext4. Finally, there is an update to the i_version comments making it explicit that atime updates should not increment that field.
Resistance to this work has come primarily from XFS developer Dave Chinner,
who called
the changed i_version rules "misguided
". He had a number
of complaints, starting with the fact that XFS sees i_version
rather differently and updates
it frequently:
In case you didn't realise, XFS can bump iversion 500+ times for a single 1MB write() on a 4kB block size filesystem, and only one of them is initial write() system call that copies the data into the page cache. The other 500+ are all the extent allocation and manipulation transactions that we might run when persisting the data to disk tens of seconds later.
This behavior, he said, is tied to how i_version is stored on-disk, meaning that changes to its semantics need to be treated like a disk-format change. He argued that what is being requested is essentially the lazytime mount option, which is implemented at the VFS level. If NFS needs lazytime-like semantics for i_version, he said, that should also be implemented at the VFS level so that all filesystems will behave in the same way.
Layton responded that lazytime semantics don't really help, since they simply defer the atime updates and will still result in unwanted i_version bumps. He also said that, since the only consumers for i_version are in the kernel, its semantics can be changed without creating further problems. Chinner disagreed with that claim, saying that his forensic-analysis tools make heavy use of that field in the on-disk images. It might not be possible to change the behavior of i_version in XFS without an on-disk format change.
Despite all of this, Chinner has let it be known that he is not really opposed to the change, except for one thing: he wants a tight specification of just how i_version is meant to behave, especially if it will be exposed to user space. Trond Myklebust suggested that i_version should only change in response to explicit operations — those in which user space has requested a change to the file. Changes to atime are, instead, implicit since user space has not asked for them, so they should not result in i_version updates. Layton said that it could simply be defined as any operation that updates an inode's mtime or ctime fields. Neil Brown had a more complex proposal that would use the ctime field directly while providing the higher resolution needed for NFS.
In the end, though, Layton argued
that "the time to write a specification for i_version was when it was
created
" and that he's doing his best to fix the problems long after
that time. But, he
said, it is "probably best to define this as loosely as possible so that
we can make it easier for a broad range of filesystems to implement
it
". An occasional spurious bump is not a huge problem, but the
regular increments caused by atime updates are. Fixing that problem should
be good enough.
For all the noise of the discussion, the disagreements are likely smaller than they seem. It is a good opportunity to get a better understanding of what this 30-year-old field really means, and to adjust its behavior to the benefit of Linux users. The next step would appear to be the posting another version of the patches by Layton, at which point we will get a sense for whether there is enough of a consensus around the proposed changes to get them merged.
Crash recovery for user-space block drivers
A new user-space block driver mechanism entered the kernel during the 6.0 merge window. This subsystem, called "ublk", uses io_uring to communicate with user-space drivers, resulting in some impressive performance numbers. Ublk has a lot of interesting potential, but the current use cases for it are not entirely clear. The recently posted crash-recovery mechanism for ublk makes it clear, though, that those use cases do exist.If an in-kernel block driver crashes, it is likely to bring down the entire kernel with it. Putting those drivers into user space can, theoretically, result in a more robust system, since the kernel can now survive a driver crash. With ublk as found in the 6.0 kernel, though, a driver crash will result in the associated devices disappearing and all outstanding I/O requests failing. From a user's point of view, this result may be nearly indistinguishable from a complete crash of the system. As patch author Ziyang Zhang notes in the cover letter, some users might be disappointed by this outcome:
This is not a good choice in practice because users do not expect aborted requests, I/O errors and a released device. They may want a recovery mechanism so that no requests are aborted and no I/O error occurs. Anyway, users just want everything works as usual.
The goal of this patch set is to grant this wish.
A user-space block driver that implements crash recovery should set up its ublk devices with the new UBLK_F_USER_RECOVERY flag. There is also an optional flag, UBLK_F_USER_RECOVERY_REISSUE, that controls how recovery is done; more on that below. After setup, no other changes are needed for normal driver operation.
Should a recovery-capable ublk driver crash, the kernel will stop the associated I/O request queues to prevent the addition of future requests, then wait patiently for a new driver process to come along. That wait can be indefinite; if a driver claims to be able to do recovery, then the kernel will expect it to live up to that claim. There is no notification mechanism for a driver crash; user space is required to notice on its own that the driver has come to an untimely end and start a new one.
That new driver process will connect to the ublk subsystem and issue the START_USER_RECOVERY command. That causes ublk to verify that the old driver is really gone and clean up after it, including dealing with all of the outstanding I/O requests. Any requests that showed up after the crash and were not accepted by the old driver can simply be requeued to the new one. Requests that were accepted may have to be handled a bit more carefully, though, since the kernel does not know if they were actually executed or not.
There are, evidently, some ublk back-ends that cannot properly deal with duplicated writes; such writes must be avoided in that case. That is what the UBLK_F_USER_RECOVERY_REISSUE flag is for; if it is present, all outstanding requests will be reissued. Otherwise, requests that had been picked up by the driver, but for which no completion status had been posted, will fail with an error status. This will happen even with read requests, which one would normally expect to be harmless if repeated.
After starting the recovery process, the new driver should reconnect to each device and issue a new FETCH_REQ command on each to enable the flow of I/O requests. Once all of the devices have been set up, an END_USER_RECOVERY command will restart the request queue and get everything moving again. With luck, users may not even notice that the block driver crashed and was replaced.
The ublk subsystem came out of Red Hat and only includes a simple file-backed driver, essentially replicating the loop driver, as an example. At the time, various use cases for this subsystem were mentioned in a vague way, but it was not clear how (or if) it is being used outside of a demonstration mode. It looks a bit like an interesting solution waiting for a problem.
The appearance of this recovery mechanism from a different company (Alibaba), just a few weeks after ublk was merged, suggests that more advanced use cases exist, and that ublk is, indeed, already in active use. This sort of recovery mechanism tends not to be developed in the absence of some hard experience indicating that it is necessary. Hopefully some of these real-world use cases will come to light — with code — so that the rest of the world can benefit from this work.
Just as usefully, this information might give some clues about where Linux is headed in the coming years. The effort to blur the boundaries between kernel tasks and those handled in user space shows no signs of slowing down; it would not be surprising to see more ublk-like mechanisms in the future. It would be interesting indeed to have an idea of where these changes are taking us — and to be shown that it isn't a world where development moves to proprietary, user-space drivers.
Page editor: Jonathan Corbet
Next page:
Brief items>>