That massive filesystem thread
We're bad at marketingLong, highly-technical, and animated discussion threads are certainly not unheard of on the linux-kernel mailing list. Even by linux-kernel standards, though, the thread that followed the 2.6.29 announcement was impressive. Over the course of hundreds of messages, kernel developers argued about several aspects of how filesystems and block I/O work on contemporary Linux systems. In the end (your editor will be optimistic and say that it has mostly ended), we had a lot of heat - and some useful, concrete results.We can admit it, marketing is not our strong suit. Our strength is writing the kind of articles that developers, administrators, and free-software supporters depend on to know what is going on in the Linux world. Please subscribe today to help us keep doing that, and so we don’t have to get good at marketing.
One can only pity Jesper Krogh, who almost certainly didn't know what he was getting into when he posted a report of a process which had been hung up waiting for disk I/O for several minutes. All he was hoping for was a suggestion on how to avoid these kinds of delays - which are a manifestation of the famous ext3 fsync() problem - on his server. What he got, instead, was to be copied on the entire discussion.
Journaling priority
One of the problems is at least somewhat understood: a call to fsync() on an ext3 filesystem will force the filesystem journal (and related file data) to be committed to disk. That operation can create a lot of write activity which must be waited for. But contemporary I/O schedulers tend to favor read operations over writes. Most of the time, that is a rational choice: there is usually a process waiting for a read to complete, but writes can be done asynchronously. A journal commit is not asynchronous, though, and it can cause a lot of things to wait while it is in progress. So it would be better not to put journal I/O operations at the end of the queue.
In fact, it would be better not to make journal operations contend with the rest of the system at all. To that end, Arjan van de Ven has long maintained a simple patch which gives the kjournald thread realtime I/O priority. According to Alan Cox, this patch alone is sufficient to make a lot of the problems go away. The patch has never made it into the mainline, though, because Andrew Morton has blocked it. This patch, he says, does not address the real problem, and it causes a lot of unrelated I/O traffic to benefit from elevated priority as well. Andrew says the real fix is harder:
Bandaid or not, this approach has its adherents. The ext4 filesystem has a new mount option (journal_ioprio) which can be used to set the I/O priority for journaling operations; it defaults to something higher than normal (but not realtime). More recently, Ted Ts'o has posted a series of ext3 patches which sets the WRITE_SYNC flag on some journal writes. That flag marks the operations as synchronous, which will keep them from being blocked by a long series of read operations. According to Ted, this change helps quite a bit, at least when there is a lot of read activity going on. The ext3 changes have not yet been merged for 2.6.30 as of this writing (none of Ted's trees have), but chances are they will go in before 2.6.30-rc1.
data=ordered, fsync(), and fbarrier()
The real problem, though, according to Ted, is the ext3 data=ordered mode. That is the mode which makes ext3 relatively robust in the face of crashes, but, says Ted, it has done so at the cost of performance and the encouragement of poor user-space programming. He went so far as to express his regrets for this behavior:
The only problem here is that not everybody believes that ext3's behavior is a bad thing - at least, with regard to robustness. Much of this branch of the discussion covered the same issues raised by LWN in Better than POSIX? a couple of weeks before. A significant subset of developers do not want the additional robustness provided by ext3 data=ordered mode to go away. Matthew Garrett expressed this position well:
One option which came up a couple of times was to extend POSIX with a new system call (called something like fbarrier()) which would enforce ordering between filesystem operations. A call to fbarrier() could, for example, cause the data written to a new file to be forced out to disk before that file could be renamed on top of another file. The idea has some appeal, but Linus dislikes it:
So rather than come up with new barriers that nobody will use, filesystem people should aim to make "badly written" code "just work" unless people are really really unlucky. Because like it or not, that's what 99% of all code is.
And that is almost certainly how things will have to work. In the end, a system which just works is the system that people will want to use.
relatime
Meanwhile, another branch of the conversation revisited an old topic: atime updates. Unix-style filesystems traditionally track the time that each file was last accessed ("atime"), even though, in reality, there is very little use for this information. Tracking atime is a performance problem, in that it turns every read operation into a filesystem write as well. For this reason, Linux has long had a "noatime" mount option which would disable atime updates on the indicated filesystem.
As it happens, though, there can be problems with disabling atime entirely. One of them is that the mutt mail client uses atime to determine whether there is new mail in a mailbox. If the time of last access is prior to the time of last modification, mutt knows that mail has been delivered into that mailbox since the owner last looked at it. Disabling atime breaks this mechanism. In response to this problem, the kernel added a "relatime" option which causes atime to be updated only if the previous value is earlier than the modification time. The relatime option makes mutt work, but it, too, turns out to be insufficient: some distributions have temporary-directory cleaning programs which delete anything which hasn't been used for a sufficiently long period. With relatime, files can appear to be totally unused, even if they are read frequently.
If relatime could be made to work, the benefits could be significant; the elimination of atime updates can get rid of a lot of writes to the disk. That, in turn, will reduce latencies for more useful traffic and will also help to avoid disk spin-ups on laptops. To that end, Matthew Garrett posted a patch to modify the relatime semantics slightly: it allows atime to be updated if the previous value is more than one day in the past. This approach eliminates almost all atime updates while still keeping the value close to current.
This patch was proposed for merging, and more: it was suggested that relatime should be made the default mode for filesystems mounted under Linux. Anybody wanting the traditional atime behavior would have to mount their filesystems with the new "strictatime" mount option. This idea ran into some immediate opposition, for a couple of reasons. Andrew Morton didn't like the hardwired 24-hour value, saying, instead, that the update period should be given as a mount option. This option would be easy enough to implement, but few people think there is any reason to do so; it's hard to imagine a use case which requires any degree of control over the granularity of atime updates.
Alan Cox, instead, objected to the patch as an ABI change and a standards violation. He tried to "NAK" the patch, saying that, instead, this sort of change should be done by distributors. Linus, however, said he doesn't care; the relatime change and strictatime option were the very first things he merged when he opened the 2.6.30 merge window. His position is that the distributors have had more than a year to make this change, and they haven't done so. So the best thing to do, he says, is to change the default in the kernel and let people use strictatime if they really need that behavior.
For the curious, Valerie Aurora has written a detailed article about this change. She doesn't think that the patch will survive in its current form; your editor, though, does not see a whole lot of pressure for change at this point.
I/O barriers
Suppose you are a diligent application developer who codes proper fsync() calls where they are necessary. You might think that you are then protected against data loss in the face of a crash. But there is still a potential problem: the disk drive may lie to the operating system about having written the data to persistent media. Contemporary hardware performs aggressive caching of operations to improve performance; this caching will make a system run faster, but at the cost of adding another way for data to get lost.
There is, of course, a way to tell a drive to actually write data to persistent media. The block layer has long had support for barrier operations, which cause data to be flushed to disk before more operations can be initiated. But the ext3 filesystem does not use barriers by default because there is an associated performance penalty. With ext4, instead, barriers are on by default.
Jeff Garzik pointed out one associated problem: a call to fsync() does not necessarily cause the drive to flush data to the physical media. He suggested that fsync() should create a barrier, even if the filesystem as a whole is not using barriers. In that way, he says, fsync() might actually live up to the promise that it is making to application developers.
The idea was not controversial, even though people are, as a whole, less concerned with caches inside disk drives. Those caches tend to be short-lived, and they are quite likely to be written even if the operating system crashes or some other component of the system fails. So the chances of data loss at that level are much smaller than they are with data in an operating system cache. Still, it's possible to provide a higher-level guarantee, so Fernando Luis Vazquez Cao posted a series of patches to add barriers to fsync() calls. And that is when the trouble started.
The fundamental disagreement here is over what should happen when an attempt to send a flush operation to the device fails. Fernando's patch returned an ENOTSUPP error to the caller, but Linus asked for it to be removed. His position is that there is nothing that the caller can do about a failed barrier operation anyway, so there is no real reason to propagate that error upward. At most, the system should set a flag noting that the device doesn't support barriers. But, says Linus, filesystems should cope with what the storage device provides.
Ric Wheeler, instead, argues that filesystems should know if barrier operations are not working and be able to respond accordingly. Says Ric:
Basically, it lets the file system know that its data integrity building blocks are not really there and allows it (if it cares) to try and minimize the chance of data loss.
Alan Cox also jumped into this discussion in favor of stronger barriers:
Linus appears to be unswayed by these arguments, though. In his view, filesystems should do the best they can and accept what the underlying device is able to do. As of this writing, no patches adding barriers to fsync() have been merged into the mainline.
Related to this is the concept of laptop mode. It has been suggested that, when a system is in laptop mode, an fsync() call should not actually flush data to disk; flushing the data would cause the drive to spin up, defeating the intent of laptop mode. The response to I/O barrier requests would presumably be similar. Some developers oppose this idea, though, seeing it as a weakening of the promises provided by the API. This looks like a topic which could go a long time without any real resolution.
Performance tuning
Finally, there was some talk about trying to make the virtual memory subsystem perform better in general. Part of the problem here has been recognized for some time: memory sizes have grown faster than disk speeds. So it takes a lot longer to write out a full load of dirty pages than it did in the past. That simple dynamic is part of the reason why writeout operations can stall for long periods; it just takes that long to funnel gigabytes of data onto a disk drive. It is generally expected that solid-state drives will eventually make this problem go away, but it is also expected that it will be quite some time, yet, before those drives are universal.
In the mean time, one can try to improve performance by not allowing the system to accumulate as much data in need of writing. So, rather than letting dirty pages stay in cache for (say) 30 seconds, those pages should be flushed more frequently. Or the system could adjust the percentage of RAM which is allowed to be dirty, perhaps in response to observations about the actual bandwidth of the backing store devices. The kernel already has a "percentage dirty" limit, but some developers are now suggesting that the limit should be a fixed number of bytes instead. In particular, that limit should be set to the number of bytes which can be flushed to the backing store device in (say) one second.
Nobody objects to the idea of a better-tuned virtual memory subsystem. But there is some real disagreement over how that tuning should be done. Some developers argue for exposing the tuning knobs to user space and letting the distributors work it out. Andrew is a strong proponent of this approach:
The fact that this hasn't even been _attempted_ (afaik) is deplorable. Why does everyone just sit around waiting for the kernel to put a new value into two magic numbers which userspace scripts could have set?
The objections to that approach follow these lines: the distributors cannot get these numbers right; in fact, they are not really even inclined to try to get them right. The proper tuning values tend to change from one kernel to the next, so it makes sense to keep them with the kernel itself. And the kernel should be able to get these things right if it is doing its job at all. Needless to say, Linus argues for this approach, saying:
Linus has suggested (but not implemented) one set of heuristics which could help the system to tune itself. Neil Brown also has a suggested approach, based on measuring the actual performance of the system's storage devices. Fixing things at this level is likely to take some time; virtual memory changes always do. But some smart people are starting to think about the problem, and that's an important first step.
That, too, could be said for the discussion as a whole. There are clearly
a lot of issues surrounding filesystems and I/O which have come to the
surface and need to be discussed. The Linux kernel community as a whole needs to
think through the sort of guarantees (for both robustness and performance)
it will offer to its users and how
those guarantees will be fulfilled. As it happens, the 2009 Linux
Storage & Filesystems Workshop begins on April 6. Many of
these topics are likely to be discussed there. Your editor has managed to
talk his way into that room; stay tuned.
Index entries for this article | |
---|---|
Kernel | Filesystems |
Posted Apr 1, 2009 0:36 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (37 responses)
> You may wish that was what they did, but reality is that "open(filename, O_TRUNC | O_CREAT, 0666)" thing.
Which is exactly what ext4 already works around. In line with reality.
> Harsh, I know. And in the end, even the _good_ applications will decide that it's not worth the performance penalty of doing an fsync(). In git, for example, where we generally try to be very very very careful, 'fsync()' on the object files is turned off by default.
Ah, thinking of doing fsync() after all, are we?
> Why? Because turning it on results in unacceptable behavior on ext3.
Chuckle :-)
And then, the real reality:
> Now, admittedly, the git design means that a lost new DB file isn't deadly, just potentially very very annoying and confusing - you may have to roll back and re-do your operation by hand, and you have to know enough to be able to do it in the first place.
Meaning, make your apps in such a way that an odd crash here and there cannot take out the whole thing.
Posted Apr 1, 2009 3:39 UTC (Wed)
by ajross (guest, #4563)
[Link] (27 responses)
And to be fair, there's a difference in designing around "the odd crash here and there" and a 30 Second Window of Doom for every file creation.
Posted Apr 1, 2009 4:19 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (9 responses)
What is your point here exactly? That I should not post because you may not like reading it? If you are a moderator of the site, please feel free to remove my post.
I make no apologies for my snideness - I think it was well deserved. Essentially, just because one file system does something in an idiotic way, we should now drop a perfectly good system call. Shouldn't we instead FIX what's broken so that all system calls and all file systems can be used as designed?
Similarly, we have seen heaps of new system calls introduced into Linux in recent times (dup3 and friends + other, backup related stuff from Ulrich Drepper), which all have to do with files. Why? Because they were needed. No complaints there. I thought the deal was that they would never get used? (see, being snide again).
> And to be fair, there's a difference in designing around "the odd crash here and there" and a 30 Second Window of Doom for every file creation.
And to be fair, there is difference in designing around complete system lockups for a number of seconds and committing data when required.
Posted Apr 1, 2009 8:34 UTC (Wed)
by nix (subscriber, #2304)
[Link] (8 responses)
They're not really intended for use by everyman, anyway.
The problem with what one might call the fsync() RANDOMLY_LOSE option is that it is something which must be used by everyman to avoid data loss, which if you get it wrong there is no sign unless you lose power at exactly the right time, and which nearly all programs you might clap eyes on other than Emacs have historically got wrong, and which many utility programs *cannot* get right no matter what, because there's no way they can tell if the data they are operating on is 'important', and thus should be fsync()ed, or not. (Sure, you could add a new command-line option to tell them, but that option is not in POSIX so portable applications can't rely on it for a long long time).
That's a big difference.
Posted Apr 1, 2009 10:24 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (6 responses)
You are kidding, right? dup3() is not for general use?
> That's a big difference.
Look, I'm not really bent on a particular mechanism of actually making sure that programmers have a reliable interface for doing this. Using fsync() before close() is the only portable solution now, but it is far from optimal. I think there is very little doubt about that. And we all know it sucks to high heaven on ext3 in ordered mode.
I don't know what the best way is: new call, some kind of flag to open that says O_ALWAYSDATABEFOREMETADATA, rename2(), close_with_magic() or whatever. But, saying that application programmers cannot grok this kind of stuff is just not true. They can and they will, only if given the tools. Just like they did dup3() and friends (and as you point out, there is little danger of misuse - these are new calls).
As I said many times before, overloading current pattern with non-portable behaviour is dangerous, because it provides false sense of robustness and ties one up to a particular FS and kernel. If we can get POSIX updated so that rename() actually means "always data before metadata, but don't put on disk now", then it may even fly. But, I don't know how that's going to make guarantees retroactively, when even Linux features file systems that don't do that (e.g. ext3 in writeback mode).
Also, having things like delayed allocation, where metadata can legitimately be committed before data, is really useful. Most short lived temporary files will never see disk platters, therefore making things faster and disks last longer. Meaning, keeping the old cruft around ain't that bad.
As for utility programs that are called from scripts, you can use dd with conv=fsync or conv=fdatasync in your pipe to commit files to disk today. On FreeBSD, they already have standalone fsync program for that. Yeah, I know. It sucks. But, your usual tools don't have to make any decisions on fsync()-ing - you can.
Posted Apr 1, 2009 18:09 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (5 responses)
Posted Apr 1, 2009 20:55 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (4 responses)
Quite the opposite. I'm all for fixing bugs and giving application programmers the _right_ tools for the job. If some Linux developers took a second to lift their noses out of the specifics of Linux and actually looked around, this could be fixed for _everyone_, not just for some Linux specific file systems. That is my point, in case you didn't get it by now.
Posted Apr 1, 2009 21:37 UTC (Wed)
by man_ls (guest, #15091)
[Link] (3 responses)
After reading that Linus is not pulling from Mr Tso's trees made me suspect. Well, now that Ts'o's commit rights have been officially revoked I think that the whole discussion is moot. I wonder if the next ext4 head maintainer will learn from this painful experience and just do the right thing.
Posted Apr 1, 2009 21:46 UTC (Wed)
by corbet (editor, #1)
[Link] (1 responses)
Maybe it's an April 1 post that went over my head?
Posted Apr 2, 2009 6:21 UTC (Thu)
by man_ls (guest, #15091)
[Link]
Will try to do better next time :D)
Posted Apr 1, 2009 22:38 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
> Why invent a new system call which cannot (by necessity) be honored by ext2, or ext4 without a journal?
Even if there was some kind of magical law that said that you could not order commits on the non-journaled file system this way, it can always be trivially implemented through - wait for it - fsync(), which has acceptable performance characteristics on such file systems.
> Everything is working now fine in ext3
Sure. Except fsync(), which locks the whole system for a few seconds. Hopefully, this will get fixed (or at least its effect reduced) as a result of the hoopla.
> Well, now that Ts'o's commit rights have been officially revoked I think that the whole discussion is moot.
Now you are really making a fool of yourself.
Posted Apr 2, 2009 23:16 UTC (Thu)
by anton (subscriber, #25547)
[Link]
Posted Apr 1, 2009 5:31 UTC (Wed)
by ncm (guest, #165)
[Link] (16 responses)
Posted Apr 1, 2009 6:07 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (7 responses)
What exactly is not polite about that? Is sarcasm now verboten on LWN? I see plenty of it. Daily.
In a post not so long ago, someone accused me of hiding behind Ted's authority (although I actually used documentation to support my case - which many don't bother to read, of course). This time, I point out what to me is nonsense coming from an even bigger authority, but that's no good either. I'm not sure what position of mine would satisfy fragile sensibilities here. Only silence, I guess.
This time I was being accused of making snide remarks. So, I replied to ajross using his terminology, although I do not actually agree with that qualification (which you can see from my sarcastic: "see, being snide again" remark) and I should have used "so called snideness" in my reply instead. I am really just being sarcastic, because we are all supposed to rally behind the high priest or something.
Sure, Linus is a genius, but that doesn't mean that whatever he says is beyond criticism. And, I do not see how I am not being polite by exercising criticism with a hint of sarcasm.
What is it exactly that you have the issue with in my posts? What exactly is impolite?
Posted Apr 1, 2009 7:54 UTC (Wed)
by khim (subscriber, #9252)
[Link] (3 responses)
Nope. You are being 100% smart-ass. Linus's reality check is not
inconsistent. It's description of reality and reality is not
consistent. Whenever it was? You have different factors and in
different but quite real situations different factors prevail. That's different facet of reality. When you consider reality from kernel
developer POV what the applications are doing is your "unchangeable fact",
your "speed of light", when you consider reality from application developer
POV what the kernel does is "unchangeable fact" and you should deal with
it. This is true even if kernel developer and application developer is the
same person. You can only think differently if your application is designed
to only be used "in-house" and you can always guarantee
control over both kernel and userspace - and git was not designed to only
be used "in-house"... You are exercising ignorance with a hint of sarcasm. That's
different.
Posted Apr 1, 2009 8:29 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (2 responses)
Let me review.
When another Unix kernel (or Linux) holds your data in buffers and commits metadata only (because it is allowed to), you, as an application developer, deal with it by ignoring that fact.
And, when your file system does crazy things with the perfectly good system call, you also ignore it as a kernel developer.
WOW, is that now the new "very special relativity"? We pick whichever behaviour is the most narrow to a specific file system and go with that?
Posted Apr 1, 2009 14:22 UTC (Wed)
by drag (guest, #31333)
[Link] (1 responses)
POSIX allows you never to write data to disk at all. That will make your file system very fast. After all you can have a POSIX-compliant file system that operates off of ramdisk quite easily.
POSIX file system access is designed to describe the interface layer between userland and the file system. It leaves the actual integration between the file system and the hardware, as well as the internals to the file system itself is left up to the developer of the OS.
It is like if you discovered all of a sudden a network service provided by a Apache-based web app uses SSL badly so that all usernames and passwords are transmitted over the Web in plain text... then you complain about it and the developer says back to you that his application's behavior is allowed by TCP/HTTP/SSL and that you should be changing your password with each usage, like people who use his app correctly do. Then he emails you some documentation from a security expert that says you should change your password frequently and that many other protocols like telnet or ftp send your username and password over the network in plain text.
Posted Apr 1, 2009 16:10 UTC (Wed)
by foom (subscriber, #14868)
[Link]
Posted Apr 2, 2009 23:17 UTC (Thu)
by xoddam (subscriber, #2322)
[Link] (1 responses)
I plead guilty and I apologise. That was immediately after replying to someone else's post the gist of which was "Ted wrote ext2 and ext3 in the first place, he is therefore above criticism." It concluded with the words "Know your place", which got me riled.
[proverb: in the midst of great anger, never answer anyone's letter]
Your words were not so condescending but they had much the same emphasis: all ur filesystems are belong to POSIX (not users) 'cos POSIX is the law, and by the way Ted's interpretation is the only correct one because he's the primary implementor.
I hope you understand where I was coming from. Forgive me.
Posted Apr 2, 2009 23:56 UTC (Thu)
by bojan (subscriber, #14302)
[Link]
Posted Apr 8, 2009 0:05 UTC (Wed)
by jschrod (subscriber, #1646)
[Link]
But your self-rightousness doesn't allow to understand this, obviously. Luckily, there are still some discussion threads where you don't try to take over. I hope the likes of you will remain few on LWN in the future, this is not Slashdot, after all.
Posted Apr 1, 2009 15:46 UTC (Wed)
by GreyWizard (guest, #1026)
[Link] (7 responses)
People get nasty in the comments here all the time. If there's something beautiful and fragile here it's already in a thousand jagged pieces. But people hector one another about being polite all the time too. That also wrecks the signal-to-noise ratio and solves nothing.
Posted Apr 4, 2009 9:05 UTC (Sat)
by jospoortvliet (guest, #33164)
[Link] (6 responses)
Living in a country where that mode of thinking is the norm, I can tell
A little decency now and then doesn't hurt. I know people who, knowing how
Posted Apr 5, 2009 3:34 UTC (Sun)
by GreyWizard (guest, #1026)
[Link] (5 responses)
But saying "be polite you jerk" merely drags things even further down into the muck.
Posted Apr 5, 2009 12:43 UTC (Sun)
by jospoortvliet (guest, #33164)
[Link] (4 responses)
First of all, some people don't notice their behavior is unnecessarily impolite. Pointing it out can help them (if they are willing to be reasonable in the first place). Never pointing out somebodies failures will make them fail forever.
Second, it shows you care about being polite. If others show they care too, a culture of 'you should be polite' can be maintained. As you might have noticed from the differences between FOSS communities, culture is important and heavily influential. And it can be changed.
Some things to note:
Posted Apr 5, 2009 15:42 UTC (Sun)
by GreyWizard (guest, #1026)
[Link] (3 responses)
A truly polite request for more courtesy might help but it's difficult to be sure because such things are quite rare. Giving in to the temptation to scold even just a little makes the comment worse than useless. Unless you are absolutely certain you can do it right it's better to focus on substantive issues and avoid appointing yourself a courtesy cop.
Posted Apr 5, 2009 16:20 UTC (Sun)
by jospoortvliet (guest, #33164)
[Link] (2 responses)
Posted Apr 5, 2009 16:27 UTC (Sun)
by GreyWizard (guest, #1026)
[Link] (1 responses)
Posted Apr 5, 2009 17:11 UTC (Sun)
by jospoortvliet (guest, #33164)
[Link]
On re-reading the thread, I think you are right in that ajross was more impolite than bojan, which often leads to a downward spiral and isn't helpful... bojan's post wasn't that far off from the normal tone on this site.
Anyway. This is went pretty far off-topic, and I think we mostly agree. For as far as we don't, we at least agree on that ;-)
Posted Apr 1, 2009 6:27 UTC (Wed)
by njs (guest, #40338)
[Link] (7 responses)
Well, yes, it's a nice goal. The problem is that *you can't* without calling fsync. When the guy who wrote the system calls it "very very annoying and confusing", then it's not really a great example of how we can make all our apps more awesome and usable in general. Unfortunately.
(During the whole ext4 discussion I spent some time trying to figure out how to abuse Ted's patches to create a transactional system that doesn't require rewriting whole databases on every write, and uses rename(2) for its write barrier instead of fsync(2). But I think block layer reordering makes it impossible. Maybe if there were an ioctl to trigger an immediate roll-over of the journal transaction.)
Posted Apr 1, 2009 7:15 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (6 responses)
fsync() sucks because it is a "commit now" thing. Not everyone wants to commit now - I fully understand that. I'm a notebook user and I don't want my disk being spun up unnecessarily. But, current semantics are what they are, so ignoring them is looking for trouble elsewhere. Sucks - yes, but living in denial doesn't help either. And, as you say, not a great way to make our apps more awesome. Just a necessary evil right now. Some of it can be avoided with backup files, but the underlying badness will persist.
It would be nice to have a system call that guarantees "data before metadata, but not necessarily now", so that other systems interested in it may also implement it. Then the apps could comfortably rely on this when renaming files over the top of other ones. I was even thinking that we should ask POSIX to standardise that fsync(-fd) means exactly that (because fd is always positive, but we use int for it, which can also have negative values), but this may confuse things even more and is probably stupid.
Sure, some Linux file systems will continue making it more comfortable even with the undefined order of current semantics, which will please users (BTW, this is really interesting: http://lwn.net/Articles/326583/). But, the long term solution in the world of Unix should probably be a bit more inclusive.
PS. To be fair to fsync(), it is an indispensable tool for databases, so making it work as fast as possible is most definitely a good thing. What ext3 in ordered mode does with it an abomination.
Posted Apr 1, 2009 7:50 UTC (Wed)
by ebiederm (subscriber, #35028)
[Link]
POSIX/UNIX semantics do not make guarantees about the filesystem state after an OS crash.
Not having to do fsck after a filesystem crash gives the illusion that the filesystem is not corrupted.
It turns out that at least with extN after a crash we see filesystem
It would be nice if there was a filesystem that could guarantee the visible state of the filesystem if fsck did not need to be run was:
Does anyone know of a journaling filesystem that guarantees not to give me a corrupt filesystem if fsck does not need to be run?
Posted Apr 1, 2009 8:05 UTC (Wed)
by mjthayer (guest, #39183)
[Link]
Posted Apr 1, 2009 8:39 UTC (Wed)
by nix (subscriber, #2304)
[Link] (3 responses)
Actually on many OSes it's a 'start a background force to disk now and return before it's done' operation; on Linux it's a 'lob it at the disk controller so it can cache it instead' operation. Still not necessarily useful (although that is changing to optionally emit a barrier to the disk controller too.)
(Speaking as the owner of an Areca RAID card with a quarter-gig of battery-backed cache, using non-RAIDed filesystems purely as an fs-cache storage area, I *like* the ability to turn off barriers: all they do is slow my system down with no reliability gain at all.)
Posted Apr 1, 2009 8:55 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
Posted Apr 1, 2009 12:02 UTC (Wed)
by butlerm (subscriber, #13312)
[Link]
The real technical problem here is that from the application perspective,
The alternative that I would really like to see is undo records for a few
Posted Apr 1, 2009 12:17 UTC (Wed)
by butlerm (subscriber, #13312)
[Link]
Posted Apr 2, 2009 20:23 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
Linus actually overstated git's use of fsync(). There are three relevant cases: That is, git relies on the assumption that a rename() is atomic with respect to the disk and dependent on all operations previously issued on the inode that is being renamed. It uses fsync() only to make sure that operations to different files happen in the order that it wants. Now, obviously, if you want to be really sure to keep some data, write it once and never replace it at all. That'll do a good job of protecting against everything, including bugs where you do something like "open(), fsync(), close(), rename()" but forget or mess up the "write()". Obviously, this isn't an option for a lot of situations, however, but it's what git does for the most important data.
Posted Apr 1, 2009 7:11 UTC (Wed)
by TRS-80 (guest, #1804)
[Link] (4 responses)
Posted Apr 1, 2009 13:40 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Posted Apr 1, 2009 19:03 UTC (Wed)
by sbergman27 (guest, #10767)
[Link] (1 responses)
Are you saying that distro kernels already do this? If so, I'm not suggesting that you are wrong. Just interested in more info.
Posted Apr 1, 2009 19:08 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link]
Posted Apr 1, 2009 13:44 UTC (Wed)
by masoncl (subscriber, #47138)
[Link]
(Credit to Eric Sandeen for tracking it down).
Posted Apr 1, 2009 7:38 UTC (Wed)
by bakterie (guest, #37541)
[Link] (2 responses)
Posted Apr 1, 2009 11:17 UTC (Wed)
by masoncl (subscriber, #47138)
[Link]
Posted Apr 2, 2009 0:37 UTC (Thu)
by davecb (subscriber, #1574)
[Link]
Well, the Unix v6 filesystem implemented
A colleague at ICL (hi, Ian!) did his
--dave
Posted Apr 1, 2009 8:09 UTC (Wed)
by mjthayer (guest, #39183)
[Link] (8 responses)
Posted Apr 1, 2009 13:58 UTC (Wed)
by TRS-80 (guest, #1804)
[Link] (3 responses)
Posted Apr 2, 2009 6:40 UTC (Thu)
by butlerm (subscriber, #13312)
[Link] (2 responses)
You can't really "queue" a rename, without doing something comparable to
It would be practical to update atimes on a low priority basis, with the
Posted Apr 2, 2009 14:01 UTC (Thu)
by xoddam (subscriber, #2322)
[Link] (1 responses)
I'm intrigued, but not satisfied. Telling the journal that a metadata change is 'committed' means that the post-crash-recovery state will reflect the change (journal replay).
Surely the only satisfactory way to commit data before committing the metadata change is to delay *all* journal commits in-order until after the relevant file data is written in place, or to journal the data itself.
For performance reasons it's probably much saner not to journal most data, especially for random access within large files, but I'm thinking that if it makes sense to allocate-on-commit to preserve the in-order semantics of atomic rename, it might also make good sense to special-case data journalling for newly-written (created or truncated) files when they are renamed (perhaps only for small files, and allocate-on-commit larger ones as users will likey expect a delay).
Having the ability to unwind a specific kind of metadata change seems very confusing. I fear that winding back a rename could well result in a violation of expected in-order semantics w.r.t. metadata after crash recovery. Or might it be possible to wind back an entire 'transaction', all other metadata changes since the rename included?
Posted Apr 2, 2009 18:13 UTC (Thu)
by butlerm (subscriber, #13312)
[Link]
"data=writeback" is the current alternative which doesn't make any pretense
Rename undo is a much less severe compromise to in-order semantics after a
In the case you mention, if you write a new version, rename it over the old
Posted Apr 1, 2009 19:41 UTC (Wed)
by Steve_Baker (guest, #265)
[Link] (1 responses)
Posted Jun 10, 2009 10:06 UTC (Wed)
by pjm (guest, #2080)
[Link]
Posted Apr 4, 2009 7:42 UTC (Sat)
by dirtyepic (guest, #30178)
[Link] (1 responses)
Posted Apr 4, 2009 7:44 UTC (Sat)
by dirtyepic (guest, #30178)
[Link]
Posted Apr 1, 2009 8:59 UTC (Wed)
by lmb (subscriber, #39048)
[Link]
(A possible extension then might be to have fsyncl(), which accepts a list of fds to sync at the same time, but it is not strictly required.)
Or, of course, to get application writers to use more async IO.
Posted Apr 1, 2009 11:22 UTC (Wed)
by rvfh (guest, #31018)
[Link] (4 responses)
The write-to-disk policy would thus be per-file, but it would be the kernel's decision to flush what needs to be when it deems necessary.
Posted Apr 1, 2009 12:20 UTC (Wed)
by RobSeace (subscriber, #4435)
[Link] (3 responses)
Posted Apr 1, 2009 13:19 UTC (Wed)
by rvfh (guest, #31018)
[Link] (2 responses)
Yes, but with more granularity. O_SYNC means write everything immediately, whereas you might want to give the kernel some time to organise the reads/writes more efficiently:
Posted Apr 2, 2009 14:14 UTC (Thu)
by xoddam (subscriber, #2322)
[Link] (1 responses)
Things go strangely pear-shaped when the most irrelevant, trivial data (eg. GNOME configs when we're only using GNOME because it's a default someone else chose) goes missing or gets corrupted.
I most definitely don't care if GNOME forgets where I put a window or two. But I do care if it fails to start.
What we end-users want (I wear a developer hat much of the time but I'm *always* a user) is not to be annoyed by the things we don't care about. O_EXPENDABLE and its ilk are an invitation for corner-cases to bite end-users. End-users don't deserve such treatment.
Posted Apr 2, 2009 14:30 UTC (Thu)
by rvfh (guest, #31018)
[Link]
And anyway, do we really not know which files are important and which are not?
But I do thank you for challenging this idea ;-) Please feel free to give counter-examples and -arguments.
Posted Apr 1, 2009 13:28 UTC (Wed)
by ballombe (subscriber, #9523)
[Link] (1 responses)
Posted Apr 1, 2009 14:25 UTC (Wed)
by knobunc (subscriber, #4678)
[Link]
Posted Apr 2, 2009 19:15 UTC (Thu)
by iabervon (subscriber, #722)
[Link] (7 responses)
Posted Apr 2, 2009 23:28 UTC (Thu)
by anton (subscriber, #25547)
[Link]
But I have tested two drives with a test program for
out-of-order writing, and found that they both wrote data several
seconds out of order with a certain access sequence. If we don't see
more frequent problems from this, that's probably because the disks don't
optimize accesses as aggressively as some people imagine.
Posted Apr 2, 2009 23:31 UTC (Thu)
by xoddam (subscriber, #2322)
[Link] (5 responses)
Posted Apr 3, 2009 0:02 UTC (Fri)
by nix (subscriber, #2304)
[Link] (1 responses)
A UPS, or battery-backing, is the answer (well, moves the failure point:
In conclusion: we all suck, our data is doomed, the Second Law shall
Posted Apr 4, 2009 0:01 UTC (Sat)
by giraffedata (guest, #1954)
[Link]
Such redundancy also makes it possible to test the UPS regularly and avoid the problem of two dead batteries when the external power fails.
The UPS doesn't count if you don't test, measure, and/or replace the its battery regularly.
Posted Apr 3, 2009 23:49 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (2 responses)
It's hard to believe there are disk drives out there (not counting an occasional broken one) that write trash over random areas as they power down. Disk drives I have seen have a special circuit to disconnect and park the head the moment voltage begins to drop. It has to park the head because you can't let the head land on good recording surface, and it has to cut off the write current because otherwise it's dragging a writing head all the way across the disk, pretty much guaranteeing the disk will never come back. I believe it's a simple circuit that doesn't involve any controller intelligence.
There is a related failure mode where the drive's client loses power and in its death throes ends up instructing the drive to trash itself while the drive still has enough power to operate normally. I've heard that's not unusual, and it's the best argument I know for a UPS that powers a system long enough for it to shut down cleanly.
Posted Apr 27, 2009 6:24 UTC (Mon)
by bersl2 (guest, #34928)
[Link] (1 responses)
It's hard to believe there are disk drives out there (not counting an occasional broken one) that write trash over random areas as they power down. Disk drives I have seen have a special circuit to disconnect and park the head the moment voltage begins to drop. It has to park the head because you can't let the head land on good recording surface, and it has to cut off the write current because otherwise it's dragging a writing head all the way across the disk, pretty much guaranteeing the disk will never come back. I believe it's a simple circuit that doesn't involve any controller intelligence. There is a related failure mode where the drive's client loses power and in its death throes ends up instructing the drive to trash itself while the drive still has enough power to operate normally. I've heard that's not unusual, and it's the best argument I know for a UPS that powers a system long enough for it to shut down cleanly. One of these happened to me. $DEITY as my witness, I will never run an important system without an UPS again. Bonus: The drive was a Maxtor. Serves me right.
Posted Apr 27, 2009 10:43 UTC (Mon)
by nix (subscriber, #2304)
[Link]
Posted Apr 17, 2009 0:05 UTC (Fri)
by hozelda (guest, #19341)
[Link]
That massive filesystem thread
That massive filesystem thread
That massive filesystem thread
That massive filesystem thread
That massive filesystem thread
By your logic, we should never fix bugs. Remember the 25 year old readdir bug? Don't you agree it was good to fix that? What if a program, somewhere, depended on that behavior?
In reality, programs use rename for atomic replacement. POSIX doesn't say anything about guarantees after a hard system crash, and it's just disingenuous to think that by punishing application authors by giving them as little robustness as possible, you're doing them some kind of portability favor.
That massive filesystem thread
That massive filesystem thread
It is a worthless effort. Each filesystem must keep its house clean. Why invent a new system call which cannot (by necessity) be honored by ext2, or ext4 without a journal? Everything is working now fine in ext3, and if it doesn't work right in ext4 people will just look for a different filesystem.
That massive filesystem thread
I'm confused. The article said that Ted's trees had not been pulled yet. In fact, that happened today; a bunch of ext4 work went into the mainline, including a number of patches which increase robustness for applications which don't use fsync(). I dunno what you were trying to link to, but it didn't work. I've not seen anything about revocation of commit rights. (It's hard to "revoke commit rights" in a distributed system in any case; at worst you can refuse to pull from somebody else's repository.)
ext4 trees
Sorry, it was a stupid attempt from a foreigner at an April Fools' prank :D I was hoping that the recursive link would give it away, but maybe it was too plausible altogether.
Recursive linking
That massive filesystem thread
That massive filesystem thread
The problem with what one might call the fsync()
RANDOMLY_LOSE option is that it is something which must be used by
everyman to avoid data loss, which if you get it wrong there is no
sign unless you lose power at exactly the right time, and which nearly
all programs you might clap eyes on other than Emacs have historically
got wrong
s/other
than/including/. However, I don't agree that this application
behaviour is wrong; if the application wants to jump through hoops to
get a little bit of extra safety on low-quality file systems, that's
ok, but if it doesn't, that's also ok. It's up to the users to chose
which applications they run and on which file system.
The end of LWN comment dialog?
The end of LWN comment dialog?
Yup. It's the beginning of the end.
If you read my original post in this thread, you will find that
I am pointing at inconsistencies of what Linus describes as reality
check.
So, I ridicule (among other things) his conclusion that: ext3
sucks at doing fsync(), hence we should drop fsync().
And, I do not see how I am not being polite by exercising
criticism with a hint of sarcasm.
Yup. It's the beginning of the end.
Yup. It's the beginning of the end.
Yup. It's the beginning of the end.
of the other article's threads. I'd like to suggest that it might be in everyone's interest to move on to
more useful pass-times than rehashing the same arguments over and over again every time there's
an update on the subject.
sticks & stones
sticks & stones
The end of LWN comment dialog?
The end of LWN comment dialog?
The end of LWN comment dialog?
with what comes out, it's on their own plate.
you it also has disadvantages... If only because the resulted hurt
feelings can muddy the discussion more than you might think. Besides, it
chases people away who would otherwise have contributed constructively -
it's not acceptable behavior in all cultures. Ever wondered why the FOSS
community is still predominantly western, despite many smart developers in
countries like India?
blunt they can be, ask someone else to read certain emails before sending
them. After all, reality is that people DO have feelings.
The end of LWN comment dialog?
The end of LWN comment dialog?
- people DO care about what others think of them. No matter how much they scream 'no I don't', they do. It is our nature.
- people should know their arguments are not supported by being mean - it is the other way around.
- I agree that a 'be polite you yerk' might not always be the best way to correct someone. A personal mail can do more. However, it won't show up in public (unless an apology is made), thus it does not much to influence others who might think it is acceptable behavior because the guy got away with it. Of course, giving a good example is better than anything else.
- Of course discussing without end whether somebody was polite enough or not muddies the discussion and lowers the SNR.
The end of LWN comment dialog?
The end of LWN comment dialog?
The end of LWN comment dialog?
The end of LWN comment dialog?
That massive filesystem thread
That massive filesystem thread
That massive filesystem thread
states that are illegal during normal operations. That is despite not
needing to run fsck the filesystem was corrupted.
- A legal state for the filesystem in normal operation.
- Everything that was fsynced was available.
That massive filesystem thread
That massive filesystem thread
That massive filesystem thread
fbarrier vs. fsync
indistinguishable from fsync. It is a "commit data before metadata"
request.
the meta data update must take place immediately, i.e. before the system
call returns. However, from a recovery perspective, it is highly desirable
that the persistent meta data state not be committed until after the data
has been committed. Unless a filesystem maintains two versions of its
metadata (a la soft updates), that is an unusually difficult requirement to
meet without serious performance problems.
critical operations like rename replacement, such that the physical data /
meta data ordering requirements are removed, and on recovery the filesystem
un-does rename replacements where the replacement data has not been
committed to disk. That replaces the ideal of point-in-time recovery with
the more practical ideal of consistent version recovery.
That massive filesystem thread
the one that the "barrier=1" mount option requests. The latter is a low
level block I/O write barrier usually implemented with a full write cache
flush (barring some sort of battery backup), the former is a data before
meta data barrier.
That massive filesystem thread
Looks like LVM finally got support for barriers in 2.6.29 (a year after first submitted) although only for linear targets.
I/O barriers and LVM
I/O barriers and LVM
I/O barriers and LVM
This change alone is making me consider using a kernel.org kernel, something I haven't done in years.
"""
I/O barriers and LVM
I/O barriers and LVM
That massive filesystem thread
other filesystems in the discussions, but then again I haven't looked very
hard. One of my computers is using reiserfs. How does reiserfs do it?
That massive filesystem thread
That massive filesystem thread
in-order writes, as did 4.x BSD and the
other pre-journaled filesystems. POSIX allows
reordering to make coalesence easy, as
a lot of research was being done at that
time to get better performance.
masters at UofT on that, and found you
could get a performance improvement
and still preserve correctness by using
what I'd characterize as tsort(1), which
worked better than BSD/Solaris soft updates.
That massive filesystem thread
People were suggesting the same thing with renames - queue them up until the data is written to disk, but apparently it's too complex (BSD FFS softupdates being proof of this).
That massive filesystem thread
That massive filesystem thread
versions of the meta data around at all times - not only that but
converting back and forth between the two versions on demand.
what softupdates does, because the rename has to take immediate affect from
the application perspective. To do that, somewhere there must be a layer
to keep track of the difference between what the user visible meta data is,
and what the committed meta data is. If the differences are sufficiently
general, that is a major problem. If one wants high performance rename
replacements, rename undo is much more practical.
caveat that a lot of memory may be consumed holding metadata blocks around
until the atime updates are complete. In addition, on a system under
sufficient load, moving I/O to a low priority thread doesn't really help
anyway.
Rename undo
Rename undo
there are no alternatives other than journalling the data or delaying all
journal commits until the corresponding data has been written. Both
options are available (e.g. data=journal, and data=ordered), and both have
serious performance problems. Of course, if that is really what you need,
than the price is worth paying.
to the preserving in-order semantics of data and meta-data after a crash.
You get a snapshot of your meta-data at a certain point of time, but the
data may be trashed.
crash. It is not point in time recovery, it is consistent version recovery.
That can have some unexpected side effects, but none remotely as severe as
losing the data completely.
one, change the security permissions on the replacement, and then the
system crashes, you are not going to get the new (unwritten) data, the new
inode, and the new permissions, you are going to get the old inode, the old
data, and the old permissions. The permissions go with the inode (and the
data), not the directory entry. That is what you want. The old inode (and
the old data) has to be kept around until the data for the new inode is
completely on disk. Otherwise you cannot undo the rename replacement after
the fact.
That massive filesystem thread
attribute when a filesystem has been mounted with the noatime or relatime
option to force strict atime updates for files so marked. That way you
can mount your filesystem(s) noatime and only put the A attribute on your
mailboxes and you're done.
per-inode noatime
That massive filesystem thread
That massive filesystem thread
That massive filesystem thread
When to flush()?
When to flush()?
When to flush()?
* O_CRITICAL: 1 second
* O_EXPENDABLE: 30 seconds
* default: 5 seconds
Totally superfluous.
Totally superfluous.
How then are they expected to know when to flush?
Examples:
* pid file, browser cache: don't care
* conf file, document, code: care
* database file: care a lot
use of atime
use of atime
--
Alternatively, you can use chattr +A to set the noatime flag on all files and directories where you dont want noatime semantics, and then clear the flag for the Unix mbox files where you care about the atime updates. Since the noatime flag is inherited by default, you can get this behaviour by setting running chattr +A /mntpt right after the filesystem is first created and mounted; all files and directories created in that file system will have the noatime file inherited.
--
fsync() and disk flushes
fsync() and disk flushes
And if you suddenly lose power, in his experience, the
drive is actually much more likely to wipe out some arbitrary track of
data from the disk than it is to have anything in the write cache and
lose it.
While I have experienced drives that damage sectors or tracks on power
loss, I consider these drives faulty; and with such drives the problem
does not seem to be limited to drives that are trying to write
something at the time. However, most drives don't wipe out arbitrary
data in my experience.
fsync() and disk flushes
fsync() and disk flushes
particular can make it much worse (turning small-range corruption into
apparent scattershot corruption).
if it's a UPS, the UPS must fail before you lose: if it's battery-backed,
you often have to lose the battery first, then power, which is likely to
happen because you often have no idea the battery has failed until it's
too late).
triumph and Sod and Murphy shall dance above our mangled filesystems.
The answer is RAID and UPS, but not that way. The RAID goes over the UPS; e.g. a mirror of two disk drives, each with its own UPS.
fsync() and disk flushes
fsync() and disk flushes
fsync() and disk flushes
Double bonus: That still wasn't traumatic enough to compel me to make backups.fsync() and disk flushes
(and perhaps better because the battery failing doesn't take your machine
down if the power is otherwise OK, while the UPS failing *does*).
Put hardware in between RAM and Disk features to work in between these