ext4 and data consistency

Posted May 13, 2010 18:38 UTC (Thu) by anton (subscriber, #25547)
In reply to: The Next3 filesystem by cortana
Parent article: The Next3 filesystem

Ted T'so still believes that data consistency on OS crashes (not application crashes) is the job of the applications (with fsync() etc.), not of the file system. And most applications don't do that, and those few that try it are probably not well tested against that (because that's extremely hard).

He fixed one particularly frequent cause of data loss in ext4 (involving writing a file, then renaming it across an old one), but nothing else. So people will see data loss with ext4 less frequently than before, but not as infrequently as with ext3 (or has this data loss feature been backported from ext4 to ext3 to give us fewer reasons to stick with ext3?).

ext4 and data consistency

Posted May 13, 2010 20:08 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link] (9 responses)

You are clearly overstating the case and the position. There were several issues fixed. Not just one.

ext4 and data consistency

Posted May 14, 2010 12:54 UTC (Fri) by anton (subscriber, #25547) [Link] (8 responses)

Am I? That's Ted T'so's position as reported on, e.g., LWN. But maybe you can show me where I was wrong in my statement of his position. And my impression is that if it was just up to him, he would not have made the rename fix.

ext4 and data consistency

Posted May 14, 2010 14:00 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link]

You said there was only one fix. There were several and there are other Ext4 filesystem developers as well. What your state as his position is leaving out a lot of naunced arguments in a complex topic and making it sound very simplistic. If you can actually show a single case where Ext4 performs less robustly than Ext3, I would be interested.

ext4 and data consistency

Posted May 21, 2010 15:08 UTC (Fri) by Duncan (guest, #6647) [Link] (6 responses)

What bothers me is how they reduced the guarantees and stability of the long mature ext3 filesystem in the aftermath of all this, by defaulting it to data=writeback, a change from the old default data=ordered.

Presumably you used tun2fs or simply fstab to ensure your ext3 mounts remain stable with data=ordered after the kernel in question (was it 2.6.30 or 2.6.31?), right?

What'd be interesting to see would be how the distributions have handled it, since. Did they go with the new ext3 data=writeback default, or have they either reverted either that commit or now default their userspace to specify data=ordered by default?

I know at least one guy who was complaining of ext3 instability after installing a new kernel due to that, that went away when he returned to data=ordered for his ext3 volumes. The context of that discussion was the pan (nntp client) user list, IIRC.

Me, I've been on reiserfs for years on both my main system and (more recently) my netbook, and have been extremely happy with it since data=ordered became its default (2.6.6 according to a google hit on another LWN comment of mine). My most recent experience with extX is on no-journal ext4 formatted USB flash-based thumbdrives, where journaling isn't a good idea. I've been following btrfs with interest, and expect I'll upgrade to it once a few more of the kinks get worked out. (I've seen hints that the current 2.6.35 cycle will reduce the strength of the warning for its kernel config item, but I don't follow the btrfs list or lkml, and any detail of even plans has been harder to come by on the broader community sites such as LWN, HO, LXer, etc, that I follow.)

Duncan

ext4 and data consistency

Posted May 22, 2010 19:15 UTC (Sat) by anton (subscriber, #25547) [Link] (5 responses)

What bothers me is how they reduced the guarantees and stability of the long mature ext3 filesystem in the aftermath of all this, by defaulting it to data=writeback, a change from the old default data=ordered.

Yes, that's what was at the back of my mind when I wrote about "backporting the data loss feature from ext4 to ext3".

Presumably you used tun2fs or simply fstab to ensure your ext3 mounts remain stable with data=ordered after the kernel in question (was it 2.6.30 or 2.6.31?), right?

The youngest kernel we have is 2.6.30, and according to /proc/mounts it mounts our ext3 file systems with data=ordered. I guess we will go the fstab route once we get a kernel that defaults to data=writeback.

I am a little worried, though, because of what happened after data=journal was no longer the default; I then read that using data=journal resulted in corrupt file systems; I read that for a significant amount of time, and never read that this bug has been fixed (but haven't seen such reports for some time).

So if they made data=ordered non-default in 2.6.31 or some kernel, will they really care if it works? My confidence is limited. We should probably better stick with 2.6.30 until we migrate off extx file systems completely.

ext4 and data consistency

Posted May 22, 2010 20:36 UTC (Sat) by nix (subscriber, #2304) [Link] (4 responses)

Sticking with 2.6.30 is foolish. Bugs are fixed in ext[34] all the time, sometimes data loss bugs: by sticking with 2.6.30, you're depriving yourself of all of those.

(btw, you can put mount options in the superblock, and avoid modifying /etc/fstab.)

ext4 and data consistency

Posted May 23, 2010 11:44 UTC (Sun) by anton (subscriber, #25547) [Link] (3 responses)

And new bugs are introduced, and if they are for a non-default option like (now) data=ordered, they won't get noticed in time, and they won't get fixed for quite some time; at least that's what the non-default data=journal episode teaches. So what's higher: the risk of data loss from a well-known kernel, or from a new kernel in a non-default setting? Choosing the latter seems foolish to me.

Modifying fstab is not a big deal, why would I want to avoid it. The problem with doing it in the superblock is that I have to do it again when I transfer the system to another disk.

ext4 and data consistency

Posted May 23, 2010 11:50 UTC (Sun) by cortana (subscriber, #24596) [Link] (1 responses)

How can I check whether my distribution has changed the default value of the option in its kernels?

ext4 and data consistency

Posted May 23, 2010 13:29 UTC (Sun) by anton (subscriber, #25547) [Link]

One way is to mount such a file system with the default value (without overriding the default with tune2fs or in fstab), and the checking the actual options in /proc/mounts. That is what I do.

Another way would be to check CONFIG_EXT3_DEFAULTS_TO_ORDERED in the kernel config file.

ext4 and data consistency

Posted May 23, 2010 13:55 UTC (Sun) by nix (subscriber, #2304) [Link]

Well, OK, you're quite within your rights to stick with an old kernel: but I hope you encounter no other security bugs, or stability bugs, or new hardware, or *anything* else that might require a new kernel!

ext4 and data consistency

Posted May 13, 2010 20:19 UTC (Thu) by drag (guest, #31333) [Link] (21 responses)

There is no file system in Linux that tries to assure that renames are atomic functions.

Ext3, Ext4, XFS, JFS, etc etc.. all of these have the same consistency problems your complaining about.

The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)

What is more with 2.6.30 a patch was added to Ext4 that attempted to detect and then replicate the same behavior in Ext3 in order to maintain backwards compatibility with application developer's assumptions on file system behavior with regards to renames.

So ya.. apparently that 'fsync' was always needed by application developers if they wanted to ensure that data was written to disk in a timely fashion.

---------------------------

I know that this issue has cropped up again due to the fact that in Ubuntu the dpkg program detects if it's running on Ext4 and goes into paranoid mode were it runs 'fsync' were as with Ext3 it does not. This causes Ubuntu installs to last significantly longer if you choose 'Ext4' file system.

If the dpkg folks were smart they'd enable paranoid mode on all file systems, except maybe Ext3 (due to Ext3's poor ability to handle that sort of workloads)

As far as my personal opinion this is a advantage for using Ext4 over Ext3 since upgrades will be much safer on my laptop...

---------------------------

The one feature that I like about Ext4 is that it takes a minute or two to run a full fsck on my home directory versus upwards to 15-20 minutes for the same operation on Ext3.

ext4 and data consistency

Posted May 13, 2010 20:45 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (11 responses)

The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)

That window must be vanishingly small because neither I nor anyone else has ever been able to make ext3 crease zero-length files in the way you describe. Quirk or not, rename atomicity is an important feature that works just fine on a running filesystem, and filesystems ought to preserve its qualities on a restart. Allowing random garbage to exist on the filesystem after a restart is terrible policy and reflects a profound ignorance on the part of filesystem developrse as to how applications and users expect their systems to work.

ext4 and data consistency

Posted May 13, 2010 23:07 UTC (Thu) by njs (guest, #40338) [Link] (10 responses)

And ext4 now has rename atomicity over crashes. I also think that this is the right decision, but I wince when I see people tear into filesystem developers over this; if anything, it seems to reflect a profound ignorance of the difficulty of the trade-offs fs developers have to make, the disparity between what people want from a fs and what fs's have historically provided, etc. Keep in mind that if you go two web-pages over, you can find people tearing into POSIX for providing *too* strong guarantees and how we absolutely need to relax them for real-world usage (atime is the obvious example, but there are others). So I can hardly blame fs developers for being *cautious* about introducing strong *new* guarantees.

ext4 and data consistency

Posted May 14, 2010 13:43 UTC (Fri) by anton (subscriber, #25547) [Link] (9 responses)

[...] trade-offs fs developers have to make, the disparity between what people want from a fs and what fs's have historically provided, etc.

Yes, different people expect different things from file systems.

E.g., I expect data consistency from a file system; Linux file systems don't give any guarantee on that, but at least ext3 does ok in most cases; some people may consider this a fluke (but is Stephen Tweedie, the creator of ext3 among them?), but that's the reality.

Other people expect maximum speed. And for these people Linux provides tmpfs and ext4.

Given this choice, ext4 is certainly not a replacemet of ext3 for me.

Keep in mind that if you go two web-pages over, you can find people tearing into POSIX for providing *too* strong guarantees and how we absolutely need to relax them for real-world usage (atime is the obvious example, but there are others).

Yes, there are different kinds of users. I lost quite a bit of time because Linux does not follow POSIX atime semantics by default anymore. I find them useful in my real-world usage. Those who don't want atime have been able to use noatime for a long time, and now there is relatime, but making it the default (especially with mounts that don't know about strictatime) is a bad practice.

ext4 and data consistency

Posted May 14, 2010 15:53 UTC (Fri) by bronson (subscriber, #4806) [Link] (5 responses)

What on earth do you use atime for? Personally, the last time I ever needed to worry about atime was in the 1990s, and it was very easy to replace.

ext4 and data consistency

Posted May 15, 2010 8:36 UTC (Sat) by anton (subscriber, #25547) [Link] (4 responses)

I use atime to check whether some complex software really does access the files that I think it does.

ext4 and data consistency

Posted May 20, 2010 19:23 UTC (Thu) by oak (guest, #2786) [Link] (3 responses)

> I use atime to check whether some complex software really does access the files that I think it does.

Wouldn't "strace -f" be handier for that kind of thing? With that you notice also a lot of other stuff that the SW does.

Strace-account script gives an overview of file accesses in the strace output:
http://blogs.gnome.org/mortenw/2005/12/14/strace-account/

ext4 and data consistency

Posted May 21, 2010 12:04 UTC (Fri) by anton (subscriber, #25547) [Link] (2 responses)

It would not be handier exactly because it tells me a huge amount of other stuff the software does and that I am not interested in.

ext4 and data consistency

Posted Jun 8, 2010 22:17 UTC (Tue) by elanthis (guest, #6227) [Link] (1 responses)

Meet grep. Grep is your friend. Grep can make your life much easier. Grep is here to help you.

ext4 and data consistency

Posted Jun 9, 2010 9:06 UTC (Wed) by anton (subscriber, #25547) [Link]

And how is that handier than just doing "stat <file>"?

ext4 and data consistency

Posted May 14, 2010 17:33 UTC (Fri) by njs (guest, #40338) [Link] (2 responses)

I'm not aware of any common filesystem that provides "data consistency" in any coherent sense, unless you do weird things like mount -o sync. Speed is too important -- Stephen Tweedie didn't make data=journal the default, either. At most you get guarantees in particular situations -- e.g., both ext3 and ext4 guarantee that a rename will not be committed to disk until writes to the file being renamed have been committed to disk. They even both try to guarantee that programmers who do horrible things like truncating the file and *then* rewriting it are somewhat protected from their incompetence.

But maybe there are other cases where ext3 does better than ext4. You must have some excellent ones in mind to lump ext4 in with tmpfs... can you give any examples?

ext4 and data consistency

Posted May 15, 2010 9:19 UTC (Sat) by anton (subscriber, #25547) [Link] (1 responses)

Speed is too important

For whom? For me data consistency is much more important. Before barriers were supported, we ran ext3 on IDE disks without write caching, and that's really slow. The file system was still fast enough.

Stephen Tweedie didn't make data=journal the default, either.

Actually he did, at least at the start. Later it got changed (by whom?) to data=ordered; that still has the potential to provide data consistency unless existing files are overwritten.

As for an example: Consider a process writing file A and then file B. With ext4 I expect that it can happen that after recovery B is present and A is not or is empty. With ext3 I expect that this does not happen. But given that I did not find any documented guarantees in Documentation/filesystems/ext3.fs, maybe we should lump ext3 with tmpfs, too.

Still, my search brought up a Linux file system that gives guarantees: In nilfs2.txt it says:

order=strict	Apply strict in-order semantics that preserves sequence
		of all file operations including overwriting of data
		blocks.  That means, it is guaranteed that no
		overtaking of events occurs in the recovered file
		system after a crash.

Yes, that's exactly the guarantee I want to see. This means that any application that keeps its files consistent as visible from other processes will also have consistent files after an OS crash.

ext4 and data consistency

Posted May 16, 2010 3:57 UTC (Sun) by njs (guest, #40338) [Link]

> For whom? For me data consistency is much more important

That's fine. I'd like data consistency too. But I still don't mount my disks with -o sync, nor does pretty much anyone else, even most of the people who say they want data consistency. That's the reality that fs developers live in.

Maybe on SSD (where nilfs2 is designed to live), we'll be able to get guaranteed data consistency as a matter of course. That'll be nice if it happens.

ext4 and data consistency

Posted May 13, 2010 21:23 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (1 responses)

The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)

My understanding is that ext3 would always have allocated the blocks for the new file and written it before the rename would occur. The 0-length file issue was due to ext4 performing delayed allocation and performing the rename before the data ever got written.

So ya.. apparently that 'fsync' was always needed by application developers if they wanted to ensure that data was written to disk in a timely fashion.

This is a misunderstanding. The desired behaviour was that operations occur in order. It's not terribly important to a user if they lose the configuration changes they made before a crash - it's pretty significant if the rename was performed before the data hit disk, resulting in the complete loss of their configuration.

It's true that POSIX doesn't require that filesystems behave this way. There's many things that POSIX doesn't require but which we expect anyway because the alternative is misery.

ext4 and data consistency

Posted May 14, 2010 12:32 UTC (Fri) by ricwheeler (subscriber, #4980) [Link]

You need to be careful not to confuse the rename issue specifically with the need to use fsync() properly to make sure that data is on disk.

Applications still have to understand when to use fsync() properly to move data from the page cache out to persistent storage (on disk, ssd, etc).

ext4 and data consistency

Posted May 14, 2010 13:21 UTC (Fri) by anton (subscriber, #25547) [Link] (5 responses)

There is no file system in Linux that tries to assure that renames are atomic functions.

That may be true (wrt. what happens on crashes; I do hope that they are all atomic wrt state visible to other processes in regular operations); I certainly have never seen any Linux file system give any guarantees about data consistency on crashes. Not doing renames properly would be pretty poor of Linux, though, given that this is a case where even the old BSD FFS goes to extra lengths to ensure at least meta-data consistency (it never cares about your data).

Concerning Linux file systems, I am pretty sure that ext3 with the default data=ordered mode can result in an inconsistent data state if file overwriting is happening, but data consistency would be achievable for files that are freshly created (I don't know if ext3 actually achieves it, though). For ext4 I don't expect any data consistency.

So ya.. apparently that 'fsync' was always needed by application developers if they wanted to ensure that data was written to disk in a timely fashion.

Yes, but that's neither necessary nor sufficient for data consistency.

[...] in Ubuntu the dpkg program detects if it's running on Ext4 and goes into paranoid mode were it runs 'fsync' were as with Ext3 it does not. This causes Ubuntu installs to last significantly longer if you choose 'Ext4' file system.

Oh, really? We have dozens of Debian systems running on ext3 (presumably without paranoid mode), and we have not had a single problem with a dpkg database corrupted by the file system. What does Ubuntu do with dpkg that makes a significant difference in the length of the installation life? And where can I find the statistics on which you base this claim?

ext4 and data consistency

Posted May 14, 2010 17:50 UTC (Fri) by njs (guest, #40338) [Link] (4 responses)

> We have dozens of Debian systems running on ext3 (presumably without paranoid mode), and we have not had a single problem with a dpkg database corrupted by the file system.

No filesystem goes out and corrupts the dpkg database, but dpkg failing to properly ensure on-disk consistency might make it possible for an untimely power failure (or whatever) to trash its database. How often do you pull the plug while dpkg is running?

That's why robustness is so hard -- it's almost impossible to test. That doesn't mean it isn't important. All it takes is one power failure with just the right timing to trash a datastore. Which is, of course, the whole problem here -- it means that as users we have to rely on external signals, like how I still don't really trust MySQL, because sure, I know they have transactions now, but do I *really* trust a group who was at one point talking about how useless they are to later have the necessary mind-numbing paranoia to catch every edge case? And hey, over here there's Postgres, whose developers clearly *are* absurdly paranoid, excellent...

Or, how you don't trust ext4, even though you have no statistics on it either, because of how Ted T'so's messages came across. It's just a mystery to me how his basically sensible posts gave you (and others) this image of him as some kind of data-eating monster.

ext4 and data consistency

Posted May 14, 2010 19:15 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

That's why robustness is so hard -- it's almost impossible to test. That doesn't mean it isn't important. All it takes is one power failure with just the right timing to trash a datastore.

Virtualization and CoW should have made this much, much easier to test in a finegrained fashion; halt the VM you're using to do the testing, CoW the file, start a new VM using the CoWed copy and mount it; note if it failed and if so how, kill the VM, remove the CoWed copy of the file and let the VM run for another few milliseconds (or, if you're being completely pedantic, another instruction!)

ext4 and data consistency

Posted May 14, 2010 19:36 UTC (Fri) by njs (guest, #40338) [Link] (1 responses)

That's a neat idea. I don't think we have cycle-accurate VMs in FOSS yet, but it doesn't matter for this, you can do the halt/check after every disk write, not every instruction. It still doesn't solve a major part of the problem -- you also need to exercise all the weird corner cases that only arise under certain sorts of memory pressure, or what happens if the disk is fragmented in *this* way and has *this* queue write depth and that makes the elevator algorithm tempted to reorder writes in an unfortunate way, etc. -- but it'd be really useful!

ext4 and data consistency

Posted May 14, 2010 20:41 UTC (Fri) by nix (subscriber, #2304) [Link]

I don't think we have cycle-accurate VMs in FOSS yet

They just need to be accurate enough that stuff works. We're not trying to make Second Reality run, here. I can't think of anything that runs on Core 2 but not AMD Phenom because of differing instruction timings!

all the weird corner cases that only arise under certain sorts of memory pressure

Seems to me that the balloon driver is what we want; it can add memory to the guest on command, can't it also take it away? I don't see why we can't do an analogue of what SQLite does in its testing procedures (use a customized allocator that forces specific allocations to fail). The disk-fragmentation stuff would take a lot more work, probably a custom block allocator, which is a bit tough since the block allocator is one of the things we're trying to test!

ext4 and data consistency

Posted May 15, 2010 9:57 UTC (Sat) by anton (subscriber, #25547) [Link]

No filesystem goes out and corrupts the dpkg database, but dpkg failing to properly ensure on-disk consistency might make it possible for an untimely power failure (or whatever) to trash its database.

The file system does not have to go out to do it, because it was entrusted with that data; so it can just fail to keep it consistent while staying at home. A good file system will properly ensure on-disk consistency without extra help from applications (beyond applications keeping the files consistent from the view of other processes).

How often do you pull the plug while dpkg is running?

Never. And I doubt it happens in a significant number of cases for Ubuntu users, either. And the subset of cases where ext3 corrupts the database is even smaller. That's why I questioned the drag's claim.

That's why robustness is so hard -- it's almost impossible to test.

And that's why I find the attitude that not the file system, but applications should be responsible for data consistency in case of an OS crash or power outage absurd. Instead of testing one or a few file systems, thousands of applications would have to be tested.

ext4 and data consistency with dpkg

Posted Jun 18, 2010 5:38 UTC (Fri) by guillemj (subscriber, #49706) [Link]

> I know that this issue has cropped up again due to the fact that in
> Ubuntu the dpkg program detects if it's running on Ext4 and goes into
> paranoid mode were it runs 'fsync' were as with Ext3 it does not. This
> causes Ubuntu installs to last significantly longer if you choose
> 'Ext4' file system.
>
> If the dpkg folks were smart they'd enable paranoid mode on all file
> systems, except maybe Ext3 (due to Ext3's poor ability to handle that
> sort of workloads)

dpkg has always done fsync() on the internal database, it was only
missing doing fsync() for the extracted control files from a package
to be installed/upgraded (which include maintainer scripts for example).

As of recently, dpkg started doing fsync() before rename on *all*
file systems for all extracted files from a package (there's actually
never been any kind of file system detection or special "paranoid mode").
It also does now fsync() on all database related directories.

The reason for this has been mainly the zero-length issues with ext4
(appearing even with the recent rename heuristic fixes), as we've had
no previous bug reports of broken systems due to zero-length files on
any other file system. But I consider it was still a bug for something
like dpkg to not fsync() files, just because the package status would
not match the package installed data, which is an issue, but not as
grave as having empty files left around (think boot loader, kernel or
libc as example).

But those changes produced major performance regressions *only* on
ext4 (that we know as of now), so we implemented per package delayed
fsync()s + rename()s, which helped a bit with ext4, but not enough. We
have now switched to use delayed sync() + rename()s *only* on Linux
(because it's the only place were sync() is synchronous) which brings
performance closer to the initial values. ext3 didn't have a noticable
performance degradation during the implementation iterations.

The still present zero-length issues and performance issues with fsync()
have been reported to ext4 upstream, the solutions offered were to either
not use fsync() because it's slow and it's not feasible to make it faster,
use non-portable sync() or ignore the problem as it's not a usual case...
(most of the hundreds of duped reports in Ubuntu, which happens to have
ext4 as default file system in latest releases, were due to sudden power
off, and not to system crash which were a minority).

Not to mention this will be an issue if someone happens to port ext4 to
any non-Linux kernel where sync() is asynchronous, then the only options
for developers are either massive performance degradation or possible
data loss in case of abrupt system crashes/shutdown...

> As far as my personal opinion this is a advantage for using Ext4 over
> Ext3 since upgrades will be much safer on my laptop...

Well, whatever happens in maintainer scripts for example is not synced,
so there's still room for data loss with dpkg on ext4...

I've just checked if rpm is doing any kind of sync for extracted files
before rename() and it does not seem so, I'm guessing other packaging
systems might be susceptible to this issue too, but I've not checked.
This is something they might also want to consider doing, in case those
systems start offering ext4 as installation file system, or they might
start suffering the same kind of bug reports as Ubuntu saw. :/