The Next3 filesystem
The Next3 filesystem
Posted May 13, 2010 16:57 UTC (Thu) by cortana (subscriber, #24596)In reply to: The Next3 filesystem by anton
Parent article: The Next3 filesystem
Posted May 13, 2010 18:38 UTC (Thu)
by anton (subscriber, #25547)
[Link] (32 responses)
He fixed one particularly frequent cause of data loss in ext4 (involving
writing a file, then renaming it across an old one), but nothing else.
So people will see data loss with ext4 less frequently than before,
but not as infrequently as with ext3 (or has this data loss feature
been backported from ext4 to ext3 to give us fewer reasons to stick
with ext3?).
Posted May 13, 2010 20:08 UTC (Thu)
by rahulsundaram (subscriber, #21946)
[Link] (9 responses)
Posted May 14, 2010 12:54 UTC (Fri)
by anton (subscriber, #25547)
[Link] (8 responses)
Posted May 14, 2010 14:00 UTC (Fri)
by rahulsundaram (subscriber, #21946)
[Link]
Posted May 21, 2010 15:08 UTC (Fri)
by Duncan (guest, #6647)
[Link] (6 responses)
Presumably you used tun2fs or simply fstab to ensure your ext3 mounts remain stable with data=ordered after the kernel in question (was it 2.6.30 or 2.6.31?), right?
What'd be interesting to see would be how the distributions have handled it, since. Did they go with the new ext3 data=writeback default, or have they either reverted either that commit or now default their userspace to specify data=ordered by default?
I know at least one guy who was complaining of ext3 instability after installing a new kernel due to that, that went away when he returned to data=ordered for his ext3 volumes. The context of that discussion was the pan (nntp client) user list, IIRC.
Me, I've been on reiserfs for years on both my main system and (more recently) my netbook, and have been extremely happy with it since data=ordered became its default (2.6.6 according to a google hit on another LWN comment of mine). My most recent experience with extX is on no-journal ext4 formatted USB flash-based thumbdrives, where journaling isn't a good idea. I've been following btrfs with interest, and expect I'll upgrade to it once a few more of the kinks get worked out. (I've seen hints that the current 2.6.35 cycle will reduce the strength of the warning for its kernel config item, but I don't follow the btrfs list or lkml, and any detail of even plans has been harder to come by on the broader community sites such as LWN, HO, LXer, etc, that I follow.)
Duncan
Posted May 22, 2010 19:15 UTC (Sat)
by anton (subscriber, #25547)
[Link] (5 responses)
I am a little worried, though, because of what happened after
data=journal was no longer the default; I then read that using
data=journal resulted in corrupt file systems; I read that for a
significant amount of time, and never read that this bug has been
fixed (but haven't seen such reports for some time).
So if they made data=ordered non-default in 2.6.31 or some kernel,
will they really care if it works? My confidence is limited. We
should probably better stick with 2.6.30 until we migrate off extx file
systems completely.
Posted May 22, 2010 20:36 UTC (Sat)
by nix (subscriber, #2304)
[Link] (4 responses)
(btw, you can put mount options in the superblock, and avoid modifying /etc/fstab.)
Posted May 23, 2010 11:44 UTC (Sun)
by anton (subscriber, #25547)
[Link] (3 responses)
Modifying fstab is not a big deal, why would I want to avoid it.
The problem with doing it in the superblock is that I have to do it
again when I transfer the system to another disk.
Posted May 23, 2010 11:50 UTC (Sun)
by cortana (subscriber, #24596)
[Link] (1 responses)
Posted May 23, 2010 13:29 UTC (Sun)
by anton (subscriber, #25547)
[Link]
Another way would be to check CONFIG_EXT3_DEFAULTS_TO_ORDERED in
the kernel config file.
Posted May 23, 2010 13:55 UTC (Sun)
by nix (subscriber, #2304)
[Link]
Posted May 13, 2010 20:19 UTC (Thu)
by drag (guest, #31333)
[Link] (21 responses)
Ext3, Ext4, XFS, JFS, etc etc.. all of these have the same consistency problems your complaining about.
The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)
What is more with 2.6.30 a patch was added to Ext4 that attempted to detect and then replicate the same behavior in Ext3 in order to maintain backwards compatibility with application developer's assumptions on file system behavior with regards to renames.
So ya.. apparently that 'fsync' was always needed by application developers if they wanted to ensure that data was written to disk in a timely fashion.
---------------------------
I know that this issue has cropped up again due to the fact that in Ubuntu the dpkg program detects if it's running on Ext4 and goes into paranoid mode were it runs 'fsync' were as with Ext3 it does not. This causes Ubuntu installs to last significantly longer if you choose 'Ext4' file system.
If the dpkg folks were smart they'd enable paranoid mode on all file systems, except maybe Ext3 (due to Ext3's poor ability to handle that sort of workloads)
As far as my personal opinion this is a advantage for using Ext4 over Ext3 since upgrades will be much safer on my laptop...
---------------------------
The one feature that I like about Ext4 is that it takes a minute or two to run a full fsck on my home directory versus upwards to 15-20 minutes for the same operation on Ext3.
Posted May 13, 2010 20:45 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (11 responses)
Posted May 13, 2010 23:07 UTC (Thu)
by njs (guest, #40338)
[Link] (10 responses)
Posted May 14, 2010 13:43 UTC (Fri)
by anton (subscriber, #25547)
[Link] (9 responses)
E.g., I expect data consistency from a file system; Linux file
systems don't give any guarantee on that, but at least ext3 does ok in
most cases; some people may consider this a fluke (but is Stephen
Tweedie, the creator of ext3 among them?), but that's the reality.
Other people expect maximum speed. And for these people Linux
provides tmpfs and ext4.
Given this choice, ext4 is certainly not a replacemet of ext3 for
me.
Posted May 14, 2010 15:53 UTC (Fri)
by bronson (subscriber, #4806)
[Link] (5 responses)
Posted May 15, 2010 8:36 UTC (Sat)
by anton (subscriber, #25547)
[Link] (4 responses)
Posted May 20, 2010 19:23 UTC (Thu)
by oak (guest, #2786)
[Link] (3 responses)
Wouldn't "strace -f" be handier for that kind of thing? With that you notice also a lot of other stuff that the SW does.
Strace-account script gives an overview of file accesses in the strace output:
Posted May 21, 2010 12:04 UTC (Fri)
by anton (subscriber, #25547)
[Link] (2 responses)
Posted Jun 8, 2010 22:17 UTC (Tue)
by elanthis (guest, #6227)
[Link] (1 responses)
Posted Jun 9, 2010 9:06 UTC (Wed)
by anton (subscriber, #25547)
[Link]
Posted May 14, 2010 17:33 UTC (Fri)
by njs (guest, #40338)
[Link] (2 responses)
But maybe there are other cases where ext3 does better than ext4. You must have some excellent ones in mind to lump ext4 in with tmpfs... can you give any examples?
Posted May 15, 2010 9:19 UTC (Sat)
by anton (subscriber, #25547)
[Link] (1 responses)
As for an example: Consider a process writing file A and then file
B. With ext4 I expect that it can happen that after recovery B is
present and A is not or is empty. With ext3 I expect that this does
not happen. But given that I did not find any documented guarantees
in Documentation/filesystems/ext3.fs, maybe we should lump ext3 with
tmpfs, too.
Still, my search brought up a Linux file system that gives
guarantees: In nilfs2.txt it says:
Posted May 16, 2010 3:57 UTC (Sun)
by njs (guest, #40338)
[Link]
That's fine. I'd like data consistency too. But I still don't mount my disks with -o sync, nor does pretty much anyone else, even most of the people who say they want data consistency. That's the reality that fs developers live in.
Maybe on SSD (where nilfs2 is designed to live), we'll be able to get guaranteed data consistency as a matter of course. That'll be nice if it happens.
Posted May 13, 2010 21:23 UTC (Thu)
by mjg59 (subscriber, #23239)
[Link] (1 responses)
My understanding is that ext3 would always have allocated the blocks for the new file and written it before the rename would occur. The 0-length file issue was due to ext4 performing delayed allocation and performing the rename before the data ever got written.
So ya.. apparently that 'fsync' was always needed by application developers if they wanted to ensure that data was written to disk in a timely fashion.
This is a misunderstanding. The desired behaviour was that operations occur in order. It's not terribly important to a user if they lose the configuration changes they made before a crash - it's pretty significant if the rename was performed before the data hit disk, resulting in the complete loss of their configuration.
It's true that POSIX doesn't require that filesystems behave this way. There's many things that POSIX doesn't require but which we expect anyway because the alternative is misery.
Posted May 14, 2010 12:32 UTC (Fri)
by ricwheeler (subscriber, #4980)
[Link]
Applications still have to understand when to use fsync() properly to move data from the page cache out to persistent storage (on disk, ssd, etc).
Posted May 14, 2010 13:21 UTC (Fri)
by anton (subscriber, #25547)
[Link] (5 responses)
Concerning Linux file systems, I am pretty sure that ext3 with the
default data=ordered mode can result in an inconsistent data state if
file overwriting is happening, but data consistency would be
achievable for files that are freshly created (I don't know if ext3
actually achieves it, though). For ext4 I don't expect any data
consistency.
Posted May 14, 2010 17:50 UTC (Fri)
by njs (guest, #40338)
[Link] (4 responses)
No filesystem goes out and corrupts the dpkg database, but dpkg failing to properly ensure on-disk consistency might make it possible for an untimely power failure (or whatever) to trash its database. How often do you pull the plug while dpkg is running?
That's why robustness is so hard -- it's almost impossible to test. That doesn't mean it isn't important. All it takes is one power failure with just the right timing to trash a datastore. Which is, of course, the whole problem here -- it means that as users we have to rely on external signals, like how I still don't really trust MySQL, because sure, I know they have transactions now, but do I *really* trust a group who was at one point talking about how useless they are to later have the necessary mind-numbing paranoia to catch every edge case? And hey, over here there's Postgres, whose developers clearly *are* absurdly paranoid, excellent...
Or, how you don't trust ext4, even though you have no statistics on it either, because of how Ted T'so's messages came across. It's just a mystery to me how his basically sensible posts gave you (and others) this image of him as some kind of data-eating monster.
Posted May 14, 2010 19:15 UTC (Fri)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted May 14, 2010 19:36 UTC (Fri)
by njs (guest, #40338)
[Link] (1 responses)
Posted May 14, 2010 20:41 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Posted May 15, 2010 9:57 UTC (Sat)
by anton (subscriber, #25547)
[Link]
Posted Jun 18, 2010 5:38 UTC (Fri)
by guillemj (subscriber, #49706)
[Link]
dpkg has always done fsync() on the internal database, it was only
As of recently, dpkg started doing fsync() before rename on *all*
The reason for this has been mainly the zero-length issues with ext4
But those changes produced major performance regressions *only* on
The still present zero-length issues and performance issues with fsync()
Not to mention this will be an issue if someone happens to port ext4 to
> As far as my personal opinion this is a advantage for using Ext4 over
Well, whatever happens in maintainer scripts for example is not synced,
I've just checked if rpm is doing any kind of sync for extracted files
Ted T'so still believes that data consistency on OS crashes (not
application crashes) is the job of the applications (with fsync()
etc.), not of the file system. And most applications don't do that, and those few that try it
are probably not well tested against that (because that's extremely
hard).
ext4 and data consistency
ext4 and data consistency
Am I? That's Ted T'so's position as reported on, e.g., LWN. But
maybe you can show me where I was wrong in my statement of his
position. And my impression is that if it was just up to him, he
would not have made the rename fix.
ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
What bothers me is how they reduced the guarantees and
stability of the long mature ext3 filesystem in the aftermath of all
this, by defaulting it to data=writeback, a change from the old
default data=ordered.
Yes, that's what was at the back of my mind when I wrote about
"backporting the data loss feature from ext4 to ext3".
Presumably you used tun2fs or simply fstab to ensure your
ext3 mounts remain stable with data=ordered after the kernel in
question (was it 2.6.30 or 2.6.31?), right?
The youngest kernel we have is 2.6.30, and according to /proc/mounts
it mounts our ext3 file systems with data=ordered. I guess we will go
the fstab route once we get a kernel that defaults to data=writeback.
ext4 and data consistency
And new bugs are introduced, and if they are for a non-default option
like (now) data=ordered, they won't get noticed in time, and they
won't get fixed for quite some time; at least that's what the
non-default data=journal episode teaches. So what's higher: the risk
of data loss from a well-known kernel, or from a new kernel in a
non-default setting? Choosing the latter seems foolish to me.
ext4 and data consistency
ext4 and data consistency
One way is to mount such a file system with the default value (without
overriding the default with tune2fs or in fstab), and the checking the
actual options in /proc/mounts. That is what I do.
ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)
That window must be vanishingly small because neither I nor anyone else has ever been able to make ext3 crease zero-length files in the way you describe. Quirk or not, rename atomicity is an important feature that works just fine on a running filesystem, and filesystems ought to preserve its qualities on a restart.
Allowing random garbage to exist on the filesystem after a restart is terrible policy and reflects a profound ignorance on the part of filesystem developrse as to how applications and users expect their systems to work.
ext4 and data consistency
ext4 and data consistency
[...] trade-offs fs developers have to make, the disparity between what
people want from a fs and what fs's have historically provided, etc.
Yes, different people expect different things from file systems.
Keep in mind that if you go two web-pages over, you can
find people tearing into POSIX for providing *too* strong guarantees
and how we absolutely need to relax them for real-world usage (atime
is the obvious example, but there are others).
Yes, there are different kinds of users. I lost quite a bit of time
because Linux does not follow POSIX atime semantics by default
anymore. I find them useful in my real-world usage. Those who don't
want atime have been able to use noatime for a long time, and now
there is relatime, but making it the default (especially with mounts
that don't know about strictatime) is a bad practice.
ext4 and data consistency
I use atime to check whether some complex software really does access
the files that I think it does.
ext4 and data consistency
ext4 and data consistency
http://blogs.gnome.org/mortenw/2005/12/14/strace-account/
It would not be handier exactly because it tells me a huge amount of other stuff the software does and that I am not interested in.
ext4 and data consistency
ext4 and data consistency
And how is that handier than just doing "stat <file>"?
ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
Speed is too important
For whom? For me data consistency is much more important. Before
barriers were supported, we ran ext3 on IDE disks without write
caching, and that's really slow. The file system was still fast
enough.
Stephen Tweedie didn't make data=journal the default, either.
Actually he did, at least at the start. Later it got changed (by
whom?) to data=ordered; that still has the potential to provide data
consistency unless existing files are overwritten.
order=strict Apply strict in-order semantics that preserves sequence
of all file operations including overwriting of data
blocks. That means, it is guaranteed that no
overtaking of events occurs in the recovered file
system after a crash.
Yes, that's exactly the guarantee I want to see. This means that any
application that keeps its files consistent as visible from other
processes will also have consistent files after an OS crash.
ext4 and data consistency
The difference is that due to a fluke to Ext3's design the window that the 'zero length files' would be created on improper shutdown is much shorter then the same window for Ext4 (or XFS or whatever)ext4 and data consistency
ext4 and data consistency
ext4 and data consistency
There is no file system in Linux that tries to assure
that renames are atomic functions.
That may be true (wrt. what happens on crashes; I do hope that they
are all atomic wrt state visible to other processes in regular
operations); I certainly have never seen any Linux file system give
any guarantees about data consistency on crashes. Not doing renames
properly would be pretty poor of Linux, though, given that this is a
case where even the old BSD FFS goes to extra lengths to ensure at
least meta-data consistency (it never cares about your data).
So ya.. apparently that 'fsync' was always needed by
application developers if they wanted to ensure that data was written
to disk in a timely fashion.
Yes, but that's neither necessary nor sufficient for data consistency.
[...] in Ubuntu the dpkg program detects if it's running on
Ext4 and goes into paranoid mode were it runs 'fsync' were as with
Ext3 it does not. This causes Ubuntu installs to last significantly
longer if you choose 'Ext4' file system.
Oh, really? We have dozens of Debian systems running on ext3
(presumably without paranoid mode), and we have not had a single
problem with a dpkg database corrupted by the file system. What does
Ubuntu do with dpkg that makes a significant difference in
the length of the installation life? And where can I find the
statistics on which you base this claim?
ext4 and data consistency
ext4 and data consistency
That's why robustness is so hard -- it's almost impossible to test. That doesn't mean it isn't important. All it takes is one power failure with just the right timing to trash a datastore.
Virtualization and CoW should have made this much, much easier to test in a finegrained fashion; halt the VM you're using to do the testing, CoW the file, start a new VM using the CoWed copy and mount it; note if it failed and if so how, kill the VM, remove the CoWed copy of the file and let the VM run for another few milliseconds (or, if you're being completely pedantic, another instruction!)
ext4 and data consistency
ext4 and data consistency
I don't think we have cycle-accurate VMs in FOSS yet
They just need to be accurate enough that stuff works. We're not trying to make Second Reality run, here. I can't think of anything that runs on Core 2 but not AMD Phenom because of differing instruction timings!
all the weird corner cases that only arise under certain sorts of memory pressure
Seems to me that the balloon driver is what we want; it can add memory to the guest on command, can't it also take it away? I don't see why we can't do an analogue of what SQLite does in its testing procedures (use a customized allocator that forces specific allocations to fail). The disk-fragmentation stuff would take a lot more work, probably a custom block allocator, which is a bit tough since the block allocator is one of the things we're trying to test!
ext4 and data consistency
No filesystem goes out and corrupts the dpkg database,
but dpkg failing to properly ensure on-disk consistency might make it
possible for an untimely power failure (or whatever) to trash its
database.
The file system does not have to go out to do it, because it was
entrusted with that data; so it can just fail to keep it consistent
while staying at home. A good file system will properly ensure
on-disk consistency without extra help from applications (beyond
applications keeping the files consistent from the view of other
processes).
How often do you pull the plug while dpkg is running?
Never. And I doubt it happens in a significant number of cases for
Ubuntu users, either. And the subset of cases where ext3 corrupts the
database is even smaller. That's why I questioned the drag's claim.
That's why robustness is so hard -- it's almost impossible to test.
And that's why I find the attitude that not the file system, but
applications should be responsible for data consistency in case of an
OS crash or power outage absurd. Instead of testing one or a few file
systems, thousands of applications would have to be tested.
ext4 and data consistency with dpkg
> Ubuntu the dpkg program detects if it's running on Ext4 and goes into
> paranoid mode were it runs 'fsync' were as with Ext3 it does not. This
> causes Ubuntu installs to last significantly longer if you choose
> 'Ext4' file system.
>
> If the dpkg folks were smart they'd enable paranoid mode on all file
> systems, except maybe Ext3 (due to Ext3's poor ability to handle that
> sort of workloads)
missing doing fsync() for the extracted control files from a package
to be installed/upgraded (which include maintainer scripts for example).
file systems for all extracted files from a package (there's actually
never been any kind of file system detection or special "paranoid mode").
It also does now fsync() on all database related directories.
(appearing even with the recent rename heuristic fixes), as we've had
no previous bug reports of broken systems due to zero-length files on
any other file system. But I consider it was still a bug for something
like dpkg to not fsync() files, just because the package status would
not match the package installed data, which is an issue, but not as
grave as having empty files left around (think boot loader, kernel or
libc as example).
ext4 (that we know as of now), so we implemented per package delayed
fsync()s + rename()s, which helped a bit with ext4, but not enough. We
have now switched to use delayed sync() + rename()s *only* on Linux
(because it's the only place were sync() is synchronous) which brings
performance closer to the initial values. ext3 didn't have a noticable
performance degradation during the implementation iterations.
have been reported to ext4 upstream, the solutions offered were to either
not use fsync() because it's slow and it's not feasible to make it faster,
use non-portable sync() or ignore the problem as it's not a usual case...
(most of the hundreds of duped reports in Ubuntu, which happens to have
ext4 as default file system in latest releases, were due to sudden power
off, and not to system crash which were a minority).
any non-Linux kernel where sync() is asynchronous, then the only options
for developers are either massive performance degradation or possible
data loss in case of abrupt system crashes/shutdown...
> Ext3 since upgrades will be much safer on my laptop...
so there's still room for data loss with dpkg on ext4...
before rename() and it does not seem so, I'm guessing other packaging
systems might be susceptible to this issue too, but I've not checked.
This is something they might also want to consider doing, in case those
systems start offering ext4 as installation file system, or they might
start suffering the same kind of bug reports as Ubuntu saw. :/