Re: [Qemu-devel] Ensuring data is written to disk

On 8/2/06, Jamie Lokier <address@hidden> wrote:

Jens Axboe wrote:
> > > For SATA you always need at least one cache flush (you need one if you
> > > have the FUA/Forced Unit Access write available, you need two if not).
> >
> > Well my question wasn't intended to be specific to ATA (sorry if that
> > wasn't clear), but a general question about writing to disks on Linux.
> >
> > And I don't understand your answer.  Are you saying that reiserfs on
> > Linux (presumably 2.6) commits data (and file metadata) to disk
> > platters before returning from fsync(), for all types of disk
> > including PATA, SATA and SCSI?  Or if not, is that a known property of
> > PATA only, or PATA and SATA only?  (And in all cases, presumably only
> > "ordinary" controllers can be depended on, not RAID controllers or
> > USB/Firewire bridges which ignore cache flushes for no good reason).
>
> blkdev_issue_flush() is brutal, but it works on SATA/PATA/SCSI. So yes,
> it should eb reliable.

Ah, thanks.  I've looked at that bit of reiserfs, xfs and ext3 now.

It looks like adding a single call to blkdev_issue_flush() at the end
of ext3_sync_file() would do the trick.  I'm surprised that one-line
patch isn't in there already.

Of course that doesn't help with writing an application to reliably
commit on existing systems.

> > > > 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too,
> > > >    for in-place writes which don't modify the inode and therefore don't
> > > >    have a journal entry?
> > >
> > > I don't think that it does, however it may have changed. A quick grep
> > > would seem to indicate that it has not changed.
> >
> > Ew.  What do databases do to be reliable then?  Or aren't they, on Linux?
>
> They probably run on better storage than commodity SATA drives with
> write back caching enabled. To my knowledge, Linux is one of the only OS
> that even attempts to fix this.

I would imagine most of the MySQL databases backing small web sites
run on commodity PATA or SATA drives, and that most people have
assumed fsync() to be good enough for database commits in the absence
of hardware failure, or when one disk goes down in a RAID.  Time to
correct those misassumption!

> > > > On Darwin, fsync() does not issue CACHEFLUSH to the drive.  Instead,
> > > > it has an fcntl F_FULLSYNC which does that, which is documented in
> > > > Darwin's fsync() page as working with all Darwin's filesystems,
> > > > provided the hardware honours CACHEFLUSH or the equivalent.
> > >
> > > That seems somewhat strange to me, I'd much rather be able to say that
> > > fsync() itself is safe. An added fcntl hack doesn't really help the
> > > applications that already rely on the correct behaviour.
> >
> > According to the Darwin fsync(2) man page, it claims Darwin is the
> > only OS which has a facility to commit the data to disk platters.
> > (And it claims to do this with IDE, SCSI and FibreChannel.  With
> > journalling filesystems, it requests the journal to do the commit but
> > the cache flush still ultimately reaches the disk.  Sounds like a good
> > implementation to me).
>
> The implementation may be nice, but it's the idea that is appalling to
> me. But it sounds like the Darwin man page is out of date, or at least
> untrue.
>
> > SQLite (a nice open source database) will use F_FULLSYNC on Darwin to
> > do this, and it appears to add a large performance penalty relative to
> > using fsync() alone.  People noticed and wondered why.
>
> Disk cache flushes are nasty, they stall everything. But it's still
> typically faster than disabling write back caching, so...

I agree that it's nasty.  But then, the fsync() interface is rather
sub-optimal.  E.g. something like sendmail which writes a new file
needs to fsync() on the file _and_ its parent directory.  You don't
want two disk flushes then, just one after both fsync() calls have
completed.  Similarly if you're doing anything where you want to
commit data to more than one file.  An fsync_multi() interface would
be more efficient.

> > Other OSes show similar performance as Darwin with fsync() only.
> >
> > So it looks like the man page is probably accurate: other OSes,
> > particularly including Linux, don't commit the data reliably to disk
> > platters when using fsync().
>
> How did you reach that conclusion?

>From seeing the reported timings for SQLite on Linux and Darwin
with/without F_FULLSYNC.  The Linux timings were similar to Darwin
without F_FULLSYNC.  Others and myself assumed the timings are
probably I/O bound, and reflect the transactions going to disk.  But
it could be Darwin being slower :-)

> reiser certainly does it if you have barriers enabled (which you
> need anyways to be safe with write back caching), and with a little
> investigation we can perhaps conclude that XFS is safe as well.

Yes, reiser and XFS look quite convincing.  Although I notice the
blkdev_issue_flush is conditional in both, and the condition is
non-trivial.  I'll assume the authors thought specifically about this.

> > In which case, I'd imagine that's why Darwin has a separate option,
> > because if Darwin's fsync() was many times slower than all the other
> > OSes, most people would take that as a sign of a badly performing OS,
> > rather than understanding the benefits.
>
> That sounds like marketing driven engineering, nice. It requires app
> changes, which is pretty silly. I would much rather have a way of just
> enabling/disabling full flush on a per-device basis, you could use the
> cache type as the default indicator of whether to issue the cache flush
> or not. Then let the admin override it, if he wants to run unsafe but
> faster.

I agree, that makes sense to me too.

> > > > from what little documentation I've found, on Linux it appears to be
> > > > much less predictable.  It seems that some filesystems, with some
> > > > kernel versions, and some mount options, on some types of disk, with
> > > > some drive settings, will commit data to a platter before fsync()
> > > > returns, and others won't.  And an application calling fsync() has no
> > > > easy way to find out.  Have I got this wrong?
> > >
> > > Nope, I'm afraid that is pretty much true... reiser and (it looks like,
> > > just grepped) XFS has best support for this. Unfortunately I don't think
> > > the user can actually tell if the OS does the right thing, outside of
> > > running a blktrace and verifying that it actually sends a flush cache
> > > down the queue.
> >
> > Ew.  So what do databases on Linux do?  Or are database commits
> > unreliable because of this?
>
> See above.

I conclude that database commits _are_ unreliable on Linux on a
disturbingly large number of smaller setups.

With ext3 on 2.6 and IDE write cache enabled, fsync() does not even
guarantee the ordering of writes, let alone commit them properly.
This is because it omits a journal commit (and hence IDE barrier), if
the data writes haven't changed the inode, which they don't if it's
within the 1-second mtime granularity.

O_SYNC on ext3 suffers the same problems.  (I don't know if O_SYNC
commits data to platters on reiser and XFS, or maintains write
ordering; I guess that fsync() should be called when those are
needed).

Considering the marketing of ext3 as offering data integrity, I'm
disappointed.

An ugly workaround suggests itself, which is to forcibly modify the
inode after writing and before calling fsync(): write, utime, utime,
fsync.  As a side effect of the journal barrier, it will cause a cache
flush to disk.

> > > > ps. (An aside question): do you happen to know of a good patch which
> > > > implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch by
> > > > googling, but it seemed that the ext3 parts might not be finished, so
> > > > I don't trust it.  I've found turning off the IDE write cache makes
> > > > writes safe, but with a huge performance cost.
> > >
> > > The hard part (the IDE code) can be grabbed from the SLES8 latest
> > > kernels, I developed and tested the code there. That also has the ext3
> > > bits, IIRC.
> >
> > Thanks muchly!  I will definitely take a look at that.  I'm working on
> > a uClinux project which must use a 2.4 kernel, and performance with
> > write cache off has been a real problem.  And I've seen fs corruption
> > after power cycles with write cache on many times, as expected.
>
> No problem.

Have looked, it's most helpful, and I will use your patches.
Ironically, that 2.4 patch seems to include reliable commits w/ ext3,
because every fsync() commits a journal entry.  Er, I think.  (It was
optimised away in 2.6: http://lkml.org/lkml/2004/3/18/36).

> > It's a shame the ext3 bits don't do fsync() to the platter though. :-/
>
> It really is, apparently none of the ext3 guys care about write back
> caching problems. The only guy wanting to help with the ext3 bits was
> Andrew. In the reiserfs guys favor, they have actively been pursuing
> solutions to this problem. And XFS recently caught up and should be just
> as good on the barrier side, I have yet to verify the fsync() part.

There's a call to blkdev_issue_flush in XFS fsync(), so it looks
promising.  I'm not sure what the condition for calling it depends on
though, but it seems likely the authors have thought it through.

> > To reliably commit data to an ext3 file, should we do ioctl(block_dev,
> > HDIO_SET_WCACHE, 1) on 2.6 kernels on IDE?  (The side effects look to
>
> Did you mean (..., 0)? And yes, it looks like it right now that fsync()
> isn't any better than other OS on ext3, so disabling write back caching
> is the safest.

I meant (..., 1).  For some reason I thought the call to
update_ordered() in ide-disk.c issued a barrier, a convenient side
effect of HDIO_SET_WCACHE.  But on re-reading, it doesn't issue a
barrier.  So that's not a solution.

(..., 0) sucks performance wise.  I think calling utime to dirty the
inode prior to fsync() will work with ext3, but it's ugly for many
reasons, not least that it will work on IDE, but it won't work on
anything (e.g. SCSI) which uses ordered tags rather than flushes.

> > me like they may create a barrier then flush the cache, even when it's
> > already enabled, but only on 2.6 kernels).  Or is there a better way?
> > (I don't see any way to do it on vanilla 2.4 kernels).
>
> 2.4 vanilla doesn't have barrier support, unfortunately.

I was wondering how to force an IDE cache flush on 2.4, from the
application after it's called fsync().  No barrier support implied.  I
guess there is some way to do it using the IDE taskfile ioctls?
Nothing is clear here, unfortunately.

I'm surprised blkdev_issue_flush (or the equivalent in 2.4) isn't
available to userspace through a block device ioctl.  There is
BLKFLSBUF which _almost_ pretends to do it, but that doesn't issue a
low-level disk flush, and it invalidates the read-cached data.

> > Should we change to only reiserfs and expect fsync() to commit data
> > reliably only with that fs?  I realise this is a lot of difficult
> > questions, that apply to more than just Qemu...
>
> Yes, reiser is the only one that works reliably across power loss with
> write back caching for the journal commits as well as fsync guarantees.

I'll try it.  I see enough problems with ext3 on a tiny embedded
system (writes stalling for a long time, read-cached data being
re-read from disk every 5 seconds) that I was avoiding reiser because
I thought it would be more complicated.  That, and I have high faith
in e2fsck.  But given the problems with ext3, maybe I'll get better
embedded results with reiser :)

-- Jamie

_______________________________________________
Qemu-devel mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/qemu-devel

From:	Bill C. Riemers
Subject:	Re: [Qemu-devel] Ensuring data is written to disk
Date:	Wed, 2 Aug 2006 11:56:16 -0400