qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] bdrv_aio_flush


From: Jamie Lokier
Subject: Re: [Qemu-devel] [PATCH] bdrv_aio_flush
Date: Mon, 1 Sep 2008 14:25:02 +0100
User-agent: Mutt/1.5.13 (2006-08-11)

Ian Jackson wrote:
> Andrea Arcangeli writes ("[Qemu-devel] [PATCH] bdrv_aio_flush"):
> > while reading the aio/ide code I noticed the bdrv_flush operation is
> > unsafe. When a write command is submitted with bdrv_aio_write and
> > later bdrv_flush is called, fsync will do nothing. fsync only sees the
> > kernel writeback cache. But the write command is still queued in the
> > aio kernel thread and is still invisible to the kernel. bdrv_aio_flush
> > will instead see both the regular bdrv_write (that submits data to the
> > kernel synchronously) as well as the bdrv_aio_write as the fsync will
> > be queued at the end of the aio queue and it'll be issued by the aio
> > pthread thread itself.
> 
> I think this is fine.  We discussed this some time ago.  bdrv_flush
> guarantees that _already completed_ IO operations are flushed.  It
> does not guarantee that in flight AIO operations are completed and
> then flushed to disk.

Andrea thinks bdrv_aio_flush does guarantee that in flight operations
are flushed, while bdrv_flush definitely does not (fsync doesn't).

I vaguely recall from the discussion before, there was uncertainty
about whether that is true, and therefore the right thing to do was
wait for the in flight AIOs to complete _first_ and then issue an
fsync or aio_fsync call.

The Open Group text for aio_fsync says: "shall asynchronously force
all I/O operations [...]  queued at the time of the call to aio_fsync
[...]".

We weren't certain if operations "queued at the time of the call"
definitely meant everything queued by earlier calls to aio_write(), or
if in implementations which start a thread to call write(), that
"queued" might not include AIOs which were still in a thread somewhere.

This is because we considered there might not be a strong ordering
between submitted AIOs: each one might be _as if_ it launched an
independent thread to process the synchronous equivalent.

This was clear also from somebody asking why Glibc's AIO has only one
thread per file descriptor, limiting its performance.

In fact there's a good reason for Glibc's one thread limitation.

Since then, I've read the Open Group specifications more closely, and
some other OS man pages, and they are consistent that _writes_ always
occur in the order they are submitted to aio_write.

(The Linux man page is curiously an exception in not saying this.)

In other words, there is a queue, and submitted aio_writes() are
strongly ordered.[1]

So it seems very likely that aio_fsync _is_ specified as Andrea
thinks, that it flushes all writes which have earlier been queued with
aio_fsync, and its unfortunate that a different interpretation of the
specification's words is possible.[2]

[1] - This is especially important when writing to a socket, pipe,
      tape drive.  It's a little surprising it doesn't specify the
      same about reads, since reading from sockets, pipes and tape
      drives requires ordering guarantees too.

[2] - This is useful: with some devices, it can be much faster to keep
      a queue going than to wait for it to drain to issue flushes for
      barriers.

> > IDE works by luck because it can only submit one command at once (no
> > tagged queueing) so supposedly the guest kernel driver will wait the
> > IDE emulated device to return ready before issuing a journaling
> > barrier with WIN_FLUSH_CACHE* but with scsi and tagged command
> > queueing this bug in the aio common code will become visible and it'll
> > break the journaling guarantees of the guest if there's a power loss
> > in the host. So it's not urgent for IDE I think, but it clearly should
> > be fixed in the qemu block model eventually.
> 
> I don't think this criticism is correct because I think the IDE FLUSH
> CACHE command should be read the same way.  The spec I have here is
> admittedly quite unclear but I can't see any reason to think that the
> `write cache' which is referred to by the spec is regarded as
> containing data which has not yet been DMAd from the host to the disk
> because the command which does that transfer is not yet complete.

I have just checked ATA-8, the draft.  With ATA TCQ or NCQ (those are
the features to have more than one command in flight), the only
queuable command are specific types of reads and writes.

All other commands can only have one command total in flight.

FLUSH CACHE cannot be queued: the OS must wait for preceding commands
to drain before it can issue FLUSH CACHE (it'll be refused otherwise).
So the question of flushing data not yet DMA'd doesn't apply (maybe in
a future ATA spec it will.)

What _can_ be queued is a WRITE FUA command: meaning write some data
and flush _this_ data to non-volatile storage.

Inside qemu, that should map to a write+fsync sequence somehow, or
write using an O_SYNC or O_DIRECT file descriptor.  (Though a user
option to disable fsyncs and make _all_ writes cached would be handy,
for those temporary VMs you want to run as fast as possible.)

(By the way, READ FUA is also in the ATA spec.  It means force a read
from the non-volatile medium, don't read from cache.  If there is
dirty data in cache, flush it first.)

-- Jamie




reply via email to

[Prev in Thread] Current Thread [Next in Thread]