[Qemu-devel] qcow2 performance plan

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] qcow2 performance plan

From:	Avi Kivity
Subject:	[Qemu-devel] qcow2 performance plan
Date:	Tue, 14 Sep 2010 15:07:32 +0200
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Lightning/1.0b3pre Thunderbird/3.1.3

Here's a draft of a plan that should improve qcow2 performance. It'swritten in wiki syntax for eventual upload to wiki.qemu.org; linesstarting with # are numbered lists, not comments.


= Basics =

At the minimum level, no operation should block the main thread.  This
could be done in two ways: extending the state machine so that each
blocking operation can be performed asynchronously (<code>bdrv_aio_*</code>)
or by threading: each new operation is handed off to a worker thread.
Since a full state machine is prohibitively complex, this document
will discuss threading.

== Basic threading strategy ==

A first iteration of qcow2 threading adds a single mutex to an image.
The existing qcow2 code is then executed within a worker thread,
acquiring the mutex before starting any operation and releasing it
after completion.  Concurrent operations will simply block until the
operation is complete.  For operations which are already asynchronous,
the blocking time will be negligible since the code will call
<code>bdrv_aio_{read,write}</code> and return, releasing the mutex.
The immediate benefit is that currently blocking operations no long block
the main thread, instead they just block the block operation which is
blocking anyway.

== Eliminating the threading penalty ==

We can eliminate pointless context switches by using the worker thread
context we're in to issue the I/O.  This is trivial for synchronous calls

(<code>bdrv_read</code> and <code>bdrv_write</code>); we simply issuethe I/O

from the same thread we're currently in.  The underlying raw block format
driver threading code needs to recognize we're in a worker thread context so
it doesn't need to use a worker thread of its own; perhaps using a thread
variable to see if it is in the main thread or an I/O worker thread.

For asynchronous operations, this is harder.  We may add a
<code>bdrv_queue_aio_read</code> and <code>bdrv_queue_aio_write</code> if
to replace a

    bdrv_aio_read()
    mutex_unlock(bs.mutex)
    return;

sequence.  Alternatively, we can just eliminate asynchronous calls.  To
retain concurrency we drop the mutex while performing the operation:
an convert a <code>bdrv_aio_read</code> to:

    mutex_unlock(bs.mutex)
    bdrv_read()
    mutex_lock(bs.mutex)

This allows the operations to proceed in parallel.

For asynchronous metadata operations, the code is simplified considerably.
Dependency lists that are maintained in metadata caches are replaced by a

mutex; instead of adding an operation to a dependency list, acquire themutex.Then issue your metadata update synchronously. If there is a lot ofcontention

on the resource, we can batch all updates into a single write:

   mutex_lock(l1.mutex)
   if not l1.dirty:
       l1.future = l1.data
       l1.dirty = True
   l1.future[idx] = cluster
   mutex_lock(l1.write_mutex)
   if l1.dirty:
       tmp = l1.future
       mutex_unlock(l1.mutex)
       bdrv_write(tmp)
       sync
       mutex_lock(l1.mutex)
       l1.dirty = tmp != l1.future
   mutex_unlock(l1.write_mutex)

== Special casing linux-aio ==

There is one case where a worker thread approach is detrimental:
<code>cache=none</code> together with <code>aio=native</code>.  We can solve
this by checking for the case where we're ready to issue the operation with
no metadata I/O:

    if mutex_trylock(bs.mutex):
       m = metadata_loopup(offset, length)
       if m:
           bdrv_aio_read(bs, m, offset, length, callback) # or write
           mutex_unlock(bs.mutex)
           return
    queue_task(operation, offset, length, callback)

= Speeding up allocation =

When a write grows a qcow2 image, the following operations take place:

# clusters are allocated, and the refcount table is updated to reflect this
# sync to ensure the allocation is committed
# the data is written to the clusters
# the L2 table is located; if it doesn't exist, it is allocated and linked
# the L2 table is updated
# sync to ensure the L2->data pointer is committed

We can avoid the first sync by maintaining a volatile list of allocated
but not yet linked clusters.  This requires a tradeoff between the risk of

losing those clusters on an abort, and the performance gain. Tominimize the

risk, the list is flushed if there is no demand for it.

# we maintain low and high theresholds for the volatile free list

# if we're under the low threshold, we start a task to allocate clustersup to the midpoint# if we're above the high threshold, we start a task to return clustersdown to the midpoint# if we ever need a cluster (extent) and find that the volatile list isempty, we double the low and thresholds (up to a limit)

# once a second, we decrease the thresholds by 25%

This ensures that sustained writes will not block on allocation.

Note that a lost cluster is simply leaked; no data loss is involved.The free list can be rebuilt if an unclean shutdown is detected. Olderimplementations can ignore this those leaks. To transport an image, itis recommended to run qemu-img to reclaim any clusters in case it wasshut down uncleanly.


== Alternative implementation ==

We can avoid a volatile list by relying on guest concurrency.  We replace
<code>bdrv_aio_write</code> by <code>bdrv_aio_submit</code>, which issues

many operations in parallel (but completes each one separately). Thismimics

SCSI and virtio devices, which can trigger multiple ops with a single call
to the hardware.  We make a first pass over all write operations, seeing how

many clusters need to be allocated, allocate that in a single operation,then

submit all of the allocating writes. Reads and non-allocating writes can
proceed in parallel.

Note that this implementation (as well as the current qcow2 code) mayleak clusters if qemu aborts in the wrong place. Avoiding leakscompletely requires either journalling, allocate-on-write, or a freelist rebuild. The first two are slow due the need for barriers.


= Avoiding L2 syncs =

Currently after updating an L2 table with a cluster pointer, we sync toavoid

loss of a cluster.  We can avoid this since the guest is required to sync
if it wants to ensure the data is on disk.  We need only to sync if we UNMAP
the cluster, before we free it in the refcount table.

= Copying L1 tables =

qcow2 requires copying of L1 tables in two cases: taking a snapshot, andgrowing the physical image size beyond a certain boundary. Since L1sare relatively small, even for very large images, and growing L1 is veryrare, we can exclude all write operations by having a globalshared/exclusive lock taken for shared access by write operations, andfor exclusive access by grow/snapshot operations.

If concurrent growing and writing is desired, it can be achieved byhaving a thread copy L1, and requiring each L1 update to update bothcopies (for the region already copied) or just the source (for theregion that was not yet copied).


--
error compiling committee.c: too many arguments to function

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] qcow2 performance plan, Avi Kivity <=
- Re: [Qemu-devel] qcow2 performance plan, Anthony Liguori, 2010/09/14
  - Re: [Qemu-devel] qcow2 performance plan, Kevin Wolf, 2010/09/14
    - Re: [Qemu-devel] qcow2 performance plan, Anthony Liguori, 2010/09/14
    - Re: [Qemu-devel] qcow2 performance plan, Kevin Wolf, 2010/09/14
    - Re: [Qemu-devel] qcow2 performance plan, Stefan Hajnoczi, 2010/09/14
    - Re: [Qemu-devel] qcow2 performance plan, Anthony Liguori, 2010/09/14
    - Re: [Qemu-devel] qcow2 performance plan, Avi Kivity, 2010/09/14
    - Re: [Qemu-devel] qcow2 performance plan, Anthony Liguori, 2010/09/14
    - Re: [Qemu-devel] qcow2 performance plan, Avi Kivity, 2010/09/14
    - Re: [Qemu-devel] qcow2 performance plan, Anthony Liguori, 2010/09/14

Prev by Date: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by Date: [Qemu-devel] [Bug 588731] Re: PXE boot not working
Previous by thread: [Qemu-devel] QEMU emulate different ISA multi-core architecture.
Next by thread: Re: [Qemu-devel] qcow2 performance plan
Index(es):
- Date
- Thread