Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 ima

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 ima

From:	Kevin Wolf
Subject:	Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
Date:	Thu, 27 Jun 2019 18:54:34 +0200
User-agent:	Mutt/1.11.3 (2019-02-01)

Am 27.06.2019 um 15:59 hat Alberto Garcia geschrieben:
> Hi all,
> 
> a couple of years ago I came to the mailing list with a proposal to
> extend the qcow2 format to add subcluster allocation.
> 
> You can read the original message (and the discussion thread that came
> afterwards) here:
> 
>    https://lists.gnu.org/archive/html/qemu-block/2017-04/msg00178.html
> 
> The description of the problem from the original proposal is still
> valid so I won't repeat it here.
> 
> What I have been doing during the past few weeks was to retake the
> code that I wrote in 2017, make it work with the latest QEMU and fix
> many of its bugs. I have again a working prototype which is by no
> means complete but it allows us to have up-to-date information about
> what we can expect from this feature.
> 
> My goal with this message is to retake the discussion and re-evaluate
> whether this is a feature that we'd like for QEMU in light of the test
> results and all the changes that we have had in the past couple of
> years.
> 
> === Test results ===
> 
> I ran these tests with the same hardware configuration as in 2017: an
> SSD drive and random 4KB write requests to an empty 40GB qcow2 image.
> 
> Here are the results when the qcow2 file is backed by a fully
> populated image. There are 8 subclusters per cluster and the
> subcluster size is in brackets:
> 
> |-----------------+----------------+-----------------|
> |  Cluster size   | subclusters=on | subclusters=off |
> |-----------------+----------------+-----------------|
> |   2 MB (256 KB) |   571 IOPS     |  124 IOPS       |
> |   1 MB (128 KB) |   863 IOPS     |  212 IOPS       |
> | 512 KB  (64 KB) |  1678 IOPS     |  365 IOPS       |
> | 256 KB  (32 KB) |  2618 IOPS     |  568 IOPS       |
> | 128 KB  (16 KB) |  4907 IOPS     |  873 IOPS       |
> |  64 KB   (8 KB) | 10613 IOPS     | 1680 IOPS       |
> |  32 KB   (4 KB) | 13038 IOPS     | 2476 IOPS       |
> |   4 KB (512 B)  |   101 IOPS     |  101 IOPS       |
> |-----------------+----------------+-----------------|

So at the first sight, if you compare the numbers in the same row,
subclusters=on is a clear winner.

But almost more interesting is the observation that at least for large
cluster sizes, subcluster size X performs almost identical to cluster
size X without subclusters:

                as subcluster size  as cluster size, subclusters=off
    256 KB      571 IOPS            568 IOPS
    128 KB      863 IOPS            873 IOPS
    64 KB       1678 IOPS           1680 IOPS
    32 KB       2618 IOPS           2476 IOPS
    ...
    4 KB        13038 IOPS          101 IOPS

Something interesting happens in the part that you didn't benchmark
between 4 KB and 32 KB (actually, maybe it has already started for the
32 KB case): Performance collapses for small cluster sizes, but it
reaches record highs for small subclusters. I suspect that this is
because L2 tables are becoming very small with 4 KB clusters, but they
are still 32 KB if 4 KB is only the subcluster size. (By the way, did
the L2 cache cover the whole disk in your benchmarks?)

I think this gives us two completely different motivations why
subclusters could be useful, depending on the cluster size you're using:

1. If you use small cluster sizes like 32 KB/4 KB, then obviously you
   can get IOPS rates during cluster allocation that you couldn't come
   even close to before. I think this is a quite strong argument in
   favour of the feature.

2. With larger cluster sizes, you don't get a significant difference in
   the performance during cluster allocation compared to just using the
   subcluster size as the cluster size without having subclusters. Here,
   the motivation could be something along the lines of avoiding
   fragmentation. This would probably need more benchmarks to check how
   fragmentation affects the performance after the initial write.

   This one could possibly be a valid justification, too, but I think it
   would need more work on demonstrating that the effects are real and
   justify the implementation and long-term maintenance effort required
   for subclusters.

> Some comments about the results, after comparing them with those from
> 2017:
> 
> - As expected, 32KB clusters / 4 KB subclusters give the best results
>   because that matches the size of the write request and therefore
>   there's no copy-on-write involved.
> 
> - Allocation is generally faster now in all cases (between 20-90%,
>   depending on the case). We have made several optimizations to the
>   code since last time, and I suppose that the COW changes made in
>   commits b3cf1c7cf8 and ee22a9d869 are probably the main factor
>   behind these improvements.
> 
> - Apart from the 64KB/8KB case (which is much faster), the patters are
>   generally the same: subcluster allocation offers similar performance
>   benefits compared to last time, so there are no surprises in this
>   area.
> 
> Then I ran the tests again using the same environment but without a
> backing image. The goal is to measure the impact of subcluster
> allocation on completely empty images.
> 
> Here we have an important change: since commit c8bb23cbdb empty
> clusters are preallocated and filled with zeroes using an efficient
> operation (typically fallocate() with FALLOC_FL_ZERO_RANGE) instead of
> writing the zeroes with the usual pwrite() call.
> 
> The effects of this are dramatic, so I decided to run two sets of
> tests: one with this optimization and one without it.
> 
> Here are the results:
> 
> |-----------------+----------------+-----------------+----------------+-----------------|
> |                 | Initialization with fallocate()  |  Initialization with 
> pwritev()   |
> |-----------------+----------------+-----------------+----------------+-----------------|
> |  Cluster size   | subclusters=on | subclusters=off | subclusters=on | 
> subclusters=off |
> |-----------------+----------------+-----------------+----------------+-----------------|
> |   2 MB (256 KB) | 14468 IOPS     | 14776 IOPS      |  1181 IOPS     |  260 
> IOPS       |
> |   1 MB (128 KB) | 13752 IOPS     | 14956 IOPS      |  1916 IOPS     |  358 
> IOPS       |
> | 512 KB  (64 KB) | 12961 IOPS     | 14776 IOPS      |  4038 IOPS     |  684 
> IOPS       |
> | 256 KB  (32 KB) | 12790 IOPS     | 14534 IOPS      |  6172 IOPS     | 1213 
> IOPS       |
> | 128 KB  (16 KB) | 12550 IOPS     | 13967 IOPS      |  8700 IOPS     | 1976 
> IOPS       |
> |  64 KB   (8 KB) | 12491 IOPS     | 13432 IOPS      | 11735 IOPS     | 4267 
> IOPS       |
> |  32 KB   (4 KB) | 13203 IOPS     | 11752 IOPS      | 12366 IOPS     | 6306 
> IOPS       |
> |   4 KB (512 B)  |   103 IOPS     |   101 IOPS      |   101 IOPS     |  101 
> IOPS       |
> |-----------------+----------------+-----------------+----------------+-----------------|
> 
> Comments:
> 
> - With the old-style allocation method using pwritev() we get similar
>   benefits as we did last time. The comments from the test with a
>   backing image apply to this one as well.
> 
> - However the new allocation method is so efficient that having
>   subclusters does not offer any performance benefit. It even slows
>   down things a bit in most cases, so we'd probably need to fine tune
>   the algorithm in order to get similar results.
> 
> - In light of this numbers I also think that even when there's a
>   backing image we could preallocate the full cluster but only do COW
>   on the affected subclusters. This would the rest of the cluster
>   preallocated on disk but unallocated on the bitmap. This would
>   probably reduce on-disk fragmentation, which was one of the concerns
>   raised during the original discussion.

Yes, especially when we have to do some COW anyway, this would come at
nearly zero cost because we call fallocate() anyway.

I'm not sure whether it's worth doing when we don't have to do COW. We
will at least avoid qcow2 fragmentation because of the large cluster
size. And file systems are a lot cleverer than qcow2 to avoid
fragmentation on the file system level. So it might not actually make a
big difference in practice.

This is pure theory, though. We'd have to benchmark things to give a
definite answer.

> I also ran some tests on a rotating HDD drive. Here having subclusters
> doesn't make a big difference regardless of whether there is a backing
> image or not, so we can ignore this scenario.

Interesting, this is kind of unexpected. Why would avoided COW not make
a difference on rotating HDDs? (All of this is cache=none, right?)

> === Changes to the on-disk format ===
> 
> In my original proposal I described 3 different alternatives for
> storing the subcluster bitmaps. I'm naming them here, but refer to
> that message for more details.
> 
> (1) Storing the bitmap inside the 64-bit entry
> (2) Making L2 entries 128-bit wide.
> (3) Storing the bitmap somewhere else
> 
> I used (1) for this implementation for simplicity, but I think (2) is
> probably the best one.

Which would give us 32 bits for the subclusters, so you'd get 128k/4k or
2M/64k. Or would you intend to use some of these 32 bits for something
different?

I think (3) is the worst because it adds another kind of metadata table
that we have to consider for ordering updates. So it might come with
more frequent cache flushes.

> ===========================
> 
> And I think that's all. As you can see I didn't want to go much into
> the open technical questions (I think the on-disk format would be the
> main one), the first goal should be to decide whether this is still an
> interesting feature or not.
> 
> So, any questions or comments will be much appreciated.

It does like very interesting to me at least for small subcluster sizes.

For the larger ones, I suspect that the Virtuozzo guys might be
interested in performing more benchmarks to see whether it improves the
fragmentation problems that they have talked about a lot. It might end
up being interesting for these cases, too.

Kevin

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, (continued)
- Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Kevin Wolf <=
  - Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Denis Lunev, 2019/06/27
    - Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Alberto Garcia, 2019/06/28
  - Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Alberto Garcia, 2019/06/28
    - Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Alberto Garcia, 2019/06/28

Prev by Date: Re: [Qemu-devel] [PATCH v2 07/14] target/arm/cpu64: max cpu: Introduce sve<vl-bits> properties
Next by Date: Re: [Qemu-devel] [PATCHv4 3/6] RISC-V: Check for the effective memory privilege mode during PMP checks
Previous by thread: Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
Next by thread: Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
Index(es):
- Date
- Thread