[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 ima
From: |
Alberto Garcia |
Subject: |
Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images |
Date: |
Fri, 28 Jun 2019 14:57:56 +0200 |
User-agent: |
Notmuch/0.18.2 (http://notmuchmail.org) Emacs/24.4.1 (i586-pc-linux-gnu) |
On Thu 27 Jun 2019 06:54:34 PM CEST, Kevin Wolf wrote:
>> |-----------------+----------------+-----------------|
>> | Cluster size | subclusters=on | subclusters=off |
>> |-----------------+----------------+-----------------|
>> | 2 MB (256 KB) | 571 IOPS | 124 IOPS |
>> | 1 MB (128 KB) | 863 IOPS | 212 IOPS |
>> | 512 KB (64 KB) | 1678 IOPS | 365 IOPS |
>> | 256 KB (32 KB) | 2618 IOPS | 568 IOPS |
>> | 128 KB (16 KB) | 4907 IOPS | 873 IOPS |
>> | 64 KB (8 KB) | 10613 IOPS | 1680 IOPS |
>> | 32 KB (4 KB) | 13038 IOPS | 2476 IOPS |
>> | 4 KB (512 B) | 101 IOPS | 101 IOPS |
>> |-----------------+----------------+-----------------|
>
> So at the first sight, if you compare the numbers in the same row,
> subclusters=on is a clear winner.
Yes, as expected.
> But almost more interesting is the observation that at least for large
> cluster sizes, subcluster size X performs almost identical to cluster
> size X without subclusters:
But that's also to be expected, isn't it? The only difference (in terms
of I/O) between allocating a 64KB cluster and a 64KB subcluster is how
the L2 entry is updated. The amount of data that is read and written is
the same.
> Something interesting happens in the part that you didn't benchmark
> between 4 KB and 32 KB (actually, maybe it has already started for the
> 32 KB case): Performance collapses for small cluster sizes, but it
> reaches record highs for small subclusters.
I dind't measure that initially because I thought that having
subclusters < 4KB was not very useful. The 512b case was just to see how
it would perform on the extreme case. I anyway decided to get the rest
of the numbers too, so here's the complete table with the missing rows:
|---------+------------+----------------+-----------------|
| Cluster | Subcluster | subclusters=on | subclusters=off |
|---------+------------+----------------+-----------------|
| 2048 | 256 | 571 | 124 |
| 1024 | 128 | 863 | 212 |
| 512 | 64 | 1678 | 365 |
| 256 | 32 | 2618 | 568 |
| 128 | 16 | 4907 | 873 |
| 64 | 8 | 10613 | 1680 |
| 32 | 4 | 13038 | 2476 |
| 16 | 2 | 7555 | 3389 |
| 8 | 1 | 299 | 420 |
| 4 | 512b | 101 | 101 |
|---------+------------+----------------+-----------------|
> I suspect that this is because L2 tables are becoming very small with
> 4 KB clusters, but they are still 32 KB if 4 KB is only the subcluster
> size.
Yes, I explained that in my original proposal from 2017. I didn't
actually investigate further, but my take is that 4KB clusters require
constant allocations and refcount updates, plus L2 tables fill up very
quickly.
> (By the way, did the L2 cache cover the whole disk in your
> benchmarks?)
Yes, in all cases (I forgot to mention that, sorry).
> I think this gives us two completely different motivations why
> subclusters could be useful, depending on the cluster size you're
> using:
>
> 1. If you use small cluster sizes like 32 KB/4 KB, then obviously you
> can get IOPS rates during cluster allocation that you couldn't come
> even close to before. I think this is a quite strong argument in
> favour of the feature.
Yes, indeed. You would need to select the subcluster size so it matches
the size of guest I/O requests (the size of the filesystem block is
probably the best choice).
> 2. With larger cluster sizes, you don't get a significant difference
> in the performance during cluster allocation compared to just using
> the subcluster size as the cluster size without having
> subclusters. Here, the motivation could be something along the
> lines of avoiding fragmentation. This would probably need more
> benchmarks to check how fragmentation affects the performance after
> the initial write.
>
> This one could possibly be a valid justification, too, but I think it
> would need more work on demonstrating that the effects are real and
> justify the implementation and long-term maintenance effort required
> for subclusters.
I agree. However another benefit of large cluster sizes is that you
reduce the amount of metadata, so you get the same performance with a
smaller L2 cache.
>> I also ran some tests on a rotating HDD drive. Here having
>> subclusters doesn't make a big difference regardless of whether there
>> is a backing image or not, so we can ignore this scenario.
>
> Interesting, this is kind of unexpected. Why would avoided COW not
> make a difference on rotating HDDs? (All of this is cache=none,
> right?)
, the 32K/4K with no COW is obviously much faster
>
>> === Changes to the on-disk format ===
>>
>> In my original proposal I described 3 different alternatives for
>> storing the subcluster bitmaps. I'm naming them here, but refer to
>> that message for more details.
>>
>> (1) Storing the bitmap inside the 64-bit entry
>> (2) Making L2 entries 128-bit wide.
>> (3) Storing the bitmap somewhere else
>>
>> I used (1) for this implementation for simplicity, but I think (2) is
>> probably the best one.
>
> Which would give us 32 bits for the subclusters, so you'd get 128k/4k or
> 2M/64k. Or would you intend to use some of these 32 bits for something
> different?
>
> I think (3) is the worst because it adds another kind of metadata table
> that we have to consider for ordering updates. So it might come with
> more frequent cache flushes.
>
>> ===========================
>>
>> And I think that's all. As you can see I didn't want to go much into
>> the open technical questions (I think the on-disk format would be the
>> main one), the first goal should be to decide whether this is still an
>> interesting feature or not.
>>
>> So, any questions or comments will be much appreciated.
>
> It does like very interesting to me at least for small subcluster sizes.
>
> For the larger ones, I suspect that the Virtuozzo guys might be
> interested in performing more benchmarks to see whether it improves the
> fragmentation problems that they have talked about a lot. It might end
> up being interesting for these cases, too.
>
> Kevin
- Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, (continued)
- Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Kevin Wolf, 2019/06/28
- Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Alberto Garcia, 2019/06/28
- Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Denis Lunev, 2019/06/28
- Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Alberto Garcia, 2019/06/28
- Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Kevin Wolf, 2019/06/28
- Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Kevin Wolf, 2019/06/28
- Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Alberto Garcia, 2019/06/28
Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images, Kevin Wolf, 2019/06/27