qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [RFC] Re-evaluating subcluster allocation for qcow2 ima


From: Alberto Garcia
Subject: Re: [Qemu-block] [RFC] Re-evaluating subcluster allocation for qcow2 images
Date: Thu, 11 Jul 2019 16:56:14 +0200
User-agent: Notmuch/0.18.2 (http://notmuchmail.org) Emacs/24.4.1 (i586-pc-linux-gnu)

On Thu 11 Jul 2019 04:32:34 PM CEST, Kevin Wolf wrote:

>> - It is possible to configure very easily the number of subclusters per
>>   cluster. It is now hardcoded to 32 in qcow2_do_open() but any power of
>>   2 would work (just change the number there if you want to test
>>   it). Would an option for this be worth adding?
>
> I think for testing we can just change the constant. Once th feature
> is merged and used in production, I don't think there is any reason to
> leave bits unused.

Me neither unless we want to allow the 64 subclusters scenario that I
mentioned.

>> - We would now have "all zeroes" bits at the cluster and subcluster
>> levels, so there's an ambiguity here that we need to solve. In
>> particular, what happens if we have a QCOW2_CLUSTER_ZERO_ALLOC
>> cluster but some bits from the bitmap are set? Do we ignore them
>> completely?
>
> The (super)cluster zero bit should probably always be clear if
> subclusters are used. If it's set, we have a corrupted image.

I mentioned in an earlier e-mail that one possibility is to leave that
bit as it is now and use the bitmap only for the allocation status (so
we'd have 64 subclusters). If QCOW_OFLAG_ZERO is set and the subcluster
is not allocated then it's all zeroes.

With this we'd double the amount of subclusters but we'd lose the
possibility to have zero and unallocated subclusters at the same time.

>> I also ran some I/O tests using a similar scenario like last time
>> (SSD drive, 40GB backing image). Here are the results, you can see
>> the difference between the previous prototype (8 subclusters per
>> cluster) and the new one (32):
>
> Is the 8 subclusters test run with the old version (64 bit L2 entries)
> or the new version (128 bit L2 entries) with bits left unused?

It's the old version of the code (I copied & pasted the numbers from the
previous table).

>> |--------------+----------------+---------------+-----------------|
>> | Cluster size | 32 subclusters | 8 subclusters | subclusters=off |
>> |--------------+----------------+---------------+-----------------|
>> |         4 KB |        80 IOPS |      101 IOPS |         92 IOPS |
>> |         8 KB |       108 IOPS |      299 IOPS |        417 IOPS |
>> |        16 KB |      3440 IOPS |     7555 IOPS |       3347 IOPS |
>> |        32 KB |     10718 IOPS |    13038 IOPS |       2435 IOPS |
>> |        64 KB |     12569 IOPS |    10613 IOPS |       1622 IOPS |
>> |       128 KB |     11444 IOPS |     4907 IOPS |        866 IOPS |
>> |       256 KB |      9335 IOPS |     2618 IOPS |        561 IOPS |
>> |       512 KB |       185 IOPS |     1678 IOPS |        353 IOPS |
>> |      1024 KB |      2477 IOPS |      863 IOPS |        212 IOPS |
>> |      2048 KB |      1536 IOPS |      571 IOPS |        123 IOPS |
>> |--------------+----------------+---------------+-----------------|
>> 
>> I'm surprised about the 256 KB cluster / 32 subclusters case (I would
>> expect ~3300 IOPS), but I ran it a few times and the results are always
>> the same. I still haven't investigated why that happens. The rest of the
>> results seem more or less normal.
>
> Shouldn't 256k/8k perform similarly to 64k/8k, or maybe a bit better?
> Why did you expect ~3300 IOPS?

Sorry I meant the 512k/16k case, which is obviously the outlier there.
 
> I found other results more surprising. In particular:
>
> * Why does 64k/2k perform better than 128k/4k when the block size for
>   your requests is 4k?

They should perform similar because the only difference in practice is
that in the former case you set two bits on the bitmap and in the latter
only one. The difference is not too big, I could run the tests again and
if the results are consistent I can investigate what's going on.

But yes, I would expect 128k/4k to be the fastest of them all.

> * Why is the maximum for 8 subclusters higher than for 32 subclusters?
>   I guess this does make some sense if the 8 subclusters case actually
>   used 64 bit L2 entries. If you did use 128 bit entries for both 32 and
>   8 subclusters, I don't see why 8 subclusters should perform better in
>   any case.

I used 64-bit entries for the 8 subcluster case. I can try with the new
code and see what happens.

> * What causes the minimum at 512k with 32 subclusters?

That's the case that I meant earlier, and I still don't have a good
hypothesis of why that happens. I'll need to debug it.

Berto



reply via email to

[Prev in Thread] Current Thread [Next in Thread]