[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_fil
From: |
Max Reitz |
Subject: |
Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback |
Date: |
Tue, 9 Jul 2019 17:32:21 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 |
On 09.07.19 15:09, Stefano Garzarella wrote:
> On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote:
>> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <address@hidden> wrote:
>>>
>>> On 09.07.19 10:55, Max Reitz wrote:
>>>> On 09.07.19 05:08, Jason Dillaman wrote:
>>>>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <address@hidden> wrote:
>>>>>>
>>>>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
>>>>>>> On 05.07.19 11:32, Stefano Garzarella wrote:
>>>>>>>> This patch allows 'qemu-img info' to show the 'disk size' for
>>>>>>>> the RBD images that have the fast-diff feature enabled.
>>>>>>>>
>>>>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
>>>>>>>> to calculate the allocated size for the image.
>>>>>>>>
>>>>>>>> Signed-off-by: Stefano Garzarella <address@hidden>
>>>>>>>> ---
>>>>>>>> v3:
>>>>>>>> - return -ENOTSUP instead of -1 when fast-diff is not available
>>>>>>>> [John, Jason]
>>>>>>>> v2:
>>>>>>>> - calculate the actual usage only if the fast-diff feature is
>>>>>>>> enabled [Jason]
>>>>>>>> ---
>>>>>>>> block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> 1 file changed, 54 insertions(+)
>>>>>>>
>>>>>>> Well, the librbd documentation is non-existing as always, but while
>>>>>>> googling, I at least found that libvirt has exactly the same code. So I
>>>>>>> suppose it must be quite correct, then.
>>>>>>>
>>>>>>
>>>>>> While I wrote this code I took a look at libvirt implementation and also
>>>>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
>>>>>> src/tools/rbd/action/DiskUsage.cc
>>>>>>
>>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>>>> index 59757b3120..b6bed683e5 100644
>>>>>>>> --- a/block/rbd.c
>>>>>>>> +++ b/block/rbd.c
>>>>>>>> @@ -1084,6 +1084,59 @@ static int64_t
>>>>>>>> qemu_rbd_getlength(BlockDriverState *bs)
>>>>>>>> return info.size;
>>>>>>>> }
>>>>>>>>
>>>>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int
>>>>>>>> exists,
>>>>>>>> + void *arg)
>>>>>>>> +{
>>>>>>>> + int64_t *alloc_size = (int64_t *) arg;
>>>>>>>> +
>>>>>>>> + if (exists) {
>>>>>>>> + (*alloc_size) += len;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
>>>>>>>> +{
>>>>>>>> + BDRVRBDState *s = bs->opaque;
>>>>>>>> + uint64_t flags, features;
>>>>>>>> + int64_t alloc_size = 0;
>>>>>>>> + int r;
>>>>>>>> +
>>>>>>>> + r = rbd_get_flags(s->image, &flags);
>>>>>>>> + if (r < 0) {
>>>>>>>> + return r;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + r = rbd_get_features(s->image, &features);
>>>>>>>> + if (r < 0) {
>>>>>>>> + return r;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff
>>>>>>>> + * feature enabled. If it is disabled, rbd_diff_iterate2() could
>>>>>>>> be
>>>>>>>> + * very slow on a big image.
>>>>>>>> + */
>>>>>>>> + if (!(features & RBD_FEATURE_FAST_DIFF) ||
>>>>>>>> + (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
>>>>>>>> + return -ENOTSUP;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL,
>>>>>>>> invokes
>>>>>>>> + * the callback on all allocated regions of the image.
>>>>>>>> + */
>>>>>>>> + r = rbd_diff_iterate2(s->image, NULL, 0,
>>>>>>>> + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
>>>>>>>> + &rbd_allocated_size_cb, &alloc_size);
>>>>>>>
>>>>>>> But I have a question. This is basically block_status, right? So it
>>>>>>> gives us information on which areas are allocated and which are not.
>>>>>>> The result thus gives us a lower bound on the allocation size, but is it
>>>>>>> really exactly the allocation size?
>>>>>>>
>>>>>>> There are two things I’m concerned about:
>>>>>>>
>>>>>>> 1. What about metadata?
>>>>>>
>>>>>> Good question, I don't think it includes the size used by metadata and I
>>>>>> don't know if there is a way to know it. I'll check better.
>>>>>
>>>>> It does not include the size of metadata, the "rbd_diff_iterate2"
>>>>> function is literally just looking for touched data blocks within the
>>>>> RBD image.
>>>>>
>>>>>>>
>>>>>>> 2. If you have multiple snapshots, this will only report the overall
>>>>>>> allocation information, right? So say there is something like this:
>>>>>>>
>>>>>>> (“A” means an allocated MB, “-” is an unallocated MB)
>>>>>>>
>>>>>>> Snapshot 1: AAAA---
>>>>>>> Snapshot 2: --AAAAA
>>>>>>> Snapshot 3: -AAAA--
>>>>>>>
>>>>>>> I think the allocated data size is the number of As in total (13 MB).
>>>>>>> But I suppose this API will just return 7 MB, because it looks on
>>>>>>> everything an it sees the whole image range (7 MB) to be allocated. It
>>>>>>> doesn’t report in how many snapshots some region is allocated.
>>>>>
>>>>> It should return 13 dirty data blocks (multipled by the size of the
>>>>> data block) since when you don't provide a "from snapshot" name, it
>>>>> will iterate from the first snapshot to the HEAD revision.
>>>>
>>>> Have you tested that?
>>>>
>>>> I‘m so skeptical because the callback function interface has no way of
>>>> distinguishing between different layers of snapshots.
>>>>
>>>> And also because we have the bdrv_block_status_above() function which
>>>> just looks strikingly similar (with the difference that it does not
>>>> invoke a callback but just returns the next allocated range). If you
>>>> pass base=NULL to it, it will also “interpret that as the beginning of
>>>> time and return all allocated regions of the image” (or rather image
>>>> chain, in our case). But it would just return 7 MB as allocated. (Even
>>>> though it does in fact return layer information, i.e. where a given
>>>> continuous chunk of data is allocated.)
>>>>
>>>> Sure, there is no good reason for why our interface should by chance be
>>>> the same as librbd’s interface. But without having tested it, the fact
>>>> that the callback cannot detect which layer a chunk is allocated on just
>>>> makes me wary.
>>>
>>> And the librbd code doesn’t alleviate my concerns.
>>>
>>> From what I can see, it first creates a bitmap (two bits per entry) that
>>> covers the whole image and says which objects are allocated and which
>>> aren‘t. Through the whole chain, that is. So in the above example, the
>>> bitmap would report everything as allocated. (Or rather “updated“ in
>>> librbd‘s terms.)
>>>
>>> Then it has this piece:
>>>
>>>> uint64_t off = m_offset;
>>>> uint64_t left = m_length;
>>>>
>>>> while (left > 0) {
>>>> uint64_t period_off = off - (off % period);
>>>> uint64_t read_len = min(period_off + period - off, left);
>>>>
>>>> // map to extents
>>>> map<object_t,vector<ObjectExtent> > object_extents;
>>>> Striper::file_to_extents(cct, m_image_ctx.format_string,
>>>> &m_image_ctx.layout, off, read_len, 0,
>>>> object_extents, 0);
>>>>
>>>> // get snap info for each object
>>>> for (map<object_t,vector<ObjectExtent> >::iterator p =
>>>> object_extents.begin();
>>>> p != object_extents.end(); ++p)
>>> [...]
>>>> for (std::vector<ObjectExtent>::iterator q = p->second.begin();
>>>> q != p->second.end(); ++q) {
>>>> r = m_callback(off + q->offset, q->length, updated,
>>>> m_callback_arg);
>>> [...]
>>>> }
>>> [...]
>>>> }
>>>>> left -= read_len;
>>>> off += read_len;
>>>> }
>>>
>>> So that iterates over the whole image (in our case, because m_offset is
>>> 0 and m_length is the image length), then picks out a chunk of read_len
>>> length, converts that to objects, and then iterates over all of those
>>> objects and all of their extents.
>>>
>>> file_to_extents looks like it’s just an arithmetic operation. So it
>>> doesn‘t look like that function returns extents in multiple snapshots.
>>> It just converts a range into subranges called “objects” and “extents”
>>> (at least that’s the way it looks to me).
>>>
>>> So overall, this looks awfully like it simply iterates over the whole
>>> image and then returns allocation information gathered as a top-down
>>> view through all of the snapshots, but not for each snapshot individually.
>>
>> Sorry, you're correct. The API function was originally designed to
>> support delta extents for supporting diff exports, so while it does
>> open each snapshot's object-map, it only returns a unioned set of
>> delta extents. The rbd CLI's "disk-usage" action behaves how I
>> described by counting the actual dirty block usage between snapshots.
>
> Max, Jason, thanks for the details!
>
> If you agree, I'll try to implement something similar, iterating on all
> snapshots.
Yes, that should work.
> What about metadata?
> I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata.
> An idea could be to iterate over them and sum the keys-values size returned,
> but I'm not sure they are returned in the same format as they are stored.
I wouldn’t mind ignoring metadata too much. If you can get something to
work, even better, but I don’t consider that too important.
Max
signature.asc
Description: OpenPGP digital signature
- [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Stefano Garzarella, 2019/07/05
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Max Reitz, 2019/07/05
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Stefano Garzarella, 2019/07/05
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Jason Dillaman, 2019/07/08
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Max Reitz, 2019/07/09
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Max Reitz, 2019/07/09
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Jason Dillaman, 2019/07/09
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Stefano Garzarella, 2019/07/09
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback,
Max Reitz <=
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Jason Dillaman, 2019/07/09
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Stefano Garzarella, 2019/07/10