[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_fil
From: |
Jason Dillaman |
Subject: |
Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback |
Date: |
Tue, 9 Jul 2019 21:42:14 -0400 |
On Tue, Jul 9, 2019 at 11:32 AM Max Reitz <address@hidden> wrote:
>
> On 09.07.19 15:09, Stefano Garzarella wrote:
> > On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote:
> >> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <address@hidden> wrote:
> >>>
> >>> On 09.07.19 10:55, Max Reitz wrote:
> >>>> On 09.07.19 05:08, Jason Dillaman wrote:
> >>>>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <address@hidden>
> >>>>> wrote:
> >>>>>>
> >>>>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> >>>>>>> On 05.07.19 11:32, Stefano Garzarella wrote:
> >>>>>>>> This patch allows 'qemu-img info' to show the 'disk size' for
> >>>>>>>> the RBD images that have the fast-diff feature enabled.
> >>>>>>>>
> >>>>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
> >>>>>>>> to calculate the allocated size for the image.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Stefano Garzarella <address@hidden>
> >>>>>>>> ---
> >>>>>>>> v3:
> >>>>>>>> - return -ENOTSUP instead of -1 when fast-diff is not available
> >>>>>>>> [John, Jason]
> >>>>>>>> v2:
> >>>>>>>> - calculate the actual usage only if the fast-diff feature is
> >>>>>>>> enabled [Jason]
> >>>>>>>> ---
> >>>>>>>> block/rbd.c | 54
> >>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>> 1 file changed, 54 insertions(+)
> >>>>>>>
> >>>>>>> Well, the librbd documentation is non-existing as always, but while
> >>>>>>> googling, I at least found that libvirt has exactly the same code.
> >>>>>>> So I
> >>>>>>> suppose it must be quite correct, then.
> >>>>>>>
> >>>>>>
> >>>>>> While I wrote this code I took a look at libvirt implementation and
> >>>>>> also
> >>>>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
> >>>>>> src/tools/rbd/action/DiskUsage.cc
> >>>>>>
> >>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>>>>>> index 59757b3120..b6bed683e5 100644
> >>>>>>>> --- a/block/rbd.c
> >>>>>>>> +++ b/block/rbd.c
> >>>>>>>> @@ -1084,6 +1084,59 @@ static int64_t
> >>>>>>>> qemu_rbd_getlength(BlockDriverState *bs)
> >>>>>>>> return info.size;
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int
> >>>>>>>> exists,
> >>>>>>>> + void *arg)
> >>>>>>>> +{
> >>>>>>>> + int64_t *alloc_size = (int64_t *) arg;
> >>>>>>>> +
> >>>>>>>> + if (exists) {
> >>>>>>>> + (*alloc_size) += len;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> + return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState
> >>>>>>>> *bs)
> >>>>>>>> +{
> >>>>>>>> + BDRVRBDState *s = bs->opaque;
> >>>>>>>> + uint64_t flags, features;
> >>>>>>>> + int64_t alloc_size = 0;
> >>>>>>>> + int r;
> >>>>>>>> +
> >>>>>>>> + r = rbd_get_flags(s->image, &flags);
> >>>>>>>> + if (r < 0) {
> >>>>>>>> + return r;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> + r = rbd_get_features(s->image, &features);
> >>>>>>>> + if (r < 0) {
> >>>>>>>> + return r;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> + /*
> >>>>>>>> + * We use rbd_diff_iterate2() only if the RBD image have
> >>>>>>>> fast-diff
> >>>>>>>> + * feature enabled. If it is disabled, rbd_diff_iterate2()
> >>>>>>>> could be
> >>>>>>>> + * very slow on a big image.
> >>>>>>>> + */
> >>>>>>>> + if (!(features & RBD_FEATURE_FAST_DIFF) ||
> >>>>>>>> + (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> >>>>>>>> + return -ENOTSUP;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> + /*
> >>>>>>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL,
> >>>>>>>> invokes
> >>>>>>>> + * the callback on all allocated regions of the image.
> >>>>>>>> + */
> >>>>>>>> + r = rbd_diff_iterate2(s->image, NULL, 0,
> >>>>>>>> + bs->total_sectors * BDRV_SECTOR_SIZE, 0,
> >>>>>>>> 1,
> >>>>>>>> + &rbd_allocated_size_cb, &alloc_size);
> >>>>>>>
> >>>>>>> But I have a question. This is basically block_status, right? So it
> >>>>>>> gives us information on which areas are allocated and which are not.
> >>>>>>> The result thus gives us a lower bound on the allocation size, but is
> >>>>>>> it
> >>>>>>> really exactly the allocation size?
> >>>>>>>
> >>>>>>> There are two things I’m concerned about:
> >>>>>>>
> >>>>>>> 1. What about metadata?
> >>>>>>
> >>>>>> Good question, I don't think it includes the size used by metadata and
> >>>>>> I
> >>>>>> don't know if there is a way to know it. I'll check better.
> >>>>>
> >>>>> It does not include the size of metadata, the "rbd_diff_iterate2"
> >>>>> function is literally just looking for touched data blocks within the
> >>>>> RBD image.
> >>>>>
> >>>>>>>
> >>>>>>> 2. If you have multiple snapshots, this will only report the overall
> >>>>>>> allocation information, right? So say there is something like this:
> >>>>>>>
> >>>>>>> (“A” means an allocated MB, “-” is an unallocated MB)
> >>>>>>>
> >>>>>>> Snapshot 1: AAAA---
> >>>>>>> Snapshot 2: --AAAAA
> >>>>>>> Snapshot 3: -AAAA--
> >>>>>>>
> >>>>>>> I think the allocated data size is the number of As in total (13 MB).
> >>>>>>> But I suppose this API will just return 7 MB, because it looks on
> >>>>>>> everything an it sees the whole image range (7 MB) to be allocated.
> >>>>>>> It
> >>>>>>> doesn’t report in how many snapshots some region is allocated.
> >>>>>
> >>>>> It should return 13 dirty data blocks (multipled by the size of the
> >>>>> data block) since when you don't provide a "from snapshot" name, it
> >>>>> will iterate from the first snapshot to the HEAD revision.
> >>>>
> >>>> Have you tested that?
> >>>>
> >>>> I‘m so skeptical because the callback function interface has no way of
> >>>> distinguishing between different layers of snapshots.
> >>>>
> >>>> And also because we have the bdrv_block_status_above() function which
> >>>> just looks strikingly similar (with the difference that it does not
> >>>> invoke a callback but just returns the next allocated range). If you
> >>>> pass base=NULL to it, it will also “interpret that as the beginning of
> >>>> time and return all allocated regions of the image” (or rather image
> >>>> chain, in our case). But it would just return 7 MB as allocated. (Even
> >>>> though it does in fact return layer information, i.e. where a given
> >>>> continuous chunk of data is allocated.)
> >>>>
> >>>> Sure, there is no good reason for why our interface should by chance be
> >>>> the same as librbd’s interface. But without having tested it, the fact
> >>>> that the callback cannot detect which layer a chunk is allocated on just
> >>>> makes me wary.
> >>>
> >>> And the librbd code doesn’t alleviate my concerns.
> >>>
> >>> From what I can see, it first creates a bitmap (two bits per entry) that
> >>> covers the whole image and says which objects are allocated and which
> >>> aren‘t. Through the whole chain, that is. So in the above example, the
> >>> bitmap would report everything as allocated. (Or rather “updated“ in
> >>> librbd‘s terms.)
> >>>
> >>> Then it has this piece:
> >>>
> >>>> uint64_t off = m_offset;
> >>>> uint64_t left = m_length;
> >>>>
> >>>> while (left > 0) {
> >>>> uint64_t period_off = off - (off % period);
> >>>> uint64_t read_len = min(period_off + period - off, left);
> >>>>
> >>>> // map to extents
> >>>> map<object_t,vector<ObjectExtent> > object_extents;
> >>>> Striper::file_to_extents(cct, m_image_ctx.format_string,
> >>>> &m_image_ctx.layout, off, read_len, 0,
> >>>> object_extents, 0);
> >>>>
> >>>> // get snap info for each object
> >>>> for (map<object_t,vector<ObjectExtent> >::iterator p =
> >>>> object_extents.begin();
> >>>> p != object_extents.end(); ++p)
> >>> [...]
> >>>> for (std::vector<ObjectExtent>::iterator q = p->second.begin();
> >>>> q != p->second.end(); ++q) {
> >>>> r = m_callback(off + q->offset, q->length, updated,
> >>>> m_callback_arg);
> >>> [...]
> >>>> }
> >>> [...]
> >>>> }
> >>>>> left -= read_len;
> >>>> off += read_len;
> >>>> }
> >>>
> >>> So that iterates over the whole image (in our case, because m_offset is
> >>> 0 and m_length is the image length), then picks out a chunk of read_len
> >>> length, converts that to objects, and then iterates over all of those
> >>> objects and all of their extents.
> >>>
> >>> file_to_extents looks like it’s just an arithmetic operation. So it
> >>> doesn‘t look like that function returns extents in multiple snapshots.
> >>> It just converts a range into subranges called “objects” and “extents”
> >>> (at least that’s the way it looks to me).
> >>>
> >>> So overall, this looks awfully like it simply iterates over the whole
> >>> image and then returns allocation information gathered as a top-down
> >>> view through all of the snapshots, but not for each snapshot individually.
> >>
> >> Sorry, you're correct. The API function was originally designed to
> >> support delta extents for supporting diff exports, so while it does
> >> open each snapshot's object-map, it only returns a unioned set of
> >> delta extents. The rbd CLI's "disk-usage" action behaves how I
> >> described by counting the actual dirty block usage between snapshots.
> >
> > Max, Jason, thanks for the details!
> >
> > If you agree, I'll try to implement something similar, iterating on all
> > snapshots.
>
> Yes, that should work.
>
> > What about metadata?
> > I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the
> > metadata.
> > An idea could be to iterate over them and sum the keys-values size returned,
> > but I'm not sure they are returned in the same format as they are stored.
>
> I wouldn’t mind ignoring metadata too much. If you can get something to
> work, even better, but I don’t consider that too important.
I don't think it's a good idea to attempt to estimate it. If there is
enough desire, we can always add a supported "disk-usage" API method
that takes into account things like the image metadata, object-map,
journaling, etc overhead. However, I wouldn't expect it to be anywhere
near the actual block usage (unless something has gone terribly wrong
w/ journaling and it fails to trim your log).
> Max
>
--
Jason
- [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Stefano Garzarella, 2019/07/05
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Max Reitz, 2019/07/05
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Stefano Garzarella, 2019/07/05
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Jason Dillaman, 2019/07/08
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Max Reitz, 2019/07/09
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Max Reitz, 2019/07/09
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Jason Dillaman, 2019/07/09
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Stefano Garzarella, 2019/07/09
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Max Reitz, 2019/07/09
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback,
Jason Dillaman <=
- Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback, Stefano Garzarella, 2019/07/10