qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [PATCH v3] block/rbd: implement .bdrv_get_allocated_fil


From: Stefano Garzarella
Subject: Re: [Qemu-block] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback
Date: Tue, 9 Jul 2019 15:09:48 +0200
User-agent: NeoMutt/20180716

On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote:
> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <address@hidden> wrote:
> >
> > On 09.07.19 10:55, Max Reitz wrote:
> > > On 09.07.19 05:08, Jason Dillaman wrote:
> > >> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <address@hidden> wrote:
> > >>>
> > >>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> > >>>> On 05.07.19 11:32, Stefano Garzarella wrote:
> > >>>>> This patch allows 'qemu-img info' to show the 'disk size' for
> > >>>>> the RBD images that have the fast-diff feature enabled.
> > >>>>>
> > >>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
> > >>>>> to calculate the allocated size for the image.
> > >>>>>
> > >>>>> Signed-off-by: Stefano Garzarella <address@hidden>
> > >>>>> ---
> > >>>>> v3:
> > >>>>>   - return -ENOTSUP instead of -1 when fast-diff is not available
> > >>>>>     [John, Jason]
> > >>>>> v2:
> > >>>>>   - calculate the actual usage only if the fast-diff feature is
> > >>>>>     enabled [Jason]
> > >>>>> ---
> > >>>>>  block/rbd.c | 54 
> > >>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>>>>  1 file changed, 54 insertions(+)
> > >>>>
> > >>>> Well, the librbd documentation is non-existing as always, but while
> > >>>> googling, I at least found that libvirt has exactly the same code.  So 
> > >>>> I
> > >>>> suppose it must be quite correct, then.
> > >>>>
> > >>>
> > >>> While I wrote this code I took a look at libvirt implementation and also
> > >>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
> > >>> src/tools/rbd/action/DiskUsage.cc
> > >>>
> > >>>>> diff --git a/block/rbd.c b/block/rbd.c
> > >>>>> index 59757b3120..b6bed683e5 100644
> > >>>>> --- a/block/rbd.c
> > >>>>> +++ b/block/rbd.c
> > >>>>> @@ -1084,6 +1084,59 @@ static int64_t 
> > >>>>> qemu_rbd_getlength(BlockDriverState *bs)
> > >>>>>      return info.size;
> > >>>>>  }
> > >>>>>
> > >>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int 
> > >>>>> exists,
> > >>>>> +                                 void *arg)
> > >>>>> +{
> > >>>>> +    int64_t *alloc_size = (int64_t *) arg;
> > >>>>> +
> > >>>>> +    if (exists) {
> > >>>>> +        (*alloc_size) += len;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    return 0;
> > >>>>> +}
> > >>>>> +
> > >>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> > >>>>> +{
> > >>>>> +    BDRVRBDState *s = bs->opaque;
> > >>>>> +    uint64_t flags, features;
> > >>>>> +    int64_t alloc_size = 0;
> > >>>>> +    int r;
> > >>>>> +
> > >>>>> +    r = rbd_get_flags(s->image, &flags);
> > >>>>> +    if (r < 0) {
> > >>>>> +        return r;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    r = rbd_get_features(s->image, &features);
> > >>>>> +    if (r < 0) {
> > >>>>> +        return r;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    /*
> > >>>>> +     * We use rbd_diff_iterate2() only if the RBD image have 
> > >>>>> fast-diff
> > >>>>> +     * feature enabled. If it is disabled, rbd_diff_iterate2() could 
> > >>>>> be
> > >>>>> +     * very slow on a big image.
> > >>>>> +     */
> > >>>>> +    if (!(features & RBD_FEATURE_FAST_DIFF) ||
> > >>>>> +        (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> > >>>>> +        return -ENOTSUP;
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    /*
> > >>>>> +     * rbd_diff_iterate2(), if the source snapshot name is NULL, 
> > >>>>> invokes
> > >>>>> +     * the callback on all allocated regions of the image.
> > >>>>> +     */
> > >>>>> +    r = rbd_diff_iterate2(s->image, NULL, 0,
> > >>>>> +                          bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> > >>>>> +                          &rbd_allocated_size_cb, &alloc_size);
> > >>>>
> > >>>> But I have a question.  This is basically block_status, right?  So it
> > >>>> gives us information on which areas are allocated and which are not.
> > >>>> The result thus gives us a lower bound on the allocation size, but is 
> > >>>> it
> > >>>> really exactly the allocation size?
> > >>>>
> > >>>> There are two things I’m concerned about:
> > >>>>
> > >>>> 1. What about metadata?
> > >>>
> > >>> Good question, I don't think it includes the size used by metadata and I
> > >>> don't know if there is a way to know it. I'll check better.
> > >>
> > >> It does not include the size of metadata, the "rbd_diff_iterate2"
> > >> function is literally just looking for touched data blocks within the
> > >> RBD image.
> > >>
> > >>>>
> > >>>> 2. If you have multiple snapshots, this will only report the overall
> > >>>> allocation information, right?  So say there is something like this:
> > >>>>
> > >>>> (“A” means an allocated MB, “-” is an unallocated MB)
> > >>>>
> > >>>> Snapshot 1: AAAA---
> > >>>> Snapshot 2: --AAAAA
> > >>>> Snapshot 3: -AAAA--
> > >>>>
> > >>>> I think the allocated data size is the number of As in total (13 MB).
> > >>>> But I suppose this API will just return 7 MB, because it looks on
> > >>>> everything an it sees the whole image range (7 MB) to be allocated.  It
> > >>>> doesn’t report in how many snapshots some region is allocated.
> > >>
> > >> It should return 13 dirty data blocks (multipled by the size of the
> > >> data block) since when you don't provide a "from snapshot" name, it
> > >> will iterate from the first snapshot to the HEAD revision.
> > >
> > > Have you tested that?
> > >
> > > I‘m so skeptical because the callback function interface has no way of
> > > distinguishing between different layers of snapshots.
> > >
> > > And also because we have the bdrv_block_status_above() function which
> > > just looks strikingly similar (with the difference that it does not
> > > invoke a callback but just returns the next allocated range).  If you
> > > pass base=NULL to it, it will also “interpret that as the beginning of
> > > time and return all allocated regions of the image” (or rather image
> > > chain, in our case).  But it would just return 7 MB as allocated.  (Even
> > > though it does in fact return layer information, i.e. where a given
> > > continuous chunk of data is allocated.)
> > >
> > > Sure, there is no good reason for why our interface should by chance be
> > > the same as librbd’s interface.  But without having tested it, the fact
> > > that the callback cannot detect which layer a chunk is allocated on just
> > > makes me wary.
> >
> > And the librbd code doesn’t alleviate my concerns.
> >
> > From what I can see, it first creates a bitmap (two bits per entry) that
> > covers the whole image and says which objects are allocated and which
> > aren‘t.  Through the whole chain, that is.  So in the above example, the
> > bitmap would report everything as allocated.  (Or rather “updated“ in
> > librbd‘s terms.)
> >
> > Then it has this piece:
> >
> > >   uint64_t off = m_offset;
> > >   uint64_t left = m_length;
> > >
> > >   while (left > 0) {
> > >     uint64_t period_off = off - (off % period);
> > >     uint64_t read_len = min(period_off + period - off, left);
> > >
> > >     // map to extents
> > >     map<object_t,vector<ObjectExtent> > object_extents;
> > >     Striper::file_to_extents(cct, m_image_ctx.format_string,
> > >                              &m_image_ctx.layout, off, read_len, 0,
> > >                              object_extents, 0);
> > >
> > >     // get snap info for each object
> > >     for (map<object_t,vector<ObjectExtent> >::iterator p =
> > >            object_extents.begin();
> > >          p != object_extents.end(); ++p)
> > [...]
> > >           for (std::vector<ObjectExtent>::iterator q = p->second.begin();
> > >                q != p->second.end(); ++q) {
> > >             r = m_callback(off + q->offset, q->length, updated, 
> > > m_callback_arg);
> > [...]
> > >           }
> > [...]
> > >     }
> > >>     left -= read_len;
> > >     off += read_len;
> > >   }
> >
> > So that iterates over the whole image (in our case, because m_offset is
> > 0 and m_length is the image length), then picks out a chunk of read_len
> > length, converts that to objects, and then iterates over all of those
> > objects and all of their extents.
> >
> > file_to_extents looks like it’s just an arithmetic operation.  So it
> > doesn‘t look like that function returns extents in multiple snapshots.
> > It just converts a range into subranges called “objects” and “extents”
> > (at least that’s the way it looks to me).
> >
> > So overall, this looks awfully like it simply iterates over the whole
> > image and then returns allocation information gathered as a top-down
> > view through all of the snapshots, but not for each snapshot individually.
> 
> Sorry, you're correct. The API function was originally designed to
> support delta extents for supporting diff exports, so while it does
> open each snapshot's object-map, it only returns a unioned set of
> delta extents. The rbd CLI's "disk-usage" action behaves how I
> described by counting the actual dirty block usage between snapshots.

Max, Jason, thanks for the details!

If you agree, I'll try to implement something similar, iterating on all
snapshots.

What about metadata?
I saw some APIs (e.g. rbd_metadata_list()) that allow us to get the metadata.
An idea could be to iterate over them and sum the keys-values size returned,
but I'm not sure they are returned in the same format as they are stored.

Thanks,
Stefano



reply via email to

[Prev in Thread] Current Thread [Next in Thread]