qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] block/rbd: workaround for ceph issue #53784


From: Ilya Dryomov
Subject: Re: [PATCH 2/2] block/rbd: workaround for ceph issue #53784
Date: Wed, 12 Jan 2022 22:55:20 +0100

On Wed, Jan 12, 2022 at 9:37 PM Peter Lieven <pl@kamp.de> wrote:
>
> Am 12.01.22 um 20:51 schrieb Ilya Dryomov:
> > On Wed, Jan 12, 2022 at 1:32 PM Peter Lieven <pl@kamp.de> wrote:
> >> Am 12.01.22 um 13:22 schrieb Ilya Dryomov:
> >>> On Wed, Jan 12, 2022 at 12:55 PM Peter Lieven <pl@kamp.de> wrote:
> >>>> Am 12.01.22 um 10:59 schrieb Ilya Dryomov:
> >>>>> On Mon, Jan 10, 2022 at 12:43 PM Peter Lieven <pl@kamp.de> wrote:
> >>>>>> librbd had a bug until early 2022 that affected all versions of ceph 
> >>>>>> that
> >>>>>> supported fast-diff. This bug results in reporting of incorrect offsets
> >>>>>> if the offset parameter to rbd_diff_iterate2 is not object aligned.
> >>>>>> Work around this bug by rounding down the offset to object boundaries.
> >>>>>>
> >>>>>> Fixes: https://tracker.ceph.com/issues/53784
> >>>>> I don't think the Fixes tag is appropriate here.  Linking librbd
> >>>>> ticket is fine but this patch doesn't really fix anything.
> >>>> Okay, I will change that to See:
> >>> It's already linked in the source code, up to you if you also want to
> >>> link it in the description.
> >>>
> >>>>>> Cc: qemu-stable@nongnu.org
> >>>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
> >>>>>> ---
> >>>>>>  block/rbd.c | 17 ++++++++++++++++-
> >>>>>>  1 file changed, 16 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>>>> index 5e9dc91d81..260cb9f4b4 100644
> >>>>>> --- a/block/rbd.c
> >>>>>> +++ b/block/rbd.c
> >>>>>> @@ -1333,6 +1333,7 @@ static int coroutine_fn 
> >>>>>> qemu_rbd_co_block_status(BlockDriverState *bs,
> >>>>>>      int status, r;
> >>>>>>      RBDDiffIterateReq req = { .offs = offset };
> >>>>>>      uint64_t features, flags;
> >>>>>> +    int64_t head;
> >>>>>>
> >>>>>>      assert(offset + bytes <= s->image_size);
> >>>>>>
> >>>>>> @@ -1360,6 +1361,19 @@ static int coroutine_fn 
> >>>>>> qemu_rbd_co_block_status(BlockDriverState *bs,
> >>>>>>          return status;
> >>>>>>      }
> >>>>>>
> >>>>>> +    /*
> >>>>>> +     * librbd had a bug until early 2022 that affected all versions 
> >>>>>> of ceph that
> >>>>>> +     * supported fast-diff. This bug results in reporting of 
> >>>>>> incorrect offsets
> >>>>>> +     * if the offset parameter to rbd_diff_iterate2 is not object 
> >>>>>> aligned.
> >>>>>> +     * Work around this bug by rounding down the offset to object 
> >>>>>> boundaries.
> >>>>>> +     *
> >>>>>> +     * See: https://tracker.ceph.com/issues/53784
> >>>>>> +     */
> >>>>>> +    head = offset & (s->object_size - 1);
> >>>>>> +    offset -= head;
> >>>>>> +    req.offs -= head;
> >>>>>> +    bytes += head;
> >>>>> So it looks like the intention is to have more or less a permanent
> >>>>> workaround since all librbd versions are affected, right?  For that,
> >>>>> I think we would need to also reject custom striping patterns and
> >>>>> clones.  For the above to be reliable, the image has to be standalone
> >>>>> and have a default striping pattern (stripe_unit == object_size &&
> >>>>> stripe_count == 1).  Otherwise, behave as if fast-diff is disabled or
> >>>>> invalid.
> >>>> Do you have a fealing how many users use a different striping pattern 
> >>>> than default?
> >>> Very few.
> >>>
> >>>> What about EC backed pools?
> >>> In this context EC pools behave exactly the same as replicated pools.
> >>>
> >>>> Do you have another idea how we can detect if the librbd version is 
> >>>> broken?
> >>> No.  Initially I wanted to just fix these bugs in librbd, relying on
> >>> the assumption that setups with a new QEMU should also have a fairly
> >>> new librbd.  But after looking at various distros and realizing the
> >>> extent of rbd_diff_iterate2() issues, I think a long-term workaround
> >>> in QEMU makes sense.  A configure-time check for known good versions
> >>> of librbd can be added later if someone feels like it.
> >>
> >> Okay, I will add a check on image open if the striping fits or
> >>
> >> we have a clone. Can you explain how I would best check for it?
> > ... if _we_ are a clone.  Use rbd_get_parent_info(), it returns 0 if an
> > image has a parent and -ENOENT if an image is standalone.  Keep in mind
> > that the image can be flattened (i.e. made standalone) at any moment so
> > this should be checked on each invocation.
> >
> > To get striping parameters, use rbd_get_stripe_unit() and
> > rbd_get_stripe_count() after checking for RBD_FEATURE_STRIPINGV2
> > feature.  This can be done just on image open, but since features
> > are queried on each invocation anyway and very few users are going
> > to have a custom striping pattern (and therefore have this feature
> > enabled), I'd do it here as well.
> >
> > Thanks,
> >
> >                 Ilya
>
> The flattening is not a problem because if we detect a clone we disable 
> get_block_status
>
> for the session. If it gets flattened later it doesn't hurt we have just no 
> allocation info.
>
> So at least these checks could be done at image open.

I see.  Doing both on image open and handling clones this way is
fine too.

>
>
> To be honest, all these checks and workaround stuff doesn't feel right. We 
> try to fix something that is broken
>
> from scratch and make the qemu block driver quite ugly doing so.
>
>
> I would personally go and drop get_block_status support for all existing 
> librbd versions and add a flag to librbd that reports that rbd_diff_iterate2 
> is not broken
>
> and check for that and only support get_block_status if it is present.

It's a couple of dozen lines of code which I certainly wouldn't
consider ugly.  A flag would be taking it too far IMO, a simple
lower-bound version check on the upcoming Quincy release should
be enough.

>
>
> Or if you can confirm that rbd_diff_iterate2 does at least work for a query 
> on the full image support get_block_status only for read-only
>
> opened image files (this is true for our both current use cases backing file 
> and qemu-img convert with rbd source) and read and cache
>
> the full allocation status at the first invocation of get_block_status (and 
> of course drop it on invalidate_cache or bdrv_reopen).
>
> This would also work around the nasty algorithm that tries to handle all 
> callback cases.

What nasty algorithm?  If you are talking about the first patch,
then I think it can be reduced to the addition of

    if (exists)
       return 0;

at the top of the callback.

Thanks,

                Ilya



reply via email to

[Prev in Thread] Current Thread [Next in Thread]