qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] block/rbd: workaround for ceph issue #53784


From: Peter Lieven
Subject: Re: [PATCH 2/2] block/rbd: workaround for ceph issue #53784
Date: Wed, 12 Jan 2022 23:09:30 +0100


> Am 12.01.2022 um 22:57 schrieb Ilya Dryomov <idryomov@gmail.com>:
> 
> On Wed, Jan 12, 2022 at 9:37 PM Peter Lieven <pl@kamp.de> wrote:
>> 
>>> Am 12.01.22 um 20:51 schrieb Ilya Dryomov:
>>> On Wed, Jan 12, 2022 at 1:32 PM Peter Lieven <pl@kamp.de> wrote:
>>>> Am 12.01.22 um 13:22 schrieb Ilya Dryomov:
>>>>> On Wed, Jan 12, 2022 at 12:55 PM Peter Lieven <pl@kamp.de> wrote:
>>>>>> Am 12.01.22 um 10:59 schrieb Ilya Dryomov:
>>>>>>> On Mon, Jan 10, 2022 at 12:43 PM Peter Lieven <pl@kamp.de> wrote:
>>>>>>>> librbd had a bug until early 2022 that affected all versions of ceph 
>>>>>>>> that
>>>>>>>> supported fast-diff. This bug results in reporting of incorrect offsets
>>>>>>>> if the offset parameter to rbd_diff_iterate2 is not object aligned.
>>>>>>>> Work around this bug by rounding down the offset to object boundaries.
>>>>>>>> 
>>>>>>>> Fixes: https://tracker.ceph.com/issues/53784
>>>>>>> I don't think the Fixes tag is appropriate here.  Linking librbd
>>>>>>> ticket is fine but this patch doesn't really fix anything.
>>>>>> Okay, I will change that to See:
>>>>> It's already linked in the source code, up to you if you also want to
>>>>> link it in the description.
>>>>> 
>>>>>>>> Cc: qemu-stable@nongnu.org
>>>>>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>>>>>> ---
>>>>>>>> block/rbd.c | 17 ++++++++++++++++-
>>>>>>>> 1 file changed, 16 insertions(+), 1 deletion(-)
>>>>>>>> 
>>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>>>> index 5e9dc91d81..260cb9f4b4 100644
>>>>>>>> --- a/block/rbd.c
>>>>>>>> +++ b/block/rbd.c
>>>>>>>> @@ -1333,6 +1333,7 @@ static int coroutine_fn 
>>>>>>>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>>>>>>>     int status, r;
>>>>>>>>     RBDDiffIterateReq req = { .offs = offset };
>>>>>>>>     uint64_t features, flags;
>>>>>>>> +    int64_t head;
>>>>>>>> 
>>>>>>>>     assert(offset + bytes <= s->image_size);
>>>>>>>> 
>>>>>>>> @@ -1360,6 +1361,19 @@ static int coroutine_fn 
>>>>>>>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>>>>>>>         return status;
>>>>>>>>     }
>>>>>>>> 
>>>>>>>> +    /*
>>>>>>>> +     * librbd had a bug until early 2022 that affected all versions 
>>>>>>>> of ceph that
>>>>>>>> +     * supported fast-diff. This bug results in reporting of 
>>>>>>>> incorrect offsets
>>>>>>>> +     * if the offset parameter to rbd_diff_iterate2 is not object 
>>>>>>>> aligned.
>>>>>>>> +     * Work around this bug by rounding down the offset to object 
>>>>>>>> boundaries.
>>>>>>>> +     *
>>>>>>>> +     * See: https://tracker.ceph.com/issues/53784
>>>>>>>> +     */
>>>>>>>> +    head = offset & (s->object_size - 1);
>>>>>>>> +    offset -= head;
>>>>>>>> +    req.offs -= head;
>>>>>>>> +    bytes += head;
>>>>>>> So it looks like the intention is to have more or less a permanent
>>>>>>> workaround since all librbd versions are affected, right?  For that,
>>>>>>> I think we would need to also reject custom striping patterns and
>>>>>>> clones.  For the above to be reliable, the image has to be standalone
>>>>>>> and have a default striping pattern (stripe_unit == object_size &&
>>>>>>> stripe_count == 1).  Otherwise, behave as if fast-diff is disabled or
>>>>>>> invalid.
>>>>>> Do you have a fealing how many users use a different striping pattern 
>>>>>> than default?
>>>>> Very few.
>>>>> 
>>>>>> What about EC backed pools?
>>>>> In this context EC pools behave exactly the same as replicated pools.
>>>>> 
>>>>>> Do you have another idea how we can detect if the librbd version is 
>>>>>> broken?
>>>>> No.  Initially I wanted to just fix these bugs in librbd, relying on
>>>>> the assumption that setups with a new QEMU should also have a fairly
>>>>> new librbd.  But after looking at various distros and realizing the
>>>>> extent of rbd_diff_iterate2() issues, I think a long-term workaround
>>>>> in QEMU makes sense.  A configure-time check for known good versions
>>>>> of librbd can be added later if someone feels like it.
>>>> 
>>>> Okay, I will add a check on image open if the striping fits or
>>>> 
>>>> we have a clone. Can you explain how I would best check for it?
>>> ... if _we_ are a clone.  Use rbd_get_parent_info(), it returns 0 if an
>>> image has a parent and -ENOENT if an image is standalone.  Keep in mind
>>> that the image can be flattened (i.e. made standalone) at any moment so
>>> this should be checked on each invocation.
>>> 
>>> To get striping parameters, use rbd_get_stripe_unit() and
>>> rbd_get_stripe_count() after checking for RBD_FEATURE_STRIPINGV2
>>> feature.  This can be done just on image open, but since features
>>> are queried on each invocation anyway and very few users are going
>>> to have a custom striping pattern (and therefore have this feature
>>> enabled), I'd do it here as well.
>>> 
>>> Thanks,
>>> 
>>>                Ilya
>> 
>> The flattening is not a problem because if we detect a clone we disable 
>> get_block_status
>> 
>> for the session. If it gets flattened later it doesn't hurt we have just no 
>> allocation info.
>> 
>> So at least these checks could be done at image open.
> 
> I see.  Doing both on image open and handling clones this way is
> fine too.
> 
>> 
>> 
>> To be honest, all these checks and workaround stuff doesn't feel right. We 
>> try to fix something that is broken
>> 
>> from scratch and make the qemu block driver quite ugly doing so.
>> 
>> 
>> I would personally go and drop get_block_status support for all existing 
>> librbd versions and add a flag to librbd that reports that rbd_diff_iterate2 
>> is not broken
>> 
>> and check for that and only support get_block_status if it is present.
> 
> It's a couple of dozen lines of code which I certainly wouldn't
> consider ugly.  A flag would be taking it too far IMO, a simple
> lower-bound version check on the upcoming Quincy release should
> be enough.

It’s just checks that need to be explained. I had all those ifdef’ry in mind 
that we had in the past. I will wrap everything that we add to fix the issue in 
an ifdef that’s false from quincy onwoards so that’s clear that we can remove 
it in a few years when the minimum required version is bumped to quincy or 
beyond.

> 
>> 
>> 
>> Or if you can confirm that rbd_diff_iterate2 does at least work for a query 
>> on the full image support get_block_status only for read-only
>> 
>> opened image files (this is true for our both current use cases backing file 
>> and qemu-img convert with rbd source) and read and cache
>> 
>> the full allocation status at the first invocation of get_block_status (and 
>> of course drop it on invalidate_cache or bdrv_reopen).
>> 
>> This would also work around the nasty algorithm that tries to handle all 
>> callback cases.
> 
> What nasty algorithm?  If you are talking about the first patch,
> then I think it can be reduced to the addition of
> 
>    if (exists)
>       return 0;
> 
> at the top of the callback.

Yes, I meant the first patch. I have answered the mails in the wrong order.

I will put it all together and send a V2 tomorrow.

Peter





reply via email to

[Prev in Thread] Current Thread [Next in Thread]