[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 2/2] block/rbd: workaround for ceph issue #53784
From: |
Peter Lieven |
Subject: |
Re: [PATCH 2/2] block/rbd: workaround for ceph issue #53784 |
Date: |
Wed, 12 Jan 2022 23:09:30 +0100 |
> Am 12.01.2022 um 22:57 schrieb Ilya Dryomov <idryomov@gmail.com>:
>
> On Wed, Jan 12, 2022 at 9:37 PM Peter Lieven <pl@kamp.de> wrote:
>>
>>> Am 12.01.22 um 20:51 schrieb Ilya Dryomov:
>>> On Wed, Jan 12, 2022 at 1:32 PM Peter Lieven <pl@kamp.de> wrote:
>>>> Am 12.01.22 um 13:22 schrieb Ilya Dryomov:
>>>>> On Wed, Jan 12, 2022 at 12:55 PM Peter Lieven <pl@kamp.de> wrote:
>>>>>> Am 12.01.22 um 10:59 schrieb Ilya Dryomov:
>>>>>>> On Mon, Jan 10, 2022 at 12:43 PM Peter Lieven <pl@kamp.de> wrote:
>>>>>>>> librbd had a bug until early 2022 that affected all versions of ceph
>>>>>>>> that
>>>>>>>> supported fast-diff. This bug results in reporting of incorrect offsets
>>>>>>>> if the offset parameter to rbd_diff_iterate2 is not object aligned.
>>>>>>>> Work around this bug by rounding down the offset to object boundaries.
>>>>>>>>
>>>>>>>> Fixes: https://tracker.ceph.com/issues/53784
>>>>>>> I don't think the Fixes tag is appropriate here. Linking librbd
>>>>>>> ticket is fine but this patch doesn't really fix anything.
>>>>>> Okay, I will change that to See:
>>>>> It's already linked in the source code, up to you if you also want to
>>>>> link it in the description.
>>>>>
>>>>>>>> Cc: qemu-stable@nongnu.org
>>>>>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>>>>>> ---
>>>>>>>> block/rbd.c | 17 ++++++++++++++++-
>>>>>>>> 1 file changed, 16 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>>>> index 5e9dc91d81..260cb9f4b4 100644
>>>>>>>> --- a/block/rbd.c
>>>>>>>> +++ b/block/rbd.c
>>>>>>>> @@ -1333,6 +1333,7 @@ static int coroutine_fn
>>>>>>>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>>>>>>> int status, r;
>>>>>>>> RBDDiffIterateReq req = { .offs = offset };
>>>>>>>> uint64_t features, flags;
>>>>>>>> + int64_t head;
>>>>>>>>
>>>>>>>> assert(offset + bytes <= s->image_size);
>>>>>>>>
>>>>>>>> @@ -1360,6 +1361,19 @@ static int coroutine_fn
>>>>>>>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>>>>>>> return status;
>>>>>>>> }
>>>>>>>>
>>>>>>>> + /*
>>>>>>>> + * librbd had a bug until early 2022 that affected all versions
>>>>>>>> of ceph that
>>>>>>>> + * supported fast-diff. This bug results in reporting of
>>>>>>>> incorrect offsets
>>>>>>>> + * if the offset parameter to rbd_diff_iterate2 is not object
>>>>>>>> aligned.
>>>>>>>> + * Work around this bug by rounding down the offset to object
>>>>>>>> boundaries.
>>>>>>>> + *
>>>>>>>> + * See: https://tracker.ceph.com/issues/53784
>>>>>>>> + */
>>>>>>>> + head = offset & (s->object_size - 1);
>>>>>>>> + offset -= head;
>>>>>>>> + req.offs -= head;
>>>>>>>> + bytes += head;
>>>>>>> So it looks like the intention is to have more or less a permanent
>>>>>>> workaround since all librbd versions are affected, right? For that,
>>>>>>> I think we would need to also reject custom striping patterns and
>>>>>>> clones. For the above to be reliable, the image has to be standalone
>>>>>>> and have a default striping pattern (stripe_unit == object_size &&
>>>>>>> stripe_count == 1). Otherwise, behave as if fast-diff is disabled or
>>>>>>> invalid.
>>>>>> Do you have a fealing how many users use a different striping pattern
>>>>>> than default?
>>>>> Very few.
>>>>>
>>>>>> What about EC backed pools?
>>>>> In this context EC pools behave exactly the same as replicated pools.
>>>>>
>>>>>> Do you have another idea how we can detect if the librbd version is
>>>>>> broken?
>>>>> No. Initially I wanted to just fix these bugs in librbd, relying on
>>>>> the assumption that setups with a new QEMU should also have a fairly
>>>>> new librbd. But after looking at various distros and realizing the
>>>>> extent of rbd_diff_iterate2() issues, I think a long-term workaround
>>>>> in QEMU makes sense. A configure-time check for known good versions
>>>>> of librbd can be added later if someone feels like it.
>>>>
>>>> Okay, I will add a check on image open if the striping fits or
>>>>
>>>> we have a clone. Can you explain how I would best check for it?
>>> ... if _we_ are a clone. Use rbd_get_parent_info(), it returns 0 if an
>>> image has a parent and -ENOENT if an image is standalone. Keep in mind
>>> that the image can be flattened (i.e. made standalone) at any moment so
>>> this should be checked on each invocation.
>>>
>>> To get striping parameters, use rbd_get_stripe_unit() and
>>> rbd_get_stripe_count() after checking for RBD_FEATURE_STRIPINGV2
>>> feature. This can be done just on image open, but since features
>>> are queried on each invocation anyway and very few users are going
>>> to have a custom striping pattern (and therefore have this feature
>>> enabled), I'd do it here as well.
>>>
>>> Thanks,
>>>
>>> Ilya
>>
>> The flattening is not a problem because if we detect a clone we disable
>> get_block_status
>>
>> for the session. If it gets flattened later it doesn't hurt we have just no
>> allocation info.
>>
>> So at least these checks could be done at image open.
>
> I see. Doing both on image open and handling clones this way is
> fine too.
>
>>
>>
>> To be honest, all these checks and workaround stuff doesn't feel right. We
>> try to fix something that is broken
>>
>> from scratch and make the qemu block driver quite ugly doing so.
>>
>>
>> I would personally go and drop get_block_status support for all existing
>> librbd versions and add a flag to librbd that reports that rbd_diff_iterate2
>> is not broken
>>
>> and check for that and only support get_block_status if it is present.
>
> It's a couple of dozen lines of code which I certainly wouldn't
> consider ugly. A flag would be taking it too far IMO, a simple
> lower-bound version check on the upcoming Quincy release should
> be enough.
It’s just checks that need to be explained. I had all those ifdef’ry in mind
that we had in the past. I will wrap everything that we add to fix the issue in
an ifdef that’s false from quincy onwoards so that’s clear that we can remove
it in a few years when the minimum required version is bumped to quincy or
beyond.
>
>>
>>
>> Or if you can confirm that rbd_diff_iterate2 does at least work for a query
>> on the full image support get_block_status only for read-only
>>
>> opened image files (this is true for our both current use cases backing file
>> and qemu-img convert with rbd source) and read and cache
>>
>> the full allocation status at the first invocation of get_block_status (and
>> of course drop it on invalidate_cache or bdrv_reopen).
>>
>> This would also work around the nasty algorithm that tries to handle all
>> callback cases.
>
> What nasty algorithm? If you are talking about the first patch,
> then I think it can be reduced to the addition of
>
> if (exists)
> return 0;
>
> at the top of the callback.
Yes, I meant the first patch. I have answered the mails in the wrong order.
I will put it all together and send a V2 tomorrow.
Peter
[PATCH 1/2] block/rbd: fix handling of holes in .bdrv_co_block_status, Peter Lieven, 2022/01/10