[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Inconsistent block status reply in qemu-nbd in qemu-img-6.2.0
From: |
Kevin Wolf |
Subject: |
Re: Inconsistent block status reply in qemu-nbd in qemu-img-6.2.0 |
Date: |
Mon, 17 Jan 2022 12:38:31 +0100 |
Am 17.01.2022 um 10:55 hat Hanna Reitz geschrieben:
> On 17.01.22 10:52, Kevin Wolf wrote:
> > Am 17.01.2022 um 09:46 hat Hanna Reitz geschrieben:
> > > On 16.01.22 19:09, Nir Soffer wrote:
> > > > On Sun, Jan 16, 2022 at 1:17 PM Nir Soffer <nsoffer@redhat.com> wrote:
> > > > > Some of our tests started to fail since qemu-img 6.2.0 released.
> > > > >
> > > > > Here is an example failure:
> > > > >
> > > > > $ truncate -s1g /data/scratch/empty-1g.raw
> > > > >
> > > > > $ qemu-nbd -r -t -e8 -k /tmp/nbd.sock -A -f raw
> > > > > /data/scratch/empty-1g.raw &
> > > > git bisect points to this commit:
> > > >
> > > > $ git bisect good
> > > > 0bc329fbb009f8601cec23bf2bc48ead0c5a5fa2 is the first bad commit
> > > > commit 0bc329fbb009f8601cec23bf2bc48ead0c5a5fa2
> > > > Author: Hanna Reitz <hreitz@redhat.com>
> > > > Date: Thu Aug 12 10:41:44 2021 +0200
> > > >
> > > > block: block-status cache for data regions
> > > >
> > > > The commit message says:
> > > >
> > > > (Note that only caching data regions but not zero regions means
> > > > that
> > > > returning false information from the cache is not catastrophic:
> > > > Treating
> > > > zeroes as data is fine. While we try to invalidate the cache on
> > > > zero
> > > > writes and discards, such incongruences may still occur when
> > > > there are
> > > > other processes writing to the image.)
> > > >
> > > > I don't think it is fine to report zero as data. This can cause severe
> > > > performance
> > > > issues when users copy unneeded zero extents over the network, instead
> > > > of
> > > > doing no work with zero bandwidth.
> > > You’re right, it was meant as a contrast to whether this might lead to
> > > catastrophic data failure.
> > >
> > > But it was also meant as a risk estimation. There wasn’t supposed to be a
> > > situation where the cache would report zeroes as data (other than with
> > > external writers), and so that’s why I thought it’d be fine. But, well,
> > > things are supposed to be one way, and then you (me) write buggy code...
> > >
> > > Thanks for the reproducer steps, I can reproduce the bug with your script
> > > (not with nbdinfo fwiw) and the problem seems to be this:
> > >
> > > diff --git a/block/io.c b/block/io.c
> > > index bb0a254def..83e94faeaf 100644
> > > --- a/block/io.c
> > > +++ b/block/io.c
> > > @@ -2498,7 +2498,8 @@ static int coroutine_fn
> > > bdrv_co_block_status(BlockDriverState *bs,
> > > * the cache requires an RCU update, so double check here to
> > > avoid
> > > * such an update if possible.
> > > */
> > > - if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
> > > + if (want_zero &&
> > > + ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
> > > QLIST_EMPTY(&bs->children))
> > > {
> > > /*
> > >
> > > We should update the cache only when we have accurate zero information, so
> > > only if want_zero was true.
> > Either that, or store the want_zero value as well and compare it before
> > using the cache.
> >
> > Depends on whether we need the optimisation for want_zero=false cases,
> > too. I think this mostly means bdrv_is_allocated(_above)(), which seems
> > to be relevant for the commit and stream jobs and 'qemu-img rebase' at
> > least and some less commonly used stuff. There are some other cases of
> > is_allocated (like in mirror) where it doesn't make a difference because
> > we always process the maximum number of bytes with the first call.
>
> I think we need the cache only for want_zero=true cases, because allocation
> information is handled by the format layer, which doesn’t use the cache at
> all. Also, the reason for the cache was to skip slow SEEK_HOLE/DATA calls
> on the protocol level, which is a want_zero=true case.
Hm, good point. But we would still call the protocol driver with
want_zero=false for raw images.
Obviously, this removes the commit case because it calls is_allocated
only on the overlay and that can't be raw. However, streaming (or more
specifically the copy-on-read driver) does seem to include the base
image in its bdrv_is_allocated_above(), which may be raw.
So it looks to me as if the optimisation were still relevant when you're
streaming from a raw base image into a qcow2 overlay.
Kevin