qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] qcow2: Reduce write_zeroes size in handle_alloc_space()


From: Kevin Wolf
Subject: Re: [PATCH] qcow2: Reduce write_zeroes size in handle_alloc_space()
Date: Wed, 10 Jun 2020 13:25:04 +0200

Am 10.06.2020 um 08:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 09.06.2020 19:19, Eric Blake wrote:
> > On 6/9/20 10:18 AM, Kevin Wolf wrote:
> > 
> > > > > > -        ret = bdrv_co_pwrite_zeroes(s->data_file, 
> > > > > > m->alloc_offset,
> > > > > > -                                 
> > > > > >    m->nb_clusters * s->cluster_size,
> > > > > > +        ret = bdrv_co_pwrite_zeroes(s->data_file, start, 
> > > > > > len,
> > > > > >                                   
> > > > > >      BDRV_REQ_NO_FALLBACK);
> > > > 
> > > > Good point.  If we weren't using BDRV_REQ_NO_FALLBACK, then avoiding a
> > > > pre-zero pass over the middle is essential.  But since we are 
> > > > insisting that
> > > > the pre-zero pass be fast or else immediately fail, the time spent in
> > > > pre-zeroing should not be a concern.  Do you have benchmark numbers 
> > > > stating
> > > > otherwise?
> > > 
> > > I stumbled across this behaviour (write_zeros for 2 MB, then overwrite
> > > almost everything) in the context of a different bug, and it just didn't
> > > make much sense to me. Is there really a file system where fragmentation
> > > is introduced by not zeroing the area first and then overwriting it?
> > > 
> > > I'm not insisting on making this change because the behaviour is
> > > harmless if odd, but if we think that writing twice to some blocks is an
> > > optimisation, maybe we should actually measure and document this.
> > > 
> > > 
> > > Anyway, let's talk about the reported bug that made me look at the
> > > strace that showed this behaviour because I feel it supports my last
> > > point. It's a bit messy, but anyway:
> > > 
> > >      https://bugzilla.redhat.com/show_bug.cgi?id=1666864
> > > 
> > > So initially, bad performance on a fragmented image file was reported.
> > > Not much to do there, but then in comment 16, QA reported a performance
> > > regression in this case between 4.0 and 4.2. And this change caused by
> > > c8bb23cbdbe, i.e. the commit that introduced handle_alloc_space().
> > > 
> > > Turns out that BDRV_REQ_NO_FALLBACK doesn't always guarantee that it's
> > > _really_ fast. fallocate(FALLOC_FL_ZERO_RANGE) causes some kind of flush
> > > on XFS and buffered writes don't. So with the old code, qemu-img convert
> > > to a file on a very full filesystem that will cause fragmentation, was
> > > much faster with writing a zero buffer than with write_zeroes (because
> > > it didn't flush the result).
> > 
> > Wow. That makes it sound like we should NOT attempt
> > fallocate(FALLOC_FL_ZERO_RANGE) on the fast path, because we don't
> > have guarantees that it is fast.
> > 
> > I really wish the kernel would give us
> > fallocate(FALLOC_FL_ZERO_RANGE|FALLOC_FL_NO_FALLBACK) which would
> > fail fast rather than doing a flush or other slow fallback.
> > 
> > > 
> > > I don't fully understand why this is and hope that XFS can do
> > > something about it. I also don't really think we should revert the
> > > change in QEMU, though I'm not completely sure. But I just wanted
> > > to share this to show that "obvious" characteristics of certain
> > > types of requests aren't always true and doing obscure
> > > optimisations based on what we think filesystems may do can
> > > actually achieve the opposite in some cases.
> > 
> > It also goes to show us that the kernel does NOT yet give us enough
> > fine-grained control over what we really want (which is: 'pre-zero
> > this if it is fast, but don't waste time if it is not).  Most of the
> > kernel interfaces end up being 'pre-zero this, and it might be fast,
> > fail fast, or even fall back to something safe but slow, and you
> > can't tell the difference short of trying'.
> 
> Hmm, actually, for small cow areas (several bytes? several sectors?),
> I'm not surprised that direct writing zeroed buffer may work faster
> than any kind of WRITE_ZERO request. Especially, expanding
> write-request for a small amount of bytes may be faster than doing
> intead two requests. Possibly, we need some heuristics here. And I
> think, it would be good to add some benchmarks based on
> scripts/simplebench to have real numbers (we'll try).

I'll continue the discussion in the BZ, but yes, at the moment the
recommendation of the XFS people seems to be that we avoid fallocate()
(at least without FALLOC_FL_KEEP_SIZE and on local filesystems) for
small sizes.

It's not obvious what "small sizes" means in practice, but I wouldn't be
surprised if a qcow2 cluster is always in this category, even if it's
2 MB. (The pathological qemu-img convert case does use 2 MB buffers.)

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]