qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] Request for clarification on qemu-img conv


From: De Backer, Fred (Nokia - BE/Antwerp)
Subject: Re: [Qemu-block] [Qemu-devel] Request for clarification on qemu-img convert behavior zeroing target host_device
Date: Thu, 13 Dec 2018 21:14:41 +0000

> >>> We observe that in Fedora 29 the qemu-img, before imaging the disk, it 
> >>> fully
> >>> zeroes it. Taking into account the disk size, the whole process now takes 
> >>> 35
> >>> minutes instead of 50 seconds. This causes the ironic-python-agent 
> >>> operation to
> >>> time-out. The Fedora 27 qemu-img doesn't do that.
> >>
> >> Known issue; Nir and Rich have posted a previous thread on the topic,
> >> and the conclusion is that we need to make qemu-img smarter about NOT
> >> requesting pre-zeroing of devices where that is more expensive than
> >> just zeroing as we go.
> >> https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg01182.html
> >
> > Yes, we should be careful to avoid the fallback in this case.
> >
> > However, how could this ever go from 50 seconds for writing the whole
> > image to 35 minutes?! Even if you end up writing the whole image twice
> > because you write zeros first and then overwrite them everywhere with
> > data, shouldn't the maximum be doubling the time, i.e. 100 seconds?

I believe the situation is different than the one described where I understand 
source and destination have a comparable size (hence doubling the time)
In the ironic deployment scenario; the source is a relatively small cloud-image 
compared to the destination which is a disk on a baremetal server. I've 
attached 2 files listing somewhat the properties of source (10G image; mostly 
sparse; compressed qcow2 size is 584M) and destination (300G RAID device on HP 
SmartArray controller).

Source qcow2 image properties:
image: /tmp/centos7-biosmbr-lvm-1539159593.qcow2
file format: qcow2
virtual size: 9.3G (10000000000 bytes)
disk size: 567M
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

Destination blockdevice properties:
# blockdev --getsz --getdiscardzeroes --getss --getpbsz --getiomin --getioopt 
--getalignoff --getmaxsect --getbsz --getsize64 --getra --getfra /dev/sda
585871964
0
512
512
262144
262144
0
512
2048
299966445568
256
256
# lsblk /dev/sda
NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda    8:0    0 279.4G  0 disk

The observation is that the whole 300GB disk gets zeroed before the "small" 
image is written.

Here is the timing for FC27:
# time qemu-img convert -t directsync -O host_device 
/tmp/centos7-biosmbr-lvm-1539159593.qcow2 /dev/sda
real    0m50.935s
user    0m7.917s
sys     0m3.954s

And for FC29:
# time qemu-img convert -t directsync -O host_device 
/tmp/centos7-biosmbr-lvm-1539159593.qcow2 /dev/sda
real    35m41.981s
user    0m8.520s
sys     0m12.232s

> >
> > Why is the write_zeroes fallback _that_ slow? It will also hit guests
> > that request write_zeroes, so I feel this is worth investigating a bit
> > more nevertheless.
> >
> > Can you check with strace which operation actually succeeds writing
> > zeros to /dev/sda? The first thing we try is fallocate with
> > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE. This should always be
> > fast, so I suppose this fails in your case. The next thing is
> > BLKZEROOUT, which I think can do a fallback in the kernel. Does this return
> success?
> > Otherwise we have another fallback mechanism inside of QEMU, which
> > would use normal pwrite calls with a zeroed buffer.
> 
> It may also be a case of poor lseek(SEEK_HOLE) performance on the source (a
> known issue with at least some versions of tmpfs). The way qemu-img queries
> for block status, it ends up repeatedly hammering on lseek(), and if lseek() 
> is
> already O(n) instead of O(1) in behavior, that explodes into some O(n^2) 
> scaling
> because qemu-img isn't caching the answers it got previously.
> 
> >
> > Once we know which mechanism is used, we can look into why it is so
> > abysmally slow.
 
> Indeed, performance traces are important for issues like this.
See strace of both FC27 and FC29 attached

Fred

Attachment: fc27_qemu-img.strace.gz
Description: fc27_qemu-img.strace.gz

Attachment: fc29_qemu-img.strace.gz
Description: fc29_qemu-img.strace.gz


reply via email to

[Prev in Thread] Current Thread [Next in Thread]