qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] block: file-posix: Fail unmap with NO_FALLBACK on block devi


From: Nir Soffer
Subject: Re: [PATCH] block: file-posix: Fail unmap with NO_FALLBACK on block device
Date: Mon, 15 Jun 2020 22:32:40 +0300

On Sat, Jun 13, 2020 at 8:08 PM Nir Soffer <nirsof@gmail.com> wrote:
>
> Punching holes on block device uses blkdev_issue_zeroout() with
> BLKDEV_ZERO_NOFALLBACK but there is no guarantee that this is fast
> enough for pre-zeroing an entire device.
>
> Zeroing block device can be slow as writing zeroes or 100 times faster,
> depending on the storage. There is no way to tell if zeroing it fast
> enough.  The kernel BLKDEV_ZERO_NOFALLBACK flag does not mean that the
> operation is fast; it just means that the kernel will not fall back to
> manual zeroing.
>
> Here is an example converting 10g image with 8g of data to block device:
>
> $ ./qemu-img info test.img
> image: test.img
> file format: raw
> virtual size: 10 GiB (10737418240 bytes)
> disk size: 8 GiB
>
> $ time ./qemu-img convert -f raw -O raw -t none -T none -W test.img 
> /dev/test/lv1
>
> Before:
>
> real    1m20.483s
> user    0m0.490s
> sys     0m0.739s
>
> After:
>
> real    0m55.831s
> user    0m0.610s
> sys     0m0.956s

I did more testing with real server and storage, and the results confirm
what I reported here in my vm based environment and poor storage.


Testing this LUN:

# multipath -ll
3600a098038304437415d4b6a59684a52 dm-3 NETAPP,LUN C-Mode
size=5.0T features='3 queue_if_no_path pg_init_retries 50'
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 18:0:0:0 sdb     8:16  active ready running
| `- 19:0:0:0 sdc     8:32  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 20:0:0:0 sdd     8:48  active ready running
  `- 21:0:0:0 sde     8:64  active ready running


The destination is 100g logical volume on this LUN:

# qemu-img info test-lv
image: test-lv
file format: raw
virtual size: 100 GiB (107374182400 bytes)
disk size: 0 B


The source image is 100g image with 48g of data:

# qemu-img info fedora-31-100g-50p.raw
image: fedora-31-100g-50p.raw
file format: raw
virtual size: 100 GiB (107374182400 bytes)
disk size: 48.4 GiB


We can zero 2.3 g/s:

# time blkdiscard -z test-lv

real 0m43.902s
user 0m0.002s
sys 0m0.130s

(I should really test with fallocate instead of blkdiscard, but the results look
the same.)

# iostat -xdm dm-3 5

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3            20.80  301.40      0.98   2323.31     0.00     0.00
0.00   0.00   26.56  854.50 257.94    48.23  7893.41   0.73  23.58

dm-3            15.20  297.20      0.80   2321.67     0.00     0.00
0.00   0.00   26.43  836.06 248.72    53.80  7999.30   0.78  24.22


We can write 445m/s:

# dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync
107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s

# iostat -xdm dm-3 5

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3             6.60 6910.00      0.39    431.85     0.00     0.00
0.00   0.00    2.48    2.70  15.19    60.73    64.00   0.14  98.84

dm-3            40.80 6682.60      1.59    417.61     0.00     0.00
0.00   0.00    1.71    2.73  14.92    40.00    63.99   0.15  97.60

dm-3             6.60 6887.40      0.39    430.46     0.00     0.00
0.00   0.00    2.15    2.66  14.92    60.73    64.00   0.14  98.22


Testing latest qemu-img:

# rpm -q qemu-img
qemu-img-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64

# time qemu-img convert -p -f raw -O raw -t none -W
fedora-31-100g-50p.raw test-lv
    (100.00/100%)

real 2m2.337s
user 0m2.708s
sys 0m17.326s

# iostat -xdm dm-3 5

pre zero phase:

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3            24.00  265.40      1.00   2123.20     0.00     0.00
0.00   0.00   36.81  543.52 144.99    42.48  8192.00   0.70  20.14

dm-3             9.60  283.60      0.59   2265.60     0.00     0.00
0.00   0.00   35.42  576.80 163.78    62.50  8180.44   0.70  20.58

dm-3            24.00  272.00      1.00   2176.00     0.00     0.00
0.00   0.00   22.89  512.40 139.77    42.48  8192.00   0.67  19.90

copy phase:

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3            27.20 10671.20      1.19    655.84     0.00     0.00
0.00   0.00    2.70   10.99 111.98    44.83    62.93   0.09  96.74

dm-3             6.40 11537.00      0.39    712.33     0.00     0.00
0.00   0.00    3.00   11.90 131.52    62.50    63.23   0.08  97.82

dm-3            27.20 12400.20      1.19    765.47     0.00     0.00
0.00   0.00    3.60   11.16 132.31    44.83    63.21   0.08  95.50

dm-3             9.60 11312.60      0.59    698.20     0.00     0.20
0.00   0.00    3.73   11.69 126.64    63.00    63.20   0.09  97.70


Testing latest qemu-img + this patch:

# rpm -q qemu-img
qemu-img-4.2.0-25.module+el8.2.1+6815+1c792dc8.nsoffer202006140516.x86_64

# time qemu-img convert -p -f raw -O raw -t none -W
fedora-31-100g-50p.raw test-lv
    (100.00/100%)

real 1m42.083s
user 0m3.007s
sys 0m18.735s

# iostat -xdm dm-3 5

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3             6.60 7919.60      0.39   1136.67     0.00     0.00
0.00   0.00   14.70   15.32 117.43    60.73   146.97   0.10  77.84

dm-3            27.00 9065.00      1.19    571.38     0.00     0.20
0.00   0.00    2.52   14.64 128.21    45.13    64.54   0.11  97.46

dm-3             6.80 9467.40      0.40    814.75     0.00     0.00
0.00   0.00    2.74   12.15 110.25    60.82    88.12   0.10  90.46

dm-3            29.00 7713.20      1.32    996.48     0.00     0.40
0.00   0.01    5.40   14.48 107.98    46.60   132.29   0.11  83.76

dm-3            11.60 9661.60      0.70    703.54     0.00     0.40
0.00   0.00    2.26   11.22 103.56    61.72    74.57   0.10  97.98

dm-3            23.80 9639.20      0.99    696.82     0.00     0.00
0.00   0.00    1.98   11.54 106.49    42.80    74.03   0.10  93.68

dm-3            10.00 7184.60      0.60   1147.56     0.00     0.00
0.00   0.00   12.84   15.32 106.58    61.36   163.56   0.09  68.30

dm-3            35.00 6771.40      1.69   1293.37     0.00     0.00
0.00   0.00   17.44   18.06 119.48    49.58   195.59   0.10  66.46




>
> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> ---
>  block/file-posix.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 3ab8f5a0fa..cd2e409184 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1621,6 +1621,16 @@ static int handle_aiocb_write_zeroes_unmap(void 
> *opaque)
>      /* First try to write zeros and unmap at the same time */
>
>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +    /*
> +     * The block device fallocate() implementation in the kernel does set
> +     * BLKDEV_ZERO_NOFALLBACK, but it does not guarantee that the operation 
> is
> +     * fast so we can't call this if we have to avoid slow fallbacks.
> +     */
> +    if (aiocb->aio_type & QEMU_AIO_BLKDEV &&
> +        aiocb->aio_type & QEMU_AIO_NO_FALLBACK) {
> +        return -ENOTSUP;
> +    }
> +
>      int ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                             aiocb->aio_offset, aiocb->aio_nbytes);
>      if (ret != -ENOTSUP) {
> --
> 2.25.4
>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]