qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCHv4] block: optimize zero writes with bdrv_write_z


From: Peter Lieven
Subject: Re: [Qemu-devel] [PATCHv4] block: optimize zero writes with bdrv_write_zeroes
Date: Thu, 15 May 2014 07:16:26 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0

Am 14.05.2014 13:41, schrieb Kevin Wolf:
> Am 08.05.2014 um 18:22 hat Peter Lieven geschrieben:
>> this patch tries to optimize zero write requests
>> by automatically using bdrv_write_zeroes if it is
>> supported by the format.
>>
>> This significantly speeds up file system initialization and
>> should speed zero write test used to test backend storage
>> performance.
>>
>> I ran the following 2 tests on my internal SSD with a
>> 50G QCOW2 container and on an attached iSCSI storage.
>>
>> a) mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX
>>
>> QCOW2         [off]     [on]     [unmap]
>> -----
>> runtime:       14secs    1.1secs  1.1secs
>> filesize:      937M      18M      18M
>>
>> iSCSI         [off]     [on]     [unmap]
>> ----
>> runtime:       9.3s      0.9s     0.9s
>>
>> b) dd if=/dev/zero of=/dev/vdX bs=1M oflag=direct
>>
>> QCOW2         [off]     [on]     [unmap]
>> -----
>> runtime:       246secs   18secs   18secs
>> filesize:      51G       192K     192K
>> throughput:    203M/s    2.3G/s   2.3G/s
>>
>> iSCSI*        [off]     [on]     [unmap]
>> ----
>> runtime:       8mins     45secs   33secs
>> throughput:    106M/s    1.2G/s   1.6G/s
>> allocated:     100%      100%     0%
>>
>> * The storage was connected via an 1Gbit interface.
>>   It seems to internally handle writing zeroes
>>   via WRITESAME16 very fast.
>>
>> Signed-off-by: Peter Lieven <address@hidden>
>> ---
>> v3->v4: - use QAPI generated enum and lookup table [Kevin]
>>         - added more details about the options in the comments
>>           of the qapi-schema [Eric]
>>         - changed the type of detect_zeroes from str to
>>           BlockdevDetectZeroesOptions. I left the name
>>           as is because it is consistent with e.g.
>>           BlockdevDiscardOptions or BlockdevAioOptions [Eric]
>>         - changed the parse function in blockdev_init to
>>           be generic usable for other enum parameters
>>         
>> v2->v3: - moved parameter parsing to blockdev_init
>>         - added per device detect_zeroes status to 
>>           hmp (info block -v) and qmp (query-block) [Eric]
>>         - added support to enable detect-zeroes also
>>           for hot added devices [Eric]
>>         - added missing entry to qemu_common_drive_opts
>>         - fixed description of qemu_iovec_is_zero [Fam]
>>
>> v1->v2: - added tests to commit message (Markus)
>> RFCv2->v1: - fixed paramter parsing strncmp -> strcmp (Eric)
>>            - fixed typo (choosen->chosen) (Eric)
>>            - added second example to commit msg
>>
>> RFCv1->RFCv2: - add detect-zeroes=off|on|unmap knob to drive cmdline 
>> parameter
>>               - call zero detection only for format (bs->file != NULL)
>>
>>  block.c                   |   11 ++++++++++
>>  block/qapi.c              |    1 +
>>  blockdev.c                |   34 +++++++++++++++++++++++++++++
>>  hmp.c                     |    5 +++++
>>  include/block/block_int.h |    1 +
>>  include/qemu-common.h     |    1 +
>>  qapi-schema.json          |   52 
>> ++++++++++++++++++++++++++++++++-------------
>>  qemu-options.hx           |    6 ++++++
>>  qmp-commands.hx           |    3 +++
>>  util/iov.c                |   21 ++++++++++++++++++
>>  10 files changed, 120 insertions(+), 15 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index b749d31..aea4c77 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3244,6 +3244,17 @@ static int coroutine_fn 
>> bdrv_aligned_pwritev(BlockDriverState *bs,
>>  
>>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, 
>> req);
>>  
>> +    if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
>> +        !(flags & BDRV_REQ_ZERO_WRITE) && bs->file &&
>> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
> Pretty long condition. :-)
>
> Looks like most is obviously needed, but I wonder what the bs->file part
> is good for? That looks rather arbitrary.

What I wanted to achieve is that this condition is only true if we handle
the format (e.g. raw, qcow2, vmdk etc.). If e.g. qcow2 then sends a
zero write this should always end on the disk and should not be optimizable.


>
>> +        flags |= BDRV_REQ_ZERO_WRITE;
>> +        /* if the device was not opened with discard=on the below flag
>> +         * is immediately cleared again in bdrv_co_do_write_zeroes */
> Is it? I only see it being cleared in bdrv_co_write_zeroes(), but that's
> not a function that seems to be called from here.

You are right. Question, do we want to support detect_zeroes = unmap
if discard = ignore? If yes, I have to update the docs. Otherwise
I have to check for BDRV_O_DISCARD before setting BDRV_REQ_MAY_UNMAP.

>
>> +        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
>> +            flags |= BDRV_REQ_MAY_UNMAP;
>> +        }
>> +    }
>> +
>>      if (ret < 0) {
>>          /* Do nothing, write notifier decided to fail this request */
>>      } else if (flags & BDRV_REQ_ZERO_WRITE) {
>> diff --git a/block/qapi.c b/block/qapi.c
>> index af11445..75f44f1 100644
>> --- a/block/qapi.c
>> +++ b/block/qapi.c
>> @@ -50,6 +50,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState 
>> *bs)
>>      }
>>  
>>      info->backing_file_depth = bdrv_get_backing_file_depth(bs);
>> +    info->detect_zeroes = bs->detect_zeroes;
>>  
>>      if (bs->io_limits_enabled) {
>>          ThrottleConfig cfg;
>> diff --git a/blockdev.c b/blockdev.c
>> index 7810e9f..955bd49 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -288,6 +288,23 @@ static int parse_block_error_action(const char *buf, 
>> bool is_read, Error **errp)
>>      }
>>  }
>>  
>> +static int parse_enum_option(const char *lookup[], const char *buf, int max,
>> +                             int def, Error **errp)
>> +{
>> +    int i;
>> +    if (!buf) {
>> +        return def;
>> +    }
>> +    for (i = 0; i < max; i++) {
>> +        if (!strcmp(buf, lookup[i])) {
>> +            return i;
>> +        }
>> +    }
>> +    error_setg(errp, "invalid parameter value: %s",
>> +               buf);
>> +    return def;
>> +}
>> +
>>  static bool check_throttle_config(ThrottleConfig *cfg, Error **errp)
>>  {
>>      if (throttle_conflicting(cfg)) {
> This hunk could have been a patch of its own (not a reason for a respin,
> but if you need to respin to address one of the other comments, please
> split the patch).

Yes, will do. Erik also already suggested this.

>
>> @@ -324,6 +341,7 @@ static DriveInfo *blockdev_init(const char *file, QDict 
>> *bs_opts,
>>      QemuOpts *opts;
>>      const char *id;
>>      bool has_driver_specific_opts;
>> +    BlockdevDetectZeroesOptions detect_zeroes;
>>      BlockDriver *drv = NULL;
>>  
>>      /* Check common options by copying from bs_opts to opts, all other 
>> options
>> @@ -452,6 +470,17 @@ static DriveInfo *blockdev_init(const char *file, QDict 
>> *bs_opts,
>>          }
>>      }
>>  
>> +    detect_zeroes =
>> +        parse_enum_option(BlockdevDetectZeroesOptions_lookup,
>> +                          qemu_opt_get(opts, "detect-zeroes"),
>> +                          BLOCKDEV_DETECT_ZEROES_OPTIONS_MAX,
>> +                          BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF,
>> +                          &error);
>> +    if (error) {
>> +        error_propagate(errp, error);
>> +        goto early_err;
>> +    }
>> +
>>      /* init */
>>      dinfo = g_malloc0(sizeof(*dinfo));
>>      dinfo->id = g_strdup(qemu_opts_id(opts));
>> @@ -462,6 +491,7 @@ static DriveInfo *blockdev_init(const char *file, QDict 
>> *bs_opts,
>>      }
>>      dinfo->bdrv->open_flags = snapshot ? BDRV_O_SNAPSHOT : 0;
>>      dinfo->bdrv->read_only = ro;
>> +    dinfo->bdrv->detect_zeroes = detect_zeroes;
>>      dinfo->refcount = 1;
>>      if (serial != NULL) {
>>          dinfo->serial = g_strdup(serial);
>> @@ -2455,6 +2485,10 @@ QemuOptsList qemu_common_drive_opts = {
>>              .name = "copy-on-read",
>>              .type = QEMU_OPT_BOOL,
>>              .help = "copy read data from backing file into image file",
>> +        },{
>> +            .name = "detect-zeroes",
>> +            .type = QEMU_OPT_STRING,
>> +            .help = "try to optimize zero writes",
>>          },
>>          { /* end of list */ }
>>      },
>> diff --git a/hmp.c b/hmp.c
>> index ca869ba..a541e7d 100644
>> --- a/hmp.c
>> +++ b/hmp.c
>> @@ -336,6 +336,11 @@ void hmp_info_block(Monitor *mon, const QDict *qdict)
>>                             info->value->inserted->backing_file_depth);
>>          }
>>  
>> +        if (verbose) {
>> +            monitor_printf(mon, "    Detect zeroes:    %s\n",
>> +                           
>> BlockdevDetectZeroesOptions_lookup[info->value->inserted->detect_zeroes]);
>> +        }
> Perhaps we should always display the information if zero detection is
> enabled. This would be similar to things like I/O throttling limits,
> which are only printed if throttling is enabled.

Ok

>
>>          if (info->value->inserted->bps
>>              || info->value->inserted->bps_rd
>>              || info->value->inserted->bps_wr
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index 9ffcb69..b8cc926 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -364,6 +364,7 @@ struct BlockDriverState {
>>      BlockJob *job;
>>  
>>      QDict *options;
>> +    BlockdevDetectZeroesOptions detect_zeroes;
>>  };
>>  
>>  int get_tmp_filename(char *filename, int size);
>> diff --git a/include/qemu-common.h b/include/qemu-common.h
>> index a998e8d..8e3d6eb 100644
>> --- a/include/qemu-common.h
>> +++ b/include/qemu-common.h
>> @@ -330,6 +330,7 @@ void qemu_iovec_concat(QEMUIOVector *dst,
>>  void qemu_iovec_concat_iov(QEMUIOVector *dst,
>>                             struct iovec *src_iov, unsigned int src_cnt,
>>                             size_t soffset, size_t sbytes);
>> +bool qemu_iovec_is_zero(QEMUIOVector *qiov);
>>  void qemu_iovec_destroy(QEMUIOVector *qiov);
>>  void qemu_iovec_reset(QEMUIOVector *qiov);
>>  size_t qemu_iovec_to_buf(QEMUIOVector *qiov, size_t offset,
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 0b00427..d1620b1 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -937,6 +937,8 @@
>>  # @encryption_key_missing: true if the backing device is encrypted but an
>>  #                          valid encryption key is missing
>>  #
>> +# @detect_zeroes: detect and optimize zero writes (Since 2.1)
>> +#
>>  # @bps: total throughput limit in bytes per second is specified
>>  #
>>  # @bps_rd: read throughput limit in bytes per second is specified
>> @@ -972,6 +974,7 @@
>>    'data': { 'file': 'str', '*node-name': 'str', 'ro': 'bool', 'drv': 'str',
>>              '*backing_file': 'str', 'backing_file_depth': 'int',
>>              'encrypted': 'bool', 'encryption_key_missing': 'bool',
>> +            'detect_zeroes': 'BlockdevDetectZeroesOptions',
>>              'bps': 'int', 'bps_rd': 'int', 'bps_wr': 'int',
>>              'iops': 'int', 'iops_rd': 'int', 'iops_wr': 'int',
>>              'image': 'ImageInfo',
>> @@ -4250,6 +4253,22 @@
>>    'data': [ 'ignore', 'unmap' ] }
>>  
>>  ##
>> +# @BlockdevDetectZeroesOptions
>> +#
>> +# Describes the operation mode for the automatic conversion of plain
>> +# zero writes by the OS to driver specific optimized zero write commands.
>> +#
>> +# @off:      Disabled (default)
>> +# @on:       Enabled
>> +# @unmap:    Enabled and even try to unmap blocks if possible. This requires
>> +#            also that @BlockdevDiscardOptions is set to unmap for this 
>> device.
>> +#
>> +# Since: 2.1
>> +##
>> +{ 'enum': 'BlockdevDetectZeroesOptions',
>> +  'data': [ 'off', 'on', 'unmap' ] }
>> +
>> +##
>>  # @BlockdevAioOptions
>>  #
>>  # Selects the AIO backend to handle I/O requests
>> @@ -4301,20 +4320,22 @@
>>  # Options that are available for all block devices, independent of the block
>>  # driver.
>>  #
>> -# @driver:      block driver name
>> -# @id:          #optional id by which the new block device can be referred 
>> to.
>> -#               This is a required option on the top level of blockdev-add, 
>> and
>> -#               currently not allowed on any other level.
>> -# @node-name:   #optional the name of a block driver state node (Since 2.0)
>> -# @discard:     #optional discard-related options (default: ignore)
>> -# @cache:       #optional cache-related options
>> -# @aio:         #optional AIO backend (default: threads)
>> -# @rerror:      #optional how to handle read errors on the device
>> -#               (default: report)
>> -# @werror:      #optional how to handle write errors on the device
>> -#               (default: enospc)
>> -# @read-only:   #optional whether the block device should be read-only
>> -#               (default: false)
>> +# @driver:        block driver name
>> +# @id:            #optional id by which the new block device can be 
>> referred to.
>> +#                 This is a required option on the top level of 
>> blockdev-add, and
>> +#                 currently not allowed on any other level.
>> +# @node-name:     #optional the name of a block driver state node (Since 
>> 2.0)
>> +# @discard:       #optional discard-related options (default: ignore)
>> +# @cache:         #optional cache-related options
>> +# @aio:           #optional AIO backend (default: threads)
>> +# @rerror:        #optional how to handle read errors on the device
>> +#                 (default: report)
>> +# @werror:        #optional how to handle write errors on the device
>> +#                 (default: enospc)
>> +# @read-only:     #optional whether the block device should be read-only
>> +#                 (default: false)
>> +# @detect-zeroes: #optional detect and optimize zero writes (Since 2.1)
>> +#                 (default: off)
>>  #
>>  # Since: 1.7
>>  ##
>> @@ -4327,7 +4348,8 @@
>>              '*aio': 'BlockdevAioOptions',
>>              '*rerror': 'BlockdevOnError',
>>              '*werror': 'BlockdevOnError',
>> -            '*read-only': 'bool' } }
>> +            '*read-only': 'bool',
>> +            '*detect-zeroes': 'BlockdevDetectZeroesOptions' } }
>>  
>>  ##
>>  # @BlockdevOptionsFile
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 781af14..5ee94ea 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -414,6 +414,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
>>      "       [,serial=s][,addr=A][,rerror=ignore|stop|report]\n"
>>      "       
>> [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n"
>>      "       [,readonly=on|off][,copy-on-read=on|off]\n"
>> +    "       [,detect-zeroes=on|off|unmap]\n"
>>      "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n"
>>      "       [[,iops=i]|[[,iops_rd=r][,iops_wr=w]]]\n"
>>      "       [[,bps_max=bm]|[[,bps_rd_max=rm][,bps_wr_max=wm]]]\n"
>> @@ -475,6 +476,11 @@ Open drive @option{file} as read-only. Guest write 
>> attempts will fail.
>>  @item address@hidden
>>  @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
>>  file sectors into the image file.
>> address@hidden address@hidden
>> address@hidden is "off", "on" or "unmap" and enables the automatic
>> +conversion of plain zero writes by the OS to driver specific optimized
>> +zero write commands. If "unmap" is chosen and @var{discard} is "on"
>> +a zero write may even be converted to an UNMAP operation.
>>  @end table
>>  
>>  By default, the @option{cache=writeback} mode is used. It will report data
>> diff --git a/qmp-commands.hx b/qmp-commands.hx
>> index ed3ab92..a535955 100644
>> --- a/qmp-commands.hx
>> +++ b/qmp-commands.hx
>> @@ -2032,6 +2032,8 @@ Each json-object contain the following:
>>           - "iops_rd_max":  read I/O operations max (json-int)
>>           - "iops_wr_max":  write I/O operations max (json-int)
>>           - "iops_size": I/O size when limiting by iops (json-int)
>> +         - "detect_zeroes": detect and optimize zero writing (json-string)
>> +             - Possible values: "off", "on", "unmap"
>>           - "image": the detail of the image, it is a json-object containing
>>              the following:
>>               - "filename": image file name (json-string)
>> @@ -2108,6 +2110,7 @@ Example:
>>                 "iops_rd_max": 0,
>>                 "iops_wr_max": 0,
>>                 "iops_size": 0,
>> +               "detect_zeroes": "on",
>>                 "image":{
>>                    "filename":"disks/test.qcow2",
>>                    "format":"qcow2",
>> diff --git a/util/iov.c b/util/iov.c
>> index 6569b5a..f8c49a1 100644
>> --- a/util/iov.c
>> +++ b/util/iov.c
>> @@ -335,6 +335,27 @@ void qemu_iovec_concat(QEMUIOVector *dst,
>>      qemu_iovec_concat_iov(dst, src->iov, src->niov, soffset, sbytes);
>>  }
>>  
>> +/*
>> + *  check if the contents of the iovecs are all zero
>> + */
>> +bool qemu_iovec_is_zero(QEMUIOVector *qiov)
>> +{
>> +    int i;
>> +    for (i = 0; i < qiov->niov; i++) {
>> +        size_t offs = qiov->iov[i].iov_len & ~(4 * sizeof(long) - 1);
> I think it gets a bit more readable like this:
>
> size_t offs = QEMU_ALIGN_DOWN(qiov->iov[i].iov_len, 4 * sizeof(long));

Yes, indeed.

>
>> +        uint8_t *ptr = qiov->iov[i].iov_base;
>> +        if (offs && !buffer_is_zero(qiov->iov[i].iov_base, offs)) {
>> +            return false;
>> +        }
>> +        for (; offs < qiov->iov[i].iov_len; offs++) {
>> +            if (ptr[offs]) {
>> +                return false;
>> +            }
>> +        }
>> +    }
>> +    return true;
>> +}
>> +
> This function could be a separate patch as well.

Ok, will split as well.

Thank you for your and Erik for his comments,
Peter



reply via email to

[Prev in Thread] Current Thread [Next in Thread]