[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCHv4] block: optimize zero writes with bdrv_write_z
From: |
Peter Lieven |
Subject: |
Re: [Qemu-devel] [PATCHv4] block: optimize zero writes with bdrv_write_zeroes |
Date: |
Thu, 15 May 2014 07:16:26 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 |
Am 14.05.2014 13:41, schrieb Kevin Wolf:
> Am 08.05.2014 um 18:22 hat Peter Lieven geschrieben:
>> this patch tries to optimize zero write requests
>> by automatically using bdrv_write_zeroes if it is
>> supported by the format.
>>
>> This significantly speeds up file system initialization and
>> should speed zero write test used to test backend storage
>> performance.
>>
>> I ran the following 2 tests on my internal SSD with a
>> 50G QCOW2 container and on an attached iSCSI storage.
>>
>> a) mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX
>>
>> QCOW2 [off] [on] [unmap]
>> -----
>> runtime: 14secs 1.1secs 1.1secs
>> filesize: 937M 18M 18M
>>
>> iSCSI [off] [on] [unmap]
>> ----
>> runtime: 9.3s 0.9s 0.9s
>>
>> b) dd if=/dev/zero of=/dev/vdX bs=1M oflag=direct
>>
>> QCOW2 [off] [on] [unmap]
>> -----
>> runtime: 246secs 18secs 18secs
>> filesize: 51G 192K 192K
>> throughput: 203M/s 2.3G/s 2.3G/s
>>
>> iSCSI* [off] [on] [unmap]
>> ----
>> runtime: 8mins 45secs 33secs
>> throughput: 106M/s 1.2G/s 1.6G/s
>> allocated: 100% 100% 0%
>>
>> * The storage was connected via an 1Gbit interface.
>> It seems to internally handle writing zeroes
>> via WRITESAME16 very fast.
>>
>> Signed-off-by: Peter Lieven <address@hidden>
>> ---
>> v3->v4: - use QAPI generated enum and lookup table [Kevin]
>> - added more details about the options in the comments
>> of the qapi-schema [Eric]
>> - changed the type of detect_zeroes from str to
>> BlockdevDetectZeroesOptions. I left the name
>> as is because it is consistent with e.g.
>> BlockdevDiscardOptions or BlockdevAioOptions [Eric]
>> - changed the parse function in blockdev_init to
>> be generic usable for other enum parameters
>>
>> v2->v3: - moved parameter parsing to blockdev_init
>> - added per device detect_zeroes status to
>> hmp (info block -v) and qmp (query-block) [Eric]
>> - added support to enable detect-zeroes also
>> for hot added devices [Eric]
>> - added missing entry to qemu_common_drive_opts
>> - fixed description of qemu_iovec_is_zero [Fam]
>>
>> v1->v2: - added tests to commit message (Markus)
>> RFCv2->v1: - fixed paramter parsing strncmp -> strcmp (Eric)
>> - fixed typo (choosen->chosen) (Eric)
>> - added second example to commit msg
>>
>> RFCv1->RFCv2: - add detect-zeroes=off|on|unmap knob to drive cmdline
>> parameter
>> - call zero detection only for format (bs->file != NULL)
>>
>> block.c | 11 ++++++++++
>> block/qapi.c | 1 +
>> blockdev.c | 34 +++++++++++++++++++++++++++++
>> hmp.c | 5 +++++
>> include/block/block_int.h | 1 +
>> include/qemu-common.h | 1 +
>> qapi-schema.json | 52
>> ++++++++++++++++++++++++++++++++-------------
>> qemu-options.hx | 6 ++++++
>> qmp-commands.hx | 3 +++
>> util/iov.c | 21 ++++++++++++++++++
>> 10 files changed, 120 insertions(+), 15 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index b749d31..aea4c77 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3244,6 +3244,17 @@ static int coroutine_fn
>> bdrv_aligned_pwritev(BlockDriverState *bs,
>>
>> ret = notifier_with_return_list_notify(&bs->before_write_notifiers,
>> req);
>>
>> + if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
>> + !(flags & BDRV_REQ_ZERO_WRITE) && bs->file &&
>> + drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
> Pretty long condition. :-)
>
> Looks like most is obviously needed, but I wonder what the bs->file part
> is good for? That looks rather arbitrary.
What I wanted to achieve is that this condition is only true if we handle
the format (e.g. raw, qcow2, vmdk etc.). If e.g. qcow2 then sends a
zero write this should always end on the disk and should not be optimizable.
>
>> + flags |= BDRV_REQ_ZERO_WRITE;
>> + /* if the device was not opened with discard=on the below flag
>> + * is immediately cleared again in bdrv_co_do_write_zeroes */
> Is it? I only see it being cleared in bdrv_co_write_zeroes(), but that's
> not a function that seems to be called from here.
You are right. Question, do we want to support detect_zeroes = unmap
if discard = ignore? If yes, I have to update the docs. Otherwise
I have to check for BDRV_O_DISCARD before setting BDRV_REQ_MAY_UNMAP.
>
>> + if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
>> + flags |= BDRV_REQ_MAY_UNMAP;
>> + }
>> + }
>> +
>> if (ret < 0) {
>> /* Do nothing, write notifier decided to fail this request */
>> } else if (flags & BDRV_REQ_ZERO_WRITE) {
>> diff --git a/block/qapi.c b/block/qapi.c
>> index af11445..75f44f1 100644
>> --- a/block/qapi.c
>> +++ b/block/qapi.c
>> @@ -50,6 +50,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState
>> *bs)
>> }
>>
>> info->backing_file_depth = bdrv_get_backing_file_depth(bs);
>> + info->detect_zeroes = bs->detect_zeroes;
>>
>> if (bs->io_limits_enabled) {
>> ThrottleConfig cfg;
>> diff --git a/blockdev.c b/blockdev.c
>> index 7810e9f..955bd49 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -288,6 +288,23 @@ static int parse_block_error_action(const char *buf,
>> bool is_read, Error **errp)
>> }
>> }
>>
>> +static int parse_enum_option(const char *lookup[], const char *buf, int max,
>> + int def, Error **errp)
>> +{
>> + int i;
>> + if (!buf) {
>> + return def;
>> + }
>> + for (i = 0; i < max; i++) {
>> + if (!strcmp(buf, lookup[i])) {
>> + return i;
>> + }
>> + }
>> + error_setg(errp, "invalid parameter value: %s",
>> + buf);
>> + return def;
>> +}
>> +
>> static bool check_throttle_config(ThrottleConfig *cfg, Error **errp)
>> {
>> if (throttle_conflicting(cfg)) {
> This hunk could have been a patch of its own (not a reason for a respin,
> but if you need to respin to address one of the other comments, please
> split the patch).
Yes, will do. Erik also already suggested this.
>
>> @@ -324,6 +341,7 @@ static DriveInfo *blockdev_init(const char *file, QDict
>> *bs_opts,
>> QemuOpts *opts;
>> const char *id;
>> bool has_driver_specific_opts;
>> + BlockdevDetectZeroesOptions detect_zeroes;
>> BlockDriver *drv = NULL;
>>
>> /* Check common options by copying from bs_opts to opts, all other
>> options
>> @@ -452,6 +470,17 @@ static DriveInfo *blockdev_init(const char *file, QDict
>> *bs_opts,
>> }
>> }
>>
>> + detect_zeroes =
>> + parse_enum_option(BlockdevDetectZeroesOptions_lookup,
>> + qemu_opt_get(opts, "detect-zeroes"),
>> + BLOCKDEV_DETECT_ZEROES_OPTIONS_MAX,
>> + BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF,
>> + &error);
>> + if (error) {
>> + error_propagate(errp, error);
>> + goto early_err;
>> + }
>> +
>> /* init */
>> dinfo = g_malloc0(sizeof(*dinfo));
>> dinfo->id = g_strdup(qemu_opts_id(opts));
>> @@ -462,6 +491,7 @@ static DriveInfo *blockdev_init(const char *file, QDict
>> *bs_opts,
>> }
>> dinfo->bdrv->open_flags = snapshot ? BDRV_O_SNAPSHOT : 0;
>> dinfo->bdrv->read_only = ro;
>> + dinfo->bdrv->detect_zeroes = detect_zeroes;
>> dinfo->refcount = 1;
>> if (serial != NULL) {
>> dinfo->serial = g_strdup(serial);
>> @@ -2455,6 +2485,10 @@ QemuOptsList qemu_common_drive_opts = {
>> .name = "copy-on-read",
>> .type = QEMU_OPT_BOOL,
>> .help = "copy read data from backing file into image file",
>> + },{
>> + .name = "detect-zeroes",
>> + .type = QEMU_OPT_STRING,
>> + .help = "try to optimize zero writes",
>> },
>> { /* end of list */ }
>> },
>> diff --git a/hmp.c b/hmp.c
>> index ca869ba..a541e7d 100644
>> --- a/hmp.c
>> +++ b/hmp.c
>> @@ -336,6 +336,11 @@ void hmp_info_block(Monitor *mon, const QDict *qdict)
>> info->value->inserted->backing_file_depth);
>> }
>>
>> + if (verbose) {
>> + monitor_printf(mon, " Detect zeroes: %s\n",
>> +
>> BlockdevDetectZeroesOptions_lookup[info->value->inserted->detect_zeroes]);
>> + }
> Perhaps we should always display the information if zero detection is
> enabled. This would be similar to things like I/O throttling limits,
> which are only printed if throttling is enabled.
Ok
>
>> if (info->value->inserted->bps
>> || info->value->inserted->bps_rd
>> || info->value->inserted->bps_wr
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index 9ffcb69..b8cc926 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -364,6 +364,7 @@ struct BlockDriverState {
>> BlockJob *job;
>>
>> QDict *options;
>> + BlockdevDetectZeroesOptions detect_zeroes;
>> };
>>
>> int get_tmp_filename(char *filename, int size);
>> diff --git a/include/qemu-common.h b/include/qemu-common.h
>> index a998e8d..8e3d6eb 100644
>> --- a/include/qemu-common.h
>> +++ b/include/qemu-common.h
>> @@ -330,6 +330,7 @@ void qemu_iovec_concat(QEMUIOVector *dst,
>> void qemu_iovec_concat_iov(QEMUIOVector *dst,
>> struct iovec *src_iov, unsigned int src_cnt,
>> size_t soffset, size_t sbytes);
>> +bool qemu_iovec_is_zero(QEMUIOVector *qiov);
>> void qemu_iovec_destroy(QEMUIOVector *qiov);
>> void qemu_iovec_reset(QEMUIOVector *qiov);
>> size_t qemu_iovec_to_buf(QEMUIOVector *qiov, size_t offset,
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 0b00427..d1620b1 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -937,6 +937,8 @@
>> # @encryption_key_missing: true if the backing device is encrypted but an
>> # valid encryption key is missing
>> #
>> +# @detect_zeroes: detect and optimize zero writes (Since 2.1)
>> +#
>> # @bps: total throughput limit in bytes per second is specified
>> #
>> # @bps_rd: read throughput limit in bytes per second is specified
>> @@ -972,6 +974,7 @@
>> 'data': { 'file': 'str', '*node-name': 'str', 'ro': 'bool', 'drv': 'str',
>> '*backing_file': 'str', 'backing_file_depth': 'int',
>> 'encrypted': 'bool', 'encryption_key_missing': 'bool',
>> + 'detect_zeroes': 'BlockdevDetectZeroesOptions',
>> 'bps': 'int', 'bps_rd': 'int', 'bps_wr': 'int',
>> 'iops': 'int', 'iops_rd': 'int', 'iops_wr': 'int',
>> 'image': 'ImageInfo',
>> @@ -4250,6 +4253,22 @@
>> 'data': [ 'ignore', 'unmap' ] }
>>
>> ##
>> +# @BlockdevDetectZeroesOptions
>> +#
>> +# Describes the operation mode for the automatic conversion of plain
>> +# zero writes by the OS to driver specific optimized zero write commands.
>> +#
>> +# @off: Disabled (default)
>> +# @on: Enabled
>> +# @unmap: Enabled and even try to unmap blocks if possible. This requires
>> +# also that @BlockdevDiscardOptions is set to unmap for this
>> device.
>> +#
>> +# Since: 2.1
>> +##
>> +{ 'enum': 'BlockdevDetectZeroesOptions',
>> + 'data': [ 'off', 'on', 'unmap' ] }
>> +
>> +##
>> # @BlockdevAioOptions
>> #
>> # Selects the AIO backend to handle I/O requests
>> @@ -4301,20 +4320,22 @@
>> # Options that are available for all block devices, independent of the block
>> # driver.
>> #
>> -# @driver: block driver name
>> -# @id: #optional id by which the new block device can be referred
>> to.
>> -# This is a required option on the top level of blockdev-add,
>> and
>> -# currently not allowed on any other level.
>> -# @node-name: #optional the name of a block driver state node (Since 2.0)
>> -# @discard: #optional discard-related options (default: ignore)
>> -# @cache: #optional cache-related options
>> -# @aio: #optional AIO backend (default: threads)
>> -# @rerror: #optional how to handle read errors on the device
>> -# (default: report)
>> -# @werror: #optional how to handle write errors on the device
>> -# (default: enospc)
>> -# @read-only: #optional whether the block device should be read-only
>> -# (default: false)
>> +# @driver: block driver name
>> +# @id: #optional id by which the new block device can be
>> referred to.
>> +# This is a required option on the top level of
>> blockdev-add, and
>> +# currently not allowed on any other level.
>> +# @node-name: #optional the name of a block driver state node (Since
>> 2.0)
>> +# @discard: #optional discard-related options (default: ignore)
>> +# @cache: #optional cache-related options
>> +# @aio: #optional AIO backend (default: threads)
>> +# @rerror: #optional how to handle read errors on the device
>> +# (default: report)
>> +# @werror: #optional how to handle write errors on the device
>> +# (default: enospc)
>> +# @read-only: #optional whether the block device should be read-only
>> +# (default: false)
>> +# @detect-zeroes: #optional detect and optimize zero writes (Since 2.1)
>> +# (default: off)
>> #
>> # Since: 1.7
>> ##
>> @@ -4327,7 +4348,8 @@
>> '*aio': 'BlockdevAioOptions',
>> '*rerror': 'BlockdevOnError',
>> '*werror': 'BlockdevOnError',
>> - '*read-only': 'bool' } }
>> + '*read-only': 'bool',
>> + '*detect-zeroes': 'BlockdevDetectZeroesOptions' } }
>>
>> ##
>> # @BlockdevOptionsFile
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 781af14..5ee94ea 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -414,6 +414,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
>> " [,serial=s][,addr=A][,rerror=ignore|stop|report]\n"
>> "
>> [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n"
>> " [,readonly=on|off][,copy-on-read=on|off]\n"
>> + " [,detect-zeroes=on|off|unmap]\n"
>> " [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n"
>> " [[,iops=i]|[[,iops_rd=r][,iops_wr=w]]]\n"
>> " [[,bps_max=bm]|[[,bps_rd_max=rm][,bps_wr_max=wm]]]\n"
>> @@ -475,6 +476,11 @@ Open drive @option{file} as read-only. Guest write
>> attempts will fail.
>> @item address@hidden
>> @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
>> file sectors into the image file.
>> address@hidden address@hidden
>> address@hidden is "off", "on" or "unmap" and enables the automatic
>> +conversion of plain zero writes by the OS to driver specific optimized
>> +zero write commands. If "unmap" is chosen and @var{discard} is "on"
>> +a zero write may even be converted to an UNMAP operation.
>> @end table
>>
>> By default, the @option{cache=writeback} mode is used. It will report data
>> diff --git a/qmp-commands.hx b/qmp-commands.hx
>> index ed3ab92..a535955 100644
>> --- a/qmp-commands.hx
>> +++ b/qmp-commands.hx
>> @@ -2032,6 +2032,8 @@ Each json-object contain the following:
>> - "iops_rd_max": read I/O operations max (json-int)
>> - "iops_wr_max": write I/O operations max (json-int)
>> - "iops_size": I/O size when limiting by iops (json-int)
>> + - "detect_zeroes": detect and optimize zero writing (json-string)
>> + - Possible values: "off", "on", "unmap"
>> - "image": the detail of the image, it is a json-object containing
>> the following:
>> - "filename": image file name (json-string)
>> @@ -2108,6 +2110,7 @@ Example:
>> "iops_rd_max": 0,
>> "iops_wr_max": 0,
>> "iops_size": 0,
>> + "detect_zeroes": "on",
>> "image":{
>> "filename":"disks/test.qcow2",
>> "format":"qcow2",
>> diff --git a/util/iov.c b/util/iov.c
>> index 6569b5a..f8c49a1 100644
>> --- a/util/iov.c
>> +++ b/util/iov.c
>> @@ -335,6 +335,27 @@ void qemu_iovec_concat(QEMUIOVector *dst,
>> qemu_iovec_concat_iov(dst, src->iov, src->niov, soffset, sbytes);
>> }
>>
>> +/*
>> + * check if the contents of the iovecs are all zero
>> + */
>> +bool qemu_iovec_is_zero(QEMUIOVector *qiov)
>> +{
>> + int i;
>> + for (i = 0; i < qiov->niov; i++) {
>> + size_t offs = qiov->iov[i].iov_len & ~(4 * sizeof(long) - 1);
> I think it gets a bit more readable like this:
>
> size_t offs = QEMU_ALIGN_DOWN(qiov->iov[i].iov_len, 4 * sizeof(long));
Yes, indeed.
>
>> + uint8_t *ptr = qiov->iov[i].iov_base;
>> + if (offs && !buffer_is_zero(qiov->iov[i].iov_base, offs)) {
>> + return false;
>> + }
>> + for (; offs < qiov->iov[i].iov_len; offs++) {
>> + if (ptr[offs]) {
>> + return false;
>> + }
>> + }
>> + }
>> + return true;
>> +}
>> +
> This function could be a separate patch as well.
Ok, will split as well.
Thank you for your and Erik for his comments,
Peter