Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_d

From:	Peter Lieven
Subject:	Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
Date:	Sat, 27 Dec 2014 21:01:26 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0

Am 27.12.2014 um 18:42 schrieb Denis V. Lunev:
> On 27/12/14 17:52, Peter Lieven wrote:
>> Am 26.12.2014 um 20:15 schrieb Denis V. Lunev:
>>> On 26/12/14 16:32, Denis V. Lunev wrote:
>>>> On 26/12/14 16:13, Peter Lieven wrote:
>>>>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:
>>>>>> The check for maximum length was added by
>>>>>>     commit c31cb70728d2c0c8900b35a66784baa446fd5147
>>>>>>     Author: Peter Lieven <address@hidden>
>>>>>>     Date:   Thu Oct 24 12:06:58 2013 +0200
>>>>>>       block: honour BlockLimits in bdrv_co_do_write_zeroes
>>>>>>
>>>>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is
>>>>>> no need to limit the size to 32 MB. Callback should provide effective
>>>>>> implementation which normally should not write any zeroes in comparable
>>>>>> amount.
>>>>>
>>>>> NACK.
>>>>>
>>>>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast 
>>>>> operation.
>>>>> This heaviliy depends on several circumstances that the block layer is 
>>>>> not aware of.
>>>>> If a specific protocol knows it is very fast in writing zeroes under any 
>>>>> circumstance
>>>>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still 
>>>>> allowed to
>>>>> return -ENOTSUP if the request size or alignment doesn't fit.
>>>>
>>>> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is
>>>> specified, the cost is almost the same for any amount of zeroes
>>>> written. This is true for fallocate from my point of view. The amount
>>>> of actually written data will be in several orders less than specified
>>>> except slow path, which honors 32 MB limit.
>>>>
>>>> If the operation is complex in realization, then it will be rate-limited
>>>> below, in actual implementation.
>>>>
>>>>> There are known backends e.g. anything that deals with SCSI that have a 
>>>>> known
>>>>> limitation of the maximum number of zeroes they can write fast in a 
>>>>> single request.
>>>>> This number MUST NOT be exceeded. The below patch would break all those 
>>>>> backends.
>>>>
>>>> could you pls point me this backends. Actually, from my point of
>>>> view, they should properly setup max_write_zeroes for themselves.
>>>> This is done at least in block/iscsi.c and it would be consistent
>>>> way of doing so.
>>>>
>>>>>
>>>>> What issue are you trying to fix with this patch? Maybe there is a better 
>>>>> way to fix
>>>>> it at another point in the code.
>>>>>
>>>>
>>>> I am trying to minimize amount of metadata updates for a file.
>>>> This provides some speedup even on ext4 and this will provide
>>>> even more speedup with a distributed filesystem like CEPH
>>>> where size updates of the files and block allocation are
>>>> costly.
>>>>
>>>> Regards,
>>>>      Den
>>> First of all, the patch is really wrong :) It was written using
>>> wrong assumptions.
>>>
>>> OK. I have spent some time reading your original patchset and
>>> and did not found any useful justification for default limit
>>> for both discard and write zero.
>>
>> 32768 is the largest power of two fitting into a uint16.
>> And uint16 is quite common for nb_sectors in backends.
>>
>
> ok. This could be reasonable.
>
>
>>>
>>> Yes, there are drivers which requires block level to call
>>> .bdrv_co_do_write_zeroes with alignment and with upper limit.
>>> But in this case driver setups max_write_zeroes. All buggy
>>> drivers should do that not to affect not buggy ones from
>>> my opinion.
>>>
>>> This is the only purpose of the original patches for limits.
>>> I have wrongly interpret BlockLimits as something connected
>>> with time of the operation. Sorry for that.
>>>
>>> Therefore there is no good reason for limiting the amount of
>>> data sent to drv->bdrv_co_writev with any data size. The only
>>> thing is that it would be good not to allocate too many memory
>>> at once. We could do something like
>>>
>>> base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE);
>>> added = 0;
>>> for (added = 0; added < num; add += MIN(2048, num)) {
>>>      qemu_iovec_add(qiov, base, MIN(2048, num));
>>> }
>>>
>>> to avoid really big allocations here even if .max_write_zeroes is
>>> very high. Do you think that this might be useful?
>>>
>>> As for .bdrv_co_do_write_zeroes itself, can we still drop
>>> default 32 Mb limit? If there are some buggy drivers, they
>>> should have .max_write_zeroes specified.
>>>
>>> The same applies to .max_discard
>>
>> Its always risky to change default behaviour. In the original discussion we
>> agreed that there should be a limit for each request. I think the 2^15 was
>> Paolos suggestion.
>>
>> You where talking of metadata updates for a file. So the operation that is 
>> too slow
>> for you is bdrv_write_zeroes inside a container file? What is the 
>> underlaying filesystem?
>> What is the exact operation that you try to optimize?
>>
>> I am wondering because as far as I can see write zeroes is only supported for
>> XFS and block devices which support BLKZEROOUT. The latter only works for
>> cache=none. So its not that easy to end up in an optimized (fast) path 
>> anyway.
>>
>> Peter
>>
>>
> you have missed 6 patches below ;) f.e. patch 2/7
>
> OK. I'll redo changes and fix on raw-posix level.

I was not in CC on the series. Can you please include me in CC for all patches 
when you respin.

Thanks,
Peter

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH 3/7] block/raw-posix: create do_fallocate helper, (continued)
- [Qemu-devel] [PATCH 3/7] block/raw-posix: create do_fallocate helper, Denis V. Lunev, 2014/12/26
- [Qemu-devel] [PATCH 2/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes, Denis V. Lunev, 2014/12/26
- [Qemu-devel] [PATCH 5/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit, Denis V. Lunev, 2014/12/26
- [Qemu-devel] [PATCH 7/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes, Denis V. Lunev, 2014/12/26
- [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs, Denis V. Lunev, 2014/12/26
  - Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs, Peter Lieven, 2014/12/26
    - Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs, Denis V. Lunev, 2014/12/26
    - Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs, Denis V. Lunev, 2014/12/26
    - Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs, Peter Lieven, 2014/12/27
    - Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs, Denis V. Lunev, 2014/12/27
    - Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs, Peter Lieven <=
    - Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs, Denis V. Lunev, 2014/12/27
- [Qemu-devel] [PATCH 4/7] block/raw-posix: create translate_err helper to merge errno values, Denis V. Lunev, 2014/12/26
- [Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes, Denis V. Lunev, 2014/12/26

Prev by Date: Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
Next by Date: Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
Previous by thread: Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
Next by thread: Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
Index(es):
- Date
- Thread