qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] RFC: some problems with bdrv_get_block_status


From: Denis V. Lunev
Subject: Re: [Qemu-block] RFC: some problems with bdrv_get_block_status
Date: Fri, 28 Apr 2017 22:37:23 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

On 04/28/2017 10:35 PM, Denis V. Lunev wrote:
> On 04/28/2017 09:31 PM, Eric Blake wrote:
>> On 04/28/2017 11:17 AM, Denis V. Lunev wrote:
>>> Hello, All!
>>>
>>> Recently we have experienced problems with very slow
>>> bdrv_get_block_status call, which is massive called f.e.
>>> from mirror_run().
>>>
>>> The problem was narrowed down to slow lseek64() system
>>> call, which can take 1-2-3 seconds.
>> I'm guessing you meant one-to-three (the range), not one-two-three
>> (three separate digits), and just had an unfortunate abbreviation of
>> 'to' turning into the phonetically-similar '2'.
>>
> from 1 to 3, you are correct
>
>
>>> address@hidden ~]# strace -f -p 77048 -T  -e lseek
>>> Process 77048 attached with 14 threads
>>> [pid 77053] lseek(23, 6359613440, SEEK_DATA) = 6360276992 <1.250323>
>> That sounds like a bug in your choice of filesystem.  It's been
>> mentioned before that lseek has some pathetically poor behavior (I think
>> tmpfs was one of the culprits), but I maintain that it is better to
>> hammer on the kernel folks to fix the poor behavior than it is to have
>> to implement user-space workarounds in every single affected program.
>>
>> That said, a workaround of being able to request the avoidance of lseek
>> has been brought up on this list before.
>>
>>
>>> The problem comes from this branch of the code
>>>
>>> bdrv_co_get_block_status
>>>     .......
>>>     if (bs->file &&
>>>         (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
>>>         (ret & BDRV_BLOCK_OFFSET_VALID)) {
>>>         int file_pnum;
>>>
>>>         ret2 = bdrv_co_get_block_status(bs->file, ret >> BDRV_SECTOR_BITS,
>>>                                         *pnum, &file_pnum);
>>>         if (ret2 >= 0) {
>>>             /* Ignore errors.  This is just providing extra information, it
>>>              * is useful but not necessary.
>>>              */
>>>             if (!file_pnum) {
>>>                 /* !file_pnum indicates an offset at or beyond the EOF;
>>> it is
>>>                  * perfectly valid for the format block driver to point
>>> to such
>>>                  * offsets, so catch it and mark everything as zero */
>>>                 ret |= BDRV_BLOCK_ZERO;
>>>             } else {
>>>                 /* Limit request to the range reported by the protocol
>>> driver */
>>>                 *pnum = file_pnum;
>>>                 ret |= (ret2 & BDRV_BLOCK_ZERO);
>>>             }
>>>         }
>>>     }
>> I don't see anything wrong with this code. It's nice to know that we
>> have data (qcow2 says we have allocated bytes due to this layer, so
>> don't refer to the backing files), but when the underlying file can ALSO
>> tell us that the underlying protocol is sparse and we are on a hole,
>> then we know that we have BDRV_BLOCK_ZERO condition which can in turn
>> enable other optimizations in quite a bit of the stack.  It IS valid to
>> return BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO (we don't have to limit
>> ourselves to BDRV_BLOCK_DATA), and we should probably do so any time
>> that such computation is cheap.
>>
>> I agree with your analysis that on a poor filesystem where lseek()
>> behavior sucks that it is no longer cheap to determine where the holes are.
>>
>> But the proper workaround for those filesystems should NOT be made by
>> undoing this part of bdrv_co_get_block_status(), but should rather be
>> fixed in file-posix.c by the addition of a user-controllable knob on
>> whether to skip lseek().  In other words, if we're going to work around
>> the poor filesystem performance, the workaround should live in
>> file-posix.c, not in the generic block/io.c.  Once the workaround is
>> enabled, the 'ret2 = bdrv_co_get_block_status()' nested call should be
>> lightning fast (because file-posix.c would no longer be using lseek when
>> you've requested the workaround).
>>
>>> Frankly speaking, this optimization should not give much.
>>> If upper layer format (QCOW2) says that we have data
>>> here, then nowadays in 99.9% we do have the data.
>> You are correct that we have BDRV_BLOCK_DATA; but it IS useful to ALSO
>> know if that data reads back as all zeros (so we can skip reading it).
>> Just because qcow2 reports something as data does NOT rule out whether
>> the data still reads as zeros.
>>
>>> Meanwhile this branch can cause problems. We would
>>> need block cleared entirely to get the benefit for most
>>> cases in mirror and backup operations.
>>>
>>> At my opinion it worth to drop this at all.
>>>
>>> Guys, do you have any opinion?
>> Again, my opinion is to not change this part of block/io.c.  Rather,
>> work should be expended on individual protocol drivers to quit wasting
>> unnecessary time on reporting BDRV_BLOCK_ZERO if the time spent to
>> determine that is not worth the effort (as it is when using lseek() on
>> inefficient filesystems).  It is always safe for a protocol driver to
>> report BDRV_BLOCK_DATA instead of BDRV_BLOCK_ZERO, if asking the
>> question is too expensive (treating holes as data may pessimize other
>> operations, but that's okay if the penalty for asking is worse than what
>> optimizations are able to regain).
>>
>>> Den
>>>
>>> P.S. The kernel is one based on RedHat 3.10.0-514. The same
>>>       problem was observed in 3.10.0-327 too.
>> Red Hat is always two words (I'm allowed to point that out, as they
>> employ me ;).  But if it really is a Red Hat kernel problem, be sure to
>> use your support contract to point out the kernel developers that they
>> really need to fix lseek() - and you'll need to give details on which
>> filesystem you're using (as not all filesystems have that abysmal
>> performance).
>>
pls ignore above letter completely. I have pressed 'Send' in the wrong
Window.
Correct answer is above in the thread.

Den



reply via email to

[Prev in Thread] Current Thread [Next in Thread]