qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2] block/raw-posix: Try both FIEMAP and SEEK_HO


From: Eric Blake
Subject: Re: [Qemu-devel] [PATCH v2] block/raw-posix: Try both FIEMAP and SEEK_HOLE
Date: Tue, 06 May 2014 15:18:33 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0

On 05/06/2014 01:00 PM, Max Reitz wrote:
> The current version of raw-posix always uses ioctl(FS_IOC_FIEMAP) if
> FIEMAP is available; lseek with SEEK_HOLE/SEEK_DATA are not even
> compiled in in this case. However, there may be implementations which
> support the latter but not the former (e.g., NFSv4.2) as well as vice
> versa.
> 
> To cover both cases, always try SEEK_HOLE/SEEK_DATA (as this will
> probably be covered by POSIX soon) and if that does not work, fall back
> to FIEMAP; and if that does not work either, treat everything as
> allocated.
> 
> Signed-off-by: Max Reitz <address@hidden>
> ---
> v2:
>  - reworked using static functions [Stefan]
>  - changed order of FIEMAP and SEEK_HOLE [Eric]
> ---
>  block/raw-posix.c | 135 
> ++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 85 insertions(+), 50 deletions(-)
> 

> +++ b/block/raw-posix.c
> @@ -146,6 +146,12 @@ typedef struct BDRVRawState {
>      bool has_discard:1;
>      bool has_write_zeroes:1;
>      bool discard_zeroes:1;
> +#if defined SEEK_HOLE && defined SEEK_DATA
> +    bool skip_seek_hole;

Remember, SEEK_HOLE support requires both support of the kernel and the
filesystem - we may still have cases where there are filesystems that
lack SEEK_HOLE but still have FIEMAP, even when compiled against kernel
headers where SEEK_HOLE is defined - on those filesystems, the kernel
guarantees SEEK_HOLE will report the entire file as allocated, even
though FIEMAP might be able to find holes.  On the other hand, the whole
point of this is to optimize cases; where the fallback to treating the
whole file as allocated may be slower but still gives correct behavior
to the guest.

I like how you have a per-struct state to track whether you have
encountered a previous failure, to skip the known-failing probe the next
time around.  But I wonder if you need a tweak...

> +static int64_t try_seek_hole(BlockDriverState *bs, off_t start, off_t *data,
> +                             off_t *hole, int *pnum)
> +{
> +#if defined SEEK_HOLE && defined SEEK_DATA
>      BDRVRawState *s = bs->opaque;
>  
> -    hole = lseek(s->fd, start, SEEK_HOLE);
> -    if (hole == -1) {
> +    if (s->skip_seek_hole) {
> +        return -ENOTSUP;
> +    }
> +
> +    *hole = lseek(s->fd, start, SEEK_HOLE);
> +    if (*hole == -1) {
>          /* -ENXIO indicates that sector_num was past the end of the file.
>           * There is a virtual hole there.  */
>          assert(errno != -ENXIO);
>  
> -        /* Most likely EINVAL.  Assume everything is allocated.  */
> -        *pnum = nb_sectors;
> -        return ret;
> +        s->skip_seek_hole = true;
> +        return -errno;
>      }

...if you are on a file system where SEEK_HOLE triggers the kernel
fallback of "entire file is allocated", but where FIEMAP is wired up for
that file system, would it make sense to have try_seek_hole return -1 in
situations where lseek(s->fd, 0, SEEK_HOLE) returns the end of the file?
 Even more, should skip_seek_hole be a tri-state?

state: "unknown" - if lseek(SEEK_HOLE) returns -1, state changes to
"skip"; if it returns something other than EOF, state changes to "use";
if it returns EOF, state remains "unknown" (as punching a hole or
resizing the image may work to create a hole in what is currently a
fully-allocated image)

state: "skip" - we've had a previous lseek failure, no need to try
seeking for holes on this file

state: "use" - we've had success probing a hole, so we want to always
trust SEEK_HOLE for this file

>  
> -    if (hole > start) {
> -        data = start;
> +    if (*hole > start) {
> +        *data = start;
>      } else {
>          /* On a hole.  We need another syscall to find its end.  */
> -        data = lseek(s->fd, start, SEEK_DATA);
> -        if (data == -1) {
> -            data = lseek(s->fd, 0, SEEK_END);
> +        *data = lseek(s->fd, start, SEEK_DATA);
> +        if (*data == -1) {
> +            *data = lseek(s->fd, 0, SEEK_END);
>          }

Pre-existing, and unaffected by this patch, but lseek() changes the file
offset.  Does that affect any other code that might do a read()/write(),
or are we safe in always using pread()/pwrite() so that we don't care
about file offset?

> + *
> + * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
> + * beyond the end of the disk image it will be clamped.
> + */
> +static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs,
> +                                            int64_t sector_num,
> +                                            int nb_sectors, int *pnum)

Indentation is off.

> +{
> +    off_t start, data = 0, hole = 0;
> +    int64_t ret;
> +
> +    ret = fd_open(bs);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    start = sector_num * BDRV_SECTOR_SIZE;
> +
> +    ret = try_seek_hole(bs, start, &data, &hole, pnum);

Again, a tri-state return (-1 for skipped, 0 for unknown, 1 for hole or
EOF found) might make more sense.  That way, you hit the fallback of
declaring the whole file as allocated only if both probes reported no
holes.  Or hide the tri-state in the helper function, but map both
"skip" and "unknown" into -1 return, and only "use" into 0 return.

> +    if (ret < 0) {
> +        ret = try_fiemap(bs, start, &data, &hole, nb_sectors, pnum);
> +        if (ret < 0) {
> +            /* Assume everything is allocated. */
> +            data = 0;
> +            hole = start + nb_sectors * BDRV_SECTOR_SIZE;
> +            ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | start;
> +        }
> +    }
>  
>      if (data <= start) {
>          /* On a data extent, compute sectors to the end of the extent.  */
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]