qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2] block: posix: Handle undetectable alignment


From: Nir Soffer
Subject: Re: [Qemu-devel] [PATCH v2] block: posix: Handle undetectable alignment
Date: Tue, 13 Aug 2019 16:12:54 +0300

On Tue, Aug 13, 2019 at 2:21 PM Kevin Wolf <address@hidden> wrote:

> Am 13.08.2019 um 12:45 hat Nir Soffer geschrieben:
> > On Mon, Aug 12, 2019 at 5:23 PM Kevin Wolf <address@hidden> wrote:
> >
> > > Am 11.08.2019 um 22:50 hat Nir Soffer geschrieben:
> > > > In some cases buf_align or request_alignment cannot be detected:
> > > >
> > > > - With Gluster, buf_align cannot be detected since the actual I/O is
> > > >   done on Gluster server, and qemu buffer alignment does not matter.
> > >
> > > If it doesn't matter, the best value would be buf_align = 1.
> > >
> >
> > Right, if we know that this is gluster.
> >
> > > - With local XFS filesystem, buf_align cannot be detected if reading
> > > >   from unallocated area.
> > >
> > > Here, we actually do need alignment, but it's unknown whether it would
> > > be 512 or 4096 or something entirely. Failing to align requests
> > > correctly results in I/O errors.
> > >
> > > > - With Gluster backed by XFS, request_alignment cannot be detected if
> > > >   reading from unallocated area.
> > >
> > > This is like buf_align for XFS: We don't know the right value, but
> > > getting it wrong causes I/O errors.
> > >
> > > > - With NFS, the server does not use direct I/O, so both buf_align
> > > >   cannot be detected.
> > >
> > > This suggests that byte-aligned requests are fine for NFS, i.e.
> > > buf_align = request_alignment = 1 would be optimal in this case.
> > >
> >
> > Right, but again we don't know this is NFS.
>
> Yes, I agree. I was just trying to list the optimal settings for each
> case so I could compare them against the actual results the path
> provides. I'm well aware that we don't know a way to get the optimal
> results for all four cases.
>
> > > These cases seems to work when storage sector size is 512 bytes,
> because
> > > > the current code starts checking align=512. If the check succeeds
> > > > because alignment cannot be detected we use 512. But this does not
> work
> > > > for storage with 4k sector size.
> > > >
> > > > Practically the alignment requirements are the same for buffer
> > > > alignment, buffer length, and offset in file. So in case we cannot
> > > > detect buf_align, we can use request alignment. If we cannot detect
> > > > request alignment, we can fallback to a safe value.
> > >
> > > This makes sense in general.
> > >
> > > What the commit message doesn't explain, but probably should do is how
> > > we determine whether we could successfully detect request alignment.
> The
> > > approach taken here is that a detected alignment of 1 is understood as
> > > failure to detect the real alignment.
> >
> > Failing with EINVAL when using 1, and succeeding with another value is
> > considered a successful detection.
> >
> > We have 3 issues preventing detection:
> > - filesystem not using direct I/O on the remote server (NFS, Gluster
> > when network.remote-dio=on)
> > - area probed is unallocated with XFS or Gluster backed by XFS
> > - filesystem without buffer alignment requirement (e.g. Gluster)
>
> I would say case 1 is effectively a subset of case 3 (i.e. it's just one
> specific reason why we don't have a buffer alignment requirement).
>
> > For handling unallocated areas, we can:
> > - always allocate the first block when creating an image (qemu-img
> > create/convert)
> > - use write() instead of read().
> >
> > In oVirt we went with the second option - when we initialize a file
> > storage domain, we create a special file and do direct write to this
> > file with 1, 512, and 4096 bytes length. If we detect 512 or 4096, we
> > use this value for creating the domain (e.g. for sanlock).  If we
> > detect 1, we use the user provided value (default 512).
>
> Yes, but there's the important difference that oVirt controls the image
> files, whereas QEMU doesn't. Even if qemu-img create made sure that we
> allocate the first block, the user could still pass us an image that
> was created using a different way.
>
> Using write() is actually an interesting thought. Obviously, we can't
> just overwrite the user image. But maybe what we could do is read the
> first block and then try to rewrite it with different alignments.
>

Yes, this is what we do in ovirt-imageio for file based storage:
https://github.com/oVirt/ovirt-imageio/blob/ca70170886b0c1fbeca8640b12bcf54f01a3fea0/common/ovirt_imageio_common/backends/file.py#L311

But we have lot of assumptions that may not work for qemu:
- we don't support read only images
- we assume nobody else is writing to the image imageio uses
  (enforced by oVirt)

So this will not work for qemu-img read-only operations.

However, this will break down with read-only images, so if we can't
> write, we'd still have to fall back to a safe default.
>
> Also, given the straces we saw, I'm afraid we might trigger gluster bugs
> where writes that failed with EINVAL mess up the internal state so that
> even later aligned requests would fail.
>

If this happen only when using Gluster performance.strict-o-direct = off,
we will
enforce performance.strict-o-direct = in oVirt.

Otherwise this is a Gluster bug and it should be fixed in Gluster.

> You can see the code here:
> >
> https://github.com/oVirt/vdsm/blob/4733018f9a719729242738b486906d3b9ed058cd/lib/vdsm/storage/fileSD.py#L838
> >
> > One way we can use in qemu is to create a temporary file:
> >
> >     /path/to/image.tmp9vo8US
> >
> > Delete the file, keeping the fd open, and detect the alignment on this
> > file using write().
>
> This isn't going to fly. We might not have write permission to the
> directory even for read-write images. Worse, we might get passed only a
> file descriptor instead of a path. So whatever we do, we must do it with
> the image file descriptor.
>
> > With this we fixed all the cases listed above, but creating new files
> > requires write permission in the directory where the image is in, and
> > will not work for some strange setups (.e.g bind-mount images).
> >
> > One issue with this is that there is no guarantee that the temporary
> > file will be deleted so the user will have to deal with leftover
> > files.
>
> On Linux, we could use O_TMPFILE for this. However, as I mentioned
> above, we may not even know the directory.
>
...

Nir


reply via email to

[Prev in Thread] Current Thread [Next in Thread]