qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: towards a workable O_DIRECT outmigration to a file


From: Dr. David Alan Gilbert
Subject: Re: towards a workable O_DIRECT outmigration to a file
Date: Thu, 18 Aug 2022 19:49:27 +0100
User-agent: Mutt/2.2.6 (2022-06-05)

* Claudio Fontana (cfontana@suse.de) wrote:
> On 8/18/22 18:31, Dr. David Alan Gilbert wrote:
> > * Claudio Fontana (cfontana@suse.de) wrote:
> >> On 8/18/22 14:38, Dr. David Alan Gilbert wrote:
> >>> * Nikolay Borisov (nborisov@suse.com) wrote:
> >>>> [adding Juan and David to cc as I had missed them. ]
> >>>
> >>> Hi Nikolay,
> >>>
> >>>> On 11.08.22 г. 16:47 ч., Nikolay Borisov wrote:
> >>>>> Hello,
> >>>>>
> >>>>> I'm currently looking into implementing a 'file:' uri for migration save
> >>>>> in qemu. Ideally the solution will be O_DIRECT compatible. I'm aware of
> >>>>> the branch https://gitlab.com/berrange/qemu/-/tree/mig-file. In the
> >>>>> process of brainstorming how a solution would like the a couple of
> >>>>> questions transpired that I think warrant wider discussion in the
> >>>>> community.
> >>>
> >>> OK, so this seems to be a continuation with Claudio and Daniel and co as
> >>> of a few months back.  I'd definitely be leaving libvirt sides of the
> >>> question here to Dan, and so that also means definitely looking at that
> >>> tree above.
> >>
> >> Hi Dave, yes, Nikolai is trying to continue on the qemu side.
> >>
> >> We have something working with libvirt for our short term needs which 
> >> offers good performance,
> >> but it is clear that that simple solution is barred for upstream libvirt 
> >> merging.
> >>
> >>
> >>>
> >>>>> First, implementing a solution which is self-contained within qemu would
> >>>>> be easy enough( famous last words) but the gist is one  has to only care
> >>>>> about the format within qemu. However, I'm being told that what libvirt
> >>>>> does is prepend its own custom header to the resulting saved file, then
> >>>>> slipstreams the migration stream from qemu. Now with the solution that I
> >>>>> envision I intend to keep all write-related logic inside qemu, this
> >>>>> means there's no way to incorporate the logic of libvirt. The reason I'd
> >>>>> like to keep the write process within qemu is to avoid an extra copy of
> >>>>> data between the two processes (qemu outging migration and libvirt),
> >>>>> with the current fd approach qemu is passed an fd, data is copied
> >>>>> between qemu/libvirt and finally the libvirt_iohelper writes the data.
> >>>>> So the question which remains to be answered is how would libvirt make
> >>>>> use of this new functionality in qemu? I was thinking something along
> >>>>> the lines of :
> >>>>>
> >>>>> 1. Qemu writes its migration stream to a file, ideally on a filesystem
> >>>>> which supports reflink - xfs/btrfs
> >>>>>
> >>>>> 2. Libvirt writes it's header to a separate file
> >>>>> 2.1 Reflinks the qemu's stream right after its header
> >>>>> 2.2 Writes its trailer
> >>>>>
> >>>>> 3. Unlink() qemu's file, now only libvirt's file remains on-disk.
> >>>>>
> >>>>> I wouldn't call this solution hacky though it definitely leaves some
> >>>>> bitter aftertaste.
> >>>
> >>> Wouldn't it be simpler to tell libvirt to write it's header, then tell
> >>> qemu to append everything?
> >>
> >> I would think so as well. 
> >>
> >>>
> >>>>> Another solution would be to extend the 'fd:' protocol to allow multiple
> >>>>> descriptors (for multifd) support to be passed in. The reason dup()
> >>>>> can't be used is because in order for multifd to be supported it's
> >>>>> required to be able to write to multiple, non-overlapping regions of the
> >>>>> file. And duplicated fd's share their offsets etc. But that really seems
> >>>>> more or less hacky. Alternatively it's possible that pwrite() are used
> >>>>> to write to non-overlapping regions in the file. Any feedback is
> >>>>> welcomed.
> >>>
> >>> I do like the idea of letting fd: take multiple fd's.
> >>
> >> Fine in my view, I think we will still need then a helper process in 
> >> libvirt to merge the data into a single file, no?
> >> In case the libvirt multifd to single file multithreaded helper I proposed 
> >> before is helpful as a reference you could reuse/modify those patches.
> > 
> > Eww that's messy isn't it.
> > (You don't fancy a huge sparse file do you?)
> 
> Wait am I missing something obvious here?
> 
> Maybe we don't need any libvirt extra process.
> 
> why don't we open the _single_ file multiple times from libvirt,
> 
> Lets say the "main channel" fd is opened, we write the libvirt header,
> then reopen again the same file multiple times,
> and finally pass all fds to qemu, one fd for each parallel transfer channel 
> we want to use
> (so we solve all the permissions, security labels issues etc).
> 
> And then from QEMU we can write to those fds at the right offsets for each 
> separate channel,
> which is easier from QEMU because we can know exactly how much data we need 
> to transfer before starting the migration,
> so we have even less need for "holes", possibly only minor ones for single 
> byte adjustments
> for uneven division of the interleaved file.
> 
> What is wrong with this one, or does anyone see some other better approach?

You'd have to know exactly how to space the channels position in the
file, unless you somehow controlled it; the allocation across the
multifd threads is load/scheduler/random I think, so you'd have to
assume the worst case of everything going to one thread.
I.e. a big sparse area and then something to tell you where they are.

Dave

> Thanks,
> 
> C
> 
> > 
> >> Maybe this new way will be acceptable to libvirt,
> >> ie avoiding the multifd code -> socket, but still merging the data from 
> >> the multiple fds into a single file?
> > 
> > It feels to me like the problem here is really what we want is something
> > closer to a dump than the migration code; you don't need all that
> > overhead of the code to deal with live migration bitmaps and dirty pages
> > that aren't going to happen.
> > Something that just does a nice single write(2) (for each memory
> > region);
> > and then ties the device state on.
> > 
> > Dave
> > 
> >>>
> >>> Dave
> >>>
> >>
> >> Thanks for your comments,
> >>
> >> Claudio
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>> Nikolay
> >>>>
> >>
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK




reply via email to

[Prev in Thread] Current Thread [Next in Thread]