qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1] docs/devel: Add VFIO device migration documentation


From: Alex Williamson
Subject: Re: [PATCH v1] docs/devel: Add VFIO device migration documentation
Date: Thu, 5 Nov 2020 14:26:45 -0700

On Fri, 6 Nov 2020 02:22:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/6/2020 12:41 AM, Alex Williamson wrote:
> > On Fri, 6 Nov 2020 00:29:36 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 11/4/2020 6:15 PM, Alex Williamson wrote:  
> >>> On Wed, 4 Nov 2020 13:25:40 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> On 11/4/2020 1:57 AM, Alex Williamson wrote:  
> >>>>> On Wed, 4 Nov 2020 01:18:12 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>         
> >>>>>> On 10/30/2020 12:35 AM, Alex Williamson wrote:  
> >>>>>>> On Thu, 29 Oct 2020 23:11:16 +0530
> >>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>>>            
> >>>>>>
> >>>>>> <snip>
> >>>>>>        
> >>>>>>>>>> +System memory dirty pages tracking
> >>>>>>>>>> +----------------------------------
> >>>>>>>>>> +
> >>>>>>>>>> +A ``log_sync`` memory listener callback is added to mark system 
> >>>>>>>>>> memory pages  
> >>>>>>>>>
> >>>>>>>>> s/is added to mark/marks those/
> >>>>>>>>>               
> >>>>>>>>>> +as dirty which are used for DMA by VFIO device. Dirty pages 
> >>>>>>>>>> bitmap is queried  
> >>>>>>>>>
> >>>>>>>>> s/by/by the/
> >>>>>>>>> s/Dirty/The dirty/
> >>>>>>>>>               
> >>>>>>>>>> +per container. All pages pinned by vendor driver through 
> >>>>>>>>>> vfio_pin_pages()  
> >>>>>>>>>
> >>>>>>>>> s/by/by the/
> >>>>>>>>>               
> >>>>>>>>>> +external API have to be marked as dirty during migration. When 
> >>>>>>>>>> there are CPU
> >>>>>>>>>> +writes, CPU dirty page tracking can identify dirtied pages, but 
> >>>>>>>>>> any page pinned
> >>>>>>>>>> +by vendor driver can also be written by device. There is 
> >>>>>>>>>> currently no device  
> >>>>>>>>>
> >>>>>>>>> s/by/by the/ (x2)
> >>>>>>>>>               
> >>>>>>>>>> +which has hardware support for dirty page tracking. So all pages 
> >>>>>>>>>> which are
> >>>>>>>>>> +pinned by vendor driver are considered as dirty.
> >>>>>>>>>> +Dirty pages are tracked when device is in stop-and-copy phase 
> >>>>>>>>>> because if pages
> >>>>>>>>>> +are marked dirty during pre-copy phase and content is transfered 
> >>>>>>>>>> from source to
> >>>>>>>>>> +destination, there is no way to know newly dirtied pages from the 
> >>>>>>>>>> point they
> >>>>>>>>>> +were copied earlier until device stops. To avoid repeated copy of 
> >>>>>>>>>> same content,
> >>>>>>>>>> +pinned pages are marked dirty only during stop-and-copy phase.  
> >>>>>>>>
> >>>>>>>>           
> >>>>>>>>> Let me take a quick stab at rewriting this paragraph (not sure if I
> >>>>>>>>> understood it correctly):
> >>>>>>>>>
> >>>>>>>>> "Dirty pages are tracked when the device is in the stop-and-copy 
> >>>>>>>>> phase.
> >>>>>>>>> During the pre-copy phase, it is not possible to distinguish a dirty
> >>>>>>>>> page that has been transferred from the source to the destination 
> >>>>>>>>> from
> >>>>>>>>> newly dirtied pages, which would lead to repeated copying of the 
> >>>>>>>>> same
> >>>>>>>>> content. Therefore, pinned pages are only marked dirty during the
> >>>>>>>>> stop-and-copy phase." ?
> >>>>>>>>>               
> >>>>>>>>
> >>>>>>>> I think above rephrase only talks about repeated copying in pre-copy
> >>>>>>>> phase. Used "copied earlier until device stops" to indicate both
> >>>>>>>> pre-copy and stop-and-copy till device stops.  
> >>>>>>>
> >>>>>>>
> >>>>>>> Now I'm confused, I thought we had abandoned the idea that we can only
> >>>>>>> report pinned pages during stop-and-copy.  Doesn't the device needs to
> >>>>>>> expose its dirty memory footprint during the iterative phase 
> >>>>>>> regardless
> >>>>>>> of whether that causes repeat copies?  If QEMU iterates and sees that
> >>>>>>> all memory is still dirty, it may have transferred more data, but it
> >>>>>>> can actually predict if it can achieve its downtime tolerances.  Which
> >>>>>>> is more important, less data transfer or predictability?  Thanks,
> >>>>>>>            
> >>>>>>
> >>>>>> Even if QEMU copies and transfers content of all sys mem pages during
> >>>>>> pre-copy (worst case with IOMMU backed mdev device when its vendor
> >>>>>> driver is not smart to pin pages explicitly and all sys mem pages are
> >>>>>> marked dirty), then also its prediction about downtime tolerance will
> >>>>>> not be correct, because during stop-and-copy again all pages need to be
> >>>>>> copied as device can write to any of those pinned pages.  
> >>>>>
> >>>>> I think you're only reiterating my point.  If QEMU copies all of guest
> >>>>> memory during the iterative phase and each time it sees that all memory
> >>>>> is dirty, such as if CPUs or devices (including assigned devices) are
> >>>>> dirtying pages as fast as it copies them (or continuously marks them
> >>>>> dirty), then QEMU can predict that downtime will require copying all
> >>>>> pages.  
> >>>>
> >>>> But as of now there is no way to know if device has dirtied pages during
> >>>> iterative phase.  
> >>>
> >>>
> >>> This claim doesn't make any sense, pinned pages are considered
> >>> persistently dirtied, during the iterative phase and while stopped.
> >>>
> >>>        
> >>>>> If instead devices don't mark dirty pages until the VM is
> >>>>> stopped, then QEMU might iterate through memory copy and predict a short
> >>>>> downtime because not much memory is dirty, only to be surprised that
> >>>>> all of memory is suddenly dirty.  At that point it's too late, the VM
> >>>>> is already stopped, the predicted short downtime takes far longer than
> >>>>> expected.  This is exactly why we made the kernel interface mark pinned
> >>>>> pages persistently dirty when it was proposed that we only report
> >>>>> pinned pages once.  Thanks,
> >>>>>         
> >>>>
> >>>> Since there is no way to know if device dirtied pages during iterative
> >>>> phase, QEMU should query pinned pages in stop-and-copy phase.  
> >>>
> >>>
> >>> As above, I don't believe this is true.
> >>>
> >>>      
> >>>> Whenever there will be hardware support or some software mechanism to
> >>>> report pages dirtied by device then we will add a capability bit in
> >>>> migration capability and based on that capability bit qemu/user space
> >>>> app should decide to query dirty pages in iterative phase.  
> >>>
> >>>
> >>> Yes, we could advertise support for fine granularity dirty page
> >>> tracking, but I completely disagree that we should consider pinned
> >>> pages clean until suddenly exposing them as dirty once the VM is
> >>> stopped.  Thanks,
> >>>      
> >>
> >> Should QEMU copy dirtied pages twice, during iterative phase and then
> >> when VM is stopped?  
> > 
> > I don't understand why this is controversial.  We cannot decide within
> > the vfio device to only expose device dirtied pages in the final stage
> > of migration.  It's not our job to minimize the number of pages copied
> > beyond the hardware granularity.  If core QEMU migration code asks for
> > dirty pages, we provide them, regardless of how many times we report a
> > page as dirty.  So yes, if that migration code asks for dirty pages in
> > the iterative stage and the stopped stage, we provide them both times.  
> 
> Isn't that would increase total migration time?

As I explained, that's not a policy decision that we as a device within
the VM should be making.  We do not have the visibility to determine
how the footprint of our device will affect the migration and by
preventing QEMU migration code from understanding the device footprint,
we're creating a scenario where QEMU absolutely cannot predict the
downtime.
 
> > If someone wants to skip the iterative phase altogether, I imagine
> > there are migration parameters that allow it, but we should not be
> > determining that policy at the device level.  Thanks,
> >   
> 
> What is that parameter? should that be documented here?

Dunno, but clearly we could pause the VM, migration, and resume on the
target.  I imagine there are migration tuning parameters that might do
essentially that automatically.  Thanks,

Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]