qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LO


From: Peter Xu
Subject: Re: [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG
Date: Wed, 8 May 2019 19:39:29 +0800
User-agent: Mutt/1.10.1 (2018-07-13)

On Wed, May 08, 2019 at 12:09:00PM +0200, Paolo Bonzini wrote:
> On 08/05/19 01:15, Peter Xu wrote:
> > Design
> > ===================
> > 
> > I started with a naive/stupid design that I always pass all 1's to the
> > KVM for a memory range to clear all the dirty bits within that memory
> > range, but then I encountered guest oops - it's simply because we
> > can't clear any dirty bit from QEMU if we are not _sure_ that the bit
> > is dirty in the kernel.  Otherwise we might accidentally clear a bit
> > that we don't even know of (e.g., the bit was clear in migration's
> > dirty bitmap in QEMU) but actually that page was just being written so
> > QEMU will never remember to migrate that new page again.
> > 
> > The new design is focused on a dirty bitmap cache within the QEMU kvm
> > layer (which is per kvm memory slot).  With that we know what's dirty
> > in the kernel previously (note! the kernel bitmap is still growing all
> > the time so the cache will only be a subset of the realtime kernel
> > bitmap but that's far enough for us) and with that we'll be sure to
> > not accidentally clear unknown dirty pages.
> > 
> > With this method, we can also avoid race when multiple users (e.g.,
> > DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for
> > multiple time.  If without the kvm memory slot cached dirty bitmap we
> > won't be able to know which bit has been cleared and then if we send
> > the CLEAR operation upon the same bit twice (or more) we can still
> > face the same issue to clear something accidentally while we
> > shouldn't.
> > 
> > Summary: we really need to be careful on what bit to clear otherwise
> > we can face anything after the migration completes.  And I hope this
> > series has considered all about this.
> 
> The disadvantage of this is that you won't clear in the kernel those
> dirty bits that come from other sources (e.g. vhost or
> address_space_map).  This can lead to double-copying of pages.
> 
> Migration already makes a local copy in rb->bmap, and
> memory_region_snapshot_and_clear_dirty can also do the clear.  Would it
> be possible to invoke the clear using rb->bmap instead of the KVMSlot's
> new bitmap?

Do you mean to invoke the clear notifications and deliver the bitmap
from the very top of the stack (for VGA it would probably be the
DirtyBitmapSnapshot, for MIGRATION it's the rb->bmap)?

Actually that's what I did in the first version before I post the work
but I noticed that there seems to have a race condition with the
design.  The problem is we have multiple copies of the same dirty
bitmap from KVM and the race can happen with those multiple users
(bitmaps of the users can be a merged version containing KVM and other
sources like vhost, address_space_map, etc. but let's just make it
simpler to not have them yet).  Here's one possible race condition
(assuming migration has started):

  (1) page P is written by the guest (let's version its data as P1),
      its dirty bit D is set in KVM

  (2) QEMU sync dirty log, fetch D for page P (which is set).  D is
      not cleared in KVM due to MANUAL_PROTECT cap.

  (3) QEMU copies the D bit to all users of dirty bmap (MIGRATION and
      VGA as example).

  (4) MIGRATION code collects bit D in migration bitmap, clear it,
      send CLEAR to KVM to clear the bit on remote (then D in KVM is
      cleared too), send the page P with content P1 to destination.
      CAUTION: when reach here VGA bitmap still has the D bit set.

  (5) page P is written by the guest again (let's version its data as
      P2), bit D set again in KVM.

  (6) VGA code collectes bit D (note! we haven't synced the log again
      so this is the _old_ dirty bit came from step 3 above) in VGA
      bitmap, clear it, send CLEAR to KVM to clear the bit on remote.
      Then D bit in KVM is cleared again.  Until here, D bit in all
      bitmaps are cleared (MIGRATION, VGA, KVM).

  (7) page P is never written again until migration completes (no
      matter whether we sync again D bit will be cleared).

  (8) On destination VM page P will contain content P1 rather than P2,
      this is data loss...

After I noticed it, I added the kvm bitmap layer with the lock to make
sure we won't send the 2nd clear at step (6) above because with the
per memslot bitmap we will be able to know that we've already cleared
a bit by one user so we shouldn't clear it again (as long as we don't
SYNC again).  However later on I just noticed maybe I don't even need
the top layer bitmap at all, because it seems to not bring much
benefit but instead it'll bring lots of more complexity - I'll need to
do bitmap convertions (because upper layer bitmaps are per ramblock,
and kvm bitmaps are per memslot), not to mention I'll need to AND the
two bitmaps (upper layer bitmap, and the QEMU kvm cached bitmap) to
avoid the above race.  So at last I simply removed the top layer
bitmap and only keep the per memslot bitmap as what this series did.

Thanks,

-- 
Peter Xu



reply via email to

[Prev in Thread] Current Thread [Next in Thread]