qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1] docs/devel: Add VFIO device migration documentation


From: Kirti Wankhede
Subject: Re: [PATCH v1] docs/devel: Add VFIO device migration documentation
Date: Sat, 7 Nov 2020 00:27:46 +0530
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.12.1



On 11/6/2020 2:56 AM, Alex Williamson wrote:
On Fri, 6 Nov 2020 02:22:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

On 11/6/2020 12:41 AM, Alex Williamson wrote:
On Fri, 6 Nov 2020 00:29:36 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 11/4/2020 6:15 PM, Alex Williamson wrote:
On Wed, 4 Nov 2020 13:25:40 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 11/4/2020 1:57 AM, Alex Williamson wrote:
On Wed, 4 Nov 2020 01:18:12 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 10/30/2020 12:35 AM, Alex Williamson wrote:
On Thu, 29 Oct 2020 23:11:16 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

<snip>
+System memory dirty pages tracking
+----------------------------------
+
+A ``log_sync`` memory listener callback is added to mark system memory pages

s/is added to mark/marks those/
+as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried

s/by/by the/
s/Dirty/The dirty/
+per container. All pages pinned by vendor driver through vfio_pin_pages()

s/by/by the/
+external API have to be marked as dirty during migration. When there are CPU
+writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
+by vendor driver can also be written by device. There is currently no device

s/by/by the/ (x2)
+which has hardware support for dirty page tracking. So all pages which are
+pinned by vendor driver are considered as dirty.
+Dirty pages are tracked when device is in stop-and-copy phase because if pages
+are marked dirty during pre-copy phase and content is transfered from source to
+destination, there is no way to know newly dirtied pages from the point they
+were copied earlier until device stops. To avoid repeated copy of same content,
+pinned pages are marked dirty only during stop-and-copy phase.

Let me take a quick stab at rewriting this paragraph (not sure if I
understood it correctly):

"Dirty pages are tracked when the device is in the stop-and-copy phase.
During the pre-copy phase, it is not possible to distinguish a dirty
page that has been transferred from the source to the destination from
newly dirtied pages, which would lead to repeated copying of the same
content. Therefore, pinned pages are only marked dirty during the
stop-and-copy phase." ?

I think above rephrase only talks about repeated copying in pre-copy
phase. Used "copied earlier until device stops" to indicate both
pre-copy and stop-and-copy till device stops.


Now I'm confused, I thought we had abandoned the idea that we can only
report pinned pages during stop-and-copy.  Doesn't the device needs to
expose its dirty memory footprint during the iterative phase regardless
of whether that causes repeat copies?  If QEMU iterates and sees that
all memory is still dirty, it may have transferred more data, but it
can actually predict if it can achieve its downtime tolerances.  Which
is more important, less data transfer or predictability?  Thanks,

Even if QEMU copies and transfers content of all sys mem pages during
pre-copy (worst case with IOMMU backed mdev device when its vendor
driver is not smart to pin pages explicitly and all sys mem pages are
marked dirty), then also its prediction about downtime tolerance will
not be correct, because during stop-and-copy again all pages need to be
copied as device can write to any of those pinned pages.

I think you're only reiterating my point.  If QEMU copies all of guest
memory during the iterative phase and each time it sees that all memory
is dirty, such as if CPUs or devices (including assigned devices) are
dirtying pages as fast as it copies them (or continuously marks them
dirty), then QEMU can predict that downtime will require copying all
pages.

But as of now there is no way to know if device has dirtied pages during
iterative phase.


This claim doesn't make any sense, pinned pages are considered
persistently dirtied, during the iterative phase and while stopped.

If instead devices don't mark dirty pages until the VM is
stopped, then QEMU might iterate through memory copy and predict a short
downtime because not much memory is dirty, only to be surprised that
all of memory is suddenly dirty.  At that point it's too late, the VM
is already stopped, the predicted short downtime takes far longer than
expected.  This is exactly why we made the kernel interface mark pinned
pages persistently dirty when it was proposed that we only report
pinned pages once.  Thanks,

Since there is no way to know if device dirtied pages during iterative
phase, QEMU should query pinned pages in stop-and-copy phase.


As above, I don't believe this is true.

Whenever there will be hardware support or some software mechanism to
report pages dirtied by device then we will add a capability bit in
migration capability and based on that capability bit qemu/user space
app should decide to query dirty pages in iterative phase.


Yes, we could advertise support for fine granularity dirty page
tracking, but I completely disagree that we should consider pinned
pages clean until suddenly exposing them as dirty once the VM is
stopped.  Thanks,

Should QEMU copy dirtied pages twice, during iterative phase and then
when VM is stopped?

I don't understand why this is controversial.  We cannot decide within
the vfio device to only expose device dirtied pages in the final stage
of migration.  It's not our job to minimize the number of pages copied
beyond the hardware granularity.  If core QEMU migration code asks for
dirty pages, we provide them, regardless of how many times we report a
page as dirty.  So yes, if that migration code asks for dirty pages in
the iterative stage and the stopped stage, we provide them both times.

Isn't that would increase total migration time?

As I explained, that's not a policy decision that we as a device within
the VM should be making.  We do not have the visibility to determine
how the footprint of our device will affect the migration and by
preventing QEMU migration code from understanding the device footprint,
we're creating a scenario where QEMU absolutely cannot predict the
downtime.


If VFIO device dirtied lot of memory and each time VFIO device reports all those as dirty during iterative phase, then migration would never converge.

If there is a way to track dirty pages during iterative phase and there is a way to know that pages weren't dirtied again, then it makes sense for them to reported during the iterative phase.

Since there is no hardware support to track pages dirtied by device, for now best way would be to report those during stop-and-copy.

Thanks,
Kirti

If someone wants to skip the iterative phase altogether, I imagine
there are migration parameters that allow it, but we should not be
determining that policy at the device level.  Thanks,

What is that parameter? should that be documented here?

Dunno, but clearly we could pause the VM, migration, and resume on the
target.  I imagine there are migration tuning parameters that might do
essentially that automatically.  Thanks,

Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]