qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC 00/26] Multifd ๐Ÿ”€ device state transfer support with VFIO


From: Maciej S. Szmigiero
Subject: Re: [PATCH RFC 00/26] Multifd ๐Ÿ”€ device state transfer support with VFIO consumer
Date: Thu, 18 Apr 2024 20:14:15 +0200
User-agent: Mozilla Thunderbird

On 18.04.2024 12:39, Daniel P. Berrangรฉ wrote:
On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
On 17.04.2024 18:35, Daniel P. Berrangรฉ wrote:
On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
On 17.04.2024 10:36, Daniel P. Berrangรฉ wrote:
On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
(..)
That said, the idea of reserving channels specifically for VFIO doesn't
make a whole lot of sense to me either.

Once we've done the RAM transfer, and are in the switchover phase
doing device state transfer, all the multifd channels are idle.
We should just use all those channels to transfer the device state,
in parallel.  Reserving channels just guarantees many idle channels
during RAM transfer, and further idle channels during vmstate
transfer.

IMHO it is more flexible to just use all available multifd channel
resources all the time.

The reason for having dedicated device state channels is that they
provide lower downtime in my tests.

With either 15 or 11 mixed multifd channels (no dedicated device state
channels) I get a downtime of about 1250 msec.

Comparing that with 15 total multifd channels / 4 dedicated device
state channels that give downtime of about 1100 ms it means that using
dedicated channels gets about 14% downtime improvement.

Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
place ? Is is transferred concurrently with the RAM ? I had thought
this series still has the RAM transfer iterations running first,
and then the VFIO VMstate at the end, simply making use of multifd
channels for parallelism of the end phase. your reply though makes
me question my interpretation though.

Let me try to illustrate channel flow in various scenarios, time
flowing left to right:

1. serialized RAM, then serialized VM state  (ie historical migration)

        main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |


2. parallel RAM, then serialized VM state (ie today's multifd)

        main: | Init |                                            | VM state |
    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |


3. parallel RAM, then parallel VM state

        main: | Init |                                            | VM state |
    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
    multifd4:                                                     | VFIO VM 
state |
    multifd5:                                                     | VFIO VM 
state |


4. parallel RAM and VFIO VM state, then remaining VM state

        main: | Init |                                            | VM state |
    multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
    multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
    multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
    multifd4:        | VFIO VM state                                         |
    multifd5:        | VFIO VM state                                         |


I thought this series was implementing approx (3), but are you actually
implementing (4), or something else entirely ?

You are right that this series operation is approximately implementing
the schema described as numer 3 in your diagrams.

However, there are some additional details worth mentioning:
* There's some but relatively small amount of VFIO data being
transferred from the "save_live_iterate" SaveVMHandler while the VM is
still running.

This is still happening via the main migration channel.
Parallelizing this transfer in the future might make sense too,
although obviously this doesn't impact the downtime.

* After the VM is stopped and downtime starts the main (~ 400 MiB)
VFIO device state gets transferred via multifd channels.

However, these multifd channels (if they are not dedicated to device
state transfer) aren't idle during that time.
Rather they seem to be transferring the residual RAM data.

That's most likely what causes the additional observed downtime
when dedicated device state transfer multifd channels aren't used.

Ahh yes, I forgot about the residual dirty RAM, that makes sense as
an explanation. Allow me to work through the scenarios though, as I
still think my suggestion to not have separate dedicate channels is
better....


Lets say hypothetically we have an existing deployment today that
uses 6 multifd channels for RAM. ie:
main: | Init | | VM state |
     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |

That value of 6 was chosen because that corresponds to the amount
of network & CPU utilization the admin wants to allow, for this
VM to migrate. All 6 channels are fully utilized at all times.


If we now want to parallelize VFIO VM state, the peak network
and CPU utilization the admin wants to reserve for the VM should
not change. Thus the admin will still wants to configure only 6
channels total.

With your proposal the admin has to reduce RAM transfer to 4 of the
channels, in order to then reserve 2 channels for VFIO VM state, so we
get a flow like:

main: | Init | | VM state |
     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd5:                                                     | VFIO VM 
state |
     multifd6:                                                     | VFIO VM 
state |

This is bad, as it reduces performance of RAM transfer. VFIO VM
state transfer is better, but that's not a net win overall.



So lets say the admin was happy to increase the number of multifd
channels from 6 to 8.

This series proposes that they would leave RAM using 6 channels as
before, and now reserve the 2 extra ones for VFIO VM state:

         main: | Init |                                            | VM state |
     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
     multifd7:                                                     | VFIO VM 
state |
     multifd8:                                                     | VFIO VM 
state |


RAM would perform as well as it did historically, and VM state would
improve due to the 2 parallel channels, and not competing with the
residual RAM transfer.

This is what your latency comparison numbers show as a benefit for
this channel reservation design.

I believe this comparison is inappropriate / unfair though, as it is
comparing a situation with 6 total channels against a situation with
8 total channels.

If the admin was happy to increase the total channels to 8, then they
should allow RAM to use all 8 channels, and then VFIO VM state +
residual RAM to also use the very same set of 8 channels:

         main: | Init |                                            | VM state |
     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM + VFIO VM state|
     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM + VFIO VM state|
     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM + VFIO VM state|
     multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM + VFIO VM state|
     multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM + VFIO VM state|
     multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM + VFIO VM state|
     multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM + VFIO VM state|
     multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM + VFIO VM state|

This will speed up initial RAM iters still further & the final switch
over phase even more. If residual RAM is larger than VFIO VM state,
then it will dominate the switchover latency, so having VFIO VM state
compete is not a problem. If VFIO VM state is larger than residual RAM,
then allowing it acces to all 8 channels instead of only 2 channels
will be a clear win.

I re-did the measurement with increased the number of multifd channels,
first to (total count/dedicated count) 25/0, then to 100/0.

The results did not improve:
With 25/0 multifd mixed channels config I still get around 1250 msec
downtime - the same as with 15/0 or 11/0 mixed configs I measured
earlier.

But with the (pretty insane) 100/0 mixed channel config the whole setup
gets so for into the law of diminishing returns that the results actually
get worse: the downtime is now about 1450 msec.
I guess that's from all the extra overhead from switching between 100
multifd channels.

I think one of the reasons for these results is that mixed (RAM + device
state) multifd channels participate in the RAM sync process
(MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.

It is possible that there are other subtle performance interactions too,
but I am not 100% sure about that.

With regards,
Daniel

Best regards,
Maciej




reply via email to

[Prev in Thread] Current Thread [Next in Thread]