[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
From: |
Chegu Vinod |
Subject: |
Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support |
Date: |
Thu, 06 Jun 2013 16:51:40 -0700 |
User-agent: |
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 |
On 6/1/2013 9:09 PM, Michael R. Hines wrote:
All,
I have successfully performed over 1000+ back-to-back RDMA migrations
automatically looped *in a row* using a heavy-weight memory-stress
benchmark here at IBM.
Migration success is done by capturing the actual serial console
output of the virtual machine while the benchmark is running and
redirecting each migration output to a file to verify that the output
matches the expected output of a successful migration. For half of the
1000 migrations, I used a 14GB virtual machine size (largest VM I can
create) and the remaining 500 migrations I used a 2GB virtual machine
(to make sure I was testing both 32-bit and 64-bit address
boundaries). The benchmark is configured to have 75% stores and 25%
loads and is configured to use 80% of the allocatable free memory of
the VM (i.e. no swapping allowed).
I have defined a successful migration per the output file as follows:
1. The memory benchmark is still running and active (CPU near 100% and
memory usage is high)
2. There are no kernel panics in the console output (regex keywords
"panic", "BUG", "oom", etc...)
3. The VM is still responding to network activity (pings)
4. The console is still responsive by printing periodic messages
throughout the life of the VM to the console from inside the VM using
the 'write' command in infinite loop.
With this method in a loop, I believe I've ironed out all the
regression-testing bugs that I can find. You all may find the
following bugs interesting. The original version of this patch was
written in 2010 (Before my time @ IBM).
Bug #1: In the original 2010 patch, each write operation uses the same
"identifier". (A "Work Request ID" in infiniband terminology).
This is not typical (but allowed by the hardware) - and instead each
operation should have its own unique identifier so that the write
operation can be tracked properly as it completes.
Bug #2: Also in the original 2010 patch, write operations were grouped
into separate "signaled" and "unsignaled" work requests, which is also
not typical (but allowed by the hardware). "Signalling" is infiniband
terminology which means to activate/deactivate notifying the sender
whether or not the RDMA operation has already completed. (Note: the
receiver is never notified - which is what a DMA is supposed to be).
In normal operation per infiniband specifications, "unsignaled"
operations (which indicate to the hardware *not* to notify the sender
of completion) are *supposed* to be paired simultaneously with a
signaled operation using the *same* work request identifier. Instead,
the original patch was using *different* work requests for
signaled/unsignaled writes, which means that most of the writes would
be transmitted without ever being tracked for completion whatsoever.
(Per infinband specifications, signaled and unsignaled writes must be
grouped together because the hardware ensures that completion
notification is not given until *all* of the writes of the same
request have actually completed).
Bug #3: Finally, in the original 2010 patch, ordering was not being
handled. Per infiniband specifications, writes can happen completely
out of order. Not only that, but PCI-express itself can change the
order of the writes as well. It was only until after the first 2 bugs
were fixed that I could actually manifest this bug *in code*: What was
happening was that a very large group of requests would "burst" from
the QEMU migration thread. At which point, not all of the requests
would finish. Then a short time later, the next iteration would start
and the virtual machine's writable working set was still "hovering"
somewhere in the same vicinity of the address space as the previous
burst of writes that had not yet completed. When this happens, the new
writes were much smaller (not a part of a larger "chunk" per our
algorithms). Since the new writes were smaller they would complete
faster than the larger, older writes in the same address range. Since
they complete out of order, the newer writes would then get clobbered
by the older writes - resulting in an inconsistent virtual machine.
So, to solve this: during each new write, we now do a "search" to see
if the address of the next requested write matches or overlaps with
the address range of any of the previous "outstanding" writes that
were still in transit, and I found several hits. This was easily
solved by blocking until the conflicting write has completed before
proceeding to issue a new write to the hardware.
- Michael
Hi Michael,
Got some limited time on the systems so gave your latest bits a quick
try today (with the default no pinning) and it seems to be better than
before.
Ran a Java warehouse workload where the guest was 85-90% busy...
For both cases
(qemu) migrate_set_speed 40G
(qemu) migrate_set_downtime 2
(qemu) migrate -d x-rdma:<ip>:<port>
...
20VCPU/256G guest
(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 106994 milliseconds
downtime: 3795 milliseconds
transferred ram: 15425453 kbytes
throughput: 20418.27 mbps
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64707112 pages
skipped: 0 pages
normal: 3839625 pages
normal bytes: 15358500 kbytes
----
40VCPU/512G guest <- I had more warehouse threads with higher
heap size etc. to make the guest busy...and hence it seems to have taken
a while to converge.
(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 2470056 milliseconds
downtime: 6254 milliseconds
transferred ram: 3230142002 kbytes
throughput: 22118.67 mbps
remaining ram: 0 kbytes
total ram: 536879680 kbytes
duplicate: 127436402 pages
skipped: 0 pages
normal: 807307274 pages
normal bytes: 3229229096 kbytes
<..>