Thanks,
zhanghailiang
/var/log/message:
Mar 31 18:32:53 master kernel: [15908.524721] memslot dirty bitmap of '4' was
destroyed, base_gfn 0x100, npages 524032
Mar 31 18:32:53 master kernel: [15908.657853] +++ gfn=0xcc is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.665105] +++ gfn=0xcb is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.672360] +++ gfn=0xca is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.682058] memslot dirty bitmap of '4' was
destroyed, base_gfn 0xfebc0, npages 16
Mar 31 18:32:53 master kernel: [15908.849527] +++ gfn=0xca is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.856845] +++ gfn=0xcb is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.864161] +++ gfn=0xcc is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.872676] memslot dirty bitmap of '4' was
destroyed, base_gfn 0xfeb80, npages 64
Mar 31 18:32:53 master kernel: [15908.882694] +++ gfn=0xca is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.889948] +++ gfn=0xcb is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.897202] +++ gfn=0xcc is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.911652] +++ gfn=0xca is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.919543] +++ gfn=0xca is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.927419] +++ gfn=0xcb is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.935317] +++ gfn=0xcb is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.943209] +++ gfn=0xcb is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.951083] +++ gfn=0xcb is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.958971] +++ gfn=0xcc is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.966837] +++ gfn=0xcc is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.974707] +++ gfn=0xcc is marked dirty,
id=2, base_gfn=0xc0, npages=524096
Mar 31 18:32:53 master kernel: [15908.988470] memslot dirty bitmap of '2' was
destroyed, base_gfn 0xc0, npages 524096
Mar 31 18:32:53 master kernel: [15909.002403] kvm: zapping shadow pages for
mmio generation wraparound
Mar 31 18:32:53 master kernel: [15909.002523] memslot dirty bitmap of '2' was
destroyed, base_gfn 0xc0, npages 8
Mar 31 18:32:53 master kernel: [15909.010110] memslot dirty bitmap of '4' was
destroyed, base_gfn 0xc8, npages 524088
Mar 31 18:32:53 master kernel: [15909.023988] memslot dirty bitmap of '5' was
destroyed, base_gfn 0xcd, npages 3
Mar 31 18:32:53 master kernel: [15909.031594] memslot dirty bitmap of '6' was
destroyed, base_gfn 0xd0, npages 524080
Mar 31 18:32:53 master kernel: [15909.044708] memslot dirty bitmap of '5' was
destroyed, base_gfn 0xcd, npages 11
Mar 31 18:32:53 master kernel: [15909.052392] memslot dirty bitmap of '6' was
destroyed, base_gfn 0xd8, npages 524072
Mar 31 18:32:53 master kernel: [15909.065651] memslot dirty bitmap of '5' was
destroyed, base_gfn 0xcd, npages 19
Mar 31 18:32:53 master kernel: [15909.073329] memslot dirty bitmap of '6' was
destroyed, base_gfn 0xe0, npages 524064
Mar 31 18:32:53 master kernel: [15909.086379] memslot dirty bitmap of '5' was
destroyed, base_gfn 0xcd, npages 27
Mar 31 18:32:53 master kernel: [15909.094084] memslot dirty bitmap of '6' was
destroyed, base_gfn 0xe8, npages 524056
Mar 31 18:32:53 master kernel: [15909.107354] memslot dirty bitmap of '6' was
destroyed, base_gfn 0xec, npages 524052
Mar 31 18:32:54 master dhcpcd[5408]: eth0: timed out
Mar 31 18:32:54 master dhcpcd[5408]: eth0: trying to use old lease in
`/var/lib/dhcpcd/dhcpcd-eth0.info'
Mar 31 18:32:54 master dhcpcd[5408]: eth0: lease expired 30265 seconds ago
Mar 31 18:32:54 master dhcpcd[5408]: eth0: broadcasting for a lease
Mar 31 18:33:02 master qemu-system-x86_64: ====qemu: The 0 times do
checkpoing===
Mar 31 18:33:02 master qemu-system-x86_64: == migration_bitmap_sync ==
Mar 31 18:33:02 master qemu-system-x86_64: qemu: --- addr=0xca000
(hva=0x7f4fa32ca000), who's bitmap not set ---
Mar 31 18:33:02 master qemu-system-x86_64: qemu: --- addr=0xcb000
(hva=0x7f4fa32cb000), who's bitmap not set ---
Mar 31 18:33:02 master qemu-system-x86_64: qemu: --- addr=0xcc000
(hva=0x7f4fa32cc000), who's bitmap not set ---
Mar 31 18:33:05 master kernel: [15921.057246] device eth2 left promiscuous mode
Mar 31 18:33:07 master avahi-daemon[5773]: Withdrawing address record for
fe80::90be:cfff:fe9e:f03e on tap0.
Mar 31 18:33:07 master kernel: [15922.513313] br0: port 2(tap0) entered
disabled state
Mar 31 18:33:07 master kernel: [15922.513480] device tap0 left promiscuous mode
Mar 31 18:33:07 master kernel: [15922.513508] br0: port 2(tap0) entered
disabled state
Mar 31 18:33:07 master kernel: [15922.562591] memslot dirty bitmap of '8' was
destroyed, base_gfn 0x100, npages 524032
Mar 31 18:33:07 master kernel: [15922.570948] memslot dirty bitmap of '3' was
destroyed, base_gfn 0xfd000, npages 4096
Mar 31 18:33:07 master kernel: [15922.578967] memslot dirty bitmap of '0' was
destroyed, base_gfn 0x0, npages 160
Mar 31 18:33:07 master kernel: [15922.586556] memslot dirty bitmap of '1' was
destroyed, base_gfn 0xfffc0, npages 64
Mar 31 18:33:07 master kernel: [15922.586574] memslot dirty bitmap of '5' was
destroyed, base_gfn 0xcd, npages 31
Mar 31 18:33:07 master kernel: [15922.586575] memslot dirty bitmap of '7' was
destroyed, base_gfn 0xf0, npages 16
Mar 31 18:33:07 master kernel: [15922.586577] memslot dirty bitmap of '2' was
destroyed, base_gfn 0xc0, npages 10
Mar 31 18:33:07 master kernel: [15922.586578] memslot dirty bitmap of '6' was
destroyed, base_gfn 0xec, npages 4
Mar 31 18:33:07 master kernel: [15922.586579] memslot dirty bitmap of '4' was
destroyed, base_gfn 0xca, npages 3
PS:
QEMU:
static void migration_bitmap_sync(void)
{
trace_migration_bitmap_sync_end(migration_dirty_pages
- num_dirty_pages_init);
num_dirty_pages_period += migration_dirty_pages - num_dirty_pages_init;
end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+ syslog(LOG_INFO, "== migration_bitmap_sync ==");
+ check_bitmap(();
+void check_bitmap(void)
+{
+ int i;
+ char *host;
+ ram_addr_t addr[3] = { 0xca000, 0xcb000, 0xcc000 };
+ RAMBlock *block = NULL;
+ int ret;
+
+ block = QLIST_FIRST_RCU(&ram_list.blocks);
+ for (i = 0; i < 3; i++) {
+ unsigned long nr = block->mr->ram_addr + (addr[i] >> TARGET_PAGE_BITS);
+ host = memory_region_get_ram_ptr(block->mr) + addr[i];
+ ret = test_bit(nr, migration_bitmap);
+ if (ret == 0) {
+ syslog(LOG_INFO, "qemu: --- addr=0x%llx (hva=%p), who's bitmap not set
---\n",
+ addr[i], host);
+ } else {
+ syslog(LOG_INFO, "qemu: +++ OK, addr=0x%llx (hva=%p) , who's bitap is
set +++\n",
+ addr[i], host);;
+ }
+ }
+}
KVM???
static void mark_page_dirty_in_slot(struct kvm *kvm,
struct kvm_memory_slot *memslot,
gfn_t gfn)
{
+ if ((gfn == 0xca || gfn == 0xcb || gfn == 0xcc) && !memslot->dirty_bitmap)
{
+ printk("oops, dirty_bitmap is null, gfn=0x%llx\n",gfn);
+ }
if (memslot && memslot->dirty_bitmap) {
unsigned long rel_gfn = gfn - memslot->base_gfn;
+ if (gfn == 0xca || gfn == 0xcb || gfn == 0xcc) {
+ printk("+++ gfn=0x%llx is marked dirty, id=%d, base_gfn=0x%llx,
npages=%d\n",
+ gfn, memslot->id, memslot->base_gfn, memslot->npages);
+ }
set_bit_le(rel_gfn, memslot->dirty_bitmap);
}
}
static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
{
if (!memslot->dirty_bitmap)
return;
+ if (memslot->id != 9) {
+ printk("memslot dirty bitmap of '%d' was destroyed, base_gfn 0x%llx, npages
%lld\n",
+ memslot->id, memslot->base_gfn, memslot->npages);
+ }
kvm_kvfree(memslot->dirty_bitmap);
memslot->dirty_bitmap = NULL;
}
filed against migration during windows boot, I'd not considered that it might
be devices not updating the bitmap.
Dave
Thanks,
zhanghailiang
What kind of load were you having when reproducing this issue?
Just to confirm, you have been able to reproduce this without COLO
patches, right?
(qemu) migrate tcp:192.168.3.8:3004
before saving ram complete
ff703f6889ab8701e4e040872d079a28
md_host : after saving ram complete
ff703f6889ab8701e4e040872d079a28
DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock
-netdev tap,id=hn0,vhost=on -device virtio-net-pci,id=net-pci0,netdev=hn0 -boot
c -drive
file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
-device
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -vnc
:7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio
-incoming tcp:0:3004
(qemu) QEMU_VM_SECTION_END, after loading ram
230e1e68ece9cd4e769630e1bcb5ddfb
md_host : after loading all vmstate
230e1e68ece9cd4e769630e1bcb5ddfb
md_host : after cpu_synchronize_all_post_init
230e1e68ece9cd4e769630e1bcb5ddfb
This happens occasionally, and it is more easy to reproduce when issue
migration command during VM's startup time.
OK, a couple of things. Memory don't have to be exactly identical.
Virtio devices in particular do funny things on "post-load". There
aren't warantees for that as far as I know, we should end with an
equivalent device state in memory.
We have done further test and found that some pages has been dirtied but its
corresponding migration_bitmap is not set.
We can't figure out which modules of QEMU has missed setting bitmap when dirty
page of VM,
it is very difficult for us to trace all the actions of dirtying VM's pages.
This seems to point to a bug in one of the devices.
Actually, the first time we found this problem was in the COLO FT development,
and it triggered some strange issues in
VM which all pointed to the issue of inconsistent of VM's memory. (We have try
to save all memory of VM to slave side every time
when do checkpoint in COLO FT, and everything will be OK.)
Is it OK for some pages that not transferred to destination when do migration ?
Or is it a bug?
Pages transferred should be the same, after device state transmission is
when things could change.
This issue has blocked our COLO development... :(
Any help will be greatly appreciated!
Later, Juan.
.
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
.
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
.