qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's mem


From: zhanghailiang
Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
Date: Sat, 28 Mar 2015 09:08:50 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

On 2015/3/27 18:51, Juan Quintela wrote:
zhanghailiang <address@hidden> wrote:
On 2015/3/26 11:52, Li Zhijian wrote:
On 03/26/2015 11:12 AM, Wen Congyang wrote:
On 03/25/2015 05:50 PM, Juan Quintela wrote:
zhanghailiang<address@hidden>  wrote:
Hi all,

We found that, sometimes, the content of VM's memory is
inconsistent between Source side and Destination side
when we check it just after finishing migration but before VM continue to Run.

We use a patch like bellow to find this issue, you can find it from affix,
and Steps to reprduce:

(1) Compile QEMU:
   ./configure --target-list=x86_64-softmmu  --extra-ldflags="-lssl" && make

(2) Command and output:
SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
qemu64,-kvmclock -netdev tap,id=hn0-device
virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive
file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
-device
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet
-monitor stdio
Could you try to reproduce:
- without vhost
- without virtio-net
- cache=unsafe is going to give you trouble, but trouble should only
    happen after migration of pages have finished.
If I use ide disk, it doesn't happen.
Even if I use virtio-net with vhost=on, it still doesn't happen. I guess
it is because I migrate the guest when it is booting. The virtio net
device is not used in this case.
Er~~
it reproduces in my ide disk
there is no any virtio device, my command line like below

x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -net none
-boot c -drive file=/home/lizj/ubuntu.raw -vnc :7 -m 2048 -smp 2 -machine
usb=off -no-user-config -nodefaults -monitor stdio -vga std

it seems easily to reproduce this issue by following steps in _ubuntu_ guest
1.  in source side, choose memtest in grub
2. do live migration
3. exit memtest(type Esc in when memory testing)
4. wait migration complete


Yes,it is a thorny problem. It is indeed easy to reproduce, just as
your steps in the above.

Thanks for the test case.  I will try to give a try on Monday.  Now that
we have a test case, it should be able to instrument things.  As the
problem is on memtest, it can't be the disk, clearly :p

OK, thanks.



This is my test result: (I also test accel=tcg, it can be reproduced also.)
Source side:
# x86_64-softmmu/qemu-system-x86_64 -machine
pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults -cpu
qemu64,-kvmclock -boot c -drive
file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device
cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio
(qemu) ACPI_BUILD: init ACPI tables
ACPI_BUILD: init ACPI tables
migrate tcp:9.61.1.8:3004
ACPI_BUILD: init ACPI tables
before cpu_synchronize_all_states
5a8f72d66732cac80d6a0d5713654c0e
md_host : before saving ram complete
5a8f72d66732cac80d6a0d5713654c0e
md_host : after saving ram complete
5a8f72d66732cac80d6a0d5713654c0e
(qemu)

Destination side:
# x86_64-softmmu/qemu-system-x86_64 -machine
pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults -cpu
qemu64,-kvmclock -boot c -drive
file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device
cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio
-incoming tcp:0:3004
(qemu) QEMU_VM_SECTION_END, after loading ram
d7cb0d8a4bdd1557fb0e78baee50c986
md_host : after loading all vmstate
d7cb0d8a4bdd1557fb0e78baee50c986
md_host : after cpu_synchronize_all_post_init
d7cb0d8a4bdd1557fb0e78baee50c986


Thanks,
zhang


What kind of load were you having when reproducing this issue?
Just to confirm, you have been able to reproduce this without COLO
patches, right?

(qemu) migrate tcp:192.168.3.8:3004
before saving ram complete
ff703f6889ab8701e4e040872d079a28
md_host : after saving ram complete
ff703f6889ab8701e4e040872d079a28

DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device
virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive
file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
-device
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet
-monitor stdio -incoming tcp:0:3004
(qemu) QEMU_VM_SECTION_END, after loading ram
230e1e68ece9cd4e769630e1bcb5ddfb
md_host : after loading all vmstate
230e1e68ece9cd4e769630e1bcb5ddfb
md_host : after cpu_synchronize_all_post_init
230e1e68ece9cd4e769630e1bcb5ddfb

This happens occasionally, and it is more easy to reproduce when
issue migration command during VM's startup time.
OK, a couple of things.  Memory don't have to be exactly identical.
Virtio devices in particular do funny things on "post-load".  There
aren't warantees for that as far as I know, we should end with an
equivalent device state in memory.

We have done further test and found that some pages has been
dirtied but its corresponding migration_bitmap is not set.
We can't figure out which modules of QEMU has missed setting
bitmap when dirty page of VM,
it is very difficult for us to trace all the actions of dirtying VM's pages.
This seems to point to a bug in one of the devices.

Actually, the first time we found this problem was in the COLO FT
development, and it triggered some strange issues in
VM which all pointed to the issue of inconsistent of VM's
memory. (We have try to save all memory of VM to slave side every
time
when do checkpoint in COLO FT, and everything will be OK.)

Is it OK for some pages that not transferred to destination when
do migration ? Or is it a bug?
Pages transferred should be the same, after device state transmission is
when things could change.

This issue has blocked our COLO development... :(

Any help will be greatly appreciated!
Later, Juan.

.




.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]