qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's mem


From: Jason Wang
Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
Date: Fri, 03 Apr 2015 16:51:11 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0


On 04/02/2015 07:52 PM, zhanghailiang wrote:
> On 2015/4/1 3:06, Dr. David Alan Gilbert wrote:
>> * zhanghailiang (address@hidden) wrote:
>>> On 2015/3/30 15:59, Dr. David Alan Gilbert wrote:
>>>> * zhanghailiang (address@hidden) wrote:
>>>>> On 2015/3/27 18:18, Dr. David Alan Gilbert wrote:
>>>>>> * zhanghailiang (address@hidden) wrote:
>>>>>>> On 2015/3/26 11:52, Li Zhijian wrote:
>>>>>>>> On 03/26/2015 11:12 AM, Wen Congyang wrote:
>>>>>>>>> On 03/25/2015 05:50 PM, Juan Quintela wrote:
>>>>>>>>>> zhanghailiang<address@hidden>  wrote:
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> We found that, sometimes, the content of VM's memory is
>>>>>>>>>>> inconsistent between Source side and Destination side
>>>>>>>>>>> when we check it just after finishing migration but before
>>>>>>>>>>> VM continue to Run.
>>>>>>>>>>>
>>>>>>>>>>> We use a patch like bellow to find this issue, you can find
>>>>>>>>>>> it from affix,
>>>>>>>>>>> and Steps to reprduce:
>>>>>>>>>>>
>>>>>>>>>>> (1) Compile QEMU:
>>>>>>>>>>>   ./configure --target-list=x86_64-softmmu 
>>>>>>>>>>> --extra-ldflags="-lssl" && make
>>>>>>>>>>>
>>>>>>>>>>> (2) Command and output:
>>>>>>>>>>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
>>>>>>>>>>> qemu64,-kvmclock -netdev tap,id=hn0-device
>>>>>>>>>>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive
>>>>>>>>>>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
>>>>>>>>>>> -device
>>>>>>>>>>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
>>>>>>>>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device
>>>>>>>>>>> usb-tablet -monitor stdio
>>>>>>>>>> Could you try to reproduce:
>>>>>>>>>> - without vhost
>>>>>>>>>> - without virtio-net
>>>>>>>>>> - cache=unsafe is going to give you trouble, but trouble
>>>>>>>>>> should only
>>>>>>>>>>    happen after migration of pages have finished.
>>>>>>>>> If I use ide disk, it doesn't happen.
>>>>>>>>> Even if I use virtio-net with vhost=on, it still doesn't
>>>>>>>>> happen. I guess
>>>>>>>>> it is because I migrate the guest when it is booting. The
>>>>>>>>> virtio net
>>>>>>>>> device is not used in this case.
>>>>>>>> Er??????
>>>>>>>> it reproduces in my ide disk
>>>>>>>> there is no any virtio device, my command line like below
>>>>>>>>
>>>>>>>> x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
>>>>>>>> qemu64,-kvmclock -net none
>>>>>>>> -boot c -drive file=/home/lizj/ubuntu.raw -vnc :7 -m 2048 -smp
>>>>>>>> 2 -machine
>>>>>>>> usb=off -no-user-config -nodefaults -monitor stdio -vga std
>>>>>>>>
>>>>>>>> it seems easily to reproduce this issue by following steps in
>>>>>>>> _ubuntu_ guest
>>>>>>>> 1.  in source side, choose memtest in grub
>>>>>>>> 2. do live migration
>>>>>>>> 3. exit memtest(type Esc in when memory testing)
>>>>>>>> 4. wait migration complete
>>>>>>>>
>>>>>>>
>>>>>>> Yes???it is a thorny problem. It is indeed easy to reproduce,
>>>>>>> just as
>>>>>>> your steps in the above.
>>>>>>>
>>>>>>> This is my test result: (I also test accel=tcg, it can be
>>>>>>> reproduced also.)
>>>>>>> Source side:
>>>>>>> # x86_64-softmmu/qemu-system-x86_64 -machine
>>>>>>> pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults 
>>>>>>> -cpu qemu64,-kvmclock -boot c -drive
>>>>>>> file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw
>>>>>>> -device cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2
>>>>>>> -monitor stdio
>>>>>>> (qemu) ACPI_BUILD: init ACPI tables
>>>>>>> ACPI_BUILD: init ACPI tables
>>>>>>> migrate tcp:9.61.1.8:3004
>>>>>>> ACPI_BUILD: init ACPI tables
>>>>>>> before cpu_synchronize_all_states
>>>>>>> 5a8f72d66732cac80d6a0d5713654c0e
>>>>>>> md_host : before saving ram complete
>>>>>>> 5a8f72d66732cac80d6a0d5713654c0e
>>>>>>> md_host : after saving ram complete
>>>>>>> 5a8f72d66732cac80d6a0d5713654c0e
>>>>>>> (qemu)
>>>>>>>
>>>>>>> Destination side:
>>>>>>> # x86_64-softmmu/qemu-system-x86_64 -machine
>>>>>>> pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults 
>>>>>>> -cpu qemu64,-kvmclock -boot c -drive
>>>>>>> file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw
>>>>>>> -device cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2
>>>>>>> -monitor stdio -incoming tcp:0:3004
>>>>>>> (qemu) QEMU_VM_SECTION_END, after loading ram
>>>>>>> d7cb0d8a4bdd1557fb0e78baee50c986
>>>>>>> md_host : after loading all vmstate
>>>>>>> d7cb0d8a4bdd1557fb0e78baee50c986
>>>>>>> md_host : after cpu_synchronize_all_post_init
>>>>>>> d7cb0d8a4bdd1557fb0e78baee50c986
>>>>>>
>>>>>> Hmm, that's not good.  I suggest you md5 each of the RAMBlock's
>>>>>> individually;
>>>>>> to see if it's main RAM that's different or something more subtle
>>>>>> like
>>>>>> video RAM.
>>>>>>
>>>>>
>>>>> Er, all my previous tests are md5 'pc.ram' block only.
>>>>>
>>>>>> But then maybe it's easier just to dump the whole of RAM to file
>>>>>> and byte compare it (hexdump the two dumps and diff ?)
>>>>>
>>>>> Hmm, we also used memcmp function to compare every page, but the
>>>>> addresses
>>>>> seem to be random.
>>>>>
>>>>> Besides, in our previous test, we found it seems to be more easy
>>>>> to reproduce
>>>>> when migration occurs during VM's start-up or reboot process.
>>>>>
>>>>> Is there any possible that some devices have special treatment
>>>>> when VM start-up
>>>>> which may miss setting dirty-bitmap ?
>>>>
>>>> I don't think there should be, but the code paths used during
>>>> startup are
>>>> probably much less tested with migration.  I'm sure the startup code
>>>> uses different part of device emulation.   I do know we have some bugs
>>>
>>> Er, Maybe there is a special case:
>>>
>>> During VM's start-up, i found that the KVMslot changed many times,
>>> it was a process of
>>> smashing total memory space into small slot.
>>>
>>> If some pages was dirtied and its dirty-bitmap has been set in KVM
>>> module,
>>> but we didn't sync the bitmaps to QEMU user-space before this slot
>>> was smashed,
>>> with its previous bitmap been destroyed.
>>> The bitmap of dirty pages in the previous KVMslot maybe be missed.
>>>
>>> What's your opinion? Can this situation i described in the above
>>> happen?
>>>
>>> The bellow log was grabbed, when i tried to figure out a quite same
>>> question (some pages miss dirty-bitmap setting) we found in COLO:
>>> Occasionally, there will be an error report in SLAVE side:
>>>
>>>      qemu: warning: error while loading state for instance 0x0 of
>>> device
>>>      'kvm-tpr-opt'                                                 '
>>>      qemu-system-x86_64: loadvm failed
>>>
>>> We found that it related to three address (gpa:
>>> 0xca000,0xcb000,0xcc000, which are the address of 'kvmvapic.rom ?'),
>>> and sometimes
>>> their corresponding dirty-map will be missed in Master side, because
>>> their KVMSlot is destroyed before we sync its dirty-bitmap to qemu.
>>>
>>> (I'm still not quite sure if this can also happen in common
>>> migration, i will try to test it in normal migration)
>>
> Hi,
>
> We have found two bugs (places) that miss setting migration-bitmap of
> dirty pages,
> The virtio-blk related can be fixed by patch of Wen Congyang, you can
> find his reply in the list.
> And the 'kvm-tpr-opt' related can be fixed by the follow patch.
>
> Thanks,
> zhang
>
> >From 0c63687d0f14f928d6eb4903022a7981db6ba59f Mon Sep 17 00:00:00 2001
> From: zhanghailiang <address@hidden>
> Date: Thu, 2 Apr 2015 19:26:31 +0000
> Subject: [PATCH] kvm-all: Sync dirty-bitmap from kvm before kvm
> destroy the
>  corresponding dirty_bitmap
>
> Sometimes, we destroy the dirty_bitmap in kvm_memory_slot before any
> sync action
> occur, this bit in dirty_bitmap will be missed, and which will lead
> the corresponding
> dirty pages to be missed in migration.
>
> This usually happens when do migration during VM's Start-up or Reboot.
>
> Signed-off-by: zhanghailiang <address@hidden>
> ---
>  exec.c    | 2 +-
>  kvm-all.c | 4 +++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/exec.c b/exec.c
> index 874ecfc..4b1b39b 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -59,7 +59,7 @@
>  //#define DEBUG_SUBPAGE
>
>  #if !defined(CONFIG_USER_ONLY)
> -static bool in_migration;
> +bool in_migration;
>
>  /* ram_list is read under rcu_read_lock()/rcu_read_unlock().  Writes
>   * are protected by the ramlist lock.
> diff --git a/kvm-all.c b/kvm-all.c
> index 335438a..dd75eff 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -128,6 +128,8 @@ bool kvm_allowed;
>  bool kvm_readonly_mem_allowed;
>  bool kvm_vm_attributes_allowed;
>
> +extern bool in_migration;
> +
>  static const KVMCapabilityInfo kvm_required_capabilites[] = {
>      KVM_CAP_INFO(USER_MEMORY),
>      KVM_CAP_INFO(DESTROY_MEMORY_REGION_WORKS),
> @@ -715,7 +717,7 @@ static void kvm_set_phys_mem(MemoryRegionSection
> *section, bool add)
>
>          old = *mem;
>
> -        if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
> +        if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES || in_migration) {
>              kvm_physical_sync_dirty_bitmap(section);
>          }
>
> -- 

I can still see XFS panic of complaining "Corruption of in-memory data
detected." in guest after migration even with this patch and IDE disk.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]