qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES1


From: zhanghailiang
Subject: Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot
Date: Tue, 7 Jul 2015 20:39:32 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

On 2015/7/7 20:21, Igor Mammedov wrote:
On Tue, 7 Jul 2015 19:43:35 +0800
zhanghailiang <address@hidden> wrote:

On 2015/7/7 19:23, Igor Mammedov wrote:
On Mon, 6 Jul 2015 17:59:10 +0800
zhanghailiang <address@hidden> wrote:

On 2015/7/6 16:45, Paolo Bonzini wrote:


On 06/07/2015 09:54, zhanghailiang wrote:

   From host, we found that QEMU vcpu1 thread and vcpu7 thread were not
consuming any cpu (Should be in idle state),
All of VCPUs' stacks in host is like bellow:

[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm]
[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0
[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0
[<ffffffff81468092>] system_call_fastpath+0x16/0x1b
[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7
[<ffffffffffffffff>] 0xffffffffffffffff

We looked into the kernel codes that could leading to the above 'Stuck'
warning,
in current upstream there isn't any printk(...Stuck...) left since that code 
path
has been reworked.
I've often seen this on over-committed host during guest CPUs up/down torture 
test.
Could you update guest kernel to upstream and see if issue reproduces?


Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to 
reproduce it.

For your test case, is it a kernel bug?
Or is there any related patch could solve your test problem been merged into
upstream ?
I don't remember all prerequisite patches but you should be able to find
   http://marc.info/?l=linux-kernel&m=140326703108009&w=2
   "x86/smpboot: Initialize secondary CPU only if master CPU will wait for it"
and then look for dependencies.


Er, we have investigated this patch, and it is not related to our problem, :)

Thanks.



Thanks,
zhanghailiang

and found that the only possible is the emulation of 'cpuid' instruct in
kvm/qemu has something wrong.
But since we can’t reproduce this problem, we are not quite sure.
Is there any possible that the cupid emulation in kvm/qemu has some bug ?

Can you explain the relationship to the cpuid emulation?  What do the
traces say about vcpus 1 and 7?

OK, we searched the VM's kernel codes with the 'Stuck' message, and  it is 
located in
do_boot_cpu(). It's in BSP context, the call process is:
BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() 
-> wakeup_secondary_via_INIT() to trigger APs.
It will wait 5s for APs to startup, if some AP not startup normally, it will 
print 'CPU%d Stuck' or 'CPU%d: Not responding'.

If it prints 'Stuck', it means the AP has received the SIPI interrupt and 
begins to execute the code
'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before 
smp_callin()(smpboot.c).
The follow is the starup process of BSP and AP.
BSP:
start_kernel()
     ->smp_init()
        ->smp_boot_cpus()
          ->do_boot_cpu()
              ->start_ip = trampoline_address(); //set the address that AP will 
go to execute
              ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU
              ->for (timeout = 0; timeout < 50000; timeout++)
                  if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if 
AP startup or not

APs:
ENTRY(trampoline_data) (trampoline_64.S)
         ->ENTRY(secondary_startup_64) (head_64.S)
            ->start_secondary() (smpboot.c)
               ->cpu_init();
               ->smp_callin();
                   ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP 
comes here, the BSP will not prints the error message.

   From above call process, we can be sure that, the AP has been stuck between 
trampoline_data and the cpumask_set_cpu() in
smp_callin(), we look through these codes path carefully, and only found a 
'hlt' instruct that could block the process.
It is located in trampoline_data():

ENTRY(trampoline_data)
           ...

        call    verify_cpu              # Verify the cpu supports long mode
        testl   %eax, %eax              # Check for return code
        jnz     no_longmode

           ...

no_longmode:
        hlt
        jmp no_longmode

For the process verify_cpu(),
we can only find the 'cpuid' sensitive instruct that could lead VM exit from 
No-root mode.
This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to 
the fail in verify_cpu.

   From the message in VM, we know vcpu1 and vcpu7 is something wrong.
[    5.060042] CPU1: Stuck ??
[   10.170815] CPU7: Stuck ??
[   10.171648] Brought up 6 CPUs

Besides, the follow is the cpus message got from host.
80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command 
instance-0000000
* CPU #0: pc=0x00007f64160c683d thread_id=68570
     CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573
     CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575
     CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576
     CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577
     CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578
     CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583
     CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584

Oh, i also forgot to mention in the above message that, we have bond each vCPU 
to different physical CPU in
host.

Thanks,
zhanghailiang




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to address@hidden
More majordomo info at  http://vger.kernel.org/majordomo-info.html


.






.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]