qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] E5-2620v2 - emulation stop error


From: Dr. David Alan Gilbert
Subject: Re: [Qemu-devel] E5-2620v2 - emulation stop error
Date: Tue, 10 Mar 2015 18:16:52 +0000
User-agent: Mutt/1.5.23 (2014-03-12)

* Andrey Korolyov (address@hidden) wrote:
> On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert
> <address@hidden> wrote:
> > * Andrey Korolyov (address@hidden) wrote:
> >> On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov <address@hidden> wrote:
> >> > On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das <address@hidden> wrote:
> >> >> Andrey Korolyov <address@hidden> writes:
> >> >>
> >> >>> On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov <address@hidden> wrote:
> >> >>>> Hello,
> >> >>>>
> >> >>>> recently I`ve got a couple of shiny new Intel 2620v2s for future
> >> >>>> replacement of the E5-2620v1, but I experienced relatively many events
> >> >>>> with emulation errors, all traces looks simular to the one below. I am
> >> >>>> running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
> >> >>>> can switch to some other versions if necessary. Most of crashes
> >> >>>> happened during reboot cycle or at the end of ACPI-based shutdown
> >> >>>> action, if this can help. I have zero clues of what can introduce such
> >> >>>> a mess inside same processor family using identical software, as
> >> >>>> 2620v1 has no simular problem ever. Please let me know if there can be
> >> >>>> some side measures for making entire story more clear.
> >> >>>>
> >> >>>> Thanks!
> >> >>>>
> >> >>>> KVM internal error. Suberror: 2
> >> >>>> extra data[0]: 800000d1
> >> >>>> extra data[1]: 80000b0d
> >> >>>> EAX=00000003 EBX=00000000 ECX=00000000 EDX=00000000
> >> >>>> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006cd4
> >> >>>> EIP=0000d3f9 EFL=00010202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
> >> >>>> ES =0000 00000000 0000ffff 00009300
> >> >>>> CS =f000 000f0000 0000ffff 00009b00
> >> >>>> SS =0000 00000000 0000ffff 00009300
> >> >>>> DS =0000 00000000 0000ffff 00009300
> >> >>>> FS =0000 00000000 0000ffff 00009300
> >> >>>> GS =0000 00000000 0000ffff 00009300
> >> >>>> LDT=0000 00000000 0000ffff 00008200
> >> >>>> TR =0000 00000000 0000ffff 00008b00
> >> >>>> GDT=     000f6e98 00000037
> >> >>>> IDT=     00000000 000003ff
> >> >>>> CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
> >> >>>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
> >> >>>> DR3=0000000000000000
> >> >>>> DR6=00000000ffff0ff0 DR7=0000000000000400
> >> >>>> EFER=0000000000000000
> >> >>>> Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb <cd>
> >> >>>> 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
> >> >>>> b8 00 e0 00 00 8e
> >> >>>
> >> >>>
> >> >>> It turns out that those errors are introduced by APICv, which gets
> >> >>> enabled due to different feature set. If anyone is interested in
> >> >>> reproducing/fixing this exactly on 3.10, it takes about one hundred of
> >> >>> migrations/power state changes for an issue to appear, guest OS can be
> >> >>> Linux or Win.
> >> >>
> >> >> Are you able to reproduce this on a more recent upstream kernel as well 
> >> >> ?
> >> >>
> >> >> Bandan
> >> >
> >> > I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
> >> > follow up with any reproduceable results.
> >>
> >> Heh.. issue is not triggered on 2603v2 at all, at least I am not able
> >> to hit this. The only difference with 2620v2 except lower frequency is
> >> an Intel Dynamic Acceleration feature. I`d appreciate any testing with
> >> higher CPU models with same or richer feature set. The testing itself
> >> can be done on both generic 3.10 or RH7 kernels, as both of them are
> >> experiencing this issue. I conducted all tests with disabled cstates
> >> so I advise to do the same for a first reproduction step.
> >>
> >> Thanks!
> >>
> >> model name      : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
> >> stepping        : 4
> >> microcode       : 0x416
> >> cpu MHz         : 2100.039
> >> cache size      : 15360 KB
> >> siblings        : 12
> >> apicid          : 43
> >> initial apicid  : 43
> >> fpu             : yes
> >> fpu_exception   : yes
> >> cpuid level     : 13
> >> wp              : yes
> >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> >> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> >> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
> >> rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
> >> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
> >> sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> >> rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
> >> flexpriority ept vpid fsgsbase smep erms
> >
> > I'm seeing something similar; it's very intermittent and generally
> > happening right at boot of the guest;   I'm running this on qemu
> > head+my postcopy world (but it's happening right at boot before postcopy
> > gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
> > but hey maybe I'm seeing a different bug.
> >
> > Dave
> 
> Yep, looks like we are hitting same bug - two thirds of mine failure
> events shot during boot/reboot cycle and approx. one third of events
> happened in the middle of runtime. What CPU, v0 or v2 are you using
> (in other words, is APICv enabled)?

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2407 0 @ 2.20GHz
stepping        : 7
microcode       : 0x70d
cpu MHz         : 2200.000
cache size      : 10240 KB
physical id     : 1
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 38
initial apicid  : 38
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt 
tsc_deadline_timer aes xsave avx lahf_lm arat pln pts dtherm tpr_shadow vnmi 
flexpriority ept vpid xsaveopt
bugs            :
bogomips        : 4409.23
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

It's really random as well; I had two within half an hour yesterday, and then
it survived overnight with no change.

KVM internal error. Suberror: 1
emulation failure
EAX=00000000 EBX=00000000 ECX=00000000 EDX=000fd2bc
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=000fd2c5 EFL=00010007 [-----PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000f6a80 00000037
IDT=     000f6abe 00000000
CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=66 ba bc d2 0f 00 e9 a2 fe f3 90 f0 0f ba 2d 04 ff fb bf 00 <72> f3 8b 25 
00 ff fb bf e8 44 66 ff ff c7 05 04 ff
 fb bf 00 00 00 00 f4 eb fd fa fc 66 b8
KVM internal error. Suberror: 1
emulation failure

and

11:37:49 INFO | [qemu output] KVM internal error. Suberror: 1
11:37:49 INFO | [qemu output] emulation failure
11:37:49 INFO | [qemu output] EAX=00000000 EBX=00000000 ECX=00000000 
EDX=000fd2bc
11:37:49 INFO | [qemu output] ESI=00000000 EDI=00000000 EBP=00000000 
ESP=00000000
11:37:49 INFO | [qemu output] EIP=000fd2bc EFL=00010007 [-----PC] CPL=0 II=0 
A20=1 SMM=0 HLT=0
11:37:49 INFO | [qemu output] ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   
[-WA]
11:37:49 INFO | [qemu output] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 
[-RA]
11:37:49 INFO | [qemu output] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   
[-WA]
11:37:49 INFO | [qemu output] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   
[-WA]
11:37:49 INFO | [qemu output] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   
[-WA]
11:37:49 INFO | [qemu output] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   
[-WA]
11:37:49 INFO | [qemu output] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
11:37:49 INFO | [qemu output] TR =0000 00000000 0000ffff 00008b00 DPL=0 
TSS32-busy
11:37:49 INFO | [qemu output] GDT=     000f6a80 00000037
11:37:49 INFO | [qemu output] IDT=     000f6abe 00000000
11:37:49 INFO | [qemu output] CR0=60000011 CR2=00000000 CR3=00000000 
CR4=00000000
11:37:49 INFO | [qemu output] DR0=0000000000000000 DR1=0000000000000000 
DR2=0000000000000000 DR3=0000000000000000
11:37:49 INFO | [qemu output] DR6=00000000ffff0ff0 DR7=0000000000000400
11:37:49 INFO | [qemu output] EFER=0000000000000000
11:37:49 INFO | [qemu output] Code=0a 00 e8 a0 64 ff ff 0f aa 66 ba bc d2 0f 00 
e9 a2 fe f3 90 <f0> 0f ba 2d 04 ff fb 3f 00 72 f3 8b 25 00 ff fb 3f e8 44 66 ff 
ff c7 05 04 ff fb 3f 00 00

note the code in that second one is in the middle of the bios,
but the code has a few bytes different from what an objdump gets,
so I'm not quite sure if something is stamping on the bios or
if that's separate.

Dave

--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



reply via email to

[Prev in Thread] Current Thread [Next in Thread]