Re: [Qemu-devel] State of ARM FIQ in Qemu

On 13 November 2014 07:58, Tim Sander <address@hidden> wrote:

Am Mittwoch, 12. November 2014, 10:00:03 schrieb Greg Bellows:

> On 12 November 2014 07:56, Tim Sander <address@hidden> wrote:
> > Hi Greg
> >
> > > > Bad mode in data abort handler detected
> > > > Internal error: Oops - bad mode: 0 [#1] PREEMPT SMP ARM
> > > > Modules linked in: firq(O) ipv6
> > > > CPU: 0 PID: 103 Comm: systemd-udevd Tainted: G O 3.14.0 #1
> > > > task: bf2b9300 ti: bf362000 task.ti: bf362000
> > > > PC is at 0xffff1240
> > > > LR is at handle_fasteoi_irq+0x9c/0x13c
> > > > pc : [<ffff1240>] lr : [<8005cda0>] psr: 600f01d1
> > > > sp : bf363e70 ip : 07a7e79d fp : 00000000
> > > > r10: 76f92008 r9 : 80590080 r8 : 76e8e4d0
> > > > r7 : f8200100 r6 : bf363fb0 r5 : bf008414 r4 : bf0083c0
> > > > r3 : 80230d04 r2 : 0000002f r1 : 00000000 r0 : bf0083c0
> > > > Flags: nZCv IRQs off FIQs off Mode FIQ_32 ISA ARM Segment user
> > >
> > > It looks like we are in FIQ mode and interrupts have been masked.
> >
> > Indeed.
> >
> > > > Control: 10c53c7d Table: 60004059 DAC: 00000015
> > > > Process systemd-udevd (pid: 103, stack limit = 0xbf362240)
> > > > Stack: (0xbf363e70 to 0xbf364000)
> > > > 3e60: bf0083c0 00000000 0000002f
> > > > 80230d04
> > > > 3e80: bf0083c0 bf008414 bf363fb0 f8200100 76e8e4d0 80590080 76f92008
> > > > 00000000
> > > > 3ea0: 07a7e79d bf363e70 8005cda0 ffff1240 600f01d1 ffffffff 8005cd04
> > > > 0000002f
> > > > 3ec0: 0000002f 800598bc 8058cc70 8000ed00 f820010c 8059684c bf363ef8
> > > > 80008528
> > > > 3ee0: 80023730 80023744 200f0113 ffffffff bf363f2c 80012180 00000000
> > > > 805baa00
> > > > 3f00: 00000000 00000100 00000002 00000022 00000000 bf362000 76e8e4d0
> > > > 80590080
> > > > 3f20: 76f92008 00000000 0000000a bf363f40 80023730 80023744 200f0113
> > > > ffffffff
> > > > 3f40: bf007a14 8059ac00 00000000 0000000a ffff8dd7 00400140 bf0079c0
> > > > 8058cc70
> > > > 3f60: 00000022 00000000 f8200100 76e8e4d0 76f9201c 76f92008 00000000
> > > > 80023af0
> > > > 3f80: 8058cc70 8000ed04 f820010c 8059684c bf363fb0 80008528 00000000
> > > > 76dd3b44
> > > > 3fa0: 600f0010 ffffffff 0000000c 8001233c 00000000 00000000 76f93428
> > > > 76f93428
> > > > 3fc0: 76f93438 00000000 76f93448 0000000c 76e8e4d0 76f9201c 76f92008
> > > > 00000000
> > > > 3fe0: 00000000 7ec115c0 76f60914 76dd3b44 600f0010 ffffffff 9fffd821
> > > > 9fffdc21
> > > > [<8005cda0>] (handle_fasteoi_irq) from [<80230d04>]
> >
> > (gic_eoi_irq+0x0/0x4c)
> >
> > > It certainly looks like we are going down the standard IRQ patch as you
> > > suggested. I'm not a Linux driver guy, but do you see any kind of
> >
> > activity
> >
> > > (break points, printfs, ...) through your FIQ handler?
> >
> > I am reaching 0xffff1224 which i believe is the fiq vector address on the
> > vexpress?
>
> Hmmm.... not sure. As you mentioned previously (and as seen in the above
> register dump), I would expect offset 0x1240 (pc=0xffff1240) for an FIQ.
> I'm not sure what is at offset 0x1224, but on my Linux kernel it appears
> that offset 0x1220 is vector_addrexcptn (not pabort), that happens to
> occupy the HYP trap vector.

Zounds! You're right, i think this was a typo in my debug script. Which i
didn't notice. But i am even reaching 0x1240 before but not 0x1244 which means

I wouldn't expect it to reach 0x1244 as that is the word after what I believe should be a branch at 0x1240 to the FIQ handler. This would mean we are not overrunning the vector table though.

it aborts on the first fiq instructions. Here is the "-d int" output directly
after the FIQ hits:
Taking exception 3 [Prefetch Abort]
...with IFSR 0x5 IFAR 0x800c8dcc //kmem_cache_alloc
Taking exception 3 [Prefetch Abort]
...with IFSR 0x5 IFAR 0x8001be00 //v7_pabort
Taking exception 3 [Prefetch Abort]
and then it continue to fail on v7_pabort repeatedly. This shows that there is
something fishy going on. It is failing on the presumed handler for the
prefetch abort? But as i see earlier resolved prefetched abort errors i can
conclude that it works up to the point where the CPU is in FIQ mode.
FIQ is special in a way that static mapped memory is needed to avoid a page
lookup as this fails under linux in fiq mode. But 0x800c8dcc (kmem_cache_alloc)
is not called in the FIQ handler which obviously can't use any Linux
infrastructure. And as i do not reach the breakpoint 0xffff1244 these misses
happen on the execution of the first address of the FIQ handler.

Can we check the vector table to see if the FIQ entry is as expected? It appears that the pabort may be in the right place, but it would be good to see if the FIQ entry is correct (branching to right place). I'd expect that we should be branching to __fiq_svc? Maybe setting a breakpoint in the first level handler may be useful?

> > > > [<80230d04>] (gic_eoi_irq) from [<f8200100>] (0xf8200100)
> > > > Code: ee02af10 f57ff06f e59d8000 e59d9004 (e599b00c)
> > > > ---[ end trace 3dc3571209a017e1 ]---
> > > > Kernel panic - not syncing: Fatal exception in interrupt
> > >
> > > It is hard to determine entirely what is happening here based on this
> > > info. I do have code of my own that routes KGDB interrupts as FIQs and
> > > with the workaround I see the FIQs handled as expected. Some things we
> >
> > can
> >
> > > try to get more info in hopes of pinpointing where to look:
> > > 1. At the top of hw/intc/arm_gic.c there is the following commented
> >
> > out
> >
> > > line:
> > > //#define DEBUG_GIC
> > >
> > > Uncomment the line, rebuild and rerun. This will give us some trace
> >
> > on
> >
> > > what is going through the GIC code.
> >
> > I have commented out some debug lines but i see:
> > Breakpoint 1, gic_update_with_grouping (s=0x5555564dba80) at
> > hw/intc/arm_gic.c:120
> > 120 DPRINTF("Raised pending FIQ %d (cpu %d)\n",
> > best_irq, cpu);
> >
> > With the expected irq nr. 49 (32+17).
> >
> > > 2. Run qemu with the "-d int" option which will print a message on
> >
> > each
> >
> > > interrupt. We should see an FIQ at some point if they are occurring.
> >
> > The
> >
> > > only issue is that there will be numerous IRQs, so you'll have to parse
> > > through them to find an "exception 6 [FIQ].
> >
> > Here is the relevant output when the FIQ hits:
> > Taking exception 2 [SVC]
> > Taking exception 2 [SVC]
> > pml: pml_timer_tick: raise_irq
> > arm_gic: Raised pending FIQ 49 (cpu 0)
> > Taking exception 6 [FIQ]
>
> This looks to me like the GIC has caught the interrupt and communicated it
> to the CPU causing it to take the FIQ exception.
>
> > pml: pml_write: update control flags: 1
> > pml: pml_update: start timer
> > pml: pml_update: lower irq
> > pml: pml_read: read magic
> > pml: pml_write: update control flags: 3
> > pml: pml_update: start timer
>
> Is pml your test driver? It looks like it initiates the interrupt and
> possibly performs some handling following it?

Yes, its just a simple set of some registers to control an interrupt. There is
i added debug output to this driver to see if and when the FIQ is accessing
the registers. But i see no accesses from FIQ mode.

> > Taking exception 3 [Prefetch Abort]
> > ...with IFSR 0x5 IFAR 0x80221d70
> > Taking exception 4 [Data Abort]
> > ...with DFSR 0x805 DFAR 0x805c604c
> > Taking exception 4 [Data Abort]
> > ...with DFSR 0x805 DFAR 0x805c604c
> > Taking exception 4 [Data Abort]
> >
> > So the fiq is hitting but unfortunatly i have no idea where the data
> > aborts are coming from.
>
> The data aborts are likely a side effect of the prefetch abort taken before
> them; it is the interesting one.
Still as above the address is odd. In FIQ mode it should not jump to this
address at all !?! This is definetly Linux memory space and i am not calling
anything linux related from FIQ.

I'm a bit confused as it appears the exception pattern has changed. Previously, we were seeing pabt, dabt, dabt, ..., but then up above the output is pabt, pabt, pabt, ... . So, either we are jumping somewhere random thus breaking repeatability or something else changed? This is also reflected in the A15 output below, but its different.

> > I have shifted all other Irqs besides 49 to group 1 so that only irq 49 is
> > a FIQ.
> > Might it be that i am seeing some secure violations...
> > The address of the IFAR __idr_pre_get which lives in the linux kernel in
> > lib/idr.c seems to
> > be implementing ann integer ID management.
> >
> > > 3. If you set a breakpoint in your driver, is it possible to see that
> > > FIQs are on from the kernel debugger. Clearly you have to try this
> >
> > from
> >
> > > a path where interrupts are masked. I see the following on my system
> > >
> > > mentioned above:
> > > ...
> > > Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment kernel
> > > ...
> >
> > So you mean by debugging via the qemu debug port? I have not enabled the
> > kgdb.
> > As stated above, i was not able to catch the fiq irq there. But it might
> > be that i get
> >
> > I have debugged qemu to see if the irq is routed correctly. The depeest
> > call i could find is this: bt
> > #0 tcg_handle_interrupt (cpu=0x555556450790, mask=16) at
> > /home/sander/speedy/soc/qemu/translate-all.c:1503
> > #1 0x0000555555755323 in cpu_interrupt (cpu=0x555556450790, mask=16)
> >
> > at /home/sander/speedy/soc/qemu/include/qom/cpu.h:556
> >
> > #2 0x00005555557561b7 in arm_cpu_set_irq (opaque=0x555556450790, irq=1,
> > level=1)
> >
> > at /home/sander/speedy/soc/qemu/target-arm/cpu.c:261
> >
> > #3 0x00005555558193ec in qemu_set_irq (irq=0x55555642c840, level=1) at
> > hw/core/irq.c:43
> > #4 0x0000555555879073 in gic_update_with_grouping (s=0x5555564dba80) at
> > hw/intc/arm_gic.c:132
> > #5 0x000055555587936d in gic_update (s=0x5555564dba80) at
> > hw/intc/arm_gic.c:180
> > #6 0x00005555558798a7 in gic_set_irq (opaque=0x5555564dba80, irq=49,
> > level=1) at hw/intc/arm_gic.c:264
> > #7 0x00005555558193ec in qemu_set_irq (irq=0x555556432b00, level=1) at
> > hw/core/irq.c:43
> > #8 0x0000555555661d4d in a9mp_priv_set_irq (opaque=0x5555564d7260,
> > irq=17, level=1)
> >
> > at /home/sander/speedy/soc/qemu/hw/cpu/a9mpcore.c:17
> >
> > #9 0x00005555558193ec in qemu_set_irq (irq=0x5555564f3c00, level=1) at
> > hw/core/irq.c:43
> > #10 0x00005555558f6fed in qemu_irq_raise (irq=0x5555564f3c00) at
> > /home/sander/speedy/soc/qemu/include/hw/irq.h:16
> > #11 0x00005555558f7363 in pml_timer_tick (opaque=0x555556595020) at
> > hw/timer/pml.c:95
> > #12 0x000055555599be6e in aio_bh_poll (ctx=0x5555563fdad0) at async.c:82
> > #13 0x00005555559b2d9f in aio_dispatch (ctx=0x5555563fdad0) at
> > aio-posix.c:137
> > #14 0x000055555599c2cb in aio_ctx_dispatch (source=0x5555563fdad0,
> > callback=0x0, user_data=0x0) at async.c:221
> > #15 0x00007ffff7901e04 in g_main_context_dispatch () from
> > /lib/x86_64-linux-gnu/libglib-2.0.so.0
> > #16 0x00005555559b0a79 in glib_pollfds_poll () at main-loop.c:200
> > #17 0x00005555559b0b7a in os_host_main_loop_wait (timeout=0) at
> > main-loop.c:245
> > #18 0x00005555559b0c52 in main_loop_wait (nonblocking=1) at
> > main-loop.c:494
> > #19 0x0000555555791d8b in main_loop () at vl.c:1872
> > #20 0x00005555557998d5 in main (argc=22, argv=0x7fffffffda38,
> > envp=0x7fffffffdaf0) at vl.c:4348
> >
> > I am not sure if arm_cpu_set_irq(opaque=0x555556450790, irq=1, level=1)
> > represents a fiq
> > and if mask 16 is the correct mask for the fiq request.
>
> Yeah this routine handles both IRQs and FIQs. I don't see anything above
> that stands out as suspicious. It may be interesting to try the same test
> driver on an A15 emulation if it is not too much trouble. This would rule
> out the A9 workaround not being sufficient for being GICv2.

Given the fact that the addresses in which the fault appears are bogus and not
accessed by the fiq handler at all. I have seen that starting up a different cpu
is just a matter of a command line option. So i started up my modified vexpress
board (pml hw added) with cortex a15 cpu. Unfortunatly the results are pretty
similar:
pml: pml_timer_tick: raise_irq
arm_gic: Raised pending FIQ 49 (cpu 0)
Taking exception 6 [FIQ]
pml: pml_write: update control flags: 1
pml: pml_update: start timer
pml: pml_update: lower irq
pml: pml_read: read magic
pml: pml_write: update control flags: 3
pml: pml_update: start timer
Taking exception 4 [Data Abort]
...with DFSR 0x5 DFAR 0xbf3d2334 //address not in Kernel space?
Taking exception 3 [Prefetch Abort]
...with IFSR 0x5 IFAR 0x800120e0 //__dabt_svc
Taking exception 3 [Prefetch Abort]
...with IFSR 0x5 IFAR 0x80012240 //__pabt_svc
Taking exception 3 [Prefetch Abort]
...with IFSR 0x5 IFAR 0x80012240//__pabt_svc
Taking exception 3 [Prefetch Abort]

Good data point. Interesting that we take a data abort first rather than a prefetch abort.

> > Row #6 show clearly that irq 49 configured to Group 0 is triggered. All
> > other interrupt are configured to Group 1
> > from my Linux kernel. The call to #4 gic_update_with_grouping shows that
> > grouping within the GIC is enabled
> > and that irq is triggered as FIQ within qemu. All of this looks good as
> > far as i understand. So i am pretty confident
> > that qemu is working correctly (minus the Prefetch and Data Aborts).
>
> I agree that QEMU appears to be handling the FIQ properly and it appears
> that the CPU is trying to dispatch it. I understand that the Linux FIQ
> handling is a little trickier than IRQs, so I suspect that either something
> in the Linux kernel handling or your driver is going awry during handling
> or as a result of the FIQ.
Yes FIQ's are tricky as you need to avoid the page lookup failures. These are
undesirable in a FIQ anyway. So all the memory i accessed is statically mapped
so that its allways available in the page table.

Best regards
Tim

From:	Greg Bellows
Subject:	Re: [Qemu-devel] State of ARM FIQ in Qemu
Date:	Thu, 13 Nov 2014 09:09:33 -0600