Re: mac99 SMP

чт, 6 мар. 2025 г., 19:22 BALATON Zoltan <balaton@eik.bme.hu>:

On Thu, 6 Mar 2025, Andrew Randrianasulu wrote:
> On Thu, Mar 6, 2025 at 6:57 PM Andrew Randrianasulu
> <randrianasulu@gmail.com> wrote:
>> On Thu, Mar 6, 2025 at 6:41 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
>>>
>>> On Thu, 6 Mar 2025, Andrew Randrianasulu wrote:
>>>> чт, 6 мар. 2025 г., 18:16 BALATON Zoltan <balaton@eik.bme.hu>:
>>>>> On Thu, 6 Mar 2025, Andrew Randrianasulu wrote:
>>>>>> On Thu, Mar 6, 2025 at 4:12 PM BALATON Zoltan <balaton@eik.bme.hu>
>>>>> wrote:
>>>>>>>
>>>>>>> On Thu, 6 Mar 2025, Andrew Randrianasulu wrote:
>>>>>>>> чт, 6 мар. 2025 г., 05:10 BALATON Zoltan <balaton@eik.bme.hu>:
>>>>>>>>
>>>>>>>>> On Thu, 6 Mar 2025, Andrew Randrianasulu wrote:
>>>>>>>>>> On Thu, Mar 6, 2025 at 2:02 AM Andrew Randrianasulu
>>>>>>>>>> <randrianasulu@gmail.com> wrote:
>>>>>>>>>>> On Thu, Mar 6, 2025 at 12:21 AM BALATON Zoltan <balaton@eik.bme.hu>
>>>>>>>>> wrote:
>>>>>>>>>>>> So is that the ISI that I saw? Line 308 is end of DSI handler but
>>>>> log
>>>>>>>>> name
>>>>>>>>>>>> shows ISI handler. But you had no ISI logs with -d int so I don't
>>>>> get
>>>>>>>>> it.
>>>>>>>>>>>> What are the registers of that CPU at that point? One of those
>>>>> should
>>>>>>>>> tell
>>>>>>>>>>>> from where it got to the ISI handler but backtrace does not show
>>>>> that.
>>>>>>>>>>>> (Check CPU docs which reg has the address that caused the
>>>>> exception, I
>>>>>>>>>>>> don't remember.)
>>>>>>>>>>>>
>>>>>>>>>>>>> this was case when I set sstep bits to 0x1 - it does not bring up
>>>>>>>>> second
>>>>>>>>>>>>> cpu, AND does not single step into irq_save function in
>>>>>>>>> core99_kick_cpu :(
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> So I assumed least impactful mode (0x1) is not very useful for
>>>>>>>>> detailed
>>>>>>>>>>>>> single stepping into this specific function.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But 0x3, 0x5 and default 0x7 all works, as far as single stepping
>>>>> and
>>>>>>>>>>>>> bringing up secondary cpu are concerned.
>>>>>>>>>>>>
>>>>>>>>>>>> You were stepping throgh CPU0 but the interesting part is what
>>>>> CPU1 is
>>>>>>>>>>>> doing so maybe try to trace that:
>>>>>>>>>>>>
>>>>>>>>>>>> thread 2
>>>>>>>>>>>> b *0x100
>>>>>>>>>>>
>>>>>>>>>>> It advances to 0x400 (second cpu) but no further than this
>>>>>>>>>
>>>>>>>>> 0x400 is the ISI vector so it seems it hits that for some reason
>>>>> which is
>>>>>>>>> the same I saw with -d int,mmu and probably it shouldn't get those
>>>>> before
>>>>>>>>> it copies MMU setup from CPU0.
>>>>>>>>>
>>>>>>>>>>> see gdb log
>>>>>>>>>>>
>>>>>>>>>>> Note that I used
>>>>>>>>>>>
>>>>>>>>>>> maintenance packet Qqemu.sstep=0x1
>>>>>>>>>>> sending: Qqemu.sstep=0x1
>>>>>>>>>>> received: "OK"
>>>>>>>>>>>
>>>>>>>>>>> and ctrl-c first thread when it failed to single step (single
>>>>> stepping
>>>>>>>>>>> on thread 2/CPU1 was already impossible)
>>>>>>>>>>>
>>>>>>>>>>> It show different state but no obvious function pointers :/
>>>>>>>>>>
>>>>>>>>>> Aw, this one was without second qemu in mttcg mode :/
>>>>>>>>>>
>>>>>>>>>> New gdb log attached
>>>>>>>>>
>>>>>>>>> As much as I understand it shows CPU0 waiting for CPU1 to set it's
>>>>>>>>> call_in_map entry (that's OK and expected) while CPU1 is getting ISIs
>>>>> or
>>>>>>>>> some other exceptions (which it likely shouldn't get) but I still
>>>>> don't
>>>>>>>>> see how far CPU1 got in its init code and what triggers these ISIs?
>>>>> When
>>>>>>>>> in ISI handler there's a register that has the fault address which is
>>>>>>>>> where it jumped there from. I told you to check PPC manual for that. I
>>>>>>>>> looked it up now it's SRR0 ("Set to the effective address of the
>>>>>>>>> instruction that the processor would have attempted to execute next
>>>>> if no
>>>>>>>>> exception conditions were present (if the exception occurs on
>>>>> attempting
>>>>>>>>> to fetch a branch target, SRR0 is set to the branch target address)").
>>>>>>>>> What code that address belongs to? That's what causing the ISIs and we
>>>>>>>>> should find out why.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for looking it up
>>>>>>>>
>>>>>>>> At very first moment when 0x100 breakpoint hit and gdb autoswitches to
>>>>>>>> thread 2 it shows empty r0-r32 and
>>>>>>>>
>>>>>>>> srr0 0x100 256
>>>>>>>>
>>>>>>>> at next "step" (not really step because single stepping starts run away
>>>>>>>> execution with sstep bits set to 0x1)
>>>>>>>>
>>>>>>>> srr0 0xc000439c -1073724516
>>>>>>>>
>>>>>>>> same as pc (program counter)
>>>>>>>>
>>>>>>>> pc 0xc000439c 0xc000439c <InstructionAccess_virt>
>>>>>>>>
>>>>>>>> so it already in its bad state? Not sure how to get any in-between
>>>>> state?
>>>>>>>>
>>>>>>>> May be enable normal ssbits (0x7) just after cpu1 is hit its breakpoint
>>>>>>>> and then single step ?
>>>>>>>
>>>>>>> You can step by assembly instruction when theres's no source or line
>>>>>>> number info with gdb "stepi" command, with that you should be able to
>>>>> step
>>>>>>> through assembly code.
>>>>>>
>>>>>> Thanks! I replaced thread 1 / step with stepi on thread 2
>>>>>>
>>>>>> it ended up with
>>>>>>
>>>>>> * 2 Thread 1.2 (CPU#1 [running]) 0xc0006d30 in
>>>>> vmap_stack_overflow_virt () at
>>>>>> arch/powerpc/kernel/head_book3s_32.S:375
>>>>>> ++thread 2
>>>>>> [Switching to thread 2 (Thread 1.2)]
>>>>>> #0 0xc0006d30 in vmap_stack_overflow_virt () at
>>>>> arch/powerpc/kernel/head_book3s
>>>>>> _32.S:375
>>>>>> 375 b interrupt_return
>>>>>> ++backtrace
>>>>>> #0 0xc0006d30 in vmap_stack_overflow_virt () at
>>>>> arch/powerpc/kernel/head_book3s
>>>>>> _32.S:375
>>>>>> #1 0x00000000 in ?? ()
>>>>>
>>>>> This does not make much sense because
>>>>> arch/powerpc/kernel/head_book3s_32.S:375 is end of FPU unavailable
>>>>> exception (which probably should not happen) and has nothing to do with
>>>>> vmap_stack_overflow_virt() (which I don't know where is as it's not
>>>>> present in the older Linux sources I was looking at). In any case it looks
>>>>> like the problem is that unexpected exceptions are happening that causes
>>>>> CPU1 to interrupt its code execution and jump to uninitialised or wrong
>>>>> vectors and prevent it to init correctly. I don't know if this is because
>>>>> on real machine this would run from cache and won't cause exceptions or
>>>>> something is not correctly emulated so I don't know how to fix. I remember
>>>>> a similar problem with MorphOS which worked on real machine but caused
>>>>> problem in QEMU but could be prevented by turning off the MSR DR IR bits
>>>>> until the exception vectors were correctly set up. But in that case
>>>>> OpenBIOS enabled these bits and MorphOS did not disable before trying to
>>>>> change exception vectors but here for second CPU I think it should start
>>>>> after a reset with these bits disabled so the question is what enables
>>>>> them in Linux and at that point is it ready to get exceptions?
>>>>
>>>>
>>>> well, cpu0 starts ok at least ...
>>>>
>>>>
>>>> can we just add disassembly command to gdb script?
>>>
>>> You can try
>>>
>>> display x/i $pc
>>
>> ++display x/i $pc
>> my-qemu-gdb-script.gdb:27: Error in sourced command file:
>> No symbol "x" in current context.
>>
>
> https://stackoverflow.com/questions/1902901/show-current-assembly-instruction-in-gdb
>
> set disassemble-next-line on
> show disassemble-next-line
>
>
> result attached

And where is that code in the Linux sources? The line you quoted sets it
to be Reset but I don't see where that's defined.

I clicked on EXPECTION word and picked up first result ...

https://elixir.bootlin.com/linux/v6.12.17/source/arch/powerpc/kernel/head_32.h#L186

but then it goes deeper ...

https://elixir.bootlin.com/linux/v6.12.17/source/arch/powerpc/kernel/head_32.h#L13

for expection prolog

https://elixir.bootlin.com/linux/v6.12.17/source/arch/powerpc/kernel/entry_32.S#L55

for prepare transfer to handler ....

This seems to be early
after CPU1 has started and tries to jump into kernel code. Without MMU set
up how that's supposed to work? Why this does not cause exceptions on real
machine? Also CPU0 is handling decrementer interrupt meanwhile. Is that
relevant to the problem and is there some unwanted interaction here?

May be? Earlier if we set breakpoint at smp_core99_kick_cpu at cpu0 it started ok ...

Regards,
BALATON Zoltan

From:	Andrew Randrianasulu
Subject:	Re: mac99 SMP
Date:	Thu, 6 Mar 2025 20:37:34 +0300