[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Boot flakiness with QEMU 3.1.0 and Clang built kernels
From: |
Cédric Le Goater |
Subject: |
Re: Boot flakiness with QEMU 3.1.0 and Clang built kernels |
Date: |
Sun, 12 Apr 2020 14:03:01 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 |
On 4/11/20 3:57 PM, Nicholas Piggin wrote:
> Nicholas Piggin's on April 11, 2020 7:32 pm:
>> Nathan Chancellor's on April 11, 2020 10:53 am:
>>> The tt.config values are needed to reproduce but I did not verify that
>>> ONLY tt.config was needed. Other than that, no, we are just building
>>> either pseries_defconfig or powernv_defconfig with those configs and
>>> letting it boot up with a simple initramfs, which prints the version
>>> string then shuts the machine down.
>>>
>>> Let me know if you need any more information, cheers!
>>
>> Okay I can reproduce it. Sometimes it eventually recovers after a long
>> pause, and some keyboard input often helps it along. So that seems like
>> it might be a lost interrupt.
>>
>> POWER8 vs POWER9 might just be a timing thing if P9 is still hanging
>> sometimes. I wasn't able to reproduce it with defconfig+tt.config, I
>> needed your other config with various other debug options.
>>
>> Thanks for the very good report. I'll let you know what I find.
>
> It looks like a qemu bug. Booting with '-d int' shows the decrementer
> simply stops firing at the point of the hang, even though MSR[EE]=1 and
> the DEC register is wrapping. Linux appears to be doing the right thing
> as far as I can tell (not losing interrupts).
>
> This qemu patch fixes the boot hang for me. I don't know that qemu
> really has the right idea of "context synchronizing" as defined in the
> powerpc architecture -- mtmsrd L=1 is not context synchronizing but that
> does not mean it can avoid looking at exceptions until the next such
> event. It looks like the decrementer exception goes high but the
> execution of mtmsrd L=1 is ignoring it.
>
> Prior to the Linux patch 3282a3da25b you bisected to, interrupt replay
> code would return with an 'rfi' instruction as part of interrupt return,
> which probably helped to get things moving along a bit. However it would
> not be foolproof, and Cedric did say he encountered some mysterious
> lockups under load with qemu powernv before that patch was merged, so
> maybe it's the same issue?
Nope :/ but this is a fix for an important problem reported by Anton in
November. Attached is the test case.
Thanks,
C.
test.S
Description: Text document