qemu-riscv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re: [PATCH] target/riscv: Exit current TB after an sfence.vma


From: Palmer Dabbelt
Subject: Re: Re: [PATCH] target/riscv: Exit current TB after an sfence.vma
Date: Wed, 30 Mar 2022 10:06:38 -0700 (PDT)

On Wed, 30 Mar 2022 09:11:18 PDT (-0700), Palmer Dabbelt wrote:
[re-ordering the top post]

+linux-riscv, as this may very well be a kernel bug

On Wed, 30 Mar 2022 05:38:30 PDT (-0700), phantom@zju.edu.cn wrote:
-----Original Messages-----
From: "Idan Horowitz" <idan.horowitz@gmail.com>
Sent Time: 2022-03-30 15:35:19 (Wednesday)
To: "Atish Patra" <atishp@atishpatra.org>
Cc: phantom@zju.edu.cn, "open list:RISC-V" <qemu-riscv@nongnu.org>, "Alistair Francis" 
<Alistair.Francis@wdc.com>, "qemu-devel@nongnu.org Developers" <qemu-devel@nongnu.org>
Subject: Re: [PATCH] target/riscv: Exit current TB after an sfence.vma

On Wed, 30 Mar 2022 at 10:28, Atish Patra <atishp@atishpatra.org> wrote:
>
> I tested on v5.17 built from defconfig for rv64.
>
> Here is the kernel code executing sfence.vma
> https://elixir.bootlin.com/linux/v5.17/source/arch/riscv/kernel/head.S#L122
>

I believe this is a kernel bug and not a QEMU one. They perform a
write to the SATP with the same ASID as the one used before (0) and

I seem to remember ASID 0 being a special one, with some global-ness,
but I couldn't find that in a quick poke into the spec.  As below, I'm
going to read through this a bit more...

(Also, I haven't had any coffee yet)

then expect it to be used, without performing an sfence.vma following
it.
This was exposed by my change, as previously the write to the satp was
performed in the same TB block as the sfence.vma *before it*, which
meant the TLB was not filled in between the previous sfence and the
write to SATP following it.
I was able to reproduce the issue with the Fedora Rawhide image in the
wiki, and I was able to resolve it by artificially forcing a TLB flush
following all writes to SATP.
I think the correct course of action here is to:
1. Report the issue to the linux kernel mailing list and/or contribute
a patch that adds said missing sfence.vma following the SATP write.
(Atish: Are you able to test if adding an sfence.vma in your kernel
build fixes the issue?)

If it's a kernel bug we should fix it, but I'm not entirely convinced
that's the case.  I can confirm that the following makes Linux boot
again

diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index ec07f991866a..83373c2bd7d0 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -121,6 +121,7 @@ relocate:
        or a0, a0, a1
        sfence.vma
        csrw CSR_SATP, a0
+       sfence.vma
 .align 2
 1:
        /* Set trap vector to spin forever to help debug */

It's been a while since I read through the rules here so I'm going to go
read them again, but IIRC that shouldn't be necessary: that first
sfence.vma should be sufficiently global to ensure all prior writes to
the page tables are visible, regardless of what's in SATP.  That said, I
remember there being a lot of subtly here in the spec wording so I'm
going to go read the spec again.

2. Restore the patch

Presumably you mean "revert" here?  That might be the right way to go,
just to avoid breaking users (even if we fix the kernel bug, it'll take
a while to get everyone to update).  That said, this smells like the
sort of thing that's going to crop up at arbitrary times in dynamic
systems so while a revert looks like it'd work around the boot issue we
might be making more headaches for folks down the road.

I agree with you partly, my test case is actually from linux kernel, I notice
the strange sfence.vma before write satp during write our teaching kernel.
I think, the strange code is used to bypass the qemu bug that Idan patched.
Because in hardware, if the stap is empty, sfence.vma will do nothing.
And that's why nobody report it.

IIUC the sfence.vma before the SATP write is very explicitly necessary:
without that fence old mappings could be utilized directly after the
SATP write, so we might not even be able to fetch the next instruction.

Before patch, qemu won't end a BB after sfence (but jump and CSR write do).
So, the kernel author reodered write stap and sfence.vma to make sfence.vma
place in the same BB with write satp, instead of the following write stvec.
(If don't reorder, sfence.vma will place in the same BB with write stvec,
that will crash kernel, see my origin analysis).

However, in hardware, since tlb is empty, put the first sfence.vma before or
after write satp is not really matters.
In qemu, as Atish's log shown, we should do a imply invisible sfence.vma after
write stap to invalid qemu's translation cache.

The ISA manual is quite explicit about SATP not enforcing these
orderings.  If what I remember about ASID 0 is true then I do think we'd
need one, though, to avoid the aliasing -- hopefully I'll make a bit
more sense soon, though...

OK, I must have just been crazy -- there's nothing special about ASID=0. Looks to me like flushing the TLB on SATP writes is necessary, I just sent a patch with a more concrete description of why.

Thanks!



reply via email to

[Prev in Thread] Current Thread [Next in Thread]