qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 2/2] cxl/core: add poison creation event handler


From: Shiyang Ruan
Subject: Re: [PATCH v3 2/2] cxl/core: add poison creation event handler
Date: Fri, 24 May 2024 23:15:07 +0800
User-agent: Mozilla Thunderbird



在 2024/5/22 14:45, Dan Williams 写道:
Shiyang Ruan wrote:
[..]
My expectation is MF_ACTION_REQUIRED is not appropriate for CXL event
reported errors since action is only required for direct consumption
events and those need not be reported through the device event queue.
Got it.

I'm not very sure about 'Host write/read' type.  In my opinion, these
two types of event should be sent from device when CPU is accessing a
bad memory address, they could be thought of a sync event which needs

Hmm, no that's not my understanding of a sync event. I expect when error
notifications are synchronous the CPU is guaranteed not to make forward
progress past the point of encountering the error. MSI-signaled
component-events are always asynchronous by that definition because the
CPU is free running while the interrupt is in-flight.

Understood.  In OS-First path, it couldn't be a sync event.


the 'MF_ACTION_REQUIRED' flag.  Then, we can determine the flag by the
types like this:
- CXL_EVENT_TRANSACTION_READ | CXL_EVENT_TRANSACTION_WRITE
                                                => MF_ACTION_REQUIRED
- CXL_EVENT_TRANSACTION_INJECT_POISON         => MF_SW_SIMULATED
- others                                      => 0

I doubt any reasonable policy can be inferred from the transaction type.
Consider that the CPU itself does not take a sychronous exception when
writes encounter poison. At most those are flagged via CMCI
(corrected machine check interrupt). The only events that cause
exceptions are CPU reads that consume poison. The device has no idea
whether read events are coming from a CPU or a DMA event.

MF_SW_SIMULATED is purely for software simulated poison events as
injected poison can stil cause system fatal damage if the poison is
ingested in an unrecoverable path.

So, I think all CXL poison notification events should trigger an action
optional memory_failure(). I expect this needs to make sure that
duplicates re not a problem. I.e. in the case of CPU consumption of CXL
poison, that causes a synchronous MF_ACTION_REQUIRED event via the MCE
path *and* it may trigger the device to send an error record for the
same page. As far as I can see, duplicate reports (MCE + CXL device) are
unavoidable.

I think my previous understanding about MCE was wrong. Here is my current understanding after some research:

Since CXL device is a memory device, while CPU consumes a poison page of CXL device, it always triggers a MCE by interrupt (INT18), no matter which-First path is configured. This is the first report. Then currently, in FW-First path, the poison event is transferred according to the following process: CXL device -> firmware -> OS:ACPI->APEI->GHES -> MCE. This is the second one. These two MCEs represent the same poisoning page, which is the so-called "duplicate report", right? Now, the memory_failure() handling I'm trying to add in OS-First path, is also another duplicate report.

So, the primary issue to be solved is the second MCE report. As you suggested, make it a optional action.

Please correct me if I'm wrong.  Thank you very much!

--
Ruan.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]