Shiyang Ruan wrote:
[..]
My expectation is MF_ACTION_REQUIRED is not appropriate for CXL event
reported errors since action is only required for direct consumption
events and those need not be reported through the device event queue.
Got it.
I'm not very sure about 'Host write/read' type. In my opinion, these
two types of event should be sent from device when CPU is accessing a
bad memory address, they could be thought of a sync event which needs
Hmm, no that's not my understanding of a sync event. I expect when error
notifications are synchronous the CPU is guaranteed not to make forward
progress past the point of encountering the error. MSI-signaled
component-events are always asynchronous by that definition because the
CPU is free running while the interrupt is in-flight.
the 'MF_ACTION_REQUIRED' flag. Then, we can determine the flag by the
types like this:
- CXL_EVENT_TRANSACTION_READ | CXL_EVENT_TRANSACTION_WRITE
=> MF_ACTION_REQUIRED
- CXL_EVENT_TRANSACTION_INJECT_POISON => MF_SW_SIMULATED
- others => 0
I doubt any reasonable policy can be inferred from the transaction type.
Consider that the CPU itself does not take a sychronous exception when
writes encounter poison. At most those are flagged via CMCI
(corrected machine check interrupt). The only events that cause
exceptions are CPU reads that consume poison. The device has no idea
whether read events are coming from a CPU or a DMA event.
MF_SW_SIMULATED is purely for software simulated poison events as
injected poison can stil cause system fatal damage if the poison is
ingested in an unrecoverable path.
So, I think all CXL poison notification events should trigger an action
optional memory_failure(). I expect this needs to make sure that
duplicates re not a problem. I.e. in the case of CPU consumption of CXL
poison, that causes a synchronous MF_ACTION_REQUIRED event via the MCE
path *and* it may trigger the device to send an error record for the
same page. As far as I can see, duplicate reports (MCE + CXL device) are
unavoidable.