lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] BERT: Error records from previous boot


From: Vadim Zeitlin
Subject: Re: [lmi] BERT: Error records from previous boot
Date: Tue, 12 May 2020 19:42:25 +0200

On Tue, 12 May 2020 15:42:09 +0000 Greg Chicares <address@hidden> wrote:

GC> Vadim--I just rebooted (debian stable) and saw this flying by;
GC> I was able to recapture it from `dmesg`:
GC> 
GC> [    1.982866] BERT: Error records from previous boot:
GC> [    1.985400] [Hardware Error]: event severity: fatal
GC> [    1.985458] [Hardware Error]:  Error 0, type: fatal
GC> [    1.985515] [Hardware Error]:   section_type: PCIe error
GC> [    1.985572] [Hardware Error]:   port_type: 4, root port
GC> [    1.985629] [Hardware Error]:   version: 1.16
GC> [    1.985685] [Hardware Error]:   command: 0x0010, status: 0x0000
GC> [    1.985744] [Hardware Error]:   device_id: 0000:00:02.0
GC> [    1.985801] [Hardware Error]:   slot: 0
GC> [    1.985856] [Hardware Error]:   secondary_bus: 0x00
GC> [    1.985914] [Hardware Error]:   vendor_id: 0x8086, device_id: 0x2f04
GC> [    1.985973] [Hardware Error]:   class_code: 000604
GC> [    1.986029] [Hardware Error]:   bridge: secondary_status: 0x0000, 
control: 0x0000
GC> [    1.986100] [Hardware Error]:   aer_uncor_status: 0x00000000, 
aer_uncor_mask: 0x00000000
GC> [    1.986170] [Hardware Error]:   aer_uncor_severity: 0x00062030
GC> [    1.986228] [Hardware Error]:   TLP Header: 00000000 00000000 00000000 
00000000
GC> 
GC> I'd never seen anything like this before.

 Unfortunately I can't help at all because I haven't ever seen this neither
and didn't know about BERT (Boot Error Record Table, as it turns out)
existence until today. And I couldn't find that much about it even today,
the best description is arguably that of the patch which added support for
it to Linux:

        From: Tony Luck <tony dot luck at intel dot com>
        Subject: [PATCH] ACPI/APEI: Add BERT data driver
        Date: Mon, 14 Aug 2017 09:56:13 -0700

        The ACPI Boot Error Record Table provides a method for platform
        firmware to give information to the operating system about error
        that occurred prior to boot (of particular interest are problems
        that caused the previous OS instance to crash).

But it doesn't even seem like there was a crash in your case, so I'm not
sure of how much interest is this information in our case...


GC> Device 00:02.0 looks like it might be important:
GC> 
GC> /home/greg[0]#lspci |grep "00:02.0"  
GC> 00:02.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI 
Express Root Port 2 (rev 02)
GC> 
GC> Yesterday I had to reboot anyway (after some mishap led to "disk full"),
GC> but I can't think of any other recent catastrophe. I don't recall any
GC> attempted reboot that ever failed, but if any ever did, I would most
GC> likely have rebooted again immediately and forgotten about the failure.
GC> 
GC> Is there anything I should do about this?

 I really have no idea, but, from (very) high level point of view, the PCIe
error must be due to either the host/controller itself or one of the
devices using it. If it's the host/controller, the only thing to do is to
replace it, i.e. the motherboard, and you would be probably unwilling to do
it until it just stops working in any case. If it's one of the devices, you
could perhaps run stress tests on it. I don't know what kind of devices do
you have on this bus, some common candidates would be a graphics card or a
SSD. If it's the former, it's not really a big deal neither as in the worst
case you would just replace it too when/if it stops working. If it's the
latter, it's potentially more concerning, but if smartmon tools don't show
any errors/problems I wouldn't do anything about it yet neither.

 So, basically, I'd just check that your backups work/can be restored from
and monitor the logs for any future errors, but otherwise wouldn't do
anything yet.

 Regards,
VZ

Attachment: pgpiukQyIoc1I.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]