A hardware error is logged on the granet iDRAC relative an ECC threshold / CPU issue
All the alerts were raised the 2021-10-01 around 21h
2021-10-01 21:05:33 MEM8000 Correctable memory error logging disabled for a memory device at location DIMM_B3. 2021-10-01 21:05:33 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:32 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:31 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:31 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:30 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:30 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:29 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:28 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:28 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:27 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:26 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:25 CPU9000 An OEM diagnostic event occurred. 2021-10-01 21:05:24 CPU0012 Correctable Machine Check Exception detected on CPU 2.
According to the dell manual :
- CPU0012 [1]
CPU0012 Message Correctable Machine Check Exception detected on CPU arg1 . Arguments arg1 = number Detailed Description None. Recommended Response Action Review System Event Log and Operating System Logs. If the issue persists, contact technical support. Refer to the product documentation to choose a convenient contact method. Category System Health Subcategory CPU = Processor Severity Severity 2 (Warning) Trap/EventID 2242 LCD Message No LCD message display defined. Initial Default IPMI Alert;LC Log Server Administrator Event ID 5603 Server Administrator Trap ID 5603
- CPU9000 [1]
CPU9000 Message An OEM diagnostic event occurred. Detailed Description None Recommended Response Action No response action is required. Category System Health Subcategory CPU = Processor Severity Severity 3 (Informational) LCD Message No LCD message display defined. Initial Default LC Log Server Administrator Event ID Not Applicable Server Administrator Trap ID Not Applicable
- MEM8000 [2]
MEM8000 Message Correctable memory error logging disabled for a memory device at location arg1 . Arguments arg1 = location Detailed Description Errors are being corrected but no longer logged. Recommended Response Action Review system logs for memory exceptions. Re-install memory at location <location> Category System Health Subcategory MEM = Memory Severity Severity 1 (Critical) Trap/EventID 2265 LCD Message SBE log disabled on <location>. Reseat memory Initial Default LC Log Server Administrator Event ID Not Applicable Server Administrator Trap ID Not Applicable
The version of the bios of this server is 2.3.2
According to the memory autorepair documentation[3], an PPR (Post Package Repair) is not planned if other errors are not detected.
The recommanded first action is to proceed to a reboot:
With BIOS 2.1.x or later, the first recommended step is to reboot/restart (without moving DIMMs to a different slot). This allows the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need to schedule any DIMM replacements.
It's also recommended to upgrade the bios and idrac software to improve the error detection but let this for later if the problem is still present after the reboot
[1] https://www.dell.com/support/manuals/fr-fr/dell-opnmang-sw-v8.0.1/eemi_13g-v1/cpu-event-messages?guid=guid-789ec7d2-2a52-4063-a753-c5dc51e91359&lang=en-us
[2] https://www.dell.com/support/manuals/fr-fr/dell-opnmang-sw-v8.0.1/eemi_13g-v1/mem-event-messages?guid=guid-ff360c01-4e4c-4f20-871d-1d24ced52985&lang=en-us
[3] https://www.dell.com/support/kbdoc/fr-fr/000053203/what-is-ddr4-self-healing-on-dell-poweredge-servers-with-intel-xeon-scalable-processors?lang=en