Page MenuHomeSoftware Heritage

ECC corrections too important on one memory module of granet
Closed, MigratedEdits Locked

Description

A hardware error is logged on the granet iDRAC relative an ECC threshold / CPU issue

All the alerts were raised the 2021-10-01 around 21h

2021-10-01 21:05:33 	MEM8000 	Correctable memory error logging disabled for a memory device at location DIMM_B3.
2021-10-01 21:05:33 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:32 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:31 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:31 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:30 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:30 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:29 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:28 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:28 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:27 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:26 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:25 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:24 	CPU0012 	Correctable Machine Check Exception detected on CPU 2.

According to the dell manual :

  • CPU0012 [1]
CPU0012

Message
    Correctable Machine Check Exception detected on CPU arg1 . 
Arguments

        arg1 = number

Detailed Description
    None. 
Recommended Response Action
    Review System Event Log and Operating System Logs. If the issue persists, contact technical support. Refer to the product documentation to choose a convenient contact method. 
Category
    System Health 
Subcategory
    CPU = Processor 
Severity
    Severity 2 (Warning)
Trap/EventID
    2242
LCD Message
    No LCD message display defined.
Initial Default
    IPMI Alert;LC Log
Server Administrator Event ID
    5603
Server Administrator Trap ID
    5603
  • CPU9000 [1]
CPU9000

Message
    An OEM diagnostic event occurred. 
Detailed Description
    None 
Recommended Response Action
    No response action is required. 
Category
    System Health 
Subcategory
    CPU = Processor 
Severity
    Severity 3 (Informational)
LCD Message
    No LCD message display defined.
Initial Default
    LC Log
Server Administrator Event ID
    Not Applicable
Server Administrator Trap ID
    Not Applicable
  • MEM8000 [2]
MEM8000

Message
    Correctable memory error logging disabled for a memory device at location arg1 . 
Arguments

        arg1 = location

Detailed Description
    Errors are being corrected but no longer logged. 
Recommended Response Action
    Review system logs for memory exceptions. Re-install memory at location <location> 
Category
    System Health 
Subcategory
    MEM = Memory 
Severity
    Severity 1 (Critical)
Trap/EventID
    2265
LCD Message
    SBE log disabled on <location>. Reseat memory
Initial Default
    LC Log
Server Administrator Event ID
    Not Applicable
Server Administrator Trap ID
    Not Applicable

The version of the bios of this server is 2.3.2
According to the memory autorepair documentation[3], an PPR (Post Package Repair) is not planned if other errors are not detected.
The recommanded first action is to proceed to a reboot:

With BIOS 2.1.x or later, the first recommended step is to reboot/restart (without moving DIMMs to a different slot). This allows the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need to schedule any DIMM replacements.

It's also recommended to upgrade the bios and idrac software to improve the error detection but let this for later if the problem is still present after the reboot

[1] https://www.dell.com/support/manuals/fr-fr/dell-opnmang-sw-v8.0.1/eemi_13g-v1/cpu-event-messages?guid=guid-789ec7d2-2a52-4063-a753-c5dc51e91359&lang=en-us
[2] https://www.dell.com/support/manuals/fr-fr/dell-opnmang-sw-v8.0.1/eemi_13g-v1/mem-event-messages?guid=guid-ff360c01-4e4c-4f20-871d-1d24ced52985&lang=en-us
[3] https://www.dell.com/support/kbdoc/fr-fr/000053203/what-is-ddr4-self-healing-on-dell-poweredge-servers-with-intel-xeon-scalable-processors?lang=en

Event Timeline

vsellier renamed this task from ECC correction too important on one memory slot of granet to ECC corrections too important on one memory slot of granet.Nov 2 2021, 4:01 PM
vsellier changed the task status from Open to Work in Progress.
vsellier claimed this task.
vsellier triaged this task as High priority.
vsellier created this task.
vsellier moved this task from Backlog to in-progress on the System administration board.
vsellier renamed this task from ECC corrections too important on one memory slot of granet to ECC corrections too important on one memory module of granet.Nov 2 2021, 4:07 PM
vsellier updated the task description. (Show Details)

The server was rebooted so the ECC counters were reset and the alert closed.
We will check if the error occurs again before asking for a replacement of the memory module by dell.

The graph service has correctly restarted:

root@granet:/etc/systemd/system# systemctl status swhgraphshm.service 
* swhgraphshm.service - swh graph shm mapper
     Loaded: loaded (/etc/systemd/system/swhgraphshm.service; disabled; vendor preset: enabled)
     Active: active (exited) since Wed 2021-11-03 08:37:25 UTC; 2min 25s ago
    Process: 3425 ExecStart=/usr/bin/mkdir -p /dev/shm/swh-graph/default (code=exited, status=0/SUCCESS)
    Process: 3459 ExecStart=/usr/bin/sh -c ln -s /srv/softwareheritage/ssd/graph/2020-12-15/compressed/* /dev/shm/swh-graph/default (c
    Process: 3476 ExecStart=/usr/bin/sh -c cp --remove-destination /srv/softwareheritage/ssd/graph/2020-12-15/compressed/graph.graph /
    Process: 65920 ExecStart=/usr/bin/sh -c cp --remove-destination /srv/softwareheritage/ssd/graph/2020-12-15/compressed/graph-transp
   Main PID: 65920 (code=exited, status=0/SUCCESS)

Nov 03 08:31:53 granet systemd[1]: Starting swh graph shm mapper...
Nov 03 08:37:25 granet systemd[1]: Finished swh graph shm mapper.
* swhgraphdev.service - swh graph
     Loaded: loaded (/etc/systemd/system/swhgraphdev.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2021-11-03 08:37:25 UTC; 4min 52s ago
   Main PID: 107224 (swh)
      Tasks: 62 (limit: 629145)
     Memory: 50.2G
     CGroup: /system.slice/swhgraphdev.service
             |-107224 /opt/swhgraph_venv/bin/python3 /opt/swhgraph_venv/bin/swh graph rpc-serve -g /dev/shm/swh-graph/default/graph
             `-107333 java -classpath /opt/swhgraph_venv/share/py4j/py4j0.10.8.1.jar:/opt/swhgraph_venv/share/swh-graph/swh-graph-0.5.

Nov 03 08:37:25 granet systemd[1]: Started swh graph.
Nov 03 08:37:26 granet swh[107224]: INFO:root:using swh-graph JAR: /opt/swhgraph_venv/share/swh-graph/swh-graph-0.5.0.jar
Nov 03 08:37:27 granet swh[107333]: Loading graph /dev/shm/swh-graph/default/graph ...
Nov 03 08:39:13 granet swh[107333]: Graph loaded.
Nov 03 08:39:28 granet swh[107224]: INFO:aiohttp.access:127.0.0.1 [03/Nov/2021:08:39:28 +0000] "GET / HTTP/1.1" 200 485 "-" "check_http/v2.2 (monitoring-plugins 2.2)"
Nov 03 08:40:28 granet swh[107224]: INFO:aiohttp.access:127.0.0.1 [03/Nov/2021:08:40:28 +0000] "GET / HTTP/1.1" 200 485 "-" "check_http/v2.2 (monitoring-plugins 2.2)"
Nov 03 08:41:28 granet swh[107224]: INFO:aiohttp.access:127.0.0.1 [03/Nov/2021:08:41:28 +0000] "GET / HTTP/1.1" 200 485 "-" "check_http/v2.2 (monitoring-plugins 2.2)"