ECC 内存故障

一台很老旧的服务器terminal不断吐出一些错误提示:

1
2
3
4
5
6
7
\[ 1250.944486\] mce: \[Hardware Error\]: Machine check events logged
\[ 1250.944493\] \[Hardware Error\]: Corrected error, no action required.
\[ 1250.948666\] \[Hardware Error\]: CPU:24 (10:9:1) MC4_STATUS\[OverCEMiscV-AddrVCECC\]: 0xdc0a400079080a13
\[ 1250.952631\] \[Hardware Error\]: Error Addr: 0x00000004abffce80
\[ 1250.952633\] \[Hardware Error\]: MC4 Error (node 6): DRAM ECC error detected on the NB.
\[ 1250.952654\] EDAC MC6: 1 CE on mc#6csrow#3channel#0 (csrow:3 channel:0 page:0x4abffc offset:0xe80 grain:0 syndrome:0x7914)
\[ 1250.952656\] \[Hardware Error\]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

可以看到是内存出现了错误,不过错误被纠正了,但内存显然是出现故障了。

先看看系统cpu节点信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 4
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 16
Model: 9
Model name: AMD Opteron(tm) Processor 6128
Stepping: 1
CPU MHz: 800.000
CPU max MHz: 2000.0000
CPU min MHz: 800.0000
BogoMIPS: 4000.04
Virtualization: AMD-V
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 5118K
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s): 4-7
NUMA node2 CPU(s): 8-11
NUMA node3 CPU(s): 12-15
NUMA node4 CPU(s): 16-19
NUMA node5 CPU(s): 20-23
NUMA node6 CPU(s): 24-27
NUMA node7 CPU(s): 28-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate vmmcall npt lbrv svm_lock nrip_save pausefilter

共有四个socket,四颗cpu,每颗CPU有八个核心,总共是32核心,对于NUMA结构的机器,一般来讲每颗CPU应该至少有一个本地的内存控制器

安装edac-util,查看内存控制器信息

1
2
3
4
5
6
7
$ sudo apt install edac-utils
$ edac-util -vs
edac-util: EDAC drivers are loaded. 4 MCs detected:
mc0:F10h
mc2:F10h
mc4:F10h
mc6:F10h

可以看到有四个内存控制器,再查看每个内存控制器可能存在的错误

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc2: 0 Uncorrected Errors with no DIMM info
mc2: 0 Corrected Errors with no DIMM info
mc2: csrow2: 0 Uncorrected Errors
mc2: csrow2: mc#2csrow#2channel#0: 0 Corrected Errors
mc2: csrow3: 0 Uncorrected Errors
mc2: csrow3: mc#2csrow#3channel#0: 0 Corrected Errors
mc4: 0 Uncorrected Errors with no DIMM info
mc4: 0 Corrected Errors with no DIMM info
mc4: csrow2: 0 Uncorrected Errors
mc4: csrow2: mc#4csrow#2channel#0: 0 Corrected Errors
mc4: csrow3: 0 Uncorrected Errors
mc4: csrow3: mc#4csrow#3channel#0: 0 Corrected Errors
mc6: 0 Uncorrected Errors with no DIMM info
mc6: 0 Corrected Errors with no DIMM info
mc6: csrow2: 0 Uncorrected Errors
mc6: csrow2: mc#6csrow#2channel#0: 2 Corrected Errors
mc6: csrow3: 0 Uncorrected Errors
mc6: csrow3: mc#6csrow#3channel#0: 4 Corrected Errors

或者这样查看:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ grep \[0-9\] /sys/devices/system/edac/mc/mc*/csrow*/*ce_count
/sys/devices/system/edac/mc/mc0/csrow2/ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow2/ce_count:0
/sys/devices/system/edac/mc/mc2/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow3/ce_count:0
/sys/devices/system/edac/mc/mc2/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow2/ce_count:0
/sys/devices/system/edac/mc/mc4/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow3/ce_count:0
/sys/devices/system/edac/mc/mc4/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow2/ce_count:3
/sys/devices/system/edac/mc/mc6/csrow2/ch0_ce_count:3
/sys/devices/system/edac/mc/mc6/csrow3/ce_count:6
/sys/devices/system/edac/mc/mc6/csrow3/ch0_ce_count:6

可以看到出现错误的内存位于MC6,csrow2和csrow3,也就是问题出现在第四个(CPU的)内存控制器的0通道DIMM0内存这里。

References:
[1]How to identify defective DIMM from EDAC error on Linux
[2]内存条物理结构分析