本番環境で初期不良の SSD を踏みました

こんばんは!障害対応直後の halfrack です!
なんか mysqld が落ちたっぽいアラートが上がってきたので調べてみたら、 Xen DomU で以下のようなメッセージが。

-bash-3.2# dmesg | tail
IPVS: Registered protocols (TCP, UDP, AH, ESP)
IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
IPVS: ipvs loaded.
ip_tables: (C) 2000-2006 Netfilter Core Team
Netfilter messages via NETLINK v0.30.
ip_conntrack version 2.4 (8192 buckets, 65536 max) - 304 bytes per conntrack
IPv4 over IPv4 tunneling driver
end_request: I/O error, dev xvda1, sector 282578400
end_request: I/O error, dev xvda1, sector 282578328
end_request: I/O error, dev xvda1, sector 282578416
-bash-3.2# 

こいつは SSD なホストなのですが、ちょいと心当たりがあったので smartctl を叩いてみました。
案の定、 Reallocated_Sector_Ct の RAW がカウントアップしまくってるぜヒャッハー!!!
Reported_Uncorrect, Hardware_ECC_Recovered, Reallocated_Event_Count あたりも、通電時間 151時間のホストでこうなっているのは見たことが有りません。

[root@x99xxx99x99 ~]# smartctl -a /dev/sda
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     M4-CT512M4SSD2
Serial Number:    00000000111403xxxxxx
Firmware Version: 0001
User Capacity:    512,110,190,592 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   9
ATA Standard is:  Not recognized. Minor revision code: 0x28
Local Time is:    Sat Jul  2 03:02:18 2011 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

(snip)

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       28672
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       153
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       3
(snip)
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       12
(snip)
195 Hardware_ECC_Recovered  0x003c   100   100   001    Old_age   Offline      -       1226
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       7
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
(snip)

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 2
(snip)
Error 0 occurred at disk power-on lifetime: 151 hours (6 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 52 00 fa 9a e0  Error: UNC 82 sectors at LBA = 0x009afa00 = 10156544

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 a8 aa f9 9a e0 00   6d+07:57:00.000  READ DMA EXT
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   6d+07:57:00.000  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00   6d+07:57:00.000  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT
(snip)
[root@x99xxx99x99 ~]# 

始めて SSD の初期不良に遭遇したのと深夜なので変にハイテンションでございますが、言いたいことは以下の点です!

  • SSD にも初期不良はあるようです
  • 少なくとも Intel/Mavell 系コントローラだと Reallocated_Sector_Ct の値は故障判定に使える
  • SSD 内部の ECC が有効に機能しているらしく壊れたブロックはちゃんと Medium Error を返す

HDD でも役に立つので、不良セクタ数はグラフにしましょうね、と思いました。
また、稼働時間と他のホストの比較から、どう見ても初期不良に見えるため光の早さで公開しましたが、このエントリは凄く突っ走って書いております。
後でよく調べたらなんか間違ってるかもしれません。続報にもご注意下さい。

以下は生ログ。時刻とか生々しいですよ?

[root@x99xxx99x99 ~]# smartctl -a /dev/sda
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     M4-CT512M4SSD2
Serial Number:    00000000111403xxxxxx
Firmware Version: 0001
User Capacity:    512,110,190,592 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   9
ATA Standard is:  Not recognized. Minor revision code: 0x28
Local Time is:    Sat Jul  2 03:02:18 2011 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (2380) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  39) minutes.
Conveyance self-test routine
recommended polling time:        (   3) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       28672
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       153
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       3
170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       7
171 Unknown_Attribute       0x0032   100   100   001    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   001    Old_age   Always       -       0
173 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       0
174 Unknown_Attribute       0x0032   100   100   001    Old_age   Always       -       0
181 Unknown_Attribute       0x0022   100   100   001    Old_age   Always       -       1309979312215
183 Unknown_Attribute       0x0032   100   100   001    Old_age   Always       -       0
184 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       12
188 Unknown_Attribute       0x0032   100   100   001    Old_age   Always       -       0
189 High_Fly_Writes         0x000e   100   100   001    Old_age   Always       -       216
195 Hardware_ECC_Recovered  0x003c   100   100   001    Old_age   Offline      -       1226
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       7
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   001    Old_age   Always       -       0
202 TA_Increase_Count       0x0018   100   100   001    Old_age   Offline      -       0
206 Flying_Height           0x000e   100   100   001    Old_age   Always       -       0

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 2

ATA Error Count: 0
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 151 hours (6 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 52 00 fa 9a e0  Error: UNC 82 sectors at LBA = 0x009afa00 = 10156544

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 a8 aa f9 9a e0 00   6d+07:57:00.000  READ DMA EXT
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   6d+07:57:00.000  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00   6d+07:57:00.000  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT

Error -1 occurred at disk power-on lifetime: 151 hours (6 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 52 00 fa 9a e0  Error: UNC 82 sectors at LBA = 0x009afa00 = 10156544

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 a8 aa f9 9a e0 00   6d+07:57:00.000  READ DMA EXT
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   6d+07:57:00.000  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00   6d+07:57:00.000  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT

Error -2 occurred at disk power-on lifetime: 151 hours (6 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 52 00 fa 9a e0  Error: UNC 82 sectors at LBA = 0x009afa00 = 10156544

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 a8 aa f9 9a e0 00   6d+07:57:00.000  READ DMA EXT
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   6d+07:57:00.000  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00   6d+07:57:00.000  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT

Error -3 occurred at disk power-on lifetime: 151 hours (6 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 52 00 fa 9a e0  Error: UNC 82 sectors at LBA = 0x009afa00 = 10156544

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 a8 aa f9 9a e0 00   6d+07:57:00.000  READ DMA EXT
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   6d+07:57:00.000  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00   6d+07:57:00.000  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT

Error -4 occurred at disk power-on lifetime: 151 hours (6 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 52 00 fa 9a e0  Error: UNC 82 sectors at LBA = 0x009afa00 = 10156544

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 a8 aa f9 9a e0 00   6d+07:57:00.000  READ DMA EXT
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   6d+07:57:00.000  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00   6d+07:57:00.000  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   6d+07:57:00.000  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@x99xxx99x99 ~]# 

「ちょいと心当たりがあったので」についてですが、これは、ついに数日前に普通に運用していた SSD が故障しましたという話です。 SSD はなかなか壊れないので、いつ壊れるかなーと長らく思っていたらついに壊れた、と。
こちらに付いてはちゃんと調べると面白い知見が得られると思うので、公開までもう暫くお待ち下さいまし。