これまで5個の中古 HDD を購入しましたが、
3個目 (2016年6月購入) が限界に達した (Reallocated_Sector_Ct が THRESH を下回った) ので、交換用に6個目を購入しました。今回もしつこく Seagate Barracuda ES.2 1TB です。同じ機種のほうが経験積めると思うので。
いつもの初期確認、まずは S.M.A.R.T. の値です。
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 082 063 044 Pre-fail Always - 168507570
3 Spin_Up_Time 0x0003 097 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 127
5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 39
7 Seek_Error_Rate 0x000f 061 060 030 Pre-fail Always - 4296392929
9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 10830
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1
12 Power_Cycle_Count 0x0032 099 037 020 Old_age Always - 1413
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 068 046 045 Old_age Always - 32 (Min/Max 25/32)
194 Temperature_Celsius 0x0022 032 054 000 Old_age Always - 32 (0 24 0 0 0)
195 Hardware_ECC_Recovered 0x001a 048 004 000 Old_age Always - 168507570
197 Current_Pending_Sector 0x0012 002 002 000 Old_age Always - 2008
198 Offline_Uncorrectable 0x0010 002 002 000 Old_age Offline - 2008
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
稼働時間は 10830 時間 (約451日) でしたが、Power_Cycle_Count が 1413 と高め (過去に入手したものは 100 程度) なので、使用する時だけ電源投入するという運用だったのではと考えられます。それから、Current_Pending_Sector が 2008 と高い値になってるので、このままでは早晩 I/O エラーに遭遇すると考えられます。
いままでに入手した6個の中古 HDD の中では、最も状態が悪いですが、ジャンク扱いということで格安 (6個の中では最安値) で入手しています。
このような状態の HDD は、これまでの経験上、SecureErase または
こちらの手順 でリフレッシュできる場合が多く、ZFS の raid 領域であれば、まだ十分使用できるとふんでます。
そんなわけで、今回は、
こちらの手順 のほうで、リフレッシュ作業してみました。
結果は次のとおりです。
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 060 060 044 Pre-fail Always - 205654349
3 Spin_Up_Time 0x0003 098 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 128
5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 39
7 Seek_Error_Rate 0x000f 066 060 030 Pre-fail Always - 4299157096
9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 11030
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1
12 Power_Cycle_Count 0x0032 099 037 020 Old_age Always - 1414
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 024 024 000 Old_age Always - 76
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 066 046 045 Old_age Always - 34 (Min/Max 31/34)
194 Temperature_Celsius 0x0022 034 054 000 Old_age Always - 34 (0 24 0 0 0)
195 Hardware_ECC_Recovered 0x001a 052 004 000 Old_age Always - 205654349
197 Current_Pending_Sector 0x0012 100 002 000 Old_age Always - 3
198 Offline_Uncorrectable 0x0010 100 002 000 Old_age Offline - 3
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
ゼロにはなりませんでしたが、3 に減りました。
単体で使うのは危険ですが、経験上 ZFS の raid 領域ならまだ使えると思えるので、実際に組み込みました。
[root@hoge ~]# zpool status tankQ
pool: tankQ
state: ONLINE
scan: resilvered 104K in 0h0m with 0 errors on Thu Oct 18 17:36:46 2018
config:
NAME STATE READ WRITE CKSUM
tankQ ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
tankQf ONLINE 0 0 0
tankQk ONLINE 0 0 0
tankQe ONLINE 0 0 0
tankQc ONLINE 0 0 0
errors: No known data errors
ZFS としてエラーのない状態になりました。zpool scrub でもエラーでなくなりました。なお、この tankQ では、各ディスクを LUKS で暗号化した上で使用しています。
以下、その他の初期確認データです。
[root@hoge ~]# hdparm -i /dev/sdk
/dev/sdk:
Model=ST31000340NS, FwRev=SN06, SerialNo=9xxxxxxH
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=no WriteCache=enabled
Drive conforms to: unknown: ATA/ATAPI-4,5,6,7
* signifies the current active mode
[root@hoge ~]# hdparm -I /dev/sdk
/dev/sdk:
ATA device, with non-removable media
Model Number: ST31000340NS
Serial Number: 9xxxxxxH
Firmware Revision: SN06
Transport: Serial
Standards:
Used: unknown (minor revision code 0x0029)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 1953525168
Logical/Physical Sector size: 512 bytes
device size with M = 1024*1024: 953869 MBytes
device size with M = 1000*1000: 1000204 MBytes (1000 GB)
cache/buffer size = unknown
Nominal Media Rotation Rate: 7200
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Write Same (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
unknown 206[12] (vendor specific)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
192min for SECURITY ERASE UNIT. 192min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5000c500yyyyyyy9
NAA : 5
IEEE OUI : 000c50
Unique ID : 0yyyyyyy9
Checksum: correct
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda ES.2
Device Model: ST31000340NS
Serial Number: 9xxxxxxH
LU WWN Device Id: 5 000c50 0yyyyyyy9
Firmware Version: SN06
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Thu Oct 18 17:49:17 2018 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 625) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 225) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 060 060 044 Pre-fail Always - 205654349
3 Spin_Up_Time 0x0003 098 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 128
5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 39
7 Seek_Error_Rate 0x000f 066 060 030 Pre-fail Always - 4299157087
9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 11030
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1
12 Power_Cycle_Count 0x0032 099 037 020 Old_age Always - 1414
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 024 024 000 Old_age Always - 76
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 066 046 045 Old_age Always - 34 (Min/Max 31/34)
194 Temperature_Celsius 0x0022 034 054 000 Old_age Always - 34 (0 24 0 0 0)
195 Hardware_ECC_Recovered 0x001a 052 004 000 Old_age Always - 205654349
197 Current_Pending_Sector 0x0012 100 002 000 Old_age Always - 3
198 Offline_Uncorrectable 0x0010 100 002 000 Old_age Offline - 3
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 119 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 119 occurred at disk power-on lifetime: 11004 hours (458 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 fd 25 6c 00 Error: UNC at LBA = 0x006c25fd = 7087613
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 e0 b0 26 6c 40 00 7d+05:46:35.674 READ FPDMA QUEUED
60 00 e0 d0 25 6c 40 00 7d+05:46:35.669 READ FPDMA QUEUED
60 00 f0 d8 24 6c 40 00 7d+05:46:35.669 READ FPDMA QUEUED
60 00 28 78 25 6c 40 00 7d+05:46:35.664 READ FPDMA QUEUED
60 00 30 a8 24 6c 40 00 7d+05:46:35.663 READ FPDMA QUEUED
Error 118 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 71 04 9d 00 32 40 Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 d0 e1 7d 40 00 00:15:51.654 READ DMA EXT
25 00 08 d0 e1 7d 40 00 00:15:51.654 READ DMA EXT
25 00 08 d0 e1 7d 40 00 00:15:51.654 READ DMA EXT
25 00 08 d0 e1 7d 40 00 00:15:51.654 READ DMA EXT
25 00 08 d0 e1 7d 40 00 00:15:51.653 READ DMA EXT
Error 117 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 71 04 9d 00 32 40 Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 d0 e1 7d 40 00 00:15:51.653 READ DMA EXT
25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT
25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT
25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT
25 00 08 c8 c5 2d 40 00 00:15:51.208 READ DMA EXT
Error 116 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 71 04 9d 00 32 40 Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT
25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT
25 00 08 c8 c5 2d 40 00 00:15:51.209 READ DMA EXT
25 00 08 c8 c5 2d 40 00 00:15:51.208 READ DMA EXT
25 00 08 c8 c5 2d 40 00 00:15:51.208 READ DMA EXT
Error 115 occurred at disk power-on lifetime: 10830 hours (451 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 71 04 9d 00 32 40 Device Fault; Error: ABRT 4 sectors at LBA = 0x0032009d = 3276957
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 c8 c5 2d 40 00 00:15:51.208 READ DMA EXT
25 00 08 b8 c6 2d 40 00 00:15:51.081 READ DMA EXT
25 00 08 b8 c6 2d 40 00 00:15:51.081 READ DMA EXT
25 00 08 b8 c6 2d 40 00 00:15:51.080 READ DMA EXT
25 00 08 b8 c6 2d 40 00 00:15:51.080 READ DMA EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 11007 -
# 2 Short offline Completed without error 00% 11004 -
# 3 Short offline Completed without error 00% 10953 -
# 4 Selective offline Completed without error 00% 10837 -
# 5 Selective offline Completed: read failure 90% 10837 1887270886
# 6 Selective offline Completed: read failure 90% 10837 1887261750
# 7 Selective offline Completed: read failure 90% 10836 1887217021
# 8 Selective offline Completed: read failure 90% 10833 63511735
# 9 Selective offline Completed: read failure 90% 10833 63502125
#10 Selective offline Completed: read failure 90% 10833 63490659
#11 Selective offline Completed: read failure 90% 10833 12121842
#12 Selective offline Completed: read failure 90% 10833 12110355
#13 Selective offline Completed: read failure 90% 10833 12098051
#14 Selective offline Completed: read failure 90% 10833 12089280
#15 Selective offline Completed: read failure 90% 10833 12078170
#16 Selective offline Completed: read failure 90% 10833 12068537
#17 Selective offline Completed: read failure 90% 10833 12059282
#18 Selective offline Completed: read failure 90% 10833 11972284
#19 Selective offline Completed: read failure 90% 10833 11957107
#20 Selective offline Completed: read failure 90% 10833 11947496
#21 Selective offline Completed: read failure 90% 10833 10545773
17 of 17 failed self-tests are outdated by newer successful extended offline self-test # 1
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 1887270886 1953525167 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
HDD は機種によってかなり挙動が異なりますが、もしこの記事を見てリフレッシュ試みる場合、Self-test log が参考になるものと思います。なお、HDD の機種によっては、Self-test log を表示できないもの (機能が実装されてない?) もあるようです。
最後に警告となりますが、ZFS または Btrfs のように、データの End-to-End チェックサムが実装されていて、なおかつ raid 構成でなければ、今回のような状態の HDD は使えないです。単体利用はもちろんダメですが、ハードウェア RAID でも使うのは危険と思いますので、くれぐれも気をつけてください。HDD や OS の挙動を学習するための実験に使うならば、よいでしょうけれど。。。
わたし自身も、tankQ をプライマリなデータ領域として使ってるわけではなく、バックアップなどのセカンダリ領域 (最悪壊れても許容できる) として利用しています。OS屋のはしくれとして、Linux(CentOS6) + ZFS それに HDD の振る舞い (特にセクターエラー発生時のリカバリ動作) を体感して経験値を積みたい、というのが主な目的です。