Using S.M.A.R.T. under NetBSD

NetBSD has supported S.M.A.R.T. for a long time. But this functionality is well hidden. You can enable S.M.A.R.T. and check a single disk like this:

# atactl wd0 smart enable
SMART supported, SMART enabled
# atactl wd0 smart status
SMART supported, SMART enabled
id value thresh crit collect reliability description                     raw
1 200   51     yes online   positive     Raw read error rate             0
3 151   21     yes online   positive     Spin-up time                   9441
4 100     0     no   online   positive     Start/stop count               16
5 200   140     yes online   positive     Reallocated sector count       0
7 200     0     no   online   positive     Seek error rate                 0
9   89     0     no   online   positive     Power-on hours count           8477
10 100     0     no   online   positive     Spin retry count               0
11 100     0     no   online   positive     Calibration retry count         0
12 100     0     no   online   positive     Device power cycle count       15
192 200     0     no   online   positive     Power-off retract count         4
193 134     0     no   online   positive     Load cycle count               199998
194 114     0     no   online   positive     Temperature                     38
196 200     0     no   online   positive     Reallocated event count         0
197 200     0     no   online   positive     Current pending sector         0
198 100     0     no   offline positive     Offline uncorrectable           0
199 200     0     no   online   positive     Ultra DMA CRC error count       0
200 100     0     no   offline positive     Write error rate               0

While this is very useful for manual checks it doesn’t provide automatic health reporting. And the recent abrupt failure of the backup hard disk in a friend’s machine reminded me of the importance of such monitoring. I therefore decided to implement an automated solution on top of NetBSD’s S.M.A.R.T. support.

The first step was to enable S.M.A.R.T. at system startup. I added the following lines to /etc/rc.local to make that happen:

echo "Turning on S.M.A.R.T.:"
for disk in $(sysctl -n hw.disknames | tr " " \\n | grep ^wd)
do
        echo -n "${disk}: "
        atactl $disk smart enable
done

Now I only needed something that checks the reported metrics every night. I therefore added the following snippet to /etc/daily.local:

found=
for disk in $(sysctl -n hw.disknames | tr " " \\n | grep ^wd)
do
        relocated=$(atactl $disk smart status |
          sed -n -e 's/.* Reallocated sector count[^0-9]*//p')
        if [ $relocated -gt 0 ]; then
                if [ -z "$found" ]; then
                        found=true
                        echo ""
                        echo "SMART checks:"
                fi
                echo "Disk $disk has $relocated relocated sectors."
        fi
done
unset disk found relocated

The above shell code reports any IDE and SATA hard disks with relocated sectors. If a hard disk reports a lot of relocated sectors or their number is growing quickly in a short time frame the disk will probably fail very soon.

Let’s hope that this way I will get an advance warning before the next major catastrophe.

Categories

Blogroll