Using S.M.A.R.T. under NetBSD

NetBSD has supported S.M.A.R.T. for a long time. But this functionality is well hidden. You can enable S.M.A.R.T. and check a single disk like this:

# atactl wd0 smart enable
SMART supported, SMART enabled
# atactl wd0 smart status
SMART supported, SMART enabled
id value thresh crit collect reliability description                    raw
1 200   51     yes online  positive    Raw read error rate            0
3 151   21     yes online  positive    Spin-up time                   9441
4 100    0     no  online  positive    Start/stop count               16
5 200  140     yes online  positive    Reallocated sector count       0
7 200    0     no  online  positive    Seek error rate                0
9  89    0     no  online  positive    Power-on hours count           8477
10 100    0     no  online  positive    Spin retry count               0
11 100    0     no  online  positive    Calibration retry count        0
12 100    0     no  online  positive    Device power cycle count       15
192 200    0     no  online  positive    Power-off retract count        4
193 134    0     no  online  positive    Load cycle count               199998
194 114    0     no  online  positive    Temperature                    38
196 200    0     no  online  positive    Reallocated event count        0
197 200    0     no  online  positive    Current pending sector         0
198 100    0     no  offline positive    Offline uncorrectable          0
199 200    0     no  online  positive    Ultra DMA CRC error count      0
200 100    0     no  offline positive    Write error rate               0

While this is very useful for manual checks it doesn’t provide automatic health reporting. And the recent abrupt failure of the backup hard disk in a friend’s machine reminded me of the importance of such monitoring. I therefore decided to implement an automated solution on top of NetBSD’s S.M.A.R.T. support.

The first step was to enable S.M.A.R.T. at system startup. I added the following lines to /etc/rc.local to make that happen:

echo "Turning on S.M.A.R.T.:"
for disk in $(sysctl -n hw.disknames | tr " " \\n | grep ^wd)
do
        echo -n "${disk}: "
        atactl $disk smart enable
done

Now I only needed something that checks the reported metrics every night. I therefore added the following snippet to /etc/daily.local:

found=
for disk in $(sysctl -n hw.disknames | tr " " \\n | grep ^wd)
do
        relocated=$(atactl $disk smart status |
          sed -n -e 's/.* Reallocated sector count[^0-9]*//p')
        if [ $relocated -gt 0 ]; then
                if [ -z "$found" ]; then
                        found=true
                        echo ""
                        echo "SMART checks:"
                fi
                echo "Disk $disk has $relocated relocated sectors."
        fi
done
unset disk found relocated

The above shell code reports any IDE and SATA hard disks with relocated sectors. If a hard disk reports a lot of relocated sectors or their number is growing quickly in a short time frame the disk will probably fail very soon.

Let’s hope that this way I will get an advance warning before the next major catastrophe.