ZFS health check script for Monit

When setting up your own home server then proper monitoring system is the one of most important things you want to do. Monit is a monitoring tool I decided to use. It's easy to configure but still very powerful. I also decided to use ZFS filesystem on my server. It's a bit more advanced than ext4 or ntfs thus there is more things to check regarding your pools health.

Thankfully I found very cool ZFS health check script on Calomel.org that sends email if something is wrong with any of the pools. I decided to adjust it a bit to work well with Monit.

Script, to be able to be run by Monit, needs to return exit code instead of sending an email. Next difference is that in bellows script lines 126 and 127 are uncommented (and lines 130 and 131 commented). This is to support Ubuntu date format. If you want to use this script on FreeBSD then you want these lines to look like in the original script. Next thing is that, in the beginning of the script, I added support for input parameters for max capacity and scrub expiration of your zpools. This is in order to keep configuration in one place (i.e. Monit's configuration file). Last but not least is the "Output for monit user interface" section. As its title says it outputs zpool status to the console so it could be recorded by Monit and displayed in its user interface.

#! /bin/bash
#
## ZFS health check script for monit.
## v1.0.2
#
## Should be compatible with FreeBSD and Linux. Tested on Ubuntu.
## If you want to use it on FreeBSD then go to Scrub Expired section and Trim Expired section
## and comment two Ubuntu date lines and uncomment two FreeBSD lines in Scrub Expired section.
## In Trim Expired section adjust the date format directly in the for loop's awk parameter.
#
## Assumed usage in monitrc (where 80 is max capacity in percentages
## and 691200 is scrub and trim expiration in seconds):
## check program zfs_health with path "/path/to/this/script 80 691200"
##     if status != 0 then alert
#
## Scrub and Trim share the same expiration threshold for the backward compatibility reasons.
#
## Original script from:
## Calomel.org
##     https://calomel.org/zfs_health_check_script.html
##     FreeBSD ZFS Health Check script
##     zfs_health.sh @ Version 0.17
#
## Main difference from the original script is that this one exits
## with a return code instead of sending an e-mail

# Parameters

maxCapacity=$1 # in percentages
scrubExpire=$2 # in seconds (691200 = 8 days)
trimExpire=$2 # in seconds (691200 = 8 days)

usage="Usage: $0 maxCapacityInPercentages scrubExpireInSeconds\n"

if [ ! "${maxCapacity}" ]; then
  printf "Missing arguments\n"
  printf "${usage}"
  exit 1
fi

if [ ! "${scrubExpire}" ]; then
  printf "Missing second argument\n"
  printf "${usage}"
  exit 1
fi


# Output for monit user interface

printf "==== ZPOOL STATUS ====\n"
printf "$(/sbin/zpool status)"
printf "\n\n==== ZPOOL LIST ====\n"
printf "%s\n" "$(/sbin/zpool list)"


# Health - Check if all zfs volumes are in good condition. We are looking for
# any keyword signifying a degraded or broken array.

condition=$(/sbin/zpool status | grep -E 'DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|corrupt|cannot|unrecover')

if [ "${condition}" ]; then
  printf "\n==== ERROR ====\n"
  printf "One of the pools is in one of these statuses: DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|corrupt|cannot|unrecover!\n"
  printf "$condition"
  exit 1
fi


# Capacity - Make sure the pool capacity is below 80% for best performance. The
# percentage really depends on how large your volume is. If you have a 128GB
# SSD then 80% is reasonable. If you have a 60TB raid-z2 array then you can
# probably set the warning closer to 95%.
#
# ZFS uses a copy-on-write scheme. The file system writes new data to
# sequential free blocks first and when the uberblock has been updated the new
# inode pointers become valid. This method is true only when the pool has
# enough free sequential blocks. If the pool is at capacity and space limited,
# ZFS will be have to randomly write blocks. This means ZFS can not create an
# optimal set of sequential writes and write performance is severely impacted.

capacity=$(/sbin/zpool list -H -o capacity | cut -d'%' -f1)

for line in ${capacity}
  do
    if [ $line -ge $maxCapacity ]; then
      printf "\n==== ERROR ====\n"
      printf "One of the pools has reached it's max capacity!"
      exit 1
    fi
  done


# Errors - Check the columns for READ, WRITE and CKSUM (checksum) drive errors
# on all volumes and all drives using "zpool status". If any non-zero errors
# are reported an email will be sent out. You should then look to replace the
# faulty drive and run "zpool scrub" on the affected volume after resilvering.

errors=$(/sbin/zpool status | grep ONLINE | grep -v state | awk '{print $3 $4 $5}' | grep -v 000)

if [ "${errors}" ]; then
  printf "\n==== ERROR ====\n"
  printf "One of the pools contains errors!"
  printf "$errors"
  exit 1
fi


# Scrub Expired - Check if all volumes have been scrubbed in at least the last
# 8 days. The general guide is to scrub volumes on desktop quality drives once
# a week and volumes on enterprise class drives once a month. You can always
# use cron to schedule "zpool scrub" in off hours. We scrub our volumes every
# Sunday morning for example.
#
# Check your /etc/cron.d/zfsutils_linux for any already scheduled jobs
#
# Scrubbing traverses all the data in the pool once and verifies all blocks can
# be read. Scrubbing proceeds as fast as the devices allows, though the
# priority of any I/O remains below that of normal calls. This operation might
# negatively impact performance, but the file system will remain usable and
# responsive while scrubbing occurs. To initiate an explicit scrub, use the
# "zpool scrub" command.
#
# The scrubExpire variable is in seconds.

currentDate=$(date +%s)
zfsVolumes=$(/sbin/zpool list -H -o name)

for volume in ${zfsVolumes}
  do
    if [ $(/sbin/zpool status $volume | grep -E -c "none requested") -ge 1 ]; then
      printf "\n==== ERROR ====\n"
      printf "ERROR: You need to run \"zpool scrub $volume\" before this script can monitor the scrub expiration time."
      break
    fi

    if [ $(/sbin/zpool status $volume | grep -E -c "scrub in progress|resilver") -ge 1 ]; then
      break
    fi

    ### Ubuntu with GNU supported date format - compatible with ZFS v2.0.3 output
    scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $11" "$12" " $13" " $14" "$15}')
    scrubDate=$(date -d "$scrubRawDate" +%s)

    ### FreeBSD with *nix supported date format
    #scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $15 $12 $13}')
    #scrubDate=$(date -j -f '%Y%b%e-%H%M%S' $scrubRawDate'-000000' +%s)

    if [ $(($currentDate - $scrubDate)) -ge $scrubExpire ]; then
      printf "\n==== ERROR ====\n"
      printf "${volume}'s scrub date is too far in the past!"
      exit 1
    fi
  done

# TRIM Expired - Check if all volumes have been trimmed in at least the last
# 8 days. The general guide is to manually trim volumes on desktop quality drives once
# a week and volumes on enterprise class drives once a month. You can always
# use cron to schedule "zpool trim" in off hours. We trim our volumes every
# Sunday morning for example.
#
# Check your /etc/cron.d/zfsutils_linux for any already scheduled jobs
#
# Manual trimming is recommended even though autotrim feature is turned on for your pool.
# From ZFS documentation:
# > Since the automatic TRIM will skip ranges it considers too small there is value in occasionally
# > running a full zpool trim. This may occur when the freed blocks are small and not enough time
# > was allowed to aggregate them. An automatic TRIM and a manual zpool trim may be run concurrently,
# > in which case the automatic TRIM will yield to the manual TRIM.

for volume in ${zfsVolumes}
  do
    if [ $(/sbin/zpool status -t $volume | grep -E -c "trim unsupported") -ge 1 ]; then
      break
    fi

    ### Ubuntu with GNU supported date format - compatible with ZFS v2.0.3 output - For other systems and version adjust awk parameter below
    trimRawDates=$(/sbin/zpool status -t $volume | grep trim | awk '{print $10" "$11" " $12" " $13" "$14}')

    while IFS= read -r trimRawDate
      do
        trimDate=$(date -d "$trimRawDate" +%s)

        if [ $(($currentDate - $trimDate)) -ge $trimExpire ]; then
          printf "\n==== ERROR ====\n"
          printf "${volume}'s trim date is too far in the past!"
          exit 1
        fi
      done <<< "$trimRawDates"
  done

# Finish - If we made it here then everything is fine
exit 0

Direct link to the script file

To use the script go to your monitrc file and add following lines:

check program zfs_health with path "/path/to/this/script 80 691200"
  if status != 0 then alert

Where 80 is the max capacity in percentages and 691200 is the scrub expiration in seconds. This will make Monit notify you everytime something is wrong with your zpool.

As a bonus you will get nice status with last script run output on the web user interface:

**Figure 1.** ZFS health check status on the web user interface

If you don't see the whole output for your zpools then you probably need to set higher PROGRAMOUTPUT limit in monitrc file. The default one is 512 bytes.

The script is also available as Github Gist.

Thanks (1)

Back to articles list

Comments

evgenymagata

14-12-2018 08:59:47 UTC

I tried your script, works perfectly from command line. But apparently Monit does not support passing arguments to the scripts, i wonder how you got this running? this is my status output:
´Missing arguments
Usage: zfs_health_check.sh maxCapacityInPercentages scrubExpireInSeconds´

when running like this

´check program zfs_health with path "/bin/bash -c /root/monit_scripts/zfs_check.sh 80 691200"
if status != 0 then alert´

Rychu

14-12-2018 17:39:49 UTC

@evgenymagata skip "/bin/bash -c" in "with path" statement. Take a look at the example in the article. In your case it should be like:

check program zfs_health with path "/root/monit_scripts/zfs_check.sh 80 691200"
    if status != 0 then alert

Let know if that helps.

Thomas

14-07-2019 15:50:32 UTC

Thanks for the script.

I'm using it on Debian 10. However I'm getting an error when it checks for the last scrub date:
date: invalid date ‘errors on Sun Jul 14’
/etc/monit/scripts/zfs.sh: 133: /etc/monit/scripts/zfs.sh: arithmetic expression: expecting primary: "1563119164 - "

Any idea what causes it?

Rychu

14-07-2019 15:58:11 UTC

@Thomas take a look at this: https://stackoverflow.com/questions/30719911/arithmetic-expression-expecting-primary
How do you run the script from monit config? Make sure you use bash. Does it give you the same error when run manually from a console?

Thomas

14-07-2019 16:26:02 UTC

Thanks for quick reply.

If I run from console I get same error. If I specifically use /bin/bash /etc/monit/scripts/zfs.sh I get error
date: invalid date ‘errors on Sun Jul 14’
/etc/monit/scripts/zfs.sh: line 133: 1563121516 - : syntax error: operand expected (error token is "- ")
root

Rychu

14-07-2019 18:08:04 UTC

Are you 100% sure you didn't do any mistake while copy-pasting the script? Maybe try once more :)
Try with wget: https://pawelrychlicki.pl/Shared/GetFile/27
Because, actually, I'm not sure if the problem is regarding the dates. But anyway, try to run this script:

#! /bin/sh

scrubExpire=691200
currentDate=$(date +%s)

zfsVolumes=$(/sbin/zpool list -H -o name)

for volume in ${zfsVolumes}
  do
    scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $11" "$12" " $13" " $14" "$15}')
    scrubDate=$(date -d "$scrubRawDate" +%s)

    echo "scrubRawDate: $scrubRawDate"
    echo "currentDate: $currentDate"
    echo "scrubDate: $scrubDate"
    echo "currentDate - scrubDate = $(($currentDate - $scrubDate))"

    [ $(($currentDate - $scrubDate)) -ge $scrubExpire ] && echo "greater or equal!"
    [ $(($currentDate - $scrubDate)) -lt $scrubExpire ] && echo "less!"

    echo "------"
  done

My outpus is as follows:

scrubRawDate: Thu Jul 11 03:01:03 2019
currentDate: 1563127385
scrubDate: 1562806863
currentDate - scrubDate = 320522
less!
------
scrubRawDate: Thu Jul 11 06:59:28 2019
currentDate: 1563127385
scrubDate: 1562821168
currentDate - scrubDate = 306217
less!
------

Maybe this will tell you something.

Thomas

14-07-2019 19:21:31 UTC

Well, I use the copy function here on the site. Seems copied well enough. I tried wget'in the link you linked, renaming it to .sh and chmod +x but it gets this error when executed:
./27.sh 80 691200
-bash: ./27.sh: /bin/sh^M: bad interpreter: No such file or directory

Odd...

I also tried the script you just linked but the error is exactly the same:
root@naser:/tmp# ./script.sh
date: invalid date ‘errors on Sun Jul 14’
scrubRawDate: errors on Sun Jul 14
currentDate: 1563132082
scrubDate:
./script.sh: 16: ./script.sh: arithmetic expression: expecting primary: "1563132082 - "

Rychu

14-07-2019 21:13:11 UTC

I think the output from yours "zpool status" command is a bit different. Try changing the awk parameters to:

$13" "$14" " $15" " $16" "$17

In the script from my last comment and see if it helped. If not try to play a bit with the numbers :)

Thomas

14-07-2019 21:31:20 UTC

That fixed the issue - thanks!

I also tried lowering the "seconds" parameter to something less than last time it was scrubbed and I'm getting properly warned about it.

What exactly makes my 'awk parameters' differ?

Rychu

15-07-2019 20:09:26 UTC

awk is used here to parse the output of /sbin/zpool status $volume | grep scrub. For some reason (maybe newer version of /sbin/zpool) the line containing the date of the last scrub run contains two additional words comparing to mine and Calomel's.
What is your output of below command:

/sbin/zpool status | grep scrub | head -n 1

I get this:

scan: scrub repaired 0B in 0h1m with 0 errors on Thu Jul 11 03:01:03 2019

Thomas

16-07-2019 17:25:55 UTC

Yeah, indeed. It's different:
scan: scrub repaired 0B in 0 days 04:07:54 with 0 errors on Sun Jul 14 12:40:33 2019

Rychu

16-07-2019 19:42:34 UTC

Yup, so that's why you have to jump two "columns" in awk :)

Dirk

11-12-2019 07:24:31 UTC

Hi, i was using that a few weeks now. But i'm using enterprise disks and decided to expand the scrub experiation time to one month and one day.

/root/.scripts/monitzfs.sh 85 2714400‬

Unfortunately that does not work and i'm getting this error:

/root/.scripts/monitzfs.sh: 142: [: Illegal number: 2714400‬

Funny thing is if i revert back to 8 days this error remains. I can execute the script from command line without any error regardless which time i take.

So it seems it is a monit problem but i can't see it. That's my monitrc entry (which worked before)

check program zfs_health with path "/root/.scripts/monitzfs.sh 85 2714400‬"
if status != 0 then alert

Any idea why that error happens?

Rychu

25-04-2020 13:02:50 UTC

After ZFS upgrade to version 0.8.3 I encountered the same error as @Thomas so I changed the script to $13" "$14" " $15" " $16" "$17 version to handle it.

Mario

20-02-2022 18:57:57 UTC

Danke für das tolle skript, aber ich bekomme die Meldung "/etc/monit/conf.d/chk_zfs.sh:29: syntax error 'maxCapacity=$1'"

Als Eintrag im Monin hab ich:

check program zfs_health with path "/etc/monit/conf.d/chk_zfs.sh 80 691200"
if status != 0 then alert

Manuell funktioniert das Skript, nur von Monit aus gestartet nicht... Woran liegt das?

Add comment

Back to articles list