Binary Talks

Hadoop DFS Check Plugin for Nagios

logo_nagiosSince I’m currently occupied with kicking off a several terabytes large Hadoop cluster I thought it’d be a good idea to provide a few Nagios plugins related to Hadoop.

Just in case you don’t know: Hadoop is a free Java software framework that supports data intensive distributed applications. It got its own filesystem called HDFS which scales up to petabytes of storage running on top of the operating system’s filesystem. Hadoop is inspired by Google’s MapReduce and GFS, the Google File System.

Yahoo for example uses several Hadoop Clusters with 100.000 CPUs total in 20.000 boxes (2.000 boxes per cluster) according to the official Hadoop website. If you haven’t checked it out yet, do so! This is – sorry for that – crazy shit and it’s working!
For the beginning I’d recommend reading the Wikipedia article, the offical documentation and, of course, watching the Hadoop related lectures over at Cloudera.

Anyway, the first plugin is almost finished and checks the amount of running DataNodes and the capacity of the DFS. It’s working, but I want to add a few more features so that the plugin become more flexible. Nevertheless, feel free to give it a try if you’re in the mood.

The Script

Hadoop is completely operated by the hadoop user and since it got its own filesystem it got its own permissions as well. To keep things clean and safe we won’t give hadoop related permissions to the Nagios user. We’ll rather enable the Nagios user to run a small shell script as root which contains a command ran by the hadoop user to get information about the cluster.

Add the following small script to a directory of your choice, e.g. /usr/local/sbin:

#!/bin/sh
su -s /bin/bash - hadoop -c 'hadoop dfsadmin -report'

Afterwards alter its permissions so that it’s only read-, write and accessible by root:

user@host: ~ $ sudo chmod 700 /usr/local/sbin/get-dfsreport.sh

Then enable the Nagios user to sudo run the script via /etc/sudoers (or better visudo):

nagios          ALL=(ALL)       NOPASSWD: /usr/local/sbin/get-dfsreport.sh

This is it for the prerequisites. You may then run the script provided for copy’n'pasting below (or svn co, your choice).
If you’ve chosen another directory than /usr/local/sbin you have to provide the path via -s/–path-sh when running the script.

user@host: ~ $ svn co svn://svn.matejunkie.com/nagios-plugins/stable/check_hadoop-dfs/ check_hadoop-dfs/
#!/bin/sh
 
#   This program is free software; you can redistribute it and/or modify
#   it under the terms of the GNU General Public License as published by
#   the Free Software Foundation; either version 2 of the License, or
#   (at your option) any later version.
#
#   This program is distributed in the hope that it will be useful,
#   but WITHOUT ANY WARRANTY; without even the implied warranty of
#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#   GNU General Public License for more details.
#
#   You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#   Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
 
PROGNAME=`basename $0`
VERSION="Version 1.0,"
AUTHOR="2009, Mike Adolphs (http://www.matejunkie.com/)"
 
ST_OK=0
ST_WR=1
ST_CR=2
ST_UK=3
 
path_sh="/usr/local/sbin"
 
print_version() {
    echo "$VERSION $AUTHOR"
}
 
print_help() {
    print_version $PROGNAME $VERSION
    echo ""
    echo "$PROGNAME is a Nagios plugin to check the status of HDFS, Hadoop's"
    echo "underlying, redundant, distributed file system."
    echo ""
    echo "$PROGNAME -s /usr/local/sbin [-w 10] [-c 5]"
    echo ""
    echo "Options:"
    echo "  -s|--path-sh)"
    echo "     Path to the shell script that is mentioned in the"
    echo "     documentation. Default is: /usr/local/sbin"
    echo "  -w|--warning)"
    echo "     Defines the warning level for available datanodes. Default"
    echo "     is: off"
    echo "  -c|--critical)"
    echo "     Defines the critical level for available datanodes. Default"
    echo "     is: off"
    exit $ST_UK
}
 
while test -n "$1"; do
    case "$1" in
        -help|-h)
            print_help
            exit $ST_UK
            ;;
        --version|-v)
            print_version $PROGNAME $VERSION
            exit $ST_UK
            ;;
        --path-sh|-s)
            path_sh=$2
            shift
            ;;
        --warning|-w)
            warning=$2
            shift
            ;;
        --critical|-c)
            critical=$2
            shift
            ;;
        *)
            echo "Unknown argument: $1"
            print_help
            exit $ST_UK
            ;;
        esac
    shift
done
 
get_wcdiff() {
    if [ ! -z "$warning" -a ! -z "$critical" ]
    then
        wclvls=1
 
        if [ ${warning} -lt ${critical} ]
        then
            wcdiff=1
        fi
    elif [ ! -z "$warning" -a -z "$critical" ]
    then
        wcdiff=2
    elif [ -z "$warning" -a ! -z "$critical" ]
    then
        wcdiff=3
    fi
}
 
val_wcdiff() {
    if [ "$wcdiff" = 1 ]
    then
        echo "Please adjust your warning/critical thresholds. The warning \
must be higher than the critical level!"
        exit $ST_UK
    elif [ "$wcdiff" = 2 ]
    then
        echo "Please also set a critical value when you want to use \
warning/critical thresholds!"
        exit $ST_UK
    elif [ "$wcdiff" = 3 ]
    then
        echo "Please also set a warning value when you want to use \
warning/critical thresholds!"
        exit $ST_UK
    fi
}
 
get_vals() {
    tmp_vals=`sudo ${path_sh}/get-dfsreport.sh`
    dn_avail=`echo -e "$tmp_vals" | grep -m1 "Datanodes available:" | awk '{print $3}'`
    dfs_used=`echo -e "$tmp_vals" | grep -m1 "Used raw bytes:" | awk '{print $4}'`
    dfs_used=`expr ${dfs_used} / 1024 / 1024`
    dfs_used_p=`echo -e "$tmp_vals" | grep -m1 "% used:" | awk '{print $3}'`
    dfs_total=`echo -e "$tmp_vals" | grep -m1 "Total raw bytes:" | awk '{print $4}'`
    dfs_total=`expr ${dfs_total} / 1024 / 1024`
}
 
do_output() {
    output="Datanodes up and running: ${dn_avail}, DFS total: \
${dfs_total} MB, DFS used: ${dfs_used} MB (${dfs_used_p})"
}
 
do_perfdata() {
    perfdata="'datanodes_available'=${dn_avail} 'dfs_total'=${dfs_total} \
'dfs_used'=${dfs_used}"
}
 
# Here we go!
get_wcdiff
val_wcdiff
 
get_vals
 
do_output
do_perfdata
 
if [ -n "$warning" -a -n "$critical" ]
then
    if [ "$dn_avail" -le "$warning" -a "$dn_avail" -gt "$critical" ]
    then
        echo "WARNING - ${output} | ${perfdata}"
	exit $ST_WR
    elif [ "$dn_avail" -le "$critical" ]
    then
        echo "CRITICAL - ${output} | ${perfdata}"
	exit $ST_CR
    else
        echo "OK - ${output} | ${perfdata} "
	exit $ST_OK
    fi
else
    echo "OK - ${output} | ${perfdata}"
    exit $ST_OK
fi

Output example:

If everything went fine you should see output like the following:

user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin
OK - Datanodes up and running: 50, DFS total: 20147365 MB, DFS used: 0 MB (0%) | 'datanodes_available'=50 'dfs_total'=20147365 'dfs_used'=0
user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -w 40 -c 30
OK - Datanodes up and running: 50, DFS total: 20147365 MB, DFS used: 0 MB (0%) | 'datanodes_available'=50 'dfs_total'=20147365 'dfs_used'=0
user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -w 60 -c 40
WARNING - Datanodes up and running: 50, DFS total: 20147365 MB, DFS used: 0 MB (0%) | 'datanodes_available'=50 'dfs_total'=20147365 'dfs_used'=0
user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -w 70-c 60
CRITICAL - Datanodes up and running: 50, DFS total: 20147365 MB, DFS used: 0 MB (0%) | 'datanodes_available'=50 'dfs_total'=20147365 'dfs_used'=0
user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -w 20 -c 40
Please adjust your warning/critical thresholds. The warning must be higher than the critical level!
user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -w 30
Please also set a critical value when you want to use warning/critical thresholds!
user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -c 30
Please also set a warning value when you want to use warning/critical thresholds!

The License

As always this little script is ment to be sh-compliant and released under the terms of the GPL Version 2 only. Feel free to subscribe via rss to get updates on this one. More options will be added in the future.

Share and Enjoy:
  • del.icio.us
  • Digg
  • Slashdot
  • Google Bookmarks
  • LinkedIn
  • StumbleUpon
  • Reddit
  • Yigg
  • Netvibes
  • MisterWong
  • Facebook
  • HackerNews
  • Identi.ca
  • FriendFeed
  • NewsVine

5 Comments

speak up

Add your comment below, or trackback from your own site.

Subscribe to these comments.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">

*Required Fields