Since I’m currently occupied with kicking off a several terabytes large Hadoop cluster I thought it’d be a good idea to provide a few Nagios plugins related to Hadoop.
Just in case you don’t know: Hadoop is a free Java software framework that supports data intensive distributed applications. It got its own filesystem called HDFS which scales up to petabytes of storage running on top of the operating system’s filesystem. Hadoop is inspired by Google’s MapReduce and GFS, the Google File System.
Yahoo for example uses several Hadoop Clusters with 100.000 CPUs total in 20.000 boxes (2.000 boxes per cluster) according to the official Hadoop website. If you haven’t checked it out yet, do so! This is – sorry for that – crazy shit and it’s working!
For the beginning I’d recommend reading the Wikipedia article, the offical documentation and, of course, watching the Hadoop related lectures over at Cloudera.
Anyway, the first plugin is almost finished and checks the amount of running DataNodes and the capacity of the DFS. It’s working, but I want to add a few more features so that the plugin become more flexible. Nevertheless, feel free to give it a try if you’re in the mood.
The Script
Hadoop is completely operated by the hadoop user and since it got its own filesystem it got its own permissions as well. To keep things clean and safe we won’t give hadoop related permissions to the Nagios user. We’ll rather enable the Nagios user to run a small shell script as root which contains a command ran by the hadoop user to get information about the cluster.
Add the following small script to a directory of your choice, e.g. /usr/local/sbin:
#!/bin/sh su -s /bin/bash - hadoop -c 'hadoop dfsadmin -report'
Afterwards alter its permissions so that it’s only read-, write and accessible by root:
user@host: ~ $ sudo chmod 700 /usr/local/sbin/get-dfsreport.sh
Then enable the Nagios user to sudo run the script via /etc/sudoers (or better visudo):
nagios ALL=(ALL) NOPASSWD: /usr/local/sbin/get-dfsreport.sh
This is it for the prerequisites. You may then run the script provided for copy’n'pasting below (or svn co, your choice).
If you’ve chosen another directory than /usr/local/sbin you have to provide the path via -s/–path-sh when running the script.
user@host: ~ $ svn co svn://svn.matejunkie.com/nagios-plugins/stable/check_hadoop-dfs/ check_hadoop-dfs/
#!/bin/sh # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA PROGNAME=`basename $0` VERSION="Version 1.0," AUTHOR="2009, Mike Adolphs (http://www.matejunkie.com/)" ST_OK=0 ST_WR=1 ST_CR=2 ST_UK=3 path_sh="/usr/local/sbin" print_version() { echo "$VERSION $AUTHOR" } print_help() { print_version $PROGNAME $VERSION echo "" echo "$PROGNAME is a Nagios plugin to check the status of HDFS, Hadoop's" echo "underlying, redundant, distributed file system." echo "" echo "$PROGNAME -s /usr/local/sbin [-w 10] [-c 5]" echo "" echo "Options:" echo " -s|--path-sh)" echo " Path to the shell script that is mentioned in the" echo " documentation. Default is: /usr/local/sbin" echo " -w|--warning)" echo " Defines the warning level for available datanodes. Default" echo " is: off" echo " -c|--critical)" echo " Defines the critical level for available datanodes. Default" echo " is: off" exit $ST_UK } while test -n "$1"; do case "$1" in -help|-h) print_help exit $ST_UK ;; --version|-v) print_version $PROGNAME $VERSION exit $ST_UK ;; --path-sh|-s) path_sh=$2 shift ;; --warning|-w) warning=$2 shift ;; --critical|-c) critical=$2 shift ;; *) echo "Unknown argument: $1" print_help exit $ST_UK ;; esac shift done get_wcdiff() { if [ ! -z "$warning" -a ! -z "$critical" ] then wclvls=1 if [ ${warning} -lt ${critical} ] then wcdiff=1 fi elif [ ! -z "$warning" -a -z "$critical" ] then wcdiff=2 elif [ -z "$warning" -a ! -z "$critical" ] then wcdiff=3 fi } val_wcdiff() { if [ "$wcdiff" = 1 ] then echo "Please adjust your warning/critical thresholds. The warning \ must be higher than the critical level!" exit $ST_UK elif [ "$wcdiff" = 2 ] then echo "Please also set a critical value when you want to use \ warning/critical thresholds!" exit $ST_UK elif [ "$wcdiff" = 3 ] then echo "Please also set a warning value when you want to use \ warning/critical thresholds!" exit $ST_UK fi } get_vals() { tmp_vals=`sudo ${path_sh}/get-dfsreport.sh` dn_avail=`echo -e "$tmp_vals" | grep -m1 "Datanodes available:" | awk '{print $3}'` dfs_used=`echo -e "$tmp_vals" | grep -m1 "Used raw bytes:" | awk '{print $4}'` dfs_used=`expr ${dfs_used} / 1024 / 1024` dfs_used_p=`echo -e "$tmp_vals" | grep -m1 "% used:" | awk '{print $3}'` dfs_total=`echo -e "$tmp_vals" | grep -m1 "Total raw bytes:" | awk '{print $4}'` dfs_total=`expr ${dfs_total} / 1024 / 1024` } do_output() { output="Datanodes up and running: ${dn_avail}, DFS total: \ ${dfs_total} MB, DFS used: ${dfs_used} MB (${dfs_used_p})" } do_perfdata() { perfdata="'datanodes_available'=${dn_avail} 'dfs_total'=${dfs_total} \ 'dfs_used'=${dfs_used}" } # Here we go! get_wcdiff val_wcdiff get_vals do_output do_perfdata if [ -n "$warning" -a -n "$critical" ] then if [ "$dn_avail" -le "$warning" -a "$dn_avail" -gt "$critical" ] then echo "WARNING - ${output} | ${perfdata}" exit $ST_WR elif [ "$dn_avail" -le "$critical" ] then echo "CRITICAL - ${output} | ${perfdata}" exit $ST_CR else echo "OK - ${output} | ${perfdata} " exit $ST_OK fi else echo "OK - ${output} | ${perfdata}" exit $ST_OK fi
Output example:
If everything went fine you should see output like the following:
user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin OK - Datanodes up and running: 50, DFS total: 20147365 MB, DFS used: 0 MB (0%) | 'datanodes_available'=50 'dfs_total'=20147365 'dfs_used'=0 user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -w 40 -c 30 OK - Datanodes up and running: 50, DFS total: 20147365 MB, DFS used: 0 MB (0%) | 'datanodes_available'=50 'dfs_total'=20147365 'dfs_used'=0 user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -w 60 -c 40 WARNING - Datanodes up and running: 50, DFS total: 20147365 MB, DFS used: 0 MB (0%) | 'datanodes_available'=50 'dfs_total'=20147365 'dfs_used'=0 user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -w 70-c 60 CRITICAL - Datanodes up and running: 50, DFS total: 20147365 MB, DFS used: 0 MB (0%) | 'datanodes_available'=50 'dfs_total'=20147365 'dfs_used'=0 user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -w 20 -c 40 Please adjust your warning/critical thresholds. The warning must be higher than the critical level! user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -w 30 Please also set a critical value when you want to use warning/critical thresholds! user@host: ~ $ ./check_hadoop-dfs.sh -s /var/nagios/home/bin -c 30 Please also set a warning value when you want to use warning/critical thresholds!
The License
As always this little script is ment to be sh-compliant and released under the terms of the GPL Version 2 only. Feel free to subscribe via rss to get updates on this one. More options will be added in the future.
It looks like you have some good plugins for Nagios. We just launched a new website exchange.nagios.org. It would be great if you could add your plugins to the new site.
Thanks!
Thanks Mary. I’m already registered and uploaded two plugins. There’s more to come within the next days.
I think dfsadmin -report has changed in 0.20 release or possibly a localized output change?
sample of new output is below:
Configured Capacity: 19777590296576 (17.99 TB)
Present Capacity: 11989662912512 (10.9 TB)
DFS Remaining: 9583081635840 (8.72 TB)
DFS Used: 2406581276672 (2.19 TB)
DFS Used%: 20.07%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
————————————————-
Datanodes available: 2 (2 total, 0 dead)
Name: 10.200.100.75:50010
Rack: /datacenter1/rack2
Decommission Status : Normal
Configured Capacity: 9888795148288 (8.99 TB)
DFS Used: 1203288698880 (1.09 TB)
Non DFS Used: 2953713659904 (2.69 TB)
DFS Remaining: 5731792789504(5.21 TB)
DFS Used%: 12.17%
DFS Remaining%: 57.96%
Last contact: Wed Sep 16 14:43:46 EDT 2009
Name: 10.200.100.170:50010
Rack: /datacenter1/rack1
Decommission Status : Normal
Configured Capacity: 9888795148288 (8.99 TB)
DFS Used: 1203292577792 (1.09 TB)
Non DFS Used: 4834213724160 (4.4 TB)
DFS Remaining: 3851288846336(3.5 TB)
DFS Used%: 12.17%
DFS Remaining%: 38.95%
Last contact: Wed Sep 16 14:43:47 EDT 2009
Thanks for the example. I’ll alter the plugin within the next days so that it’ll be useable for 0.20.x or higher as well. We’re still using 0.18.3 here since it’s quite stable.
Changes to make it work with 0.20.1
124,126c124,125
< dfs_used=`echo -e "$tmp_vals" | grep -m1 "DFS Used:" | awk '{print $3}'`
< dfs_used=`expr $dfs_used / 1024 / 1024`
dfs_used=`echo -e “$tmp_vals” | grep -m1 “Used raw bytes:” | awk ‘{print $4}’`
> dfs_used=`expr ${dfs_used} / 1024 / 1024`
128,130c127,128
< dfs_total=`echo -e "$tmp_vals" | grep -m1 "Configured Capacity:" | awk '{print $3}'`
< dfs_total=`expr $dfs_total / 1024 / 1024`
dfs_total=`echo -e “$tmp_vals” | grep -m1 “Total raw bytes:” | awk ‘{print $4}’`
> dfs_total=`expr ${dfs_total} / 1024 / 1024`