Things i've made

Looking for more? See a list of everything i've made.

Saturday, December 3, 2011

High performance NCSA alternative for Nagios

Ive worked with some non-trivial nagios setups: ~5 nagios nodes with each performing 20-50 service checks per second. If you have never worked with distributed nagios deployments then you would be oblivious to the pain that NCSA can cause. First off I should start be saying that NCSA is responsible for reporting the results of service checks on remote nagios nodes back to the master nagios nodes, essentially allowing the setup to be distributed. What no one tells you when you are deploy NCSA is that it send service checks in series while nagios performs service checks in parallel. If your nagios node is already operating at capacity and you try and have it report to a master node, you are going to see service check results queue up and then services start to appear as if they are flapping.

Not being able to find a simple solution to the NCSA backlog problem I went and created a micro daemon that replaces NCSA with something a lot more high performance so instead of NCSA being the bottle neck, that role is back on nagios itself. Another fantastic feature of this NCSA replacement is that it emits service checks for itself so you can monitor how many service checks per second are being proxied and how many bytes/second that is - great if you have little bandwidth.

Three nagios proxies in one setup showing their own status
To deploy the daemon, simply compile it on master and slave nodes and set it up as shown below
  Slave 1                     --.
  ./nagios_proxy 10.16.250.30    \       Master (10.16.250.30)
                                 |----> ./nagios_proxy
                                 |
  Slave 2                     ---/
  ./nagios_proxy 10.16.250.30


The daemons need to be started in the same directory as the nagios.cmd pipe which in this setup is /opt/nagios/var/rw
Define the commands in your nagios config to send check results via the proxy:
define command {
        command_name    send_service_check_proxy
        command_line    /opt/nagios/libexec/send_service_check_proxy $HOSTNAME$ '$SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$'
}

define command {
        command_name    send_host_check_proxy
        command_line    /opt/nagios/libexec/send_host_check_proxy $HOSTNAME$ $HOSTSTATEID$ '$HOSTOUTPUT$'
}

The send_host_check_proxy command itself:
#!/bin/bash
/bin/echo "PROCESS_HOST_CHECK_RESULT;$1;$2;$3" > /opt/nagios/var/rw/nagios-remote.cmd

The send_service_check_proxy command:
#!/bin/bash
/bin/echo "PROCESS_SERVICE_CHECK_RESULT;$1;$2;$3;$4" > /opt/nagios/var/rw/nagios-remote.cmd

Now to make things a bit fault tolerant I usually run the proxy from a script that just keeps restarting the proxy if it quits. This is important because the proxy will terminate if it determines that something has gone wrong knowing fully well that it will be restarted. This is useful in large outages since the system will almost always come back up by itself. Its also nice to have a log file to refer to after the fact. Its probably worth noting that this service script also takes care of the pipe that is used by the proxy so if you don't use this script then you will have to manage that yourself.
#!/bin/bash
ulimit 1000000

while true; do
  ./nagios_proxy 10.16.250.30 >>/var/log/nagios_proxy.log 2>&1
  sleep 2
done

If you are readhat inclined then the below service script will help keep the whole thing running smoothly across system restarts, logrotations, etc. Converting to something other than redhat is up to you.
!/bin/sh
# 
# chkconfig: 345 99 01
# description: Nagios distributed high-performance proxy/aggregator
#
# File : nagiosproxy
#
# Author : Ryan Krumins (ryan.krumins@gmail.com)
# 
# Changelog :
#
# 2009-05-12 Ryan Krumins 
#  - initial implementation
#
# Description : Start and stops the Nagios proxy/aggregator
#               used to collect and distribute passive nagios checks
#

INST_DIR=/opt/nagios/var/rw
PROXY=$INST_DIR/nagios_proxy
PROXY_WRAP=$INST_DIR/run_proxy.sh
PROXY_PIPE=$INST_DIR/nagios-remote.cmd

# Sanity checks.
[ -x $PROXY ] || exit 0 
[ -x $PROXY_WRAP ] || exit 0 

# Source functions library
. /etc/init.d/functions

RETVAL=0

start () {
    echo -n $"Starting nagiosproxy: "
    if [ -n "`/sbin/pidof -o %PPID nagios_proxy`" ]; then
        echo -n $"nagios_proxy: already running"
        failure
        echo
        return 1
    fi
    cd $INST_DIR
    $PROXY_WRAP & >/dev/nul 2>&1
 sleep 1
    if [ -n "`/sbin/pidof -o %PPID nagios_proxy`" ]; then
        success
    else
        failure
    fi;
    echo
    return $RETVAL
}

stop() {
 killall run_proxy.sh
 killall nagios_proxy
}

restart() {
 killall nagios_proxy
}

case "$1" in
  start)
        start
        ;;
  stop)
        stop
        ;;
  restart|force-reload|reload)
        restart
        ;;
  status)
        echo "nagios_proxy: " `/sbin/pidof nagios_proxy`
esac

exit $?

Lastly, the important part, the proxy source code itself for the NCSA replacement. I posted this on pastebin since it displays much nicer there.

So there you have it, a NCSA replacement that handles thousands of service check results per second, monitors itself in a useful/meaningful way, and is also tolerant to faults and rarely needs any interaction regardless of what occurs in your environment. If you use this I would appreciate you comments. This is only a quickly thrown together hack but if its useful enough then I would consider working on feature requests.

5 comments:

  1. Could you, please, post it on the Nagios Exchange?
    http://exchange.nagios.org
    This will benefit the Nagios community. Thank you!

    ReplyDelete
    Replies
    1. Hi Ludmil,

      Thanks for the suggestion!

      I had actually already added this to the nagios exchange. If it becomes more popular I would consider packaging it up properly as well.

      Delete
  2. Downloading and testing. sounds like exactly what I was looking for :)

    ReplyDelete
  3. Hi, like the idea but can't see how exactly they piece together. Does nagios-remote.cmd need compiled?

    ReplyDelete
    Replies
    1. Additionally, I'm running Nagios 4.0.5 (on CentOS/OEL). is this compatible?

      Delete