As in our own environment at nugg.ad every startup/HA environment needs a proper monitoring solution to fit the minimum requirements to fullfill the high availability demands of your customers. Without a proper solution you can trust on you’re screwed. It’s simple as that.
You just need to know first whether your application servers are melting, your database/nosql backend begins to burst or someone accidentally the whole thing (you know the meme, do you?). Not only to inform customers before they call you, but also to be able to plan further growth of your environment.
Since I’m currently rewriting our whole monitoring environment based on 3 1/2 years experience with Nagios and its competitors I thought to share that knowledge with you. Just in case someone has a use for this.
Choosing a monitoring solution
First rule: Stick with the monitoring solution you’re at least a bit experienced with and which fits to your environment. In huge environments Zabbix is capable of doing the monitoring work for you and so is Zenoss or Nagios as well. In smaller environments MRTG or Munin might also do the job.
All big Open Source monitoring solutions are highly customizable and extendable. You just need to know how to find the right plugins or to ask properly within the community how something can be achieved.
If you’re not experienced with Open Source monitoring solutions at all, get a first look on the feature set of various solutions at Wikipedia. Choose wisely afterwards and most important: Stick to that solution for quite a while to explore its advantages and to get better at anger management when facing its disadvantages as well. Sooner or later you’ll get the big picture.
It’s not that important which software you choose. It’s more important what you make out of it for your environment!
The user’s demands
What I’ve expierienced within the last years is that the demands are quite comprehensive:
- The operations team needs to get informed almost instantly in case of a real emergency via various contact possibilities
… and it’ll let you stay longer at the office to fix this when they get woken up each night due to false alarms - The CTO needs proper escalation methods to get informed when something’s broken and not taken care of
- The executives and its board need a nice and shiny visualization of the platform to present the company’s growth and its state
- The consulting or support team needs a simple read only web-interface to get a proper impression whether everything’s allright in case of a unexpected customer call.
All of these demands have something in common: The basis of all operations is trust.
The operations team has to trust your monitoring solution to fix problems in a fast, but advised way instead of ignoring problems after the fourth false alarm during a week. The CTO needs to know that your escalation strategies are working and that you don’t screw him. The executives need proper graphs without spikes or even downtimes which they’re not able to explain within meetings and the consulting or support team’s need is to get an overview of an almost real time state of your environment without getting confused.
My monitoring environment
I’m currently using Nagios as the basic monitoring solution for our environment with several plugins attached to it.
To reuse all the data provided by Nagios I let it write its information to a MySQL database via ndoutils. This enables you to use almost any software which understands the ndoutils database layout, e.g. nice and shiny web-interfaces or visualization tools like nagvis.
For the graphing I’m currently using two solutions. pnp4nagios 0.6x with its highly recommended NPCD daemon acts as the basis for proper graphing of various system information. Since the pnp4nagios web-interface is only recommended for Unix/Linux system administrators I’m reusing the rrd databases within Cacti to provide a better overview of the whole platform. Cacti again is mostly being used for SNMP based checks which needs no alerting.
You’ll now see the difference: System metrics which needs a proper alerting are handled by Nagios itself and metrics that only needs to be monitored for its statistics are realized via Cacti (and therefore mostly SNMP). This eases your configuration work, keeps system resources in balance and avoids misunderstandings within the team that needs to take care of your environment while enjoying your Martini on the Keys.
To enable the consultants to get a good overview of the platform’s health, I’m using two different tools. At first NagVis provides a graphic overview of the system’s health without providing too detailed data. A traffic light based graphic with three states (green, yellow and red) might be too few information, but a Google Maps based view of your various datacenter locations with green, yellow or red icons will do the job. The supporters then are able to explain to the customer that datacenter 123 is down due to a failure within the system which is enough in most cases. All other cases with demand of a clearer view will be redirected to you, no worries. They know their job.
For the consultants which are more experienced with the system itself I provide a nice and shiny Nagios interface within Cacti itself. They’re able then to access Cacti’s performance graphs and the metrics provided by Nagios as well.
Conclusion
Maintaining an Open Source monitoring solution means a lot of work and you have to grapple with your favorite monitoring solution for quite a while before achieving your goals. But if you do, you’re the peacekeeper that lets your colleagues sleep during the night, the magician that casts nice and shiny graphs instantly and the master who has a global overview of your platform.
I hope, this gave you a basic view on how to find the right monitoring solution for you without going into the technical details. If you’ve got any questions let me know.
Add your comment below, or trackback from your own site.
Subscribe to these comments.
Be nice. Keep it clean. Stay on topic. No spam.
You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">