Better Enterprise/Network Monitoring with Component Diagnostics

And when I say "Enterprise" I do mean a collection of servers and application components, although the "U.S.S. Enterprise" does have a cameo!

Most server based applications installed today have no system monitoring. The way to tell if the system is down is when the phone calls start coming. Sometimes at home. This sort of thing can really put a damper on the pay raise department.

And this topic gets richer - when you have to do an emergency upgrade, how do you know something you changed didn't screw up the production system in a whole different way? Phone calls again? More pay raise dampening.

A collection of simple system monitors and diagnostics can help you to be sure that your application is installed correctly and currently functioning correctly. And when there is a problem, not only can you be notified immediately, you can view the information to find out where, exactly, the problem is. If the problem is fixed quickly, or fixed before anybody notices, this bodes well for the pay raise stuff.

Nagios is a popular tool for doing a variety of system monitoring, but in my experience, Nagios is used almost exclusively for monitoring hardware. So if your application can no longer talk to the database, nagios thinks everything is just fine. The solution is to add some simple stuff to your app that you and nagios can use to make sure everything is okay.

A simple servlet is a great way to go for this sort of thing. A diagnotic can return something like "OK" or "FAIL" so any web browser can run a diagnostic on a component. Plus, nagios (and similar products) are able to exercise web pages too.

In "Star Trek: The Next Generation", Picard will tell somebody to run a "Level 3 diagnostic" for some part of the Enterprise. It turns out that the writers of this show have worked out what, exactly, that means. The following is an attempt to use their terminology to fit the needs of server development. Please note that level 1 and level 2 diagnostics require shutting down the application. I think that this sort of thing is better represented by testing, so I've left those out.

Level 5 Diagnostic

Level 4 Diagnostic

Level 3 Diagnostic

Needs of the real world are a bit more than in star trek. A few more diagnostics for the repertoire:

Ping

Installation Diagnostic

Status page

If you implement this servlet only with ping, you have made a big leap. Calling this ping, even manually, will verify that your component is not locked up and that the network is functioning at least in part. If you later connect nagios to the component ping, you can probably find out about a problem with your component long before you get that first call from a user of the component.

Overall, implementing all aspects of this servlet would probably take less than a day. Connecting it to nagios would take about an hour. And then ... if there ever is a problem, you should be able to resolve the problem about 20 times (maybe 100 times) faster than without it.