And when I say "Enterprise" I do mean a collection of servers and application components, although the "U.S.S. Enterprise" does have a cameo!
Most server based applications installed today have no system monitoring. The way to tell if the system is down is when the phone calls start coming. Sometimes at home. This sort of thing can really put a damper on the pay raise department.
And this topic gets richer - when you have to do an emergency upgrade, how do you know something you changed didn't screw up the production system in a whole different way? Phone calls again? More pay raise dampening.
A collection of simple system monitors and diagnostics can help you to be sure that your application is installed correctly and currently functioning correctly. And when there is a problem, not only can you be notified immediately, you can view the information to find out where, exactly, the problem is. If the problem is fixed quickly, or fixed before anybody notices, this bodes well for the pay raise stuff.
Nagios is a popular tool for doing a variety of system monitoring, but in my experience, Nagios is used almost exclusively for monitoring hardware. So if your application can no longer talk to the database, nagios thinks everything is just fine. The solution is to add some simple stuff to your app that you and nagios can use to make sure everything is okay.
A simple servlet is a great way to go for this sort of thing. A diagnotic can return something like "OK" or "FAIL" so any web browser can run a diagnostic on a component. Plus, nagios (and similar products) are able to exercise web pages too.
In "Star Trek: The Next Generation", Picard will tell somebody to run a "Level 3 diagnostic" for some part of the Enterprise. It turns out that the writers of this show have worked out what, exactly, that means. The following is an attempt to use their terminology to fit the needs of server development. Please note that level 1 and level 2 diagnostics require shutting down the application. I think that this sort of thing is better represented by testing, so I've left those out.
Level 5 Diagnostic
-
Select a few things to do that can complete in two seconds or less.
Configure nagios to exercise this diagnostic about once every two minutes.
The first line of text returned must be "OK" or "FAIL" for nagios.
Additional human readable text would be nice.
Some things that might be exercised:
-
read from the database (verify that communication with the database is funtioning)
ability to put a NO-OP message into a JMS queue (verify that JMS is functioning)
make sure all JMS queue sizes meet expected parameters
verify that app can read and write to the file system (testing for permission problems and file handle problems)
number of objects in memory meets a threshold
number of objects in the file system meets a threshold
certain files (logs?) are not getting too large
certain recent activities are within "normal operating parameters"
HTTP server is serving a tiny web page
recently logged error or warning
is data current?
Level 4 Diagnostic
-
select a rich list of things to do that can complete in 30 seconds or less.
Configure nagios to exercise this diagnostic about once every two hours.
The first line of text returned must be "OK" or "FAIL" for nagios.
Additional human readable text would be nice.
Some things that might be exercised:
-
exercise the Level 5 Diagnostic for this component
check something in the Level 5 diagnostic list above that could not fit into 2 seconds
ping components the app depends on (verify that the network is functioning between the two components)
exercise all JMS queues
FTP read/write
exercise EJB interface
exercise MDB interface
exercise SOAP interface
exercise RMI interface
examine JDBC driver version
exercise a sophisticated algorithm
Level 3 Diagnostic
-
select a rich list of things to do that can complete in 5 minutes or less.
Configure nagios to exercise this diagnostic about once every day.
The first line of text returned must be "OK" or "FAIL" for nagios.
Additional human readable text would be nice.
Some things that might be exercised:
-
exercise the Level 4 Diagnostic for this component
check for when licenses expire for third party products
exercise a workflow.
Needs of the real world are a bit more than in star trek. A few more diagnostics for the repertoire:
Ping
-
Configure nagios to exercise this diagnostic about once every 30 seconds.
Just return "OK".
This shows that the network and most systems are functioning and that the component is not locked up.
Installation Diagnostic
-
Called by the person installing the component (not by nagios) immediately after installation.
Recycle stuff from the level 3, 4 and 5 diagnostics.
Check to make sure jar file code can be exercised.
Called as needed by developers and system operators (not by nagios).
Status page
-
Called as needed by developers and system operators (not by nagios).
This might show:
-
a detailed list of which systems are functioning correctly and which are not so good
log data
the version number for the jdbc driver, the app server and the java version
the current sizes of JMS queues
the number of servlet requests
If you implement this servlet only with ping, you have made a big leap. Calling this ping, even manually, will verify that your component is not locked up and that the network is functioning at least in part. If you later connect nagios to the component ping, you can probably find out about a problem with your component long before you get that first call from a user of the component.
Overall, implementing all aspects of this servlet would probably take less than a day. Connecting it to nagios would take about an hour. And then ... if there ever is a problem, you should be able to resolve the problem about 20 times (maybe 100 times) faster than without it.