Cluster-Handbook/Cluster Monitoring

Cluster Monitoring
The topic of this chapter is cluster monitoring, a very versatile topic. There is plenty of software available to monitor distributed systems. However it is difficult to find one project that provides a solution for all needs. Among those needs may be the desire to efficiently gather as many metrics as possible about the utilization of “worker” nodes from a high performance cluster. Efficiency in this case is no meaningless advertising word – it is very important. Nobody wants to tolerate load imbalance just because some data to create graphs is collected, the priorities in high performance computing are pretty straight in that regard. Another necessity may be the possibility to observe certain things that are not allowed to surpass a specified threshold. Those things may be the temperature of water used for coolings things like CPUs or the allocated space on a hard disk drive. Ideally the monitoring software would have the possibility to commence counter-measures as long as the value is above the threshold. In the course this chapter two different monitoring solutions are be installed on a cluster of virtual machines. First, Icinga a fork of the widely used Nagios is be tested. After that Ganglia is used. Both solutions are Open Source and rather different in the functionalities they offer.



Figure 4.1 provides an overview of both the used cluster (of virtual machines) and the used software. All nodes use the most current Ubuntu LTS release. In the case of Ganglia the software version is 3.5, the most current one and thus compiled from source. For Icinga version 1.8.4 is used.

Icinga
Icinga is an Open Source monitoring solution. It is a fork of Nagios and maintains backwards compatibility. Thus all Nagios plugins also work with Icinga. The version provided by the official Ubuntu repositories in Ubuntu 12.04 is 1.6.1. To get a more current version the package provided by a Personal Package Archive (PPA) is used.

Installation
Thanks to the provided PPA the installation was rather simple. There was only one minor nuisance. The official guide for installation using packages for Ubuntu suggested to install Icinga like this:

Unfortunately this failed, since the package installation of  required a working database (either PostgreSQL or MySQL). So one has to switch the order of packages or just install PostgreSQL before Icinga.
 * Listing 4.1 Suggested order of packages to install Icinga on Ubuntu.

Configuration
After the installation of Icinga the provided web interface was accessible right away (using port forwarding to access the virtual machine). Some plugins were enabled to monitor the host on which Icinga was installed by default.



Figure 4.2 shows the service status details for these plugins on the master node. Getting Icinga to monitor remote hosts (the worker nodes) required much more configuration. A look into the configuration folder of Icinga revealed how the master node was configured to display the information of figure [fig:default_plugins]. Information is split into two parts: host identification and service specification. The host identification consists of,   and. A service is specified by a,   and a. The  accepts a Nagios plugin or a custom plugin which has to be configured in another Icinga configuration file:.



Figure 4.3 shows some important parts of the modified default configuration file used for the master node. As it can be seen both the host and service section start with a  statement which stands for the template that is going to be used. Icinga ships with a default (generic) template for hosts and services which is sufficient for us.



The question of how to achieve a setup as presented in figure 4.4 now arises. We want to use Icinga to monitor our worker nodes. For that purpose Icinga provides two different methods, which work the same way but use different techniques. In either case the Icinga running on the master node periodically asks the worker nodes for data. The other approach would have been, that Icinga just listens for data and the worker nodes initiate the communication themselves. The two different methods are SSH and NRPE. In the manuals both methods are compared and NRPE is recommended at the cost of increased configuration effort. NRPE causes less CPU overhead, SSH on the other hand is available on nearly every Linux machine and thus does not need to be configured. For our purpose decreased CPU overhead is a selling point and therefore NRPE is used. The next sections describe how Icinga has to be configured to monitor remote hosts with NRPE.

Master

In order to use NRPE additional software has to be installed on the master node. The package  provides Icinga with the possibility to use NRPE to gather data from remote hosts. Unfortunately that package is part of Nagios and thus upon installation the whole Nagios project is supposed to be installed as a dependency. Luckily using the option  for   we can skip the installation of those packages. The now installed package provides a new  that can be used during the service definition for a new host:. That command can be used to execute a Nagios plugin or a custom command on a remote host. As figure [fig:flo_ahc_icinga_components] shows, we want to be able to check “gmond” (a deamon of the next monitoring solution: Ganglia) and if two NFS folders ( and  ) are mounted correctly. For that purpose we create a new configuration file, in this case  , and change the host section presented in figure [fig:host_service_config] to the hostname and IP of the desired worker. The  in the service section has to be used like this:

The NRPE command accepts one argument (thus ): a command that is going to be executed on the remote host specified in the host section. In this case that command is  which not part of the Nagios plugin package, it is a custom shell script. The next section describes the necessary configuration on the remote host that has to be done before  works.
 * Listing 4.2 NRPE check command in worker configuration file.

Worker

Additional software has to be installed on the worker as well. In order to be able to respond to NRPE commands from the master the package  has to be installed. That package provides Nagios plugins and a service that is answering the NRPE requests from the master. We are not going to use a Nagios plugin, instead we write three basic shell scripts that will make sure that (as shown in figure [fig:flo_ahc_icinga_components]):


 * 1) The   service of Ganglia is running.
 * 2) Both   and   are correctly mounted using NFS from the master.

Before we can define those commands we have to allow the master to connect to our worker nodes:

After that we can edit the file  and add an alias and a path for the three scripts. The commands will be available to the master under the name of the specified alias.
 * Listing 4.3 Add the IP address of the master to /.

This is all that has to be done on the worker. One can check if everything is setup correctly with a simple command from the master as listing [lst:nrpe_check] shows:
 * Listing 4.4 Add custom commands to

Unfortunately in our case some extra steps were needed as the above command returned an error from every worker node. After turning on (and off again) the debug mode on the worker nodes ( in  ) the command returned the NRPE version and everything worked as expected. That is some strange behaviour, especially since it had to be done on every worker node.
 * Listing 4.5 Check if NRPE is setup correctly with.


 * Listing 4.6  success!.

Usage
Figure 4.5 shows the service status details for all hosts. Our custom commands are all working as expected. If that would not be the case they would appear as the  process. The status of that service is critical which is visible at first glance. The Icinga plugin api allows 4 different return statuses:



Additionally to the return code it is possible to return some text output. In our example we only return “Everything ok!”. The plugin which checks the  process uses that text output to give a reason for the critical service status which is quite self-explanatory.



Ganglia
Ganglia is an open source distributed monitoring system specifically designed for high performance computing. It relies on RRDTool for data storage and visualization and available in all major distributions. The newest version added some interesting features which is why we did not use the older one provided by the official Ubuntu repositories.

Installation
The installation of Ganglia was pretty straightforward. We downloaded the latest packages for Ganglia and RRDTool which is used to generate the nice graphs. RRDTool itself also needed  to be installed. After the compilation (no special configure flags were set) and installation we had to integrate RRDTool into the environment such that Ganglia is able to use it. This usually means adjusting the environment variables  and. Out of personal preference we choose another solution as listing [lst:rrd_env] shows.

Ganglia also needs  and additionally. Both also have to be installed on the worker nodes. It was important to specify  during the configuration.
 * Listing 4.7 Integrating RRDTool into the environment..


 * Listing 4.8 Installation of Ganglia.

Configuration


Ganglia consists of two major components: gmond and gmetad. Gmond is a monitoring daemon that has to run on every node that is supposed to be monitored. Gmetad is a daemon that polls other gmond daemons and stores their data in rrd databases which are then used to visualize the data in the Ganglia web interface. The goal was to configure Ganglia as shown in figure [fig:flo_ahc_ganglia_components]. The master runs two gmond daemons, one specifically for collecting data from the master, and the other one just to gather data from the gmond daemons running on the worker nodes. We installed Ganglia to  which is mounted on every worker via NFS. In order to start the gmond and gmetad processes on the master and worker nodes init scripts were used. The problem was, that there were no suitable init scripts provided by the downloaded tar ball. Our first idea was to extract the init script of the (older) packages of the Ubuntu repositories. That init script didn’t work as expected. Restarting and stopping the gmond service caused problems on the master node, since 2 gmond processes were running there. Instead of using the pid of the service they were killed by name, obviously no good idea. We tried to change the behaviour manually, but unfortunately that didn’t work. After the gmond process is started, the init systems reads the pid of the started service and stores it in a gmond.pid file. The problem was, that the gmond process demonizes after starting and changes the running user (from root to nobody). Those actions also changed the pid which means the .pid file is no longer valid and stopping and restarting the service won’t work. After a lot of trial and error we found a working upstart (the new init system used by Ubuntu) script in the most recent (not yet released) Ubuntu version 13.04. In that script we only had to adjust service names and make sure that the NFS partition is mounted before we start the service. For some magical reason that setup even works on the master node with two gmond processes.

Master

At first we configured the gmetad daemon. We specified two data sources: “Infrastructure” (the master node) and “Cluster Nodes” (the workers). Gmetad gathers the data for these sources from the two running gmond processes on the master. To prevent conflicts both are accepting connections on different ports: 8649 (Infrastructure) and 8650 (Cluster Nodes). We also adjusted the grid name and the directory in which the rrd databases are stored.

The next step was to configure the gmond processes on the master: gmond_master and gmond_collector. Since the gmond_master process doesn’t communicate with other gmond’s no communication configuration was necessary. We only had to specify a tcp_accept_channel on which the gmond responds to queries of gmetad. Additionally one can specify names for the host, cluster and owners and provide a location (for example the particular rack).
 * Listing 4.9 Interesting parts of.

The gmond_collector process needs to communicate with the four gmond_worker processes. There are two different communications methods present in ganglia: unicast and multicast. We choose unicast and the setup was easy. The gmond_collector process additionally has to accept queries from the gmetad process which is why we specified another tcp_accept_channel. On the specified udp_recv_channel the gmond_collector waits for data from the gmond_worker processes.
 * Listing 4.10 Configuration of.

Worker
 * Listing 4.11 Configuration of.

The gmond_worker processes neither listens to other gmond processes nor accepts queries from a gmetad daemon. Thus the only interesting part in the configuration file is the sending mechanism of that gmond daemon.

Listing 4.12 Configuration of.

Usage
Ganglia already gathers and visualizes data about the cpu, memory, network and storage by default. It is also possible to extend the monitoring capabilities with custom plugins. The gathered data can be viewed in many small graphs each only featuring one data source, or in larger aggregated “reports”.



The front page of ganglia shows many of those aggregated reports for the whole grid and the “sub clusters”. Figure 4.7 shows that front page from where it is possible to navigate to the separate sub clusters and also to specific nodes. The reports on that page also show some interesting details. The master node for example has some outgoing network traffic every 5 minutes. By default all reports show data from the last hour, but it is also possible to show the data over the last 2/4 hours, week, month or year.

Graph aggregation

An especially interesting feature is the custom graph aggregation. Let’s say there is a report available that visualizes the cpu utilization of all (for example 10) available nodes. If you run a job that requires four of these nodes, you are likely not interested in the data of the other 6 nodes. With Ganglia you can create a custom report that only matches nodes that you specified with a regular expression.





If that is not enough it is also possible to create entirely custom aggregated graphs where you can specify the used metrics, axis limits and labels, graph type (line or stacked) and nodes. In figure [fig:graph_aggregation] we specified such a graph. We choose a custom title, set the Y-axis label to percent, set the lower and upper axis limits to 0 and 100 and the system cpu utilization as a metric. It is also possible to choose more than one metric as long as the composition is meaningful.