Cluster-Handbook/SLURM

SLURM
A cluster is a network of resources that needs to be controlled and managed in order to achieve an error free process. The node must be able to communicate with each other, wherein there are two categories generally - the login node (also master or server node) and the worker nodes (also client node). It is common that users can't access the worker nodes directly, but can run programs on all nodes. Usually there are several users who claim the resources for themselves. The distribution of these can not therefore be set by users themselves, but according to specific rules and strategies. All these tasks are taken over by the job and resource management system - a batch system.

Batch-System overview
A batch system is a service for managing resources and also the interface for the user. The user sends jobs - tasks with executions and a description of the needed resources and conditions. All the jobs must be managed by the batch system. Major components of a batch system are a server and the clients. The server is the main component and provides an interface for monitoring. The main task of the server is managing resources to the registered clients. The main task of the clients is the execution of the pending programs. A client also collects all information about the course of the programs and system status. This information can be provided on request for the server. A third optional component of a batch system is the scheduler. Some batch systems have a built-in scheduler, but all give the option to integrate an external scheduler in the system. The scheduler sets according to certain rules: who, when and how many resources can be used. A batch system with all the components mentioned is SLURM, which is presented in this section with background information and instructions for installing and using. Qlustar also has an SLURM integration.

SLURM Basics
SLURM (Simple Linux Utility for Resource Management) is a free batch-system with an integrated job scheduler. SLURM was created in 2002 from the joint effort mainly by Lawrence Livermore National Laboratory, SchedMD, Linux Networx, Hewlett-Packard, and Groupe Bull. Soon, more than 100 developers had contributed to the project. The result of the efforts is a software that is used in many high-performance computers of the TOP-500 list (also the currently fastest Tianhe-2 ). SLURM is characterized by a very high fault tolerance, scalability and efficiency. There are backups for daemons (see section 5.3) and various options to dynamically respond to errors. It can manage more than 100,000 jobs, up to 1000 jobs per second, with up to 600 jobs per second that can be executed. Currently unused nodes can be shut down in order to save power. Moreover SLURM has a fairly high level of compatibility to a variety of operating systems. Originally developed for Linux, today many more platforms are supported: AIX, *BSD (FreeBSD, NetBSD, and OpenBSD), Mac OS X, Solaris. It is also possible to crosslink different systems and run jobs on them. For scheduling a mature concept has been developed with a variety of options. With policy options many levels can be produced, which each can be managed. Thus, a database can be integrated in which user groups and projects can be recorded, which are subject to their own rules. Also the user can be attributed rights as part of its group or project. SLURM is an active project that is being developed. In 2010, the developers founded the company SchedMD and offer paid support for SLURM on. S

Setup
The heart of SLURM are two daemons -  and. Both can have a backup. The Controldaemon, as the name suggests, is running on the server. It initializes, controls and logs all activity of the resource manager. This service is divided into three parts - the Job Manager, which manages the queue with waiting jobs, the Node Manager, that holds status information of the node and the partition manager, which allocates the node. The second daemon runs on each client. He performs all the instructions from  and. With the special command  the client extends further its status information to the server. If the connection is established, diverse SLURM commands can be accessed from the server. Some of these can theoretically be called from the client, but usually are carried out only on the server.



In the picture 5.1, some commands are exemplified. The five most important are explained in detail below.

sinfo
This command displays the node and partition information. With additional options the output can be filtered and sorted.

The column  shows the name of the partition. Asterisk means that it is a default name. The column  refers to the partition and can show   or. displays the user-specified time limit. Unless specified, the value is assumed to be infinite. indicates the status of  the listed nodes. Possible states are $$allocated, completing, down, drained, draining, fail, failing, idle$$ and $$unknown$$, wherein the respective abbreviations are as following:  and. An asterisk means that there was no feedback from the node obtained. shows node names set in the configuration file. The command can be given several options that can query on the one hand the additional information, and on the other can format the output as desired. Complete list - https://computing.llnl.gov/linux/slurm/sinfo.html The column  shows the name of the partition. Asterisk means that it is a default name. The column  refers to the partition and can show   or. displays the user-specified time limit. Unless specified, the value is assumed to be infinite. indicates the status of  the listed nodes. Possible states are $$allocated, completing, down, drained, draining, fail, failing, idle$$ and $$unknown$$, wherein the respective abbreviations are as following:  and. An asterisk means that there was no feedback from the node obtained. shows node names set in the configuration file. The command can be given several options that can query on the one hand the additional information, and on the other can format the output as desired. Complete list - https://computing.llnl.gov/linux/slurm/sinfo.html
 * Listing 5.1 Output of the  Command.

srun
With this command you can interactively send jobs and/or allocate nodes.

In this example, you want to execute 2 nodes and total (not per node) 2 CPUs.
 * Listing 5.2 Interactive  Usage.

With the option  you allocate reserved resources. In this context, programs can be run, which do not go beyond the scope of the registered resources.
 * Listing 5.3  Command with some options.

A complete list of options - https://computing.llnl.gov/linux/slurm/srun.html

scancel
This command is used to abort a job or one or more job steps. As parameters, you pass the ID of the job that has to be stopped. It depends on the user’s rights, what jobs he is allowed to cancel.

In the example you want to cancel all jobs that are in the $$pending$$ state, belong to the user  and are in the partition. If you do not have permission the output will show accordingly.
 * Listing 5.4  Command with some options.

Complete list of options - https://computing.llnl.gov/linux/slurm/scancel.html

squeue
This command displays the job-specific set of information. Again, the output and the extent of information on additional options can be controlled.

indicates the identification number of the jobs. The column  shows the corresponding ID name of the job, and again, it can be modified or extended manually. is the status of the job, as in the example  - $$pending$$ or   - $$running$$ (there are many other statuses). Accordingly, the clock is running under  only for the job whose status is set to $$running$$. This time is no limit, but the current running time of the jobs. Is the job on hold, the timer remains at 0:00. The reasons why the job is not running, is shown in the next columns. Under  you see the number of nodes required for the job. The current job needs only one node, the waiting one two. The last column also shows the reason as to why, the job is not running, such as. For more detailed information other options must be passed.
 * Listing 5.5 Output of the  Command.

A complete list - https://computing.llnl.gov/linux/slurm/squeue.html

scontrol
This command is used, to view or modify the SLURM configuration of one or more jobs. Most operations can be performed only by the. One can write the desired options and commands directly after the call or call  alone and continue working in this context. This example shows the use of the command from the context. It can be set how much information is obtained with specific queries.
 * Listing 5.6 Using the  command.

A complete list of options - https://computing.llnl.gov/linux/slurm/scontrol.html

Mode of operation
There are basically two modes of operation. You can call a compiled program from the server interactively. Here you can give a number of options, such as the number of nodes and processes on which the program is to run. It is much cheaper to write job scripts, in which the options may be kept clear or commented on. The following sections provide the syntax and semantics of the options in the interactive mode and also jobscripts will be explained.

Interactive
The key command here is. Thus one can perform the jobs interactively and allocate resources. This is followed by options that are passed to the batch system. There are options whose values are set for the environment variables in SLURM (see Section 5.3).

Jobscript
A jobscript is a file with for example -commands. There are no in- or output parameters. In a jobscript, the environment variables are set directly. Therefore the lines need to marked with. Other lines, which start with a hash are comments. The main part of a job script is the program call. Optionally, however, additional parameters can be passed like in the interactive mode.

The script is called as usual and runs all included commands. A job script can contain several program calls also from different programs. Each program call can have redundant options attached, which can override the environment variables. If there are no further details the in the beginning of the script specified options apply.
 * Listing 5.7 Example for a jobscript.

Installation
Installing SLURM can as well as other software either via a finished Ubuntu package, or manually, which is significantly more complex. For the current version, there is still not a finished package. If the advantages of the newer version listed below are not required, the older version is sufficient. In the following sections for both methods a instructions is provided. Improvements in version 2.5.3:


 * Race conditions eliminated at job dependencies
 * Effective cleanup of terminated jobs
 * Correct handling of  and
 * Bugs fixed for newer compilers
 * Better GPU-Support

Package
The prefabricated package contains the version 2.3.2. With

you download the package. In the second step, a configuration file must be created. There is a website that accepts manual entries and generates the file automatically - https://computing.llnl.gov/linux/slurm/configurator.html. In this case, only the name of the machine has to be adjusted. Most importantly, this file must be identical on all clients and on the server. The daemons are started automatically. With  you can quickly check if everything went well.

Manually
Via a link you download the archive with the latest version - 2.5.3. The archive must be unpacked.

Furthermore the configure must be called. It would suffice to indicate no options, however, one should make sure that the directory is the same on the server and the clients, because you would have to install it on each client individually otherwise. If  and   are available, you install SLURM.

As in the last section, the configuration file must be created (see link above) even with manual installation. The name of the machine have to be adjusted. The option  should get the value of 2. After the failure of a node it is otherwise no longer recruited, even if he is available again. In addition, $$Munge$$ should be selected.

Munge

Munge is a highly scalable authentication service. It is needed for the client node to respond only to requests from the “real” server, not from any. For this, a key must be generated and copied to all associated client nodes and the server. Munge can be installed as usual. In order that authentication requests are performed correctly, the clocks must be synchronized on all machines. For this the ntp-service is enough. The relative error is within the tolerance range of Munge. As server the SLURM server should also be registered.

After Munge was installed, a system user has been added, which was specified in the configuration file (by default ).

The goal is that each user can submit jobs and execute SLURM commands from his working directory. This requires that the  is customized in. Add this following path:

The path refers of course to the directory where SLURM has been installed and can also differ. Now it is possible for the user to call from any directory SLURM commands. However, this does not include  commands, because they do not read from the environment variable path. For this you use the following detour:

With  one gets the full path of the command that is read from the modified environment variables. This path you pass to the  command. It may be useful, because after manual installation both daemons must be started manually. On the server the  and the   on the client machines is executed. With additional options  you can see the error messages in more detail if something went wrong. stands for debugging and -v for $$verbose$$. The more “v”s are strung together, the more detailed is the output.

Scheduler
The scheduler is a arbitration logic which controls the time sequence of the jobs. This section covers the internal scheduler of SLURM with various options. Also potential external scheduler get covered.

Internal Scheduler
In the configuration file three methods can be defined - $$builtin, backfill$$ and $$gang.$$

Builtin
This method works on the FIFO principle without further intervention.

Backfill
The backfill process is a kind of FIFO with efficient allocation. Once a job requires currently free resources and is queued behind other jobs, whose resource claims currently can not be satisfied, the “minor” job is preferred. The by the user defined time limit is relevant.



In the graphic [fig:schedul] left is shown a starting situation in which there are three jobs. Job1 and job3 just need one node, while job2 needs two. Both nodes are not busy in the beginning. Following the FIFO strategy job1 would be carried out and block a node. The other two have to wait in the queue, although a node is still available. Following the backfill strategy job3 would be preferred to the job2 since the required resources for job3 are currently free. Very important here is the prescribed time limit. Job3 is finished before job1 and would thus not prevent job2’s execution, if both nodes would become available. If job3 had a longer time-out, this job would not be preferred.

Gang
The Gang-scheduling has only one application, if accepted, that the resources are not allocated exclusively (option  must be specified). At this it jumps between time slices. Gang-scheduling causes that in the same time slots, as possible, belonging processes are handled and thus the context jumps are reduced. The  option specifies the length of a time slot in seconds. Within this time slice - or  -Scheduling can be used if multiple processes compete for resources. In older versions (lower than 2.1) the distribution follows the Round-Robin principle.

To use the Gang-scheduling at least three options have to be set:

By default, 4 processes would be able to allocate the same resources. With the option

the number $$xy$$ can be defined.

Policy
However, the scheduling capabilities of SLURM are not limited to the three strategies. With the help of $$Policy$$ options each strategy can be refined and adjusted to the needs of users and administrators. The Policy is a kind of house rules, which are subject to any job, user, or group or project. In large systems, it will quickly become confusing when each user has his own set of specific rules. Therefore SLURM also supports database connections via MySQL or PortgreSQL. For the use of databases those need to be explicitly configured for SLURM. On the SLURM part certain options need to be set so that the policy rules can be applied to the defined groups. Detailed description - https://computing.llnl.gov/linux/slurm/accounting.html.

Most options in scheduling are targeted at the priority of jobs. SLURM uses a complex calculation method to determine the priority - $$priority-multifactor$$. Five factors play a role in the calculation - $$Age$$ (waiting time of a waiting jobs), $$Fair-share$$ (the difference between allocated and used resources), $$Job size$$ (number of allocated nodes), $$Partition$$ (a factor that has been assigned to a node group), $$QOS$$ (a factor of service quality). Each of these factors will also receive a weighting. This means that some factors are defined as more dominant. The overall priority is the sum of the weighted factors ($$float$$ values between 0.0 and 1.0):

The detailed descriptions of the factors and their composition can be found here: https://computing.llnl.gov/linux/slurm/priority_multifactor.html.

Particularly interesting is the QOS ($$Quality Of Service$$)-factor. The prerequisite for it's usage is the $$multi-factor-plugin$$ and that  is nonzero. A user can specify a QOS for each job. This affects the scheduling, context jumps and limits. The allowed QOSs be specified as a comma-separated list in the database. QOSs in that list can be used by users of the associated group. The default value  does not affect the calculations. However, if the user knows that his job is particularly short, he could define his job script as follows:

This option increases the priority of the job (if properly configured), but cancels it after the time limit described in his QOS. Thus, one should take into account realistically. The available QOSs can be display with the command:

The default values of the QOS look like this:

These values can be changed by the administrator in the configuration. Example:

Complete list - https://computing.llnl.gov/linux/slurm/resource_limits.html

External Scheduler
SLURM is compatible with various other schedulers - this includes Maui, Moab, LSF and Catalina. In the configuration file  should be selected if an external scheduler should be integrated.

Maui
Maui is a freely available scheduler by Adaptive Computing. ) The development of Maui has been set of 2005. The package can be downloaded after registration on the manufacturing side. The version requires Java for installation. Maui features numerous policy and scheduling options, however, are now being offered by SLURM itself.

Moab
Moab is the successor of Maui. Since the project Maui was set, Adaptive Computing has developed the package under the name of Moab and under commercial license. Moab shell scale better than Maui. Also paid support is available for the product.

LSF
LSF - Load Sharing Facility - is a commercial software from IBM. The product is suitable not only for IBM machines, but also for systems with Windows or Linux operating systems.

Catalina

Catalina is a on going project for years, of which there is currently a pre-production release. It includes many features of Maui, supports grid-computing and allows guaranteeing of available nodes after as certain time. For the use Python is required.

Conclusion
The product can be used without much effort. The development of SLURM adapts to the current needs, and so it can not only be used on a small scale (less than 100 cores) but also in leading highly scalable architectures. This is supported by the reliability and sophisticated fault tolerance of SLURM. SLURMs Scheduler options leave little wishes, there are no extensions necessary. SLURM is in all a very well-done, up to date software.