Use of the compute cluster

We operate a small HPC cluster at the department. Compared to a large HPC cluster such as LiDO, which is operated by the ITMC, our system offers some comfort extras to facilitate interactive work. The software equipment is tailored more precisely to the requirements of our department.

Before accessing the system for the first time, an account must be activated for use. This is also possible for students without any problems. Simply contact the Sysadmins.

Basic usage

The resource management of the cluster is handled by the Resource Manager SLURM. Communication with our cluster always takes place from our server with shell access shell.statistik.tu-dortmund.de. This has the advantage that you can also send jobs or check their status from home (via a VPN connection to the university). For direct interaction with the cluster, the commands salloc, sbatch and srun are used to start jobs, scancel to cancel jobs, squeue to view the current job queue and sacct to view the job history. The -u $USER option can be used to restrict the output of the job queue and job history to your own account.

Starting jobs should only be necessary in exceptional cases using the aforementioned commands. Normally, you will use one of the scripts described below.

A job is canceled with scancel $JobID. The JobID is a sequential number that is assigned to each job when it is started. It can also be looked up using the squeue command. If you have a lot of jobs that you want to cancel because you have found an error in the script, for example, you can use the kill_all_jobs command. After a confirmation prompt, this script does exactly what the name says and cancels all jobs in your account.

The official documentation for the SLURM resource manager used can be found here.

Walltime and memory

In order to be able to assign a suitable compute node to each job, the Resource Manager requires information about the expected runtime and memory consumption. You need to think about this for a moment before starting a job. However, it is not necessary to manually search for a computer with free CPUs and sufficient free memory.

There are currently 5 partitions. The partition keller contains all computers in the Linuxpool (Nodes=fisher, rao, kendall, pearson, tukey, and neyman), all of which are similar in speed. The partition all contains all possible machines, which are the ones from the Linuxpool and two additional servers (Nodes=vaughan, eicker), which are faster. The partitions interactive and matlab only contain these two servers. The partition gpu only contains the server vaughan and can be used for calculations on the gpu. You can see which node your job is running on by using squeue.

When selecting walltime and memory, it should be kept in mind that these limits are hard. If they are exceeded, the job is automatically terminated. But the smaller the values are, the easier it is for the batch system to find free space. However, the walltime should still be chosen rather conservatively, as our compute nodes have different speeds. While nodes in the Linuxpool (fisher, kendall, neymann, pearson, rao, tukey) are about the same speed, the speed of the compute servers (eicker, vaughan) is variable and depends on utilization and ambient temperature.

The maximum walltime for batch jobs is 2 days. Interactive jobs can request up to 14 days of walltime. However, it is not recommended to use jobs for this length of time. We reserve the right to change these values.

If you do not know how long jobs run and how much memory they require, it is advisable to start one or two test jobs. In batch mode, R outputs the runtime of the job in the log file at the end by default. By manually calling the garbage collector with gc() as the last command in an R script, you also get an output of the maximum memory requirement.

If a job is terminated prematurely by the resource manager, the reason is usually given at the end of the log file. If there is nothing there (this can happen if the job was terminated hard), you can use sacct to subsequently request information about your own jobs from the resource manager. The standard output contains brief information from the current day. Detailed information from a specific date can be obtained with sacct -S MMDD -l. Information on currently running jobs can be obtained with sstat.

A total of 168 (real) CPU cores and 832 GB of memory are available in our cluster, but these are not evenly distributed across the computing nodes. Additional jobs are queued and executed automatically after the completion of running jobs. If a compute node fails (hardware defect, power failure, etc.), the batch system will automatically restart the batch jobs running on it on another node.

The execution sequence of jobs in the queue depends on their priority. The priority is calculated on the basis of the computing time used within the last weeks and the waiting time in order to achieve a balanced allocation of resources between the different users.

Here is an overview of how many jobs with a certain memory consumption can be run simultaneously on our cluster.

Memory up to x MB Number of jobs
3000 168
7000 114
14000 52
16000 46
28000 26
32000 20
64000 10
128000 5
256000 2
384000 1

Gaps in the node utilization are of course filled by the batch system in the best possible way. Even if one 384 GB job is running, the system could, for example, still start two 128 GB jobs and 12 14 GB jobs on the nodes with less memory.

Interactive Jobs

As previously mentioned, our cluster offers the option of getting a job for interactive work very conveniently. In contrast to a normal HPC cluster, you do not have to join the normal job queue.

It is recommended to use a batch mode whenever possible.

The script interactive is used to start an interactive job. This script starts an interactive shell on a compute node. You can also do this within screen in order to be able to disconnect in the meantime. If you call the script without options, it asks for the walltime (desired usage time) and memory requirements. It then establishes a connection with the assigned node. There is no need to manually search for a computer with sufficient free memory and free CPUs. You can then start your programs (e.g. R) as normal.

If you do not specify anything for the walltime and memory, you will get 12 hours of time and 1024 MB of memory.

Warning

After the walltime has expired or if the registered memory is exceeded, the session is automatically terminated in any case.

The script knows two additional arguments ncpus and queue. With ncpus (based on the LiDO PBS option ppn) you can reserve several CPUs for the interactive job (e.g. ncpus=4 for 4 cores). This is interesting if you want to parallelize with the package parallel under R. The queue option can be used to switch to another queue in the cluster. This only makes sense for test purposes of the admins.

The Matlab queue is only accessible to members of the LSIckstadt and LSRahnenfuehrer groups.

The complete help for the command is available with interactive help

However, this convenient way of using the computer has some disadvantages. If you forget a job (e.g. because you have disconnected your screen session), it is terminated by the batch system without further inquiry after the end of the walltime. There is no admin here to ask again beforehand. Forgotten jobs can no longer permanently block a computer. If the assigned compute node fails, jobs started in interactive mode cannot be restarted automatically by the batch system.

Example: Starting an interactive job with resource query (here 3:30h usage time, 512 MB memory), starting R, terminating R and completing the interactive job.

skrey@shell:~$ interactive
Walltime ([dd-hh:]mm[:ss]) : 3:30:00
Memory                     : 512
salloc: Granted job allocation 9898
skrey@computed:~$ R
>
> quit()
Save workspace image? [y/n/c]: n
skrey@computed:~$ exit
logout
salloc: Relinquishing job allocation 9898
skrey@shell:~$

Further example: Starting an interactive job without querying the resources (direct transfer as argument), requesting 2 cores and switching to the matlab queue

skrey@shell:~$ interactive walltime=3:30:00 memory=512 ncpus=2 queue=matlab
salloc: Granted job allocation 9899
skrey@computec:~$ matlab
>>
>> quit
skrey@computec:~$ exit
logout
salloc: Relinquishing job allocation 9899
skrey@shell:~$

Here you can see directly how the Resource Manager has assigned different compute nodes. If you only enter a number without a colon in the walltime, this number is interpreted as minutes (LiDO assumes seconds).

All optional arguments can also be entered in a configuration file for the script (.interactiveconf in the user’s home directory). Then they are no longer queried.

R in batch mode

The submitR script is used to start R batch jobs. In comparison to interactive, it also requests a mail address for notification of job start, end and error (can be switched off using the option mail=false). The default for this is login@statistik.tu-dortmund.de. However, the default value for the working memory here is only 1024 MB. There are also the options outdir, which defines the location for the job’s log files (default is the current directory, if the specified directory does not exist, it is created), and autosubmit. The default for autosubmit is autosubmit=true and sends the job directly to the batch system. With autosubmit=false only a job file is generated, which you have to send manually with sbatch (practical if you want to set options that submitR does not support).

This script does not recognize the ncpus option for reserving multiple processors. Instead, there are a few other options for simply starting parallelized jobs or a large number of jobs with only minor changes (e.g. for a simulation study).

If you do not need process communication, you can start a script with different settings using the replicate option (e.g. replicate=3, which is equivalent to replicate=1-3, or replicate=1,3-5). This then starts the corresponding number of jobs and sets an environment variable with the replicate number for each job, which can be queried with Sys.getenv("PBS_ARRAYID") and used to index an object with parameters, set seeds or similar. This variant of creating several R jobs has the advantage over parallelization via several CPU cores or MPI that the jobs are smaller and therefore easier to get a free space on the cluster. In addition, parallelization is very easy with this method, as only minimal changes to serial R scripts are necessary.

After the jobs are finished, you should always check the log files to identify any errors that may have occurred. If there is an abort due to walltime or memory overrun, this is also shown in the log file. A quick search of log files for aborts due to walltime or memory overruns is possible by calling fgrep -i -e killed -e canceled *.out|awk '{print $1}'.

If you want to ensure that jobs are always executed on the same type of nodes in order to compare the runtime of jobs, you can use the excludenodes option in submitR to restrict the node selection to nodes of the same type.

The Rversion option can be used to specify the R version, for example Rversion=3.0.3. The default setting is the latest installed version, which is also started in interactive mode by a simple call of R.

Example: R batch job with default values.

skrey@shell:~$ submitR test.R
Notify address             :
Using default address 'skrey@statistik.tu-dortmund.de'.
Walltime ([dd-hh:]mm[:ss]) :
Using default walltime of 12:00:00.
Memory                     :
Using default memory of 1024 MB.
Chosen queue               : all
Chosen job id              : 1369223637
Working directory          : /home/skrey
PBS Job ID                 : Submitted batch job 9904
skrey@shell:~$

The script has now started a job for 12h runtime with 1024MB allocated memory. Notifications are sent to the user’s statistics mail address. The PBS Job ID is the number of the job within the batch system. To cancel this job, enter the command scancel 9904.

Example: Several R batch jobs identical except for a parameter list with replicate. Walltime is set to 2 minutes via the command line and the log files go to the artest directory.

skrey@shell:~$ submitR artest.R walltime=2 replicate=1,3-5 outdir=artest
Notify address             :
Using default address 'skrey@statistik.tu-dortmund.de'.
Memory                     :
Using default memory of 1024 MB.
Chosen queue               : all
Chosen job id              : 1369225331
Working directory          : /home/skrey
PBS Job ID                 : Submitted batch job 9913
PBS Job ID                 : Submitted batch job 9914
PBS Job ID                 : Submitted batch job 9915
PBS Job ID                 : Submitted batch job 9916
skrey@shell:~$

The corresponding R script looks like this:

p <- as.integer(Sys.getenv("PBS_ARRAYID"))
print(p)

param <- list(list(ar = c(1, -0.9, 0.3)),
              list(ar = c(1, -0.4, 0.1)),
              list(ar = c(1, -0.4, 0.1)),
              list(ar = c(0.8897, -0.4858), ma = c(-0.2279, 0.2488)),
              list(order = c(1,1,0), ar = 0.7)
              )

# Sensible initialization of the random number generator for parallel
# Calculations:
library("rlecuyer")
.lec.SetPackageSeed(c(6,13,73,4,52,1)) # rlecuyer equivalent to set.seed()
nstreams <- 5 # Number of Random number streams
names <- paste("myrngstream",1:nstreams,sep="") # Some name for the RNG Stream
.lec.CreateStream(names) # creating random number streams
.lec.CurrentStream(names[p]) # Selection of the p-th stream

y <- arima.sim(n = 1000, param[[p]])

z <- ar(y)
print(z)
save(z, file=paste("artest-", p, ".RData", sep=""))

First, the environment variable PBS_ARRAYID is read and saved as an integer in p. This p is then used to inidicate the parameter list. Subsequently, 5 independent random number streams are generated with rlecuyer and the stream p is selected (important for reproducibility of simulations) This is then used to simulate data, estimate an AR-model and save the model. The file name is again made up of a self-selected part and the parameter index p.

submitR also offers the possibility to use the different parallelization systems of R.

If you start submitR with the mcparallel option before the .R file to be executed (e.g. submitR mcparallel mctest.R), the multicore functions of the parallel and multicore packages (mclapply, mcparallel, etc.) can be used directly as on a desktop computer. No initialization is necessary. In addition to the settings described above, the number of CPU cores to be used is also queried here. This can also be specified via the ntasks option on the command line. The maximum number of CPU cores for the compute servers is 28 (per CPU). The Linuxpool machines have 8 core per CPU. A sensible compromise between increasing speed through parallelization and minimizing the waiting time for a sufficiently large compute node is ntasks=4.

With the parallel option before the .R file, R starts in an MPI environment and the parallel package (as well as the obsolete snow and snowfall packages) can be used with Rmpi. When using Rmpi, you must remember to call mpi.exit() at the end of the R script to terminate the job cleanly. Without this call, the batch system thinks that the MPI job has not been terminated cleanly and reports a job termination in the associated e-mail. MPI is not limited to the CPU cores of a single computing node, but can make use of multiple nodes, too.

The node allocation for MPI jobs prefers pure nodes to maximize the execution speed by excluding all nodes with different hardware equipment via the excludenodes option of submitR. This can significantly increase the scheduling time under certain circumstances. The automatic selection can be overridden by manually passing the option. Alternatively, the option sensiblempi=false can be used to achieve the fastest possible allocation of the desired number of CPUs (with the risk of not getting unmixed nodes, which reduces the execution speed).

Finally, an example of a parallel R job with MPI:

skrey@shell:~$ submitR parallel testp.R autosubmit=true email=skrey walltime=10 memory=2048
Number of tasks (#)        : 12
Chosen queue               : all
Chosen job id              : 1369228604
Working directory          : /home/skrey
PBS Job ID                 : Submitted batch job 9919
skrey@shell:~$

After a query of the number of tasks (can also be passed directly with the ntasks option), the parallel R process starts.

The following are a few minimal examples of how you can (should) set up a script for a parallel R job with MPI.

Parallel with MPI:

library("Rmpi")
library("parallel")

# Since R always requires a control process, the number of workers must be 1
# less than the number of tasks assigned by the batch system.
ntasks <- mpi.universe.size() - 1

# Inizialise Cluster
cl <- makeCluster(ntasks, type = "MPI")

# Initialize random number generators on all workers of the cluster with a
# specific seed (prevents cycles in the random numbers between the workers)
clusterSetRNGStream(cl, iseed=42)

# Parallel Code, e.g. with clusterApply, parSapply(), clusterEvalQ(), etc.

# End Cluster and MPI
stopCluster(cl)
mpi.exit()

Warning

Currently jobs do not finish with the above axamples. A hack around this is to close the cluster in the wrong order. This way after your computations are finished, it will be closed with a segfault (which essentially can be ignored):

...
mpi.exit()
stopCluster(cl) # causes an expected segfault now

Snowfall with MPI:

library("Rmpi")
library("snowfall")

# Since R always requires a control process, the number of workers must be 1
# less than the number of tasks assigned by the batch system.
ntasks <- mpi.universe.size() - 1

# Initilise Snowfall
sfInit(parallel=TRUE, type="MPI", cpus=ntasks)

# Initialize random number generators on all workers of the cluster with a
# specific seed (prevents cycles in the random numbers between the workers)
sfClusterSetupRNGstream(seed=c(6,13,73,4,52,1)

# Parallel Code, e.g. with sfSapply()

# End Snowfall and MPI
sfStop()
mpi.exit()

When writing scripts for parallel MPI jobs, you should use the parallel package. This provides all the relevant functions of snow and is included in the standard installation of R. snowfall is very convenient to use, but unfortunately it is also very slow in communicating with the individual worker processes, which means that a large part of the speed gain through parallelization is wasted.

It is not advisable to use the multicore package. It does not recognize the number of assigned CPU cores and therefore starts far too many R processes. This increases the memory consumption unnecessarily and makes the calculation slow, as the operating system constantly jumps back and forth between the many R processes. The mc* functions of the parallel package do not have this problem.

All optional arguments can also be entered here in a configuration file for the script (.submitrconf in the user’s home directory). Then they are no longer queried.

R with batch jobs

A cluster-specific template file and a personal configuration file are required to use the R package BatchJobs. The template file for our cluster is located under /opt/R/BatchJobs/dortmund_fk_statistik.tmpl.

There is also a sample configuration file in this directory that you can copy: cp /opt/R/BatchJobs/.BatchJobs.R ~/ Before use, the e-mail settings (in particular the recipient in mail.to) must be adjusted. Depending on the job, you will also have to adjust the runtime walltime and the memory consumption memory per job. The walltime is specified here in seconds.

cluster.functions = makeClusterFunctionsSLURM("/opt/R/BatchJobs/dortmund_fk_statistik.tmpl")

default.resources = list(
  walltime = 3600L,
  memory = 1024L,
  ntasks = 1L,
  ncpus = 1L
)

mail.start = "first+last"
mail.done = "first+last"
mail.error = "all"
mail.from = "<slurm@statistik.tu-dortmund.de>"
mail.to = "<login@statistik.tu-dortmund.de>"
mail.control = list(smtpServer="mail.statistik.tu-dortmund.de")

debug = FALSE

The complete documentation can be found on the BatchJobs project page. In particular, you should read through the page about using it in Dortmund.

Matlab with batch operations

The Matlab installation is available for members of the LSIckstadt and LSRahnenfuehrer groups.

The submitM script is used to start Matlab batch jobs. In comparison to interactive, it also requests a mail address for notification of job start, end and error (can be switched off using the option mail=false). The default for this is login@statistik.tu-dortmund.de’. The ``queue` option is missing, as only the matlab queue comes into question. The default value for the working memory is only 1024 MB. There are also the options outdir, which defines the location for the job’s log files (default is the current directory, if the specified directory does not exist, it is created), and autosubmit. The standard for autosubmit is autosubmit=true and sends the job directly to the batch system. With autosubmit=false, only a job file is generated, which you have to send manually with sbatch (practical if you want to set options that submitM does not support).

Example: Matlab batch job for 5 minutes, log file output in the directory testjob

skrey@shell:~$ submitM test.m outdir=testjob
Notify address             :
Using default address 'skrey@statistik.tu-dortmund.de'.
Walltime ([dd-hh:]mm[:ss]) : 5
Memory                     :
Using default memory of 1024 MB.
Chosen queue               : matlab
Working directory          : /home/skrey
PBS Job ID                 : Submitted batch job 9900
skrey@shell:~$

I have not made any entries here for the mail address and memory, so the default values have been used.

In the next example, I have switched off autosubmit and specified all options directly when calling:

skrey@shell:~$ submitM test.m autosubmit=false ncpus=2 email=skrey walltime=5 memory=256
Chosen queue               : matlab
Working directory          : /home/skrey
--- Generated batch job --------------------------------------------------------
--------------------------------------------------------------------------------
Run

  sbatch submitM.Nb2YWX

to submit the job to the batch system. You can then safely remove the job file:

  rm -f submitM.Nb2YWX

The script has now generated the job file submitM.NB2YWX, which can be customized and then started with sbatch.

All optional arguments can also be entered here in a configuration file for the script (.submitmconf in the user’s home directory). Then they are no longer requested.