MPI cluster

This cluster is designed to run MPI jobs efficiently.

The cluster can be accessed by logging into mpi.sam.pitt.edu with your PittID.

$ ssh <pittid>@login0a.mpi.sam.pitt.edu

If access is required from Pitt Wireless or off-campus refer to our documentation on VPN access

Node configuration

  • There are 32 compute nodes in total with the following configuration
MPI compute nodes
CPU E5-2660 v3 (Haswell)
Speed 2.60GHz
Cores 20
Memory 128 GB 2133 MHz
Disk 256 GB SSD
InfiniBand 56 Gb/s FDR
  • There are two login nodes that can be used for compilation.
MPI login nodes
CPU E5-2620 v3 (Haswell)
Speed 2.40GHz
Cores 12 (24 hyperthreads)
Memory 64 GB 1867 MHz
InfiniBand 56 Gb/s FDR

Operating System and Kernel

For performance reasons the following configuration has been chosen for compute nodes and login nodes.

OS Kernel
RedHat Enterprise 6.6 2.6.32-358

Filesystems

All nodes in the MPI mount the following file severs.

  • It is important to note the $HOME directories are shared with other clusters and configuration files may not be compatible. Please check through your .bashrc, .bash_profile and all other dotfiles if you encounter problems.
Filesystem Mount
Home 0 /home
Home 1 /home1
Home 2 /home2
Gscratch2 /gscratch2
MobyDisk /mnt/mobydisk
Scratch (compute only) /scratch
Misc 0 /opt/sam/
  • Note that the contents of /opt/sam/ are not shared with any other cluster.

Compilers

The Intel C/C++ and Fortran compilers have been installed on the MPI and are automatically placed in your PATH when you log in. GNU compilers are also available in your path when you login. Newer GNU compilers are available as module environments.

Compiler Version executable name AVX2 support
Intel C 15.0.3 icc Yes
Intel C++ 15.0.3 icpc Yes
Intel Fortran 15.0.3 ifort Yes
---- ---- ---- ----
GNU C 4.9.2* gcc Yes
GNU C++ 4.9.2* g++ Yes
GNU Fortran 4.9.2* gfortran Yes
---- ---- ---- ----
GNU C 4.4.7 gcc No
GNU C++ 4.4.7 g++ No
GNU Fortran 4.4.7 gfortran No

See the man pages man <executable> for more information about flags.

  • NOTE: The Intel compilers will return an error when using the -V flag to determine the version. This is expected behavior due to the way the compilers have been configured on this cluster.

  • GCC 4.9.2 is available through the Spack Application Environment. See below.

Instruction sets

The Haswell CPUs support AVX2 instructions. To have the best chance of utilizing these instructions and to ensure that your executable and libraries are optimized use the Intel Compilers and add the -xHOST flag since the login nodes have the same architecture as the compute nodes.

The GCC 4.9.2 compiler also support AVX2 with the -march=core-avx2 flag.

Intel MKL

The Intel Math Kernel Library provides widely used libraries for BLAS, Lapack, Fast Fourier Transform and Transcendental functions.

Version 11.2.3 has been installed on the cluster.

To link your application with MKL add one of the following options when using Intel Compilers.

Flag Description
-mkl=sequential Use only the serial implementation
-mkl=parallel Use the parallel implementation
Number of threads is controlled by MKL_NUM_THREADS

The following demonstrates how to link with Scalapack. The $MKLROOTenvironment variable is set when you login.

> spack load openmpi%intel
> spack env openmpi%intel mpicc prog.c -I$MKLROOT/include -L$MKLROOT/lib/intel64 \
      -lmkl_scalapack_lp64 \
      -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \
      -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm
 
 
> spack load openmpi%intel
> spack env openmpi%intel mpif90 prog.f90 -I$MKLROOT/include -L$MKLROOT/lib/intel64 \
      -mkl_scalapack_lp64 \
      -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \
      -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm

GNU compilers

If you are using the GNU compilers and wish to link with MKL include the following flags.

  • For sequential linking
-L$MKLROOT/lib/intel64 \
Wl,--start-group       \
 -lmkl_gf_lp64         \
 -lmkl_sequential      \
 -lmkl_core            \
 -lm                   \
-Wl,--end-group        \
-lpthread
  • For parallel linking
-L$MKLROOT/lib/intel64 \
Wl,--start-group       \
 -lmkl_gf_lp64         \
 -lmkl_gnu_thread      \
 -lmkl_core            \
 -lm                   \
-Wl,--end-group        \
-lgomp -lpthread

Application environment

Spack will be used by cluster administrators to provide optimized builds of commonly used software. Applications compiled with Spack will be available to users through the Lmod modular environment commands. There are no default modules loaded when you log in.

  • If users wish to use Spack to install packages in their home directories run git clone /opt/sam/spack to install Spack. Be certain to source <path-to-my-spack>/share/spack/setup-env.sh.

Installed packages

Use the command spack find to list all installed compilers. The packages are listed by compiler used and by architecture. The architecture for the MPI Cluster is called haswell, which means that codes have been compiled to utilize the AVX2 instruction set as best as possible.

defusco@login0a:~>spack find
==> 14 installed packages.
-- haswell / gcc@4.4.7 ------------------------------------------
binutils@2.25  gcc@4.9.2  gmp@6.0.0a  hwloc@1.9  libelf@0.8.13  mpc@1.0.2  mpfr@3.1.3
 
-- haswell / intel@15.0.3 ---------------------------------------
boost@1.58.0  cp2k@2.6.1  fftw@3.3.4  libint@1.1.4  libxc@2.2.2  mpich@3.0.4  openmpi@1.8.6

Module environment files have been created for each of these packages and can be easily loaded into your shell with spack load. Executables have been compiled with RPATH so that LD_LIBRARY_PATH or any other environment variable cannot change the path of the libraries that have been linked. Multiple packages with different compilers and dependencies can be loaded safely. The executables will always load the correct libraries.

In the example below I have loaded the Cp2k package compiled with the Intel compilers into my environment. The cp2k.popt executable is now in my PATH and all of the required libraries have been found.

defusco@login0a:~>spack load cp2k
defusco@login0a:~>which cp2k.popt
/opt/sam/spack/opt/spack/haswell/intel-15.0.3/cp2k-2.6.1-2rzb75e56oe5hvwz3wz2az4lgzlrlxwl/bin/cp2k.popt
defusco@login0a:~>ldd `which cp2k.popt`
        linux-vdso.so.1 =>  (0x00007fff264e9000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000371c600000)
        libmkl_rt.so => /opt/sam/intel/composer_xe_2015.3.187/mkl/lib/intel64/libmkl_rt.so (0x00007f2872ba9000)
        libmkl_scalapack_lp64.so => /opt/sam/intel/composer_xe_2015.3.187/mkl/lib/intel64/libmkl_scalapack_lp64.so (0x00007f28722bf000)
        libiomp5.so => /opt/sam/intel/composer_xe_2015.3.187/compiler/lib/intel64/libiomp5.so (0x00007f2871f81000)
        libmpi_usempif08.so.0 => /opt/sam/spack/opt/spack/haswell/intel-15.0.3/openmpi-1.8.6-ukxwbpmjjjc4ema2lwjqcukelqn5b64a/lib/libmpi_usempif08.so.0 (0x00007f2871cd4000)
        ...

To unload a package from your environment use module rm <package> or module purge.

See the installed packages page for details about each package supported by SaM.

Using compilers and libraries installed with Spack

To use libraries installed with spack just load the modules into your environment and add the appropriate flags to your compilation and linking commands.


The following compilers have been installed with Spack and require special commands to utilize outside of Spack for development.

-- haswell / gcc@4.4.7 ------------------------------------------
gcc@4.9.2
-- haswell / intel@15.0.3 ---------------------------------------
openmpi@1.8.6

To utilize the compilers two steps have to be taken in your shell. First load the compiler into your environment. Then you can call the compiler using spack env <compiler-package> <executable>. The second step ensures that environment variables unique to Spack have been set. If this is not done the compiler will behave incorrectly.

defusco@login0a:~>spack load gcc
defusco@login0a:~>spack env gcc gcc file.c

The same is true for MPI compiler wrappers. Notice that the %intel modifier is required to make sure that the wrapper using the Intel compiler. If the %intel modifier is not included it will use GCC 4.7.

defusco@login0a:~>spack load openmpi
defusco@login0a:~>spack env openmpi%intel mpicc -v
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.3.187 Build 20150407
Copyright (C) 1985-2015 Intel Corporation.  All rights reserved.
 
defusco@login0a:~>spack env openmpi mpicc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC)

To use the Spack compiler packages in a Makefile you need to specify that you want a login shell.

SHELL=/bin/bash -l
 
mpiModule = spack load openmpi
MPICC     = spack env openmpi%intel mpicc
 
pi: pi.cpp
        $(mpiModule) && $(MPICC) $< -DUSE_OPENMP -DUSE_MPI -openmp -o $@

Slurm Workload Manager

The MPI cluster uses Slurm for batch job queuing. 30 compute nodes belong to the haswell partition and it is the default partition. The develop partition contains two nodes that have been reserved for development and debugging. The sinfo command provides an overview of the state of the nodes within the cluster.

defusco@login0a:~>sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
haswell*     up 6-00:00:00      7  alloc n[199-205]
haswell*     up 6-00:00:00     23   idle n[206-228]
develop      up    2:00:00      2   idle n[197-198]

Nodes in the alloc state mean that a job is running. The asterisk next to the haswell partition means that it is the default partition for all jobs.

squeue shows the list of running and queued jobs.

defusco@login0a:~>squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2432   haswell cp2k.sba  defusco PD       0:00     30 (Resources)
              2435   haswell cp2k.sba  defusco PD       0:00      5 (Priority)
              2434   haswell cp2k.sba  defusco  R       0:19      5 n[199-203]
              2431   haswell     bash  defusco  R       3:14      4 n[207-209]

The most common states for jobs in squeue are described below. See man squeue for more details.

Abbreviation State Description
CA CANCELLED Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED Job has terminated all processes on all nodes.
CG COMPLETING Job is in the process of completing. Some processes on some nodes may still be active.
F FAILED Job terminated with non-zero exit code or other failure condition.
PD PENDING Job is awaiting resource allocation.
R RUNNING Job currently has an allocation.
TO TIMEOUT Job terminated upon reaching its time limit.

In the above example jobs 2434 and 2431 are running and using 5 and 4 nodes, respectively and their node assignments are printed. Jobs 2432 and 2435 are both pending. Job 2432 has requested 30 nodes is listed first because it has the highest priority. All 30 nodes in the haswell partition are not available so the job in pending until Resources become available. Job 2435 requests 5 nodes and is pending because a job of higher Priority is also pending. It is likely that job 2435 will run before 2432 due to backfill. See man squeue for a complete description the possible REASONS for pending jobs.

To see when all jobs are expected to start run squeue --start.

defusco@login0a:mpi>squeue --start
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
              2708   haswell     bash  defusco PD                 N/A      2 (null)               (QOSUsageThreshold)
              2710   haswell     bash  defusco PD 2015-07-18T14:33:32     28 (null)               (Resources)
  • Note: not all jobs have a definite start time.

Slurm jobs

The three most important commands in Slurm are sbatch, srun and scancel. sbatch is used to submit a job script to the queue like the one below, called host.sbatch srun defines a Job Step and launches the specified command on the compute nodes. srun can be used to launch an MPI program or to replicate a serial task. Multiple Job Steps can be used within a single batch script. If srun is omitted the command will only run on the first allocated node. Jobs on the MPI cluster can only be assigned whole nodes and the user must request at least two. Jobs can be cancled with ´scancel´.

  • Note: do not use mpirun or mpiexec. srun is the only supported MPI launcher.
#!/bin/bash
 
#SBATCH --nodes=4
#SBATCH --time=5:00
 
srun hostname
  • NOTE: requests for walltime extensions will not be granted

This job is submitted with the command sbatch. By default the standard out is redirected to slurm-<jobid>.out.

defusco@login0a:~>sbatch host.sbatch
Submitted batch job 2437
defusco@login0a:~>cat slurm-2437.out
n197.mpi.sam.pitt.edu
n198.mpi.sam.pitt.edu
n199.mpi.sam.pitt.edu
n200.mpi.sam.pitt.edu
  • Note: By default the working directory of your job is the directory from which the batch script was submitted. See below for more information about job environments.

The sbatch arguments here are the minimal subset required to accurately specify a job on the MPI cluster. Please refer to man sbatch for more options.

sbatch argument Description
-N --nodes Maximum number of nodes to be used by each Job Step.
--tasks-per-node Maximum number of MPI ranks that each Job Step is allowed to run.
--cpus-per-task Maximum number of CPUs required for each MPI ranks.
OMP_NUM_THREDS is NOT set by default.
-e --error File to redirect standard error.
-J --job-name The job name.
-t --time Define the total time required for the job
The format is days-hh:mm:ss.
--qos Declare the Quality of Service to be used.
The default is normal.
--partition Select the partition to submit the job to.
The default is haswell.

The above arguments can be provided in a batch script by preceding them with #SBATCH. Note that the shebang (#!) line must be present. The shebang line can call any shell or scripting language available on the cluster. For example, #!/bin/bash, #!/bin/tcsh, #!/bin/env python or #!/bin/env perl.

srun also takes the --nodes, --tasks-per-node and --cpus-per-task arguments to allow each job step to change the utilized resources but they cannot exceed those given to sbatch.

Partitions

Two partitions have been configured on the MPI cluster. The haswell partition is for production jobs and the develop partition is intended for code development and debugging. develop should not be used to test job inputs on production codes.

Partition name Number of Nodes Max Walltime
develop 2 2:00:00
haswell 30 6-00:00:00

Quality of Service

All jobs submitted to Slurm must be assigned a Quality of Service (QoS). QoS levels define resource limitations. The default QoS is normal.

Quality of Service Max Walltime Min Nodes per Job Max Nodes per Job Priority factor
short 12:00:00 2 30 1.0
normal 3-00:00:00 2 15 0.75
long 6-00:00:00 2 15 0.5
  • Walltime is specified in days-hh:mm:ss

If your job does not meet these requirements it will be not be accepted.

defusco@login0a:mpi>sbatch -N 30 --qos=normal job.sh
salloc: error: Job submit/allocate failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Jobs that only request one node in the haswell partition will also not be accepted.

defusco@login0a:hybridPi (master *)>sbatch -N 1 --partition=haswell job.sh
sbatch: error: Job submit/allocate failed: Node count specification invalid

Development partition

To use the development partition both --partition=develop and --qos=develop must be specified in the job. The limit in number of nodes per job is 2 and each user can only have one job running at a time.

#!/bin/bash
 
#SBATCH --partition=develop
#SBATCH --qos=develop
#SBATCH --nodes=1
#SBATCH --tasks-per-node=20
 
srun ...

If the --qos flag is left off the job will not be accepted with the following error.

salloc: error: Job submit/allocate failed: Job has invalid qos

Similarly, the develop qos cannot be used on the haswell partition.

Job priorities

Jobs on the MPI cluster are executed in order of priority. The priority function has four components Age, FairShare, QoS and JobSize. Each component has a value between 0 and 1 and each are weighted separately in the total job priority. Only the Age factor increases as the job waits.

  • NOTE: The priority weights are intended to favor jobs that use more nodes for shorter wall times.
Priority factor Description Weight
Age Total time queued.
Factor reaches 1 at 14 days.
2000
QoS Priority factor from QoS levels above. 2000
JobSize Factor approaches 1 as more nodes are requested 4000
FairShare FairShare factor described below 2000
  • The maximum priority value is 10000 for any job.

To view the priority components of all jobs run sprio. In this example three jobs have been submitted and are waiting to run.

root@head0:~>squeue --format "%.18i %.9P %.9q %.8j %.8u %.2t %.10M %.6D %R"
             JOBID PARTITION       QOS     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2766   haswell     short     bash  defusco PD       0:00     30 (Resources)
              2765   haswell    normal     bash  defusco PD       0:00      2 (Resources)
              2764   haswell      long     bash  defusco PD       0:00      2 (Resources)
              2763   haswell     short     bash  defusco  R      26:42     30 n[199-228]

All three jobs were submitted by the same user at the same time so the FairShare and Age factors are identical. Job 2766 has a much larger JobSize factor because it requested 30 nodes.

root@login0a:~>sprio
          JOBID   PRIORITY        AGE  FAIRSHARE    JOBSIZE        QOS
           2764       1253          3          1        250       1000
           2765       1753          3          1        250       1500
           2766       5753          2          1       3750       2000

sprio shows the weighted priority factors and the total. To see the values of each priority factor before the weights are applied use the -n flag.

root@login0a:~>sprio -n
          JOBID PRIORITY   AGE        FAIRSHARE  JOBSIZE    QOS
           2764 0.00000029 0.0012946  0.0003503  0.0625000  0.5000000
           2765 0.00000040 0.0012897  0.0003503  0.0625000  0.7500000
           2766 0.00000133 0.0012393  0.0003503  0.9375000  1.0000000

Backfill

Even though jobs are expected to run in order of decreasing priority, backfill allows jobs with lower priority to fit in the gaps. A job will be allowed to run through backfill if it's execution does not delay the start of higher priority jobs. To use backfill effectively users are encouraged to submit jobs with as short a walltime as possible.

Fairshare policies

FairShare has been enabled which adjusts priorities for jobs based on historical usage of the cluster. The FairShare priority factor is explained on the Slurm website.

To see the current FairShare priority factor run sshare. Several options are available, please refer to man sshare for more details.

The FairShare factors for all users is listed with sshare -a.

root@head0:~>sshare -a
             Account       User Raw Shares Norm Shares   Raw Usage Effectv Usage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          1.000000    48208745      1.000000   0.500000 
 root                      root          1    0.250000      113558      0.002356   0.993490 
 cssd                                    1    0.250000       22554      0.000470   0.998699 
  cssd                     jar7          1    0.125000           0      0.000235   0.998699 
  cssd                   jaw171          1    0.125000       22554      0.000469   0.997404 
 kjordan                                 1    0.250000    13320473      0.277344   0.463495 
  kjordan                 keg56          1    0.250000    13320473      0.277344   0.463495 
 sam                                     1    0.250000    34752158      0.719822   0.135909 
  sam                       amf          1    0.062500           0      0.179956   0.135909 
  sam                   defusco          1    0.062500    34752158      0.720607   0.000338 
  sam                   kimwong          1    0.062500           0      0.179956   0.135909 
  sam                      php8          1    0.062500           0      0.179956   0.135909

On the MPI cluster all groups are given equal shares. In the above output there are four groups, including root, and their Norm Shares are each 0.25. All users in a group share that normalized value. As users use the cluster the raw processor-seconds are recored in Raw Usage. The effective usage is a decaying function of Raw usage with a half-life of 14 days. If a user has not run a job in 14 days their Effective Usage will drop by one half. The FairShare column is the factor provided to the priority function above and is a function of the user's Effective Usage and Norm Share. Users with values close to 1 have a higher priority because they have not used the cluster recently. Values close to 0.5 means that the user has met the Norm Shares target utilization of the cluster. Values close to 0 means that the user has exceed the Norm Shares target utilization and will have a lower priority for new jobs that are submitted.

Since all users within a group share the group's Normalized Share users of the ccsd group above that have not used the cluster have a larger FairShare factor than users of the sam group.

Process affinity

On the MPI Cluster Slurm has been configured with task/affinity to allow automatic task affinity and binding. Slurm has several options for process affinity and by default will bind the process to the core on which it was placed. By default MPI processes are assigned to sockets in a cyclic fashion such that the first two processes will be placed on different sockets. Add --cpu_bind=verbose in your srun job step to print affinity information.

  • While the automatic binding should work in many cases it never hurts to force it to bind to core with --cpu_bind=cores.

Users are encouraged to experiment with affinity and placement options in srun. See the Multi-Core documentation for more details.

For this section I will be demonstrating basic affinity concepts with the hybridPi project. The executables are at /home/sam/training/mpi/hybridPi.

Running MPI programs

#SBATCH --nodes=10
#SBATCH --tasks-per-node=2
 
srun /home/sam/training/mpi/hybridPi/pi-mpi
  • --tasks-per-node cannot exceed 20.

In this example 2 MPI ranks per node have been launched. One processes goes to core 0 (socket 0) and the second does to either core 10 or core 11 (socket 1). As more MPI ranks (--tasks-per-node) are requested they will alternate onto cores in socket 0 and socket 1.

defusco@login0a:~>cat slurm-2600.out
20 MPI processes
local value is 0.1998335829 on core  0 of n197.mpi.sam.pitt.edu ( 0.9677040577 s)
local value is 0.1988410271 on core 10 of n197.mpi.sam.pitt.edu ( 0.9604279995 s)
local value is 0.1968851805 on core  0 of n198.mpi.sam.pitt.edu ( 0.9539489746 s)
local value is 0.1940224490 on core 10 of n198.mpi.sam.pitt.edu ( 0.9507009983 s)
local value is 0.1903324131 on core  0 of n199.mpi.sam.pitt.edu ( 0.9584050179 s)
local value is 0.1859125254 on core 10 of n199.mpi.sam.pitt.edu ( 0.9440989494 s)
local value is 0.1808720996 on core  0 of n200.mpi.sam.pitt.edu ( 0.9467079639 s)
local value is 0.1753262309 on core 10 of n200.mpi.sam.pitt.edu ( 0.9456501007 s)
local value is 0.1693901961 on core  0 of n201.mpi.sam.pitt.edu ( 0.9552078247 s)
local value is 0.1631747315 on core 11 of n201.mpi.sam.pitt.edu ( 0.9380021095 s)
local value is 0.1567824077 on core  0 of n202.mpi.sam.pitt.edu ( 0.9471399784 s)
local value is 0.1503051574 on core 10 of n202.mpi.sam.pitt.edu ( 0.9291751385 s)
local value is 0.1438228813 on core  0 of n203.mpi.sam.pitt.edu ( 0.9552040100 s)
local value is 0.1374029752 on core 11 of n203.mpi.sam.pitt.edu ( 0.9365890026 s)
local value is 0.1311005776 on core  0 of n204.mpi.sam.pitt.edu ( 0.9551730156 s)
local value is 0.1249593337 on core 10 of n204.mpi.sam.pitt.edu ( 0.9315221310 s)
local value is 0.1190124881 on core  0 of n205.mpi.sam.pitt.edu ( 0.9677009583 s)
local value is 0.1132841502 on core 10 of n205.mpi.sam.pitt.edu ( 0.9499299526 s)
local value is 0.1077906123 on core  0 of n206.mpi.sam.pitt.edu ( 0.9657518864 s)
local value is 0.1025416341 on core 10 of n206.mpi.sam.pitt.edu ( 0.9558229446 s)
 
Global Pi is   3.1415926533
 
3.412212 seconds

Running MPI+openMP programs

Slurm does not set OMP_NUM_THREADS automatically. The default behavior is to launch 20 OpenMP threads no matter how many MPI ranks have been launched. For optimal performance it it highly recommended that cpus-per-task be equal to your desired OMP_NUM_THREADS.

#!/bin/bash
 
#SBATCH --nodes=4
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=5:00
 
OMP_NUM_THREADS=1
srun /home/sam/training/mpi/hybridPi/pi
 
echo
echo
echo
 
OMP_NUM_THREADS=8
srun /home/sam/training/mpi/hybridPi/pi
  • tasks-per-node plus cpus-per-task cannot exceed 20.

In this example the MPI+openMP program was run twice with varying numbers of openMP threads. For both Job Steps one MPI rank per node has been launched and may be started on either socket 0 or socket 1. In the second example 8 openMP threads have been spawned, which decreases the runtime each of the local value computations. openMP threads are spawned on the same socket as the MPI rank and bound to the core on which they start.

defusco@login0a:~>cat slurm-2592.out
4 MPI processes
1 openMP threads
local value is 0.9799146525 on core 19 of n197.mpi.sam.pitt.edu ( 4.3307490349 s)
local value is 0.8746757834 on core 19 of n198.mpi.sam.pitt.edu ( 4.3204240799 s)
local value is 0.7194139991 on core 19 of n199.mpi.sam.pitt.edu ( 4.3240640163 s)
local value is 0.5675882183 on core 19 of n200.mpi.sam.pitt.edu ( 4.3306980133 s)
 
Global Pi is   3.1415926533
4.660805 seconds
 
 
 
4 MPI processes
8 openMP threads
local value is 0.9799146525 on core 13 of n197.mpi.sam.pitt.edu ( 0.6780118942 s)
local value is 0.8746757834 on core 3 of n198.mpi.sam.pitt.edu ( 0.6207699776 s)
local value is 0.7194139991 on core 2 of n199.mpi.sam.pitt.edu ( 0.6578431129 s)
local value is 0.5675882183 on core 18 of n200.mpi.sam.pitt.edu ( 0.6780018806 s)
 
Global Pi is   3.1415926533
0.961753 seconds

Local Scratch directory

In many cases using a parallel scratch file system is preferable for MPI programs, but some programs can or require utilization of a local fast scratch filesystem. Each node in the MPI Cluster has a single scratch disk for temporary data generated by the job. Local scratch directories are created on each node in the following location at the start of job or allocation.

/scratch/slurm-$SLURM_JOB_ID

The $LOCAL environment variable is then set in the job's environment to the above scratch directory.

  • The $LOCAL directories are removed from each node at the completion of the job

To copy files to the $LOCAL scratch disk on the master compute node just use cp or rsync. Remember, the initial working directory for the job is the directory from which the job was submitted. To allow srun to run the job from the $LOCAL scratch directory add --chdir.

#!/bin/bash
 
#SBATCH -N 2
 
cp file $LOCAL
rsync -a dir $LOCAL
 
#The file 'out' is written to the directory from which the job was submitted
srun --chdir=$LOCAL my-executable > out
 
#copy files back from the master with cp
cp $LOCAL/out2 out2

To copy a single file to all compute nodes allocated to the job use sbcast. See man sbcast for more options.

#!/bin/bash
 
#SBATCH -N 2
 
sbcast file $LOCAL/file
 
#The file 'out' is written to the directory from which the job was submitted
srun --chdir=$LOCAL my-executable > out
 
#copy files back from the master with cp
cp $LOCAL/out2 out2
  • The destination filename must appear in the second argument to sbcast.

To copy multiple files to all of the compute nodes allocated to the job, a more complicated rsync script is required. The script below ensures that only one rsync processes per node is spawned by srun.

#!/bin/bash
 
#SBATCH -N 2
 
srun --tasks-per-node=1 -c1 -n $SLURM_NODES sh -c "rsync -a dir $LOCAL/"
 
#The file 'out' is written to the directory from which the job was submitted
srun --chdir=$LOCAL my-executable > out
 
#copy files back from the master with cp
cp $LOCAL/out2 out2

To copy files back from the compute nodes use sgather. See man sgather for more information.

If more complicate data movement is needed consider writing custom epilog and prolog scripts. The path to the scripts can be provided to srun and sbatch with the --prolog and --epilog arguments. See man srun for more information. Spank plugins can also be utilized. See man spank.

Job environment

By default slurm will not execute a login shell; it will not source ~/.bashrc, ~/.bash_profile, /etc/bashrc or /etc/profile. It will MIGRATE your current environment from the submission host.

In this example I have the my_bashrc variable defined to yes in my ~/.bashrc, but I set it to no just before running sbatch and it remains no on the compute node. Since only one node was requested

>cat env.sh
#!/bin/bash
 
#SBATCH --nodes=1
#SBATCH --time=10:00
#SBATCH --output=env.out
#SBATCH --error=env.err
 
hostname
env
 
>tail ~/.bashrc
export my_bashrc=yes
 
>export my_bashrc=no
>sbatch env.sh
Submitted batch job 700
 
>grep my_basrhc env.out
my_bashrc=no

The user can force the creation of a login shell as follows, but this means that different things happen. Now /etc/bashrc, /etc/profile, ~/.bashrc and ~/.profile are sourced but your working environment was lost.

>cat env.sh
#!/bin/bash --login
 
#SBATCH --nodes=1
#SBATCH --time=10:00
#SBATCH --output=env.out
#SBATCH --error=env.err
 
hostname
env
 
>tail ~/.bash_profile
export my_bashprofile=yes
 
>export my_bashrc=no
>export my_bashprofile=no
>sbatch env.sh
Submitted batch job 702
 
>grep my env.out
my_bashprofile=yes
my_bashrc=yes