Using local scratch disk for jobs

What and Why

The compute nodes on Frank access the home filesystem via NFS over the limited bandwidth gigabit ethernet network. When jobs on the compute nodes read/write data on $HOME or a subdirectory (usually by having its current execution path therein), each I/O operation will take a chunk of this limited bandwidth. Such I/O is not only very slow and unreliable, it might also slow/bring down other critical resources when used heavily (e.g. think of the I-376 tunnels during rush hour, with 4 lanes of traffic squeezing into the two tunnel lanes).

For this reason, it is recommended that you use the alternative scratch space for I/O. This scratch space is the filesystem (usually a disk with a TB capacity) that is local to each compute node.

The downside is execution on local disk is not as straightforward as running under $HOME and requires some care. At the very least, you'll need to copy the input files from $HOME to the local disk, cd there, run the job, and copy output files back.

Job scratch directories

The local disk on all compute nodes is mounted as /scratch. This filesystem on each compute node is on a different physical disk and can not be seen by one another. For example, files written on node n0 will not be visible on node n1. So is the case from the login node, i.e. your files on the local drives will not be accessible once the job finishes. There is a grace period of a few days past job end time by with scratch files are flushed. If you have an urgent need to recover these files, submit a ticket within a day or two past job completion.

Once a job is started in the queue, a scratch directory is created as a subdirectory of the local filesystem on the compute node(s) on which the job is running. A separate scratch directory is created for each job, which is writable only by the job owner. This is to prevent accidental writes or deletes by other jobs or users.

The path to the scratch subdirectory is some site specific configuration value, and is subject to change in the future1. Whatever the value is, it is stored by the $SCRATCH environment variable set correctly by the queue system. In other words, it is strongly recommended that the scratch path is referred to by the environment variable $SCRATCH (or synonymously by the variable $LOCAL) and not its hard-coded path.

For example, consider a job which launches an executable called run.x that needs a single input input.dat and produces a single output output.dat which we want to save at the end of the job. The job script, ignoring the PBS directives, might look like:

cp input.dat $LOCAL
cd $LOCAL
$HOME/bin/run.x
cp output.dat $PBS_O_WORKDIR
# optionally: 
rm -rf $LOCAL

where $PBS_O_WORKDIR contains the directory location from which the job was submitted. Note that the executable in this example is somewhere under $HOME. It doesn't need to be on $LOCAL. Remember, compute nodes can access any and all files at $HOME (this job script is eventually executing on the compute node). That is how the input.data is accessed by a simple cp command, the executable is read and loaded into the compute node, and the output.log file is copied back at the end. Optionally, the job cleans up after itself by removing all its scratch files, potentially saving space for subsequent runs.

Qsub flags

To set the amount of scratch space your job requires, use the following flag.

#PBS -l ddisk=<size>gb

where is the amount of disk space your job will need. The queue will then find a node which has at least this much space available before running the job. Please read the queue configuration page for the default disk space sizes.

Automating and Securing Postrun Activity with a trap

Sometimes a job can die unexpectedly, usually run out of time during execution, and never reaches the postrun activity (copy, cleanup, etc.) at the end. In this scenario, if the job was outputting valuables on $SCRATCH, copy and clean up is not accomplished. To prevent this from happening, you can set a bash-shell trap.

  • Only use the trap for emergencies, such as walltime exceeded! Torque will allow 5 seconds for the trap to complete before killing the process if the walltime has been exceeded. Traps should not be used for normal execution.

  • Traps will still be executed during normal termination and will still be limited to 5 seconds.

Following on the previous example:

cp input.dat $LOCAL
cd $LOCAL
 
# trap should be set BEFORE the execution begins.
# It is sprung when the script EXITs for any reason,
# error or not. "Sprung" meaning the second argument 
# to trap will be executed when EXIT is encountered. 
trap "cp output.dat $PBS_O_WORKDIR" EXIT
 
# Run as usual
$HOME/bin/run.x
 
# No need to repeat cp at the end, trap will be sprung

You can set only one trap at a time, but a trap can do more via functions. For example, let's include scratch clean up:

cp input.dat $LOCAL
cd $LOCAL
 
 
# post run activity
run_at_exit () {
  set -v 
  cp output.dat $PBS_O_WORKDIR
  rm -rf $SCRATCH
}
 
# set trap, this time to call the function
trap run_at_exit EXIT
 
# Run as usual
$HOME/bin/run.x
 
# No need to call the function explicitly, trap will be sprung 
# even at successful termination

See that and more details in this nicely written article.

Monitoring output while the job is running

As mentioned above, the files on scratch cannot be accessed outside of a running job; moreover in the above example any output that run.x dumps on the terminal will only be available at the very end in the job output file.

In order to follow the progress while the job is running, you may want to redirect the terminal output to somewhere at $HOME. Further enhancing our running example:

cp input.dat $LOCAL
cd $LOCAL
trap "cp output.dat $PBS_O_WORKDIR" EXIT
 
# Redirect output to the submission directory
$HOME/bin/run.x > $PBS_O_WORKDIR/log

If your executable doesn't write useful progress info on the terminal, but generates files for this purpose, then there is no easy way to follow up on progress. If you experience a situation like this, please don't hesitate to ask for assistance in the Forum areas.

Scratch policies

The scratch is intended for the writing/reading of temporary files during the course of a job. Immediately after termination of a job all these temporary files are deleted.

It is therefore important that users copy any valuable output files from scratch to a home directory, at the very end of a job. You don't need to copy temporary files (esp if they tend to be big, they're best left behind and forgotten!).

Access Scratch directories

SSH access to compute nodes is provided to the compute nodes for retrieval of scratch files. In order to SSH directly into a compute node, you will first need to prepare SSH keys.

If you have never set up SSH keys on Frank, follow these steps. When asked to enter a passphrase, you may enter one or leave it blank.

user@login0a:~>ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/group/user/.ssh/id_rsa): 
Created directory '/home/group/user/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/group/group/.ssh/id_rsa.
Your public key has been saved in /home/group/user/.ssh/id_rsa.pub.
The key fingerprint is:
c7:71:30:e4:2b:82:43:ab:7d:c9:5b:27:5e:e9:d9:be user@login0a.frank.sam.pitt.edu
The key's randomart image is:
+--[ RSA 2048]----+
|       .+        |
|       . o       |
|  .     o .      |
| . o     =       |
|  + u . S o      |
| o o o . o       |
|. . + o +        |
|   . + = o       |
|    . . o.E.     |
+-----------------+
user@login0a:~>cat .ssh/id_rsa.pub >> .ssh/authorized_keys

You will now be able to login directly to a compute node.


  1. At the time of writing (Jan 2014) the job scratch path is /scratch/$PBS_JOBID, which e.g. is equal to /scratch/123456.clusman0.localdomain where the job is 123456