Using slurm on janus

Janus Overview

Logging in

ssh bracken@login.rc.colorado.edu

Your Password is: PPPPDDDDDD (PPPP = your pin, DDDDDD = OTP number)

File systems

https://www.rc.colorado.edu/services/storage/filesystemstorage

  1. Home directory (~) - limited storage but snapshotted regularly, so you can use this for code
  2. Projects directory (/projects/$USER) - 256 GB storage per user, snapshotted regularly, use this for storing data
  3. Lustre (/lustre/janus_scratch/$USER) - Intended for parallel IO from jobs, not for long term storage. Has some usage restrictions.
  4. Peta Library - Long term storage

Using Slurm

Good intro

Add the following line to ~/.my.bash_profile or type it in the prompt

module load slurm

Make a work directory

mkdir testing
cd testing

Test job script, copy all the folowing into the command prompt:

#!/bin/bash
# Lines starting with #SBATCH are treated by bash as comments, but interpreted by slurm
# as arguments.  

#
# Set the name of the job
#SBATCH -J test_job

#
# Set a walltime for the job. The time format is HH:MM:SS - In this case we run for 5 minutes.

#SBATCH --time=0:05:00

#
# Select one node
#

#SBATCH -N 1
# Select one task per node (similar to one processor per node)
#SBATCH --ntasks-per-node 1

# Set output file name with job number
#SBATCH -o testjob-%j.out
# Use the janus-debug QOS
#SBATCH --qos=janus-debug

# The following commands will be executed when this script is run.
echo The job has begun
echo Wait one minute...
sleep 60
echo Wait a second minute...
sleep 60
echo Wait a third minute...
sleep 60
echo Enough waiting. Job completed.

# End of example job shell script

Submit the job:

sbatch testjob_submit.sh

Check on the job:

squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            183359     janus get_clus  bracken  R       0:04      2 node[0433-0434]
scontrol show job <jobid>
   UserId=bracken(1000397) GroupId=brackenpgrp(1000397)
   Priority=602 Nice=0 Account=ucb00000307 QOS=janus-debug
   JobState=COMPLETING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:10 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2014-09-11T13:16:54 EligibleTime=2014-09-11T13:16:54
   StartTime=2014-09-11T13:16:55 EndTime=2014-09-11T13:17:05
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=janus AllocNode:Sid=janus-compile3:17163
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node[0433-0434]
   BatchHost=node0433
   NumNodes=2 NumCPUs=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=12:0:*:* CoreSpec=0
   MinCPUsNode=12 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/bracken/testing/job.sh
   WorkDir=/home/bracken/testing
   StdErr=/home/bracken/testing/output-testjob-183359.out
   StdIn=/dev/null
   StdOut=/home/bracken/testing/output-testjob-183359.out

QOS's (Quality of Service's a.k.a. Queues)

The QOS's for all other Research Computing resources are the following:

Memory Limits

Using parallel R on Janus

First follow This tutorial for setting up Rmpi.

Test job script

#!/bin/bash

## job.sh example testing Rmpi
## you should see output that has 2 different node names 

# Set the name of the job
#SBATCH -J get_cluster_names

# Set a walltime for the job. The time format is HH:MM:SS
#SBATCH --time=00:00:30

# Select nodes
#SBATCH -N 2

# Select one task per node (similar to one processor per node)
#SBATCH --ntasks-per-node 12

# Set output file name with job number
#SBATCH -o output-testjob-%j.out

# Use the normal QOS
#SBATCH --qos=janus-debug

#SBATCH --mail-type=ALL #Type of email notification- BEGIN,END,FAIL,ALL 
#SBATCH --mail-user=cameron.bracken@colorado.edu #Email to which notifications will be sent

nodes=2
ppn=12
np=$(($nodes*$ppn))

# Get OpenMPI in our PATH.  openmpi_ipath and openmpi_ib
# can also be used if running over those interconnects. 
module load openmpi/openmpi-1.6.4_gcc-4.8.1_ib

`which mpirun` -n 1 `which R` --vanilla --slave <<EOF

library(parallel)
cl <- makeCluster($np,type="MPI")
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
clusterCall(cl, runif, $np)
stopCluster(cl)
mpi.quit()
EOF

Other resources: