Home

JURECA

1. 1 1 134 12 119 9 88 7 66 55 44 33 2 Figure 6 38 q_cpuquota Job accounting is done via a central database in JSC and the information about JURECA jobs will be completed once per day around midnight based on information obtained from Slurm Accounting Database Users get information about their current quota test status or the usage of single jobs by using the command q_cpuquota Command format q_cpuquota OPTIONS Some useful options of the q_cpuquota commnd are Option Description Print usage information h lt cluster gt Show information for the specified system e g JURECA j lt jobID gt Show accounting information for the specified job t lt time gt Show information about jobs in the specified time period d lt number gt Show information about jobs of the last specified days 39 4 Batch Jobs Users submit batch applications usually bash scripts using the sbatch command In the job scripts in order to define the sbatch parameters SBATCH directives must be used The script is executed on the first compute node in the allocation To execute parallel MPI tasks users call srun within their script With srun users can also create job steps A job step can allocate the whole or a subset of the already allocated resources from sbatch With these commands Slurm offers a mechanism to al
2. usr local software jureca TC FullToolchains gpsolf 2015 06 intel 2015 07 D intel para 2015 06 intel para 2015 07 D usr local software jureca TC Compilers MPI gpsmpi 2015 06 iimpi 2015 07 D ipsmpi 2015 06 mt ipsmpi 2015 07 mt usr local software jureca TC Compilers GCC 5 1 0 D icc 2015 3 187 D iccifort 2015 3 187 D Load a toolchain without version the default is used module load intel para List all loaded modules from the current toolchain module list Currently Loaded Modules 1 z1ib 1 2 8 7 GCC bare 4 9 3 T3 PlOGE 2015 5 cs 2 binutils 2 25 8 popt 1 16 14 psmpi 5 1 4 1 3 ncurses 5 9 9 pscom 5 0 45 1 15 imk1 11 2 3 187 4 libatomic_ops 7 4 2 10 ipsmpi 2015 07 16 intel para 2015 07 5 gc 7 4 2 11 iccifort 2015 3 187 GCC bare 4 9 3 6 util wrapper 1 1 12 icc 2015 3 187 GCC bare 4 9 3 List all application modules available for the current toolchain module avail usr local software jureca Stage2 modules all Toolchain intel para 2015 07 ABINIT 7 10 2 ASE 3 8 1 3440 Python 2 7 10 Autoconf 2 69 Automake 1 13 Automake 1 15 D Bison 3 0 2 Boost 1 58 0 Python 2 7 10 BuildEnv defaults CDO 1 6 9 CMake 3 1 3 Get information about a package module spider Boost or module spider Boost 1 56 0 Load an application module module load Boost 1 56
3. Hybrid jobs with SMT Example 9 This example shows a hybrid application that will start 4 tasks per node on 3 allocated nodes and starting 12 threads per task using in total 48 cores per node SMT enabled bin bash SBATCH nodes 3 SBATCH ntasks 12 SBATCH ntasks per node 4 SBATCH cpus per task 12 SBATCH output mpi out j SBATCH error mpi err j SBATCH time 00 20 00 export OMP NUM THREADS S SLURM_ CPUS PER_TASK srun hybrid prog 43 Example 10 This example presents a hybrid application which will execute hybrid prog on 3 nodes using 2 MPI tasks per node and 24 OpenMP threads per task 48 CPUs per node bin bash SBATCH nodes 3 SBATCH ntasks 6 SBATCH ntasks per node 2 SBATCH cpus per task 24 SBATCH output mpi out j SBATCH error mpi err j SBATCH time 00 20 00 SBATCH partition batch export OMP NUM THREADS S SLURM_ CPUS PER_TASK srun hybrid prog Intel MPI jobs In order to run Intel MPI jobs user can use srun The mpirun command is currently not supported That means for now the users can not export and use the environment variables from Intel MPI because srun does not work with them Users will be informed when mpirun will be supported 4 2 Job steps In a previous chapter we described job steps as small allocations or jobs inside the current job Each call of srun will create a new job step It is up to the users to decide
4. Job Submission Filter During job submission a submission filter is configured to take certain actions depending on the partition and the resources that are allocated Here is the list of the configured rules for this filter e When a job is submitted in the gpus partition then deny submission if no gpu GRES was requested e When a job is submitted in the mem512 partition then deny submission if no mem512 GRES was requested By default when a job is submitted and has no memXXX GRES requested then add mem128 in the GRES list of the job This helps us for our accounting statistics The GRES can be defined during submission with option gres lt comma separated list of gres gt of the commands sbatch and salloc In chapter 3 2 examples will be given on how to use the GRES option during submissions 13 2 7 Priorities Slurm schedules the jobs according to their priorities which means that the jobs with the highest priorities will be executed next With the backfilling algorithm though jobs usually small with lower priorities can be schedule next if they can fit and run on the available resources before the next high priority job is scheduled to start Slurm has a very simple and well defined priority mechanism that allows us to define exactly the batch model we want Following we present how Slurm calculates the priorities for each job Job priority PriorityWeightAge age_factor PriorityWeightFairshare fa
5. 36 Examples Show job information in long format for default period starting from 00 00 today until now sacct 1 o u i i l i i o o Show job only information without jobsteps starting from the defined date until now sacct S 2014 10 01T07 33 00 X Show job and jobstep information printing only the specified fields sacct S 2014 10 01 format jobid elapsed nnodes state sacctmgr The sacctmgr command is mainly used by the administrators to view or modify accounting information and data in the accounting database This command provides also an interface with limited permissions to the users for some querying actions Command format sacctmgr OPTIONS COMMAND Some of the most useful commands for sacctmgr are Command Description show list cluster Show cluster information show association where user lt name gt List all visible associations or the ones for the specified user show event where node lt node_name gt List all events for all or for the specified nodes show qos where name lt gos_name gt List all or the specified QoS show user Show some user information like privileges etc show and list commands are the same for sacctmgr Examples Show cluster information sacctmgr show cluster Show the association of user1 sacctmgr show association where user userl Print all QoSs sacctmgr show qos Show the privileges of my user sacct
6. intel 2015 07 56 7 3 Compilation MPI program example file mpi c include lt stdio h gt include lt mpi h gt int main int argc char argv rank int rank size char processor name MPI_MAX PROCESSOR NAME int name_len Initialize the MPI environment MPI_Init amp argc amp argv Get the number of processes MPI_Comm_size MPI_COMM WORLD amp size Get the rank of the process MPI Comm rank MPI_COMM WORLD amp rank Get the name of the processor MPI_Get_processor name processor name amp name_ len Print out printf Hello world from processor s rank d out of size Finalize the MPI environment MPI Finalize return 10 Hybrib program example file hybrid c include lt stdio h gt include lt mpi h gt include mpi h define _NUM_THREADS 16 int main int argc char argv int rank size count total char processor name MPI_MAX PROCESSOR NAME int name_len omp_set_num threads _NUM THREADS Initialize the MPI environment MPI_Init amp argc amp argv Get the number of processes MPI_Comm size MPI_COMM WORLD amp size Get the rank of the process MPI_Comm_rank MPI_COMM WORLD amp rank Get the name of the processor MPI_Get_processor name processor name amp name_len count 0 pragma omp parallel reduction count 57 d processors n processor
7. Define the mail address to receive mail notification mail type Define when to send a mail notifications Valid options BEGIN END FAIL REQUEUE or ALL N lt minnodes maxnodes gt nodes lt minnodes maxnodes gt Number of compute nodes used by the job Can be omitted if ntasks and ntasks per node is given n lt number gt ntasks lt number gt Number of tasks MPI processes Can be omitted if nodes and ntasks per node is given ntasks per core lt ntasks gt Number of tasks that will run on each CPU ntasks per node lt ntasks gt Number of tasks per compute node o lt filename pattern gt output lt filename pattern gt Path to the job s standard output Slurm supports format strings containing replacement symbols such as j job ID p lt partition_names gt partition lt partition_names gt Partition to be used The argument can be either devel batch etc on JURECA If omitted batch is the default reservation lt name gt Allocate resources from the specified reservation t lt time gt time lt time gt Maximal wall clock time of the job tasks per node lt n gt Same as ntasks per node Note srun can also be used to start interactive jobs but we suggest to use salloc srun should be used only to start job steps and spawn the processes like MPI tasks inside an allocation 22 Implied allocation
8. hybridscript sh 7 5 Job Control Hold a job scontrol hold 14900 squeue JOBID PARTITION NAME 14900 batch hybridte Release a job scontrol release 14900 squeue JOBID PARTITION NAME 14900 batch hybridte Cancel a job scancel 14905 7 6 Query Commands Check the Queue squeue JOBID PARTITION NAME 44210 batch Simulati 44211 batch Simulati 44213 batch Simulati 44214 batch Simulati 44215 batch Simulati 44216 batch Simulati 44217 batch Simulati 44241 batch equil2 1 44283 batch expl g01 43141 batch scr N50M 43140 batch scr N50M 43856 batch Vito ANA 43847 batch F3T_CR 44342 batch run scri 44230 batch submit3a 44231 batch submit3b 44238 batch submit5a 44239 batch submit5b 44242 batch submit6a 40618 batch bridgel 43800 batch job 43799 batch job 43796 batch job 41497 batch job sh 43932 batch Simulati USER paschoul USER paschoul USER jics4002 jics4002 jics4002 jics4002 jics4002 jics4002 jics4002 cao paj15340 esmi2000 esmi2000 hgr221 hku230 jias5002 jiek6000 jiek6000 jiek6000 jiek6000 jiek6000 jiff3006 hgr240 hgr240 hgr240 jics6402 hgr283 sT PD sT R sT PD PD PD PD Hu y Id 000 D Dddd TIME 0 00 TIME 0 01 TIME 0 00 0 00 0 00 0 00 0 00 0 00 0 00 29 17 1559757 19 46 04 19 52 01 19 32 11 16 02 55 51 07 51 07 51 07 51207 51 07 49 08 14 06 56 14 45 36 14 47 35 15 01 28 14 06 56 11 26 17 59 NODES 4
9. The JURECA compute nodes are interconnected with Mellanox EDR 100 Gbps technology organized in a fully non blocking fat tree The WORK and HOME filesystems are mounted from JUST storage cluster offering site wide access to user data JURECA also features major advances in the software stack The system is launched with the latest CentOS 7 Linux enterprise distribution a Parastation MPI implementation with MPI 3 support and a hierarchical module environment for the simplified usage of the software offerings by JSC The batch system on JURECA is the open source Slurm workload manager together with the Parastation resource management which has been a core element of the JUROPA software stack 1 2 Cluster Nodes The available compute nodes of JURECA are Type Hostname CPU Cores SMT RAM Resources Node Num Thin jrc 0036 0455 0491 0940 1138 2x Intel Xeon 24 48 128 GB mem128 Compute 1382 1395 1884 E5 2680 v3 Haswell DDR4 1605 2 5GHz Fat type 1 jrc 1010 1137 2x Intel Xeon 24 48 256 GB mem128 mem256 128 E5 2680 v3 Haswell DDR4 2 5GHz Fat type 2 jrc 0946 1009 2x Intel Xeon 24 48 512 GB mem512 64 E5 2680 v3 Haswell DDR4 2 5GHz GPUs jrc 0001 0035 0456 0490 0941 0945 2x Intel Xeon 24 48 128GB mem128 gpu 4 75 E5 2680 v3 Haswell DDR4 q 25GHz 2x Nvidia K80 Visualization jrc 1383 1392 2x Intel Xeon 24 48 512 GB mem512 gpu 2 type 1 E5 2680 v3 Haswell DDR4 q
10. partitions and nodes sacct is used to retrieve accounting information about jobs and job steps For older jobs sacct queries the accounting database sacctmgr is primarily used by the administrators to view or modify accounting information in Slurm s database However it allows also the users to query some information about their accounts and other accounting information Note Man pages exist for all daemons commands and API functions The command option help also provides a brief summary of the available options 3 2 Allocation Commands sbatch amp salloc The commands sbatch and salloc can be used to allocate resources sbatch is used for batch jobs The arguments for the sbatch command is the allocation options followed by the jobscript sbatch gets the allocation options either from the command line or from the job script using SBATCH directives salloc is used to allocate resources for interactive jobs Command format sbatch options jobscript args salloc options lt command gt command args Here we present some useful options only for sbatch command Option Description a lt indexes gt Submit a job array set of jobs Each job can be array lt indexes gt identified by its index number export lt env variables ALL NONE gt Specify which environment variables will be passed to the job Default is ALL ignore pbs Ignore any PBS options in the job scr
11. 10 25GHz 2x Nvidia K40 Visualization jrc 1393 1394 2x Intel Xeon 24 48 TB mem1024 gpu 2 type 2 E5 2680 v3 Haswell DDR4 q 2 2 5GHz 2x Nvidia K40 1 JURECA s login nodes ie External Hostname Internal Hostname CPU Cores SMT RAM ode Num Login jureca fz juelich de jrl 01 12 2x Intel Xeon 24 48 128 GB 12 jureca 01 12 fz juelich de E5 2680 v3 Haswell DDR4 2 5GHz The external hostname jureca fz juelich de is an alias for redirecting to the login nodes in a round robin fashion 1 3 Data Management Filesystems On JURECA we provide GPFS shared filesystems We provide home scratch and archive file systems which have different purposes The home filesystems are supposed to be used for user s data storage with the safety of backups TSM backup the scratch filesystem should be used as a fast storage for the data produced by the jobs no backup and purged regularly and the archive ones are to be used for long term data archiving Here is a small matrix with all filesystems available to the users Filesystem Mount Point Description GPFS WORK work Scratch filesystem without backup GPFS HOME homea Home filesystems with TSM backup homeb homec GPFS ARCH arch Archiving filesystems with TSM backup Available only on the arch2 login nodes GPFS DATA data Special filesystem used only by certain groups with TSM backu
12. 36031 kraused 1 jrc0300 COMPLETED 0 0 36032 kraused 1 jrc0301 COMPLETED 0 0 63 8 Changelog Version 2 0 1 Fixed some typos and the borders of a few tables Version 2 0 0 e Extended documentation for Phase 2 complete system All nodes types and the new partitions are documented in this version Also new sections were added about the General Resourses GRES of Slurm Version 1 0 0 e First version of this document NOTE This document was created using LibreOffice Writer and then it was exported to the PDF format There is a known issue where the users cannot copy from this document some commands and then paste them on their terminals Currently we don t know if there is a fix for this issue but we will investigate it 64
13. 48 SBATCH output mpi out j SBATCH error mpi err j SBATCH time 00 15 00 SBATCH partition batch srun mpi prog 42 Hybrid Jobs Example 7 In this example a hybrid MPI OpenMP job is presented This job will allocate 5 compute nodes for 2 hours The job will have 30 MPI tasks in total 6 tasks per node and 4 OpenMP threads per task On each node 24 cores will be used no SMT enabled Note It is important to define the environment variable OMP_NUM_THREADS and this must match with the value that was given to the option cpus per task bin bash SBATCH J TestJob SBATCH N 5 SBATCH o TestJob j out SBATCH e TestJob j err SBATCH time 02 00 00 SBATCH partition batch export OMP NUM THREADS 4 srun N 5 ntasks per node 6 cpus per task 4 hybrid prog Example 8 In this example there is a hybrid application which will start 2 tasks per node on 4 allocated nodes and starting 12 threads per node no SMT In order to set the environment variable OMP_NUM_ THREADS Slurm s variable SLURM_CPUS_PER_TASK is used which is defined by the option _ cpus per task bin bash SBATCH N 4 SBATCH n 8 can be omitted SBATCH ntasks per node 2 SBATCH cpus per task 6 SBATCH output mpi out j SBATCH error mpi err j SBATCH time 00 20 00 SBATCH partition batch export OMP NUM THREADS S SLURM_CPUS_PER_ TASK srun hybrid prog
14. Currently psslurm is under active development by ParTec and JSC in the context of the JuRoPA collaboration The Batch System manages the compute nodes which are the main resource entity of the cluster Slurm groups the compute nodes into partitions These partitions are the equivalent of queues in Moab It is possible for different partitions to overlap which means that the compute nodes can belong to multiple partitions Also partitions can be configured with certain limits for the jobs that will be executed Jobs are the allocations of resources by the users in order to execute tasks on the cluster for a specified period of time Slurm introduces also the concept of job steps which are sets of possibly parallel tasks within the jobs One can imagine job steps as smaller allocations or jobs within the job which can be executed sequentially or in parallel during the main job allocation In Figure 1 we present the architecture of the daemons and their interactions with the user commands of Slurm Compute Node 11 l i l l i l Login Nodes Compute Node Ni Figure 1 2 2 Slurm Configuration High Availability for the main controllers slurmctld and slurmdbd Backfilling scheduling algorithm No node sharing Job scheduling according to priorities Accounting mechanism slurmdbd with MySQL database as back end storage User and job limits enforced by QoS Quality of Service and some
15. NODES 4 NODES w PDHNDDOHFLHFHH HR WONDPPRWOHHFHHRHHRHRH NODELIST REASON JobHeldUser NODELIST REASON jrc 120 123 NODELIST REASON Dependency Dependency Dependency Dependency Dependency Dependency Dependency jrc 0387 0388 0406 0407 0450 0453 jrc 0251 0341 0342 jrc 0369 0372 jrc 0439 0440 0443 0444 jrc 0454 0455 jrc 0343 0368 0400 0403 jrc 0415 0417 jrc0305 jrc0418 jrc0441 jrc0442 jrc0304 jrc 0136 0140 0321 0325 jrc 0134 0316 jrc 0141 0317 jrc0319 jrc 0249 0250 jrc 0152 0447 0449 Check the Queue for one user squeue u paschoul JOBID PARTITION NAME USER ST TIME NODES NODELIST REASON 14910 batch mpitest paschoul PD 0300 4 QOSResourceLimit 14911 batch mpitest paschoul PD 0 00 4 QOSResourceLimit 14912 batch hybridte paschoul R 0 02 4 jrc 0120 0123 14908 batch mpitest paschoul R 0 02 4 jrc 095 098 Check partitions and nodes sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST batch up 2 00 00 5 drain j3c 090 115 116 119 124 batch up 2 00 00 65 idle j3c 061 089 091 114 117 118 120 123 125 130 large down 1 00 00 5 drain j3c 090 115 116 119 124 large down 1 00 00 65 idle j3c 061 089 091 114 117 118 120 123 125 130 Check off line nodes sinfo R REASON USER TIMESTAMP NODELIST 401 NodeHardware root 2015 07 29T21 18 54 jrc0320 403 NodeHardware root 2015 07 30T07 58 53 jrc0340 402 NodeHardware root 2015 07 30T04 12 14 jrc0144 Gold
16. This is necessary if you want to debug your program 0 0 3 Sets the optimization level L A path can be given in which the linker searches for libraries D Defines a macro U Undefines a macro I Allows to add further directories to the include file search path H Gives the include file order This options is very useful if you want to find out which directories are used and in which order they are applied SOX Stores useful information like compiler version options used etc in the executable ipo Inter procedural optimization axCORE AVX2 Indicates the processor for which code is created help Gives a long list of quite a big amount of options Compilation Examples Compile an MPI program in C mpicxx 02 o mpi_prog program cpp _ Compile a hybrid MPI OpenMP program in C mpicc openmp 0 mpi_prog program c 1 8 Batch model amp Accounting Following we present the main policies concerning the batch model and accounting that are applied on JURECA Job scheduling according to priorities The jobs with the highest priorities will be scheduled next Backfilling scheduling algorithm The scheduler checks the queue and may schedule jobs with lower priorities that can fit in the gap created by freeing resources for the next highest priority jobs No node sharing The smallest allocation for jobs is one compute node Running jobs do not disturb each other For each project a Linux group i
17. hard limits configured in the partition settings There is a QoS for each contingent state normal lowcont nocont and suspended Users without contingent are set to a different QoS and get a penalty for their job priorities No preemption configured Running jobs cannot be preempted Prologue and Epilogue with pshealthcheck from Parastation The prologue checks the status of the nodes at job start and Epilogue cleans up the nodes after job completion Same limits configurations for batch and interactive jobs no difference between batch and interactive jobs for Slurm different behavior than Moab 9 2 3 Partitions In Slurm multiple nodes can be grouped into partitions which are sets of nodes with associated limits for wall clock time job size etc These limits are hard limits for the jobs and can not be overruled by the specified limits in QoS s Partitions may overlap and nodes may belong to more than one partition making partitions serve as general purpose queues like queues in Moab The following table shows the partitions on JURECA Phase 1 with their configured maximum limits and default values Partition attr Limit Value devel Maximum wall clock time for each job 2 hours interactive jobs Default wall clock time for each job 30 minutes thin nodes 2 Minimum Maximum number of nodes per job 1 8 nodes Default number of nodes for each job 1 node Max number of running submitted jobs per user QoS dependen
18. how they will create job steps It is possible to have one job step after another using all the allocated nodes each time or to have many job steps running in parallel Instead of submitting many single node jobs known as farming it is suggested to the users to do farming using job steps inside a single job In this case since all CPUs are available to the job the only bounding factor is the memory per task and the walltime The users will be accounted for all the nodes of the allocation regardless if all nodes are used for job steps or not Example 11 In the following example it is presented how to execute MPI programs in different job steps sequentially inside a job allocation In total 4 nodes are allocated for 2 hours In this job 3 job steps will be created The first job step will run on 4 nodes having 1 MPI task per node for 20 minutes After that the second job step will be executed on 3 nodes with 24 MPI tasks per node for 1 hour And in the end the last job step will run on 4 nodes with 48 MPI tasks per node using all virtual cores on each node SMT and it will finish when the MPI application will be completed or will be canceled by the scheduler if it will reach the walltime limit bin bash SBATCH nodes 4 SBATCH output mpi out j SBATCH error mpi err j SBATCH time 02 00 00 srun N4 ntasks per node 1 time 00 20 00 mpi progl srun N3 ntasks per node 24 time 01 00 00 mpi prog2 srun N4 ntasks p
19. mem256 jobscript Submit a job requesting 4 nodes in mem512 partition will be denied if no gres mem512 is given sbatch N4 p mem512 gres mem512 jobscript Submit a job requesting 8 nodes and 2 GPUs per node in gpus partition must give gres gpu X sbatch N8 p gpus gres gpu 2 cuda jobscript Submit a job requesting 32 nodes and 4 GPUs per node in gpus partition sbatch N32 p gpus gres gpu 4 jobscript Submit a job requesting the 2 fat visualization nodes with 2 GPUs per node in vis partition sbatch N32 p vis gres mem1024 gpu 2 jobscript 24 3 3 Spawning commands srun With srun the users can spawn any kind of application process or task inside a job allocation It can be a shell command any single multi threaded executable in binary or script format MPI application or hybrid application with MPI and OpenMP When no allocation options are defined with srun command the options from sbatch or salloc are inherited srun should be used either 1 Inside a job script submitted by sbatch 2 Or after calling salloc Note To start an application with Parastation MPI the users should use only srun and not mpiexec For Intel MPI mpirun is not supported yet but it will be later Command format srun options executable args The allocation options of srun for the job steps are almost the same as for sbatch and salloc please see the table above with allocation options There are also some us
20. o 0 00000000000000000000 00000000000000000000 o 000o 00000000000000000000 e ece 00000000000000000000 0090000000000 o00000000000000000000 0000000000000000000000000 000000000000000000 o0000000000000000000000000 o 00000000000000000 000000000000000000000000 000000000000000000000000 0000000000000000000000000 000000000000000000000000000000000000000000000000 0000000000000 00000000000000000000000000000000000 000000000000 000000000000000000000000000000000000 00000000000 0000000000000000000000000000000000000 0000000 0GS 000000000000 00000000000000000000000000 000000000 000000000000000000000000000000000000000 xk0o000000 00000000000000000000000000000000000000000 0000000 00000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000 000900000000000000000000000000000000000000000000 0 0000000000000000000000000000000000000000000000 A ge ee A ee Figure 3 indicates that the task can be scheduled on the corresponding core For the purpose of presentation 2 17 JURECA node Each column corresponds to a logical core and each row to a task process A red dot stars are used to highlight cores tasks 0 6 12 18 In Figure 3 we have the visualization of the processor affinity of a 48 tasks job step on a single 00000000000000000000000000000000000000000 000000000008 0 000000000000000000000000000000000 000000000000000008 1 0000000000000000000
21. options Depending on the combination of the allocation options that are used during submission some other allocation options can be omitted because they are implied and the system calculates them automatically or the default values are used Following there is table with these combinations Used options Implied options can be omitted nodes amp ntasks ntasks per node nodes amp ntasks per node ntasks nodes ntasks default is 1 task per node ntasks nodes amp ntasks per node Examples Submit a job requesting 2 nodes for 1 hour with 24 tasks per node implied value of ntasks 48 sbatch N2 ntasks per node 24 time 1 00 00 jobscript Submit a job script allocating 4 nodes with 16 tasks in total implied 4 tasks per node for 30 minutes sbatch N4 n16 t 30 jobscript Submit a job array of 4 jobs with 1 node per job with the default walltime sbatch array 0 3 N1 jobscript Submit a job script in the batch partition requesting 64 nodes for 2 hours sbatch N64 p batch t 2 00 00 jobscript Submit a job without a job script but wrapping a shell command sbatch N4 n4 wrap srun hostname Submit a job requesting the execution to start after the specified date sbatch begin 2015 01 11T12 00 00 N2 time 2 00 00 jobscript Submit a job requesting all available mail notifications to the specified email address sbatch N2 mail user myemail address com mail t
22. se et 59 ZG Query Command Seen E E EE EEE E EATE E TEE E EEE R 59 Fj PCC SCT AVS enc ete oo fa A E eii 62 Chanselopi daa 64 1 Cluster Information 1 1 Introduction After more than five years of successful operation the JUROPA general purpose supercomputer has been shutoff on the 24th of June 2015 The successor JURECA Juelich Research on Exascale Cluster Architectures is projected to reach a peak performance of about 1 8 PFLOPS per second once fully installed In order to minimize the service interruption for users the system is installed in two phases The first phase was consisted by 260 compute nodes and since 2nd of November 2015 the second phase is available and in production including in total 1884 compute nodes The JURECA system is based on Intel Xeon E5 2680 v3 Haswell CPUs with 12 cores per CPU and utilizes the scalable V class server architecture of T Platforms Compute nodes are dual socket systems with 24 cores per node Different sizes of DDR4 memory will be offered in the full system The normal thin nodes are equipped with 128 GiB memory For applications with higher memory demands two other types of nodes with 256 GiB per node and 512 GiB per node are available Accelerated applications can take advantage of the compute nodes equipped with NVIDIA K80 GPUs Several login nodes are available Additionally visualization nodes with large main memory and latest generation NVIDIA K40 GPUs for pre post processing are available
23. the entries that have Partition as value it means that the limits are inherited from the Partitions where the jobs are running in parenthesis you can see the configured QoS limits 11 Each association in Slurm s database belongs to one user only In each association there are two entries regarding the QoSs One entry with the list of available QoSs and another entry with the Default QoS used when QoS is not specified with options In every association only one available QoS is defined same as default for each user depending on the contingent status This is implemented in JSC s accounting mechanism and the users are not allowed to change their QoS The limits are enforced to the users by setting the correct QoS for their association according to their contingent Job limits are enforced by that QoS in combination with the partition limits If the users request allocations over the limits then the submission will fail flag DenyOnLimit 2 6 Generic Resources GRES Slurm provides the functionality to define generic resources GRES for each node type These generic resources can be used during job submissions in order to allocate nodes with specific resources or features Slurm can be configured also to deny allocations which don t specify any GRES for certain nodes or partitions and this feature is used for the some of JURECA s partitions like gpus and mem512 The GRES configuration can be used also to extract important accounting informa
24. 0 Unload all currently loaded modules module purge Accessing Old Software Software on JURECA is organized in stages By default only the most recent stage with up to date software is available To access older or in development versions of software installations you must manually extend your module path using the command module use usr local software jureca lt Other Stage gt 1 7 Compilers On JURECA we offer some wrappers to the users in order to compile and execute parallel jobs using MPI Different wrappers are provided depending on the MPI version that is used Users can choose the compiler s version using the module command see the modules section The following table shows the names of the MPI wrapper procedures for the Intel compilers as well as the names of compilers themselves The wrappers build up the MPI environment for your compilation task so please always use the wrappers instead of the compilers Programming Language Compiler Parastation MPI Wrapper Intel MPI Wrapper Fortran 90 ifort mpif90 mpiifort Fortran 77 ifort mpif77 mpiifort C icpc mpicxx mpicpc C icc mpicc mpiicc In the following table we present some useful compiler options that are commonly used Option Description openmp Enables the parallelizer to generate multi threaded code based on the OpenMP directives 9 Creates debugging information in the object files
25. 00000000 0000000000000000000000008 0 0000000000000000000000 000000000000000000000000 000000000000000000 o000000000000000000000000000000000008 00000000000 000000000000000000000000000000000006 000000 0000000000000 000000000000000000 0000000000 0000000 Figure 4 In Figure 4 we have the visualization of the processor affinity of a 8 tasks job step with the option cpus per task 6 a hybrid MPI OpenMP job with 8 MPI processes and OMP_NUM_THREADS 6 Pinning of the individual threads spawned by each task is not in the hand of the resource management system but managed by the runtime e g the OpenMP runtime library Note It is important to specify the correct cpus per task count to ensure an optimal pinning for hybrid applications The processor affinity masks generated with the options cpu_bind rank and cpu_bind threads coincide with the default binding scheme Note The distribution of processes across sockets can be affect with the option m of srun command Please see the man page of srun for more information Binding to sockets With the option cpu_bind sockets processes can be bound to sockets see Figure 5 0000000000060 000000000000 0000000000 0 c0000000000088 0 00000000 0000 Figure 5 In Figure 5 we have the visualization of the processor affinity for a two tasks job step with the option cpu_bind sockets This option can be further combined with hint nomultithread to restrict task zero to
26. 20 E A vate e A a date a E e e i Ea Ea 21 bates sal en Seele 21 Generic Resources GRES HH sense ate aa 24 E Rear ee 25 SHINE i 25 3A Ouer Commands AA leere en ee 26 UV AD AA AA AA AA AA een 26 STIEWE RIESE Re SEHEN ee 27 SLIDE cies sted NAE TEE ee ee ee ee ee E EE 27 SAD EEE E TEE 29 SPIN ORE PA T E E T T 30 SCONTO O RE ade 30 A a a a a a 31 SOIT ee el ee re RE E E 34 3 0 Job Uhlity Commands ne sais 34 A ae 34 SAL On er ER ee 35 3 7 Job Accounting Commands ae es EN a EE E E AEA a iks 36 E A ee le een 36 ACI elek ee 37 3 8 Custom commands from ISCH ae u E EEKE O Eea aei E NE 38 A O ER a a AO 38 GQ CPW MOA ays ee een 39 Batch JODS ee een EEE RESET ee 40 4 1 Job SCENE examples ae Era Casita Solana tee dais 41 O 41 Parallel jobs are rs 41 Open MP TOD nee seen leise Al IEP A JOD ATA 42 MPI OWS AW IDC SM ns else 42 A ae Ri he eaten as cesta aE EAEE 43 Hybrid TODS MAS MT ee 43 O A 44 A 2 GD SIENA ee a EEE EEE 44 Acs Depende Channel 45 4 4 JOD ATIV Se a naeh 46 4 5 MP MD En ge een RER EUR einen 47 Interactive JOBS EA AAA A At 48 5 1 Inferactive SESSION nun a 48 32 X O a 2ER ERDE ONE 49 From Moab Torqueto Slim rasanten 50 6 1 Differences between the SySteme ccsscccsscssscssersessccsscessccsnssessceseccesscssnssessessnscensceeseeessoanes 50 6 2 User Comma els Comparison rt ii ee eiserne 51 EX A iS 52 7 1 Bernplafe JOBS CDS ba 52 TAPE OTO LEI I S ern as 52 LS EOMP1 Al On ee 57 ES IS 58 A A A da ab
27. 2015 3 187 GCC bare 4 9 3 icc 2015 3 187 GCC bare 4 9 3 ifort 2015 3 187 D icc 2015 3 187 D usr local software jureca UI Tools EasyBuild 2 2 0 binutils 2 25 ncurses 5 9 util wrapper 1 1 Inspector 2015_update1 gc 7 4 2 popt 1 16 zlib 1 2 8 JUBE 2 0 6 RDP Sle Za Ze pscom 5 0 45 1 VTune 2015_update4 libatomic_ops 7 4 2 tbb 4 3 6 211 usr local software jureca Devel Developers InstallSoftware D Stages Devel S D Stages Current S Stages Legacy S Where S Module is Sticky requires force to unload or purge D Default Module Use module spider to find all possible modules Use module keyword keyl key2 to search for all possible modules matching any of the keys Load a module module load OpenFOAM 2 3 1 module list Currently Loaded Modules 1 zi 1a 28 8 popt 1 16 15 imk1 11 2 3 187 2 binutils 2 25 9 pscom 5 0 45 1 16 intel para 2015 07 3 ncurses 5 9 10 ipsmpi 2015 07 17 libreadline 6 3 4 libatomic_ops 7 4 2 11 iccifort 2015 3 187 GCC bare 4 9 3 18 SCOTCH 6 0 3 5 gc 7 4 2 12 icc 2015 3 187 GCC bare 4 9 3 19 OpenFOAM 2 3 1 6 util wrapper 1 1 13 ifort 2015 3 187 GCC bare 4 9 3 7 GCC bare 4 9 3 14 psmpi 5 1 4 1 Purge all modules module purge module list No modules loaded Check a package module spider Boost Description Boost provid
28. 752 11183 0 3569 0 35800 14400 10831 0 3569 0 36848 4878 4821 2 56 0 36858 1615 1558 2 56 0 39621 1609 1552 2 56 0 39672 983 926 2 56 0 39714 1749 1692 2 56 0 39726 1770 IS 2 56 0 40608 101329 771 0 598 100000 44116 744 688 0 56 0 44122 744 688 0 56 0 44257 100662 606 0 56 100000 44258 100661 606 0 56 100000 44259 100661 606 0 56 100000 44260 100661 606 0 56 100000 44261 100661 606 0 56 100000 44262 100661 606 0 56 100000 44263 100661 606 0 56 100000 44264 100661 606 0 56 100000 44265 100661 606 0 56 100000 44266 100661 606 0 56 100000 44267 100661 606 0 56 100000 61 7 7 Accounting Commands Check user association sacctmgr show assoc where user paschoul Cluster Account User Partition Share GrpJobs GrpNodes GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins Qos Def QOS GrpCPURunMins jureca zam paschoul 3000 normal normal jureca root paschoul maint il nolimits nolimits Check all QoSs sacctmgr show qos Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpCPUs GrpCPUMins GrpCPURunMins GrpJobs GrpMem GrpNodes GrpSubmit GrpWall MaxCPUs MaxCPUMins MaxNodes MaxWall MaxCPUsPU MaxJobsPU MaxNodesPU MaxSubmitPU normal 100000 00 00 00 cluster DenyOnLimit 1 000000 280 1 00 00 00 280 280 4096 lowcont 0 00 00 00 cluster DenyOnLimit 1 000000 280 06 00 00 24 280 4096 nocont 0 00 00 00 cluster DenyOnLimit 1 000000 280 06 00 00 24 280 4096 suspended 0 00 00 00
29. A J LICH FORSCHUNGSZENTRUM JURECA User s Manual for the Batch System Slurm Slurm integrated with Parastation Author Chrysovalantis Paschoulas Support sc fz juelich de Contributors Dorian Krause Philipp Th rnig Eric Gregory Matthias Nicolai Theodoros Stylianos Kondylis Ulrich Detert Document version 2 0 1 2015 Nov 03 Table of Contents 1 J ster TEE At OMe A o 1 A ae Pocus rien ereignete 1 1 2 Gl ster Nodes ienen its 1 1 3 Data Management Filesystems u e ea ee nenne 2 1 4 Access tothe Cluster io E a need 2 AI a a a A E a AOA aan ESOT a EAEAN aSa 3 A A E E E Du len sei E 3 Modules and Toolchains hierarehy u ae Reale 3 Using the module command Aue ee 4 Accessing Old SoftWare nur aeg 5 1 7 COMpilet Saen ae een ae 6 Compilation Example en Een Win 6 1 8 Batch model amp ACCOUNTS u ER AAA 7 2 Batch System MMM A ii 8 2 O 8 2 2 Sl rm Conheiralion anreisen 9 2 3 PAVITCONS ON 10 2 4 Slurm s Accounting Database as en 11 Zo O nE DOS ri 11 2 6 Generic Resources GRES maria RER 12 Job Submission Filtet aeni nian a a sen vata e eames e gl ase ae ee 13 2 7 O 14 2 8 MOM PEM yA OMIMEM eisen 15 2 9S M a EEE aah ied te SE nse Sine aac ee Gia OO Re let 15 Using SM rR Pa nennen ceca 16 Howto Proton MES tiie ED RE IE IDEE a 16 2 10 Processor Afe a isa 16 Default processor affinity a 17 Binding socket ee ee win Sachsens 18 EN RR alas 18 Disabling PA DER E o 19 acom User E OS a cd 20 A A O E
30. ATCH time 30 sleep 5 hostname Parallel job In order to start a parallel job users have to use the srun command that will spawn processes on the allocated compute nodes of the job Options given to srun will override the allocation option from sbatch In case of no srun options the defined options with SBATCH or the defaults will be used Example 2 Here is a simple example of a job script where we allocate 4 compute nodes for 1 hour Inside the job script with the srun command we request to execute on 2 nodes with 1 process per node the system command hostname in a time frame of 10 minutes bin bash SBATCH J TestJob SBATCH N 4 SBATCH o TestJob 3 out SBATCH e TestJob j err SBATCH time 60 srun N2 ntasks per node 1 t 10 hostname OpenMP job Example 3 In this example the job will execute an OMP application named omp prog The allocation is for 1 node and by default since there is no node sharing all CPUs of the node are available for the application The output filenames are also defined and a walltime of 2 hours is requested Note It is important to define and export the variable OMP_NUM_THREADS that will be used by the executable bin bash SBATCH J TestOMP SBATCH N 1 SBATCH o TestOMP j out SBATCH e TestOMP j err SBATCH time 02 00 00 export OMP NUM THREADS 48 home user test omp prog 41 MPI job Example 4 In the following examp
31. Parastation MPICH MPI iimpi icc ifort Intel MPI FullToolchains gpsolf gpsmipi OpenBLAS FFTW and ScaLAPACK intel para ipsmpi2 Intel Math Kernel Library imkl intel iimpi Intel Math Kernel Library imkl Using the module command Users should load unload and query modules though the module command Several useful module commands are Command Description module avail Shows the available toolchains and what modules are compatible to load right now according to the currently loaded toolchain module load lt modname gt lt modversion gt Loads a specific module Default version if it is not given module list Lists what modules are currently loaded module unload lt modname gt lt modversion gt Unloads a module module purge Unloads all modules module spider lt modname gt Finds the location of a module within the module hierarchy As we said above in order to load a desired application module it is necessary first to load the correct toolchain Therefore preparing the module environment includes two steps 1 First load one of the available toolchains The intel para toolchain from the Fulltoolchains has the most supported software at this moment 2 Second load other application modules which where built with currently loaded toolchain Following we will give some examples of the module command List the available toolchains module avail
32. alltime gt Partition Queue selection q lt queue gt partition lt queue gt p lt queue gt Email for notifications M lt email gt mail user lt email gt Event types for notifications m lt mode gt mail type lt mode gt Job name N lt jobname gt job name lt jobname gt J lt jobname gt Interactive jobs I None use salloc or srun Job dependencies W depend lt mode gt lt jobID gt dependency lt dependency_list gt d lt dependency_list gt 51 7 Examples 7 1 Template job scripts Template MPI job script bin bash SBATCH J lt jobname gt SBATCH N lt number gt SBATCH n lt number gt can be omitted SBATCH ntasks per node lt number gt SBATCH o lt jobname gt j out SBATCH e lt jobname gt j err SBATCH mail type lt BEGIN END FAIL or ALL gt SBATCH mail user lt email gt SBATCH partition lt batch devel gt SBATCH time lt time gt run MPI application below with srun Template Hybrid job script bin bash SBATCH J lt jobname gt SBATCH N lt number gt SBATCH n lt number gt can be omitted SBATCH ntasks per node lt number gt SBATCH cpus per task lt number gt SBATCH o lt jobname gt j out SBATCH e lt jobname gt j err SBATCH mail type lt BEGIN END FAIL or ALL gt SBATCH mail user lt email gt SBATCH partition lt batch d
33. ate Please check the man pages for the available time formats j lt job step gt jobs lt job step gt Show information only for the specified jobs job steps 1 Show full report with all available fields for each long reported job job step n Do not display the header in the beginning of the noheader output N lt node_list gt nodelist lt node_ list gt Show information only for jobs that ran on the specified nodes name lt jobname_list gt Show information about jobs with the specified names o lt field list gt format lt field_list gt Specify the list of fields that will be displayed in the output Available fields can be found with e option or in the man pages r lt partition_name gt partition lt partition_name gt Show information only for jobs that ran in the specified partitions Default is all partitions s lt state_list gt state lt state_list gt Filter and show information only about jobs with the specified states like completed cancelled failed etc Please check the man pages for the full list of states S lt start_time gt starttime lt start_time gt List jobs with any state or with specified states using option state after the given date The default value is 00 00 00 of current date Check man page for date formats X allocations Show information only for jobs and not for job steps
34. ation commands users can change this default behavior The users can load modules and prepare the desired environment before job submission and then this environment will be passed to the jobs that will be submitted Of course a good practice is to include module commands inside the job scripts in order to have full control of the environment of the jobs Similar to the Intel Nehalem processors in JUROPA the Haswell processors in JURECA offer the possibility of Simultaneous Multi Threading SMT in the form of the Intel Hyper Threading HT Technology With HT enabled each physical processor core can execute two threads or tasks simultaneously The operating system thus lists a total of 48 logical cores or Hardware Threads HWT Therefore a maximum of 48 processes can be executed on each compute node without overbooking Each compute node on JURECA consists of two CPUs located on socket zero and one with 12 physical cores These cores are numbered 0 to 23 and the hardware threads are named 0 to 47 in a round robin fashion Figure 2 depicts a node schematically and illustrates the naming convention Node Figure 2 15 Using SMT on JURECA The Slurm batch system on JURECA does not differentiate between physical cores and hardware threads In the Slurm terminology each hardware thread is a CPU For this reason each compute node reports a total of 48 CPUs in the scontrol show node output Therefore whether or not threads share a physical
35. children of the accounts In each association it is possible to specify fair share job limits and QoS To interact with slurmdbd and get accounting information from the database Slurm provides the commands sacct and sacctmgr 2 5 Job Limits QoS As we describe above the limits of the partitions are the hard limits that put an upper limit for the jobs However the actual job limits are enforced by the limits specified in both partitions and Quality of Services which means that first the QoS limits are checked enforced but these limits can never go over the partition limits One QoS is configured for each possible contingent status normal lowcont nocont These QoSs play the most important role to define the job priorities By defining those QoSs the available range of priorities is separated into three sub ranges one for each contingent mode Also one more QoS is defined with the name suspended which will be given to all associations that belong to users projects that have ended and or are not allowed to submit jobs anymore Following we present the list with the configured Quality of Services Name Priority Flags MaxNodes Job MaxWall Job MaxJobs User MaxSubmittedJobs normal 100 000 DenyOnLimit Partition Partition 1d 32 4096 lowcont 50 000 DenyOnLimit Partition Partition 6h 4096 nocont 0 DenyOnLimit Partition Partition 6h 4096 suspended 0 DenyOnLimit 0 0 Note For
36. cluster DenyOnLimit 1 000000 0 00 00 00 0 0 0 nolimits 100000 00 00 00 cluster DenyOnLimit 1 000000 Check one QoS sacctmgr show qos where name normal Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpCPUs GrpCPUMins GrpCPURunMins GrpJobs GrpMem GrpNodes GrpSubmit GrpWall MaxCPUs MaxCPUMins MaxNodes MaxWall MaxCPUsPU MaxJobsPU MaxNodesPU MaxSubmitPU normal 100000 00 00 00 cluster DenyOnLimit 1 000000 280 1 00 00 00 280 280 4096 Check old jobs history sacct X u hgul47 JobID JobName Partition Account AllocCPUS State ExitCode 42861 T34pH4 batch hgul4 48 TIMEOUT 0 1 42866 T28pH8 batch hgul4 48 TIMEOUT 0 1 42872 T26pH8 batch hgul4 48 RUNNING 0 0 42873 T26pH5 batch hgul4 48 RUNNING 0 0 42874 T26pH4 batch hgul4 48 RUNNING 0 0 42875 T24pH8 batch hgul4 48 RUNNING 0 0 62 42876 T24pH5 batch hgul4 48 RUNNING 0 0 42877 T22pH8 batch hgul4 48 RUNNING 0 0 42878 T22pH5 batch hgul4 48 RUNNING 0 0 Check old jobs with different format and specified time frame sacct X u kraused format jobid user nnodes nodelist state exit S 2014 11 15T00 00 00 E 2015 11 17T18 00 00 JobID User NNodes NodeList State ExitCode 2 kraused 1 jrc0106 COMPLETED 0 0 3 kraused 8 jrc 0106 0113 FAILED 2 0 10 kraused 8 jrc 0108 0115 COMPLETED 0 0 297 kraused 70 jrc 0282 0306 CANCELLED 0 0 298 kraused 70 jrc 0106 0125 FAILED 127 0 299 kraused 2 None assigned CANCELLED 0 0 300 kraused 2 None assigned CANCELLED 0 0 301 kraused 2 N
37. core depends on the total number of tasks per node ntasks per node and cpus per task and the process pinning The use of the last 24 hardware threads can be disabled with the option hint nomultithread of srun command This option leads to overbooking of the same logical cores as soon as more than 24 threads are executed For most application this option is not beneficial and the default value should be used hint multithread In chapter 4 1 Job script examples there are some examples about SMT How to profit from SMT Processes which are running on the same physical core will share several of the resources available to that particular core Therefore applications will profit most from SMT if processes running on the same core which are complementary in their usage of resources e g complementary computation and memory access phases On other hand processes with similar resource usage may compete for bandwidth or functional units and hamper each other We recommend to test whether your code profits from SMT or not In order to test whether your application benefits from SMT one should compare the timings of two runs on the same number of physical cores i e number of nodes specified with nodes should be the same for both jobs One job without SMT t1 and one job with SMT t2 If t2 is lower than t1 your application benefits from SMT In practice t1 t2 will be less than 1 5 e g a runtime improvement of maximal 50 wi
38. cores 0 to 11 and task two to cores 12 to 23 On JURECA locality domains coincide with sockets so that options cpu_bind ldoms and cpu_bind sockets give the same results Manual pinning For advanced use cases it can be desirable to manually specify the binding masks or core sets for each task This is possible using the options cpu_bind map_cpu and cpu_bind mask_cpu For example the following command spawns two tasks pinned to core 1 and 5 respectively srun n 2 cpu_bind map_cpu 1 5 18 The next command spawns two tasks pinned to cores 0 and 1 0x3 3 20 21 and cores 2 and 3 0xC 11 22 23 respectively srun n 2 cpu_bind mask_cpu 0x3 0xC Disabling pinning Processor binding can be disabled using the argument cpu_bind none to srun In this case each thread may execute on any of the 48 logical cores and the scheduling of the processes is up to the operating system On JURECA the options cpu_bind none and cpu_bind boards achieve the same result 19 3 Slurm User Commands In this section we will give first a list of all commands with a short description and then later we will describe with more details the functionality of each command giving also some examples 3 1 List of Commands Slurm offers a variety of user commands for all the necessary actions concerning the jobs With these commands the users have a rich interface to allocate resources query job status control jobs manage accounting i
39. d P Print information in a parsable way Delimit output parsable2 with without a in the end Examples Print information about the user s shares in a long format sshare 1 Print information about the user s shares in a parsable way sshare P Print information about the user s shares without the initial header in the output sshare n 32 3 5 Job Control Commands scancel With scancel we can signal or cancel jobs job arrays or job steps Command format scancel OPTIONS job_id _array_id step_id Some of the most useful options of the scancel command are Option Description A lt account gt account lt account gt Restrict the operation only to the jobs under the specified account b Send a signal to the batch job shell and its child batch processes i Enables interactive mode User must confirm for each interactive operation n lt job_name gt name lt job_name gt Cancel a job with the specified name p lt partition_name gt partition lt partition_name gt Restrict the operation only to the jobs that are running in the specified partition R lt reservation name gt reservation lt reservation name gt Restrict the operation only to the jobs that are running using the specified reservation s lt signal name gt signal lt signal name gt Send a signal to the specified job
40. d to transfer a file from all allocated nodes to the currently active job This command can be used only inside a job script scontrol is primarily used by the administrators to view or modify Slurm configuration like partitions nodes reservations jobs etc However it provides also some functionality for the users to manage jobs or query and get some information about the system configuration sinfo is used to retrieve information about the partitions reservations and node states It has a wide variety of filtering sorting and formatting options smap graphically shows the state of the partitions and nodes using a curses interface We recommend llview as an alternative which is supported on all JSC machines sprio can be used to query job priorities squeue allows to query the list of pending and running jobs By default it reports the list of pending jobs sorted by priority and the list of running jobs sorted separately according to the job priority 20 srun is used to initiate job steps mainly within a job or start an interactive job srun has a wide variety of options to specify resource requirements A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job s node allocation sshare is used to retrieve fair share information for each user sstat allows to query status information about a running job sview is a graphical user interface to get state information for jobs
41. dard output error streams Command format sattach options lt jobid stepid gt Some of the most useful options of sattach are Option Description input filter lt task number gt output filter lt task number gt error filter lt task number gt Transfer the standard input or print the standard output error only from the specified task 1 Add the task number in the beginning of each line of label standard output error layout Print the task layout information of the job step without attaching to its O streams pty Run task number zero in pseudo terminal Examples Attach to the output of job 777 and job step 1 sattach 777 1 Attach to the output of job 777 and job step 2 adding the task ID in the beginning of each line sattach 1 777 2 34 sstat With sstat we can get various status information about running job steps for example minimum maximum and average values for metrics like CPU time Virtual Memory VM usage Resident Set Size RSS Disk I O Tasks number etc Command format sstat OPTIONS Some of the most useful options of sstat are Option Description a Show information about all steps for the specified job allsteps e Show the list of fields that can be specified with the helpformat format option i Show information about the pids for each jobstep pidformat j lt job step gt jobs
42. de is in an advanced reservation and not generally available smap With smap we can get a graphical overview of the cluster It shows information about the nodes and the jobs that are running on them Command format Some of the most useful smap options are Option Description c Send output to the command line without using commandline curses D lt option gt Define the display mode of smap Please read the man display lt option gt pages for more information h Do not print the header of the output noheader H Show information about hidden partitions and their show_hidden jobs i lt seconds gt Repeatedly print information at the specified interval iterate lt seconds gt n lt node_list gt Show information only for the specified nodes nodes lt node_list gt 29 sprio With sprio we can check the priorities of all pending jobs in the queue Command format sprio OPTIONS Some of the most useful sprio options are Option Description h Do not print the header of the output noheader j lt job_id_list gt Show information only about the requested jobs jobs lt job_id list gt 1 Report more information long n Print the the normalized priority factors of the jobs norm o lt output_format gt Specify the information that will be printed columns format lt output_format gt Please read the man pa
43. des in a summarized way sinfo s List all reservations sinfo R Show jobs that belong to a specific user sinfo T Show information for partition devel sinfo p devel 28 Depending on the options the srun command will print the states of the partitions and the nodes The partitions may be in state UP DOWN or INACTIVE The UP state means that a partition will accept new submissions and the jobs will be scheduled The DOWN state allows submissions to a partition but the jobs will not be scheduled The INACTIVE state means that not submissions are allowed The nodes also can be in various states Node state code may be shortened according to the size of the printed field A node can have also a combination of states like IDLE MAINT The following table shows the most common node states Shortened State State Name Description alloc ALLOCATED The node has been allocated comp COMPLETING The job associated with this node is in the state of COMPLETING down DOWN The node is unavailable for use drain DRAINING amp While in DRAINING state any running job on the node will be DRAINED allowed to run until completion After that and in DRAIN state the node will be unavailable for use idle IDLE The node is not allocated to any jobs and is available for use maint MAINT The node is currently in a reservation with a flag of maintenance resv RESERVED The no
44. eful options only for srun Option Description forward x Enable X11 forwarding only for interactive jobs multi prog lt filename gt Run different programs with different arguments for each task specified in a text file pty Execute the first task in pseudo terminal mode r lt num gt Execute a jobstep inside allocation with relative index relative lt num gt of a node exclusive Allocate distinct cores for each task Examples Spawn 48 tasks on 4 nodes 12 tasks per node for 30 minutes srun N4 n48 t 30 executable Spawn 12 tasks on 2 nodes 6 tasks per node specifying in a file the executables for each task srun n12 N2 multi prog tasks conf tasks conf 0 5 hostname 6 11 executable2 Inside a job script execute 6 tasks on 1 node without sharing cores with other job steps srun exclusive n6 N1 mpi prog 25 3 4 Query Commands squeue With squeue we can see the current status information of the queued and running jobs Command format squeue OPTIONS Some of the most useful squeue options are Option Description A lt account_list gt account lt account_list gt List jobs for the specified accounts a Show information about jobs and job steps for all all partitions r Optimized display for job arrays array h Do not print the header of the output noheader i lt seconds gt iterate lt seconds g
45. en client root 2015 07 20T16 59 45 jrc0386 401 NodeHardware root 2015 07 29T21 18 54 jrc0320 403 NodeHardware root 2015 07 30T07 58 53 jrc0340 402 NodeHardware root 2015 07 30T04 12 14 jrc0144 Golden client root 2015 07 20T16 59 45 jrc0386 401 NodeHardware root 2015 07 29T21 18 54 jrc0320 Check reservations sinfo T RESV_NAME STATE START_TIME END TIME DURATION NODELIST test ACTIVE 2014 11 14T15 24 47 2015 10 01T00 00 00 320 07 35 13 Check one partition scontrol show partition batch PartitionName batch AllowGroups ALL AllowAccounts ALL AllowQos ALL AllocNodes ALL Default YES DefaultTime 01 00 00 DisableRootJobs NO GraceTime 0 Hidden NO MaxNodes 64 MaxTime 1 00 00 00 MinNodes 1 LLN NO MaxCPUsPerNode 48 Nodes jrc 0116 0155 0246 0455 Priority 1 RootOnly NO ReqResv NO Shared NO PreemptMode OFF State UP TotalCPUs 12000 TotalNodes 250 SelectTypeParameters N A DefMemPerNode UNLIMITED MaxMemPerNode UNLIMITED Check one node scontrol show node jrc0130 NodeName jrc0130 Arch x86_64 CoresPerSocket 12 CPUAlloc 48 CPUErr 0 CPUTot 48 CPULoad 24 03 Features normal Gres mem128 no_consume 1 NodeAddr jrc0130 NodeHostName jrc0130 Version psslurm 41 p14 03 OS Linux RealMemory 128952 AllocMem 0 Sockets 2 Boards 1 State ALLOCATED ThreadsPerCore 2 TmpDisk 0 Weight 1 BootTime 2015 07 27T11 34 29 SlurmdStartTime 2015 07 27T11 34 53 CurrentWatts 0 LowestJoules 0 ConsumedJoules 0 ExtSensorsJoules n s ExtSensorsWatts 0 ExtSensorsTemp n
46. ended for small interactive jobs focused on development and application optimizations The batch partition is intended for the normal production jobs The default partition is batch which includes the thin and the fat type 1 compute nodes The mem512 partition is intended for memory bounded jobs and includes the fat type 2 nodes The gpus partition includes the nodes that are equipped with 2 Nvidia K80 GPUs Note 4x nvidia devices GPUs available on each node because each K80 card has 2 GPUs inside The vis partition includes both visualization node types with 512 GB and 1024 GB memory and they are equipped with 2 Nvidia K40 GPUs The large partition includes the same nodes as batch and is intended to be used later for very large jobs The maint partition is not for normal usage because it is used only by admins usually during off line maintenance 2 4 Slurm s Accounting Database Slurm manages its own data in two different ways First there is a runtime engine in memory backed up with state files that is managed by slurmctld and second there is the MySQL database that is managed by slurmdbd Slurm stores all the important information in its MySQL database like cluster information events accounts users associations QoS and job history An association is the combination of cluster account user and partition Associations are stored in a tree like hierarchical structure starting with the root node with the accounts as its children and users as
47. er node 48 mpi prog3 44 Example 12 In the following example we show a job script where two different job steps are initiated within one job In total 24 cores are allocated on two nodes Each job step uses 24 cores on each compute node With the option exclusive we ensure that distinct CPUs Virtual Cores are allocated for each job step Here the job steps will be executed in parallel in order to put the processes on the background amp is needed in the end of each command line and then the wait shell command after the srun commands will ensure that the job will wait until all job steps are completed bin bash SBATCH nodes 2 SBATCH output mpi out j SBATCH error mpi err j SBATCH time 00 20 00 srun N1 exclusive n 24 mpi progl srun N1 exclusive n 24 mpi prog2 wait 4 3 Dependency Chains Slurm supports dependency chains which are collections of batch jobs with defined dependencies similar to job chains of Moab on JUROPA Job dependencies can be defined using the dependency argument of sbatch The format is sbatch dependency lt type gt lt jobID gt lt jobscript gt sbatch d lt type gt lt jobID gt lt jobscript gt The available dependency types for job chains are after afterany afternotok and afterok For more information please check the man page of sbatch Example 13 Below is an example of a job script for the handling
48. ere multiple executables can have one common MPI_COMM WORLD communicator For this purpose Slurm provides the option multi prog for the srun command only This option expects a configuration text file as an argument and the format is srun OPTIONS multi prog lt text file gt Each line of the configuration file can have two or three possible fields separated by space and the format is like this lt list of task ranks gt lt executable gt lt possible arguments gt In the first field is defined a comma separated list of ranks for the MPI tasks that will be spawned Possible values are integer numbers or ranges of numbers The second field is the path name of the executable And the third field is optional and defines the arguments of the program Example 16 In this example there is a simple configuration file with name multi conf This file defines three MPI programs For the first executable mpi prog1 only one instance will be executed with rank 0 and one integer argument For the second program mpi prog2 Slurm will create two tasks with ranks 4 and 6 and each one will have the path of a file as argument For the third program mpi prog3 five MPI tasks will be executed with ranks 1 2 3 5 and 7 without any arguments mpi progl 0 mpi prog2 tmp txt 0 4 1 3 5 7 mpi prog3 6 3 Following is the job script that will start this MPMD job The job script allocates 4 nodes fo
49. es Partitions one partition only can overlap and we can specify limits Priorities Complex priorities mechanism Easy to configure maintain and manage The desired batch model from JSC can be easily applied Limits Policy Good support for limits and policies Highly configurable define limits and configuration policies per partition account user Enforce limits with QoS Job scripts Define job script options with MSUB Define job script options with SBATCH In the following table yo u can see some of the differences between Torque and Slurm environment they must use the option Vy Torque Slurm Scheduling Integrates only a simple FIFO Slurm is a capable scheduler with support scheduler needs external scheduler for backfilling algorithm Output files With the default options stores output Standard output and error files are created locally on the nodes Upon completion in the final destination immediately files are gathered at destination Working directory Must explicitly change to current Jobs start to run in the directory where working directory they were submitted from Job Steps Not supported by Torque Flexible allocations within jobs Task Distribution Possible to specify different number of Possible to specify only the same number tasks per set of nodes e g of tasks on all nodes with the allocation 1 nodes 1 ppn 2 nodes 4 ppn 8 options Environment If users wan
50. es free peer reviewed portable C source libraries Homepage http www boost org Versions Boost 1 57 0 Python 2 7 Boost 1 58 0 Python 2 7 Boost 1 58 0 Python 2 7 e o vo To find detailed information about Boost please enter the full name For example module spider Boost 1 58 0 Python 2 7 10 55 Check a specific version of a package module spider Boost 1 58 0 Python 2 7 10 Description Boost provides free peer reviewed portable C source libraries Homepage http www boost org This module can only be loaded through the following modules Stages Stage2 gpsolf 2015 06 Stages Stage2 intel para 2015 06 Stages Stage2 intel para 2015 06 mt Stages Stage2 intel para 2015 07 Stages Stage2 intel para 2015 07 mt Stages Stage2 intel 2015 04 Stages Stage2 intel 2015 07 Stages Stage3 gpsolf 2015 06 Stages Stage3 gpsolf 2015 07 Stages Stage3 intel para 2015 07 Stages Stage3 intel 2015 07 Stages Current gpsolf 2015 06 Stages Current gpsolf 2015 07 Stages Current intel para 2015 07 Stages Current intel 2015 07 Stages Devel gpsolf 2015 06 Stages Devel gpsolf 2015 07 Stages Devel intel para 2015 07 Stages Devel intel 2015 07 Stages Legacy gpsolf 2015 06 Stages Legacy intel para 2015 06 Stages Legacy intel para 2015 06 mt Stages Legacy intel para 2015 07 Stages Legacy intel para 2015 07 mt Stages Legacy Stages Legacy gpsolf 2015 06 gpsolf 2015 07 intel 2015 04 intel 2015 07 intel para 2015 07
51. evel gt SBATCH time lt time gt export OMP_NUM_THREADS SLURM_CPUS_PER_TASK run Hybrid application below with srun 7 2 Modules Check loaded modules module list No modules loaded Check available Toolchains module avail gpsolf 2015 06 gpsolf 2015 07 D gpsmpi 2015 06 gpsmpi 2015 07 D Gcc 5 1 0 intel 2015 07 intel para 2015 07 mt usr local software jureca UI FullToolchains intel para 2015 07 D usr local software jureca UI Compilers MPI iimpi 2015 07 ipsmpi 2015 07 mt ipsmpi 2015 07 D usr local software jureca UI Compilers iccifort 2015 3 187 GCC bare 4 9 3 52 GCC 5 2 0 D iccifort 2015 3 187 D GCC bare 4 9 3 ifort 2015 3 187 GCC bare 4 9 3 icc 2015 3 187 GCC bare 4 9 3 ifort 2015 3 187 D icc 2015 3 187 D usr local software jureca UI Tools EasyBuild 2 2 0 binutils 2 25 ncurses 5 9 util wrapper 1 1 Inspector 2015_updatel gc 7 4 2 popt 1 16 zlib 1 2 8 JUBE 2 0 6 ipp 8 2 2 187 pscom 5 0 45 1 VTune 2015_update4 libatomic_ops 7 4 2 tbb 4 3 6 211 usr local software jureca Devel Developers InstallSoftware D Stages Devel S D Stages Current S Stages Legacy S Where S Module is Sticky requires force to unload or purge D Default Module Use module spider to find all possible modules Use module keyword keyl key2
52. f accessing and managing these data is implemented in slurmdbd In our case slurmdbd is configured to use a MySQL database as the back end storage To interact with slurmdbd and get information from the accounting database Slurm provides commands like sacct and sacctmgr In contrast to the Moab Torque combination where Moab provides scheduling and Torque performs resource management like batch job start or node health monitoring Slurm combines the functionality of the batch system and resource management For this purpose Slurm provides the slurmd daemon which runs on the compute nodes and interacts with slurmctld For the executing of user processes slurmstepd instances are spawned by slurmd to shepherd the user processes On JURECA cluster no slurmd slurmstepd daemons are running on the compute nodes Instead the process management is performed by psid the management daemon from the Parastation Cluster Suite which has a proven track record on the JUROPA system Similar to the architecture of the JUROPA resource management system where a psid plugin called psmom replaces the Torque daemon on the compute nodes a plugin of psid called psslurm replaces slurmd on the compute nodes of JURECA Therefore only one daemon is required on the compute nodes for the resource management which minimizes jitter which can affect large scale applications For the end users there is no real difference visible because of this integration between Slurm and Parastation
53. fy the stdin stdout and stderr file names option sa will be replaced by the value of SLURM_ARRAY_JOB_ID and option a will be replaced by the value of SLURM_ARRAY TASK ID Also each job in an array has its own normal unique job ID This ID is exported in the environment variable SLURM_JOBID Example 14 In the following example the job script will create a job array of 4 jobs with indices 0 3 Each job will run on 1 node with walltime of 1 hour and will execute a different bash script script_ 0 3 sh bin bash SBATCH nodes 1 SBATCH output prog A_ a out SBATCH error prog A_ a err SBATCH time 01 00 00 SBATCH array 0 3 script_ SLURM_ARRAY_TASK_ID sh Example 15 In the following job script a job array of 20 jobs will be submitted with indices 1 20 Each job will run on a separate node with 2 hours walltime limit Some may be running and some may be waiting in the queue For this job array all jobs will execute the same binary prog with different input files input_ 1 20 txt bin bash x SBATCH nodes 1 SBATCH output prog A_ a out SBATCH error prog A_ a err SBATCH time 02 00 00 SBATCH array 1 20 SBATCH partition batch srun N1 ntasks per node 1 prog input_ SLURM_ARRAY TASK_ID txt 46 4 5 MPMD Slurm supports the MPMD model Multiple Program Multiple Data Execution Model that can be used for MPI applications wh
54. ges for more information u lt user_list gt Show information about the jobs of the specified users user lt user_list gt wW Print the configured weights for each factor weights Examples Show information about priorities of all queued jobs in a long format sprio 1 Show priority information for job 777 sprio j 777 Show the priorities of all jobs that belong to the specified user sprio u userl Show priority information in a custom format printing only job ID priority and user sprio o 7i 10Y 8u scontrol This command is primarily used by the administrators to manage Slurm s configuration However it provides also some functionality for the users to manage jobs or query and get some information about the system configuration Here we present the way to query and get various information with scontrol 30 Command format scontrol OPTIONS COMMAND Some of the most useful scontrol query commands are Command Description show hostlist lt host_list gt Return a compressed regular expression for the given comma separated host list show hostlistsorted lt host_list gt Return a compressed and sorted regular expression for the given comma separated host list show hostnames lt host_regex gt Expand the given regular expression to a full list of hosts show job lt job_id gt Show information about all jobs or about the specified job
55. ipt wrap lt command string gt Wraps a command in a simple sh shell script d lt dependency_list gt Delay the start of the job until the specified dependency lt dependency_list gt dependencies have been satisfied 21 These three commands sbatch salloc and srun share many allocation options The most useful and commonly used allocation options are explained in following table Option Description begin lt time gt Delay and schedule job after the specified time cores per socket lt cores gt Allocate nodes with at least the specified number of cores per socket c lt ncpus gt cpus per task lt ncpus gt Number of logical CPUs hardware threads per task This option is only relevant for hybrid OpenMP jobs D lt directory gt Set the working directory of the job e lt filename pattern gt error lt filename pattern gt Path to the job s standard error Slurm supports format strings containing replacement symbols such as j job ID gres lt list of gres gt Comma separated list of GRES H hold Job will be submitted in a held state zero priority Can be released with scontrol release lt job_id gt i lt filename pattern gt input lt filename pattern gt Connect the jobscript s standard input directly to the specified file J lt jobname gt job name lt jobname gt Set the name of the job mail user
56. ir share factor PriorityWeightJobSize job size factor PriorityWeightPartition partition factor PriorityWeightQOS QOS factor Slurm uses five factors to calculate the job priorities Age Fairshare Job Size Partition and QoS The possible range of values for the factors is between 0 0 min and 1 0 max For each factor we have defined a weight that is used in the job priority equation Following is the list of weights we have configured Weight Value WeightQOS 100 000 WeightAge 32 500 WeightJobSize 14 500 WeightFairshare 3 000 WeightPartition 0 It is clear now that QoS plays an important role for the calculation of the priorities With the different QoSs that have been defined it is possible to create different priority ranges according to the contingent of the users Below follows a table with the priority ranges for each contingent mode Contingent Status Priority Ranges normal 100 001 150 000 lowcont 50 001 100 000 nocont 0 50 000 suspended For each contingent state the available range for priorities is 50k and is calculated from three factors a job age b job size and c fair share In current setup the partition factor is not used which means no difference in the priorities between different partitions 14 On the compute nodes the whole shell environment is passed to the jobs during submission With some options of the alloc
57. le an MPI application will start 96 tasks on 4 nodes running 24 tasks per node no SMT requesting a walltime limit of 15 minutes in batch partition Each MPI task will run on a separate core of the CPU bin bash SBATCH nodes 4 SBATCH ntasks 96 SBATCH output mpi out j SBATCH error mpi err j SBATCH time 00 15 00 SBATCH partition batch srun N4 ntasks per node 24 mpi prog MPI jobs with SMT On each node there are 28 real cores available and with SMT enabled 48 virtual cores In order to enable SMT the users just have to request from Slurm to allocate more than 24 CPUs on each compute node Following there are some examples where SMT is enabled Example 5 In this example we have an MPI application starting 1536 tasks in total on 32 nodes using 48 logical CPUs hardware threads per node SMT enabled requesting a time period of 20 minutes The batch partition is used bin bash x SBATCH nodes 32 SBATCH ntasks 1536 SBATCH ntasks per node 48 can be omitted SBATCH output mpi out j SBATCH error mpi err j SBATCH time 00 20 00 SBATCH partition batch srun mpi prog Example 6 In this example the job script will start the program mpi prog on 4 nodes using 48 MPI tasks per node where two MPI tasks will be executed on each physical core bin bash SBATCH nodes 4 SBATCH ntasks 192 can be omitted SBATCH ntasks per node
58. ll be achieved through SMT However applications may show a smaller benefit or even slow down when using SMT Please note that the process binding may have a significant impact on the measured run times t1 and t2 2 10 Processor Affinity Each JURECA compute node features 24 physical and 48 logical cores The Linux operating system on each node has been designed to balance the computational load dynamically by migrating processes between cores where necessary For many high performance computing applications however dynamic load balancing is not beneficial since the load can be predicated a priori and process migration may lead to performance loss on the JURECA compute nodes which fall in the category of Non Uniform Memory Access NUMA architectures To avoid process migration processes can be pinned or bound to a logical core through the resource management system A pinned process or thread is bound to a specific set of cores which may be a single or multiple logical cores and will only run on the cores in this set 16 1 the default each each task process will be assigned to a set of cores cores is not supported on JURECA and will be rejected the available options of srun are standard across all Slurm installations the implementation of process affinity is done in plugins and thus may differ between installations On JURECA a custom pinning implementation is used In contrast to other options the processor affinity
59. locate resources for a certain walltime and then run many parallel jobs in that frame The following table describes the most common or necessary allocation options that can be defined in a job script Option Default value Description SBATCH nodes lt number gt 1 Number of nodes for the allocation SBATCH N lt number gt SBATCH ntasks lt number gt 1 Number of tasks MPI processes Can be omitted if SBATCH n lt number gt nodes and ntasks per node are given SBATCH ntasks per node lt num gt 1 Number of tasks per node If keyword omitted the SBATCH tasks per node lt num gt default value is used but there are still available maximum 56 CPUs per node for current allocation SBATCH cpus per task lt num gt 1 Number of threads VCores per task Used only for SBATCH c lt num gt OpenMP or hybrid jobs SBATCH output lt path gt slurm lt jobID gt out Path to the file for the standard output SBATCH o lt path gt SBATCH error lt path gt slurm lt jobID gt out Path to the file for the standard error SBATCH e lt path gt SBATCH time lt walltime gt Depends on the Requested walltime limit for the job SBATCH t lt walltime gt partition SBATCH partition lt name gt batch Partition to run the job Currently available batch and SBATCH p lt name gt devel partitions SBATCH mail user lt email gt username Email address for notifica
60. lt job step gt Show information for the specified jobs or jobsteps n noheader Do not display the header in the beginning of the output o lt field list gt format lt field_list gt fields lt field_list gt Specify the comma separated list of fields that will be displayed in the output Available fields can be found with e option or in the man pages P Print information in a parsable way Output will be parsable delimited with P parsable2 Examples Display default status information for job 777 sstat j 777 Display the defined metrics for job 777 in parsable format sstat P format JobID AveCPU AvePages AveRSS AveVMSize j 777 35 3 7 Job Accounting Commands sacct With sacct we can get accounting information and data for the jobs and jobsteps that are stored in Slurm s accounting database Slurm stores the history of all jobs in the database but each user has permissions to check only his her own jobs Command format sacct OPTIONS Some of the most useful options of sacct are Option Description b Show a brief listing with the fields jobid status and brief exitcode e Show the list of fields that can be specified with the helpformat format option E lt end_time gt endtime lt end_time gt List jobs with any state or with specified states using option state before the given d
61. mgr show user 37 3 8 Custom commands from JSC Ilview Ilview is a cluster monitoring tool implemented in JSC that shows a graphical overview of the cluster The nodes are grouped and presented per rack and different coloring is used per job for each allocation on the nodes The GUI shows the list of all current jobs in the queue and gives also information about the utilization of the cluster Below in Figure 6 there is a screenshot of Ilview Options Step 60 s M active Search paschoul Last Update 07 30 15 11 27 02 nextin 06s Source www 13182 12384 ree 98 6 nds 3070 nshd jobs run wait 85 705 TEnd is 42 jiek6000 1 8h of 4 30 batch 14 11 ay 50 42 jiek6000 1 8h of 4 30 batch 14 11 51 40 jiek6000 1 8h of 4 30 batch 14 11 Ml 52 40 jiek6000 1 8h of 4 30 batch 14 11 53 38 jiek6000 1 8h of 4 30 batch 14 11 54 38 jiek6000 1 8h of 4 30 batch 14 11 55 36 jiek6000 1 8h of 4 30 batch 14 11 56 36 jiek6000 1 8h of 4 30 batch 14 11 57 34 jiek6000 1 8h of 4 30 batch 14 11 E 58 34 jiek6000 1 8h of 4 30 batch 14 11 59 32 jiek6000 1 8h of 4 30 batch 14 11 Ml 60 32 jiek6000 1 8h of 4 30 batch 14 11 i 61 30 jiek6000 1 8h of 4 30 batch 14 11 62 30 jiek6000 1 8h of 4 30 batch 14 11 63 28 jiek6000 1 8h of 4 30 batch 14 11 64 28 jiek6000 1 8h of 4 30 batch 14 11 65 26 jiek6000 1 8h of 4 30 batch 14 11 7 m fe DR Hark 1 Bh nf 4 30 hatch 14 11
62. n 2 7 10 flex 2 5 38 GDB 7 8 flex 2 5 39 D GEOS 3 4 2 freetype 2 5 5 GLib 2 42 2 gettext 0 19 4 GMP 5 1 3 git 2 3 2 53 GMP 6 0 0a GPAW 0 10 0 11364 Python 2 7 10 GROMACS 5 0 5 hybrid GSL 1 16 GTI 1 4 0 HDF5 1 8 14 gpfs HPL 2 1 Harminv 1 3 1 Hypre 2 10 0b IOR 2 10 3 mpiio JasPer 1 900 1 LAMMPS 20150210 LWM2 1 1 Libint 1 1 5 Libint 2 0 3 M4 1 4 17 METIS 5 1 0 MUMPS 5 0 0 parmetis MUST 1 4 0 MethPipe 3 3 1 NAG Mark24 NASM 2 11 08 NCO 4 4 8 OPARI2 1 1 3 OPARI2 1 1 4 OTF 1 12 5 OTF2 1 5 1 OpenFOAM 2 0 1 OpenFOAM 2 2 2 OpenFOAM 2 3 1 OpenSSL 1 0 1i PAPI 5 4 0 PCRE 8 36 PMA PDT 3 20 PETSc 3 6 0 downloads complex debug PETSc 3 6 0 downloads complex PETSc 3 6 0 downloads debug PETSc 3 6 0_ downloads _int8 debug PETSc 3 6 0 downloads _int8 PETSc 3 6 0 downloads ParMETIS 4 0 3 Perl 5 20 1 PnMPI 1 2 0 PnMPI 1 4 0 PyYAML 3 10 Python 2 7 10 Python 2 7 10 Python 3 4 3 0t 4 8 5 QuantumESPRESSO 5 1 1 SCOTCH 6 0 3 SION1ib 1 5 5 SQLite 3 8 8 1 Scalasca 2 2 Scalasca 2 2 1 D D D D D D D D D glproto 1 4 17 gnuplot 5 0 0 grace 5 1 25 h5py 2 4 0 Python 2 7 10 h5py 2 5 0 Python 2 7 10 D inputproto 2 3 kbproto 1 0 6 1ibICE 1 0 9 1ibsM 1 2 2 libX11 1 6 1 libXau 1 0 8 libXaw 1 0 12 libXmu 1 1 2 libXpm 3 5 11 libxt 1 1 4 libdrm 2 4 60 libdwarf 20140805 libelf 0 8 13 libffi 3 2 1 libgd 2 1 1 libpciaccess 0 13 3 libpng 1 6 16 libpth
63. name count total omp_get_num_threads Print out printf Hello world from processor s processor name rank size total Finalize the MPI environment MPI_Finalize return 0 Compile the MPI program mpicc o mpi prog mpi c Compile the Hybrid program mpicc openmp o hybrid prog hybrid c 7 4 Job submission Job script for an MPI job file mpiscript sh bin bash SBATCH J mpitest SBATCH N 4 SBATCH ntasks per node 24 SBATCH o mpitest j out SBATCH e mpitest j err SBATCH mail type END SBATCH mail user c paschoulas fz juelich de SBATCH partition batch SBATCH time 00 30 00 run MPI application below with srun srun N 4 ntasks per node 24 mpi prog Submit the MPI job script sbatch mpiscript sh Job script for a Hybrid job file hybridtest sh bin bash SBATCH J hybridtest SBATCH N 4 SBATCH ntasks per node 24 SBATCH cpus per task 2 SBATCH o hybridtest j out SBATCH e hybridtest j err SBATCH mail type END SBATCH mail user c paschoulas fz juelich de SBATCH partition batch SBATCH time 00 30 00 export OMP_NUM_THREADS SLURM_CPUS_PER_TASK run Hybrid application below with srun count omp_get_num threads rank d out of d processors srun N 4 ntasks per node 24 c SLURM_CPUS_PER TASK hybrid prog 58 OpenMP threads sd n Submit the Hybrid job script sbatch
64. nformation and to simplify their work with some utility commands Here is the list of all Slurm s user commands salloc is used to request interactive jobs allocations When the job is started a shell or other program specified on the command line is started on the submission host login node From the shell srun can be used to interactively spawn parallel applications The allocation is released when the user exits the shell sattach is used to attach standard input output and error plus signal capabilities to a currently running job or job step One can attach to and detach from jobs multiple times sbatch is used to submit a batch script which can be a bash Perl or Python script The script will be executed on the first node in the allocation chosen by the scheduler The working directory coincides with the working directory of the sbatch directory Within the script one or multiple srun commands can be used to create job steps and execute MPI parallel applications Note mpiexec is not supported on JURECA srun is the only supported method to spawn MPI applications In the future the mpirun command from Intel MPI may be supported scancel is used to cancel a pending or running job or job step It can also be used to send an arbitrary signal to all processes associated with a running job or job step sbcast is used to transfer a file to all nodes allocated for a job This command can be used only inside a job script sgather is use
65. o protect the SSH key with a non trivial pass phrase to fulfill the FZJ security policy The generated public ssh key contained in the file id_dsa pub or id_rsa pub on user s workstation must be uploaded through the web interface from Dispatch when initially applying for a user account on JURECA system This SSH key afterwards will be automatically stored in the file SHOME ssh authorized_keys on the cluster 1 5 Shell Environment The default shell for all users on JURECA is BASH bin bash After a successful login user s shell environment is defined in files HOME bash_profile and HOME bashrc Since the GPFS filesystems are shared between different clusters in JSC that means the users home directories are also shared on all system where the users have access to This makes it more difficult for the users to create the correct or desired shell environment for each system In order to solve this issue a file has been created on all systems which contains a string with the system s name The file is etc FZJ systemname This file is available on all login and compute nodes The users can read this file and depending on the system they are logged in they can set the desired environment On JURECA the string that is stored in that file is jureca 1 6 Modules The installed software on JURECA is organized through a hierarchy of modules Loading a module adapts your environment variables to give you acce
66. o the remote shell it is possible to run srun again from that remote shell in order to execute interactively applications without any delays no scheduling delays since the allocation has already been granted Below follows a transcript of an exemplary interactive session salloc nodes 2 time 00 01 00 salloc Pending job allocation 4749 salloc job 4749 queued and waiting for resources salloc job 4749 has been allocated resources salloc Granted job allocation 4749 hostname jr103 srun ntasks 2 ntasks per node 2 hostname jrc01l61 jrc0162 srun cpu bind none nodes 1 ntasks 1 pty bin bash i hostname jrc0161 logout hostname jr103 exit exit salloc Relinquishing job allocation 4749 salloc Job allocation 4749 has been revoked 48 Note When the users want to start a remote shell on the compute nodes they should always give the option cpu_bind none to the srun command in order to disable the default pinning If this option is not given then the default CPU binding settings will pin the processes in an unexpected way e g sometimes restricting the processes on one core only Here is an example how it should be used srun cpu_bind none nodes 1 ntasks 1 pty bin bash i 5 2 X Forwarding The X11 forwarding support has been implemented with the forward x option of the srun command It is similar to the option msub x from Moab X11 for
67. of job chains The script submits a chain of sno_oF_JoBs A job will only start after successful completion of its predecessor Please note that a job which exceeds its time limit is not marked successful bin bash x submit a chain of jobs with dependency number of jobs to submit NO_OF_JOBS lt no of jobs gt define jobscript JOB_SCRIPT lt jobscript gt echo sbatch JOB_SCRIPT JOBID sbatch JOB_ SCRIPT 2 gt amp 1 awk print NF I 0 while I le NO_OF_JOBS do echo sbatch d afterok JOBID JOB_SCRIPT JOBID sbatch d afterok JOBID JOB_SCRIPT 2 gt amp 1 awk print NF let I S 1 1 done 45 4 4 Job Arrays Slurm supports job arrays and offers a mechanism to easily manage these collections of jobs Job arrays are only supported for the sbatch command and as we described previously they can be defined using the options array or a To address a job array Slurm provides a base array ID and an array index unique for each job The format for specifying an array job is first the base array jobID followed gt by _ and then the array index lt base job id gt lt array index gt Slurm exports two environment variables that can be used in the job script to identify each array job SLURM_ARRAY JOB_ID base array job ID SLURM_ARRAY TASK ID array index Some additional options are available to speci
68. one assigned CANCELLED 0 0 302 kraused 2 None assigned CANCELLED 0 0 303 kraused 2 None assigned CANCELLED 0 0 304 kraused 2 None assigned CANCELLED 0 0 305 kraused 2 None assigned CANCELLED 0 0 306 kraused 2 None assigned CANCELLED 0 0 307 kraused 2 None assigned CANCELLED 0 0 308 kraused 2 None assigned CANCELLED 0 0 309 kraused 2 None assigned CANCELLED 0 0 310 kraused 2 None assigned CANCELLED 0 0 311 kraused 2 None assigned CANCELLED 0 0 312 kraused 2 None assigned CANCELLED 0 0 313 kraused 2 None assigned CANCELLED 0 0 314 kraused 2 None assigned CANCELLED 0 0 35 kraused 2 None assigned CANCELLED 0 0 316 kraused 70 jrc 0282 0306 CANCELLED 0 0 317 kraused 70 jrc 0106 0125 CANCELLED 0 0 318 kraused 2 None assigned CANCELLED 0 0 319 kraused 2 None assigned CANCELLED 0 0 320 kraused 2 None assigned CANCELLED 0 0 321 kraused 2 None assigned CANCELLED 0 0 36016 kraused 1 jrc0454 FAILED 127 0 36017 kraused El jrc0455 FAILED 127 0 36018 kraused 1 jrc0307 COMPLETED 0 0 36019 kraused 1 jrc0411 COMPLETED 0 0 36020 kraused El jrc0412 COMPLETED 0 0 36021 kraused 1 jrc0413 COMPLETED 0 0 36022 kraused 1 jrc0414 COMPLETED 0 0 36023 kraused 1 jrc0415 COMPLETED 0 0 36024 kraused i jrc0429 COMPLETED 0 0 36025 kraused 1 jrc0430 COMPLETED 0 0 36026 kraused 1 jrc0431 COMPLETED 0 0 36027 kraused i jrc0432 COMPLETED 0 0 36028 kraused l jrc0433 COMPLETED 0 0 36029 kraused 1 jrc0298 COMPLETED 0 0 36030 kraused 1 jrc0299 COMPLETED 0 0
69. options need to be directly passed to srun and must not be given to sbatch or salloc In particular the option cannot be specified in the header of a batch script prevented all tasks in a job step are pinned to a set of cores which heuristically determines the optimal core set based on the job step specification In job steps with cpus per task task is pinned to a single logical core as shown in Figure 3 In job steps with a cpus per task Since the majority of applications benefit from strict pinning that prevents migration unless explicitly count larger than one e g threaded applications Slurm allows users to modify the process binding by means of the cpu_bind option of srun While with cardinality matching the value of cpus per task see Figure 4 Default processor affinity Note The option cpu_bind 00000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000W o00000000000000000000000000000000000000000000W0o 00000000000000000000000000000000000000000000W0 00 0 00000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000W0 0000 00000000000000000000000000000000000000000Wo 00000 o0000000000000000000000000000000000000000W0 000000 sO00E0000000000000000000000000000000000WO00000 0 o0000000000000000000000000000000000000W0 00000000 oo 0 0 000000000000000000000 0000000000000000000000000000 o0o u 00 0000000000000000000000 KO00000000000000000000000000
70. p User local binaries GPFS usr local Software repository available via module commands The GPFS filesystems on JURECA are mounted from JUST storage cluster JUQUEEN and JUDGE users should be aware that they will work in the same HOME and WORK directories as on these production machines Please note that JSC has already done an automatic migration of all user data from Lustre to GPFS for HOME and WORK directories The old home can be found under juropa 1 4 Access to the Cluster Users can have access to the login nodes of the system only through SSH connections There are 12 login nodes in total There is configured a round robin shared hostname between the login nodes jureca fz juelich de The users can still connect to specific login nodes by using the individual hostname of each node jureca 01 12 fz juelich de For example to connect to the system users must execute from their workstation the following command ls ssh username jureca fz juelich de or to a specific login node the second one for example ssh username jureca02 fz juelich de It is not possible to login by suppling username password credentials Instead password free login based on SSH key exchange is required The public private ssh key pair has to be generated on the workstation you are using for accessing JURECA On Linux or UNIX based systems the key pair can be generated by executing ssh keygen t dsa rsa It is required t
71. r 1 hour The command srun will start this MPMD application where all 4 nodes will be used with 2 MPI tasks per node 8 tasks in total It can be submitted with sbatch bin bash SBATCH nodes 4 SBATCH time 01 00 00 srun N4 ntasks per node 2 multi prog multi conf The multi prog option can be used of course for any kind of binary and its usage is not restricted to MPI jobs only but it is the only way to apply the MPMD model 47 5 Interactive Jobs 5 1 Interactive Session Interactive sessions can be allocated using the salloc command The following command for example will allocate 2 nodes for 30 minutes salloc nodes 2 time 00 30 00 Once an allocation has been made the salloc command will start a bash on the login node where the submission was done After a successful allocation the users can execute srun from that shell and they can spawn interactively their applications For example srun ntasks 4 ntasks per node 2 cpus per task 7 hybrid prog The interactive session is terminated by exiting the shell In order to obtain a shell on the first allocated compute nodes like command msub 1 from Moab the users can start a remote shell from within the current session and connect it to a pseudo terminal pty using the srun command with a shell as an argument For example srun cpu_bind none nodes 2 pty bin bash After gaining access t
72. re are plenty of possible job states for Slurm The following table describes the most common states State Code State Name Description CA CANCELLED Job was explicitly cancelled by the user or an administrator The job may or may not have been initiated cD COMPLETED Job has terminated all processes on all nodes CF CONFIGURING Job has been allocated resources but is waiting for them to become ready for use CG COMPLETING Job is in the process of completing Some processes on some nodes may still be active Usually Slurm is running job s epilogue during this state F FAILED Job terminated with non zero exit code or other failure condition NF NODE FAIL Job terminated due to failure of one or more allocated nodes PD PENDING Job is awaiting resource allocation R RUNNING Job currently has an allocation Note Slurm is always running the prologue at the beginning of each job before the actual execution of user s application TO TIMEOUT Job terminated upon reaching its walltime limit sview With sview we get a graphical overview of the cluster It shows information about system configuration partitions nodes jobs reservations Some actions also are possible through the GUI No options are available for sview Users can just call the command and they will get the graphical window sinfo With sinfo we can get information and check the current state of partitions nodes and reservation
73. read stubs 0 3 libreadline 6 3 librsb 1 2 0 rcl libtool 2 4 6 libunwind 1 1 libxc 2 0 2 bxe 2 222 D libxcb 1 11 Python 2 7 10 libxm12 2 9 2 libyam1 0 1 5 makedepend 1 0 5 matplotlib 1 3 1 Python 2 7 10 motif 2 3 4 mpiP 3 4 1 netCDF 4 3 2 netCDF C 4 2 netCDF C 4 4 2 1 netCDF Fortran 4 4 2 netCDF4 python 1 1 7 1 Python 2 7 10 numpy 1 7 1 Python 2 7 10 parallel netcdf 1 6 0 pkg config 0 28 sprng 1 sundials 2 6 1 tcsh 6 18 01 xbitmaps 1 1 1 xcb proto 1 11 Python 2 7 10 xextproto 7 3 0 xorg macros 1 19 0 xproto 7 0 27 xtrans 1 3 5 zsh 5 0 2 usr local software jureca Stage3 modules all Toolchain iccifort 2015 3 187 GCC bare 4 9 3 psmpi 5 1 4 1 D impi 5 1 0 079 psmpi 5 1 4 1 mt usr local software jureca Stage3 modules all Toolchain ipsmpi 2015 07 imk1 11 2 3 187 usr local software jureca UI FullToolchains gpsolf 2015 06 intel 2015 07 gpsolf 2015 07 D intel para 2015 07 D intel para 2015 07 mt usr local software jureca UI Compilers MPI gpsmpi 2015 06 gpsmpi 2015 07 D iimpi 2015 07 ipsmpi 2015 07 mt ipsmpi 2015 07 D usr local software jureca UI Compilers iccifort 2015 3 187 GCC bare 4 9 3 iccifort 2015 3 187 D GCC 5 1 0 GCC 5 2 0 D 54 GCC bare 4 9 3 ifort
74. s t lt job_state_name gt state lt job state name gt Restrict the operation only to the jobs that have the specified state Please check the man page u lt user name gt user lt user name gt Cancel job s only from the specified user If no job ID is given then cancel all jobs of this user Examples Cancel jobs with ID 777 and 778 scancel 777 778 Cancel jobs with the specified names scancel n testjobl testjob2 Cancel all jobs in queue pending running etc from user1 scancel u userl Cancel all jobs in partition devel that belong to user1 scancel p devel u userl Cancel all jobs from user1 that are in pending state scancel t PENDING u userl 33 scontrol The scontrol command can be also used to manage and do some actions on the jobs Command Description hold lt job_list gt Prevent a pending job from being started release lt job_list gt Release a previously held job so it can start notify lt job_id gt lt message gt Send message to the standard error stderr of a job Examples Put jobs 777 and 778 in hold scontrol hold 777 778 Release job 777 from hold scontrol release 77 3 6 Job Utility Commands sattach With sattach we can attach to a running job step and get or manage the IO streams of the tasks in that job step By default without options it attaches to the stan
75. s 60 jrc0128 Check the shares sshare Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare root 1 000000 7347830935 1 000000 0 500000 root paschoul 1 0 000002 0 0 000000 1 000000 deep 3000 0 004902 0 0 000000 1 000000 eau00 3000 0 004902 0 0 000000 1 000000 ecy00 3000 0 004902 460420812 0 062667 0 000142 esmil7 3000 0 004902 0 0 000000 1 000000 esmil9 3000 0 004902 1494 0 000000 0 999971 esmi20 3000 0 004902 65827880 0 008957 0 281814 grs200 3000 0 004902 1257977 0 000171 0 976079 grs300 3000 0 004902 478 0 000000 0 999991 grs400 3000 0 004902 45140069 0 006144 0 419461 hac29 3000 0 004902 87 0 000000 0 999998 hbn15 3000 0 004902 0 0 000000 1 000000 hbn23 3000 0 004902 0 0 000000 1 000000 hbn29 3000 0 004902 79553390 0 010826 0 216354 hbn30 3000 0 004902 19171256 0 002609 0 691441 hbn31 3000 0 004902 117928315 0 016051 0 103343 hbn32 3000 0 004902 0 0 000000 1 000000 hbn33 3000 0 004902 0 0 000000 1 000000 zam 3000 0 004902 66646822 0 009071 0 277284 zam paschoul 3000 0 000037 46031 0 000075 0 247122 zdv590 3000 0 004902 0 0 000000 1 000000 Check the priorities sprio JOBID PRIORITY AGE FAIRSHARE JOBSIZE QoS 203 46776 32500 0 14277 0 1771 46776 32500 0 14277 0 6659 34303 32500 15 2788 0 6660 34303 32500 15 2788 0 6767 36084 32500 15 3569 0 8435 1016 794 0 223 0 8597 5208 4985 0 223 0 8633 932 710 0 223 0 8797 32555 32500 0 56 0 8801 32555 32500 0 56 0 8805 32555 32500 0 56 0 8886 1304 1082 0 223 0 8996 14
76. s This command is useful for checking the availability of the nodes Command format sinfo OPTIONS 27 Some of the most useful sinfo options are Option Description a Show information about all partitions all d Show information only for the non responding dead dead nodes i lt seconds gt iterate lt seconds gt Repeatedly print information at the specified interval 1 long Report more information n lt nodes gt nodes lt nodes gt Show information only about the specified nodes N Node Show information in a node oriented format o lt output_format gt format lt output_ format gt Specify the information that will be printed columns Please read the man pages for more information p lt partition gt partition lt partition gt Show information in a node oriented format r responding Show information only for the responding nodes R list reasons List the reasons why nodes are not in a healthy state s summarize List partitions without many details for the nodes t lt states gt states lt states gt List nodes only with the specified state e g allocated down drain idle maint etc T reservation Show information about the reservations Examples Show information about nodes in idle state sinfo t idle Show information about partitions and no
77. s created where the users belong to Each user has available contingent from one project only CPU Quota modes monthly and fixed The projects are charged on a monthly base or get a fixed amount until it is completely used Contingent CPU Quota states for the projects normal low contingent no contingent Contingent priorities normal gt lowcont gt nocont Users without contingent get a penalty to the priorities of their jobs but they are still allowed to submit and run jobs 2 Batch System Slurm 2 1 Slurm Overview Slurm is the Batch System Workload Manager of JURECA cluster Slurm Simple Linux Utility for Resource Management is a free open source resource manager and scheduler It is a modern extensible batch system that is widely deployed around the world on clusters of various sizes A Slurm installation consists of several programs and daemons The Slurm control daemon slurmctld is the central brain of the batch system responsible for monitoring the available resources and scheduling batch jobs The slurmctld runs on an administrative node with a special setup to ensure availability of the services in case of hardware failures Most user programs such as srun sbatch salloc and scontrol interact with the slurmctld For the purpose of job accounting slurmctld communicates with Slurm database daemon slurmdbd Slurm stores all the information about users jobs and accounting data in its own database The functionality o
78. show node lt node_name gt Show information about all nodes or about the specified node show partition lt partition_name gt Show information about all partitions or about the specified one show reservation lt reservation_name gt Show information about all reservations or about the specified one show step lt step_id gt Show information about all jobsteps or about the specified one Examples Expand and print a list of hostnames for the specified range scontrol show hostname jrc 0106 0115 Show information about the job 777 scontrol show job 777 Show information about the node jrc0117 scontrol show node jrc0117 Show information about the partition batch scontrol show partition batch sshare With sshare we can retrieve fairshare information and check the current value of the fairshare factor that is used to calculate the priorities of the jobs Command format sshare OPTIONS 31 Some of the most useful options of sshare are Option Description A lt account_list gt accounts lt account_list gt Show information for the specified accounts By default users belong only to one account h Do not display the header in the beginning of the noheader output 1 Show more information long P Print information in a parsable way Delimit output parsable with with a in the en
79. ss to a specific set of software and its dependencies The hierarchical organization of the modules ensures that you get a consistent set of dependencies for example all built with the same compiler version or all relying on the same implementation of MPI The module hierarchy is built upon toolchains Toolchain modules in the lowest level contain just a compiler suite like Intel compilers icc and ifort Toolchains in the second level contain a compiler suite and a compatible implementation of MPI The third and highest level contains full toolchains with a compiler suite an MPI implementation and compatible mathematical libraries such as SCALAPACK An application is only accessible to the user when its module is loaded You can load the application module only when the toolchain modules containing its dependencies are loaded first Modules and Toolchains hierarchy If you know the dependencies of the application you would like to run you can simply load a Toolchain module bundle from one of the three levels Compilers Compilers MPI or FullToolchains Here is a quick reference to the tools provided by each toolchain module Type Modules available Compilers GCC Gnu compilers with frontends for C C Objective C Fortran Java amp Ada ifort Intel Fortran compiler icc Intel C and C compilers iccifort icc ifort Intel C and Fortran compilers together Compilers MPI gpsmpi2 GCC Parastation MPICH MPI ipsmpi2 icc ifort
80. t Repeatedly print information at the specified interval 1 long Report more information o lt output_format gt format lt output_format gt Specify the information that will be printed columns Please read the man pages for more information p lt part_list gt partition lt part_list gt List jobs only from the specified partitions R lt reservation name gt reservation lt reservation name gt List jobs only for the specified reservation S lt sort_list gt sort lt sort_list gt Specify the order of the listed jobs start Print the expected start time for each job in the queue t lt state_list gt states lt state_list gt List jobs only with the specified state failed pending running etc u lt user list gt user lt user list gt Print the jobs of the specified user Examples Repeatedly print queue status every 4 seconds squeue i 4 Show jobs in the devel partition _ squeue p devel 26 Show jobs that belong to a specific user squeue u user0l Print queue status with a custom format showing only job ID partition user and job state squeue format 18i 9P 8u 2t Normally the jobs will pass through several states during their life cycle Typical job states from submission until completion are PENDING PD RUNNING R COMPLETING CG and COMPLETED CD However the
81. t batch Maximum job wall clock time normal nocont 1 day 6 hours default partition Default wall clock time for each job 1 hour batch jobs thin fat type 1 Minimum Maximum number of nodes per job Default number of nodes for each job Max number of running submitted jobs per user 1 128 nodes 1 node QoS dependent mem512 Maximum job wall clock time normal nocont 1 day 6 hours mem bounded jobs Default wall clock time for each job 1 hour fat type 2 Minimum Maximum number of nodes per job 1 32 nodes Default number of nodes for each job 1 node Max number of running submitted jobs per user QoS dependent gpus Maximum job wall clock time normal nocont 1 day 6 hours gpu acc jobs Default wall clock time for each job 1 hour 2x Nvidia K80 Minimum Maximum number of nodes per job 1 32 nodes Default number of nodes for each job 1 node Max number of running submitted jobs per user QoS dependent vis Maximum job wall clock time normal nocont 1 day 6 hours gpu acc jobs Default wall clock time for each job 1 hour 2x Nvidia K40 i Minimum Maximum number of nodes per job 1 4 nodes Default number of nodes for each job 1 node Max number of running submitted jobs per user QoS dependent large Same as batch except the max nodes limit overlaps with batch Note Normally it will be in state DOWN maint Used by admins during offline maintenance 10 The devel partition is int
82. t to export the whole shell The environment defined in user s shell during submission will be automatically exported to the job 50 6 2 User Commands Comparison The following table presents commands with similar functionality from Slurm Moab and Torque User Commands Slurm Moab Torque Job Submission sbatch msub qsub Job deletion scancel canceljob qdel Job status squeue checkjob qstat scontrol show job Job hold scontrol hold mjobctl h qhold Job release scontrol release mjobctl u qris Queue list squeue showq qstat Q Cluster status sinfo qstat a Node list scontrol show nodes pbsnodes 1 GUI sview xpbsmon The table below compares the allocation options of msub and sbatch Allocation option Moab Torque msub Slurm sbatch Number of nodes 1 nodes lt number gt nodes lt number gt N lt number gt Number of total tasks None ntasks lt number gt n lt number gt Number of tasks cpus per node 1 ppn lt number gt ntasks per node lt num gt tasks per node lt num gt Number of threads per task v tpt lt number gt Ccpus per task lt num gt c lt num gt File for the standard output o lt path gt output lt path gt o lt path gt File for the standard error e lt path gt error lt path gt e lt path gt Walltime limit 1 walltime lt time gt time lt walltime gt t lt w
83. tion about the types of resources that the users are requesting allocating The following table includes all configured generic resources on JURECA GRES Name Description mem128 128 GB memory on node mem256 256 GB memory on node mem512 512 GB memory on node mem1024 1024 GB memory on node gpu Node equipped with GPUs The following table show the GRES that are configured for each node type Node Type List of GRES thin mem128 fat type 1 mem128 mem256 fat type 2 mem128 mem256 mem512 gpu mem128 gpu 4 vis type 1 mem128 mem256 mem512 gpu 2 vis type 2 mem128 mem256 mem512 mem1024 gpu 2 As it is shown on the previous table Slurm allows to define multiple resources for each type of nodes The mem GRES is not consumable and it has one count but for the gpu GRES it is configured a number which defines how many GPUs are available for each node type 12 The following table shows the partitions and the list of GRES for the nodes that are included in them Partition List of GRES devel mem128 batch mem128 mem256 mem512 mem512 mem128 mem256 gpu mem128 gpu 4 vis mem512 mem1024 gpu 2 mem128 mem256 large mem128 mem256 maint mem128 mem256 mem512 mem1024 gpu 2 4 During job submissions Slurm will deny any submission when a user requests a GRES that is not configured for the desired partition or set of nodes
84. tions SBATCH mail type lt mode gt NONE Event types for email notifications SBATCH job name lt jobname gt jobscript s name Job name SBATCH J lt jobname gt SBATCH gres lt list gt mem128 Generic resources Multiple srun calls can be placed in a single batch script Options such as nodes ntasks and ntasks per node are by default taken from the sbatch arguments but can be overwritten for each srun invocation If nasks per node is omitted or set to a value higher than 24 then SMT simultaneous multi threading will be enabled Each compute node has 24 physical cores and 48 logical cores As we described before the job script is submitted using sbatch OPTIONS lt jobscript gt On success sbatch writes the job ID to standard out Note In case some allocation options are defined in both command line and inside the job script then the options that were given as arguments in the command line will be used and the options in the job script will be ignored 40 4 1 Job script examples Serial job Example 1 Here is a simple example where some system commands are executed inside the job script This job will have the name TestJob One compute node will be allocated for 30 minutes Output will be written in the defined files The job will run in the default partition batch bin bash SBATCH J TestJob SBATCH N 1 SBATCH o TestJob 3j out SBATCH e TestJob j err SB
85. to search for all possible modules matching any of the keys Load a Toolchain and check loaded modules module load intel para 2015 07 module list Currently Loaded Modules 1 z11b 1 2 8 9 pscom 5 0 45 1 2 binutils 2 25 10 ipsmpi 2015 07 3 ncurses 5 9 11 iccifort 2015 3 187 GCC bare 4 9 3 4 libatomic_ops 7 4 2 12 icc 2015 3 187 GCC bare 4 9 3 5 ge 7 4 2 13 ifort 2015 3 187 GCC bare 4 9 3 6 util wrapper 1 1 14 psmpi 5 1 4 1 7 GCC bare 4 9 3 15 imk1 11 2 3 187 8 popt 1 16 16 intel para 2015 07 Check available packages module avail usr local software jureca Stage3 modules all Toolchain intel para 2015 07 ABINIT 7 10 2 ScientificPython 2 9 4 Python 2 7 10 ASE 3 8 1 3440 Python 2 7 10 Score P 1 4 1 Autoconf 2 69 Score P 1 4 2 D Automake 1 13 Szip 2 1 Automake 1 15 D TAU 2 24 Bison 3 0 2 TAU 2 24 1 D Boost 1 58 0 Python 2 7 10 Tc1 8 5 9 BuildEnv defaults Tc1 8 6 3 D CDO 1 6 9 UDUNITS 2 1 24 CMake 3 1 3 UltraScan3 3 3 2002 CMake 3 2 3 D VampirTrace 5 14 4 CP2K 2 6 0 XML LibXML 2 0118 Perl 5 20 1 Cube 4 3 1 Xerces C 3 1 2 Cube 4 3 2 D YAXT 0 3 0 Cython 0 22 Python 2 7 10 adf 2014 07 DB_File 1 831 Perl 5 20 1 arpack ng 3 1 3 Doxygen 1 8 9 1 bzip2 1 0 6 ELPA 2014 06 001 generic simple CURL 7 41 0 ELPA 2014 06 001 hybrid D cpmd 4 1 Elemental 0 85 darshan runtime 2 3 1 FFTW 2 1 5 darshan util 2 3 1 FFTW 3 3 4 D expat 2 1 0 FIAT 1 1 Pytho
86. warding is required for users who want to use applications or tools which provide a GUI Here is an example that shows how to use this feature salloc nodes 1 time 00 01 00 srun cpu_bind none nodes 1 ntasks 1 forward x pty bin bash i GUI App Note User accounts will be charged per allocation whether the compute nodes are used or not Batch submission is the preferred way to execute jobs 49 6 From Moa b Torque to Slurm On JUROPA we are using the combination of Moab and Torque for the Batch System Moab works as the scheduler and Torque is the resource manager However on JURECA we use Slurm as scheduler and resource manager In this chapter we will compare and give some information about these two solutions and we will try to help the users have an easier migration from Moab Torque to Slurm 6 1 Differences between the Systems Here we will compare and declare some differences between Moab and Slurm Moab Slurm Resource Not supported Needs an external A flexible and capable resource manager Management Resource Manager like Torque in our case psslurm on the nodes Nodes It is possible to set nodes for batch and No difference between batch and interactive jobs only or both interactive jobs for Slurm Queues Partitions separate node into groups Slurm defines only partitions For Slurm Queues are used for job submission on the partitions are used as queu
87. ype ALL jobscript Specify a job name and the standard output error files sbatch N1 J myjob o MyJob j out e MyJob j err jobscript Start an interactive job and allocate 4 nodes for 1 hour salloc N4 time 60 Start an interactive job with srun and allocate 1 node for 10 minutes in devel partition srun N1 p devel t 10 pty u bin bash i 23 Generic Resources GRES As we described in chapter 2 6 generic resources has been configured for each node type In order to request nodes with specific GRES resources the option gres must be used during submissions The following tables shows the combinations of GRES types that are available for each partition Partition List of GRES devel MAY gres mem128 batch MAY gres mem128 OR gres mem256 mem512 MUST gres mem512 gpu MAY gres mem128 MUST gres gpu 1 4 VIS MAY gres mem512 OR gres mem1024 MUST gres gpu 1 2 large MAY gres mem128 OR gres mem256 maint MAY gres mem128 OR gres mem256 OR gres mem512 OR gres mem1024 OR gres gpu 2 4 Examples Submit a job requesting 2 nodes in devel partition by default GRES mem128 will be added sbatch N2 p devel jobscript Submit a job requesting 32 nodes in batch partition by default GRES mem128 will be added sbatch N32 p batch jobscript Submit a job requesting 8 nodes in batch partition with 256 GB memory sbatch N8 p batch gres

JURECA

Contents

Download Pdf Manuals

Related Search

Related Contents