Home

HLRE-3 MISTRAL user's manual

1. The following table shows the names of the MPI wrapper procedures for the Intel compilers as well as the names of compilers themselves The wrappers build up the MPI environment for your compilation task such that we recommend the use of the wrappers instead of the compilers themselves language compiler Intel MPI Wrapper bullx MPI Wrapper Fortran ifort mpiifort mpif90 90 95 2003 Fortran 77 ifort mpiifort mpif77 C icpc mpiicpc mpic C icc mpiicc mpicc Table 2 3 MPI compiler wrapper overview for Intel compiler The table below lists some useful compiler options that are commonly used for the In tel compiler For further information please refer to the man pages of the compiler or the comprehensive documentation on the Intel website https software intel com en us intel software technical documentation option description openmp Enables the parallelizer to generate multi threaded code based on the OpenMP directives g Creates debugging information in the object files This is necessary if you want to debug your program O 0 3 Sets the optimization level L lt library path gt A path can be given in which the linker searches for libraries D Defines a macro U Undefines a macro I lt include path gt Allows to add further directories to the include file search path SOX Stores useful information like compiler versi
2. squeue j 13258 JOBID PARTITION NAME USER ST TIME NODES NODELIST REASON 13258 compute bash x123456 R 0 11 2 m 10001 10002 hostname we are still on the login node mlogin103 ssh m10001 userQm10001 7 hostname 18 m10001 user m10001 exit logout Connection to m10001 closed exit we need to exit in order to release the allocation salloc Relinquishing job allocation 13258 salloc Job allocation 13258 has been revoked 4 2 2 Spawning Command With srun the users can spawn any kind of application process or task inside a job al location or directly start executing a parallel job and indirectly ask SLURM to create the appropriate allocation It can be a shell command any single multi threaded exe cutable in binary or script format MPI application or hybrid application with MPI and OpenMP When no allocation options are defined with srun command the options from sbatch or salloc are inherited srun should preferably be used either 1 inside a job script submitted by sbatch see 4 2 3 2 or after calling salloc The allocation options of srun for the job steps are almost the same as for sbatch and salloc please see the table in section 4 2 3 for some allocation options Examples Spawn 48 tasks on 2 nodes 24 tasks per node for 30 minutes srun N 2 n 48 t 30 A xy1234 my_small_test_job You will have to specify the account to be used for this job in the
3. 3 2 gcec48 bin nc config libs L sw rhel6 x64 netcdf netcdf_c 4 3 2 gcec48 lib 11 W1I rpath sw rhel6 x64 netcdf netcdf_c 4 3 2 gcec48 lib Inetcdf Get paths to Fortran netCDF include files sw rhel6 x64 netcdf netcdf_fortran 4 4 2 intell4 bin nf config fHags I sw rhel6 x64 netcdf netedf fortran 4 4 2 intel14 include Get options needed to link a Fortran program to netCDF sw rhel6 x64 netedf netcdf fortran 4 4 2 intel14 bin nf config flibs L sw rhel6 x64 netcdf netcdf fortran 4 4 2 intel14 lib Inetedff Wl rpath sw rhel6 x64 netedf netedf fortran 4 4 2 intel14 lib L sw rhel6 x64 netedf netedf_c 4 3 2 gcc48 lib W1 rpath sw rhel6 x64 netcdf netcdf_c 4 3 2 gcec48 lib A L sw rhel6 x64 hdf5 hdf5 1 8 14 threadsafe gcc48 lib Wl rpath sw rhel6 x64 hdf5 hdf5 1 8 14 threadsafe gec48 lib L sw rhel6 x64 sys libaec 0 3 2 gcc48 lib W1l rpath sw rhel6 x64 sys libaec 0 3 2 gcc48 lib Inetedf Ihdf5_hl lhdf5 1sz lcurl lz 12 Chapter 3 Batch System SLURM 3 1 SLURM Overview SLURM is the Batch System Workload Manager used on MISTRAL cluster SLURM Simple Linux Utility for Resource Management is a free open source resource manager and scheduler It is a modern extensible batch system that is widely deployed around the world o
4. A non login bash shell or bash subshell reads bashrc file Tcsh always reads and executes cshrc file If tcsh is invoked as login shell the file login is sourced additionally The typical tasks and settings that can be put in shell setup files are for example e Creation of a custom prompt e Modification of search path for external commands and programs Definition of environment variables needed by programs or scripts e Definition of aliases Execution of commands e g module load lt modname gt lt version gt Chapter 2 Software Environment 2 1 Modules To cover the software needs of DKRZ users and to maintain different software versions the DKRZ uses the module environment Loading a module adapts your environment variables to give you access to a specific set of software and its dependencies The modules are not organized hierarchically but have internal consistency checks for dependencies and can uniquely be identified by naming convention lt modname gt lt modversion gt Optionally the version of the compiler that was used to build the software is also encoded in the name for example all modules built with the same Intel compiler version are labelled with e g intel14 2 1 1 Modules Available Table 2 1 provides a quick reference to some module categories The list of available modules will steadily grow to cover the general software needs of DKRZ users A complete list is dynamically updated whenever ne
5. The processor clock rate is 2 5 GHz The peak performance of the system is about 1 5 PFLOPS s The aggregated main memory is 115 TB The parallel file system Lustre provides 20 PB of usable disk space Four kinds of nodes are available to users 8 login nodes 1496 compute nodes for running scientific models 48 nodes for interactive use and pre and postprocessing of data and 12 visualisation nodes See Table 1 1 for a listing of the specifics of different node types type hostname CPU GPUs memory nodes login 8 mlogin 100 107 2x12 core none 256 GB Intel Haswell 2 5GHz compute m 10000 11367 2x12 core none 64 GB 1386 m 11404 11421 Intel Haswell 2 5GHz compute m 11368 11403 2x12 core none 128 GB 110 m11422 m11431 Intel Haswell m 11440 11511 2 5GHz pre post 48 m 11512 11559 2x12 core none 256 GB Intel Haswell 2 5GHz viz 12 mg 100 111 2x12 core Nvidia Tesla 256 GB Intel Haswell K80 2x 2 5GHz GK110BGL Table 1 1 MISTRAL node configuration The Operating System on the MISTRAL cluster is Red Hat Enterprise Linux release 6 4 Santiago All compute pre postprocessing and visualization nodes are integrated in one FDR InfinBand IB fabric with three Mellanox SX6536 director switches and fat tree topology with a blocking factor of 1 2 2 The measured bandwidth between two arbitrary compute nodes is 5 9 GByte s with a latency of 2 7 us 1 3 Data Management File
6. case you need to combine the two binding methods mentioned above Keep in mind that we are using threads per core 2 throughout the cluster Hence you need to specify the amount of CPUs per process task on the basis of HyperThreads even if you do not intend to use HyperThreads The following table gives an overview on how to achieve correct binding using a full node MPI intranode distribution of tasks srun distribution block block srun distribution block cyclic no OpenMP 4SBATCH tasks per node 24 SBATCH tasks per node 24 no HT s c srun cpu_bind cores srun cpu_bind cores task0 cpu 0 24 taskl cpu 1 25 task0 cpu 0 24 taskl cpu 12 36 no 0m SBATCH tasks per node 48 SBATCH tasks per node 48 srun cpu_bind threads srun cpu_bind threads task0 cpu0 task1 cpu24 task2 cpul task0 cpu0 taskl cpul2 task2 cpul 4 SBATCH tasks per node 6 SBATCH tasks per node 6 a export OMP_NUM_THREADS 4 export OMP_NUM_THREADS 4 E export KMP_AFFINITY export KMP_AFFINITY granularity core granularity core compact 1 compact 1 srun cpu_bind cores srun cpu_bind cores cpus per task 8 cpus per task 8 task0 cpu 0 1 2 3 24 25 26 27 task0 cpu 0 1 2 3 24 25 26 27 task1 cpu 4 5 6 7 28 29 30 31 taskl cpu 12 13 14 15 36 37 38 39 task0 thread0 cpu 0 24 task0 thread0 cpu 0 24 task0 thread1 cpu 1 25 task0 thread1 cpu 1 25 A ead SBATCH tasks per node 12 SBATCH tasks per no
7. job 4242 bash scontrol show job 4242 Hold a job bash scontrol hold 4242 bash squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST REASON 4242 nightly tst_job b123456 PD 0 00 1 JobHeldUser 36 Release a job bash scontrol release 4242 bash squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST REASON 4242 nightly tst_job b123456 R 0 01 1 m 10007 10011 With scancel we can signal or cancel jobs job arrays or job steps Cancel a specific job bash scancel 4711 Cancel all jobs in queue pending running etc from user x123456 in interactive mode user must confirm for each operation bash scancel interactive u x123456 With sstat one can get various status information about running job steps for exam ple minimum maximum and average values for metrics like CPU time Virtual Memory VM usage Resident Set Size RSS Disk I O Tasks number etc Display default status information for job 4242 bash sstat j 4242 Display the defined metrics for job 4242 in parsable format bash sstat P format JobID AveCPU AvePages AveRSS AveVMSize j 4242 4 5 3 Accounting Commands With sacct one can get accounting information and data for the jobs and jobsteps that are stored in SLURM s accounting database SLURM stores the history of all jobs in the database but each user has permissions to check only his h
8. running or the reason why the job is pending Typical reasons for pending jobs are waiting for resources to become available Resources and queuing behind a job with higher priority Priority sbatch submit a batch script The script will be executed on the first node of the allocation The working directory coincides with the working directory of the sbatch directory Within the script one or multiple srun commands can be used to create job steps and execute parallel applications scancel cancel a pending or running job or job step It can also be used to send an arbitrary signal to all processes associated with a running job or job step salloc request interactive jobs allocations When the job is started a shell or other pro gram specified on the command line is started on the submission host login node From this shell you should use srun to interactively start a parallel applications The allocation is released when the user exits the shell 17 srun initiate parallel job steps within a job or start an interactive job scontrol primarily used by the administrators provides some functionality for the users to manage jobs or get some information about the system configuration such as nodes partitions jobs and configurations sprio query job priorities sshare retrieve fair share information for each account the user belongs to sstat query status information related to CPU task node RSS and virtual memory about a running
9. same manner as for salloc and sbatch 4 2 3 Batch Jobs Users submit batch applications using the sbatch command The batch script is usually a shell script consisting of two parts resources requests and job steps Resources requests are for example number of nodes needed to execute the job number of tasks time duration of the job etc Job steps are user s tasks that must be executed The resources requests and other SLURM submission options are prefixed by SBATCH and must precede any executable commands in the batch script For example bin bash SBATCH partition compute SBATCH account xz0123 SBATCH nodes 1 SBATCH ntasks per node 24 SBATCH time 00 30 00 Begin of section with executable commands set e ls 1 srun my_program 19 The script itself is regarded by SLURM as the first job step and is serially executed on the first compute node in the job allocation To execute parallel MPI tasks users call srun within their script Thereby a new job step is initiated It is possible to execute parallel programs in the form of job steps in any configuration within the job allocation This means a job step can use all allocated resources or several job steps created via multiple srun calls can use a subset of allocated resources The following table describes the most common or required allocation options that can be defined in a batch script SBATCH o
10. 2 gt Table 2 2 module command overview For all details of the module command please refer to the man page or execute module help To use the module command in a script you can source one of the following files in your script before any invocation of the module command in bash or ksh script source sw rhel6 x64 etc profile mistral in tesh or csh script source sw rhel6 x64 etc csh mistral The module avail command provides up to date information on installed software and versions 2 2 Compiler and MPI On MISTRAL we provide the Intel GCC GNU Compiler Collection and NAG com pilers and several Message Passing Interface MPI implementations Bullx MPI with and without Mellanox MXM and FCA tools Intel MPI MVAPICH2 and OpenMPI No compilers and MPIs are loaded by default For most applications we recommend to use the Intel compilers and Bullx MPI library with Mellanox tools to achieve the optimal performance on MISTRAL For some appli cations running on small number of nodes slightly better performance might be achieved with the Intel compilers and Intel MPI Compiler and an appropriate MPI library can be selected by loading the corresponding module files for example Use the default versions of Intel compiler and Bullx MPI with Mellanox MXM FCA tools module load intel mxm fca bullxmpi mlx Use the default versions of Intel compiler and Intel MPI module load intel intelmpi
11. AS INP o eeoa RA a Roe ae ARA A k 444 Frequency Scaling 2 2 22 zu 2 202 eR Pee ee Pee a AA Job Steps a a eee En RR Re k 446 Dependency Chams sesio sed pa nenas dae AAT AAA 6 a Ry users a e EOR 45 SLURM Command Examples 242 45 cerdas u ro 64442484 45 1 Query Commands gt oe 2 s ce we ee ds ea EE EA 22 en su se aa am dene 45 3 Accounting Commands o se saoer een R Chapter 1 Cluster Information 1 1 Introduction MISTRAL the High Performace Computing system for Earth system research HLRE 3 is DKRZ s first petascale supercomputer by Atos The phase 1 configuration with a peak performance of roughly 1 5 PetaFLOPS consists of more than 1 500 compute nodes 36 000 compute cores 110 Terabytes of memory and 20 Petabytes of disk To access MISTRAL you need to be a member in at least one active HLRE project to have a valid user account and to accept DKRZ s Guidelines for the use of information processing systems of the Deutsches Klimarechenzentrum GmbH DKRZ 1 2 Cluster Nodes The MISTRAL cluster in phase 1 consists of more than 1500 nodes The compute nodes are housed in bullx B700 DLC Direct Liquid Cooling blade systems with two nodes forming one blade Each node has two sockets equipped with an Intel Xeon E5 2680 v3 12 core processor Haswell each Thus 24 physical cores per node are available Due to active Hyper Threading the operating system recognizes two logical cores per physical core
12. DEUTSCHES KLIMARECHENZENTRUM HLRE 3 MISTRAL User s Manua Support beratung dkrz de 2015 08 25 Contents Cluster Information Li o III 1 2 Cluster Dede 2223 ein eed ee ee A A ee ar 1 3 Data Management Filesystems 2 o 000 131 Data Migration from Blizzard 2 2 2 2 aa es o au LA Access tothe Cluster aaa ann ar msn IAL Ten pae pa Sree ee Re Ee abe we a BE EE y erna es KBr NN as a ee Oe 123 Loemohell si s sos geru a e a A Software Environment A E EE E E E 211 Modules Avyalable os a es 4S Pe ae Er RD A 2 1 2 Using the Module Command o o 22 lerne MPI a RR HR EES RE ra ES 2 2 1 Compilation Examples 2 o 222 Recommendations eats sa wats a aaoi aor a EE Ea Batch System SLURM K SLUEM Overview eoa a da aa ee ee ea a i 32 OLURM Partitions co onda va a rn a Vi 32 Job Lies OS ee ee a ar 34 Prorntesand ACCOUNTING lt cac ra ER FR RA 4 3 5 Job Environment s sa caw r aderir a deln a har a Dar ha an kn SLURM Usage 41 ALURM Command Overview o E E R Ku ne arena 4 2 Allocation Commands wu era RARA area LLE aeaa sons 2 4 24S eee See ows Os ees On 222 opewnine Command a sk esi eda oh aka ALS Bah JODS 22 ba PRES EES EMIS oe MOSS ewe a 43 Job script Examples s s e oia e Oe oe A Re AA Advanced SLURM Feste occiso sa aa 141 Hyper Threading HT s der c pactos n pana daa Gann ea na 442 Process and Thread Binding i se Goe dee ne ee Se a L
13. G_LEVEL ERROR Disable GHC algorithm for collective communication export OMPI_MCA coll ghc Environment settings to run a MPI OpenMP parallel program compiled with Intel MPI load environment module load intelmpi export I MPI PMI LIBRARY usr lib64 libpmi so Use srun not mpirun or mpiexec command to launch programs compiled with any MPI library srun l cpu_bind cores cpus per task 8 myprog 4 4 Advanced SLURM Features 4 4 1 Hyper Threading HT Similar to the IBM Power6 used in BLIZZARD the Haswell processors deployed for MISTRAL offer the possibility of Simultaneous Multithreading SMT in the form of the Intel Hyper Threading HT Technology With HT enabled each physical processor core can execute two threads or tasks simultaneously The operating system thus lists a total of 48 logical cpus or Hardware Threads HWT Therefore a maximum of 48 processes can be executed on each compute node without overbooking Each compute node on MISTRAL consists of two Intel Xeon E5 2680 v3 processors located on socket zero and one with 12 physical cores each These cores are numbered 0 to 23 and the hardware threads are numbered 24 to 47 Figure 4 1 depicts a node schematically and illustrates the naming convention On MISTRAL we have HT enabled on each compute node and SLURM always uses the option threads per core 2 implicitly such that the user is urged to bind the tasks threads in an appropriate way In Sec
14. ams compiled with any MPI library srun l cpu_bind cores cpus per task 8 myprog Hybrid MPI OpenMP job with Hyper Threading The following example will run on 2 compute nodes having 6 MPI tasks per node and starting 8 threads per node using Hyper Threading bin bash ASBATCH job name my_job job name BATCH partition compute partition name SBATCH nodes 2 number of nodes ASBATCH ntasks per node 6 number of MPI tasks on each node BATCH time 01 00 00 Set a limit on the total run time 8BATCH mail type FAIL Notify user by email ASBATCH mail user your email Set your e mail address 5BATCH account xz0123 Charge resources on project account SBATCH output my_job 0 j File name for standard output SBATCH error my job e j File name for standard error output Bind your OpenMP threads export OMP NUM THREADS 8 export KMP_AFFINITY verbose granularity thread compact 1 export KMP STACKSIZE 64m Environment settings to run a MPI OpenMP parallel program compiled with Bullx MPI and Mellanox libraries load environment module load intel module load mxm 3 3 3002 26 module load fca 2 5 2379 module load bullxmpi_mlx bullxmpi_mlx 1 2 8 3 Settings for Open MPI and MXM MellanoX Messaging library export OMPLMCA_pml cm export OMPI_MCA_mtlmsm export OMPI_LMCA_mtl_mxm_np 0 export MXM RDMA PORTS m1x5_0 1 export MXM_LO
15. de 12 ae export OMP_NUM_THREADS 4 export OMP_NUM_THREADS 4 export KMP_AFFINITY export KMP_AFFINITY granularity tread granularity thread compact 1 compact 1 srun cpu_bind threads srun cpu_bind threads cpus per task 4 cpus per task 4 task0 cpu 0 1 24 25 task0 cpu 0 1 24 25 taskl cpu 2 3 26 27 taskl cpu 12 13 36 37 task0 thread0 cpu0 task0 thread0O cpu0 task0 thread1 cpul task0 thread1 cpul task0 thread2 cpu24 task0 thread2 cpu24 30 4 43 MPMD SLURM supports the MPMD Multiple Program Multiple Data execution model that can be used for MPI applications where multiple executables can have one common MPI_COMM_WORLD communicator In order to use MPMD the user has to set the srun option multi prog lt filename gt This option expects a configuration text file as an argument in contrast to the SPMD Single Program Multiple Data case where srun has to be given the executable Each line of the configuration file can have two or three possible fields separated by space and the format is lt list of task ranks gt lt executable gt lt possible arguments gt In the first field a comma separated list of ranks for the MPI tasks that will be spawned is defined Possible values are integer numbers or ranges of numbers The second field is the path name of the executable And the third field is optional and defines the arguments of the program Example Listing 4 1 Jobscript fra
16. e must respect the order of loading the modules compiler MXM FCA and afterwards bullx MPI If the MXM FCA environment is not loaded one will use the bullx MPI without MXM and FCA tools In order to use the MXM Mellanox Messaging to accelerate the underlying send re ceive or put get messages the following variables have to be used export OMPI MCA pml cm export OMPLMCA_mtl mxm export MXM_RDMA_PORTS mlx5_0 1 Furthermore FCA Fabric Collectives Accelerations accelerates the underlying col lective operations used by the MPI PGAS languages To use FCA one must specify the following variables export OMPI MCA coll ghc export OMPI_MCA_ coll fca_priority 95 export OMPI MCA coll fca enable 1 You will find the bullxMPI documentation by Atos at https www dkrz de Nutzerportal en doku mistral manuals Libraries There is no module to set NetCDF paths for the user If you need to specify such paths in Makefiles or similar please use the nc config and nf config tool to get the needed compiler flags and libraries e g Get paths to netCDF include files sw rhel6 x64 netcdf netcdf_c 4 3 2 gcc48 bin nc config cflags 1 sw rhel6 x64 netcdf netedf_c 4 3 2 gec48 include I sw rhel6 x64 sys libaec 0 3 2 gcec48 include I sw rhel6 x64 hdf5 hdf5 1 8 14 threadsafe gcc48 include Get options needed to link a C program to netCDF sw rhel6 x64 netcdf netcdf_c 4
17. ent For details please take a look at the Intel manuals or contact DKRZ user s consultancy In most cases use export KMP_AFFINITY granularity core compact 1 if you do not want to use HyperThreads and export KMP_AFFINITY granularity thread compact 1 if you intend to use HyperThreads You might also try scatter instead of compact place ment to take benefit from bigger L3 cache MPI jobs Process task binding can be done via srun options cpu_bind and distribution The syntax is cpu_bind quiet verbose type distribution lt block cyclic arbitrary plane lt options gt block cyclic gt with e type cores bind to physical cores threads bind to logical CPUs HyperThreads e first distribution method before the controls the distribution of resources across nodes e second optional distribution method after the controls the distribution of resources across sockets within a node For details please take a look at the manpage of srun or contact DKRZ user s consultancy In most cases use srun cpu_bind verbose cores distribution block cyclic myapp if you do not want to use HyperThreads and srun cpu_bind verbose threads distribution block cyclic myapp if you intend to use HyperThreads You might also benefit from different task distributions than block cyclic 29 Hybrid MPI OpenMP jobs In this
18. er own jobs Show job information in long format for default period starting from 00 00 today until now bash sacct l Show job only information without jobsteps starting from the defined date until now bash sacct S 2015 01 07T00 42 00 X Show job information with different format and specified time frame bash sacct X u b123456 format jobid nnodes nodelist state exit S 2015 01 01 E 2015 31 01T23 59 59 The sacctmgr command is mainly used by the administrators to view or modify accounting information and data in the accounting database This command provides also an interface with limited permissions to the users for some querying actions The most useful command is to show all associations a user is allowed to submit jobs bash sacctmgr show assoc where user lt user_id gt List all or the specified QoS bash sacctmgr show qos where name lt qos_name gt Show the privileges of my user 37 bash sacctmgr show user Show cluster information bash sacctmgr show cluster 38
19. exclusive yes yes exclusive MaxMemPerCPU nodelimit 5 GByte 2 5 GByte 5 GByte Table 3 2 Overview on SLURM partitions for MISTRAL 3 3 Job Limits QoS As stated above the partitions have several hard limits that put an upper limit for the jobs on the wall clock or other constraints However the actual job limits are enforced by the limits specified in both partitions and so called Quality of Services QoS which means that using a special QoS the user might weaken the partition limits These QoSs play an important role to define the job priorities By defining some QoSs the possible priorities can be modified in order to e g enable earlier starttime of jobs Following we present the current list with the configured Quality of Services If users have any demand for creating new QoS we kindly ask to contact us QoS express description higher priority limits 4 nodes 20 min wallclock Table 3 3 Overview on SLURM QoS for MISTRAL 15 3 4 Priorities and Accounting The main policies concerning the batch model and accounting that are applied on MIS TRAL are also defined via SLURM e SLURM schedules the jobs according to their priorities The jobs with the highest priorities will be scheduled next e Usage of backfilling scheduling algorithm the SLURM scheduler checks the queue and may schedule jobs with lower priorities that can fit in the gap created by freeing resources for the n
20. ext highest priority jobs e For each project a SLURM account is created where the users belong to Each user might use the contingent from several projects that he belongs to e Users can submit jobs even when granted shares are already used this result in a low priority but the job might start when the system is empty SLURM has a very simple and well defined priority mechanism that allows to define different weighting models The actual priority is based on five factors to calculate the job priorities Age Fairshare Job Size Partition and QoS Job_priority PriorityW eight Age x age_factor PriorityW eightFairshare x fairshare_factor PriorityW eight JobSize x job_size_factor 3 1 PriorityW eight Partition x partition_factor PriorityW eightQOS x QOS_factor For each factor a weight is defined to balance the job priority equation e WeightQOS 10000 e WeightAge 1000 e WeightJobSize TODO e WeightFairshare 100000 e WeightPartition 10000 3 5 Job Environment On the compute nodes the whole shell environment is passed to the jobs during submission With some options of the allocation commands like export for the sbatch command users can change this default behaviour The users can load modules and prepare the desired environment before job submission and then this environment will be passed to the jobs that will be submitted Of course a good practice is to include module commands inside the job
21. he application using only parts of the allocated resources one needs to give again all relevant allocation options to srun like ntasks or ntasks per node e g srun ntasks 2 ntasks per node 1 cpu_bind cores distribution block cyclic lt my_binary gt All environment variables set at the time of submission are propagated to the SLURM jobs With some options of the allocation commands like export for sbatch or srun users can change this default behaviour The users can load modules and prepare the desired environment before job submission and then this environment will be passed to the jobs that will be submitted Of course a good practice is to include module commands in job scripts in order to have full control of the environment of the jobs NOTE on the MISTRAL cluster setting of A resp account is necessary to submit a job otherwise submission will be rejected You can query the accounts for which job submission is allowed using the command bash sacctmgr list assoc format account qos MaxJobs user USER 21 4 3 Job Script Examples Serial job bin bash SBATCH job name my_job Specify job name BATCH partition shared Specify partition name ASBATCH ntasks 1 Specify max number of tasks to be invoked SBATCH mem per cpu lt MB gt Specify real memory required per CPU BATCH time 00 30 00 Set a limit on the total run time SBATCH mail
22. hile in DRAINING state any running job on the node will be allowed to run until completion After that and in DRAIN state the node will be unavailable for use idle IDLE The node is not allocated to any jobs and is available for use maint MAINT The node is currently in a reservation with a flag of maintenance resv RESERVED The node is in an advanced reservation and not generally available A listing based in nodes can be viewed as follows bash sinfo N NODELIST NODES PARTITION STATE m 10000 10278 10286 10438 10498 10518 10554 13 computex downx m10000 1 shared down m 10001 10017 10036 10049 11296 11313 63 shared idle m 10001 10017 10036 10053 10072 10107 1318 computex idle m 10018 10035 11314 11331 36 shared alloc m 10018 10035 10054 10071 10108 10110 10113 165 computex alloc m 11512 11517 11519 11531 11533 11553 11555 45 prepost idle m 11518 11532 11554 3 prepost drainx mg 100 101 103 111 11 gpu idle mg102 1 gpu down Query configuration and limits for one specific partition here compute bash scontrol show partition compute Check one node here m10010 bash scontrol show node m10010 4 5 2 Job Control The scontrol command is primarily used by the administrators to manage SLURM s configuration However it provides also some functionality for the users to manage jobs and get some information about the system configuration Show information about the
23. ia cpus per task and cpu_bind one might also use the srun option hint no multithread The fol lowing example allocates one full node and uses 24 tasks without HyperThreads for the first program run and then 48 tasks using HyperThreads for the second run Such a procedure might be used in order to see whether an application takes benefits of the use of HyperThreads or not bin bash SBATCH job name my_job Specify job name SBATCH partition compute Specify partition name ASBATCH nodes 1 Specify number of nodes SBATCH time 00 30 00 Set a limit on the total run time SBATCH account x12345 Charge resources on this project account Environment settings to execute a parallel program compiled with Intel MPI module load intelmpi export I MPI PMI LIBRARY usr lib64 libpmi so export I_MPI_FABRICS shm dapl export I MPI FALLBACK 0 export IMPIDAPL_UD enable First check how myprog performs without Hyper Threads srun 1 cpu_bind verbose hint nomultithread ntasks 24 myprog Second check how myprog performs with Hyper Threads srun 1 cpu bind verbose hint multithread ntasks 48 myprog Hybrid MPI OpenMP job without Hyper Threading The following job example will allocate 4 compute nodes for 1 hour The job will launch 24 MPI tasks in total 6 tasks per node and 4 OpenMP threads per task On each node 24 cores will be used bin bash SBATCH
24. ify partition name 5BATCH nodes 4 Specify number of nodes ASBATCH ntasks per node 48 Specify number of tasks on each node BATCH time 00 30 00 Set a limit on the total run time SBATCH mail type FAIL Notify user by email ASBATCH mail user your email Set your e mail address FSBATCH account xz0123 Charge resources on project account ASBATCH output my_job 0 j File name for standard output SBATCH error my_job e j File name for standard error output Environment settings to run a MPI parallel program compiled with Bullx MPI and Mellanox libraries load environment module load intel module load mxm 3 3 3002 module load fca 2 5 2379 module load bullxmpi mlx bullxmpi mlx 1 2 8 3 Settings for Open MPI and MXM MellanoX Messaging library export OMPI MCA pmEcm export OMPI_MCA_mtlmsm export OMPI_MCA_mtl_mxm_np 0 export MXM_RDMA PORTS m1x5_0 1 export MXM LOG LEVEL ERROR Disable GHC algorithm for collective communication export OMPI_MCA coll ghc Environment settings to run a MPI parallel program compiled with Intel MPI load environment module load intelmpi export IMPIPMILIBRARY usr 1lib64 libpmi so Use srun not mpirun or mpiexec command to launch programs 24 compiled with any MPI library srun 1 cpu_bind threads distribution block cyclic myprog Instead of specifying the choice to use HyperThreads or not explicitly v
25. in bash 22 SBATCH job name my_job SBATCH partition shared SBATCH ntasks 1 Specify job name Specify partition name Specify max number of tasks to be invoked Specify number of CPUs per task Set a limit on the total run time SBATCH cpus per task 8 SBATCH time 00 30 00 SBATCH account x12345 Charge resources on this project account 4h dh FE Se SE SR Se SE bind your OpenMP threads export OMP_NUM_THREADS 8 export KMP_AFFINITY verbose granularity thread compact 1 export KMP_STACKSIZE 64M execute OpenMP programs e g cdo P 8 lt operator gt lt ifile gt lt ofile gt MPI job without HyperThreading The overall setting of the batch script does not vary whether one is using Intel MPI or bullx MPI or any other MPI implementation Only specific modules might be used and or environmental variables should be set in order to fine tune the used MPI Especially the parallel application should always be started using the srun command instead of invoking mpirun mpiexec or others The following example allocates 8 full nodes and uses 24 physical cores per node The total number of tasks is 192 bin bash SBATCH job name my_job SBATCH partition compute SBATCH nodes 8 SBATCH ntasks per node 24 Specify job name Specify partition name Specify number of nodes Specify number of tasks on each node SBATCH time 00 30 00 Set a limit on the
26. in different job steps sequentially after each other and also parallel to each other inside the same job allocation In total 4 nodes are allocated the first 2 job steps run on all nodes after each other while the job steps 3 and 4 run in parallel each using only 2 nodes bin bash SBATCH nodes 4 SBATCH time 00 30 00 SBATCH account x12345 32 run 2 job steps after each other srun N4 ntasks per node 24 time 00 10 00 mpi progl srun N4 ntasks per node 24 time 00 20 00 mpi_prog2 run 2 job steps in parallel srun N1 n24 mpi prog3 amp srun N3 ntasks per node 24 mpi_prog4 amp 4 4 6 Dependency Chains SLURM supports dependency chains which are collections of batch jobs with defined dependencies Job dependencies can be defined using the dependency argument of sbatch bin bash SBATCH dependency lt type gt The available dependency types for job chains are e after lt jobID gt job starts when job with lt jobID gt begun execution e afterany lt jobID gt job starts when job with lt jobID gt terminates e afterok lt jobID gt job starts when job with lt jobID gt terminates successfully e afternotok lt jobID gt job starts when job with lt jobID gt terminates with failure e singleton jobs starts when any previously job with the same job name and user terminates 4 4 7 Job Arrays SLURM supports j
27. job sacct retrieve accounting information about jobs and job steps For completed jobs sacct queries the accounting database 4 2 Allocation Commands A job allocation i e a request on compute resources can be created using the SLURM salloc sbatch or srun command The usual way to allocate resources and execute a job on MISTRAL is to write a batch script and submit it to SLURM with the sbatch command see section 4 2 3 for details Alternatively an interactive allocation can be used via the salloc command or a parallel job can directly be started with the srun command 4 2 1 Interactive Jobs Interactive sessions can be allocated using the salloc command The following command for example will allocate 2 nodes for 30 minutes salloc nodes 2 time 00 30 00 account x12345 Once an allocation has been made the salloc command will start a bash shell on the login node where the submission was done After a successful allocation the users can execute srun from that shell to spawn interactively their applications For example srun ntasks 4 ntasks per node 2 cpus per task 4 my_code The interactive session is terminated by exiting the shell In order to run commands directly on the allocated compute nodes the user has to use ssh to connect to the desired nodes For example salloc nodes 2 time 00 30 00 account x12345 salloc Granted job allocation 13258
28. job name my_job job name SBATCH partition compute partition name SBATCH nodes 4 number of nodes SBATCH ntasks per node 6 number of MPI tasks per node SBATCH time 01 00 00 Set a limit on the total run time SBATCH mail type FAIL Notify user by email SBATCH mail user your email Set your e mail address SBATCH account xz0123 Charge resources on project account SBATCH output my_job 0 j File name for standard output SBATCH error my_job e j File name for standard error output Bind your OpenMP threads export OMP_NUM_THREADS 4 export KMP_AFFINITY verbose granularity core compact 1 export KMP_STACKSIZE 64m 25 Environment settings to run a MPI OpenMP parallel program compiled with Bullx MPI and Mellanox libraries load environment module load intel module load mxm 3 3 3002 module load fca 2 5 2379 module load bullxmpi mlx bullxmpi mlx 1 2 8 3 Settings for Open MPI and MXM MellanoX Messaging library export OMPLMCA_pml cm export OMPI_MCA_mtlmsm export OMPI_MCA_mtl_mxm_np 0 export MXM RDMA PORTS m1x5_0 1 export MXM_LOG_LEVEL ERROR Disable GHC algorithm for collective communication export OMPI_MCA coll ghc Environment settings to run a MPI OpenMP parallel program compiled with Intel MPI load environment module load intelmpi export IMPIPMILIBRARY usr 1lib64 libpmi so Use srun not mpirun or mpiexec command to launch progr
29. me for the coupled MPI ESM model using 8 nodes bin bash SBATCH nodes 8 SBATCH ntasks per node 24 SBATCH time 00 30 00 SBATCH exclusive SBATCH account x12345 Atmosphere ECHAM_NPROCA 6 ECHAM_NPROCB 16 Ocean MPIOM_NPROCX 12 MPIOM_NPROCY 8 Paths to executables ECH AM EXECUTABLE bin echam6 MPIOM_EXECUTABLE bin mpiom x Derived values useful for running ECHAM_NCPU ECHAM_NPROCA ECHAM_NPROCB MPIOM_NCPU MPIOM_NPROCX x MPIOM_NPROCY NCPU ECHAM_NCPU MPIOM_NCPU MPIOM LART CPU MPIOM_NCPU 1 ECHAM_LAST_CPU NCPU 1 create MPMD configuration file cat gt mpmd conf lt lt EOF 0 MPIOM_LAST_CPU MPIOM_EXECUTABLE MPIOM NCPU ECHAM LAST_CPU ECHAM EXECUTABLE EOF 31 Run MPMP parallel program using Intel MPI module load intelmpi export IMPI_PMI LIBRARY usr lib64 libpmi so export I MPI FABRICS shm dapl export IMPI_FALLBACK 0 export I MPI_DAPL_UD enable srun 1 cpu_bind verbose cores multi prog mpmd conf 4 4 4 Frequency Scaling The Intel Haswell processor allows for CPU frequency scaling which in general enables the operating system to scale the CPU frequency up or down in order to save power CPU frequencies can be scaled automatically depending on the system load or manually by userspace programs This is done via power schemes for the CPU so called governors Only one ma
30. n a job is requeued the batch script is initiated from its beginning Table 4 1 SLURM sbatch options Multiple srun calls can be placed in a single batch script Options such as nodes ntasks and ntasks per node are inherited from the sbatch arguments but can be 20 overwritten for each srun invocation The complete list of parameters can be inquired from the sbatch man page man sbatch As already mentioned above the batch script is submitted using the SLURM sbatch command sbatch OPTIONS lt jobscript gt On success sbatch writes the job ID to standard output Options provided on command line supersede the same options defined in the batch script Remember the difference between options for selection allocation and distribution in SLURM Selection and allocation works with sbatch but task distribution and binding should directly be specified with srun within an sbatch script The following steps give an overview for details see the further documentation below 1 Resource Selection e g e HSBATCH nodes 2 e HSBATCH sockets per node 2 e HSBATCH cores per socket 12 2 Resource Allocation e g e HSBATCH ntasks 12 e HSBATCH ntasks per node 6 e HSBATCH ntasks per socket 3 3 Start the application relying on the sbatch options only Task binding and distri bution with srun e g srun cpu_bind cores distribution block cyclic lt my_binary gt 4 Start t
31. n clusters of various sizes A SLURM installation consists of several programs user commands and daemons which are shown in Table 3 1 and Figure 3 1 daemon description control daemon responsible for monitoring of available resources and scheduling slurmctld of batch jobs it is running on admin nodes as HA resource database daemon accessing and managing the MySQL database which stores all slurmdbd the information about users jobs and accounting data slurm daemon functionality of the batch system and resource management it slurmd is running on each compute node step daemon a job step manager spawned by slurmd to guide the user slurmstepd processes Table 3 1 Overview on SLURM components SLURM manages the compute pre post processing and visualisation nodes as its main resource of the cluster Several nodes are grouped together into partitions which might overlap i e one node might be contained in several partitions Compared to LoadLeveler on BLIZZARD partitions are the equivalent of classes hence partitions are the main concept for users to start jobs on the MISTRAL cluster 13 salloc sbatch slurmetld scancel y slurmdbd scontrol sinfo smap sprio Login nodes sacctmgr slurmd slurmd Compute node n 1 Compute node n Figure 3 1 SLURM daemons and their interaction 3 2 SLURM Partitions In SLURM multiple nodes can be grouped in
32. ob arrays which is a mechanism for submitting and managing collec tions of similar jobs quickly and easily Job arrays are only supported for the sbatch command and are defined using the option array lt indices gt All jobs use the same initial options e g number of nodes time limit etc however since each part of the job array has access to the SLURM_ARRAY_TASK_ID environment variable individual setting for each job is possible For example the following job submission bash sbatch array 1 3 N1 slurm_job_script sh will generate a job array containing three jobs Assuming that the jobID reported by sbatch is 42 then the parts of the array will have the following environment variables set array index 1 SLURM_JOBID 42 SLURM_ARRAY_JOB_ID 42 SLURM_ARRAY_TASK_ID 1 array index 2 SLURM_JOBID 43 SLURM_ARRAY_JOB_ID 42 SLURM_ARRAY_TASK_ID 2 33 array index 3 SLURM_JOBID 44 SLURM_ARRAY_JOB_ID 42 SLURM_ARRAY_TASK_ID 3 Some additional options are available to specify the stdin stdout and stderr file names option A will be replaced by the value of SLURM_ARRAY_JOB_ID and option a will be replaced by the value of SLURM_ARRAY_TASK_ID The following example creates a job array of 42 jobs with indices 0 41 Bach job will run on a separate node with 24 tasks per node Depending on the queuing situation some jobs may be running and some may be waiting in the queue Each part of the job array will execu
33. of large data sets and scripts of output from running applications and frequently accessed data quota 24 GB according to 15 TB annual project allocation backup yes please contact DKRZ no no user s consultancy to restore files deleted by mistake automatic no no yes data deletion data life until user account deletion 1 month after 14 days since the last file time project access expiration Table 1 2 MISTRAL file system configuration 1 3 1 Data Migration from Blizzard e The users home directories from blizzard have been copied to mistral under mnt lustre01 rsync pf The last copy was made on August 1st 2015 Please copy the files you need to your actual home directory on mistral pf a b g k m u lt userid gt e pool data is mirrored from blizzard to the same directory on mistral e The project directories on work have been copied to mistral under mnt lustre01 rsync work The last copy was made August 1st 2015 Please move all data you want to keep to your actual project directory work lt projectid gt 1 4 Access to the Cluster The High Performance Computing system MISTRAL can be only accessed via Secure Shell SSH network protocol For file transfer between different hosts SSH provides SCP and SFTP 1 4 1 Login You can log into MISTRAL with the following ssh command replacing lt userid gt by your username bash ssh lt userid gt mistral dkrz de After having l
34. ogged into MISTRAL you will find yourself on one of the eight login nodes mlogin100 mlogin107 The login nodes serve as front end to the compute nodes of the HPC cluster They are intended for file editing and compilation of source code as well as for submitting monitoring and cancelling of batch jobs They can also be used for none time and memory intensive serial processing tasks The routine data analysis and visualization however have to be performed on pre post processing nodes or on visualization servers For interactive testing and debugging of parallel programs you can use SLURM salloc command to allocate the required number of nodes 1 4 2 Password All DKRZ systems are managed by the LDAP protocol The password can be changed through DKRZ online services A user defined password must contain at least eight non blank characters and must be a combination of upper and lower case letters numbers and special characters In case you do not remember your password please contact DKRZ user s consultancy Members of MPI and UniHH CEN should contact CIS CEN IT 1 4 3 Login Shell The default login shell for new DKRZ users is bash You can change your login shell to tcsh or ksh using the DKRZ online services The settings you would like to use every time you log in can be put into special shell setup files A login bash shell looks for bash_profile bash_login or profile in your home directory and executes commands from the first file found
35. ollowing four partitions are currently defined on MISTRAL 14 compute shared prepost gpu This is default partition consisting of 1496 compute nodes and intended for running parallel scientific applications The compute nodes allocated for a job are used exclusively and can not be shared with other jobs This partition is defined on 100 nodes and can be used to run small jobs not re quiring a whole node for the execution so that one compute node can be shared between different jobs The partition is dedicated for execution of shared memory applications parallelized with OpenMP or pthreads as well as for serial and parallel data processing jobs The prepost partition is made up of 48 large memory nodes It is dedicated for memory intensive data processing jobs Furthermore interactive usage of nodes is permitted on this partition If over subscription is explicitly requested by the user using the share option on job submission resources can be shared with other jobs The 12 nodes in this partition are additionally equipped with Nvidia Tesla K80 GPUs and can be used for 3 dimensional data visualization or execution of applica tions ported to GPUs The nodes in this partition will replace Halo cluster in the future The limits configured for different partitions are listed in the table below partition compute prepost shared gpu MaxNodes 512 2 1 1 MaxTime 8 hours 4 hours 7 days 4 hours Shared
36. on options used etc in the executable ipo Inter procedural optimization xAVX or Indicates the processor for which code is created xCORE AVX2 help Gives a long list of quite a big amount of options Table 2 4 Intel compiler options 2 2 1 Compilation Examples Compile a hybrid MPI OpenMP program using the Intel Fortran compiler and Bullx MPI with MXM and FCA module add intel mxm fca bullxmpi_mlx mpif90 openmp O2 xCORE AVX2 fp model source o mpi_omp_prog program f90 Compile an MPI program in Fortran using Intel Fortran compiler and Intel MPI module add intel intelmpi mpiifort O2 xCORE AVX2 fp model source o mpi_prog program f90 10 2 2 2 Recommendations Intel Compiler Using the compiler option xCORE AVX2 resp xHost causes the Intel compiler to use full AVX2 support vectorization with FMA instructions which might result in binaries that do not produce MPI decomposition independent results Switching to xAVX should solve this issue but result in up to 15 slower runtime MPI The bullx MPI was used throughout for the benchmarks of the HLRE 3 procurement From BULL ATOS point of view a good environment will be to use bullxMPI_mlx with MXM i e load the specific environment before compiling module add intel mxm 3 3 3002 fca 2 5 2379 bullxmpi_mlx bullxmpi_mlx 1 2 8 3 mpif90 O2 xCORE AVX2 o mpi prog program f90 On
37. ption default value description nodes lt ber gt c S 1 Number of nodes for the allocation N lt number gt Number of tasks MPI processes Can be omitted if nodes and ntasks per node are given Number of tasks per node If keyword omitted the default ntasks lt number gt 1 n lt number gt ntasks per node lt number gt 1 value is used but there are still 48 CPUs available per node for current allocation if not shared a N Er ek Number of threads logical cores PHS 1 per task Used mainly for OpenMP c lt number gt al or hybrid jobs output lt path gt lt fil tt gt Raiz S Pa 4 man slurm j out Standard output file o lt path gt lt file pattern gt lt gt lt fil ttern gt error party ep slurm j out Standard error file e lt path gt lt file pattern gt Requested walltime limit for time lt walltime gt tition d t lt walltime gt Re ag the job partition lt name gt Le EME compute Partition to run the job mail user lt email gt username Email address for notifications Event types for email notifications Possible values are NONE BEGIN a 7 lt gt 7 y ee NONE END FAIL REQUEUE ALL TIMELIMIT job lt j gt job R a job script s name Job name account lt project gt none Project that should be charged A lt project gt Specifies whether the batch job ER should be requeued after a node ae requeue failure Whe
38. scripts in order to have full control of the environment of the jobs 16 Chapter 4 SLURM Usage This chapter serves as an overview of user commands provided by SLURM and how users should use the SLURM batch system in order to run jobs on MISTRAL For a comparison to LoadLeveler commands see http slurm schedmd com rosetta pdf or read the more detailed description of each command s manpage A concise cheat sheet for SLURM can be downloaded here http slurm schedmd com pdfs summary pdf 4 1 SLURM Command Overview SLURM offers a variety of user commands for all the necessary actions concerning the jobs With these commands the users have a rich interface to allocate resources query job status control jobs manage accounting information and to simplify their work with some utility commands For examples how to use these command see Chapter 4 5 sinfo show information about all partitions and nodes managed by SLURM as well as about general system state It has a wide variety of filtering sorting and formatting options squeue query the list of pending and running jobs By default it reports the list of pending jobs sorted by priority and the list of running jobs sorted separately according to the job priority The most relevant job states are running R pending PD completing CG completed CD and cancelled CA The TIME field shows the actual job execution time The NODELIST REASON field indicates on which nodes the job is
39. sh squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST REASON 13194 compute MR_2_01P k203059 PD 0 00 13 PartitionTimeLimit 13263 compute LRO014 r k208024 R 4 03 16 m 10002 10017 Check the Queue for one user only bash squeue u USER JOBID PARTITION NAME USER ST TIME NODES NODELIST REASON 13263 compute LRO014 r k208024 R 4 03 16 m 10002 10017 Depending on the options the sinfo command will print the states of the partitions and the nodes The partitions may be in state UP DOWN or INACTIVE The UP state means that a partition will accept new submissions and the jobs will be scheduled The DOWN state allows submissions to a partition but the jobs will not be scheduled The INACTIVE state means that not submissions are allowed bash sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST computex up 8 00 00 31 maint m 11440 11470 computex up 8 00 00 13 down m 10000 10278 10286 10438 computex up 8 00 00 812 idle m 10001 10034 10036 10041 gpu up 4 00 00 down mg102 gpu up 4 00 00 11 idle mg 100 101 103 111 35 The nodes also can be in various states Node state code may be shortened according to the size of the printed field The following shows the most common node states alloc ALLOCATED The node has been allocated comp COMPLETING The job associated with this node is in the state of COMPLET ING down DOWN The node is unavailable for use drain DRAINING DRAINED W
40. systems On MISTRAL we provide the Lustre parallel filesystem version 2 5 Users have access to three different storage spaces HOME WORK and SCRATCH Each storage area has a specific purpose as described below HOME is the file system where users sessions start upon login to MISTRAL It is backed up and should be used to store shell setup files source codes scripts and important files WORK is a project space available through the allocations process and shared between all users of a project It provides disk space for large amounts of data but it is not backed up It can be used e g for writing raw model output and processing of data that is accessible to all project members SCRATCH is provided for temporary storage and processing of large data To prevent the file system from overflowing old data is automatically deleted The granted retention period is 14 days All file systems are available on all nodes login and compute so you can use them during interactive sessions and in batch jobs The table below provides further details on available file systems File HOME WORK SCRATCH System path pf la b g k m u lt userid gt work lt project gt scratch a b g k m u lt userid gt envVar HOME description e Assigned to user account e Assigned to e Assigned to user account e Storage of personal project account e Temporary storage and settings files source codes e Interim storage processing
41. te the same binary but with different input files bin bash SBATCH nodes 1 SBATCH output prog A_ a out SBATCH error prog A _ a err SBATCH time 00 30 00 SBATCH array 0 41 SBATCH account x12345 srun ntasks per node 24 prog input_ SLURM_ARRAY_TASK _ID txt 34 4 5 SLURM Command Examples 4 5 1 Query Commands Normally the jobs will pass through several states during their life cycle Typical job states from submission until completion are PENDING PD RUNNING R COM PLETING CG and COMPLETED CD However there are plenty of possible job states for SLURM The following describes the most common states CA CANCELLED Job was explicitly cancelled by the user or an administrator The job may or may not have been initiated CD COMPLETED Job has terminated all processes on all nodes CF CONFIGURING Job has been allocated resources but is waiting for them to become ready for use CG COMPLETING Job is in the process of completing Some processes on some nodes may still be active F FAILED Job terminated with non zero exit code or other failure condition NF NODE_FAIL Job terminated due to failure of one or more allocated nodes PD PENDING Job is awaiting resource allocation R RUNNING Job currently has an allocation TO TIMEOUT Job terminated upon reaching its walltime limit Some examples how users can query their jobs status are ba
42. tion 4 3 there are examples commands and job scripts on how to use HT or not 27 139 98 1 80 S LOZI8 90 UY s1eg jeasiyd sexepu 1919 150H ayze pU ooo oran orn ene pi ayze PLT ayze PLT ayze pU ayze pU ayze PLT awoe 7 Yed Nd 8E d Nd 2E d Nd 964d Nd DZS ltd HS3PoS 189r9 l d SPONYANN cera na zerana tera na DE d Md 62d Nd 82 d Md it d Md 9d Nd St d Nd bad Md aioe 1 O d 1890S anta O d spou gin 19982 U suyo Figure 4 1 Schematic illustration of compute nodes 4 4 2 Process and Thread Binding OpenMP jobs Thread binding is done via Intel runtime library using the KMP_AFFINITY environment variable The syntax is KMP_AFFINITY lt offset gt lt modifier gt lt type gt lt permute gt with e modifier verbose giving detailed output on how binding was done 28 granularity core reserve full physical cores i e two logical CPUs to run threads on granularity thread fine reserve logical CPUs HyperThreads to run threads e type compact places the threads as close to each other as possible scatter distributes the threads as evenly as possible across the entire allocation e permute controls which levels are most significant when sorting the machine topol ogy map i e 0O CPUs default 1 cores 2 sockets LLC e offset indicates the starting position for thread assignm
43. to partitions which are sets of nodes with associated limits for wall clock time job size etc These limits are hard limits for the jobs and can not be overruled Jobs are the allocations of resources by the users in order to execute tasks on the cluster for a specified period of time Furthermore the concept of jobsteps is used by SLURM to describe a set of different tasks within the job One can imagine jobsteps as smaller allocations or jobs within the job which can be executed sequentially or in parallel during the main job allocation The SLURM sinfo command lists all partitions and nodes managed by SLURM on MISTRAL as well as provides general information about the current nodes status bash sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute up 8 00 00 13 down m 10000 10278 10286 10438 compute up 8 00 00 168 alloc m 10036 10045 10108 10125 compute up 8 00 00 1315 idle m 10001 10035 10046 10107 prepost up 4 00 00 3 drain m 11518 11532 11554 prepost up 4 00 00 45 idle m 11512 11517 11519 11531 shared up 7 00 00 00 1 down m10000 shared up 7 00 00 00 28 alloc m 10036 10045 11314 11331 shared up 7 00 00 00 71 idle m 10001 10035 10046 10049 gpu up 4 00 00 1 down mg102 gpu up 4 00 00 11 idle mg 100 101 103 111 For detailed information about all available partitions and their limits use the SLURM scontrol command as follows scontrol show partition The f
44. total run time SBATCH mail type FAIL Notify user by email SBATCH mail user your email Set your e mail address oe Se Fe te ie SE SBATCH account xz0123 Charge resources on project account SBATCH output my_job 0 j File name for standard output SBATCH error my_job e j File name for standard error output Environment settings to run a MPI parallel program compiled with Bullx MPI and Mellanox libraries load environment module load intel module load mxm 3 3 3002 module load fca 2 5 2379 module load bullxmpi_mlx bullxmpi_mlx 1 2 8 3 Settings for Open MPI and MXM MellanoX Messaging library export OMPIMCA_pmEcm export OMPI_ MCA mtlmxm export OMPILMCA_mtl_mxm_np 0 export MXM RDMA PORTS m1x5_0 1 export MXM_LOG_LEVEL ERROR 23 Disable GHC algorithm for collective communication export OMPI_MCA coll ghc Environment settings to run a MPI parallel program compiled with Intel MPI load environment module load intelmpi export I MPI PMI LIBRARY usr lib64 libpmi so Use srun not mpirun or mpiexec command to launch programs compiled with any MPI library srun l cpu_bind cores distribution block cyclic myprog MPI job with HyperThreading The following example allocates 4 full nodes and uses 48 logical CPUs per node The total number of tasks is 192 bin bash 5BATCH job name my_job Specify job name BATCH partition compute Spec
45. type FAIL Notify user by email in case of job failure BATCH mail user you email Set your e mail address 5BATCH account x12345 Charge resources on this project account execute serial programs e g cdo lt operator gt lt ifile gt lt ofile gt Note The shared partition has a limit of 1280MB memory per CPU In case your serial job needs more memory you have to increase the number of tasks using option ntasks although you might not use all these CPUs OpenMP job without HyperThreading bin bash ASBATCH job name my_job Specify job name BATCH partition shared Specify partition name BATCH ntasks 1 Specify max number of tasks to be invoked FSBATCH cpus per task 16 Specify number of CPUs per task BATCH time 00 30 00 Set a limit on the total run time ASBATCH account x12345 Charge resources on this project account bind your OpenMP threads export OMP_NUM_THREADS 8 export KMP_AFFINITY verbose granularity core compact 1 export KMP_STACKSIZE 64M execute OpenMP programs e g cdo P 8 lt operator gt lt ifile gt lt ofile gt Note You need to specify the value of cpus per task as multiple of Hyper Threads HT The environment variable KMP_AFFINITY needs to be set correspondingly Whether HT is used or not is defined via the envVar KMP_AFFINITY see 4 4 2 for details OpenMP job with HyperThreading 441 b
46. w software was built and can be found at https www dkrz de Nutzerportal en doku mistral softwarelist type modules available compiler intel Intel compilers with frontends for C C and Fortran gcc Gnu compiler suite nag NAG compiler MPI intelmpi Intel MPI bullxmpi Bullx MPI with without mellanox libraries mvapich2 MVAPICH2 an MPI 3 implementation openmpi Open MPI tools allinea forge Allinea DDT debugger and MAP profiler cdo command line Operators to manipulate and analyse Climate and NWP model Data ncl NCAR Command Language ncview visual browser for netCDF format files python Python Table 2 1 MISTRAL module overview 2 1 2 Using the Module Command Users can load unload and query modules through the module command The most important module sub commands are listed in the table below command description module avail Shows the list of all available modules module show Shows environment changes the modulefile lt modname gt lt version gt lt modname gt lt version gt will cause if loaded module add Loads a specific module Default version is loaded if lt modname gt lt version gt the version is not given module list Lists all modules currently loaded module rm Unloads a module lt modname gt lt version gt module purge Unloads all modules module switch Replaces one module with another lt modname gt lt version1 gt lt modname gt lt version
47. y be active at a time The default governor is ondemand which allows the operating system to scale down the CPU frequency on the compute nodes to 1 2GHz if they are in idle state The user can set the governor to userspace in order to allow for different CPU frequencies Therefore the batch job needs to define the desired behaviour via the environmental variable SLURM_CPU_FREQ_REQ or via the srun option cpu freq To set a fixed frequency of 2 5GHz 2500000 kHz use export SLURM_CPU_FREQ_REQ 2500000 Other allowed frequencies are 1 2 1 3 2 5 GHz To enable automatic frequency scaling depending on the workload use export SLURM_CPU_FREQ_REQ ondemand By default srun configures all CPUs to run at fixed frequency of 2 5GHz in order to get similar wallclock runtime between different jobs if no options or binaries are changed 4 4 5 Job Steps Job steps can be thought of as small allocations or jobs inside the current job allocation Each call of srun creates a job step which implies that one job allocation given via sbatch can have one or several job steps executed in parallel or sequentially Instead of submitting many single node jobs the user might also use job steps inside a single job having multiple nodes allocated A job using job steps will be accounted for all the nodes of the allocation regardless if all nodes are used for job steps or not The following example uses job steps to execute MPI programs

HLRE-3 MISTRAL user's manual

Contents

Download Pdf Manuals

Related Search

Related Contents