Home

User Manual - Support

1. Bright Cluster Manager 6 0 User Manual Revision 3790 Date Wed 15 May 2013 Table of Contents Table of Contents ee ee i Introduction 1 11 What Is A Beowulf Cluster 1 1 2 Brief Network Description o o 2 Cluster Usage 5 2 1 Login To The Cluster Environment 5 2 2 Setting Up The User Environment 6 2 3 Environment Modules 6 24 Compiling Applications o ooo 9 Using MPI 11 3 1 Intercommects es 11 3 2 Selecting An MPI implementation 12 3 3 Example MPI Run 2 2006 2 4 2 22 ar ep eg 12 Workload Management 17 4 1 What Is A Workload Manager 17 4 2 Why Use A Workload Manager 17 43 What Does A Workload Manager Do 17 4 4 JobSubmissionProcess 18 45 What Do Job Scripts Look Like 18 4 6 Running Jobs On A Workload Manager 18 4 7 Running Jobs In Cluster Extension Cloud Nodes Using cmsub 19 SLURM 21 5 1 Loading SLURM Modules And Compiling The Executable 21 5 2 Running The Executable With salloc 22 5 3 Running The Executable As ASLURM Job Script 24 SGE 29 6 1 Writing A Job Script o o o oooooo o o 29 6 2 Submitting AJob o o o oo oo ooo 33 6 3 MonitoringAJob 34 6 4 Deleting AJob l san sua asua a h en 35 PBS Variants Torque And PBS Pr
2. Bright Computing Inc 3 3 Example MPI Run 13 module add mvapich gcc 64 1 2rc1 Depending on the libraries and compilers installed on the system the availability of these packages might differ To see a full list on the system the command module avail can be typed 3 3 1 Compiling And Preparing The Application The code must be compiled with MPI libraries and an underlying com piler The correct library command can be found in the following table Language C C Fortran77 Fortran90 Fortran95 Command mpicc mpiCC mpif77 mpif90 mpif95 The following example uses the MPI with a C compiler mpicc myapp c This creates a binary a out which can then be executed using the mpirun command 3 3 2 Creating A Machine File A machine file contains a list of nodes which can be used by an MPI pro gram The workload management system creates a machine file based on the nodes allocated for a job when the job is submitted with the workload manager job submission tool So if the user chooses to have the workload management system allocate nodes for the job then creating a machine file is not needed However if an MPI application is being run by hand outside the workload manager then the user is responsible for creating a machine file manually Depending on the MPI implementation the layout of this file may differ Machine files can generally be created in two ways e Listing the same node several times
3. The man page for cmsub gives details on the cloud related option val ues Bright Computing Inc SLURM Simple Linux Utility for Resource Management is a workload managementsystem developed originally at the Lawrence Livermore Na tional Laboratory It has both a graphical interface and command line tools for submitting monitoring modifying and deleting jobs SLURM is normally used with job scripts to submit and execute jobs Various settings can be put in the job script such as number of processors resource usage and application specific variables The steps for running a job through SLURM are to Create the script or executable that will be handled as a job Create a job script that sets the resources for the script executable Submit the job script to the workload management system The details of SLURM usage depends upon the MPI implementa tion used The description in this chapter will cover using SLURM s Open MPI implementation which is quite standard SLURM documen tation can be consulted https computing llnl gov linux slurm mpi_guide html if the implementation the user is using is very differ ent 5 1 Loading SLURM Modules And Compiling The Executable In section 3 3 3 an MPI Hello world executable that can run in parallel is created and run in parallel outside a workload manager The executable can be run in parallel using the SLURM workload manager For this the SLURM module should first be
4. gpus 1 node002 cm cluster state free np 3 gpus 1 For PBS Pro the display resembles some output elided fred bright52 pbsnodes a node001 cm cluster Mom node001 cm cluster ntype PBS state free pcpus 3 resources_available arch linux resources_available host node001 sharing default_shared node002 cm cluster Mom node002 cm cluster PBS state free ntype Bright Computing Inc Using GPUs GPUs Graphics Processing Units are chips that provide specialized parallel pro cessing power Originally GPUs were designed to handle graphics processing as part of the video processor but their ability to handle non graphics tasks in a similar manner has become important for general computing GPUs designed for general purpose computing task are commonly called General Purpose GPUs or GPGPUs A GPU is suited for processing an algorithm that naturally breaks down into a process requiring many similar calculations running in parallel Physically one GPU is typically a built in part of the motherboard of a node or a board in a node and consists of hundreds of processing cores There are also dedicated standalone units commonly called GPU Units consisting of several GPUs in one chassis and which are typically assigned to particular nodes via PCI Express connections Bright Cluster Manager contains several tools which can be used to set up and program GPUs for general purpose computations
5. SLURM Environment Variables Available environment variables include SLURM_CPUS_ON_NODE processors available to the job on this node SLURM_JOB_ID job ID of executing job SLURM_LAUNCH_NODE_IPADDR IP address of node where job launched SLURM_NNODES total number of nodes SLURM_NODEID relative node ID of current node SLURM_NODELIST list of nodes allocated to job SLURM_NTASKS total number of processes in current job SLURM_PROCID MPI rank or relative process ID of the current process SLURM_SUBMIT_DIR directory from with job was launched SLURM_TASK_PID process ID of task started SLURM_TASKS_PER_NODE number of task to be run on each node CUDA_VISIBLE_DEVICES which GPUs are available for use Typically end users use SLURM_PROCID in a program so that an input of a parallel calculation depends on it The calculation is thus spread across processors according to the assigned SLURM_PROCID so that each proces sor handles the parallel part of the calculation with different values More information on environment variables is also to be found in the man page for sbatch Bright Computing Inc 5 3 Running The Executable As A SLURM Job Script 27 5 3 4 Submitting The SLURM Job Script Submitting a SLURM job script created like in the previous section is done by executing the job script with sbat ch fred bright52 sbatch slurmhello sh Submitted batch job 703 fred bright52 cat slurm 703 out Hello w
6. can be seen with the output of sbatch help The more overviewable usage output from sbatch usage may also be helpful Some of the more useful ones are listed in the following table Bright Computing Inc 26 SLURM Directive Description Specified As Name the job lt jobname gt Request at least lt minnodes gt nodes Request lt minnodes gt to lt maxn odes gt nodes Request at least lt MB gt amount of temporary disk space Run the job for a time of lt wall time gt Run the job at lt time gt Set the working directory to lt di rectorypath gt Set error log name to lt job name err gt Set output log name to lt job name log gt Mail lt user address gt Mail on any event Mail on job end Run job in partition Run job using GPU with ID lt num ber gt as described in section 8 5 2 SBATCH J lt jobname gt SBATCH N lt minnodes gt HSBATCH N lt minnodes gt lt maznodes gt SBATCH tmp lt MB gt SBATCH t lt walltime gt SBATCH begin lt time gt SBATCH D lt directorypath gt HSBATCH e lt jobname err gt HSBATCH o lt jobname Log gt SBATCH mail user lt user address gt SBATCH mail type ALL SBATCH mail type END SBATCH p lt destination gt SBATCH gres gpu lt number gt By default both standard output and standard error go to a file slurm lt j gt out where lt j gt is the job number 5 3 3
7. compiled with MPI libraries runs on nodes in parallel when sub mitted with mpirun When using mpirun manually outside a workload manager environment the number of processes np as well as the num ber of hosts machinefile should be specified For example on a 2 compute node 4 processor cluster Example fred bright52 module add mvapich gcc 64 1 2rc1 or as appropriate fred bright52 mpirun np 4 machinefile mpirun hosts nolocal hello Hello world from process 003 out of 004 processor name node002 cm cluster Hello world from process 002 out of 004 processor name node001 cm cluster Hello world from process 001 out of 004 processor name node002 cm cluster Hello world from process 000 out of 004 processor name node001 cm cluster Here the nolocal option prevents the executable running on the lo cal node itself and the file mpirun hosts is a list of node names Example fred bright52 cat mpirun hosts node001 node002 O Bright Computing Inc 3 3 Example MPI Run 15 Running the executable with mpirun as shown does not take the re sources of the cluster into account To handle running jobs with cluster resources is of course what workload managers such as SLURM are de signed to do Running an application through a workload manager is introduced in Chapter 4 Appendix A contains a number of simple MPI programs Bright Computing Inc Workload Management 4 1 What Is A Workload Manager A wo
8. directory 8 4 Compiling Code Both CUDA and OpenCL involve running code on different platforms e host with one or more CPUs e device with one or more CUDA enabled GPUs Accordingly both the host and device manage their own memory space and it is possible to copy data between them The CUDA and OpenCL Best Practices Guides in the doc directory provided by the CUDA toolkit package have more information on how to handle both platforms and their limitations The nvec command by default compiles code and links the objects for both the host system and the GPU The nvec command distinguishes between the two and it can hide the details from the developer To compile the host code nvcc will use gcc automatically nvcc options lt inputfile gt A simple example to compile CUDA code to an executable is nvcc testcode cu o testcode The most used options are gor debug lt level gt This generates debug able code for the host Gor device debug lt level gt This generates debug able code for the GPU o or output file lt file gt This creates an executable with the name lt file gt arch sm_13 This can be enabled if the CUDA device supports compute capability 1 3 which includes double precision If double precision floating point is not supported or the flag is not set warn ings such as the following will come up warning Double is not supported Demoting to float The nvcc documentation manual The CUDA Co
9. e JOBID myapp ot JOBID 6 2 1 Submitting To A Specific Queue Some clusters have specific queues for jobs which run are configured to house a certain type of job long and short duration jobs high resource jobs or a queue for a specific type of node To see which queues are available on the cluster the qstat command can be used qstat g c CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE Bright Computing Inc 34 SGE long q 0 01 0 0 144 288 0 144 default q 0 01 0 0 144 288 0 144 The job is then submitted for example to the long q queue qsub q long q sleeper sh 6 3 Monitoring A Job The job status can be viewed with qstat In this example the sleeper sh script has been submitted Using qstat without options will only display a list of jobs with no queue status options qstat job ID prior name user state submit start at queue slots 249 0 00000 Sleeperi root qw 12 03 2008 07 29 00 1 250 0 00000 Sleeperi root qw 12 03 2008 07 29 01 1 251 0 00000 Sleeperi root qw 12 03 2008 07 29 02 1 252 0 00000 Sleeperi root qw 12 03 2008 07 29 02 1 253 0 00000 Sleeperi root qw 12 03 2008 07 29 03 1 More details are visible when using the for full option The Queuetype qtype can be Batch B or Interactive I e The used tot or used free column is the count of used free slots in the queue The states column is the state of the queue qstat f queuename qtype used tot load_avg arch states all
10. encoded string appear less than ex pected because there are unprintable characters in the encoding due to the cipher used being not exactly rot13 O Bright Computing Inc User Portal The user portal allows users to login via a browser and view the state of the cluster themselves It is a read only interface The first time a browser is used to login to the cluster portal a warning about the site certificate being untrusted appears in a default Bright Cluster configura tion This can safely be accepted 9 0 5 Home Page The default home page allows a quick glance to convey the most important cluster related information for users figure 9 1 Bright Cluster Manager User Portal Chromium eoe 5 Bright Cluster Manag CG fi amp bep 10 1 24 230 index php Ela For quick access place your bookmarks here on the bookmarks bar Import bookmarks now Other Bookmarks Bright Computing HOME WORKLOAD NODES GRAPHS Bright Cluster Manager User Portal MESSAGE OF THE DAY DOCUMENTATION CONTACT npletely operational and all my Bright Computing website are Administrator manual ser manual CLUSTER OVERVIEW Uptime 3 days 6 03 min Memory 1 1 GiB out of 3 1 GiB tot Nodes 5 4 0 Swap 316 9 MiB out of 81 9 GiB total Devices 0 1 0 Load 0 4 use Cores 5 5tot 1 1 syster Users 0 out of 2 total 94 6 idle Phase Load N A ampere 3 9 othe Occupation Rate 3 4 WORKLOAD OVERVIEW Queue Scheduler Slots Nodes Running Queue
11. going to be done by relying on defaults Instead node specifications are supplied to SLURM along with the executable To understand SLURM node specifications the following cases con sider and explain where the node specification is valid and invalid Number of nodes requested The value assigned to the N nodes option is the number of nodes from the cluster that is requested for allo cation for the executable In the current cluster example it can only be 1 For a cluster with for example 1000 nodes it could be a number up to 1000 A resource allocation request for 2 nodes with the nodes option causes an error on the current 1 node cluster example fred bright52 salloc N2 mpirun hello salloc error Failed to allocate resources Node count specification in valid salloc Relinquishing job allocation 573 Number of tasks requested per cluster The value assigned to the n ntasks option is the number of tasks that are requested for allo cation from the cluster for the executable In the current cluster example it can be 1 to 4 tasks The default resources available on a cluster are the number of available processor cores A resource allocation request for 5 tasks with the ntasks option causes an error because it exceeds the default resources available on the 4 core cluster fred bright52 salloc n5 mpirun hello salloc error Failed to allocate resources More processors requested t han permitted Adding a
12. run via the job script e Creating the job script adding directives applications runtime pa rameters and application specific variables to the script e Submitting the script to the workload management system This chapter covers the using the workload managers and job scripts with the PBS variants so that users can get a basic understanding of how they are used and can get started with typical cluster usage In this chapter section 7 1 covers the components of a job script and job script ex amples section 7 2 1 covers submitting monitoring and deleting a job with a job script More depth on using these workload managers is to be found in the PBS Professional User Guide and in the online Torque documentation at http www adaptivecomputing com resources docs Bright Computing Inc Pro 38 PBS Variants Torque And PBS Pro 7 1 Components Of A Job Script To use Torque or PBS Pro a batch job script is created by the user The job script is a shell script containing the set of commands that the user wants to run It also contains the resource requirement directives and other specifications for the job After preparation the job script is submit ted to the workload manager using the qsub command The workload manager then tries to make the job run according to the job script specifi cations A job script can be resubmitted with different parameters e g differ ent sets of data or variables 7 1 1 Sam
13. the number assigned by SGE when the job is submitted using qsub Only jobs belonging to the logged in user can be deleted Us ing qdel will delete a user s job regardless of whether the job is running or in the queue O Bright Computing Inc PBS Variants Torque And PBS Bright Cluster Manager works with Torque and PBS Pro which are two forks of Portable Batch System PBS PBS was a workload management and job scheduling system first developed to manage computing re sources at NASA in the 1990s Torque and PBS Pro can differ significantly in the output they present when using their GUI visual tools However because of their historical legacy their basic design structure and job submission methods from the command line remain very similar for the user Both Torque and PBS Pro are therefore covered in this chapter The possible Torque schedulers Torque s built in scheduler Maui or Moab are also covered when dis cussing Torque Torque and PBS Pro both offer a graphical interface and command line tools for submitting monitoring modifying and deleting jobs For submission and execution of jobs both workload managers use PBS job scripts The user puts values into a job script for the resources being requested such as the number of processors memory Other values are also set for the runtime parameters and application specific variables The steps for running a job through a PBS job script are e Creating an application to be
14. update jobid 254 EligibleTime 2011 10 18T22 00 00 An approximate GUI SLURM equivalent to scontrol is the sview tool This allows the job to be viewed under its jobs tab and the job to be edited with a right click menu item It can also carry out many other functions including canceling a job Webbrowser accessible job viewing is possible from the workload tab of the User Portal section 9 0 6 Bright Computing Inc SGE Sun Grid Engine SGE is a workload management and job scheduling system first developed to manage computing resources by Sun Microsys tems SGE has both a graphical interface and command line tools for submitting monitoring modifying and deleting jobs SGE uses job scripts to submit and execute jobs Various settings can be put in the job script such as number of processors resource usage and application specific variables The steps for running a job through SGE are to Create a job script Select the directives to use e Add the scripts and applications and runtime parameters Submit it to the workload management system 6 1 Writing A Job Script A binary cannot be submitted directly to SGE a job script is needed for that A job script can contain various settings and variables to go with the application A job script format looks like bin bash Script options Optional script directives shell commands Optional shell commands application Application itself 6 1 1 Directives I
15. used by a user the path to a package can be set Because there is a huge choice of software packages and versions it can be hard to set up the right environment variables and paths for software that is to be used To make setting up the environment easier Bright Cluster Manager provides preconfigured environment modules section 2 3 2 3 Environment Modules It can be quite hard to set up the correct environment to use a particular software package and version For instance managing several MPI software packages on the same system or even different versions of the same MPI software package is quite difficult for most users on a standard SUSE or Red Hat system be cause many software packages use the same names for executables and libraries A user could end up with the problem of never being quite sure which libraries have been used for the compilation of a program as multiple li braries with the same name may be installed Very often a user would like to test new versions of a software package before permanently installing the package Within Red Hat or SuSE this would be quite a complex task to achieve Environment modules using the module command make this process much easier 2 3 1 Available commands module help Modules Release 3 2 6 2007 02 14 Copyright GNU GPL v2 1991 Usage module switches subcommand subcommand args Switches H help this usage info V version modules version amp configuration opt
16. 8 1 Packages A number of different GPU related packages are included in Bright Cluster Man ager For CUDA 4 0 these are e cuda40 driver Provides the GPU driver e cuda40 libs Provides the libraries that come with the driver libcuda etc e cuda40 toolkit Provides the compilers cuda gdb and math libraries e cuda40 tools Provides the CUDA tools SDK e cuda40 profiler Provides the CUDA visual profiler e cuda40 sdk Provides additional tools development files and source ex amples CUDA versions 4 1 4 2 and 5 0 are also provided by Bright Cluster Manager The exact implementation depends on how the system administrator has config ured CUDA 8 2 Using CUDA After installation of the packages for general usage and compilation itis sufficient to load just the CUDA4 toolkit module module add cuda40 toolkit O Bright Computing Inc 52 Using GPUs Also available are several other modules related to CUDA cuda40 blas 4 0 17 Provides paths and settings for the CUBLAS library cuda40 fft Provides paths and settings for the CUFFT library The toolkit comes with the necessary tools and compilers to compile CUDA C code Extensive documentation on how to get started the various tools and how to use the CUDA suite is in the CUDA_INSTALL_PATH doc directory 8 3 Using OpenCL OpenCL functionality is provided with the environment module cuda40 toolkit Examples of OpenCL code can be found in the CUDA_SDK OpenCL
17. an up an MPI process MPI_Init amp arge amp argv MPI_Finalize A 4 What Is The Current Process How Many Processes Are There Typically a process in a parallel application needs to know who it is its rank and how many other processes exist A process finds out its own rank by calling MPI_Comm_rank Int myrank MPI_Comm_rank MPI_COMM_WORLD amp myrank The total number of processes is returned by MPI_Comm_size int nprocs MPI_Comm_size MPI_COMM_WORLD amp nprocs A 5 Sending Messages A message is an array of elements of a given data type MPI supports all the basic data types and allows a more elaborate application to construct new data types at runtime A message is sent to a specific process and is marked by a tag integer value specified by the user Tags are used to distinguish between different message types a process might send receive In the sample code above the tag is used to distinguish between work and termination messages MPI_Send buffer count datatype destination tag MPI_COMM_WORLD A 6 Receiving Messages A receiving process specifies the tag and the rank of the sending process MPI_ANY_TAG and MPI_ANY_SOURCE may be used optionally to receive a message of any tag and from any sending process MPI_Recv buffer maxcount datatype source tag MPI_COMM_WORLD amp status A 6 Receiving Messages 65 Information about the received message is returned in a status variabl
18. ariables bind job to advance reservation account string in accounting record handle command as binary binds job to processor cores define type of checkpointing for job request checkpoint method skip previous definitions for job use current working directory define command prefix for job script delete context variable s request a deadline initiation time specify standard error stream path s place user hold on job consider following requests hard print this help define jobnet interdependencies define jobnet array interdependencies specify standard input stream file s merge stdout and stderr stream of job share tree or functional job share continued Bright Computing Inc 6 1 Writing A Job Script 31 Table 6 1 3 SGE Job Script Options continued Option and parameter Description jsv jsv_url 1 resource_list m mail_options masterq wc_queue_list notify now yLes n o M mail_list N name o path_list P project_name p priority pe pe name slot_range q wc_queue_list R yles n o r yles n o sc context_list shell yles n o soft sync y es n o S path_list t task_id_range tc max_running_ tasks terse v variable_list verify V w elwInlvlp wd working directory file job submission verification script to be used request the given resources define mail notification events bind master task to queue s noti
19. ault 23 59 59 0 0 ER 1 4 showq From Maui If the Maui scheduler is running and the Maui module loaded module add maui then Maui s showq command displays a similar output In this example one dual core node is available 1 node 2 processors one job is running and 3 are queued in the Idle state showq ACTIVE JOBS JOBNAME USERNAME STATE PROC REMAINING STARTTIME 45 cvsupport Running 2 1 59 57 Tue Jul 14 12 46 20 1 Active Job 2 of 2 Processors Active 100 00 1 of 1 Nodes Active 100 00 O Bright Computing Inc 7 2 Submitting Job IDLE JOBS JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 46 cvsupport Idle 2 2 00 00 Tue Jul 14 12 46 20 47 cvsupport Idle 2 2 00 00 Tue Jul 14 12 46 21 48 cvsupport Idle 2 2 00 00 Tue Jul 14 12 46 22 3 Idle Jobs BLOCKED JOBS JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs 4 Active Jobs 1 Idle Jobs 3 Blocked Jobs 0 Viewing Job Details With qstat And checkjob Job Details With qstat With qstat f the full output of the job is dis played The output shows what the jobname is where the error and out put files are stored and various other settings and variables qstat f Job Id 19 mascm4 cm cluster Job_Name TestJobPBS Job_Owner cvsupport mascm4 cm cluster Q queue testq job_state server mascm4 cm cluster Checkpoint u ctime Tue Jul 14 12 35 31 2009 Error_Path mascm4 cm cluster home cvs
20. aying the current jobs running on the cluster 9 0 7 The NODES Tab The page opened by clicking on the NODES tab shows a list of nodes on the cluster figure 9 3 Bright Cluster Manager User Portal Chromium coo Bright Cluster Manag gt C fi FbhJl0 124230 nodes php For quick access place your bookmarks here on the bookmarks bar Import bookmarks now Other Bookmarks Bright C HOME WORKLOAD NODES GRAPHS Bright Cluster Manager User Portal DEVICE INFORMATION Hostname State Memory Cores CPU Speed GPU NICs IB Category iB 1 Dual Core AMD Opteron tm Proct 2615 MHz MiB Mi L f ron tm F Figure 9 3 User Portal Nodes Page Bright Computing Inc The following information about the head or regular nodes is presented Hostname the node name e State For example UP DOWN INSTALLING Memory RAM on the node Cores Number of cores on the node CPU Type of CPU for example Dual Core AMD Opteron Speed Processor speed e GPU Number of GPUs on the node if any NICs Number of network interface cards on the node if any IB Number of InfiniBand interconnects on the node if any e Category The node category that the node has been allocated by the ad ministrator by default itis default 9 0 8 The GRAPHS Tab By default the GRAPHS tab displays the cluster occupation rate for the last hour figure 9 4 Bright Cluster Manager User Portal Chromium ooa 5 Brigh
21. bang line shell definition line Bright Computing Inc 5 3 Running The Executable As A SLURM Job Script 25 SBATCH lines optional job script directives section 5 3 2 shell commands optional shell commands such as loading necessary modules application execution line execution of the MPI application using sbat ch the SLURM submission wrapper In SBATCH lines SBATCH is used to submit options The various meanings of lines starting with are Line Starts With Treated As Comment in shell and SLURM SBATCH Comment in shell option in SLURM SBATCH Comment in shell and SLURM After the SLURM job script is run with the sbatch command Sec tion 5 3 4 the output goes into file my stdout as specified by the o command If the output file is not specified then the file takes a name of the form slurm lt jobnumber gt out where lt jobnumber gt is a number start ing from 1 The command sbatch usage lists possible options that can be used on the command line or in the job script Command line values override script provided values 5 3 2 SLURM Job Script Options Options sometimes called directives can be set in the job script file using this line format for each option SBATCH option parameter Directives are used to specify the resource allocation for a job so that SLURM can manage the job optimally Available options and their de scriptions
22. be used nvidia smi L GPU 0 O5E710DE 068F1i0DE Tesla T10 Processor S N 706539258209 GPU 1 O5E710DE 068F10DE Tesla T10 Processor S N 2486719292433 To set the ruleset on the GPU nvidia smi i 0 c 1 The ruleset may be one of the following 0 Default mode multiple applications allowed on the GPU e 1 Exclusive thread mode only one compute context is allowed to run on the GPU usable from one thread at a time 2 Prohibited mode no compute contexts are allowed to run on the GPU 3 Exclusive process mode only one compute context is allowed to run on the GPU usable from multiple threads at a time To check the state of the GPU nvidia smi i 0 q COMPUTE mode rules for GPU 0 1 In this example GPUO is locked and there is a running application using GPUO A second application attempting to run on this GPU will not be able to run on this GPU O Bright Computing Inc 54 Using GPUs histogram device 0 main cpp 101 cudaSafeCall Runtime API error no CUDA capable device is available After use the GPU can be unlocked to allow multiple users nvidia smi i O c O 8 5 3 CUDA Utility Library CUTIL is a simple utility library designed for use in the CUDA SDK samples There are 2 parts for CUDA and OpenCL The locations are e CUDA_SDK C lib e CUDA_SDK OpenCL common lib Other applications may also refer to them and the toolkit libraries have already been pre configured ac
23. ch are launched without any specified number of nodes run on a single free node chosen by the workload manager The executable line to run a program myprog that has been compiled with MPI libraries is run by placing the job launcher command mpirun before it as follows mpirun myprog Bright Computing Inc 42 PBS Variants Torque And PBS Pro Using cm launcher With mpirun In The Executable Line For Torque for some MPI implementations jobs sometimes leave pro cesses behind after they have ended A default Bright Cluster Manager installation provides a cleanup utility that removes such processes To use it the user simply runs the executable line using the cm launcher wrapper before the mpirun job launcher command cm launcher mpirun myprog The wrapper tracks processes that the workload manager launches When it sees processes that the workload manager is unable to clean up after the job is over it carries out the cleanup instead Using cm launcher is recommended if jobs that do not get cleaned up correctly are an issue for the user or administrator 7 1 4 Example Batch Submission Scripts Node Availability The following job script tests which out of 4 nodes requested with 1 nodes are made available to the job in the workload manager Example bin bash PBS 1 walltime 1 00 PBS 1 nodes 4 echo n I am on hostname echo finding ssh accessible nodes for node in cat PBS_NODEFILE do ec
24. cordingly However they need to be compiled prior to use Depending on the cluster this might have already have been done fred demo cd fred demo cp r CUDA_SDK fred demo cd basename CUDA_SDK cd C fred demo C make fred demo C cd basename CUDA_SDK cd OpenCL fred demo OpenCL make CUTIL provides functions for parsing command line arguments read and writing binary files and PPM format images comparing data arrays typically used for comparing GPU results with CPU results timers macros for checking error codes checking for shared memory bank conflicts 8 5 4 CUDA Hello world Example A hello world example code using CUDA is Example CUDA example Hello World using shift13 a rot13 like function Encoded on CPU decoded on GPU rot13 cycles between 26 normal alphabet characters shift13 shifts 13 steps along the normal alphabet characters So it translates half the alphabet into non alphabet characters shift13 is used because it is simpler than rot13 in c so we can focus on the point Bright Computing Inc 8 5 Available Tools 55 c Bright Computing Taras Shapovalov lt taras shapovalov brightcomputing com gt include lt cuda h gt include lt cutil_inline h gt include lt stdio h gt CUDA kernel definition undo shift13 _ global__ void helloWorld char str int idx blockIdx x blockDim x threadIdx x strlidx 13 int ma
25. ct 8 is recommended instead Bright Computing Inc 40 PBS Variants Torque And PBS Pro Further examples of node resource specification are given in a table on page 41 Job Directives Job Name Logs And IDs If the name of the job script file is jobname then by default the output and error streams are logged to jobname o lt number gt and jobname e lt number gt respectively where lt number gt indicates the associated Job number The default paths for the logs can be changed by using the o and e directives respectively while the base name jobname here can be changed using the N directive Often a user may simply merge both logs together into one of the two streams using the j directive Thus in the preceding example j oe merges the logs to the output log path while j eo would merge it to error log path The job ID is an identifier based on the job number and the FODN of the login node For a login node called bright52 cm cluster the job ID for a job number with the associated value lt number gt from earlier would by default be lt number gt bright52 cm cluster but it can also simply be ab breviated to lt number gt Job Queues Sending a job to a particular job queue is sometimes appropriate An ad ministrator may have set queues up so that some queues are for very long term jobs or some queues are for users that require GPUs Submitting a job to a particular queue lt destination gt i
26. d Failed Completed Avg Duration Est Delay Figure 9 1 User Portal Default Home Page The following items are displayed on a default home page a Message Of The Day The administrator may put up important messages for users here Bright Computing Inc 58 User Portal links to the documentation for the cluster e contact information This typically shows how to contact technical support an overview of the cluster state displaying some cluster parameters a workload overview This is a table displaying a summary of queues and their associated jobs 9 0 6 The WORKLOAD Tab The page opened by clicking on the WORKLOAD tab allows a user to see workload related information for the cluster figure 9 2 Bright Cluster Manager User Portal Chromium ooo 5 Bright Cluster Manag C fi amp bip 10 1 24 230 workload php ZA A For quick access place your bookmarks here on the bookmarks bar Import bookmarks now G Other Bookmarks Bright Computing HOME WORKLOAD NODES GRAPHS Bright Cluster Manager User Portal OVERVIEW Queue Scheduler Slots Nodes Running Queued Failed Completed Avg duration Est delay JOBS RUNNING JobID Scheduler Queue Jobname Processes Username Status Run time torque shortq sjob st Figure 9 2 User Portal Workload Page The following two tables are displayed A workload overview table the same as the table in the home page A table displ
27. de 5 nodes 2 processors per node and 1 GPU per node 5 nodes 2 processors per node and 1 GPU per node 5 nodes 2 processors per node 3 virtual processors for MPI code 5 nodes 2 processors per node using any GPU on the nodes 5 nodes 2 processors per node mem 500mb walltime 03 10 30 nodes 8 select 8 nodes 2 ppn 1 nodes 3 ppn 8 nodes 5 ppn 2 gpus 1 select 5 ncpus 2 ngpus 1 select 5 ncpus 2 mpiprocs 3 select 5 ncpus 2 ngpus 1 select 5 ncpus 2 gpu_id 0 using a GPU with ID 0 from nodes For Torque 2 5 5 For PBS Pro 11 Some of the examples illustrate requests for GPU resource usage GPUs and the CUDA utilities for Nvidia are introduced in Chapter 8 In the Torque and PBS Pro workload managers GPU usage is treated like the attributes of a resource which the cluster administrator will have pre configured according to local requirements For further details on resource list directives the Torque and PBS Pro user documentation should be consulted 7 1 3 The Executable Line In the job script structure section 7 1 1 the executable line is launched with the job launcher command after the directives lines have been dealt with and after any other shell commands have been carried out to set up the execution environment Using mpirun In The Executable Line The mpirun command is used for executables compiled with MPI libraries Executables that have not been compiled with MPI libraries or whi
28. der v1 v10 v11 v12 v13 v2 v3 v4 v5 v6 v7 v8 v9 Bright Computing Inc 2 4 Compiling Applications 2 4 Compiling Applications Compiling an application is usually done on the head node or login node Typically there are several compilers available on the head node For example GNU compiler collection Open64 compiler Intel compilers Portland Group compilers The following table summarizes the available compiler commands on the cluster Language GNU Open64 Portland Intel C gcc opencc pgcc icc C g openCC pgCC icc Fortran77 gfortran openf90 ff77 pgf77 ifort Fortran90 gfortran openf90 pgf90 ifort Fortran95 gfortran openf95 pgf95 ifort GNU compilers are the de facto standard on Linux and are installed by default They do not require a license AMD s Open64 is also installed by default on Bright Cluster Manager Commercial compilers by Portland and Intel are available as packages via the Bright Cluster Manager YUM repository and require the purchase of a license to use them To make a compiler available to be used in a user s shell commands the appropriate environment module section 2 3 must be loaded first On most clusters two versions of GCC are available 1 The version of GCC that comes along with the Linux distribution For example for CentOS 6 0 Example fred bright52 which gcc gcc version head 1 usr bin gcc gcc GCC 4 4 4 20100726 Red Hat 4 4 4 13 2 The latest version sui
29. e The received message tag is status MPI_TAG and the rank of the sending process is status MPI_SOURCE Another function not used in the sample code returns the number of data type elements received It is used when the number of elements received might be smaller than maxcount MPI_Get_count amp status datatype amp nelements With these few functions almost any application can be programmed There are many other more exotic functions in MPI but all can be built upon those presented here so far
30. enf90 from Open64 In such cases it is possible to override the default compiler path environment variable for example fred bright52 module list Currently Loaded Modulefiles 1 null 3 gec 4 4 6 5 torque 2 5 5 2 shared 4 openmpi gcc 64 1 4 2 fred bright52 mpicc version showme mpif90 version showme gcc version gfortran version fred bright52 export OMPI_CC icc export OMPI_FC openf90 fred bright52 mpicc version showme mpif90 version showme icc version openf90 version Variables that may be set are OMPI_CC OMPI_FC OMPI_F77 and OMPI_CXX More on overriding the Open MPI wrapper settings is doc umented in the man pages of mpicc in the environment section Bright Computing Inc Using MPI MPI libraries allow the compilation of code so that it can be used over many processors at the same time The available MPI implementations for the variant MPI 1 are MPICH and MVAPICH For the variant MPI 2 they are MPICH2 and MVAPICH2 Open MPI supports both variants These MPI libaries can be compiled with GCC Open64 Intel or PGI Also depending on the cluster the interconnect available may be Eth ernet GE InfiniBand IB or Myrinet MX Also depending on the cluster configuration MPI implementations for different compilers can be loaded By default MPI implementations that are installed are compiled and made available using both GCC and Open64 The interconnect and compiler impleme
31. er Nodes are configured and controlled by the head node and do only what they are told to do One of the main differences between Beowulf and a Cluster of Workstations COW is the fact that Beowulf behaves more like a single machine rather than many workstations In most cases nodes do not have keyboards or monitors and are accessed only via re mote login or possibly serial terminal Beowulf nodes can be thought of as a CPU memory package which can be plugged into the cluster just like a CPU or memory module can be plugged into a motherboard 1 2 Brief Network Description A Beowulf Cluster consists of a login compile and job submission node called the head and one or more compute nodes often referred to as worker nodes A second fail over head node may be present in order to take control of the cluster in case the main head node fails Furthermore a second fast network may also have been installed for high performance communication between the head and the nodes see figure 1 1 Local network Head node Failover node InfiniBand switch N Ethernet switch PAN Figure 1 1 Cluster layout The login node is used to compile software to submit a parallel or batch program to a job queueing system and to gather analyze results O Bright Computing Inc 1 2 Brief Network Description Therefore it should rarely be necessary for a user to log on to one of the nodes and in some cases node log
32. es or which are launched without any specified number of nodes run on a single free node chosen by the workload manager The executable line to run a program nyprog that has been compiled with MPI libraries is run by placing the job launcher command mpirun before it as follows mpirun myprog Using cm launcher With mpirun In The Executable Line For SGE for some MPI implementations jobs sometimes leave processes behind after they have ended A default Bright Cluster Manager instal lation provides a cleanup utility that removes such processes To use it the user simply runs the executable line using the cm launcher wrapper before the mpirun job launcher command cm launcher mpirun myprog The wrapper tracks processes that the workload manager launches When it sees processes that the workload manager is unable to clean up after a job is over it carries out the cleanup instead Using cm launcher is recommended if jobs that do not get cleaned up correctly are an issue for the user or administrator 6 1 5 Job Script Examples Some job script examples are given in this section Each job script can use a number of variables and directives Single Node Example Script An example script for SGE bin sh N sleep S bin sh Make sure that the e and o file arrive in the working directory cwd Merge the standard out and standard error to one file j y sleep 60 echo Now it is date Parallel Example Sc
33. fault n ntasks Request this many tasks on the cluster Defaults to 1 task per node c cpus per task request this many CPUs per task not implemented by Open MPI yet none ntasks per node request this number of tasks per node The full options list and syntax for salloc can be viewed with man salloc The requirement of specified options to salloc must be met before the executable is allowed to run So for example if nodes 4 and the cluster only has 3 nodes then the executable does not run 5 2 1 Node Allocation Examples The following session illustrates and explains some node allocation op tions and issues for SLURM using a cluster with just 1 compute node and 4 CPU cores Default settings The hello MPI executable with default settings of SLURM runs successfully over the first and in this case the only node that it finds Bright Computing Inc 5 2 Running The Executable With salloc fred bright52 salloc mpirun hello salloc Granted job allocation 572 Hello world from process 0 out of 4 host name node001 Hello world from process 1 out of 4 host name node001 Hello world from process 2 out of 4 host name node001 Hello world from process 3 out of 4 host name node001 salloc Relinquishing job allocation 572 The preceding output also displays if N1 indicating 1 node is specified or if n4 indicating 4 tasks is specified The node and task allocation is almost certainly not
34. for loading some output elided Example fred bright52 module avail cm local modulefiles cluster tools 5 2 dot module info shared cmd freeipmi 1 0 2 null use own cmsh ipmitool 1 8 11 openldap version 2 cm shared modulefiles acml gcc 64 4 4 0 intel compiler 64 12 0 2011 5 220 acml gcc mp 64 4 4 0 intel cluster checker 1 7 acml gcc int64 64 4 4 0 intel cluster runtime 3 2 acml gcc int64 mp 64 4 4 0 intel tbb oss ia32 30_221oss acml open64 64 4 4 0 intel tbb oss intel64 30_221oss Bright Computing Inc Cluster Usage In the list there are two kinds of modules local modules which are specific to the node or head node only e shared modules which are made available from a shared storage and which only become available for loading after the shared mod ule is loaded The shared module is obviously a useful local module and is there fore loaded for the user by default on a default cluster Although version numbers are shown in the module avail output it is not necessary to specify version numbers unless multiple versions are available for a module To remove one or more modules the module unload or module rm command is used To remove all modules from the user s environment the module purge command is used The user should be aware that some loaded modules can conflict with o
35. ftware and hardware has long been superseded the name given to this basic design remains Beowulf class cluster computer or less formally Beowulf cluster 1 1 2 Brief Hardware And Software Description On the hardware side commodity hardware is generally used in Be owulf clusters to keep costs down These components are mainly Linux compatible PCs standard Ethernet adapters InfiniBand interconnects and switches On the software side commodity software is generally used in Be owulf clusters to keep costs down For example the Linux operating system the GNU C compiler and the Message Passing Interface MPI standard The head node controls the whole cluster and serves files and infor mation to the nodes It is also the cluster s console and gateway to the Bright Computing Inc Introduction outside world Large Beowulf machines might have more than one head node and possibly other nodes dedicated to particular tasks for exam ple consoles or monitoring stations In most cases compute nodes in a Beowulf system are dumb in general the dumber the better with the focus on the processing capability of the node within the cluster rather than other abilities a computer might generally have A node may there fore have one or several number crunching processors The processors may also be GPUs enough memory to deal with the processes passed on to the node a connection to the rest of the clust
36. fy job before killing suspending it start job immediately or not at all notify these e mail addresses specify job name specify standard output stream path s set job s project define job s relative priority request slot range for parallel jobs bind job to queue s reservation desired define job as not restartable set job context replaces old context start command with or without wrapping lt loginshell gt c consider following requests as soft wait for job to end and return exit code command interpreter to be used create a job array with these tasks throttle the number of concurrent tasks experimental tersed output print only the job id export these environment variables do not submit just verify export all environment variables verify mode error warning none just verify poke for jobs use working_directory read commandline input from file More detail on these options and their use is found in the man page for qsub 6 1 4 The Executable Line In a job script the executable line is launched with the job launcher com mand after the directives lines have been dealt with and after any other shell commands have been carried out to set up the execution environ ment Bright Computing Inc 32 SGE Using mpirun In The Executable Line The mpirun job launcher command is used for executables compiled with MPI libraries Executables that have not been compiled with MPI li brari
37. ho n running on usr bin ssh node hostname done The directive specifying wa11time means the script runs at most for 1 minute The PBS_NODEFILE array used by the script is created and ap pended with hosts by the queueing system The script illustrates how the workload manager generates a PBS_NODEFILE array based on the re quested number of nodes and which can be used in a job script to spawn child processes When the script is submitted the output from the log will look like I am on node001 finding ssh accessible nodes running on node001 running on node002 running on node003 running on node004 This illustrates that the job starts up on a node and that no more than the number of nodes that were asked for in the resource specification are provided The list of all nodes for a cluster can be found using the pbsnodes command section 7 2 6 Bright Computing Inc 7 1 Components Of A Job Script 43 Using InfiniBand A sample PBS script for InfiniBand is bin bash Sample PBS file Name of job PBS N MPI Number of nodes in this case 8 nodes with 4 CPUs each The total number of nodes passed to mpirun will be nodes ppn Second entry Total amount of wall clock time true time 02 00 00 indicates 02 hours PBS 1 nodes 8 ppn 4 walltime 02 00 00 Mail to user when job terminates or aborts PBS m ae If modules are needed by the script then source module
38. in int argc char argv char s Hello World printf String for encode decode s n s CPU shift13 int len sizeof s for int i 0 i lt len itt s i 13 printf String encoded on CPU as ZsNn s Allocate memory on the CUDA device char cuda_s cudaMalloc void amp cuda_s len Copy the string to the CUDA device cudaMemcpy cuda_s s len cudaMemcpyHostToDevice Set the grid and block sizes dim3 is a type and Hello World is 12 characters say 3x4 dim3 dimGrid 3 dim3 dimBlock 4 Invoke the kernel to undo shift13 in GPU helloWorld lt lt lt dimGrid dimBlock gt gt gt cuda_s Retrieve the results from the CUDA device cudaMemcpy s cuda_s len cudaMemcpyDeviceToHost Free up the allocated memory on the CUDA device cudaFree cuda_s printf String decoded on GPU as s n s return 0 The preceding code example may be compiled and run with fred bright52 nvcc hello cu o hello fred bright52 module add shared openmpi gcc 64 1 4 4 slurm Bright Computing Inc 56 Using GPUs fred bright52 salloc nl gres gpu 1 mpirun hello salloc Granted job allocation 2263 String for encode decode Hello World String encoded on CPU as Uryy dlyq String decoded on GPU as Hello World alloc Relinquishing job allocation 2263 salloc Job allocation 2263 has been revoked The number of characters displayed in the
39. include lt string h gt define BUFSIZE 128 define TAG O int main int argc char argv t char idstr 32 char buff BUFSIZE int numprocs int myid int i MPI_Status stat all MPI programs start with MPI_Init all N processes exist thereafter MPI_Init amp arge amp argv MPI_Comm_size MPI_COMM_WORLD amp numprocs find out how big the SPMD world is MPI_Comm_rank MPI_COMM_WORLD amp myid and this processes rank is At this point all the programs are running equivalently the rank is used to distinguish the roles of the programs in the SPMD model with rank 0 often used specially if myid 0 printf d We have Zd processors n myid numprocs for i 1 i lt numprocs i sprintf buff Hello Zd i MPI_Send buff BUFSIZE MPI_CHAR i TAG MPI_COMM_WORLD for i 1 i lt numprocs i MPI Examples MPI_Recv buff BUFSIZE MPI_CHAR i TAG MPI_COMM_WORLD amp stat print d s n myid buff else t receive from rank 0 MPI_Recv buff BUFSIZE MPI_CHAR 0 TAG MPI_COMM_WORLD amp stat sprintf idstr Processor Zd myid strcat buff idstr strcat buff reporting for dutyNn send to rank 0 MPI_Send buff BUFSIZE MPI_CHAR 0 TAG MPI_COMM_WORLD MPI Programs end with MPI Finalize this is a weak synchronization point MPI_Finalize return 0 A 2 MPI Skeleton The sample code below co
40. ins are disabled altogether The head login and compute nodes usually communicate with each other through a gigabit Ethernet network capable of transmitting information at a maximum rate of 1000 Mbps In some clusters 10 gigabit Ethernet 10GE 10GBE or 10GigE is used capable of up to 10 Gbps rates Sometimes an additional network is used by the cluster for even faster communication between the compute nodes This particular network is mainly used for programs dedicated to solving large scale computational problems which may require multiple machines and could involve the exchange of vast amounts of information One such network topology is InfiniBand commonly capable of transmitting information at a maximum rate of 56Gbps and about 1 2us latency on small packets for clusters in 2011 The commonly available maximum transmission rates will increase over the years as the technology advances Applications relying on message passing benefit greatly from lower latency The fast network is usually complementary to a slower Ethernet based network O Bright Computing Inc Cluster Usage 2 1 Login To The Cluster Environment The login node is the node where the user logs in and works from Simple clusters have a single login node but large clusters sometimes have mul tiple login nodes to improve the reliability of the cluster In most clusters the login node is also the head node from where the cluster is monitored and installed On the logi
41. ions f force force active dependency resolution t terse terse format avail and list format 1 long long format avail and list format Bright Computing Inc 2 3 Environment Modules h human readable format avail and list format v verbose enable verbose messages s silent disable verbose messages c create create caches for avail and apropos i icase case insensitive ul userlvl lt lvl gt set user level to novlice explert adv anced Available SubCommands and Args add load modulefile modulefile rmlunload modulefile modulefile switch swap modulefile1 modulefile2 display show modulefile modulefile avail modulefile modulefile use al append dir dir unuse dir dir update refresh purge list clear help modulefile modulefile whatis modulefile modulefile apropos keyword string initadd modulefile modulefile initprepend modulefile modulefile initrm modulefile modulefile initswitch modulefilei modulefile2 initlist initclear 2 3 2 Changing The Current Environment The modules loaded into the user s environment can be seen with module list Modules can be loaded using the add or load options A list of mod ules can be added by spacing them module add shared open64 openmpi open64 The module avail command lists all modules that are available
42. lementation Once the appropriate compiler module has been loaded the MPI imple mentation is selected along with the appropriate library modules In the following list lt compiler gt indicates a choice of gcc intel open64 or pgi mpich ge lt compiler gt 64 1 2 7 mpich2 smpd ge lt compiler gt 64 1 3 2p1 mvapich lt compiler gt 64 1 2rc1 mvapich2 lt compiler gt 64 1 6 e openmpi lt compiler gt 64 1 4 2 blas lt compiler gt 64 1 1 e blacs openmpi compiler 64 1 1patch03 e globalarrays lt compiler gt openmpi 64 5 0 2 e gotoblas lt name of CPU gt 2 64 1 26 After the appropriate MPI module has been added to the user envi ronment the user can start compiling applications The mpich mpich2 and openmpi implementations may be used on Ethernet On InfiniBand mvapich mvapich2 and openmpi may be used The openmpi MPI imple mentation will attempt to use InfiniBand but will revert to Ethernet if InfiniBand is not available 3 3 Example MPI Run This example covers an MPI run which can be run inside and outside of a queuing system To use mpirun the relevant environment modules mustbe loaded For example to use mpich over Gigabit Ethernet ge compiled with GCC module add mpich ge gcc Similarly to use InfiniBand 1Not recommended except for testing purposes due to lack of optimization 2 lt name of CPU gt indicates a choice of the following barcelona core opteron penryn or prescott
43. loaded by the user on top of the chosen MPI implementation in this case Open MPI Example fred bright52 module list Currently Loaded Modulefiles 1 gec 4 4 6 3 shared 2 openmpi gec 64 1 4 2 4 cuda40 toolkit 4 0 17 fred bright52 module add slurm module list Currently Loaded Modulefiles 1 gec 4 4 6 3 shared 5 slurm 2 2 4 2 openmpi gcc 64 1 4 2 4 cuda40 toolkit 4 0 17 Bright Computing Inc SLURM 22 SLURM The hello world executable from section 3 3 3 can then be com piled and run for one task outside the workload manager as mpicc hello c o hello mpirun np 1 hello 5 2 Running The Executable With salloc Running it as a job managed by SLURM can be done interactively with the SLURM allocation command salloc as follows fred bright52 salloc mpirun hello SLURM is more typically run as a batch job section 5 3 However execution via salloc uses the same options and it is more convenient as an introduction because of its interactive behavior In a default Bright Cluster Manager configuration SLURM auto detects the cores available and by default spreads the tasks across the cores that are part of the allocation request To change how SLURM spreads the executable across nodes is typi cally determined by the options in the following table Short Long Option Option Description N nodes Request this many nodes on the cluster Use all cores on each node by de
44. mpiler Driver NVCC has more information on compiler options The CUDA SDK has more programming examples and information accessible from the file CUDA_SDK C Samples html For OpenCL code compilation can be done by linking against the OpenCL library gcc test c 10penCL gt test cpp 10penCL nvcc test c 10penCL Bright Computing Inc 8 5 Available Tools 53 8 5 Available Tools 8 5 1 CUDA gdb The CUDA debugger can be started using cuda gdb Details of how to use it are available in the CUDA GDB NVIDIA CUDA Debugger manual in the doc directory Itis based on GDB the GNU Project debugger and requires the use of the g or G options compiling Example nvcc g G testcode cu o testcode 8 5 2 nvidia smi The NVIDIA System Management Interface command nvidia smi can be used to allow exclusive access to the GPU This means only one application can run on a GPU By default a GPU will allow multiple running applications Syntax nvidia smi OPTION1 ARG1 OPTION2 ARG2 The steps for making a GPU exclusive e List GPUs e Select a GPU Lock GPU to a compute mode e After use release the GPU After setting the compute rule on the GPU the first application which exe cutes on the GPU will block out all others attempting to run This application does not necessarily have to be the one started by the user that set the exclusivity lock on the the GPU To list the GPUs the L argument can
45. n node code can be compiled applications can be developed applications can be submitted to the cluster for execution running applications can be monitored To carry out an ssh login to the cluster a terminal can be used from unix like operating systems Example ssh myname cluster hostname On a Windows operating system an SSH client such as for PuTTY can be downloaded The cluster s address must be added and the connect but ton clicked The username and password must be entered when prompted If the administrator has changed the default SSH port from 22 to some thing else the port can be specified with the p lt port gt option ssh p lt port gt lt user gt lt cluster gt Optionally after logging in the password used can be changed using the passwd command passwd Bright Computing Inc Cluster Usage 2 2 Setting Up The User Environment By default each user uses the bash shell interpreter Each time a user login takes place a file named bashrc is executed to set up the shell environment for the user The shell environment can be customized to suit user preferences For example e the prompt can be changed to indicate the current host and direc tory the size of the command history file can be increased aliases can be added for frequently used command sequences e environment variables can be created or modified the location of software packages and versions that are to be
46. nd configuring just one more node to the current cluster would allows the resource allocation to succeed since an added node would provide at least one more processor to the cluster Number of tasks requested per node The value assigned to the ntasks per node option is the number of tasks that are requested for allocation from each node on the cluster In the current cluster example it can be 1 to 4 tasks A resource allocation request for 5 tasks per node with ntasks per node fails on this 4 core cluster giving an output like fred bright52 salloc ntasks per node 5 mpirun hello salloc error Failed to allocate resources More processors requested t han permitted Bright Computing Inc 24 SLURM Adding and configuring another 4 core node to the current cluster would still not allow resource allocation to succeed because the request is for at least 5 cores per node rather than per cluster Restricting the number of tasks that can run per node A resource al location request for 2 tasks per node with the ntasks per node option and simultaneously an allocation request for 1 task to run on the cluster using the ntasks option runs successfully although it uselessly ties up resources for 1 task per node fre bright52 salloc ntasks per node 2 ntasks 1 mpirun hello salloc Granted job allocation 574 Hello world from process O out of 1 host name node005 salloc Relinquishing job allocation 574 The
47. ne Control the jobs freeze hold the job resume the job delete the job Some advanced options in workload managers can prioritize jobs and add checkpoints to freeze a job Bright Computing Inc 18 Workload Management 4 4 Job Submission Process Whenever a job is submitted the workload management system checks on the resources requested by the job script It assigns cores and memory to the job and sends the job to the nodes for computation If the required number of cores or memory are not yet available it queues the job until these resources become available If the job requests resources that are always going to exceed those that can become available then the job ac cordingly remains queued indefinitely The workload management system keeps track of the status of the job and returns the resources to the available pool when a job has finished that is been deleted has crashed or successfully completed 4 5 What Do Job Scripts Look Like A job script looks very much like an ordinary shell script and certain commands and variables can be put in there that are needed for the job The exact composition of a job script depends on the workload manager used but normally includes e commands to load relevant modules or set environment variables directives for the workload manager to request resources control the output set email addresses for messages to go to an execution job submission line When ru
48. nning a job script the workload manager is normally respon sible for generating a machine file based on the requested number of pro cessor cores np as well as being responsible for the allocation any other requested resources The executable submission line in a job script is the line where the job is submitted to the workload manager This can take various forms Example For the SLURM workload manager the line might look like srun mpi mpichi_p4 a out Example For Torque or PBS Pro it may simply be mpirun a out Example For SGE it may look like mpirun np 4 machinefile TMP machines a out 4 6 Running Jobs On A Workload Manager The details of running jobs through the following workload managers is discussed next e SLURM Chapter 5 e SGE Chapter 6 Torque with Maui or Moab and PBS Pro Chapter 7 O Bright Computing Inc 4 7 Running Jobs In Cluster Extension Cloud Nodes Using cmsub 19 4 7 Running Jobs In Cluster Extension Cloud Nodes Using cmsub Extra computational power from cloud service providers such as Amazon can be used by an appropriately configured cluster managed by Bright Cluster Manager If the head node is running outside a cloud services provider and at least some of the compute nodes are in the cloud then this hybrid clus ter configuration is called a Cluster Extension cluster with the compute nodes in the cloud being the cloud extension of the cluster For a Cl
49. ntains the complete communications skeleton for a dy namically load balanced head compute node application Following the code is a description of some of the functions necessary for writing typical parallel appli cations include lt mpi h gt define WORKTAG 1 define DIETAG 2 main argc argv int argc char argv int myrank MPI_Init amp karge amp argv initialize MPI MPI_Comm_rank MPI_COMM_WORLD always use this amp myrank process rank 0 thru N 1 if myrank 0 head else computenode MPI_Finalize cleanup MPI head int ntasks rank work double result MPI_Status status MPI_Comm_size MPI_COMM_WORLD always use this A 2 MPI Skeleton 63 amp ntasks processes in application Seed the compute nodes for rank 1 rank lt ntasks rank work get_next_work_request MPI_Send amp work message buffer de one data item MPI_INT data item is an integer rank destination process rank WORKTAG user chosen message tag MPI_COMM_WORLD always use this Receive a result from any compute node and dispatch a new work request work requests have been exhausted work get_next_work_request while valid new work request MPI_Recv amp result message buffer 1 one data item MPI_DOUBLE of type double real MPI_ANY_SOURCE receive fr
50. ntation can be worked out from looking at the module and package name The modules available can be searched through for the compiler variant and then the package providing it can be found Example fred bright52 search for modules starting with the name openmpi fred bright52 module 1 avail 2 gt amp 1 grep openmpi openmpi gcc 64 1 4 2 2011 05 03 0 37 51 openmpi intel 64 1 4 2 2011 05 02 8 24 28 openmpi open64 64 1 4 2 2011 05 02 8 43 13 fred bright52 rpm qa grep openmpi openmpi ge gcc 64 1 4 2 108_cm5 2 x86_64 openmpi geib open64 64 1 4 2 108_cm5 2 x86_64 openmpi geib intel 64 1 4 2 108_cm5 2 x86_64 Here for example openmpi geib intel 64 x86_64 implies Open MPI compiled for both Gigabit Ethernet ge and Infini Band ib compiled with the Intel compiler for a 64 bit architecture 3 1 Interconnects Jobs can use a certain network for intra node communication Bright Computing Inc 12 Using MPI 3 1 1 Gigabit Ethernet Gigabit Ethernet is the interconnect that is most commonly available For Gigabit Ethernet no additional modules or libraries are needed The Open MPI MPICH and MPICH2 implementations will work over Giga bit Ethernet 3 1 2 InfiniBand InfiniBand is a high performance switched fabric which is characterized by its high throughput and low latency Open MPI MVAPICH and MVA PICH2 are suitable MPI implementations for InfiniBand 3 2 Selecting An MPI imp
51. o 37 7 1 Components Of A Job Seript 38 72 Submitting AJob ooo o 44 Table of Contents 8 Using GPUs 51 SL Packages zen atada Deren een arena 51 8 2 Usa CUDASS n ts is A A 51 8 3 Using OpeneE L sad amu a la a on 52 8 4 Compiling Code 52 85 Available Tools 53 9 User Portal 57 A MPI Examples 61 ALT Hell world eco pus a ar en 61 A2 MPESkeleton isobaras be a ba 62 A 3 MPI Initialization And Finalization 64 A 4 What Is The Current Process How Many Processes Are There eee A Ge eat De BEE ana S Ee ES 64 A 5 Sending Messages ee eee 64 A 6 Receiving Messages 2 ee ee ee 64 Preface Welcome to the User Manual for the Bright Cluster Manager 6 0 clus ter environment This manual is intended for users of a cluster running Bright Cluster Manager This manual covers the basics of using the Bright Cluster Manager user environment to run compute jobs on the cluster Although it does cover some aspects of general Linux usage it isby no means comprehen sive in this area Readers are advised to make themselves familiar with the basics of a Linux environment Our manuals constantly evolve to match the development of the Bright Cluster Manager environment the addition of new hardware and or ap plications and the incorporation of customer feedback Your input as a user and or administrator is of great value to u
52. ob is stopped and an error message like the following is displayed gt PBS job killed walltime lt runningtime gt exceeded limit lt settime gt Here lt runningtime gt indicates the time for which the job actually went on to run while lt settime gt indicates the time that the user set as the wall time resource limit Resource List Directives Resource list directives specify arguments to the 1 directive of the job script and allow users to specify values to use instead of the system de faults For example in the sample script structure earlier a job walltime of one hour and a memory space of at least 500MB are requested the script requires the size of the space be spelled in lower case so 500mb is used If a requested resource list value exceeds what is available the job is queued until resources become available For example if nodes only have 2000MB to spare and 4000MB is re quested then the job is queued indefinitely and it is up to the user to fix the problem Resource list directives also allow for example the number of nodes 1 nodes and the virtual processor cores per nodes 1 ppn to be spec ified If no value is specified the default is 1 core per node If 8 cores are wanted and it does not matter how the cores are allo cated e g 8 per node or 1 on 8 nodes the directive used in Torque is PBS 1 nodes 8 For PBS Pro v11 this also works but is deprecated and the form PBS 1 sele
53. om any sender MPI_ANY_TAG any type of message MPI_COMM_WORLD always use this amp status received message info MPI_Send amp work 1 MPI_INT status MPI_SOURCE WORKTAG MPI_COMM_WORLD work get_next_work_request Receive results for outstanding work requests for rank 1 rank lt ntasks rank MPI_Recv amp result 1 MPI_DOUBLE MPI_ANY_SOURCE MPI_ANY_TAG MPI_COMM_WORLD amp status Tell all the compute nodes to exit for rank 1 rank lt ntasks rank MPI_Send 0 0 MPI_INT rank DIETAG MPI_COMM_WORLD computenode double result int work MPI_Status status for MPI_Recv amp work 1 MPI_INT 0 MPI_ANY_TAG MPI_COMM_WORLD amp status 64 MPI Examples Check the tag of the received message if status MPI_TAG DIETAG return result do the work MPI_Send amp result 1 MPI_DOUBLE O 0 MPI_COMM_WORLD Processes are represented by a unique rank integer and ranks are numbered 0 1 2 N 1 MPI_COMM_WORLD means all the processes in the MPI appli cation It is called a communicator and it provides all information necessary to do message passing Portable libraries do more with communicators to provide synchronisation protection that most other systems cannot handle A 3 MPI Initialization And Finalization As with other systems two functions are provided to initialize and cle
54. ork NONE Memory gt 0 Disk gt 0 Swap gt 0 Opsys NONE Arch NONE Features NONE Dedicated Resources Per Task PROCS 1 MEM 495M IWD NONE Executable NONE Bypass O StartCount 0 PartitionMask ALL Flags RESTARTABLE PE 1 01 StartPriority 173 job cannot run in partition DEFAULT idle procs do not meet requirement s 0 of 1 procs found idle procs 3 feasible procs 0 Rejection Reasons CPU 3 The v option gives slightly more detail 7 2 5 Deleting A Job An already submitted job can be deleted using the qdel command qdel lt jobid gt The job ID is printed to the terminal when the job is submitted To get the job ID of a job if it has been forgotten the following can be used qstat or showq 7 2 6 Monitoring Nodes In Torque And PBS Pro The nodes that the workload manager knows about can be viewed using the pbsnodes command The following output is from a cluster made up of 2 core nodes as indicated by the value of 2 for ncpu for Torque and PBS Pro If the node is available to run scripts then its state is free or time shared When a node is used exclusively section 8 5 2 by one script the state is job exclusive For Torque the display resembles some output elided Bright Computing Inc 7 2 Submitting A Job 49 fred bright52 pbsnodes a node001 cm cluster state free np 3 ntype cluster status rectime 1317911358 varattr jobs 96 ncpus 2
55. orld from process 001 out of 004 processor name node001 Queues in SLURM terminology are called partitions SLURM has a default queue called defq The administrator may have removed this or created others If a particular queue is to be used this is typically set in the job script using the p or partition option SBATCH partition bitcoinsq It can also be specified as an option to the sbatch command during sub mission to SLURM 5 3 5 Checking And Changing Queued Job Status After a job has gone into a queue the queue status can be checked using the squeue command The job number can be specified with the j option to avoid seeing other jobs The man page for squeue covers other options Jobs can be canceled with scancel lt job number gt The scontrol command allows users to see and change the job direc tives while the job is still queued For example a user may have speci fied a job using the begin directive to start at 10am the next day by mistake To change the job to start at 10pm tonight something like the following session may take place fred bright52 scontrol show jobid 254 grep Time RunTime 00 00 04 TimeLimit UNLIMITED TimeMin N A SubmitTime 2011 10 18T17 41 34 EligibleTime 2011 10 19T10 00 00 StartTime 2011 10 18T17 44 15 EndTime Unknown SuspendTime None SecsPreSuspend 0 The parameter that should be changed is EligibleTime which can be done as follows fred bright52 scontrol
56. other way round that is a resource allocation request for 1 task per node with the ntasks per node option and simultaneously an al location request for 2 tasks to run on the cluster using the ntasks op tion fails because on the 1 cluster node only 1 task can be allocated re sources on the single node while resources for 2 tasks are being asked for on the cluster fred bright52 salloc ntasks per node 1 ntasks 3 mpirun hello salloc error Failed to allocate resources Requested node configuratio n is not available salloc Job allocation 575 has been revoked 5 3 Running The Executable As A SLURM Job Script Instead of using options appended to the salloc command line as in sec tion 5 2 it is usually more convenient to send jobs to SLURM with the sbat ch command acting on a job script A job script is also sometimes called a batch file In a job script the user can add and adjust the SLURM options which are the same as the salloc options of section 5 2 The various settings and variables that go with the application can also be adjusted 5 3 1 SLURM Job Script Structure A job script submission for the SLURM batch job script format is illus trated by the following fred bright52 cat slurmhello sh bin sh SBATCH o my stdout SBATCH time 30 time limit to batch job SBATCH ntasks 1 SBATCH ntasks per node 4 module add shared openmpi gcc 64 1 4 2 slurm mpirun hello The structure is she
57. ple Script Structure A job script in PBS Pro or Torque has a structure illustrated by the follow ing basic example Example bin bash PBS 1 walltime 1 00 00 PBS 1 nodes 4 PBS 1 mem 500mb PBS j oe cd HOME myprogs mpirun myprog a b c The first line is the standard shebang line used for scripts The lines that start with PBS are PBS directive lines described shortly in section 7 1 2 The last two lines are an example of setting remaining options or con figuration settings up for the script to run In this case a change to the directory myprogs is made and then run the executable myprog with ar guments a b c The line that runs the program is called the executable line section 7 1 3 To run the executable file in the executable line in parallel the job launcher mpirun is placed immediately before the executable file The number of nodes the parallel job is to run on is assumed to have been specified in the PBS directives 7 1 2 Directives Job Script Directives And qsub Options A job script typically has several configurable values called job script di rectives set with job script directive lines These are lines that start with a PBS Any directive lines beyond the first executable line are ignored The lines are comments as far as the shell is concerned because they start with a However at the same time the lines are special com mands when the job script is processed by the q
58. pt may be submitted as follows Example qsub mpirun job A job may be submitted to a specific queue testq as follows Example qsub q testq mpirun job The man page for qsub describes these and other options The options correspond to PBS directives in job scripts section 7 1 1 If a particular item is specified by a qsub option as well as by a PBS directive then the qsub option takes precedence 7 2 3 Job Output By default the output from the job script lt scriptname gt goes into the home directory of the user for Torque or into the current working directory for PBS Pro By default error output is written to lt scriptname gt e lt jobid gt and the application output is written to lt scriptname gt o lt jobid gt where lt jobid gt is a unique number that the workload manager allocates Specific output and error files can be set using the o and e options respectively The error and output files can usefully be concatenated into one file with the j oe or j eo options More details on this can be found in the qsub man page 7 2 4 Monitoring A Job To use the commands in this section the appropriate workload manager module must be loaded For example for Torque torque module needs to be loaded module add torque qstat Basics The main component is qstat which has several options In this exam ple the most frequently used options are discussed In PBS Torque the command qstat an shows what jobs are cu
59. q node001 cm cluster BI 0 16 NA 1x26 amd64 au all q node002 cm cluster BI 0 16 NA 1x26 amd64 au HHHHHHHH HHH HHH HHH HEHE HHH HH ERH HH RBHEH RBH HH RBH RHERHEHRRH RHR RH RHR RH RHEE RHEE HE PENDING JOBS PENDING JOBS PENDING JOBS PENDING JOBS PENDING JOBS HHHHHHHH HHH HHH HHH AHHH RHEE RBH AHRRHHH RBH EH RBH RHRBH RHE RHRHRBH RHR RH RHE H RHEE HE 249 0 55500 Sleeperi root qw 12 03 2008 07 29 00 1 250 0 55500 Sleeperi root qw 12 03 2008 07 29 01 1 Job state can be e d eletion e E rror h old r unning e R estarted s uspended O Bright Computing Inc 6 4 Deleting A Job 35 S uspended t ransfering T hreshold w aiting The queue state can be u nknown if the corresponding sge_execd cannot be contacted a larm the load threshold is currently exceeded A larm the suspend threshold is currently exceeded C alendar suspended see calendar_conf s uspended see qmod S ubordinate d isabled see qmod D isabled see calendar_conf E rror sge_execd was unable to locate the sge_shepherd use qmod to fix it o rphaned for queue instances By default the qstat command shows only jobs belonging to the cur rent user i e the command is executed with the option u user To see jobs from other users too the following format is used qstat u 6 4 Deleting A Job Ajob can be deleted in SGE with the following command qdel lt jobid gt The job id is
60. r rently submitted or running on the queuing system An example output is fred bright52 qstat an bright52 cm cluster Req d Req d Elap Bright Computing Inc 46 PBS Variants Torque And PBS Pro Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time 78 bright52 fred shortq tjob 10476 1 1 555mb 00 01 R 00 00 79 bright52 fred shortq tjob 1 1 555mb 00 01 Q The output shows the Job ID the user who owns the job the queue the job name the session ID for a running job the number of nodes re quested the number of CPUs or tasks requested the time requested 1 walltime the job state S and the elapsed time In this example one job is seen to be running R and one is still queued Q The n parameter causes nodes that are in use by a running job to display at the end of that line Possible job states are Job States Description Job is completed regardless of success or failure Job is exiting after having run Job is held job is queued eligible to run or routed job is running job is suspend job is being moved to new location HH nn vo m Pi a Job is waiting for its execution time The command qstat q shows what queues are available In the fol lowing example there is one job running in the testq queue and 4 are queued qstat q server master cm cluster Queue Memory CPU Time Walltime Node Run Que Lm State testq 23 59 59 1 4 ER def
61. ript For parallel jobs the pe environment must be assigned to the script De pending on the interconnect there may be a choice between a number of parallel environments such as MPICH Ethernet or MVAPICH Infini Band bin sh Your job name N My_Job Bright Computing Inc 6 2 Submitting A Job 33 Use current working directory cwd Join stdout and stderr j y pe Parallel environment request Set your number of processors here pe mpich NUMBER_OF_CPUS Run job through bash shell S bin bash If modules are needed source modules environment etc profile d modules sh Add any modules you might require module add shared The following output will show in the output file Used for debugging echo Got NSLOTS processors echo Machines cat TMPDIR machines Use MPIRUN to run the application mpirun np NSLOTS machinefile TMPDIR machines application 6 2 Submitting A Job The SGE module must be loaded first so that SGE commands can be ac cessed module add shared sge With SGE a job can be submitted with qsub The qsub command has the following syntax qsub options jobscript jobscript args After completion either successful or not output is put in the user s current directory appended with the job number which is assigned by SGE By default there is an error and an output file myapp
62. rkload management system also known as a queueing system job scheduler or batch submission system manages the available resources such as CPUs GPUs and memory for jobs submitted to the system by users Jobs are submitted by the users using job scripts Job scripts are con structed by users and include requests for resources How resources are allocated depends upon policies that the system administrator sets up for the workload manager 4 22 Why Use A Workload Manager Workload managers are used so that users do not manually have to keep track of node usage in a cluster in order to plan efficient and fair use of cluster resources Users may still perhaps run jobs on the compute nodes outside of the workload manager if that is administratively permitted However running jobs outside a workload manager tends to eventually lead to an abuse of the cluster resources as more people use the cluster and thus inefficient use of available resources It is therefore usually forbidden as a policy by the system administrator on production clusters 4 33 What Does A Workload Manager Do A workload manager uses policies to ensure that the resources of a cluster are used efficiently and must therefore track cluster resources and jobs To do this a workload manager must e Monitor the node status up down load average e Monitor all available resources available cores memory on the nodes e Monitor the jobs state queued on hold deleted do
63. s and we would be very grateful if you could report any comments suggestions or corrections to us at manuals brightcomputing com Introduction This manual is intended for cluster users who need a quick introduction to the Bright Beowulf Cluster Environment It explains how to use the MPI and batch environments how to submit jobs to the queueing system and how to check job progress The specific combination of hardware and software installed may differ depending on the specification of the cluster which means that parts of this manual may not be relevant to the user s particular cluster 1 1 What s A Beowulf Cluster 1 1 1 Background And History In the history of the English language Beowulf is the earliest surviving epic poem written in English It is a story about a hero with the strength of many men who defeated a fearsome monster called Grendel In computing a Beowulf class cluster computer is a multicomputer ar chitecture used for parallel computations i e it uses many computers to gether so that it has the brute force to defeat fearsome number crunching problems The architecture was first popularized in the Linux community when the source code used for the original Beowulf cluster built at NASA was made widely available The Beowulf class cluster computer design usu ally consists of one head node and one or more regular nodes connected together via Ethernet or some other type of network While the origi nal Beowulf so
64. s done by using the directive PBS q lt destination gt Directives Summary A summary of the job directives covered with a few extras are shown in the following table Directive Description Specified As Name the job lt jobname gt PBS N lt jobname gt Run the job for a time of lt walltime gt PBS 1 lt walltime gt Run the job at lt time gt PBS a lt time gt Set error log name to lt jobname err gt HPBS e lt jobname err gt Set output log name to lt jobname log gt PBS o lt jobname Log gt Join error messages to output log PBS j eo Join output messages to error log PBS j oe Mail to lt user address gt PBS M lt user address gt Mail on lt event gt PBS m lt event gt where lt event gt takes the a bort value of the letter in b egin the parentheses e nd n do not send email Queue is lt destination gt HPBS q lt destination gt Login shell path is lt shellpath gt HPBS S lt shellpath gt O Bright Computing Inc 7 1 Components Of A Job Script 41 Resource List Directives Examples Examples of how requests for resource list directives work are shown in the following table Resource Example Description PBS 1 Specification Request 500MB memory Set a maximum runtime of 3 hours 10 minutes and 30 seconds 8 nodes anywhere on the cluster 8 nodes anywhere on the cluster 2 nodes 1 processor per node 3 nodes 8 processors per no
65. s environment etc profile d modules sh Add any modules you might require module add shared mvapich gcc torque maui pbspro Full path to application application name application lt application gt Run options for the application options lt options gt Work directory workdir lt work dir gt HERRIA HORROR HORROR RODAR RODAR RODAR RODAR RODAR RODAR A EERE A EHR RR EERE You should not have to change anything below this line tttttttitittitititttttittitttttitttttittttttttttttitittttttttttt change the working directory default is home directory cd workdir echo Running on host hostname echo Time is date echo Directory is pwd echo PBS job ID is PBS_JOBID echo This job runs on the following machines echo cat PBS_NODEFILE uniq mpirun_command mpirun application options Run the parallel MPI executable nodes ppn echo Running mpirun_command eval mpirun_command Bright Computing Inc 44 PBS Variants Torque And PBS Pro In the preceding script no machine file is needed since it is automati cally built by the workload manager and passed on to the mpirun parallel job launcher utility The job is given a unique ID and run in parallel on the nodes based on the resource specification 7 1 5 Links To Other Resources About Job Scripts In Torque And PBS Pro A number of useful links are Torque examples http bmi cchmc org resources software torq
66. sub command The dif ference is illustrated by the following The following shell comment is only a comment for a job script pro cessed by qsub PBS Bright Computing Inc 7 1 Components Of A Job Script 39 The following shell comment is also a job script directive when pro cessed by qsub PBS Job script directive lines with the PBS part removed are the same as options applied to the qsub command so a look at the man pages of qsub describes the possible directives and how they are used If there is both a job script directive and a qsub command option set for the same item the qsub option takes precedence Since the job script file is a shell script the shell interpreter used can be changed to another shell interpreter by modifying the first line the line to the preferred shell Any shell specified by the first line can also be overridden by using the PBS S directive to set the shell path Walltime Directive The workload manager typically has default walltime limits per queue with a value limit set by the administrator The user sets walltime limit by setting the PBS 1 walltime directive to a specific time The time specified is the maximum time that the user expects the job should run for and it allows the workload manager to work out an optimum time to run the job The job can then run sooner than it would by default If the walltime limit is exceeded by a job then the j
67. t Cluster Manag lt C fi amp btp 10 1 24 230 graphs php ZA For quick access place your bookmarks here on the bookmarks bar Import bookmarks now other Bookmarks Bright Co HOME WORKLOAD NODES GRAPHS Bright Cluster Manager User Portal CLUSTER OCCUPATION RATE OccupationRate Y Datapoints so terval Hours 1 61 43 35 v4 i 8 3 a 8 8 a a a a 3 Figure 9 4 User Portal Graphs Page Selecting other values is possible for e Workload Management Metrics The following workload manager metrics can be viewed RunningJobs Bright Computing Inc 60 User Portal QueuedJobs FailedJobs CompletedJobs EstimatedDelay AvgJobDuration AvgExpFactor e Cluster Management Metrics The following metrics can be viewed OccupationRate NetworkBytesRecv NetworkBytesSent DevicesUp NodesUp TotalMemoryUsed TotalSwapUsed PhaseLoad CPUCoresAvailable GPUAvailable TotalCPUUser TotalCPUSystem TotalCPUldle e Datapoints The number of points used for the graph can be specified The points are interpolated if necessary Interval Hours The period over which the data points are displayed The Update button must be clicked to display any changes made O Bright Computing Inc MPI Examples A 1 Hello world A quick application to test the MPI libraries and the network Hello World Type MPI Test Program include lt mpi h gt include lt stdio h gt
68. t is possible to specify options directives to SGE by using in the script The difference in the meaning of lines that start with the char acter in the job script file should be noted Line Starts With Treated As Comment in shell and SGE Comment in shell directive in SGE Comment in shell and SGE Bright Computing Inc 30 SGE 6 1 2 SGE Environment Variables Available environment variables HOME Home directory on execution machine USER User ID of job owner JOB_ID Current job ID JOB_NAME Current job name like the N option in qsub qsh qrsh q login and qalter HOSTNAME Name of the execution host TASK_ID Array job task index number 6 1 3 Job Script Options Options can be set in the job script file using this line format for each option option parameter Available options and their descriptions can be seen with the output of qsub help Table 6 1 3 SGE Job Script Options Option and parameter Description a date_time ac context_list ar ar_id A account_string b yles n o binding env pelset expllin str c ckpt_selector ckpt ckpt name clear cwd C directive_prefix dc simple_context_list dl date_time e path_list h hard help hold_jid job_identifier_list hold_jid_ad job_identifier_list i file_list j yles n o js job_share request a start time add context v
69. table for general use that is packaged as a mod ule by Bright Computing Example fred bright52 module load gcc fred bright52 which gcc gcc version head 1 cm shared apps gcc 4 4 6 bin gcc gcc GCC 4 4 6 To use the latest version of GCC the gcc module must be loaded To revert to the version of GCC that comes natively with the Linux distribu tion the gcc module must be unloaded The compilers in the preceding table are ordinarily used for applica tions that run on a single node However the applications used may fork thread and run across as many nodes and processors as they can access if the application is designed that way The standard structured way of running applications in parallel is to use the MPI based libraries which link to the underlying compilers in the O Bright Computing Inc 10 Cluster Usage preceding table The underlying compilers are automatically made avail able after choosing the parallel environment MPICH MVAPICH Open MPI etc via the following compiler commands Language C C Fortran77 Fortran90 Fortran95 Command mpicc mpiCC mpif77 mpif90 mpif95 2 4 1 Mixing Compilers Bright Cluster Manager comes with multiple Open MPI packages corre sponding to the different available compilers However sometimes mix ing compilers is desirable For example C compilation may be preferred using icc from Intel while Fortran90 compilation may be preferred using op
70. thers loaded at the same time For example loading openmpi gcc 64 without removing an already loaded openmpi gcc 64 can result in con fusion about what compiler opencc is meant to use 2 3 3 Changing The Default Environment The initial state of modules in the user environment can be set as a default using the module init subcommands The more useful ones of these are e module initadd add a module to the initial state e module initrm remove a module from the initial state module initlist list all modules loaded initially e module initclear clear all modules from the list of modules loaded initially Example module initclear module initlist bash initialization file HOME bashrc loads modules null module initadd shared gcc 4 4 6 openmpi gcc 64 torque module initlist bash initialization file HOME bashrc loads modules null shared gcc 4 4 6 openmpi gcc 64 torque The new initial state module environment for the user is loaded from the next login onwards If the user is unsure about what the module does it can be checked using module whatis module whatis sge sge Adds sge to your environment The man pages for module gives further details on usage For multiple versions when no version is specified the alphabetically last version is chosen This usually is the latest but can be an issue when versions move from say 9 to 10 For example the following is sorted in alphabetical or
71. to indicate that more than one process should be started on each node node001 node001 node002 node002 Listing nodes once but with a suffix for the number of CPU cores to use on each node node001 2 node002 2 3 3 3 Running The Application A Simple Parallel Processing Executable A simple hello world program designed for parallel processing can be built with MPI After compiling it can be used to send a message about how and where it is running O Bright Computing Inc 14 Using MPI fred bright52 cat hello c include lt stdio h gt include lt mpi h gt int main int argc char argv int id np i char processor_name MPI_MAX_PROCESSOR_NAME int processor_name_len MPI_Init amp argce amp argv MPI_Comm_size MPI_COMM_WORLD amp np MPI_Comm_rank MPI_COMM_WORLD amp id MPI_Get_processor_name processor_name amp processor_name_len for i 13 i lt 2 i printf Hello world from process 403d out of 403d processor name s n id np processor_name MPI_Finalize return 0 fred bright52 mpicc hello c o hello fred bright52 hello Hello world from process 000 out of 001 processor name bright52 cm cluster However it still runs on a single processor unless it is submitted to the system in a special way Running An MPI Executable In Parallel Without A Workload Manager After the relevant module files are chosen section 3 3 for MPI an exe cutable
72. ue examples e PBS Pro script files http www ccs tulane edu computing pbs pbs phtml e Running PBS Pro jobs and directives http wiki hpc ufl edu index php Job_Submission_Queues 7 2 Submitting A Job 7 2 1 Preliminaries Loading The Modules Environment To submit a job to the workload management system the user must en sure that the following environment modules are loaded If using Torque with no external scheduler module add shared torque If using Torque with Maui module add shared torque maui If using Torque with Moab module add shared torque moab If using PBS Pro module add shared pbspro Users can pre load particular environment modules as their default using the module init commands section 2 3 3 7 2 2 Using qsub The command qsub is used to submit jobs to the workload manager sys tem The command returns a unique job identifier which is used to query and control the job and to identify output The usage format of qsub and some useful options are listed here USAGE qsub lt options gt lt job script gt Option Hint Description O Bright Computing Inc 7 2 Submitting A Job a at run the job at a certain time 1 list request certain resource s q queue job is run in this queue N name name of job 5 shell shell to run job under j join join output and error files For example a job script called mpirun job with all the relevant di rectives set inside the scri
73. upport test package TestJobPBS el9 Hold_Types n Join_Path n Keep_Files n Mail_Points a mtime Tue Jul 14 12 35 31 2009 Output_Path mascm4 cm cluster home cvsupport test package Test JobPB S o19 Priority O qtime Tue Jul 14 12 35 31 2009 Rerunable True Resource_List nodect 1 Resource_List nodes 1 ppn 2 Resource_List walltime 02 00 00 Variable_List PBS_O_HOME home cvsupport PBS_O_LANG en_US UTF 8 PBS_O_LOGNAME cvsupport PBS_O_PATH usr kerberos bin usr local bin bin usr bin sbin usr sbin home cvsupport bin cm shared apps torque 2 3 5 bin cm shar ed apps torque 2 3 5 sbin PBS_0_MAIL var spool mail cvsupport PBS_0_SHELL bin bash PBS_SERVER mascm4 cm cluster PBS_0_HOST mascm4 cm cluster PBS_0_WORKDIR home cvsupport test package PBS_0_QUEUE default etime Tue Jul 14 12 35 31 2009 submit_args pbs job q default O Bright Computing Inc 48 PBS Variants Torque And PBS Pro Job Details With checkjob The checkjob command only for Maui is particularly useful for checking why a job has not yet executed For a job that has an excessive memory requirement the output looks something like fred bright52 checkjob 65 checking job 65 State Idle Creds user fred group fred class shortq qos DEFAULT WallTime 00 00 00 of 00 01 00 SubmitTime Tue Sep 13 15 22 44 Time Queued Total 2 53 41 Eligible 2 53 41 Total Tasks 1 Req 0 TaskCount 1 Partition ALL Netw
74. uster Extension cluster job scripts to a workload manager should be submitted using the cmsub cluster manager utility This allows the job to be considered for running on the extension the cloud nodes Jobs that are to run on the local regular nodes not in a cloud are not dealt with by cmsub The environment module section 2 3 cluster tools must be loaded in order for the cmsub utility to be available for use The basic usage for cmsub is cmsub OPTIONS script Users that are used to running jobs as root should note that the root user cannot usefully run a job with cmsub The user can submit some cloud related values as options to cmsub on the command line followed by the job script Example cat myscripti bin sh hostname cmsub regions eu west 1 myscripti Upload job id 1 User job id 2 Download job id 3 The cloud related values can also be specified in a job directive style format in the job script itself using the CMSUB tag to indicate a cloud related option Example cat myscript2 bin sh CMSUB regions us west 2 eu west 1 CMSUB input list home user myjob in CMSUB output list home user myjob out CMSUB remote output list home user file which will be created CMSUB input home user onemoreinput dat CMSUB input home user myexec myexec cmsub myscript2 Upload job id 4 User job id 5 Download job id 6 Bright Computing Inc 20 Workload Management

User Manual - Support

Contents

Download Pdf Manuals

Related Search

Related Contents